|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object edu.northwestern.at.utils.corpuslinguistics.NGramExtractor
public class NGramExtractor
Extract ngrams from text.
Field Summary | |
---|---|
protected java.util.TreeMap |
nGramCounts
The list of ngrams and associated counts. |
(package private) int |
nGramSize
Number of words forming an ngram. |
protected int |
numberOfNGrams
Total number of ngrams. |
(package private) int |
windowSize
Window size within which to search for ngrams. |
protected WordCountExtractor |
wordCountExtractor
The WordCountExtractor with the list of words to analyze. |
Constructor Summary | |
---|---|
NGramExtractor(java.util.ArrayList wordList,
int nGramSize,
int windowSize)
Create NGram analysis from an arraylist of words. |
|
NGramExtractor(java.lang.String[] words,
int nGramSize,
int windowSize)
Create NGram analysis from string array of words. |
|
NGramExtractor(java.lang.String fileName,
java.lang.String encoding,
int nGramSize,
int windowSize)
Create NGram analysis of a text file. |
|
NGramExtractor(WordCountExtractor wordCountExtractor,
int nGramSize,
int windowSize)
Create NGram analysis from a WordCountExtractor. |
Method Summary | |
---|---|
protected void |
generateNGrams()
Generate NGram analysis from string array of words. |
int |
getNGramCount(java.lang.String ngram)
Return count for a specific ngram. |
java.util.SortedMap |
getNGramMap()
Return NGram map. |
java.lang.String[] |
getNGrams()
Return NGrams. |
int |
getNumberOfNGrams()
Returns the total number of ngrams. |
int |
getNumberOfUniqueNGrams()
Returns the number of unique ngrams. |
void |
mergeNGramExtractor(NGramExtractor extractor)
Merge ngrams from another NGramExtractor. |
static java.lang.String[] |
splitNGramIntoWords(java.lang.String ngram)
Returns the individual words comprising an ngram. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected WordCountExtractor wordCountExtractor
int nGramSize
int windowSize
protected java.util.TreeMap nGramCounts
Key=ngram string
Value=Integer(count)
The ngram string is two or more words with a tab character ("\t") separating the words.
protected int numberOfNGrams
Constructor Detail |
---|
public NGramExtractor(java.lang.String[] words, int nGramSize, int windowSize)
words
- The string array with the words.nGramSize
- The number of words forming an ngram.windowSize
- The window size (number of words)
within which to construct ngrams.
Example: nGramSize=2, windowSize=3, text="a quick brown fox".
The first window is "a quick brown". The ngrams are "a quick", "a brown", and "quick brown".
The second window is "quick brown fox." The ngrams are "quick brown", "quick fox", and "brown fox".
public NGramExtractor(java.util.ArrayList wordList, int nGramSize, int windowSize)
wordList
- The arraylist with the words.nGramSize
- The number of adjacent words forming an ngram.windowSize
- The window size (number of words)
within which to construct ngrams. public NGramExtractor(java.lang.String fileName, java.lang.String encoding, int nGramSize, int windowSize)
fileName
- The file containing the text to analyze.encoding
- The encoding for the text file (.e.g, "utf-8").nGramSize
- The number of adjacent words forming an Ngram.windowSize
- The window size (number of words)
within which to construct ngrams. public NGramExtractor(WordCountExtractor wordCountExtractor, int nGramSize, int windowSize)
wordCountExtractor
- The WordCountExtractor containing
the words to analyze.nGramSize
- The number of adjacent words forming
an Ngram.windowSize
- The window size (number of words)
within which to construct ngrams. Method Detail |
---|
protected void generateNGrams()
public void mergeNGramExtractor(NGramExtractor extractor)
extractor
- Merge ngrams from another extractor. public int getNGramCount(java.lang.String ngram)
ngram
- The ngram whose count is desired.
public java.lang.String[] getNGrams()
public java.util.SortedMap getNGramMap()
public int getNumberOfNGrams()
public int getNumberOfUniqueNGrams()
public static java.lang.String[] splitNGramIntoWords(java.lang.String ngram)
ngram
- The ngram to parse.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |