WordHoard - Finding Collocates

Introduction

The Latin maxim noscitur a soclis (one knows them by their associates) applies to words as well as people. Often we want to know not only how often a word form appears in a text, but also how frequently two or more specific word forms appear near each other in a text. Words which appear in proximity more frequently than expected are called collocates.

Some collocates appear in rigid or frozen forms. Examples include titles such as King of England and President of the United States, adverbial phrases such as in general, and verbal phrases such as freeze up. You may be interested in the frozen forms an author uses. You may be interested in collocates which are not frozen forms and which are unique to a specific author or a specific work. You can use the same WordHoard facilities to pursue either type of investigation. The WordHoard Find Multiword Units analysis is more helpful if you are primarily interested in uncovering multiword phrases and frozen forms. If you want to compare the relative frequency of collocates for a word in two different texts, see Comparing Collocates.

To locate collocates we need to define three quantities.

The neighborhood of a word in a text which defines the size of the word span to the left and right in which to search for potential collocates.
The reference frequency of the potential collocates.
The statistical method for comparing the observed and reference frequencies.

In WordHoard you start by choosing a specific spelling or lemma for which you want to find collocates. We call this the focus word or the node. You define the search neighborhood for collocates by specifying a span of words to the left and right of the focus word within which to search for collocates. Different words tend to require spans of different size. In English, short spans usually work well.

WordHoard provides five different commonly used statistical measures for assessing which words in the neighborhood of the focus word are possible collocates. The measures compare the observed frequency of the potential collocate in the neighborhood of the focus word with the frequency of the potential collocate in the entire text, which WordHoard uses as the reference frequency.

Find collocates dialog

In the following we search for collocates of the verb "think" in Shakespeare's "Hamlet". To find the collocates, select "Find Collocates" from the Analysis menu. WordHoard displays the following dialog.

Find Collocates Dialog

The dialog fields are as follows.

Word is the chosen focus word whose collocates we seek. In this example we select the word "think (v)", e.g., "think" used as a verb.
Word Form specifies the type of word form to find. You may specify lemma or spelling. We select lemma for our analysis. Choosing the lemma allows us to ignore spelling differences that might otherwise mask the recognition of a collocate because it appears in several different word forms, e.g., as a singular or plural noun, or in different verb tenses. If you are interested primarily in frozen forms, you probably want to choose spelling instead.
Left span specifies how many words to the left of the focus word in the text WordHoard should look for collocates.
Right span specifies how many words to the right of the focus word in the text WordHoard should look for collocates.
Cutoff specifies the mininum number of times a word must appear in the neighborhood of the focus word to be considered a collocate.
Analysis Text provides the text in which to search for collocates of the selected word. We select Shakespeare's play "Hamlet" as the analysis text.

Mark significant log-likelihood values appends asterisks to each significant log-likelihood value. When the significance values are not being adjusted (see the next option below), the asterisks indicate the following levels of significance.

****	Significant at 0.0001 level
***	Significant at 0.001 level
**	Significant at 0.01 level
*	Significant at 0.05 level

We enable this option.

Adjust chi-square for number of comparisons adjusts the breakpoints for assessing the significance of the log-likelihood statistics as described in the section Adjusting significance levels for many comparisons. We do not enable this option.
Show word classes for all words asks WordHoard to display the word class for spellings and lemmata in the output. If you do not enable this option, WordHoard displays only the spelling or lemma text. We do not enable this option.

Find collocates output and measures of association

WordHoard presents the output of the collocate analysis in a table with eight columns. The first column contains the potential collocates of "think (v)". This includes all the words which appeared at least "cutoff" number of times in the chosen span of words to the left and right of the focus word.

The second column shows the number of times the potential collocate occurs near "think (v)" within the specified span.

The third column presents the total number of times the potential collocate appears in the analysis text. That is the reference frequency for the collocate.

The next five columns present measures of association for the potential collocate. We discuss those below.

The header of the output table provides the frequency for the focus word. In this case, "think (v)" appears 56 times in Hamlet. There are 168 words which appear within the chosen span of 1 word to the right and 1 word to the left of "think (v)" anywhere in Hamlet.

Find Collocates Output

WordHoard provides five commonly used measures of association for assessing how well two words adhere. In each case, the higher the value of the measure of association, the more likely the words are to be collocates. You can sort the collocates on any one of the association measure values by holding down the shift key and clicking the column header for the measure. By default WordHoard sorts the collocates by descending log-likelihood value.

All of the measures depend upon an estimate of the probability of occurrence for each word and for the two words together. WordHoard uses the maximum likelihood estimate for the probability, which is just the frequency of occurrence divided by the total number of words in the text -- either the number of spellings or the number of lemmata, depending upon which type of word form we chose to analyze.

The calculations for all the association measures are based upon the following contingency table. Here w1 is the focus word and w2 is the potential collocate.

w₂ ~w₂

w₁ O₁₁ O₁₂ R₁

~w₁ O₂₁ O₂₂ R₂

C₁ C₂ N

w₂ ~w₂

w₁ E₁₁ = R₁C₁/N E₁₂ = R₁C₂/N

~w₁ E₂₁ = R₂C₁/N E₂₂ = R₂C₂/N

o_ij are the observed counts.
- o₁₁ counts the number of times the focus word and potential collocate occur near each other in the selected span.
- o₁₂ counts the number of times the focus word appears but not near the potential collocate.
- o₂₁ counts the number of times the the potential collocate appears but not near the focus word.
- o₂₂ counts the number of words other than the focus word and the potential collocate.
R₁ and R₂ are the row sums.
C₁ and C₂ are the column sums.
N is the total number of words (either spellings or lemmata) in the text.
E_ij are the expected counts under the hypothesis of independence -- that is, that the words are not collocates.

Here is the contingency table for the potential collocate "it" derived from the WordHoard output above. Entries in the left-hand table which appear in plain text come directly from the WordHoard output. Entries in italics are computed using the marginal constraints. The expected values under independence are computed from the formulae provided above. Internally WordHoard constructs these contingency table entries in order to compute the association measures.

it ~it

think 14 42 56

~think 582 29249 29831

596 29291 29887

it ~it

think E₁₁ = 1.11674 E₁₂ = 54.88326

~think E₂₁ = 594.88326 E₂₂ = 29236.11674

Based upon the entries in this contingency table we can define the five association measures presented by WordHoard as follows.

Dice Coefficient

Dice coefficient = ( 2 * O₁₁ ) / ( R₁ + C₁ )

For our example, the Dice coefficient for "it" is computed as:

Dice coefficient = ( 2 * 14 ) / ( 56 + 596 )

= 28 / 652

= 0.0429

The Dice coefficient takes values from 0 through 1. The value increases as the frequency of the co-occurrences of the focus word and potential collocate increases relative to the counts of the focus word and potential collocate individually. A Dice score of zero means the words never appear together, while a Dice score of one means the words always appear together. Mathematically the Dice score is the harmonic mean of the conditional probabilities P(w1 | w2) and P(w2 | w1). P(w1 | w2) is the conditional probability that the second word in the bigram appears given the first word. P(w2 | w1) is the conditional probability that the first word in the bigram appears given the second word.

For our example, the top five scoring words (lemmata) are not, it, on, you, shall.

Phi-squared (φ²)

φ² = 2 * ( O₁₁ - E₁₁ ) / ( E₁₁ * N )

For our example, the φ² value for "it" is computed as:

φ²	= 2 * ( 14 - 1.11674 ) / ( 1.11674 * 29887 )
	= 165.978 / 33376.01
	= 0.0050

φ² takes values from 0 through 1. φ² is the Pearson chi-square value of association divided by the number of words. In other words, you can recover the Pearson chi-square value for the contingency table from φ². WordHoard displays the log-likelihood chi-square value in preference to Pearson's chi-square because the log-likelihood value is more reliable for literary studies.

For our example, the top five scoring words (lemmata) using φ² are it, not, I, you, on.

Log-likelihood

The log-likelihood ratio statistic G² measures the discrepancy of the the observed word frequencies from the values which we would expect to see if the word frequencies (by percentage) were the same in the neighborhood of the collocate and the entire text. The larger the discrepancy, the larger the value of G², and the more statistically significant the difference between the frequency of the potential collocate's appearance in the neighborhood of the focus word from the collocate's appearance in the text as a whole.

Log-likelihood =

2 * (O₁₁ * ln(O₁₁/E₁₁) + O₁₂ * ln(O₁₂/E₁₂) + O₂₁ * ln(O₂₁/E₂₁) + O₂₂ * ln(O₂₂/E₂₂))

For our example, the log-likelihood value for "it" is computed as:

Log-likelihood =	2 * ( 14 * ln( 14 / 1.11674 ) + 42 * ln( 42 / 54.88326 ) + 582 * ln( 582 / 594.88326 ) + 29249 * ln( 29249 / 29236.11674 ) )
=	2 * ( 14 * 2.52864 - 42 * 0.267539 - 582 * 0.021895 + 29249 * 0.004406 )
=	48.62

WordHoard ignores any zero observed count value in computing the log-likelihood value.

For our example, the top five scoring words (lemmata) are it, I, not, you, on.

Specific Mutual Information compares the probability of finding the two words w₁ and w₂ in proximity to the expected probability that the two words appear independently of each other in the text.

Mutual Information = log₂( O₁₁ / E₁₁ )

For our example, the mutual information value for "it" is computed as:

Mutual information	= log₂( 14 / 1.11674 )
	= log₂( 12.53649 )
	= 3.6481

When the specific mutual information score is zero the two words are not collocates. When the score is greater than zero, the two words may be collocates. How do we determine which values indicate a collocate relationship exists or not? One commonly applied rule of thumb is that a Specific Mutual Information score greater than 1.585 indicates the two words are possible collocates. Since 1.585 is the log₂ of 3, a score greater than 1.585 indicates the observed ratio occurs at least three times more than expected by chance. For our example, the top five scoring words (lemmata) are on, not, it, or, shall.

Another approach is to compute the salience of the word pair by multiplying the mutual information score by the log₂ of the co-occurrence count:

Salience = log₂( O₁₁ ) * Mutual Information score

For example, the salience for "it" is given by:

Salience for "it"	= log₂( 14 ) * 3.6481
	= 3.8074 * 3.6481
	= 13.89

Frequent word pairs will take on a higher salience value. For our example, the words sorted in descending order by salience are:

Word Salience

it 13.89

not 12.43

i 11.09

you 10.73

on 8.03

thou 6.01

shall 5.01

so 4.84

of 4.63

what 4.56

they 4.45

or 3.23

and 2.50

to 1.79

do 1.72

be -005

Mutual information tends to weight rare events more highly than common events. That may be useful in detecting unusual frozen forms. The derived salience values are less sensitive to rare events.

Symmetric Conditional Probability

The Symmetric Conditional Probability is the product of the two conditional probabilities P(w1 | w2) and P(w2 | w1). P(w1 | w2) is the conditional probability that the second word in the bigram appears given the first word. P(w2 | w1) is the conditional probability that the first word in the bigram appears given the second word. Symnmetric Conditional Probability takes values from 0 through 1. The closer the value is to 1, the more likely the two words are to be collocates.

Symmetric Conditional Probability = O₁₁² / ( R₁ * C₁ )

For our example, the symmetric conditional probability coefficient for "it" is computed as:

Symmetric conditional probability	= (14 * 14) / ( 56 * 596 )
	= 196 / 33376
	= 0.0059

For our example, the top five scoring words (lemmata) are it, not, i, you, on.

The different measures produce different rankings of the degree to which each potential collocate adheres with the focus word. Which measure is best for literary research? There doesn't seem to be a concensus yet among the experts. More research is needed to come up with recommendations.

While WordHoard reports five measures of association for collocates, over eighty such measures have been proposed. The paper by Pavel Pecina covers most of them. If you don't see your favorite measure in WordHoard, you can use WordHoard's scripting facilities to implement your own.

In our example, the top five most highly ranked words on each measure are mostly the same. None of them looks particularly unusual at first glance. We may be able to find out more by looking at the contexts in which the words appear in Hamlet.

Visualizing collocate measures using a tag cloud

As an alternative to looking at this dense table of numbers, WordHoard allows you to display the collocate results in a tag cloud. A tag cloud displays words or phrases in different font sizes. To create a tag cloud from the collocate output results, select the measure of association for the tag cloud using the "Cloud Association Measure" drop down list. We will use the log-likelihood values. Then select the "Cloud" button to generate the cloud.

Tag cloud for lemma comparison

The larger the text for a collocate, the higher its association measure value. This allows you assess at a glance the relative importance of the collocate. WordHoard assigns a font size of 100 points to the word with the largest (scaled) association measure value. Words whose font size ends up smaller than 3 points are not displayed in the tag cloud.

The words comprising a collocation are separated by a small raised square in the tag cloud output. The collocate pairs are separated by spaces.

Notice we selected the checkbox "Compress log-likelihood value range in tag clouds" at the bottom of the tabular output. Selecting that option scales the log-likelihood values before generating the tag cloud using those values to determine the size of the text for each corresponding word. WordHoard uses a transformation based upon the inverse hyperbolic sine of the log-likelihood values. This helps to prevent exceptionally large log-likelihood values from dominating the tag cloud display. WordHoard does not scale measures other than log-likelihood.

Different association measures can result in different tag clouds. For instance, if we select Symmetric Conditional Probability as the association measure, we get the following tag cloud.

Tag cloud for lemma comparison

The lemma "it" still dominates, but "not" and "I" have switched places.

Collocate contexts

WordHoard allows you to view the contexts in which a collocate appears in the analysis text. For example, to view the contexts in which "it" appears as a neighbor of "think (v)", highlight the first row in the output table.

Tag cloud for lemma comparison

The "Context" button is now available. Select the "context" button to see the contexts in which the collocate "it" appears near "think" in Hamlet. The context words are displayed in their lemma form since we chose to find collocates based upon the lemma form of "think." From this we see that all the occurrences of "think" and "it" together in Hamlet are of the form "think it" or "think on it".

Tag cloud for lemma comparison

If you double-click on a context line, you will be taken to the full text for that context. For example, double-click on the first line and the text of Hamlet opens in a new window with the occurence of "think" from the selected context highlighted.

Context for collocate

References

Pavel Pecina surveys over eighty different measures of association for collocates in:

Pecina, Pavel. 2005. An Extensive Empirical Study of Collocation Extraction Methods. In Proceedings of the 43th Annual Meeting of the Association for Computational Linguistics (ACL 2005), Student Research Workshop, Ann Arbor, Michigan, June.

Tracking Word Form Use Over Time

Table of Contents

Finding Multiword Units

Dice coefficient	= ( 2 * 14 ) / ( 56 + 596 )
	= 28 / 652
	= 0.0429

Word	Salience
it	13.89
not	12.43
i	11.09
you	10.73
on	8.03
thou	6.01
shall	5.01
so	4.84
of	4.63
what	4.56
they	4.45
or	3.23
and	2.50
to	1.79
do	1.72
be	-005