WordHoard - Introduction to Analysis Methods

Exporting and Importing Sets

Table of Contents

Calculator Window

Introduction to Analysis Methods

Introduction
Analysis and reference texts
Saving the analysis output
Changing the output column sort order
References

Introduction

Corpus Linguistics can be characterized as the study of linguistic phenomena through large collections of machine-readable texts called corpora. Corpora are often enhanced with tagging information for each word such as the part of speech, the lemma (root word), speaker gender, and so on. Some of the interesting linguistic phenomena we can examine in a tagged corpus include:

word form frequencies and distributions;
syntactic analysis;
multiword units and co-occurring word forms;
character, word, clause, sentence, and paragraph lengths and distributions;
semantic content.

Many interesting linguistic features are revealed by simply counting and comparing the number of occurrences of the word forms such as spelling, lemmata, parts of speech, or speaker gender. We can compare these word forms among different corpora or subsets of a corpus using statistical methods. WordHoard provides a basic slate of statistical methods for analyzing word form counts through the Analysis menu. You can create word form lists, find collocates and statistically interesting multi-word units, compare single or multiple word forms between two text collections, track word usage over time, and compare texts for similarity. In addition to these built-in analyses, you can create your own using the WordHoard script facility.

Analysis and reference texts

The first step in using an analysis procedure in WordHoard is to specify an analysis text and in some cases, a reference text. The analysis text provides the focal point. The reference text provides the text for comparisons or overall word form counts. You uncover properties of the analysis text by comparing it with the reference text.

Most of the WordHoard analysis methods allow you to select a corpus, work, work set, or word set for both the analysis text and the reference text. Usually the analysis text is different from the reference text. The reference text may include the analysis text as a member.

For illustrative purposes, we will use Shakespeare's Hamlet as the analysis text and the Shakespeare corpus as the reference text. In this case, the reference text (all of Shakespeare) contains the analysis text (Hamlet) as a member.

Saving the analysis output

WordHoard displays the results of an analysis in tabular form. You can use the "Edit" menu to copy all or selected portions of the results to the system clipboard. This allows you to paste the results into a report or another program for further analysis. You can also save the results to a file using the "Save As" command on the "File" menu. You have a choice of three output formats for saving the results.

CSV saves the results a comma-separated series of values. WordHoad writes each row of result values to a separate line in the output file.
TAB saves the results as a tab-separate series of values. (The tab is Ascii character 9.) As with CSV, WordHoard writes each row of output values to a separate line in the output file.
HTML saves the results as an XHTML formatted table suitable for inclusion in a longer (X)HTML document. WordHoard only creates the table formatting markup. You will need to add the remainder of the (X)HTML markup to create a complete HTML document.

Changing the output column sort order

The analysis output tables may sorted by the contents of a column. The sort column is marked by an up triangle , indicating the values appear in ascending order, or a down triangle , indicating the values appear in descending order. You may select a new column for sorting in ascending by clicking on the header for a column. To sort in descending order, hold down the shift key while clicking on the column header.

Technical statistical or computational discussions appear in gray boxes. You may skip these discussions if you prefer.

References

The following texts provide good coverage of most topics concerning the statistical analysis of texts.

Manning, Christopher and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge: MIT Press.
Oakes, Micheal P. 1998. Statistics for Corpus Linguistics. Edinburg University Press.

Exporting and Importing Sets

Table of Contents

Calculator Window