|
|
|
The main public WordHoard site contains the following four corpora:
WordHoard supports the following categories of tagging data:
Morphological tagging includes full lemmatization with parts of speech, word classes, and spellings. Every word is tagged with its lemma, part of speech, word class, major word class, and spelling.
Lemmas distinguish between word classes, with the word class indicated in parentheses following the lemma name. For example, the verb "love (v)" is a separate lemma from the noun "love (n)".
WordHoard understands homonyms, which are disambiguated by adding additional homonym numbers in parentheses following the lemma name. For example, there are two verbs "lie." The first one means "to recline," and is indicated in WordHoard by "lie (v) (1)". The second one means "to tell a falsehood," and is indicated in WordHoard by "lie (v) (2)". Texts residing in corpora other than the WordHoard defaults may not have homonyms distinguished.
For Shakespeare, the morphological tagging deals properly with contractions. For example, the first word of Hamlet "who's" is tagged as having two word parts, each with its own lemma and part of speech. A search for the pronoun "who" finds this word occurrence, as does a search for the primary verb "be." The other corpora may not have this kind of intelligent tagging. They treat contractions and other compound words as having a single part, with a single lemma and part of speech.
Each word is tagged with a flag that tells whether the word appears in narration or in a speech. For words in speeches in Shakespeare and the Early Greek Epic corpus, we also have tags for the speaker name, speaker gender, and speaker mortality.
Each word is tagged with a flag that tells whether the word appears in prose or verse.
Each word in the Early Greek Epic corpus is tagged with its metrical shape.
When known, each work is tagged with its publication date or date range, and each author is tagged with birth/death dates and earliest/latest publication dates.
The Chaucer corpus is tagged with Professor Larry Benson's word glosses (definitions) and with his lemmatization and part of speech taxonomies (in addition to our own). The Benson glosses are only displayed. They cannot be used as criteria for searches or calculations.
All of the corpora include the basic morphological tagging data. The other tagging categories are optional. Some of the corpora support them, and some do not.
|
|
|