WordHoard - The Corpora and Tagging Data

Working with Very Common and Very Rare Words

Table of Contents

Getting Started

The Corpora and Tagging Data

The main public WordHoard site contains the following four corpora:

Early Greek Epic. This corpus includes Homer, Hesiod, and the Homeric Hymns in the original Greek, with English and/or German translations for all texts but Shield of Herakles.
Chaucer. We have all the works of Chaucer, including all of The Canterbury Tales.
Spenser. We have all of the poetical works of Spenser, including The Faerie Queene.
Shakespeare. We have all the works of Shakespeare, including all of his plays and poems.

WordHoard supports the following categories of tagging data:

Morphological
Morphological tagging includes full lemmatization with parts of speech, word classes, and spellings. Every word is tagged with its lemma, part of speech, word class, major word class, and spelling.

Lemmas distinguish between word classes, with the word class indicated in parentheses following the lemma name. For example, the verb "love (v)" is a separate lemma from the noun "love (n)".

WordHoard understands homonyms, which are disambiguated by adding additional homonym numbers in parentheses following the lemma name. For example, there are two verbs "lie." The first one means "to recline," and is indicated in WordHoard by "lie (v) (1)". The second one means "to tell a falsehood," and is indicated in WordHoard by "lie (v) (2)". Texts residing in corpora other than the WordHoard defaults may not have homonyms distinguished.

For Shakespeare, the morphological tagging deals properly with contractions. For example, the first word of Hamlet "who's" is tagged as having two word parts, each with its own lemma and part of speech. A search for the pronoun "who" finds this word occurrence, as does a search for the primary verb "be." The other corpora may not have this kind of intelligent tagging. They treat contractions and other compound words as having a single part, with a single lemma and part of speech.
Narrative
Each word is tagged with a flag that tells whether the word appears in narration or in a speech. For words in speeches in Shakespeare and the Early Greek Epic corpus, we also have tags for the speaker name, speaker gender, and speaker mortality.
- By convention, words that appear in narration are considered to have been "spoken by the narrator or poet."
- There are three gender values:
  - Male.
  - Female.
  - Uncertain, mixed or unknown.
- There are three mortality values:
  - Mortal.
  - Immortal or supernatural.
  - Unknown or other.
Prosodic
Each word is tagged with a flag that tells whether the word appears in prose or verse.
Metrical
Each word in the Early Greek Epic corpus is tagged with its metrical shape.
Dates
When known, each work is tagged with its publication date or date range, and each author is tagged with birth/death dates and earliest/latest publication dates.
Benson Glosses
The Chaucer corpus is tagged with Professor Larry Benson's word glosses (definitions) and with his lemmatization and part of speech taxonomies (in addition to our own). The Benson glosses are only displayed. They cannot be used as criteria for searches or calculations.

All of the corpora include the basic morphological tagging data. The other tagging categories are optional. Some of the corpora support them, and some do not.

Working with Very Common and Very Rare Words

Table of Contents

Getting Started