Table of Contents


The Corpora XML File

Introduction for Text Developers

WordHoard's texts and tagging data are defined in XML files in the data directory, which is organized as follows:

  • corpora.xml: Defines the corpora.
  • authors.xml: Defines the authors.
  • word-classes.xml: Defines the word classes.
  • pos.xml: Defines the parts of speech.
  • works: Work definition files.
    • cha: Chaucer work definition files.
      • aaa.xml: Work definition file for Anelida and Arcite.
      • ... more Chaucer work definition files.
    • ege: Early Greek Epic work definition files.
      • HH.xml: Work definition file for Homeric Hymns.
      • ... more Early Greek Epic work definition files.
    • sha: Shakespeare work definition files.
      • 1h4.xml: Work definition file for The First Part of King Henry the Fourth.
      • ... More Shakespeare work definition files.
    • spe: Spenser work definition files.
      • faq.xml: Work definition file for The Faerie Queene.
      • ... more Spenser work definition files.
  • spellings: Standard spelling definition files.
    • nu-spellings.xml: Standard spelling definitions for the three NU English language corpora.
  • benson-glosses.xml: Professor Larry Benson's Chaucer glosses definition file.
  • translations: Translation files.
    • ege: Early Greek Epic translation files.
      • english: English translations.
        • HH.xml: English translation of Homeric Hymns.
        • ... more Early Greek Epic English translations.
      • german: German translations.
        • IL.xml: German translation of The Iliad.
        • ... more Early Greek Epic German translations.
      • metrical: Metrical transcriptions.
        • HH.xml: Metrical transcription of Homeric Hymns.
        • ... more Early Greek Epic metrical transcriptions.
      • roman: Roman transliterations.
        • HH.xml: Roman transliteration of Homeric Hymns.
        • ... more Early Greek Epic Roman transliterations.
  • annotations: Static annotation files.
    • ek.xml: The E. K. annotations for Spenser's Shepheardes Calender.
    • iliad-scholia.xml: The Iliad Scholia.
  • work-sets.xml: Defines the system work sets.

All of WordHoard's XML files use the Unicode character set with UTF-8 encoding.

To add a new corpus, edit the corpora.xml file and create a new subdirectory of the works directory to hold the works in the corpus.

To add a new author, edit the authors.xml file.

To add a new work, create the XML definition file for the work and place it in the subdirectory of works for its corpus. Edit the corpora.xml file to add the work to the table of contents view(s) for its corpus.

We use the convention that corpus ids are used to name subdirectories of the works directory, and work ids are used to name the work definition files. For example, sha is the id for the Shakespeare corpus, and ham is the id for Hamlet, so the work definition file for Hamlet is located at works/sha/ham.xml.

When you make any changes to the raw data files, you must rebuild the static object model. Use the full-build alias:

% full-build

This alias runs the script at scripts/full-build.csh. This script takes a long time to run. We usually do full builds late at night, when we are done with our other work for the day.

The script first creates a new empty wordhoard database. It then runs a series of build tools which read the raw data XML files and fully populate the tables in the wordhoard database.

The full-build script calls the following helper scripts in order:

  1. create-client-database.csh (cdb)
  2. build-corpora.csh (bco)
  3. build-authors.csh (bau)
  4. build-word-classes.csh (bwc)
  5. build-pos.csh (bpo)
  6. build-benson-glosses.csh (bbg)
  7. build-all-works.csh (baw)
  8. build-annotations (ban)
  9. build-all-translations (bat)
  10. calculate-counts.csh (cc)
  11. build-work-sets.csh (bws)
  12. analyze-tables.csh (atb)

Steps 7 and 10 (build-all-works and calculate-counts) take a long time. The other steps are fast.

The short names in parentheses are aliases that can be used to run each helper script individually. The helpers must be run in order.

The full-build script generates a detailed report on stdout, which the alias redirects to the file misc/full-build.txt. The report contains error messages in the form "##### Message".

When we do a full build, we often open a second terminal window and execute the following command to monitor the progress of the build:

% tail -f misc/full-build.txt

The alias bini runs helpers 1 through 6 and is useful during development to quickly generate a new database with no works:

alias bini "cdb;bco;bau;bwc;bpo;bbg"

When we work on new texts, make changes to old texts or tagging data, or work on structural changes to the object model, we almost always work with a stripped down database. For example, a database with just one Shakespeare play may be quite adequate to test some new feature or change to the text or tagging data for the play. Building such a small subset of the database is much quicker and easier than trying to work with the full production version.

For example, the following commands create a test database containing just Romeo and Juliet:

% bini
% bw sha roj
% cc

The bw alias in this example runs the build-work script, which builds a single work.

Sometimes, depending on what you're working on, you can get by without running calculate-counts (cc).

When you are done testing some new feature with such a stripped down database, you can do a full build while you are sleeping to go back to the full production database with all the corpora and works.

When you are working on text rendering issues, use the debug parameter with the bw alias. For example:

% bw sha roj debug

This example builds or rebuilds Romeo and Juilet and then runs the WordHoard client and opens the work display window for the new version of the text. In addition, tagging data is not saved in the database, which makes the build run much faster.

It is much faster to start with a database that contains no works, then build all of them at once using build-all-works (baw), than it is to build the works one at a time using build-work (bw). This is what the full-build script does.

A Note on Terminology

The XML files contain id attributes which assign short string identifiers to WordHoard objects (corpora, works, word classes, parts of speech, lemmas, words, etc.). In the source code for WordHoard, in the MySQL databases, and in the rest of this manual, we use the term "tag" for this concept, to avoid an unfortunate conflict with Hibernate's use of the term "id." It's important not to confuse these concepts.

For example:

% mysql
mysql> use wordhoard;
mysql> describe corpus;
| Field             | Type         | Null | Key | Default | Extra          |
| id                | bigint(20)   | NO   | PRI | NULL    | auto_increment |
| tag               | varchar(255) | YES  | MUL | NULL    |                |
| title             | varchar(255) | YES  |     | NULL    |                |
| charset           | tinyint(4)   | YES  |     | NULL    |                |
| posType           | tinyint(4)   | YES  |     | NULL    |                |
| taggingData_flags | bigint(20)   | YES  |     | NULL    |                |
| numWorkParts      | int(11)      | YES  |     | NULL    |                |
| numLines          | int(11)      | YES  |     | NULL    |                |
| numWords          | int(11)      | YES  |     | NULL    |                |
| maxWordPathLength | int(11)      | YES  |     | NULL    |                |
| translations      | varchar(255) | YES  |     | NULL    |                |
11 rows in set (0.04 sec)
mysql> select id, tag, title from corpus;
| id | tag  | title            |
|  1 | ege  | Early Greek Epic |
|  2 | cha  | Chaucer          |
|  3 | spe  | Spenser          |
|  4 | sha  | Shakespeare      |
4 rows in set (0.00 sec)

The corpus table contains a column named id of type bigint(20) which is the unique Hibernate id for the corpus object. It also contains a column named tag of type varchar(255) which is the human-readable identifier for the corpus object. It is the tag column which corresponds to the XML id attribute on the corpus elements in the corpora.xml definition file.

In this section of the user manual we mostly use the term "id," since we are discussing the XML files. Do not confuse this with the Hibernate id.

Temporary Files

The various build scripts and programs use a number of temporary files. These files are created inside a subdirectory named temp of the WordHoard development directory.




Table of Contents


The Corpora XML File