|
|
|
WordHoard's texts and tagging data are defined in XML files in the data
directory, which is organized as follows:
corpora.xml
: Defines the corpora.authors.xml
: Defines the authors.word-classes.xml
: Defines the word classes.pos.xml
: Defines the parts of speech.works
: Work definition files.
cha
: Chaucer work definition files.
aaa.xml
: Work definition file for Anelida and Arcite.ege
: Early Greek Epic work definition files.
HH.xml
: Work definition file for Homeric Hymns.sha
: Shakespeare work definition files.
1h4.xml
: Work definition file for The First Part of King Henry the Fourth.spe
: Spenser work definition files.
faq.xml
: Work definition file for The Faerie Queene.spellings
: Standard spelling definition files.
nu-spellings.xml
: Standard spelling definitions for the three NU English language corpora.benson-glosses.xml
: Professor Larry Benson's Chaucer glosses definition file.translations
: Translation files.
ege
: Early Greek Epic translation files.
english
: English translations.
HH.xml
: English translation of Homeric Hymns.german
: German translations.
IL.xml
: German translation of The Iliad.metrical
: Metrical transcriptions.
HH.xml
: Metrical transcription of Homeric Hymns.roman
: Roman transliterations.
HH.xml
: Roman transliteration of Homeric Hymns.annotations
: Static annotation files.
ek.xml
: The E. K. annotations for Spenser's Shepheardes Calender.iliad-scholia.xml
: The Iliad Scholia.work-sets.xml
: Defines the system work sets.
All of WordHoard's XML files use the Unicode character set with UTF-8 encoding.
To add a new corpus, edit the corpora.xml
file and create a new subdirectory of the works
directory to hold the works in the corpus.
To add a new author, edit the authors.xml
file.
To add a new work, create the XML definition file for the work and place it in the subdirectory of works
for its corpus. Edit the corpora.xml
file to add the work to the table of contents view(s) for its corpus.
We use the convention that corpus ids are used to name subdirectories of the works
directory, and work ids are used to name the work definition files. For example, sha
is the id for the Shakespeare corpus, and ham
is the id for Hamlet, so the work definition file for Hamlet is located at works/sha/ham.xml
.
When you make any changes to the raw data files, you must rebuild the static object model. Use the full-build
alias:
% full-build
This alias runs the script at scripts/full-build.csh
. This script takes a long time to run. We usually do full builds late at night, when we are done with our other work for the day.
The script first creates a new empty wordhoard
database. It then runs a series of build tools which read the raw data XML files and fully populate the tables in the wordhoard
database.
The full-build
script calls the following helper scripts in order:
create-client-database.csh (cdb)
build-corpora.csh (bco)
build-authors.csh (bau)
build-word-classes.csh (bwc)
build-pos.csh (bpo)
build-benson-glosses.csh (bbg)
build-all-works.csh (baw)
build-annotations (ban)
build-all-translations (bat)
calculate-counts.csh (cc)
build-work-sets.csh (bws)
analyze-tables.csh (atb)
Steps 7 and 10 (build-all-works
and calculate-counts
) take a long time. The other steps are fast.
The short names in parentheses are aliases that can be used to run each helper script individually. The helpers must be run in order.
The full-build
script generates a detailed report on stdout
, which the alias redirects to the file misc/full-build.txt
. The report contains error messages in the form "##### Message"
.
When we do a full build, we often open a second terminal window and execute the following command to monitor the progress of the build:
% tail -f misc/full-build.txt
The alias bini
runs helpers 1 through 6 and is useful during development to quickly generate a new database with no works:
alias bini "cdb;bco;bau;bwc;bpo;bbg"
When we work on new texts, make changes to old texts or tagging data, or work on structural changes to the object model, we almost always work with a stripped down database. For example, a database with just one Shakespeare play may be quite adequate to test some new feature or change to the text or tagging data for the play. Building such a small subset of the database is much quicker and easier than trying to work with the full production version.
For example, the following commands create a test database containing just Romeo and Juliet:
% bini % bw sha roj % cc
The bw
alias in this example runs the build-work
script, which builds a single work.
Sometimes, depending on what you're working on, you can get by without running calculate-counts (cc)
.
When you are done testing some new feature with such a stripped down database, you can do a full build while you are sleeping to go back to the full production database with all the corpora and works.
When you are working on text rendering issues, use the debug
parameter with the bw
alias. For example:
% bw sha roj debug
This example builds or rebuilds Romeo and Juilet and then runs the WordHoard client and opens the work display window for the new version of the text. In addition, tagging data is not saved in the database, which makes the build run much faster.
It is much faster to start with a database that contains no works, then build all of them at once using build-all-works (baw)
, than it is to build the works one at a time using build-work (bw)
. This is what the full-build
script does.
A Note on Terminology
The XML files contain id
attributes which assign short string identifiers to WordHoard objects (corpora, works, word classes, parts of speech, lemmas, words, etc.). In the source code for WordHoard, in the MySQL databases, and in the rest of this manual, we use the term "tag" for this concept, to avoid an unfortunate conflict with Hibernate's use of the term "id." It's important not to confuse these concepts.
For example:
% mysql mysql> use wordhoard; mysql> describe corpus; +-------------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------------------+--------------+------+-----+---------+----------------+ | id | bigint(20) | NO | PRI | NULL | auto_increment | | tag | varchar(255) | YES | MUL | NULL | | | title | varchar(255) | YES | | NULL | | | charset | tinyint(4) | YES | | NULL | | | posType | tinyint(4) | YES | | NULL | | | taggingData_flags | bigint(20) | YES | | NULL | | | numWorkParts | int(11) | YES | | NULL | | | numLines | int(11) | YES | | NULL | | | numWords | int(11) | YES | | NULL | | | maxWordPathLength | int(11) | YES | | NULL | | | translations | varchar(255) | YES | | NULL | | +-------------------+--------------+------+-----+---------+----------------+ 11 rows in set (0.04 sec) mysql> select id, tag, title from corpus; +----+------+------------------+ | id | tag | title | +----+------+------------------+ | 1 | ege | Early Greek Epic | | 2 | cha | Chaucer | | 3 | spe | Spenser | | 4 | sha | Shakespeare | +----+------+------------------+ 4 rows in set (0.00 sec)
The corpus
table contains a column named id
of type bigint(20)
which is the unique Hibernate id for the corpus object. It also contains a column named tag
of type varchar(255)
which is the human-readable identifier for the corpus object. It is the tag
column which corresponds to the XML id
attribute on the corpus
elements in the corpora.xml
definition file.
In this section of the user manual we mostly use the term "id," since we are discussing the XML files. Do not confuse this with the Hibernate id.
Temporary Files
The various build scripts and programs use a number of temporary files. These files are created inside a subdirectory named temp
of the WordHoard development directory.
|
|
|