This repository contains scripts for a collation processing workflow and its evaluation.
# Recommended steps (use virtualenv)
virtualenv env -p python3.8
source env/bin/activate
# end recommended steps
# begin install
pip install -r requirements.txtModels will be installed automatically if they are not.
You can get a list of available models with
pie-extended list# Lemmatise raw (txt) files for ulterior collation
python3 main.py <path> [--lemmatise] [--lang] [--engine]
# Collate annotated files in XML
# containing (possibly human-corrected) linguistic information
python3 main.py <path> [--collate]
# Assign categories (graphematic, flexional, morphosyntactic, lexical) to the variation sites
python3 main.py <path> [--categorise]
# Or, alternatively, do it all in one row
python3 main.py <path> [--lemmatise] [--lang] [--engine] [--collate] [--categorise]To evaluate the results:
python eval.py <path_to_gt_xml> <path_to_results_xml> [--print_diff]For simple collation from the txt sources, without preprocessing:
python main.py <path> [--simple]More info about usage and examples are available below.
If you want to use directly XML annotated files,
they must be in TEI, and contain <w> tags,
with @lemma, and possibly @pos and @msd tags,
<w
lemma="mëisme"
pos="ADJind"
msd="NOMB.=s|GENRE=m|CAS=r"
>meisme</w>Or, possibly, use an @type,
<w
lemma="mëisme"
type="ADJind|NOMB.=s|GENRE=m|CAS=r"
>meisme</w>
# Lemmatise raw (txt) files for ulterior collation
python3 main.py <path> [--lemmatise] [--lang] [--engine]
# Example
python3 main.py data/input --lemmatise --lang fro --engine pieThis step takes txt files and produces annoted XML files, which will be saved in the directory lemmat. Attention: when you lemmatise different sources, define the path to the output directory to store the different results separately; the default path is out, which will produce out/lemmat.
The only currently available engine is pie in pie-extended. For a list of available models, type
pie-extended list# Collate annotated files in XML
# containing (possibly human-corrected) linguistic information
python3 main.py <path> [--collate]
# Example
python3 main.py data/input --collateThis step takes XML files and collate them. The results are saved in XML and in txt (as a table) in the directory coll (the default path is out, which will produce out/coll).
Before collating, you might want to correct the XML generated by the previous step. For avoiding over-writing, move the XML files to a new directory before editing them and edit the path accordingly before launching the command.
# Assign categories to the variation sites
python3 main.py <path> [--categorise]
# Example
python3 main.py put/coll/out.xml --categoriseThis step takes the XML result of the collation and assign a category to each variation site.
The linguistic information on each <rdg> inside the <app> is used to assign the category: for example, if the <rdg>s have the same value of @lemma, @pos and @msd, the variation will be graphematic. The category is stored in the attribute @ana on the <app>. Currently supported categories are graphematic, flexional, morphosyntactic, lexical.
python3 main.py <path> [--lemmatise] [--lang] [--engine] [--collate] [--categorise]
# Example
python3 main.py data/input --lemmatise --lang fro --engine pie --collate --categorise