This document describes how to run and evaluate our Lexicon-driven model, implemented by an adaptation of MorphTagger
For this goal, the ambiguous file ($AMBIG) and the unambiguous file ($UNAMBIG) created for Arabic or Hebrew are used. Notice that the Hebrew data is freely available (unlike Arabic) and can be downloaded through our site.
You'll need to install SRILM, for further details, check Bar Haim's site for MorphTagger
An updated version of MorphTagger can be found here. The update includes some improvement of handling punctuations, and splitting the training and testing scripts. Also included in the package some additional scripts
Assuming we have an ambiguous test file called "test", and an unambiguous learn file called "learn", we first create a model, and then tag the "test" file using the model.
Create Model
./MTLearn.pl -dir . learn
this will create the model files corpus.lm and corpus.lex.prob in the current directory
Test Model
./MTTest.pl -dir . test
The tagging will be outputted to "tagging-test". If you want to use the gold-segmentation, run:
./MTTest.pl -dir . -gold goldseg tagsfile test
"tagsfile" is the possible tags for words with no correct segmentation by the morphological analyzer. The testing will use the gold segmentation. The tagging will be outputted to gold-tagging-test. You'll need to create the goldseg file, using the "gold" file, which contains the sentences of test with the correct analysis, execute:
./RemoveTags.pl < gold > goldseg
Evaluate
./MTEval.pl gold-MTout tagging-test empty gold
The output will be written to "tagging-test.(err,eval,erranal)". "empty" is an empty file, gold-MTout is the gold file in MorphTagger output format. To obtain it, just run:
./MTunambig-to-MTout.pl < gold > gold-MTout
We created a script to run n-fold (n is set to 10) cross-validation over given data. The script needs an ambiguous file and an unambiguous file. To execute, type:
./MTCross.pl -no_func -h -dir cv $AMBIG $UNAMBIG
to use the gold segmentation just add the "-gold" parameter and "-tags tagsfile", for additional information regarding the parameters, just type "./MTCross.pl"
Using the gold segmentation
| Fold# | Token accuracy | Morph Accuracy |
|---|---|---|
| 0 | 96.09 | 96.5 |
| 1 | 96.92 | 97.22 |
| 2 | 96.92 | 97.23 |
| 3 | 96.9 | 97.19 |
| 4 | 96.64 | 96.99 |
| 5 | 96.76 | 97.09 |
| 6 | 97.18 | 97.49 |
| 7 | 96.28 | 96.68 |
| 8 | 96.61 | 96.92 |
| 9 | 96.91 | 97.22 |
| AVG | 96.72 | 97.05 |
| STDEV | 0.33 | 0.29 |
Not using the gold segmentation
| Fold# | Tok Tag Acc | Tok Seg Acc | Morph Tag Rec | Morph Tag Pre | Morph Tag F1 | Morph Seg Rec | Morph Seg Pre | Morph Seg F1 |
|---|---|---|---|---|---|---|---|---|
| 0 | 95.42 | 99.24 | 95.71 | 95.6 | 95.66 | 99.11 | 99 | 99.05 |
| 1 | 96.23 | 99.37 | 96.47 | 96.38 | 96.426 | 99.29 | 99.2 | 99.25 |
| 2 | 96.39 | 99.52 | 96.67 | 96.56 | 96.62 | 99.47 | 99.36 | 99.42 |
| 3 | 96.5 | 99.54 | 96.75 | 96.7 | 96.73 | 99.46 | 99.41 | 99.44 |
| 4 | 96.05 | 99.61 | 96.38 | 96.35 | 96.37 | 99.55 | 99.51 | 99.53 |
| 5 | 96.39 | 99.56 | 96.69 | 96.59 | 96.64 | 99.53 | 99.43 | 99.48 |
| 6 | 96.57 | 99.14 | 96.69 | 96.66 | 96.67 | 98.98 | 98.94 | 98.96 |
| 7 | 95.96 | 99.53 | 96.27 | 96.26 | 96.27 | 99.44 | 99.43 | 99.44 |
| 8 | 96.2 | 99.45 | 96.4 | 96.4 | 96.4 | 99.31 | 99.32 | 99.32 |
| 9 | 96.31 | 99.2 | 96.36 | 96.39 | 96.38 | 98.94 | 98.98 | 98.96 |
| AVG | 96.20 | 99.42 | 96.44 | 96.39 | 96.42 | 99.31 | 99.26 | 99.29 |
| STDEV | 0.33 | 0.17 | 0.31 | 0.31 | 0.31 | 0.23 | 0.21 | 0.22 |
Using the gold segmentation
| Fold# | Token accuracy | Morph Accuracy |
|---|---|---|
| 0 | 93.55 | 95.12 |
| 1 | 93.61 | 95 |
| 2 | 92.07 | 93.84 |
| 3 | 92.62 | 94.26 |
| 4 | 92.63 | 94.15 |
| 5 | 93.12 | 94.62 |
| 6 | 93.41 | 94.95 |
| 7 | 93.34 | 94.76 |
| 8 | 94.01 | 95.39 |
| 9 | 93.83 | 95.23 |
| AVG | 93.22 | 94.73 |
| STDEV | 0.61 | 0.51 |
Not using the gold segmentation
| Fold# | Tok Tag Acc | Tok Seg Acc | Morph Tag Rec | Morph Tag Pre | Morph Tag F1 | Morph Seg Rec | Morph Seg Pre | Morph Seg F1 |
|---|---|---|---|---|---|---|---|---|
| 0 | 91.53 | 96.71 | 93.17 | 93.49 | 93.33 | 97.23 | 97.56 | 97.4 |
| 1 | 91.25 | 96.26 | 92.37 | 93.07 | 92.72 | 96.57 | 97.31 | 96.94 |
| 2 | 89.68 | 96.16 | 91.4 | 92.14 | 91.77 | 96.56 | 97.35 | 96.96 |
| 3 | 90.94 | 96.44 | 92.36 | 93.09 | 92.73 | 96.68 | 97.44 | 97.06 |
| 4 | 90.53 | 96.18 | 92.22 | 92.55 | 92.39 | 96.78 | 97.12 | 96.95 |
| 5 | 91.17 | 96.65 | 92.43 | 93.34 | 92.88 | 96.78 | 97.73 | 97.25 |
| 6 | 90.73 | 96.41 | 92.57 | 93.26 | 92.91 | 97.07 | 97.81 | 97.44 |
| 7 | 91.27 | 96.8 | 92.57 | 93.34 | 92.95 | 96.95 | 97.76 | 97.35 |
| 8 | 92.02 | 96.76 | 93.59 | 93.87 | 93.73 | 97.42 | 97.71 | 97.57 |
| 9 | 91.78 | 96.64 | 93.26 | 93.78 | 93.52 | 97.09 | 97.62 | 97.35 |
| AVG | 91.09 | 96.50 | 92.59 | 93.19 | 92.89 | 96.91 | 97.54 | 97.23 |
| STDEV | 0.67 | 0.24 | 0.62 | 0.52 | 0.56 | 0.28 | 0.22 | 0.23 |