Intro

This document describes how to run and evaluate our Lexicon-driven model, implemented by an adaptation of MorphTagger

For this goal, the ambiguous file ($AMBIG) and the unambiguous file ($UNAMBIG) created for Arabic or Hebrew are used. Notice that the Hebrew data is freely available (unlike Arabic) and can be downloaded through our site.

Implementation

You'll need to install SRILM, for further details, check Bar Haim's site for MorphTagger

An updated version of MorphTagger can be found here. The update includes some improvement of handling punctuations, and splitting the training and testing scripts. Also included in the package some additional scripts

Running MorphTagger

Assuming we have an ambiguous test file called "test", and an unambiguous learn file called "learn", we first create a model, and then tag the "test" file using the model.

Create Model

./MTLearn.pl -dir . learn

this will create the model files corpus.lm and corpus.lex.prob in the current directory

Test Model

./MTTest.pl -dir . test

The tagging will be outputted to "tagging-test". If you want to use the gold-segmentation, run:

./MTTest.pl -dir . -gold goldseg tagsfile test

"tagsfile" is the possible tags for words with no correct segmentation by the morphological analyzer. The testing will use the gold segmentation. The tagging will be outputted to gold-tagging-test. You'll need to create the goldseg file, using the "gold" file, which contains the sentences of test with the correct analysis, execute:

./RemoveTags.pl < gold > goldseg

Evaluate

./MTEval.pl gold-MTout tagging-test empty gold

The output will be written to "tagging-test.(err,eval,erranal)". "empty" is an empty file, gold-MTout is the gold file in MorphTagger output format. To obtain it, just run:

./MTunambig-to-MTout.pl < gold > gold-MTout

Cross-validation

We created a script to run n-fold (n is set to 10) cross-validation over given data. The script needs an ambiguous file and an unambiguous file. To execute, type:

./MTCross.pl -no_func -h -dir cv $AMBIG $UNAMBIG

to use the gold segmentation just add the "-gold" parameter and "-tags tagsfile", for additional information regarding the parameters, just type "./MTCross.pl"

Results for cross-validating MorphTagger over ATB1v3.0

Using the gold segmentation

Fold#Token accuracyMorph Accuracy
096.0996.5
196.9297.22
296.9297.23
396.997.19
496.6496.99
596.7697.09
697.1897.49
796.2896.68
896.6196.92
996.9197.22
AVG96.7297.05
STDEV0.330.29

Not using the gold segmentation

Fold#Tok Tag AccTok Seg Acc Morph Tag RecMorph Tag Pre Morph Tag F1Morph Seg Rec Morph Seg PreMorph Seg F1
095.4299.2495.7195.695.66 99.119999.05
196.2399.3796.4796.3896.426 99.2999.299.25
296.3999.5296.6796.5696.62 99.4799.3699.42
396.599.5496.7596.796.73 99.4699.4199.44
496.0599.6196.3896.3596.37 99.5599.5199.53
596.3999.5696.6996.5996.64 99.5399.4399.48
696.5799.1496.6996.6696.67 98.9898.9498.96
795.9699.5396.2796.2696.27 99.4499.4399.44
896.299.4596.496.496.4 99.3199.3299.32
996.3199.296.3696.3996.38 98.9498.9898.96
AVG96.2099.4296.4496.3996.42 99.3199.2699.29
STDEV0.330.170.310.310.31 0.230.210.22

Results for cross-validating MorphTagger over Hebrew

Using the gold segmentation

Fold#Token accuracyMorph Accuracy
093.5595.12
193.6195
292.0793.84
392.6294.26
492.6394.15
593.1294.62
693.4194.95
793.3494.76
894.0195.39
993.8395.23
AVG93.2294.73
STDEV0.610.51

Not using the gold segmentation

Fold#Tok Tag AccTok Seg Acc Morph Tag RecMorph Tag Pre Morph Tag F1Morph Seg Rec Morph Seg PreMorph Seg F1
091.5396.7193.1793.4993.33 97.2397.5697.4
191.2596.2692.3793.0792.72 96.5797.3196.94
289.6896.1691.492.1491.77 96.5697.3596.96
390.9496.4492.3693.0992.73 96.6897.4497.06
490.5396.1892.2292.5592.39 96.7897.1296.95
591.1796.6592.4393.3492.88 96.7897.7397.25
690.7396.4192.5793.2692.91 97.0797.8197.44
791.2796.892.5793.3492.95 96.9597.7697.35
892.0296.7693.5993.8793.73 97.4297.7197.57
991.7896.6493.2693.7893.52 97.0997.6297.35
AVG91.0996.5092.5993.1992.89 96.9197.5497.23
STDEV0.670.240.620.520.56 0.280.220.23