This document describes how to run and evaluate the SVM classifier (Data-driven model) over Arabic and Hebrew
For this goal, two files are needed:
You'll need to install YAMCHA, a text chunker program implementing SVM classifying
The scripts used below can be downloaded here
First, we need to prepare the raw text from the corpus. We use the unambiguous file created for MorphTagger, assume that the path for the file is $UNAMBIG.
Prepare raw text:
perl GetTokens.pl < $UNAMBIG > tokens
Prepare correctly segmented text:
perl GetMorphs.pl < $UNAMBIG > morphs
Create training data for segmentation:
perl MTUnambig-to-YAMSEG.pl $UNAMBIG train-seg
Create training data for POS tagging:
perl MTUnambig-to-YAMPOS.pl $UNAMBIG train-pos
Assuming we have a text file in latin transliteration called "test", and two training files called "train-seg" and "train-pos", we first create two classifiers one for segmentation and one for POS tagging, then we segment "test" using the segmentation classifier. Over the segmented text we run the POS classifier.
Create Classifiers
Segmentation classifier (model):
make CORPUS=train-seg MULTI_CLASS=2 MODEL=seg FEATURE="F:-5..5:0..0 T:-5..-1" train
this will create the model file "seg.model"
make CORPUS=train-pos MULTI_CLASS=2 MODEL=pos FEATURE="F:-2..2:0.. T:-2..-1" train
this will create the model file "pos.model"
Test Model
./TOKrun.pl -model seg.model test
this will output the segmentation to "test.TOK"
./POSrun.pl -model pos.model test.TOK
this will output the tagging to "test.TOK.POS"
To evaluate, you can convert the "test.TOK.POS" to MorphTagger output format, and then use MTEval in the MorphTagger package:
svmo-mto.pl test.TOK.POS test
The output will be written to "test.TOK.POS.MT".
We created a script to run n-fold (n is set to 10) cross-validation over given data. The script needs an unambiguous file. To execute, type:
./SVMCross.pl -dir cv $UNAMBIG
The script currently performs POS tagging over the gold segmentation
Using the gold segmentation
| Fold# | Token accuracy | Morph Accuracy |
|---|---|---|
| 0 | 95.96 | 96.33 |
| 1 | 96.71 | 97 |
| 2 | 96.57 | 96.9 |
| 3 | 96.38 | 96.71 |
| 4 | 96.75 | 97.06 |
| 5 | 96.83 | 97.14 |
| 6 | 97.12 | 97.43 |
| 7 | 96.27 | 96.65 |
| 8 | 96.18 | 96.53 |
| 9 | 96.75 | 97.05 |
| AVG | 96.55 | 96.88 |
| STDEV | 0.35 | 0.33 |
Using the gold segmentation
| Fold# | Token accuracy | Morph Accuracy |
|---|---|---|
| 0 | 92.78 | 94.61 |
| 1 | 91.31 | 93.25 |
| 2 | 89.72 | 92.18 |
| 3 | 90.32 | 92.7 |
| 4 | 91.18 | 93.25 |
| 5 | 91.4 | 93.46 |
| 6 | 91.7 | 93.77 |
| 7 | 91.65 | 93.65 |
| 8 | 91.78 | 93.78 |
| 9 | 92.15 | 94.11 |
| AVG | 91.40 | 93.48 |
| STDEV | 0.87 | 0.69 |