Intro

This document describes how to run and evaluate the SVM classifier (Data-driven model) over Arabic and Hebrew

For this goal, two files are needed:

Implementation

You'll need to install YAMCHA, a text chunker program implementing SVM classifying

The scripts used below can be downloaded here

Creating the raw text file

First, we need to prepare the raw text from the corpus. We use the unambiguous file created for MorphTagger, assume that the path for the file is $UNAMBIG.

Prepare raw text:

perl GetTokens.pl < $UNAMBIG > tokens

Prepare correctly segmented text:

perl GetMorphs.pl < $UNAMBIG > morphs

Creating the training file

Create training data for segmentation:

perl MTUnambig-to-YAMSEG.pl $UNAMBIG train-seg

Create training data for POS tagging:

perl MTUnambig-to-YAMPOS.pl $UNAMBIG train-pos

Running SVM

Assuming we have a text file in latin transliteration called "test", and two training files called "train-seg" and "train-pos", we first create two classifiers one for segmentation and one for POS tagging, then we segment "test" using the segmentation classifier. Over the segmented text we run the POS classifier.

Create Classifiers

Segmentation classifier (model):

make CORPUS=train-seg MULTI_CLASS=2 MODEL=seg FEATURE="F:-5..5:0..0 T:-5..-1" train

this will create the model file "seg.model"

make CORPUS=train-pos MULTI_CLASS=2 MODEL=pos FEATURE="F:-2..2:0.. T:-2..-1" train

this will create the model file "pos.model"

Test Model

./TOKrun.pl -model seg.model test

this will output the segmentation to "test.TOK"

./POSrun.pl -model pos.model test.TOK

this will output the tagging to "test.TOK.POS"

To evaluate, you can convert the "test.TOK.POS" to MorphTagger output format, and then use MTEval in the MorphTagger package:

svmo-mto.pl test.TOK.POS test

The output will be written to "test.TOK.POS.MT".

Cross-validation

We created a script to run n-fold (n is set to 10) cross-validation over given data. The script needs an unambiguous file. To execute, type:

./SVMCross.pl -dir cv $UNAMBIG

The script currently performs POS tagging over the gold segmentation

Results for cross-validating the SVM model over ATB1v3.0

Using the gold segmentation

Fold#Token accuracyMorph Accuracy
095.9696.33
196.7197
296.5796.9
396.3896.71
496.7597.06
596.8397.14
697.1297.43
796.2796.65
896.1896.53
996.7597.05
AVG96.5596.88
STDEV0.350.33

Results for cross-validating the SVM model over the Hebrew Corpus

Using the gold segmentation

Fold#Token accuracyMorph Accuracy
092.7894.61
191.3193.25
289.7292.18
390.3292.7
491.1893.25
591.493.46
691.793.77
791.6593.65
891.7893.78
992.1594.11
AVG91.4093.48
STDEV0.870.69