On Prediction Using Variable Order Markov Models - Companion Site
home| paper| code| datasets| acknowledgments| contact us

 
Datasets
In the paper we compare the VMM algorithms performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks.
Protein The protein set includes proteins (amino-acid sequences) from the well-known Structural Classification of Proteins (SCOP) database. We used all sequences in release 1.63 of SCOP, after eliminating redundancy.
Text For the English text we chose the well-known `Calgary Corpus', which is traditionally used for benchmarking lossless compression algorithms.
Music The music set was assembled from MIDI files of music pieces. The musical benchmark was compiled using a variety of well-known pieces of different styles. The styles we included are: classical, jazz and rock/pop. All the pieces we consider are polyphonic (played with several instruments simultaneously).
Click  here  to download or view the datasets.