Data Science


How to visualise your data

Visualisation for Algorithms


Importing data

One instance per file:

bin/mallet import-dir --input sample-data/web/* --output web.mallet

One file, one instance per line:

bin/mallet import-file --input /data/web/data.txt --output web.mallet

Build the classifier

bin/mallet train-classifier --input acl/acl.mallet --output-classifier acl/acl.maxent.classifier --trainer MaxEnt

Test how it works with unseen data

bin/mallet classify-dir --input datadir --output - --classifier classifier

Evaluation of a Classification Algorithm

./bin/mallet train-classifier --input web.mallet --training-portion 0.9 --trainer MaxEnt

./bin/mallet train-classifier --input web.mallet --cross-validation 10 --trainer MaxEnt

Mallet Sequence Tagging

Using SimpleTagger perform n-fold cross validation using these parameters

--train true --test lab --threads 2 --iterations 50 crf-input-data.txt

Generalised Expectation

./bin/mallet import-file --input train.file.tsv --output train.file.mallet

./bin/mallet import-file --input test.file.tsv --use-pipe-from train.file.mallet --output test.file.mallet


--input hockey-train.mallet --output hockey.unlabeled.vectors --hide-targets


--input lang-train.mallet --output lang.constraints --features-file labeled-features-lang.tsv --targets heuristic


./bin/mallet train-classifier --training-file ham.train.unlabeled.vectors --testing-file ham.test.mallet --trainer "MaxEntGETrainer,gaussianPriorVariance=0.1,constraintsFile=\"ham.constraints\"" --report test:accuracy

Human generated features:

java \

--input baseball-hockey.train.vectors \

--output baseball-hockey.constraints \

--features-file baseball-hockey.labeled_features \

--targets heuristic

Machine generated features:

Finally, we may estimate the expectations using the exact target expectations from the labeled data. The targets option to do this is oracle.

java \

--input baseball-hockey.train.vectors \

--output baseball-hockey.constraints \

--features-file baseball-hockey.features \

--targets oracle


mallet Line # does not match regex: When importing files to mallet


tr -dc [:alnum:][\ ,.]\\n < ./inputfile.txt > ./inputfilefixed.txt 

See explanation.

Read More about Mallet

Classification using Mallet