代写 algorithm Java Finite State Automaton theory The Universal Similarity Metric – 189 – Marcus Hutter

The Universal Similarity Metric – 189 – Marcus Hutter
6 THE UNIVERSAL SIMILARITY METRIC
• Kolmogorov Complexity
• The Universal Similarity Metric
• Tree-Based Clustering
• Genomics & Phylogeny: Mammals, SARS Virus & Others • Classification of Different File Types
• Language Tree (Re)construction
• Classify Music w.r.t. Composer
• Further Applications
• Summary

The Universal Similarity Metric – 190 – Marcus Hutter
The Similarity Metric: Abstract
The MDL method has been studied from very concrete and highly tuned practical applications to general theoretical assertions. Sequence prediction is just one application of MDL. The MDL idea has also been used to define the so called information distance or universal similarity metric, measuring the similarity between two individual objects. I will present some very impressive recent clustering applications based on standard Lempel-Ziv or bzip2 compression, including a completely automatic reconstruction (a) of the evolutionary tree of 24 mammals based on complete mtDNA, and (b) of the classification tree of 52 languages based on the declaration of human rights and (c) others.
Based on [Cilibrasi&Vitanyi’05]

The Universal Similarity Metric – 191 – Marcus Hutter
Kolmogorov Complexity
Question: When is object=string x similar to object=string y? Universal solution: x similar y ⇔ x can be easily (re)constructed from y
⇔ Kolmogorov complexity K(x|y) := min{l(p) : U(p,y) = x} is small Examples:
1) x is very similar to itself (K(x|x) =+ 0)
2) A processed x is similar to x (K(f(x)|x) =+ 0 if K(f) = O(1)).
e.g. doubling, reverting, inverting, encrypting, partially deleting x. 3) A random string is with high probability not similar to any other
string (K(random|y) =length(random)).
The problem with K(x|y) as similarity=distance measure is that it is neither symmetric nor normalized nor computable.

The Universal Similarity Metric – 192 – Marcus Hutter
The Universal Similarity Metric
• Symmetrization and normalization leads to a/the universal metric d: 0 ≤ d(x, y) := max{K(x|y), K(y|x)} ≤ 1
max{K(x), K(y)}
• Every effective similarity between x and y is detected by d
• Use K(x|y)≈K(xy)−K(y) (coding T) and K(x)≡KU (x)≈KT (x) =⇒ computable approximation: Normalized compression distance:
d(x,y) ≈ KT (xy) − min{KT (x),KT (y)} 􏰘 1 max{KT (x), KT (y)}
• For T choose Lempel-Ziv or gzip or bzip(2) (de)compressor in the applications below.
• Theory: Lempel-Ziv compresses asymptotically better than any probabilistic finite state automaton predictor/compressor.

The Universal Similarity Metric – 193 – Marcus Hutter
Tree-Based Clustering
• If many objects x1, …, xn need to be compared, determine the similarity matrix Mij= d(xi,xj) for 1 ≤ i,j ≤ n
• Now cluster similar objects.
• There are various clustering techniques.
• Tree-based clustering: Create a tree connecting similar objects, • e.g. quartet method (for clustering)

The Universal Similarity Metric – 194 – Marcus Hutter
Genomics & Phylogeny: Mammals
Let x1, …, xn be mitochondrial genome sequences of different mammals:
Partial distance matrix Mij using bzip2(?)
Cat Echidna Gorilla …
BrownBear 0.002
Carp 0.943
Cat 0.887
Chimpanzee 0.935
Cow 0.906
Echidna 0.944
FinbackWhale 0.915
Gibbon 0.939
Gorilla 0.940
HouseMouse 0.934
Human 0.930
… …
0.943 0.887
0.006 0.946
0.946 0.003
0.954 0.926
0.947 0.897
0.955 0.942
0.952 0.905
0.951 0.928
0.957 0.931
0.956 0.919
0.946 0.922
… …
0.935 0.906
0.954 0.947
0.926 0.897
0.006 0.926
0.926 0.006
0.948 0.936
0.926 0.885
0.849 0.931
0.731 0.927
0.943 0.925
0.667 0.920
… …
0.944 0.915
0.955 0.952
0.942 0.905
0.948 0.926
0.936 0.885
0.005 0.936
0.936 0.005
0.947 0.930
0.947 0.931
0.941 0.933
0.939 0.922
… …
0.939 0.940 0.934
0.951 0.957 0.956
0.928 0.931 0.919
0.849 0.731 0.943
0.931 0.927 0.925
0.947 0.947 0.941
0.930 0.931 0.933
0.005 0.859 0.948
0.859 0.006 0.944
0.948 0.944 0.006
0.844 0.737 0.932
… … …
0.930 …
0.946 …
0.922 …
0.667 …
0.920 …
0.939 …
0.922 …
0.844 …
0.737 …
0.932 …
0.005 …
… …
BrownBear
Carp Cow
FinWhale
Gibbon
HouseMouse …
Human …
Chimpanzee

The Universal Similarity Metric – 195 – Marcus Hutter
Genomics & Phylogeny: Mammals
Evolutionary tree built from complete mammalian mtDNA of 24 species:
Carp Cow BlueWhale FinbackWhale Cat BrownBear PolarBear GreySeal HarborSeal Horse WhiteRhino Gibbon Gorilla Human Chimpanzee PygmyChimp Orangutan SumatranOrangutan HouseMouse Rat Opossum Wallaroo Echidna Platypus
Ferungulates
Eutheria
Primates
Eutheria – Rodents Metatheria
Prototheria

The Universal Similarity Metric – 196 – Marcus Hutter
Genomics & Phylogeny: SARS Virus and Others
• Clustering of SARS virus in relation to potential similar virii based on complete sequenced genome(s) using bzip2:
• The relations are very similar to the definitive tree based on medical-macrobio-genomics analysis from biologists.

The Universal Similarity Metric – 197 – Marcus Hutter
Genomics & Phylogeny: SARS Virus and Others
MurineHep11
n10
MurineHep2
RatSialCorona
SARSTOR2v120403 HumanCorona1
n8
n7
AvianIB2 AvianIB1
n2
n13
n4
n0 n5
PRD1
n9
SIRV2
MeaslesSch
n3
n11
n12
SIRV1
MeaslesMora
DuckAdeno1
n1 AvianAdeno1CELO
n6
HumanAdeno40
BovineAdeno3

The Universal Similarity Metric – 198 – Marcus Hutter
Classification of Different File Types
Classification of files based on markedly different file types using bzip2 • Four mitochondrial gene sequences
• Four excerpts from the novel “The Zeppelin’s Passenger”
• Four MIDI files without further processing
• Two Linux x86 ELF executables (the cp and rm commands) • Two compiled Java class files
No features of any specific domain of application are used!

The Universal Similarity Metric – 199 – Marcus Hutter
Classification of Different File Types
GenesFoxC
MusicHendrixB
GenesRatD
n10
n5 n13
MusicHendrixA
MusicBergB
MusicBergA
n0
GenesPolarBearB
GenesBlackBearA
n3
n8 n2
n7 n12
ELFExecutableB
ELFExecutableA
n1
JavaClassB
n6
JavaClassA
n4
n9
TextA
n11
TextD
TextC
TextB
Perfect classification!

The Universal Similarity Metric – 200 – Marcus Hutter
Language Tree (Re)construction
• Let x1, …, xn be the “The Universal Declaration of Human Rights” in various languages 1, …, n.
• Distance matrix Mij based on gzip. Language tree constructed from Mij by the Fitch-Margoliash method. [Li 04]
• All main linguistic groups can be recognized (next slide)

BALTIC
SLAVIC
Basque [Spain]
Hungarian [Hungary]
Polish [Poland]
Sorbian [Germany]
Slovak [Slovakia]
Czech [Czech Rep]
Slovenian [Slovenia]
Serbian [Serbia]
Bosnian [Bosnia]
Croatian [Croatia]
Romani Balkan [East Europe] Albanian [Albany]
Lithuanian [Lithuania]
Latvian [Latvia]
Turkish [Turkey]
Uzbek [Utzbekistan]
Breton [France]
Maltese [Malta] OccitanAuvergnat [France] Walloon [Belgique]
English [UK]
French [France]
Asturian [Spain]
Portuguese [Portugal]
Spanish [Spain]
Galician [Spain]
Catalan [Spain]
Occitan [France]
Rhaeto Romance [Switzerland] Friulian [Italy]
Italian [Italy]
Sammarinese [Italy]
Corsican [France]
Sardinian [Italy]
Romanian [Romania]
Romani Vlach [Macedonia] Welsh [UK]
Scottish Gaelic [UK]
Irish Gaelic [UK]
German [Germany] Luxembourgish [Luxembourg] Frisian [Netherlands]
Dutch [Netherlands]
Afrikaans
Swedish [Sweden]
Norwegian Nynorsk [Norway] Danish [Denmark]
Norwegian Bokmal [Norway] Faroese [Denmark]
Icelandic [Iceland]
Finnish [Finland]
Estonian [Estonia]
ALTAIC
ROMANCE
GERMANIC
CELTIC
UGROFINNIC

The Universal Similarity Metric – 202 – Marcus Hutter
Classify Music w.r.t. Composer
Let m1, …, mn be pieces of music in MIDI format. Preprocessing the MIDI files:
• Delete identifying information (composer, title, …), instrument indicators, MIDI control signals, tempo variations, …
• Keep only note-on and note-off information.
• A note, k ∈ Z half-tones above the average note is coded as a
signed byte with value k.
• The whole piece is quantized in 0.05 second intervals.
• Tracks are sorted according to decreasing average volume, and then output in succession.
Processed files x1, …, xn still sounded like the original.

The Universal Similarity Metric – 203 – Marcus Hutter
Classify Music w.r.t. Composer
12 pieces of music: 4×Bach + 4×Chopin + 4×Debussy. Class. by bzip2 DebusBerg4
DebusBerg1
n7
n4
ChopPrel1
n6
ChopPrel22
DebusBerg2
ChopPrel24
n3
n2
n1
n9
n8
DebusBerg3
n5
ChopPrel15
n0
BachWTK2F2 BachWTK2F1
BachWTK2P1 BachWTK2P2
Perfect grouping of processed MIDI files w.r.t. composers.

The Universal Similarity Metric – 204 – Marcus Hutter
Further Applications
• Classification of Fungi
• Optical character recognition
• Classification of Galaxies
• Clustering of novels w.r.t. authors
• Larger data sets
See [Cilibrasi&Vitanyi’05]

The Universal Similarity Metric – 205 – Marcus Hutter
The Clustering Method: Summary
• based on the universal similarity metric,
• based on Kolmogorov complexity,
• approximated by bzip2,
• with the similarity matrix represented by tree, • approximated by the quartet method
• leads to excellent classification in many domains.

The Universal Similarity Metric – 206 – Marcus Hutter
Exercises
1. [C20] Prove that d(x, y) := max{K(x|y),K(y|x)}−K(x|x) is a metric. max{K(x),K(y)}
2. [C25] Reproduce the phylogenetic tree of mammals and the language tree using the CompLearn Toolkit available from http://www.complearn.org/.

The Universal Similarity Metric – 207 – Marcus Hutter
Literature
[Ben98] C. H. Bennett et al. Information distance. IEEE Transactions on Information Theory, 44(4):1407–1423, 1998.
[Li 04] M. Li et al. The similarity metric. IEEE Transactions on Information Theory, 50(12):3250–3264, 2004.
[CVW04] R. Cilibrasi, P. M. B. Vit ́anyi, and R. de Wolf. Algorithmic clustering of music based on string compression. Computer Music Journal, 28(4):49–67, 2004. http://arXiv.org/abs/cs/0303025.
[CV05] R. Cilibrasi and P. M. B. Vit ́anyi. Clustering by compression. IEEE Trans. Information Theory, 51(4):1523–1545, 2005.
[CV06] R. Cilibrasi and P. M. B. Vit ́anyi. Similarity of objects and the meaning of words. In Proc. 3rd Annual Conferene on Theory and Applications of Models of Computation (TAMC’06), LNCS. Springer, 2006.