Conditional Entropy

Presentation
in
C27_B251,  MSCS building, UNE, Armidale, NSW, Australia
on
Friday, 23rd February 2007 from 12 to 1pm.

by
Dr. Charles R. Watson
School of Maths, Stats and Computer Science
University of New England, Armidale NSW 2351
URL: http://mcs.une.edu.au/~cwatson7/I/ConditionalEntropy.html
Slides: http://mcs.une.edu.au/~cwatson7/I/ConditionalEntropy.ppt

 

 “This process of constructing instruction tables should be very fascinating. There need be no real danger of it ever becoming a drudge, for any processes that are quite mechanical may be turned over to the machine itself.” Alan Mathison Turing, OBE (1945).

The software described here was developed on turing.une.edu.au and bombe.une.edu.au.

Introduction

We describe an algorithm for searching large data sets, by example, using an inverted trie (coined by E. Fredkin:1960 from reTRIEval) to analyse information content. A dictionary is built with reference or training data then used to analyse test data. Only strings in the dictionary are counted. The probability distribution is conditional on the dictionary data. A reverse link to the suffix list allows efficient search to match the prefix. Conditional entropy and hit rate are used as measures of correspondence of the test data with the reference dictionary.  To search for a string of length N we traverse the N-1 length suffix list until we match the prefix character. The maximum number of attempted matches is the size of the alphabet; half for the average number of attempts. We iteratively find all strings by using the last calculated suffix list to find the next string. This is an online algorithm, in that the dictionary and analysis is up-to date before reading the next character from the input stream. Once the dictionary (reference or training data) is complete or there is no more virtual memory the classifier continues, ad infinitum, with linear complexity on the length of the input stream.

This is more than mere motif matching. The entire structure of the reference data is encapsulated by the dictionary. Patterns are discovered by this homology-by-example paradigm. Tandem repeats are easily identified by the recurring relative location of the patterns equal to the pattern length. For example, the string: {TCATCATCATC} records the pattern {CATC} at locations 1, 4  and 7 with the relative locations 3,  3.

Entropy as information content

We can deduce genomic structure from information content. Shannon (1948:11) defined information entropy as , where  is the probability of a random variable .

 

Shannon's definition of entropy is closely related to thermodynamic entropy as defined in physics and chemistry. The entropy of genomic data can be calculated from the distribution of repeated sequences and provides a standard measure of randomness or structure. Repeated sequences indicate homology between genes and species and can infer protein folding structure. This analysis is applied to intron phases, protein, mRNA, coding and non-coding DNA data. Independent fair coin flips have an entropy of 1 bit per flip. A source that always generates a long string of A's has an entropy of 0, since the next character will always be an 'A'.

 

Entropy effectively bounds the performance of the strongest lossless (or nearly lossless) compression possible, which can be realized in theory by using the typical set or in practice using Huffman, Lempel-Ziv or arithmetic coding. The performance of existing data compression algorithms is often used as a rough estimate of the entropy of a block of data.

 

Although entropy is often used as a characterization of the information content of a data source, this information content is not absolute: it depends crucially on the probabilistic model. A source that always generates the same symbol has an entropy of 0, but the definition of what a symbol is depends on the alphabet. Consider a source that produces the string CATCATCATCATCATCAT in which C is always followed by A and T. If the probabilistic model considers individual letters as independent, the entropy of the sequence is 1.585 bit per character. But if the sequence is considered as a sequence of the amino acid Histidine "HHHHHH...", then the entropy is 0.

To search for a string of length N we traverse the {N-1} length suffix list until we match the prefix character. The maximum number of attempted matches is the size of the alphabet; half for the average number of attempts. We iteratively find all strings by using the last calculated suffix list to find the next string. This is an online algorithm, in that the dictionary is up-to date before reading the next character from the input stream. Once the dictionary (reference or training data) is complete or the is no more virtual memory to extend the dictionary the classifier continues, ad infinitum,  with linear complexity on the length of the input stream.

Shannon (1948:7) used the following randomly generated example where the word transition probabilities are the same as the original text. This illustrates structural redundancy in English. The result is similar to English, but incomprehensible.

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER

OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT

THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

 

Conditional Entropy

Shannon defines the conditional entropy (equivocation) of y given x, as the average of the entropy of y for each value of x, weighted according to the probability of getting that particular x. This quantity measures how uncertain we are of y on the average when we know x.
That is

 

The uncertainty (or entropy) of the joint event x, y is the uncertainty of x plus the uncertainty of y when x is known.

 

 

Hence

 

Conditional probability is the probability of some event Y, given the occurrence of some other event X. Conditional probability is written P(Y|X), and is read "the probability of Y, given X".

Conditional entropy can be used as a classifier, like Bayesian inference.
Thomas Bayes (c. 1702 – April 17, 1761) formulated a special case of Bayes' theorem, which was published posthumously.

Maximal Entropy

If  = 1/N for i=1,N then E = log N

For any skewed (non-uniform) distribution E < log N

The following table shows the 12 most frequent strings of length 1 to 7

Total

F:1

 

F:2

 

F:3

 

F:4

 

F:5

 

F:6

 

F:7

 

 

l

5258

an

2199

ing

978

The_

566

nd_th

320

___The

247

tration

178

 

d

5996

in

2207

ng_

1009

d_th

596

and_t

323

of_the

248

stratio

178

 

r

7558

_h

2280

to_

1040

____

597

_____

324

ation_

253

ation__

183

 

s

8014

n_

2495

_to

1089

_The

617

_you_

337

nd_the

274

ration_

184

 

i

8604

t_

2991

___

1126

his_

652

tion_

337

_of_th

279

of_the_

206

 

n

8945

s_

3240

and

1295

_of_

831

n_the

342

and_th

286

_____Th

237

 

h

9031

_a

3694

ed_

1440

ing_

894

_with

350

n_the_

291

____The

237

 

a

10481

th

3935

_an

1451

_to_

957

d_the

487

__The_

296

nd_the_

242

 

o

10565

d_

3997

nd_

1484

_and

1207

_his_

524

_that_

297

___The_

246

 

t

12082

he

4493

the

2697

and_

1242

_The_

564

_with_

310

_of_the

248

 

e

16727

_t

4699

he_

2986

the_

1954

_and_

1207

_and_t

318

and_the

256

 

_

34390

e_

6430

_th

3130

_the

2457

_the_

1952

d_the_

416

_and_th

284

Entropy

 

4.25

 

7.65

 

10.26

 

12.20

 

13.66

 

14.75

 

15.55

MaxE

 

4.25

 

8.50

 

12.75

 

17.01

 

21.26

 

25.51

 

29.76


Maximum entropy for random text increases linearly. So, for strings of length 7, the maximum entropy of 29.76 is twice the observed entropy. This illustrates the high level of redundancy in English text. We also see redundancy in computer files and non-coding introns. However coding DNA has little redundancy. For mRNA in the 7th human chromosome the single nucleotide entropy is 1.9994.

The frequency distribution of nucleotides in the 7th human chromosome 

Symbol

nucleotide

Count

Probability

C

Cytosine

799912

0.2490

G

Guanine

802584

0.2498

T

Thymine

775123

0.2413

A

Adenine

834906

0.2599

Total

 

3212525

1.0000


 

Human Proteome

Entropy of Human proteome as a function of frame size.

frame

1

2

3

4

5

6

7

observed

4.191

8.356

12.505

16.613

20.402

22.508

22.954

random

4.322

8.644

12.965

17.279

21.423

23.446

23.633

 

The Human Proteome with 13,151,137 amino acids has low redundancy. The maximum entropy is 23.6487.

Applications

This has applications to computational genetics and could make a multidisciplinary research project for UNE. Some preliminary results of Bos taurus growth factor homologies can be found at http://mcs.une.edu.au/~cwatson7/I/Bos_taurus_growth.xls .

We use DNA repair genes of Deinococcus radiodurans to find homologous genes in mammalian species. These natural proteins can be used to develop human cancer vaccines.

Baseline comparison of the structure of model species is achieved by memory optimization.

We have detailed analysis of human chromosome 7, GTF2I gene, and protein homologies.

This process can be applied to pharmaceutical design, gene therapy, phylogenetics, evolution, molecular biology and quantitative genetics. Other application areas include data-mining, document retrieval, linguistics, plagiarism detection and automated generation of semantic web ontologies.

Work in progress:

  • polymorphic computer virus detection,
  • bacterial antibiotic resistance,
  • viral evolution,
  • homologous DNA repairing genes. Nucleotide excision repair proteins
  • Ankyrin repeats in SMART, PFAM and NRDB {TPLH}.{AA}..{GH}……{LL}..{GADVN}..{N}.
  • INK4 tumor suppressor family.


An Application for Plagiarism Detection

This algorithm can be applied to the problem of plagiarism detection. The candidate thesis/assignment is loaded into the dictionary. Then a linear complexity search is performed on all known test data. The algorithm provides efficient detection of verbatim plagiarism with an objective percentage measure of original content.

 

Redundancy occurs in English language text, computer files and non-coding DNA. This redundancy can be used to reverse engineering software, decrypt secret codes and identify individuals by their unique DNA fingerprint.

 

For DNA fingerprinting, tandem repeats can be identified by the recurring relative location of the pattern equal to the pattern length. For example, the string: {TCATCATCATC} matches the reference pattern {CATC} at locations 1, 4 and 7 with the relative locations 3, 3. Here follows a sequence of relative locations of the string {CAT} extracted from a sequence of non-coding human DNA. Notice the characteristic tandem repeat represented by “3 3 3 3 …” used to identify individual people and diagnose genetic disease.

 

293 27 9 22 17 20 6 20 19 3 3 3 3 3 3 3 3 3 3 6 3 3 142 6 46 80 80 44 70 39 129 58 144 58 62 69 93 128 89 4 6 40 121 5 24 44 13 21 139 110 33 60 37 11 14 61 73 41 25 12 36 11 325 59 46 3 65 10 15 84 68 54 195 182 33 67 15 50 127 46 51 13 120 115 109 68 9 6 20 72 96 41 79 24 79 67 9 51 24 3 43 8 37 4 10 101 326 11 22 25 23 25 28 44 50 19 55 65 108 33 243 4 8 136 66 45 141 19 51 101 14 18 32 21 56 10 35 37 12 16 8 126 7 361 150 13 76 3 3 62 4 61 39 34 47 3 3 163 70

 

We can deduce information structure without comparing large sequences. The relative location of repeats is sufficient to match large genetic structures. In the same way verbatim plagiarism can be efficiently detected by comparing the relative location of spaces and common letters such as ‘e’ and ‘t’. This technique is tolerant to the insertion and deletion of text after copying the original because most to the relative locations of patterns remain undisturbed.

 

Maximum entropy for random text increases linearly with the size of the string. We illustrate the high level of redundancy in English text by analysing the information content of Aesop’s Fables (http://www.gutenberg.org/). The total number of characters is 171580. The 3 most frequently occurring characters are: ‘t’ 12082, space 34390 and ‘e’ 16727. For strings of length 7, the maximum entropy of 29.76 is twice the observed entropy. The following table shows the observed and maximum entropy of all strings of length 1 to 7 in Aesop’s Fables.

 

References

Bayes, T. 1763, “An Essay towards solving a Problem in the Doctrine of Chances”. Philsophical Transaction of the Royal Society of London 53 (1763), 370-418.

 

Fredkin, E.  1960, “Trie Memory”. Communications of the ACM, 3(9):490-499, Sept. 1960.

 

Knuth, D. 1997, The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition. Addison-Wesley.

 

Shannon, C. E. 1948, “A mathematical theory of communication”, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948.

 

Wikipedia. Retrieved regularly from http://en.wikipedia.org/wiki/

 

Conclusion

  1. An efficient algorithm
  2. Applicable to any type of data
  3. Learning (dictionary construction) is unsupervised
  4. This is a work in progress evolving as problems arise
  5. The Challenge for us now is to ask the right questions

 

Epilogue

“What has been will be again, what has been done will be done again; there is nothing new under the sun.” Ecclesiastes 1:9 circa 950BC