Presentation
in
C27_B251, MSCS building, UNE, Armidale, NSW,
on
Friday, 23rd February 2007 from 12 to 1pm.
by
Dr. Charles R. Watson
School of Maths, Stats and Computer Science
University of New England, Armidale NSW 2351
URL: http://mcs.une.edu.au/~cwatson7/I/ConditionalEntropy.html
Slides: http://mcs.une.edu.au/~cwatson7/I/ConditionalEntropy.ppt
“This process of constructing instruction tables should be
very fascinating. There need be no real danger of it ever becoming a drudge,
for any processes that are quite mechanical may be turned over to the machine
itself.” Alan Mathison Turing, OBE (1945).
The
software described here was developed on turing.une.edu.au
and bombe.une.edu.au.
We describe an algorithm for searching large data sets, by example,
using an inverted trie (coined by
This is more than mere motif matching. The entire structure
of the reference data is encapsulated by the dictionary. Patterns are
discovered by this homology-by-example paradigm. Tandem repeats are easily
identified by the recurring relative location of the patterns equal to the
pattern length. For example, the string: {TCATCATCATC} records the pattern
{CATC} at locations 1, 4 and 7 with the relative
locations 3, 3.
, where
is the probability of a random variable
.
Shannon's definition
of entropy is closely related to thermodynamic entropy as defined in physics and chemistry. The entropy of genomic data can be
calculated from the distribution of repeated sequences and provides a standard
measure of randomness or structure. Repeated sequences indicate homology
between genes and species and can infer protein folding structure. This
analysis is applied to intron phases, protein, mRNA,
coding and non-coding DNA data. Independent fair coin flips have an entropy of 1 bit per flip. A source
that always generates a long string of A's has an entropy of 0, since the next
character will always be an 'A'.
Entropy
effectively bounds the performance of the strongest lossless (or nearly
lossless) compression possible, which can be realized in theory by using the typical set
or in practice using Huffman, Lempel-Ziv or arithmetic
coding. The performance of existing data compression algorithms is often
used as a rough estimate of the entropy of a block of data.
Although entropy
is often used as a characterization of the information content of a data
source, this information content is not absolute: it depends crucially on the
probabilistic model. A source that always generates the same symbol has an
entropy of 0, but the definition of what a symbol is depends on the alphabet.
Consider a source that produces the string CATCATCATCATCATCAT in which C is always followed by A and T. If the
probabilistic model considers individual letters as independent, the entropy of the sequence
is 1.585 bit per character. But if the sequence is considered as a sequence of
the amino acid Histidine
"HHHHHH...", then the entropy is 0.
To search for a string
of length N we traverse the {N-1} length suffix list until we match the prefix
character. The maximum number of attempted matches is the size of the alphabet;
half for the average number of attempts. We iteratively find all strings by
using the last calculated suffix list to find the next string. This is an
online algorithm, in that the dictionary is up-to date before reading the next
character from the input stream. Once the dictionary (reference or training
data) is complete or the is no more virtual memory to extend the dictionary the
classifier continues, ad infinitum, with linear complexity on the length
of the input stream.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER
OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT
THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
as the average of the entropy of y for each value of x,
weighted according to the probability of getting that particular x. This
quantity measures how uncertain we are of y on the average when we know x.
That is ![]()
The
uncertainty (or entropy) of the joint event x, y is the uncertainty of x plus the uncertainty of y when x
is known.
![]()
Hence ![]()
Conditional probability is the probability of some event
Y, given the occurrence of some other event X. Conditional probability
is written P(Y|X), and is read
"the probability of Y, given X".
Conditional entropy can be used as a classifier, like Bayesian inference. Thomas Bayes (c. 1702 – April 17,
1761) formulated a special case of Bayes'
theorem, which was published posthumously.
If
= 1/N for i=1,N
then E = log N
For any skewed
(non-uniform) distribution E < log N
|
|
F:1 |
|
F:2 |
|
F:3 |
|
F:4 |
|
F:5 |
|
F:6 |
|
F:7 |
|
|
|
l |
5258 |
an |
2199 |
ing |
978 |
The_ |
566 |
nd_th |
320 |
___The |
247 |
tration |
178 |
|
|
d |
5996 |
in |
2207 |
ng_ |
1009 |
d_th |
596 |
and_t |
323 |
of_the |
248 |
stratio |
178 |
|
|
r |
7558 |
_h |
2280 |
to_ |
1040 |
____ |
597 |
_____ |
324 |
ation_ |
253 |
ation__ |
183 |
|
|
s |
8014 |
n_ |
2495 |
_to |
1089 |
_The |
617 |
_you_ |
337 |
nd_the |
274 |
ration_ |
184 |
|
|
i |
8604 |
t_ |
2991 |
___ |
1126 |
his_ |
652 |
tion_ |
337 |
_of_th |
279 |
of_the_ |
206 |
|
|
n |
8945 |
s_ |
3240 |
and |
1295 |
_of_ |
831 |
n_the |
342 |
and_th |
286 |
_____Th |
237 |
|
|
h |
9031 |
_a |
3694 |
ed_ |
1440 |
ing_ |
894 |
_with |
350 |
n_the_ |
291 |
____The |
237 |
|
|
a |
10481 |
th |
3935 |
_an |
1451 |
_to_ |
957 |
d_the |
487 |
__The_ |
296 |
nd_the_ |
242 |
|
|
o |
10565 |
d_ |
3997 |
nd_ |
1484 |
_and |
1207 |
_his_ |
524 |
_that_ |
297 |
___The_ |
246 |
|
|
t |
12082 |
he |
4493 |
the |
2697 |
and_ |
1242 |
_The_ |
564 |
_with_ |
310 |
_of_the |
248 |
|
|
e |
16727 |
_t |
4699 |
he_ |
2986 |
the_ |
1954 |
_and_ |
1207 |
_and_t |
318 |
and_the |
256 |
|
|
_ |
34390 |
e_ |
6430 |
_th |
3130 |
_the |
2457 |
_the_ |
1952 |
d_the_ |
416 |
_and_th |
284 |
|
Entropy |
|
4.25 |
|
7.65 |
|
10.26 |
|
12.20 |
|
13.66 |
|
14.75 |
|
15.55 |
|
MaxE |
|
4.25 |
|
8.50 |
|
12.75 |
|
17.01 |
|
21.26 |
|
25.51 |
|
29.76 |
Maximum entropy for random text increases linearly. So, for strings of length 7,
the maximum entropy of 29.76 is twice the observed entropy. This illustrates
the high level of redundancy in English text. We also see redundancy in
computer files and non-coding introns. However coding
DNA has little redundancy. For mRNA in the 7th human chromosome the single
nucleotide entropy is 1.9994.
The frequency distribution of nucleotides in the 7th human
chromosome
|
Symbol |
Count |
Probability |
|
|
C |
799912 |
0.2490 |
|
|
G |
802584 |
0.2498 |
|
|
T |
775123 |
0.2413 |
|
|
A |
834906 |
0.2599 |
|
|
Total |
|
3212525 |
1.0000 |


|
frame |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
observed |
4.191 |
8.356 |
12.505 |
16.613 |
20.402 |
22.508 |
22.954 |
|
random |
4.322 |
8.644 |
12.965 |
17.279 |
21.423 |
23.446 |
23.633 |
The Human
Proteome with 13,151,137 amino acids has low redundancy. The maximum entropy is
23.6487.
This has applications to
computational genetics and could make a multidisciplinary research project for
UNE. Some preliminary results of Bos taurus growth factor homologies
can be found at http://mcs.une.edu.au/~cwatson7/I/Bos_taurus_growth.xls
.
We use DNA repair genes
of Deinococcus radiodurans
to find homologous genes in mammalian species. These natural proteins can be
used to develop human cancer vaccines.
We have detailed
analysis of human chromosome 7, GTF2I gene, and protein homologies.
This process can be
applied to pharmaceutical design, gene therapy, phylogenetics,
evolution, molecular biology and quantitative genetics. Other application areas
include data-mining, document retrieval, linguistics, plagiarism detection and
automated generation of semantic web ontologies.
Work in progress:
This
algorithm can be applied to the problem of plagiarism detection. The candidate
thesis/assignment is loaded into the dictionary. Then a linear complexity
search is performed on all known test data. The algorithm provides efficient
detection of verbatim plagiarism with an objective percentage measure of
original content.
Redundancy
occurs in English language text, computer files and non-coding DNA. This
redundancy can be used to reverse engineering software, decrypt secret codes
and identify individuals by their unique DNA fingerprint.
For DNA
fingerprinting, tandem repeats can be identified by the recurring relative
location of the pattern equal to the pattern length. For example, the string:
{TCATCATCATC} matches the reference pattern {CATC} at locations 1, 4 and 7 with
the relative locations 3, 3. Here follows a sequence of relative locations of
the string {CAT} extracted from a sequence of non-coding human DNA. Notice the
characteristic tandem repeat represented by “3 3
3 3 …” used to
identify individual people and diagnose genetic disease.
293 27 9 22
17 20 6 20 19 3 3 3 3 3 3 3 3 3 3 6 3 3 142 6 46 80 80 44 70 39 129 58 144 58 62 69 93 128 89 4 6 40 121 5 24
44 13 21 139 110 33 60 37 11 14 61 73 41 25 12 36 11 325 59 46 3 65 10 15 84 68
54 195 182 33 67 15 50 127 46 51 13 120 115 109 68 9 6 20 72 96 41 79 24 79 67
9 51 24 3 43 8 37 4 10 101 326 11 22 25 23 25 28 44 50 19 55 65 108 33 243 4 8
136 66 45 141 19 51 101 14 18 32 21 56 10 35 37 12 16 8 126 7 361 150 13 76 3 3 62 4 61 39 34 47 3 3 163 70
We can
deduce information structure without comparing large sequences. The relative
location of repeats is sufficient to match large genetic structures. In the
same way verbatim plagiarism can be efficiently detected by comparing the
relative location of spaces and common letters such as ‘e’ and ‘t’. This technique is tolerant to the
insertion and deletion of text after copying the original because most to the
relative locations of patterns remain undisturbed.
Maximum
entropy for random text increases linearly with the size of the string. We
illustrate the high level of redundancy in English text by analysing
the information content of Aesop’s Fables (http://www.gutenberg.org/).
The total number of characters is 171580. The 3 most frequently occurring
characters are: ‘t’ 12082, space 34390 and
‘e’ 16727. For strings of length 7, the maximum entropy of 29.76 is
twice the observed entropy. The following table shows the observed and maximum
entropy of all strings of length 1 to 7 in Aesop’s Fables.
Bayes, T. 1763, “An Essay towards solving a
Problem in the Doctrine of Chances”. Philsophical
Transaction of the Royal Society of
Fredkin, E. 1960, “Trie Memory”.
Communications of the ACM,
3(9):490-499, Sept. 1960.
Knuth,
D. 1997, The Art of Computer Programming,
Volume 3: Sorting and Searching, Third Edition. Addison-Wesley.
Wikipedia. Retrieved regularly from http://en.wikipedia.org/wiki/
“What has been will be again, what has been done will be done again; there is nothing new under the sun.” Ecclesiastes 1:9 circa 950BC