In biological systems the instantiation of information is DNA, but what is this information about? To some extent, it is the blueprint of an organism and thus information about its own structure. More specifically, it is a blueprint of how to build an organism that can best survive in its native environment, and pass on that information to its progeny. This view corresponds essentially to Dawkins’ view of selfish genes that ‘‘use’’ their environment (including the organism itself), for their own replication (18). Thus, those parts of the genome that do correspond to something (the non-neutral fraction, that is) correspond in fact to the environment the genome lives in. Deutsch (19) referred to this view by saying that ‘‘genes embody knowledge about their niches.’’ This environment is extremely complex itself, and consists of the ribosomes the messages are translated in, other chemicals and the abundance of nutrients inside and outside the cell, and the environment of the organism proper (e.g., the oxygen abundance in the air as well as ambient temperatures), among many others. An organism’s DNA thus is not only a ‘‘book’’ about the organism, but is also a book about the environment it lives in, including the species it co-evolves with. It is well known that not all of the symbols in an organism’s DNA correspond to something. These sections, sometimes referred to as ‘‘junk-DNA,’’ usually consist of portions of the code that are unexpressed or untranslated (i.e., excised from the mRNA). More modern views concede that unexpressed and untranslated regions in the genome can have a multitude of uses, such as for example satellite DNA near the centromere, or the polyC polymerase intron excised from Tetrahymena rRNA. In the absence of a complete map of the function of each and every base pair in the genome, how can we then decide which stretch of code is ‘‘about something’’ (and thus contributes to the complexity of the code) or else is entropy (i.e., random code without function)?
A true test for whether a sequence is information uses the success (fitness) of its bearer in its environment, which implies that a sequence’s information content is conditional on the environment it is to be interpreted within (4). Accordingly, Mycoplasma mycoides, for example (which causes pneumonia-like respiratory illnesses), has a complexity of somewhat less than one million base pairs in our nasal passages, but close to zero complexity most everywhere else, because it cannot survive in any other environment—meaning its genome does not correspond to anything there. A genetic locus that codes for information essential to an organism’s survival will be fixed in an adapting population because all mutations of the locus result in the organism’s inability to promulgate the tainted genome, whereas inconsequential (neutral) sites will be randomized by the constant mutational load. Examining an ensemble of sequences large enough to obtain statistically significant substitution probabilities would thus be sufficient to separate information from entropy in genetic codes. The neutral sections that contribute only to the entropy turn out to be exceedingly important for evolution to proceed, as has been pointed out, for example, by Maynard Smith (20).
In Shannon’s information theory (22), the quantity entropy (H) represents the expected number of bits required to specify the state of a physical object given a distribution of probabilities; that is, it measures how much information can potentially be stored in it. In a genome, for a site i that can take on four nucleotides with probabilities
{pC(i), pG(i), pA(i), pT(i)}, [1]
the entropy of this site is
H- = -ΣC,G,A,Tj pj(i) log pj(i) [2]
The maximal entropy per-site (if we agree to take our logarithms to base 4: i.e., the size of the alphabet) is 1, which occurs if all of the probabilities are all equal to 1/4. If the entropy is measured in bits (take logarithms to base 2), the maximal entropy per site is two bits, which naturally is also the maximal amount of information that can be stored in a site, as entropy is just potential information. A site stores maximal information if, in DNA, it is perfectly conserved across an equilibrated ensemble. Then, we assign the probability p = 1 to one of the bases and zero to all others, rendering Hi = 0 for that site according to Eq. 2. The amount of information per site is thus (see, e.g., ref. 23)
I(i) = Hmax - Hi [3]
In the following, we measure the complexity of an organism’s sequence by applying Eq. 3 to each site and summing over the sites. Thus, for an organism of l base pairs the complexity is
C = l - Σi H(i) [4]
It should be clear that this value can only be an approximation to the true physical complexity of an organism’s genome. In reality, sites are not independent and the probability to find a certain base at one position may be conditional on the probability to find another base at another position. Such correlations between sites are called epistatic, and they can render the entropy per molecule significantly different from the sum of the per-site entropies (4). This entropy per molecule, which takes into account all epistatic correlations between sites, is defined as
H = Σg p(g|E) log p(g|E) [5]
and involves an average over the logarithm of the conditional probabilities p(g|E) to find genotype g given the current environment E. In every finite population, estimating p(g|E) using the actual frequencies of the genotypes in the population (if those could be obtained) results in corrections to Eq. 5 larger than the quantity itself (24), rendering the estimate useless. Another avenue for estimating the entropy per molecule is the creation of mutational clones at several positions at the same time (7, 25) to measure epistatic effects. The latter approach is feasible within experiments with simple ecosystems of digital organisms that we introduce in the following section, which reveal significant epistatic effects. The technical details of the complexity calculation including these effects are relegated to the Appendix.