Chaos game representation

Chaos Game Representation is a graphical representation of a sequence. It is a Method of converting a long one- dimensional sequence (e.g. English text or Genetic sequence) into a graphical form.

Chaos game representation

Introduction

Chaos Game Representation is a graphical representation of a sequence. It is a Method of converting a long one-   dimensional sequence (e.g.  English text or Genetic sequence) into a graphical form. Chaos game representation (CGR) is a novel holistic approach that provides a visual image of a DNA sequence quite different from the traditional linear arrangement of nucleotides.Chaos Game Representation (CGR) can recognize patterns in the nucleotide sequences, obtained from databases, of a class of genes using the techniques of fractal structures and by considering DNA sequences as strings composed of four units, G, A, T and C. Such recognition of patterns relies on visual identification It is an application of non-random input to an iterated function system. The method was first proposed for DNA sequence, now it is being used for any arbitrary symbols. CGR exhibits striking pattern characteristics of the genome. Since the patterns are unique to genomes this can be used to identify genome fragments or to detect horizontal gene transfers. In addition, Quantitative analysis of CGR patterns can be used for sequence comparison both   Alignment-free and alignment-based sequence comparisons are possible with CGR.

Chaos

It is a word derived from the greek typically refers to unpredictability. In the meta physical sense, it is the opposite of law and order. Chaos refers to an apparent lack of order in a system that nevertheless obeys particular laws or rules. Mathematically chaos means an aperiodic deterministic behavior which is very sensitive to its initial conditions. A system is called chaotic if it is impossible to make accurate long term predictions about the behavior of the system.According to Ron Leadbetter, “Chaos is from the Greek word Khaos, meaning "gaping void". There are many explanations as to who or what Chaos is, but most theories state that it was the void from which all things developed into adistinctive entity, or in which they existed in a confused and amorphous shape before they were separated into genera. In other words, Chaos is or was "nothingness." .Important   properties of a chaotic system are,

  • Sensitive dependence on initial conditions.
  • The trajectory never repeats.
  • They are non linear.
  • The transition to chaos is preceded by infinite levels of bifurcation.

Chaos theory

Chaos theory is the collection of results, methods, and visualization techniques used to study dynamical systems. The name "chaos theory" comes from the fact that the systems that the theory describes are apparently disordered, but chaos theory is really about finding the underlying order in apparently random data. Systems - no matter how complex - rely upon an underlying order, and that very simple or small systems and events can cause very complex behaviors or events. Chaos theory attempts to explain how chaos is produced and how can it be understood and controlled. Chaos Theory is a name given to recent wide-ranging attempts to uncover the statistical regularity hidden in processes that otherwise appears random, such as turbulence in fluids, weather patterns, predator-prey cycles, the spread of disease, and even the onset of war. Systems described as "chaotic" are extremely susceptible or at risk to changes in initial conditions. As a result, small uncertainties in measurement are magnified over time, making chaotic systems predictable in principle but unpredictable in practice.

One of the first researchers in chaos theory was a meteorologist, Edward Lorenz. He was using a set of equations to model the weather. When he solved the system numerically, he found that a particle moving subject to atmospheric forces has a very complicated trajectory or path. He also found that any such a trajectory always approaches the same general pattern of an attractor. This pattern is butterfly shaped and is now known as the Lorenz attractor. The conclusion of these observations is that a very small change in initial conditions can produce unpredictable and sometimes drastic results by triggering a series of increasingly significant events. In theory, the flutter of a butterfly's wings in Australia could, for example, produce a snow storm in the Northeastern, thousands of miles away.

Thus,  non-linear dynamics , chaotic dynamical systems or simply ” chaos” deals with the structure of complex curves known as fractals, complex geometric shapes, which can be defined by simple algorithms. I.e., a fractal is the result of a mathematically calculated equation which can be transferred into an image to be shown. A dynamical system is a set of functions (rules or equations) that specify how variables change over time. A mathematical model of a human system where numerical quantities are used to represent the state of the system is an example of a dynamical system.

Usually, in a dynamical system, the changes of dynamical quantities after a short time period can be determined by combining the numerical quantities in a formula. The change over longer time periods can be determined by iterating the formula. Chaos theory deals with the behavior of certain non linear dynamical systems that exhibit the phenomenon known as chaos, and this phenomenon is highly sensitive to the initial conditions. Chaos theory tries to find the order in the systems, what appears to be completely random data. Even though, Chaos may look “random-like” it may occurs in deterministic systems with no randomness. The idea behind is that it is possible to get completely random results from normal equations. Thus, Chaos theory covers the reverse also. Chaos theory constructs:

· Phase Space: A conceptualization (often pictorial or geometric) of the possible states a system might take. It conveys that, at best, we see only a portion of “reality” at one time. And that will be always that part, on which we choose to focus.

· Strange Attractors and Basins of Attraction: Attractors are patterns in phase space to which values of variables settle into after transients die out. Strange attractors are focal points for the most challenging patterns generated by dynamical, chaotic systems. Basins of attraction are the areas containing those patterns within their boundaries.

· Fractals: Fractal boundaries (or simply fractals) are mathematical representations of the irregular "lines" of demarcation between separate units. Pictures result from iteration of non linear equations, usually in a feed back loop, a set of points produced. Graphing these points produces these images Fractals convey that reality is rarely as clear/clean cut as we picture it.

· Bifurcation/ Cascade: Bifurcation is a scientific way to say something splits in two—branches. If patterns bifurcate quickly enough, they can become complex very fast, leading to bifurcation cascade and chaos.

· Unpredictability: Unpredictability is the inability to state with certainty the next state (or, for that matter, the previous state) of a system given knowledge of its present state.

· Recursivity: It is self-reflexiveness, and self-relectiveness, feeding information from one’s patterns back into the process of producing them.

· Equilibrium: Equilibrium is the tendency or inertia of system not to change its patterns by staying near or returning to points of attraction. Patterns change significantly and most unpredictably in far from equilibrium in chaotic systems.

· Self-affinity: It denotes the tendency for phenomena to evidence recurring patterns. The affinity can be over size, time, different angles, or other ways more difficult to see or to grasp, for example, by the process that generates them or probabilistically.

Dynamical systems

The dynamic of any situation refers to how the situation changes over the course of time. A dynamical system is a physical setting together with rules for how the setting changes or evolves from one moment of time to the next. Dynamical system is a system in which the situation changes over the course of time. Physical settings along with a set of functions/rules that specifies how variables changes over time in a dynamical system. The current state of a dynamical system is specified by the current values of its variables. The system state at time t is an instantaneous description of the system which is sufficient to predict the future states of the system without recourse to states prior to t. The evolution of the system state refers to a sequence or continuous trajectory through the space of possible system states. E.g.: Pendulum, Mathematical model of human system, where numerical quantities represent the state of system, some varies with time while others remains fixed.

playing the chaos game with rolls of red, green, blue, blue.

Now continue in this fashion for a small number of rolls of the die. Five rolls are sufficient if you are playing the game “by hand'' or on a graphing calculator, and eight are sufficient if you are playing on a high-resolution computer screen. (If you start with a point outside the triangle, you will need more of these initial rolls.)

After a few initial rolls of the die, begin to record the track of these traveling points after each roll of the die. The goal of the chaos game is to roll the die many hundreds of times and predict what the resulting pattern of points will be. Most people who are unfamiliar with the game guess that the resulting image will be a random smear of points. Others predict that the points will eventually fill the entire triangle. Both guesses are quite natural, given the random nature of the chaos game. But both guesses are completely wrong. The resulting image is anything but a random smear; with probability one, the points form what mathematicians call the Sierpinski triangle .

The Sierpinski triangle [14]

In the Figure top triangle is in RED, lower left triangle in GREEN, and lower right triangle in BLUE.         We have used color merely to indicate the proximity or closeness of the vertex with the given color. For example, the portion of the triangle closest to the green vertex is colored green, and so forth.

The sequence of points generated by the chaos game is called the orbit of the seed. The process of repeating the rolls of the die and tracing the resulting orbit is called iteration. Iteration is important in many areas of mathematics. In fact, the branch of mathematics known as discrete dynamical systems theory is the study of such iterative processes.

There are two remarkable facets of the chaos game.

1.       The Sierpinski triangle is one of the most basic types of geometric images known as fractals.

2.       The second is the fact that this figure results no matter what seed is used to begin the game

With probability one, the orbit of any seed eventually fills out S. The words “with probability one'' are important here. Obviously, if we always roll ``red,'' the orbit will simply tend directly to the red vertex. Of course, we do not expect a fair die to yield the same two numbers at each roll.

The following figures  shows Sierpinski triangles in each iterations:

First recursion of Sierpinski triangle
Second recursion of Sierpinski triangle
Third recursion of Sierpinski triangle
fourth recursion of Sierpinski triangle

CGR - A brief history

Chaos Game Representation (CGR) was proposed as a scale-independent representation for genomic sequences by Jeffrey in 1990 (Jeffrey, H. J. 1990). In his method of representing gene structure using Chaos game,    H. Joel Jeffrey discovered some characteristic patterns presented below:-

The CGR of Human Beta Globin   exhibhits a characteristic pattern (almost empty area in the upper right quadrant (the g-quadrant). A smaller copy of this scoop appears in the left (c-quadrant), presenting a double scoop appearance. Also the CGR exhibhits a property of self-similarity, a concept  very important in the study of fractals and chaotic dynamics. Also it is stated that the features found in CGR of Human Beta Globin  are found in a number of other genetic sequences.

CGR of Human Beta Globin Region on Chromosome 11 [1]

The results come out can be summarized as:

  1. An empty area in the upper right quadrant corresponds to a comparative sparseness on guanine.
  2. Any base is plotted somewhere in the quadrant with its label. Any base is always plotted half the way towards its corner.
  3. Copies of double scoop one in T and one in A quadrant-exhibiting a property of self similarity.
  4. This kind of pattern is found only in Vertebrate sequences.

In his paper Nick Goldman confirms a double scoop pattern in almost all vertebrate DNA sequences and in the CGR of human T-cell lymphotropic virus (type III) genome and other human viruses is entirely attributable to a scarcity of CG dinucleotides The biological reasons for this scarcity are well-understood, being the selective disadvantage of CG dinucleotides which are prone to methylation and subsequent Mutation. Indeed, it must be doubtful whether more-complex patterns of information in DNA sequences will be visible in CGRs, since simpler patterns will tend to obscure more complex ones.

Therefore the possibility of using CGR for revealing patterns in genes can be extended to identify a pattern for Exon and Intron sequences. Later, three years after the original proposition; it was demonstrated and interpreted the translation of CGR quadrant frequencies into oligonucleotide frequencies. It was a good indication that CGRs can be more useful than simple evaluation of nucleotide, dinucleotide and trinucleotide frequencies’ (Goldman, 1993). So for several years, the possibility that the CGR format can be used for representing the nucleotide sequence as well as identifying the resulting sequence scheme had not been fully explored. In 2001 Almeida proved that the distributions of positions in the CGR plane are generalization of Markov chain probability tables that accommodates non-integer orders (Almeida 2001). This brought CGR to the main stream again.  

In CGR representation for promoter, quadrants are taken as A is at position (0,0) G at position (1,1) T at position (0,1) and C at (1,0) which is different way of usual representation, i.e., A at(0,0) G at (0,1) C at(1,0) and T at(1,1) . The reason for this variation is the AT rich and GC rich nature of the promoter region. The increase in their concentrations is represented as diagonals and can be easily observable. But in the standard case it is represented as bands and cannot be observed prominently.

CGR of nucleotide sequence

In CGR the four nucleotides A, G, C, and T are assigned to the corners of a square as in the given figure

CGR Plot with Each Corner Assigned to Nucleotides

Nucleotide A is at position (0,0) G at position (1,0) T at position (1,1) and C at (0,1).  Any arbitrary nucleotide is represented as one point in the square. The first point plotted halfway between the centre of the square and the corner corresponding to the first nucleotide of the sequence and successive points plotted halfway between the previous point and the corner corresponding to the base of each successive sequence site.

Every point of a CGR is a representation of sequence up to that position.  For example in the CGR of the sequence “ATTGCAGGCT” the sixth point represents the sequence “ATTGCA”.  Thus there is a on to one correspondence between the subsequences and the points in the CGR.   Since a base is always plotted in its quadrant, any sequence will always be plotted somewhere in the quadrant of its last base, and conversely any two points in the same quadrant must have the same last base.

Important features of CGR

Some of the striking features of CGR includes,

1.  Any visible pattern in the CGR corresponds to some pattern in the sequence base.

2. The kth point plotted on the CGR of a sequence corresponds to the first K long initial subsequences of the sequence and no other subsequence.

3.  A CGR square is subdivided in to four sub squares each corresponds to a particular nucleotide. This is shown in the following figure.

Depicts the sub quadrants of the CGR

From the above figure, we can learn that,

· The Bottom Left sub square stands for the nucleotide A. All points coming in this region represents a unique DNA sequence which ends in A.

· Bottom right sub square is owned by G nucleotide. So all points which fall in this region represents a unique DNA sequence that ends with the nucleotide G.

· Top left sub sub square is for Nucleotide C and is for the nucleotide sequence with last element C.

· Top Right sub square corresponds to the nucleotide T. Every point in this region uniquely identifies a DNA sequence which ends in T.

·  The vertex (0, 0) stands for the nucleotide sequence ‘A’. Similarly Vertex (0, 1) is for the sequence ‘G’, vertex (1, 0) for ‘C’ and vertex (1, 1) for the sequence ‘T’.

4.  Each base gives a point in the quadrant labeled with that base; the sub-quadrant is determined by the preceding base, the sub-sub-quadrant (e.g. 'TAG') by the base preceding that, etc. This correspondence continues recursively.

5.  If two points are within the same quadrant, they correspond to sequences with same last base; if they are in the same sub quadrant, the sequences have the same last two bases; if they are in the same sub- sub-quadrant they have the same last three bases, etc. The idea is depicted using the following figure.

Depicts the correspondence between oligonucleotides and areas of the CGR of DNA sequences

6. The number of different n-mers present in the sequence can be found by dividing the CGR square into sub-squares of side 2-n.  1-mer represents oligonucleotide with only one base, 2-mer represents oligonucleotide with two bases and so on.  The notion of quadrant is recursive; each quadrant can be divided into quadrants, etc. as shown in Figure (Goldman N. 1993) .

Distribution of n-mers in CGR quadrants and sub quadrants

7. There is no DNA associated with the centre of the square. So the centre of the square represents a null DNA sequence.

8. Adjacent bases in the sequence are not plotted adjacent to each other: Being close in the CGR does not mean being close in the sequence.

9.  We can trace the path of each point corresponding to each nucleotide in the   given sequence, to plot the CGR. We can draw a path from the final vertex where we plot the last nucleotide, passing through all the intermediate vertices corresponding to the successive nucleotides, to the centre. This path will be unique for individual DNA sequence. The length of the path indicates the size of the corresponding nucleotide sequence. The figure given below shows the path drawn for the sample sequence TACAGA.

plotting the sequence in CGR

10. CGR points can be generated by an iterated function system defined by the following equations,

Xi   = 0.5 (Xi-1+ gix)

Yi    = 0.5 (Yi-1 +giy)

Where, gix and giy correspond to the X and Y co-ordinates of the nucleotide at position i in the sequence.

11. The CGR has a fractal nature. A fractal is a geometric object which is rough or irregular on all scales of length, and so which appears to be 'broken up' in a radical way. Fractals are said to possess infinite detail, and they may actually have a self-similar structure that occurs at different levels of magnification. The plot of entire genome of an organism and 100Kb of the genome will have the same pattern for the CGR. Because of this property CGR can treated as genome signatures. Horizontal transfers in genomes can be easily identified using CGRs. It can also be used to identifying unknown DNA fragments.CGR pattern of a fraction of human Collagen shows the same pattern as the entire sequence of Human collagen. The following two figures, captured from Chaos Game Explorer, describe this fractal nature of the CGR.

CGR of entire sequence of Human Collagen
CGR of a fraction of Human Collagen

In the above two figures, both have the same pattern. The only thing which differs is the point density. The first one is highly dense as it depicts the entire sequence while the second one is less dense as it covers only partial sequence of the same. This is one of the striking features of CGR, and it is due to the fractal nature of it. Because of this property, CGR can treat as genome signatures.

CGR of proteins

CGR can be generalized and applied for visualizing and analyzing protein databases. In the paper “Chaos game representation of Protein structures” CGR can be extended in such a way that it becomes applicable for visualizing and analyzing both the primary and secondary structures of proteins.Examples of applications will be presented for investigating regularities and motifs in the primary structure of proteins and for analyzing possible structural attachments on the super secondary structure level of proteins. For a less detailed structure description, with reference to helix, sheet turn and random coil structures are used for characterizing the polypeptide structure. The original CGR suggested can be used by replacing the four nucleotides with the four secondary structure elements at the vertices of the square. A random sequence would result in a filled square and this attracter indicates the non randomness of the structural elements in proteins. Work has been carried out by plotting in triangles to represent 3D structures.Thus, CGR can be extended in such a way that it becomes applicable for visualizing and analyzing the amino acid sequence in a protein. This is achieved by dividing the 20 amino acids into 4 groups and represents each group in each quadrant. Different properties of amino acids can be used to divide the amino acids into 4 groups like, Hydrophobicity, Molecular weight, PI values,Alpha propensity  and Beta propensity etc. Grouping is performed based on these values .For example according to the hydrophobic scales the amino acids can be grouped as,

Different amino acid groups

Then each group is placed at the four corners of the square and applying the same principle of plotting, we will get the CGR for amino acid sequence.The CGR plot of a random amino acid sequence of length 1500 according to hydrophobic scale is depicted below,

(Group3)                          (Group 4)

                                                                                    (Group 1)                         (Group 2)

Applications

  • Characterization and Classification of Species.
  • Identify and compare different organisms.
  • Visualizing and analyzing Genomic Sequences.
  • To locate specific regions such as Intron, exon, promoter sites,transcription factor binding sites etc.
  • Visualizing and analyzing protein sequences like, Protein Pattern Identification
  • The plot of entire genome of an organism and 100Kb of the genome will have the same pattern for the CGR. Because of this property the CGR can be treated as Genome signatures.
  • It can be used to identify unknown DNA fragments.
  • Horizontal transfers in the genome can easily be identified.

Ongoing works

CGR on genes:

  1. Characterization and Classification of Species by Chaos Game Representation of Genomic Sequences.
  2. Using CGRs    to    locate specific   regions    such as promoter sites, transcription factor binding sites etc.

CGR on proteins:

  1. Visualizing and analyzing protein sequences.
  2. investigating regularities and motifs in the primary structure of proteins
  3. For analyzing possible structural attachments on the super secondary structure level of proteins.
  4. For a less detailed structure description, with reference to helix, sheet turn and random coil structures are used for characterizing the polypeptide structure, etc.

References

1.    Jeffrey, H. J. Chaos game representation of gene structure. Nucleic Acids Res. 18:2163–2170, 1990
2.    Almeida, J.S, “Analysis of genomic sequences by chaos game representation”. Bioinformatics 17: 429-437, et al.2001.
3.    Deschavanne P J, Giron A, Vilain J, Fagot G, Fertil B: Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol., 16:1391-1399, 1999
4.    Goldman N., Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res, 21:2487-2491, 1993
5.    Jijoy Joseph and Roschen Sasikumar, Chaos game representation for comparison of whole genomes BMC Bioinformatics, 7:243  2006
6.    Manus.J.Donahue,An introduction to Chaos theory and Fractal Geometry, 1:8, 1997
7.     Peggy Cenac, Test on the structure of Biological sequences via Caos Game representation , 1:4,2005
8.     A. Fiser, G. E. Tusnady, and I. Simon, “Chaos game representation of
9.    protein structures,” J. Mol. Graph., vol. 12, no. 4, 302–304, 1994
10.    Patrick.J.Deschavenne, ALIANgIRON,Joseph Vilain, Guillaume Fagot, Bernard Fetil , Genomic Signature: Charecterization and Classification of Species assessed by Chaos Game Representation, 1391:1399, (1999)
11.    A. Roy, C. Raychadhury A .Nandy , Novel Techniques of graphical representation and analysis of DNA Sequences-A review,  55:69, (1998)
12.    Manus.J.Donahue, An introduction to Chaos theory and Fractal Geometry, 1:8, (1997)
13.    Alexey Pasechnik1, Aleksandr Myll¨ari2 and Tapio Salakoski2, Dynamical visualization of the dna sequence and its nucleotide Content, 1:4, (2002).
14.    www.Wikipedia.com
15.    https://wiki.umn.edu/view/NicMcPhee/SierpinskiTriangleInScheme