The machine code of living things

You have arrived at this post with a single click. There is a lot of information here, not to mention on other sites, for example your Facebook news feed, where you probably found this link. Although Facebook has reached a very high level of complexity compared to early webpages or even early computer programs, you navigate through these sites and programs like second nature. But, if you think about it, a computer does nothing else than process a flow of zeros and ones. Same is true for the twitch of your finger executing the click that brought you here, which is just an orchestrated stream of electric pulses in your brain. At the core, computers and living things share a very similar design principle. In this post, we shall take a trip to the domain of these living things, seeing them through the eyes of an engineer.

First we will get our magnifying glass and take a deeper look at computers. Let’s see how a seemingly very easy task – multiplication – looks like at the deepest level on the computer.

1. If you think you know how to multiply, think again.

For a brief moment, let’s recall that how does the Central Processing Unit (CPU) of a computer work.

Schematics of the CPU. Source: wikipedia

On this picture you can see the block diagram of a simple CPU. The black arrows are indicating the data flow between components. Suppose you have two integers, 5 and 16, and you want them multiplied. The numbers are stored in the main memory. The CPU takes these numbers, does the computation with the help of the arithmetic logic units, storing partial results in the registers. The arithmetic logic unit is helpless in itself (mostly it only does bitwise logical operations on the data which is given to it), it needs to be directed by the control unit. In the above diagram, the red arrows represent control flow. You can see that the control unit is the overlord. It does its job by using instructions from a predefined instruction set. The instruction set is different between CPU architectures, but the essence is similar. A program written using these basic instructions is called machine code. The code is executed sequentially.

Suppose that our CPU has the following (completely made up and probably entirely ineffective) simple instruction set.
ADD (reg1) (reg2): adds the value in register ‘reg2’ to the value in register ‘reg1’. (If ‘reg2’ is not a register but an integer, suppose it just ads N to ‘reg1’)
CMP (reg1) (reg2) (flagreg): if the value in ‘reg1’ equals to the value in ‘reg2’, sets ‘flagreg’ to 1, otherwise it sets ‘flagreg’ to 0
CPY (reg1) (reg2): copies the value of register ‘reg2’ to register ‘reg1’
IF (flagreg) (statement): if flagreg = 1, executes the statement
GOTO (label): jumps to the given line of the code
SET (reg1) (N): sets the register ‘reg1’ to the value N
SUB (reg1) (reg2): same as ADD, but with subtraction

To see the wonder in opening a new tab in your browser, try to come up with a program which multiplies two positive integers using exclusively these instructions. If you are stuck, this flowchart might help.

The actual program MUL accepts two positive integers n and m as input in IN1 and IN2 respectively. The output is n*m in the register OUT.

01 CPY REG1 IN1
02 CPY REG2 IN1
03 CPY TEMP IN2
04 SET TARGET_VALUE 1
05 CMP TEMP TARGET_VALUE FLAG
06 IF FLAG GOTO 12
07 ADD REG1 REG2
08 SUB TEMP 1
09 CMP TEMP TARGET_VALUE FLAG
10 IF FLAG GOTO 12
11 GOTO 07
12 CPY REG1 OUT

If you want to challenge yourself, try to write a program using these instructions which takes the positive integers n and m as input and gives back n^m . I have also made a flowchart for this, if you are stuck.

A simple instruction, for example the ADD, is executed with the use of arithmetic logic units.

Arithmetic Logic Unit (ALU). Source: wikipedia

Basically, an ALU takes two integer inputs, carries out a simple operation and stores the result somewhere. Point is, an ALU in itself is quite powerless, but with combining, connecting and controlling them in clever and convoluted ways, anything is possible. (To precisely define anything, one has to look into the abstract concept of Turing machines and Turing completeness. In brief, if your instruction set is large enough, you can implement any algorithm.)

Of course, the programs written using instruction sets like the MUL program above are not exactly in machine code. Aside from the self-made instruction set, machine code contains memory addresses instead of keywords, so a simple Hello world! program would look like this:

0100 ba 0c 01
0103 b4 09
0105 cd 21
0107 b8 00 4c
010A cd 21
010c 48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 21 0d 0a 24

This is certainly quite unreadable to humans, but this is the form which holds the most information to computers. All the other programming languages containing actual codewords are just abstractions. As we shall see, something similar is true for an another information storing object: our DNA.

Bottom line is, an algorithm is basically a process, storing and manipulating information. As we shall see, very similar things are going on inside every living organism. Except they are infinitely more complicated and more colorful.

2. The machine code of genes.

It is kind of common knowledge that “your genes determine who you are”. Of course, this is an oversimplification and the truth is not that straightforward. Chances are, that unless you are professionally trained in biology, you do not know what exactly goes on with your genes.  Let’s take a look on how it works!

2.1. Protein synthesis. In fact, your body is made up of proteins, not genes, although they are strongly interconnected. If proteins are the building blocks, your body is the house, then your genes are the blueprint. The process which realizes the genes in the form of proteins is called protein synthesis. To explain this, let’s start with the DNA. DNA (or deoxyribonucleic acid) is basically a double chain of molecules called nucleotides. Inside the DNA, you can find four type of them: Adenine, Thymine, Cytosine, Guanine. An adenine can only be paired with a thymine on the complementary side of the chain and similarly, cytosines are always paired with guanines. Thus, you can think of the DNA as a very long string composed from a four letter alphabet, something like this:

…ATGCTTAGACGATATATACCAGGAGTACCAT…

A full strand of your DNA can be found in every one of your cells and a gene is nothing else than a certain section in the DNA. Like the machine code, it is not very useful for your body in itself, it needs to be decoded. The path from DNA to proteins starts with transcription and ends in translation, which results in proteins. Your proteins determine who you are, and how you function currently. Proteins are not only created when your body develops, you need them all the time. Moreover, you don’t need them constantly in the same concentration, this has to change dynamically.  For example, insulin is a protein, whose production is triggered by digestion.

If this whole process happens and we have a protein as a result, we say that the gene had been expressed. The question is, how does the gene knows when and how often it needs to be expressed? To answer this question, let’s look at what is in the DNA aside your genes.

A gene is nothing else than a substring of the DNA. But as far as we know, not the whole genome contains useful information. Such regions are called coding regions and the parts which are not used to encode proteins are called noncoding regions. For example, 98% of the human genome is noncoding. But this doesn’t mean that these parts are useless! It contains for instance the promoter regions, which are located right before a gene. They serve the purpose to bond so-called transcription factors, to which RNA polymerase can bond, whose role is to initiate the transcription of the DNA segment containing the information required to turn the gene into protein. After the corresponding RNA chunk is created (which contains the nucleotid Uracil instead of Thymine), it is transported outside the nucleus, and translated into a chain of amino acids (i.e. to a protein). The RNA is being read by three neighbouring nucleotides at a time, each of which codes one of the 20 commonly occuring amino acids. This chunk of three is called an RNA codon. The “dictionary” from RNA codons to amino acids is called the RNA codon table.

RNA codon table

The translation always starts with the codon AUG and finishes with a Stop codon, for example UAA.

So far, so good. Useful chunk of DNA in, useful amino acid out. Up until this point, there is not (that) much complexity to it, there is just a dictionary translating between portions of DNA and amino acids. To reveal the next level of complexity, let’s check at the previously slightly overlooked promoter regions and transcription factors.

2.2. The regulation of gene expression. Let’s revisit the role of the promoter region before a gene. It is the part where transcription factors are attached if present, aiding the bonding of the RNA polymerase, much like in the figure below. This effect is called upregulation, because this increases the number of the proteins encoded by that gene.

Transcription factors and RNA polymerase

Similarly, transcription factors can prevent the RNA polymerase from bonding, making the expression of the protein slower or even impossible. This is called downregulation.

The thing is, transcription factors are proteins themselves. In effect, the quantity of a protein in your body can depend on several other proteins! Overall, these relations form the so-called transcription network, which is quite complex even for primitive organisms. Down below is a part of the transcription network of yeast.

Network of transcription regulatory factors in yeast. (Source)

Now we are not far away from our starting point: machines. From here, it is not hard to see that with clever genetic design, logic gates like AND, OR, NOR and XOR can be designed! For example, a NOR gate would contain two transcription factor sites, both binding downregulating factors. Think back, what are the CPU made of? A control center and a bunch of Arithmetic Logic Units, a.k.a. logic gates, and an algorithm is nothing else than manipulation of data. There, for a given input, we have outputs, while in biological systems, given a signal, we have a response. This can be used to design more complex circuits like the one below.

A NOR-gate based combinatorial logical priority encoder. (Source)

Just like the IF, AND, OR, etc logic gates in computers, transcriptional networks in real organisms also have several recurring patterns, which are called network motifs.

Transcriptional motifs in yeast. Downregulatory connections are denoted with dotted arrows. (Source)

To add an another layer of complexity, consider that the concentration of a certain protein in a given organism is not binary, but can be thought of as a continously varying quantity. This way, for instance by creating a periodic regulatory loop, one can obtain “molecular clocks”, and indeed this is how the circadian clock works, which is used by organisms to keep track of the time of day. (Yes, you also have one.)

Soon, biologists realized the fact that living things are molecular machines, built on principles similarly to electric circuits and computers. Parts make devices, devices make systems. This observation gave birth to the field of synthetic biology, whose purpose is to design and construct biological machines. These ideas have made a significant impact on biology, but biology was not the only one who benefited from this relationship. In fact, nature has been inspiring programmers ever since the creation of computers.

3. The stronger program gets to replicate.

In 1990, a scientist named Thomas Ray launched a very interesting computer experiment called Tierra. In this, computer programs were competing for CPU time and access to memory, that is, for resources. The twist was that these programs could mutate and breed, making potentially stronger offsprings. This was one of the first attempts to simulate evolution, but the concept wasn’t new. The idea was, that by defining the notion of “stronger” appropriately and introducing mixing, breeding and mutating some properties of an abstract object, one can design certain objects fitting a purpose very well. This is the essence of the genetic algorithm. To talk more concretely, let’s see an example.

Probably you have seen an antenna in your life a lot of times, but you may not know exactly how it works. The purpose of an antenna is to convert radio waves to electric signals and so forth. This enables, for example, the decoding of radio waves to sound in a radio, so you can actually hear it. Antennas are fundamental devices for telecommunication. Spacecrafts and satellites also use them to communicate with Earth. But, since the propagation of radio waves very much depends on the electromagnetic properties of the ambient space, you may need to design special antennas if you want to use them under certain circumstances, for instance outside the magnetosphere of Earth. So, how would you design the shape of such object? Since there are way to many possibilites (as many as there curves are in the 3D space), checking each and every one is out of the question. Instead, you can try to emulate what happens during evolution. The performance of a given antenna can be evaluated simply, thus you have a scoring system. Designs can also breed and mutate, for example by mixing the features of two antennas (breeding) or growing a short, straight arm out from one if its ends (mutation). The search for the “strongest” design starts by selecting a bunch of candidates randomly, evaluating them, then crossing the best ones according to our score, occasionally mutating them. Iterating this process, eventually we will obtain certain specimens high in the fitness landscape, outperforming its peers.

To much surprise, this example is not a crazy thought experiment, this method was indeed used to design antennas for certain spacecrafts. Such are called evolved antennas.

Antenna of the NASA ST5 spacecraft. Its design had been obtained with the use of evolutionary algorithms. (Source)

In fact, the genetic algorithm had been applied successfully several times. Here is a list on wikipedia collecting some, including option pricing for financial markets and pop music production. (Hey, at least it was successful in making people laugh!)

4. Can a biologist fix a radio?

In a famous thought experiment, the biologist Yuri Lazebnik had asked exactly this question. If you give a radio to biologists, can they figure out how it works? He concluded that after several experiments (like how the color affects the quality of the sound), many heated debates and millions of dollars later, they would come up with something like in the figure below.

A. The biologist’s view on a radio. B. The engineer’s view on the same radio. (Source)

He argued that the reason why biologists cannot understand how a radio works is because they don’t have an appropriate language and quantitative models to describe it. When complexity emerges, the tools and concepts used to understand very simple processes on the molecular level are not sufficient. Thinking patterns like “if I modify A, B also changes” are not enough. How can you understand the complex role of certain proteins when you think in simple patterns like that? You can only think what you can say. If you take away the language, you take away the understanding.

Of course, this thought experiment about biologists and radios happened a long time ago. A lot have happened since then. In the last few decades, scientists have understood the importance of an appropriate language for describing structure and function in organisms. As it was in the case in several other natural and social sciences, this language turned out to be mathematics. Complex biological systems are studied with the tools of network theory, where the interaction between components are described with quantitative models like differential equations.

With these changes in the methods, the questions had also been changed. Aside from asking “How does this work?”, many scientists now also ask “How can I create an organism which does what I want?”. For this purpose, engineering-inspired languages like the Synthetic Biology Open Language was created, providing a way to design living things fit to serve a purpose, for instance turning waste into biomass.

Synthetic Biology Open Language symbols. (Source)

The tools of synthetic biology are so widespread now that they are available to even young high school scientists. In the annual iGEM Synthetic Biology competition there are dozens of high school teams, coming up with innovative solutions for complex bioengineering problems.

Now is a really great time for engineers and scientists to work in the field of biology. If the invention of genetic engineering was the analogue of the industrial revolution, the recent developments are like the birth of the internet. Freeman Dyson, legendary scientist wrote in his article titled Our biotech future:

…imagine what will happen when the tools of genetic engineering become accessible to these people. There will be do-it-yourself kits for gardeners who will use genetic engineering to breed new varieties of roses and orchids. Also kits for lovers of pigeons and parrots and lizards and snakes to breed new varieties of pets. Breeders of dogs and cats will have their kits too.

Freeman Dyson

For the first time in science, this vision is within our reach, and it is incredibely exciting.

Author: Tivadar Danka

Applied Pure Mathematician

One thought on “The machine code of living things”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s