50-year speed rate of increase: The trajectory of genetic information decoding!

Hirokazu Kobayashi
Jun 29, 2024
7 min read

Updated: 4 days ago

Hirokazu Kobayashi

CEO, Green Insight Japan Co., Ltd.

Professor Emeritus and Visiting Professor, University of Shizuoka

The Summer Olympics will be held in Paris beginning July 26. The first modern Olympic Games in the summer were held in Athens in 1896 and will be the 33rd. World records have been broken in many events, and the eyes of the world will be focused on the athletes' success this time as well. My life as a researcher is approaching its 50th year. Let's compare the rates of increase in various speeds over the past 50 years. Fifty years ago, around 1974, Japan experienced its "first oil crisis" in October 1973, when consumers started buying all the bath rolls in the country. As a student at the time, I noticed that one of the two elevators in my university building was shut down, and the corridor lights were dimmed to save electricity. As a result, even my feelings were darkened. Many of you were not born then, but for those who cannot remember those days, the following songs should help remember the era in which they were famous. The world’s most significant hit was “Top of the World” by Carpenters. For people who lived in Japan, in order of release, “Kandagawa” by Kaguyahime, “Anata (You)” by Akiko Kosaka, “Cape Erimo” by Shinichi Mori, “Tsumiki no Heya (Room of Stacked Trees)” by Akira Fuse, “Nagori Yuki (Remaining Snow)” by Iruka, “Shoro Nagashi (Spirit Boat Procession)” by Grape, etc. Late-night radio shows were very popular with young people at that time. My favorite was Shinji Tanimura (Chinpei: 1948-2023) on Nippon Cultural Broadcasting's "Say! Young". Also, Tsurukoh Shofukutei (1948-) on Nippon Broadcasting System's "All Night Nippon.” If we listened to the program as it was, it turned into "Singing Headlights" at 3:00 AM., which affected the next day's classes.

Comparison between 1974 and the present (June 2024)

100-meter dash: 9.95 sec → 9.58 sec, 1.04 times faster

200-meter dash: 19.72 sec → 19.19 sec, 1.03 times faster

400-meter dash: 43.86 sec → 43.03 sec, 1.02 times faster

100-meter freestyle: 49.44 sec → 46.91 sec, 1.05 times faster

200-meter freestyle: 1 min 45.85 sec → 1 min 42.00 sec, 1.04 times faster

100-meter backstroke: 55.49 sec → 51.85 sec, 1.07 times faster

Maximum speed of Shinkansen bullet train in operation: 210 km/h (130 mi/h)→ 320 km/h (199 mi/h), 1.52 times faster

Commercial car's maximum speed: 302 km/h (188 mi/h) → 490 km/h (304 mi/h), 1.62 times faster

Computer processor speed: 1~5 MHz → 2~5 GHz, about 1,000 times faster

Decoding genetic information: 0.08 bases/day* → 90 billion bases/day (about 1 trillion times)

*Calculated from the value given in the manuscript of Robert William Holley's (1922-1993) Nobel Prize lecture in Physiology or Medicine.

The result is a resounding victory for genetic information breaking. Form and wear have improved in sports, but the limits of the human body's capabilities have been reached. In transportation, there is a limit to how much power can be transmitted by wheels and tires, which can be increased to "603 km/h (375 mi/h)" in a linear motor car. In air transportation, there was the supersonic airliner "Concorde," which was retired in 2003. I saw this plane at London Heathrow Airport. Reasons for its retirement include the crash in 2000 that killed all 113 people on board, economics, and the sonic boom. Computers as a storage medium have increased millions of times at the level of personal computers, but the computation speed is only about a thousand times faster. On the other hand, the speed of deciphering genetic information has increased about 1 trillion times, making it the fastest.

My life as a researcher paralleled the history of genetic information determination. DNA, the main body of genetic information, can be thought of as a long string, but with the technology available at that time, it was impossible to analyze it as a mixture. Thus, the analysis of the genetic information (base sequence) of RNA began. The first significant achievement was determining the 77-base sequence of alanine transfer RNA (tRNA) by Holly (1965). At that time, a two-dimensional development method using filter paper electrophoresis and paper chromatography was developed for fragmented RNA, the origins of which can be traced back to 1960. In the 1970s, the base sequence of E. coli lipoprotein mRNA was determined by Keiichi Takeishi (1940-), with whom I had the pleasure of working at the University of Shizuoka. At that time, it became possible to fragment DNA and use E. coli to increase the amount of these fragments (cloning). In other words, the technology for deciphering the genetic information (sequencing) was developed using uniform DNA fragments, which consist of the four letters A, C, G, and T. Chemical methods were developed to precisely cut the left (5') side of these letters (Maxam-Gilbert method). This was published in 1977 and has been widely used by researchers worldwide. I was surprised to find Alan Maxam's (1942-) doctoral dissertation in the library of the Biological Laboratories at Harvard University, where I was a postdoctoral fellow, and to learn that this method had been derived from a graduate student. In 1977, Frederick Sanger (1918-2013) developed a technique to extend the genetic information to the right (3') side of the genetic information using an enzymatic reaction (Sanger method). At the time, all of these techniques were manual. Walter Gilbert (1932-) and Sanger were awarded the Nobel Prize in Chemistry in 1980 for their work in developing a method for determining DNA sequences, along with Paul Berg (1926-2023), who created the cloning technique. This was Sanger's second Nobel Prize in Chemistry.

A single gene often has genetic information of about 100 to 5,000 letters. In the 1980s, progress was made in determining the information of many genes from plants, animals, and microorganisms. The first analysis began with the genes for ribosomal RNA (rRNA) and transfer RNA (tRNA), followed by the elucidation of the genetic information of proteins. In plants, the first step was to analyze the gene for the L-subunit of Rubisco, the enzyme that first fixes carbon dioxide in “photosynthesis,” a function unique to plants. Lawrence Bogorad (1921-2003) and his colleagues at Harvard University published their work in Nature in 1980. I was a graduate student at Nagoya University at the time and looked at this research with envy. In 1983, I had the opportunity to join Bogorad's laboratory. Later, regarding the origin of Rubisco, we were the first in the world to find that in primitive photosynthetic bacteria, this gene first appeared in one set with the S subunit gene and was further duplicated into two sets (1989). Other than that, I devoted little attention to gene decoding but focused on the regulatory mechanisms of gene expression.

The set of genetic information necessary for an organism to live is called the genome. Thus, genome analysis has progressed from viruses, whose genomes are small. In animals, the mitochondria, in addition to the nuclei, contain the genome, starting with the human mitochondrial genome, consisting of 16,569 bases, published by Sanger et al. in Nature in 1981. In plants, the genome exists in chloroplasts in addition to mitochondria. The size of chloroplast genomes is about 120,000 to 150,000 bases. Japan is a world leader in this field. The research group of Masahiro Sugiura (1936-) and a joint group of Haruo Koseki (1925-2009) and Kanji Ohyama (1939-), with whom I have a close relationship, completed the whole-genome analysis of the chloroplasts of tobacco and the liverwort Marchantia polymorpha, respectively, and published their results in the EMBO Journal and Nature, respectively, in 1986. The detection part of the Sanger method was mechanized in the 1990s, followed by the automation of reaction processing. Genome analysis in Japan has continued to lead the world, and people I know have been active in this field. The 3.57 million nucleotides of cyanobacteria (blue-green algae), a model of plant photosynthesis, were published in 1996 by a research group led by Tetsuyuki Tabata (1954-) at the Kazusa DNA Research Institute. The 4.64 million nucleotides of E. coli, a model organism in molecular biology, were published in 1997 by a collaborative research team led by the United States. The nuclear genomes of animals and plants are more than three hundred times larger, and the analysis took longer. For Arabidopsis thaliana, considered a model plant, Tabata's research group sequenced chromosomes 3 (23 million bases) and 5 (26 million bases). The results were published in Nature in 2000. In the Human Genome Project, Yoshiyuki Sakaki (1942-: Human Genome Center, Institute of Medical Science, University of Tokyo / RIKEN Genomic Sciences Center, now University of Shizuoka, Member of the Management Council) was in charge of chromosomes 11 (134 million bases), 18 (76 million bases) and 21 (47 million bases). Their draft sequences were published in Nature in 2001. In rice, an important cereal and a model of monocotyledonous plants, Takuji Sasaki (1947-: National Institute of Agrobiological Sciences, NIAS / University of Tsukuba) and Takashi Gojobori (1951-: National Institute of Genetics, NIG) sequenced chromosome 1 (46 million bases), which was published in Nature in 2002.

In the late 1990s, next-generation sequencing (NGS) methods were developed. While conventional methods use a mass of identical DNA fragments, NGS starts with a mixture of different DNA fragments attached with adapters, subjected to PCR, and sequenced using fluorescently labeled substrates or by detecting the release of pyrophosphate or protons. In addition, "third-generation sequencing" was announced in 2014. One uses nanopores to identify bases. In the other, hairpin adapters are attached to both ends of a DNA fragment, which, after denaturation, becomes a circular single-stranded DNA template for repeated incorporation of fluorescently labeled substrates. It is a PacBio Revio sequencer. This allows a catalog performance of 90 billion bases/day. I applied this method to the tea plant genome (4 billion bases) and could read 71 billion bases with a misread rate of about 1/400. This is a satisfactory result for about 18 times the genome size used for our tea genome editing.

50-year speed rate of increase: The trajectory of genetic information decoding!

Recent Posts

Comments