Science Talk

All of YouTube's data stored in teaspoon of DNA

Researchers here working on DNA-based data storage amid impending storage crisis

The world is facing a looming data storage crisis, and Singapore can help to avert it.

In 2018, people watched 4.33 million videos on YouTube, sent 159 million e-mails and posted 49,000 photographs on Instagram every minute of the year, among other data uses.

At this rate, we will produce 418 zettabytes of data this year, according to the World Economic Forum, and even more in the future. A single zettabyte is a trillion gigabytes.

Our current methods of storing all this data are not sustainable, for several reasons.

Most digital archives are now stored on magnetic and optical data storage systems, but we will run out of the materials used to produce these in less than a century.

Meanwhile, the environmental and economic costs of server farms, which make up 3 per cent of global electricity use and 2 per cent of greenhouse gas emissions, will soar.

ADVANTAGES

While scientists have been investigating alternative methods of storing data, one stands out.

DNA-based data storage, which stores information in man-made strands of DNA, has three key advantages. It has extremely high data storage density, remains stable for hundreds of years and requires very little power.

Last year, scientists in Israel announced that they had developed a way to store more than 10 petabytes, or 10 million gigabytes, in a single gram of DNA.

This means that, theoretically, all of YouTube's data could be stored in a teaspoon of DNA.

Even though scientists have been working on DNA-based data storage methods for nearly a decade, however, major obstacles remain.

  • Pioneer of coding techniques for CDs, DVDs 

    Dr Kees Immink, 73, a Dutch scientist, inventor and entrepreneur, is a visiting scholar at the Singapore University of Technology and Design's Advanced Coding and Signal Processing Laboratory.

    For more than 40 years, he has made vital contributions to the field of digital consumer electronics.

    His pioneering coding techniques provided the foundation for generations of audio, video and data recording media, including CDs, DVDs, and hard-disk and solid-state drives.

    He also established constrained codes as an important sub-field of information and coding theory, and created many coding instructions that accelerated the progress of digital data storage technology.

    For his groundbreaking work, he has been awarded the Institute of Electrical and Electronics Engineers Medal of Honour, the Edison Medal, and an Emmy Award for his contributions to television technology.

    He was also knighted in the Netherlands in 2000.

    He is currently working on coding and signal processing techniques for various kinds of data storage devices, including optical, non-volatile memory and DNA-based data storage.

    He will be a speaker at the upcoming Global Young Scientists Summit 2020 to be held in Singapore from next Tuesday to Friday. The annual event is organised by the National Research Foundation to connect young researchers and eminent scientists to spark new ideas and conversations.

KEY CHALLENGES

First, a quick explanation of how DNA-based data storage works.

Each DNA molecule consists of linked components called nucleotides, which come in four types: guanine, cytosine, adenine and thymine, represented by the letters G, C, A and T.

To store information in DNA, digital data, which consists of 0s and 1s, is translated into sequences made up of the G, C, A and T letters.

Companies or other organisations then manufacture synthetic DNA molecules representing those translated sequences and store them.

To retrieve the data, the synthetic DNA molecules are sequenced, and the output translated back into the original digital information.

While this method has been tried and tested, there are significant challenges.

The costs of sequencing DNA have fallen dramatically in recent years. The cost of producing the synthetic DNA molecules, however, is still prohibitively expensive.

Currently, it costs about US$5 million (S$6.8 million) to store just one gigabyte of data - a lot of money to store not even a full DVD movie!

Creating DNA molecules and sequencing them also involve biochemical and biophysical processes that are prone to errors.

The process of writing DNA to produce the synthetic molecules, for example, is vulnerable to substitution, insertion and deletion errors.

THE SINGAPORE CONNECTION

In Singapore, several teams of researchers are hard at work on these problems.

At the National University of Singapore, Associate Professor Poh Chueh Loo, Associate Professor Yew Wen Shan and their colleagues are working on more efficient ways to synthesise DNA sequences.

The Singapore University of Technology and Design's Advanced Coding and Signal Processing Laboratory, where I am a visiting scholar, is another local nexus of research in the field. The laboratory, under the leadership of Associate Professor Cai Kui, its founder, has been developing algorithms to prevent, detect and correct errors in writing and sequencing DNA.

We have found, for instance, that when the same nucleotide is repeated more than four times in a row, the probability of sequencing errors rises substantially.

We have also described how to design algorithms to translate data into strands of nucleotides that meet various error-limiting conditions. Furthermore, we calculated the maximum number of data bits that can be stored per nucleotide if a constraint is imposed to prevent too many repetitions of a nucleotide in a row.

Much more work needs to be done to make DNA-based data storage viable, including in areas such as how to restore lost data.

In hard-disk drives, data is stored in fixed places, so even if you lose some data, knowing what is supposed to go where can help you to restore the missing pieces.

A pool of DNA, however, is like coffee in a pot, with free-floating molecules. This makes data restoration much more difficult.

Still, DNA-based data storage remains one of the most promising solutions to our impending data storage crisis.

And Singapore, with its vibrant research sector and excellent expertise in the sciences, is well positioned to be a leader in this research field.

A version of this article appeared in the print edition of The Straits Times on January 11, 2020, with the headline 'All of YouTube's data stored in teaspoon of DNA'. Print Edition | Subscribe