How computational biology is solving the big data dilemma, one question at a time.

Plus Q&A’s with Dan Knights and Chad Myers

When you log onto Facebook, your profile provides the company with a truckload of data about you — where you hang out, what you “Like”, and who your friends are. What’s more, the computational algorithms used by social media sites are getting better and better at identifying whom you should befriend, or what you should “Like.” Surprisingly, Computational Biologists in the University of Minnesota’s Biotechnology Institute (BTI) are using some of these same algorithmic techniques to power new paradigms in data acquisition and analysis for biology.

In the case of Facebook data, an “edge”is defined by connections to Friends or
actions that a user takes, such as a “Like” or status update. The company stores
all of this information and uses it to personalize your experience on the site. Consider Friend recommendations, for example. “The concept is simple,” says Chad Myers, a BTI member with appointments in both the College of Biological Science (CBS) and the College of Science and Engineering (CSE). “Facebook looks at your set of edges, and someone else’s set of edges, and it says, you share this many edges in common, so you are probably Friends, too. That approach can be really accurate in biological data, as well.”

An “edge” in biological data could fall into a multitude of categories, but the core concept is the same. Biological molecules that perform similar functions often exhibit similar
patterns in large genome-scale datasets. Computational algorithms can then identify and analyze these patterns in the data.

“We can measure similarity in data, but we don’t necessarily have to understand every aspect of it,” says Myers. “Based on how closely associated certain unknown genes or proteins are to more thoroughly annotated genes or proteins, we can guess what functions the unknown molecules might serve.”

With the advent of genomics and other next-generation technologies, the biotech sector is collecting unprecedented quantities of high-dimensional biological data. The increasing complexity and volume of these data could provide significant insights in biology, yet they also introduce unprecedented computational challenges.

For example, think of a simple organism like yeast. One researcher might study a cellular process by using a high-powered microscope to generate video with thousands of frames, each containing of millions of pixels. Another researcher might use state-of-the-art spectrometry to catalogue hundreds or thousands of protein interactions. Yet another might apply genome sequencing to measure quantitative gene expression levels across the genome for hundreds or thousands of samples. Each of these data collection endeavors is a hard problem to solve individually, yet perhaps the ultimate challenge is to understand how these data sets, taken together, tell an integrated story about biology.

Simply put, the human brain cannot process such vast collections of data points. Furthermore, Excel spreadsheets no longer suffice to make sense of these interrelated pools of information. “Computational biologists develop new methods, and define new paradigms to look at data with computational techniques,” says Myers, whose research maps out millions of genetic interactions in organisms all the way from yeast to human cells.

One of the techniques Myers has developed relies — like Facebook — on similarities between clusters of genes. If certain genes display similar expression patterns under a variety of controlled circumstances, it can be inferred that their functions might be similar.

Dan Knights, a member of BTI and the Department of Computer Science and Engineering works in a different sector of Computational Biology. Knights’ research investigates the rapidly evolving microbiomes of human and non-human primates. Both Myers and Knights use primarily genomic sequencing information and advanced computational techniques to pave the way from raw data to an integrated understanding of biology.

“We’re interested in how you can define a healthy gut microbiome,” says Knights, who looks at not just one genome, but at all the genomes present within a subject’s gastrointestinal tract at a given time. “This turns out to be quite challenging because a diverse gut microbiome with hundreds or thousands of different species living in it is actually more healthy than one with only 100 species or 50 species.”

Computational biologists fall along a spectrum. “Some researchers focus more on the biology, others focus entirely on developing new algorithms,” says Knights. Both Myers’ and Knights’ labs do a bit of both. Some of their work involves designing and executing wet lab experiments; the rest involves building tools to process data from those experiments, and other experiments by collaborators across the world.

“Researchers developing computational approaches often have an abstract perspective of biological systems,” says Myers. “While specific biological questions motivate our work, when we develop a method, we rarely only develop it for a specific biological system or even species. The problems that are most exciting are the ones where, if we solve it here, it will also solve someone’s problem out there.”

By deriving solutions that cut across disciplines, faculty in BTI work on a broad range of problems. The same technologies enabling the genomics revolution are also being applied to precision agriculture, sensors in manufacturing, bioremediation, and other environmental concerns like climate change.

Ultimately, researchers in the biology sphere aim to construct models with strong
predictive power. For instance, Myers hopes to understand how genetic interactions influence phenotypes, either in normal or disease states. Likewise, Knights hopes to create diagnostic tests that can predict health consequences based on the community of microbes living within a person’s gastrointestinal tract.

Computational biology is an excellent tool for organizing, identifying and analyzing trends in data sets. Computers won’t, however, eliminate the need for deeply experienced experts. “You can never automate human intuition, especially that of biologists,” says Myers. “There might be thousands of hypotheses that are consistent with the data you’ve measured, but a good geneticist or molecular biologist can really narrow that hypothesis space quickly based on intuition.”

Rather than computers replacing people, computers are changing how humans interact with data. Computational processes are becoming increasingly iterative, meaning that data analysis is not a simple one-off affair. Instead, one set of computations could
reveal a pattern that a human would need to detect and/or interpret. Then the next set of computations would be designed to further investigate  this particular pattern, in a cycle that repeats indefinitely.

It’s no longer  just that one human detects such patterns. “Projects are getting bigger and more multi-disciplinary,” says Knights, who leans heavily on both local and global collaborators. “Disciplines are converging. Everyone’s trying to put the pieces together right now, because it’s all one giant system.”

At the intersection of computer science and biology lie exquisite opportunities to build new technologies and solve major problems. Visit gateway.bti.umn.edu to read more about the exciting work being completed by BTI faculty.