Information-theoretic & machine learning techniques for analysis and compression of biological data
Stochastic and Information-theoretic Modeling of DNA Mutations
Biological sequence data is generated over billions of years through mutation processes that alter sequences through the course of their evolutionary history. These mutation processes determine the statistical properties of the resulting sequences. To better exploit these statistical properties in analysis and compression tasks, we study both combinatorial and stochastic models of DNA mutation processes including duplication and point mutation. Our research in this area includes:
- We study combinatorial and stochastic models of mutation processes in order to quantify their ability in generating sequence diversity and to enable estimation of mutation rates and sequence analysis. We study biologically-inspired combinatorial sequence growth models to find their capacity and from the perspective of formal languages. We are also interested in extremal problems related to computing the duplication distance. Our work also includes the analysis of more faithful but also more complex representation of mutation processes through stochastic mutation models.
- Estimation of mutation rates and sequence analysis via stochastic mutation models. We have developed an efficient method for estimating from a single sequence the parameters of the model, i.e., the probabilities of various mutation events, as well as the total number of mutations. An interactive web page for exploring tandem repeats in human Chromosome 1 can be accessed here.
- Compressibility of biological sequences. Stochastic models of mutation processes enable us to compute bounds on the compressibility of biological sequence data. Given the growth of sequence data at an unprecedented rate, identifying the limits of compressibility is crucial in designing and evaluating the performance of compression methods.
Compression of metagenomic sequencing data
Metagenomics refers to the genomic study of environmental samples containing DNA from several organisms. Due to the public health applications of metagenomic studies, including studies of bacterial communities in water, soil, atmosphere, and the human body, the field is receiving increased attention. For example, the NIH Human Microbiome Project is focused on the development of metagenomic tools and databases for characterizing human microbiota and their role in human health. As the size of metagenomic datasets grow, so does the need for efficient compression tools. By integrating taxonomy identification, alignment, and source coding, we provide a parallelized compression platform that when tested on a variety of metagenomic datasets reduced file sizes by more than 87%, outperforming standard tools.
For the task of gene prioritization, we developed HyDRA, which identifies genes that contribute to the onset of a given disease among a set of test genes. For each database in a set of diverse biological databases, HyDRA creates a ranking of the test genes, for example, based on their similarity to a set of training genes as measured by that database. It then aggregates these rankings to produce a ranking of the test genes. HyDRA uses the aforementioned distances and algorithms for rank aggregation in order to achieve a higher accuracy in the top positions, which are typically selected to be experimentally tested. HyDRA also has the ability to augment the training set through gene discovery. We tested HyDRA for a number of diseases including autism and several types of cancer and showed it outperforms the state-of-the-art tools, such as Endeavour and ToppGene.
Data Storage in DNA
DNA is emerging as a potential solution to growing data storage needs, due to recent advances in DNA synthesis and sequencing. Compared to silicon-based storage, DNA offers higher density and data longevity. This project focuses on characterizing common error patterns as well as designing error-correcting codes for combating errors in DNA storage systems.
Ordinal data processing
Ordinal data refers to information presented as comparisons rather than values, such as rankings and partial orders. One of the advantages of ordinal data is that it does not rely on a scale, which makes it a very flexible representation. As Kendall (Rank correlation methods, 1955) puts it:
What the ranking loses in accuracy, it gains in generality, for if we stretch the scale of measurement (and even if we stretch it differently in different regions) the ranking remains unaltered. As a result, rankings have a wide range of applications, from identifying disease-causing genes to error-correcting codes. Since rankings depend on comparisons of items and not a subjective scale that may vary from user to user, they are also useful for representing user preference in recommender systems.
Distances on rankings and rank aggregation
To process ranked data, we often need a distance measure between rankings, for example, when evaluating the performance of search engines or in distance-based rank aggregation, which is the task of combining ordinal data obtained from multiple sources, such as a group of experts, voters, or databases, into one ranking.
My work in this area includes axiomatic derivation of distances capable of incorporating similarity information and positional factors, such as the non-uniform importance of different parts of rankings; algorithms for computing and approximating these distances; and distance-based rank aggregation algorithms. For example, I introduced the weighted Kendall distance, which can be tuned to be more sensitive to changes in top positions in rankings and is computable in time (GitHub). In rank aggregation and voting, this distance enforces a higher degree of agreement between the sources and the aggregate in the top positions compared to conventional metrics and can eliminate negative outlier votes. We use these distances and methods to develop HyDRA, a tool for gene prioritization.
|Kendall tau||Weighted Kendall|
Sorting streaming data with limited storageThe permutation polytopes for Kendall tau and weighted Kendall distances. Edge lengths denote distance between permutations differing in an adjacent transposition.
Sorting streaming data with limited storage
The problem of sorting a data stream with limited storage arises both in big data contexts where the ordering of the elements of a large data stream is to be approximated, and in learning user preference rankings for recommender systems where users observe a set of items one at a time, such as movies on Netflix or songs on Spotify, and can recall information about only a small number of previous items. A useful tool for studying this problem is the rate-distortion theory of permutations. To solve the problem, we first find average- and worst-case rate-distortion bounds for the permutation space. These results are then used to provide bounds on the performance of algorithms for sorting streaming data with limited storage. Furthermore, we provide a simple algorithm that achieves these performance bonds in the asymptotic sense up to a constant factor.
Coding for storage systems
Because of its advantages over other types of storage, including speed, robustness (due to lack of mechanical parts), and efficiency, flash memory has become a $30B industry and a widely used technology, especially in mobile and cloud applications. However, the flash technology suffers from loss of reliability as data density increases and from a relatively short life span. My research focuses on addressing these problems by designing i) error-correcting codes for flash memories with a large number of charge levels that allow high-density but reliable data storage and ii) rewriting codes that decrease the need for erasure cycles and thus improve the speed and endurance of the memory. To achieve these goals, we introduced the translocation error model, which can describe a wide class of flash memory errors, including multi-level charge drops, read- and write-disturbs, and over-injections. We also presented several code constructions and decoders for correcting errors in this model, as well as information-theoretic bounds on the maximum possible coding rate. The rewriting codes are based on multipermutations and allow writing new data without erasures for a number of times, thus increasing the speed and the endurance of the memory.