Karthik Devarajan, PhD
- PhD, Northern Illinois University, 2000
- MSc, Tech, Birla Institute of Technology & Science, India, 1992
- Statistical Scientist, Cancer Bioinformatics, AstraZeneca R&D Boston, Waltham, MA, 2002-2005
- Biostatistician, Bristol-Myers Squibb Pharmaceutical Research Institute, Bristol-Myers Squibb, Princeton, NJ, 1999-2002
Unsupervised dimension reduction
We have developed unsupervised dimension reduction methods for model-based clustering of gene expression data and for text mining applications in biomedical informatics using non-negative matrix factorization (NMF). An important, but often ignored, aspect of high-throughput genomic data is its heteroscedasticity, or signal-dependent nature of noise in the measurements. We have developed information-theoretic methods that extract relevant components from large-scale biological data by accounting for signal-dependent noise. In addition, we have developed computational tools for dimension reduction and visualization using NMF that are freely available to the academic research community. These include hpcNMF, a C++ package that uses high-performance computing clusters (devarajan.fccc.edu) and the R package GNMF (http://cran.r-project.org/web/packages/gnmf/index.html). Furthermore, by extending nonnegative matrix factorizations using the theory of generalized linear models, we are developing methods that provide a unified framework for the modeling and analysis of data obtained in different scales (Devarajan et al., 2015, 2015; Devarajan & Cheung, 2014, 2016; Cheung et al., 2015; Devarajan, 2008; Devarajan & Ebrahimi, 2008).
Supervised and semi-supervised dimension reduction
In studies where prior knowledge on phenotype is available, the focus is on correlating the outcome variable of interest with covariates. For example, when information on an outcome variable such as time to an event (or survival time) is available, one of the goals of an investigator is to understand how the expression levels of genes, clinical and demographic variables (covariates) relate to an individual’s survival in the course of a disease. The analysis of time to event data, generally called survival analysis, arises in many fields of study including biology, medicine, public health, engineering, and economics. Its role and significance in cancer research cannot be overstated. The Cox proportional hazards (PH) model is the most celebrated and widely used statistical model linking survival time to covariates. It is a multiplicative hazards model that implies constant hazard ratio, that is, it postulates that the risk (or hazard) of death of an individual given their covariate measurements is simply proportional to their baseline risk in the absence of any covariate. The model assumes that the hazard and survival curves corresponding to two different values of covariates do not cross. While this model has proved to be very useful in practice due to its simplicity and interpretability, the assumption of constant relative risk has been shown to be invalid in a variety of situations in medical studies. For example, non-proportional hazards are typical when treatment effect increases or decreases over time leading to converging or diverging hazards. This situation cannot be handled by the Cox PH model, and more general models that consider non-proportionality of hazards are required for modeling survival data. To this end, we have developed a class of non-proportional hazards models that embeds the Cox PH model as a special case. We proposed a theoretical and a computational framework for estimation using this generalized model that allows us to rigorously test the assumption of proportional hazards. This approach accounts for varying trends in the relative risk over time. Furthermore, we have developed information-theoretic methods to test the effect of an individual covariate or a group of covariates in the PH model.
Our preliminary work involved the development of a model for predicting patient survival by extracting components of gene expression that were strongly correlated with it. In this high-dimensional setting, it is unreasonable to expect the expression levels of the many thousands of genes to exhibit proportionality in hazards. Our current research interests in this area include the systematic comparison of several well-known models for correlating gene expression with patient survival and the identification of genes that demonstrate a time-varying effect using publicly available data from repositories such as Gene Expression Omnibus and The Cancer Genome Atlas. Indeed our recent investigations involving the re-analysis or meta-analysis of existing gene expression data sets have revealed such a time-varying trend exhibited by several genes implicated in kidney cancer. We are currently developing methods using continuum regression (CR) – a unified framework for supervised dimension reduction - in conjunction with complex survival models. CR embeds a spectrum of regression methods into a single framework that includes methods such as ordinary least squares, partial least squares and principal components regression as special cases, thereby enabling a powerful array of methods to be developed for this problem (Devarajan & Ebrahimi, 2009, 2011, 2013; Peri et al., 2013; Devarajan et al., 2010).
Assessment of technical reproducibility and outlier detection in large-scale biological data
A problem that arises frequently in high-throughput studies is the assessment of technical reproducibility of data obtained under homogeneous experimental conditions. This is an important problem considering the exponential growth in the number of high-throughput technologies that have become available to the researcher in the past decade. Although methods for determining the quality of data obtained from microarrays have been in existence for many years, these methods do not necessarily translate to data obtained from other technologies. Moreover, these methods tend to be typically graphical in nature and do not employ rigorous statistical methods. There is an inherent need for the quantitative evaluation of the reproducibility of technical replicates obtained using novel approaches such as next-generation sequencing, high-throughput compound and siRNA screening, and SNP arrays. To this end, we have developed model-based methods that account for the technical variability and potential asymmetry that arises naturally in replicate data. This data driven approach borrows strength from the large volume of data available in these studies and can be used for assessing technical reproducibility independent of the technology used to generate the data (Caretti et al., 2008; Anastassiadis et al., 2011, 2013; Duong-Ly et al., 2016).
Another area of active interest is the development of statistical models that detect the presence of cancer in a cohort of patients based on biomarker measurements and clinical variables. To this end, we have systematically compared the performance of various methods and developed algorithms that are better able to detect the presence of hepatocellular carcinoma in the background of cirrhosis using levels of established serum biomarkers and other relevant clinical characteristics of the patient. Our algorithm has been independently validated by several members of the Early Detection Research Network as well as the National Cancer Institute, and provides a significant improvement in prediction accuracy of up to 10% (Wang et al., 2012, 2013, 2016; Communale et al., 2013).
Advances in high-throughput technologies in the past decade have given rise to large-scale biological data that is measured in a variety of scales. Gene expression studies enable the simultaneous measurement of the expression profiles of tens of thousands of genes and proteins, often from only a handful of biological samples. Data is typically presented as a two-way numeric table in which the rows represent the genes, columns represent the samples and each entry consists of the expression level of a given gene in a given sample. The samples may represent a phenotype such as tissue type, experimental condition or time points. Traditionally these studies have involved the use of microarray technology to measure mRNA expression, and more recently, the use of SNP arrays to measure allele-specific expression and DNA copy number variation, methylation arrays to quantify DNA methylation and next-generation sequencing technologies, such as RNA-Seq and ChIP-Seq, for the measurement of digital gene expression. In addition, high-throughput compound and siRNA screening assays are specifically designed to detect interactions with compounds by directly measuring inhibition of siRNA or kinase activity.
These studies have resulted in massive amounts of data requiring analysis and interpretation while offering tremendous potential for growth in our understanding of the pathophysiology of many diseases. The focus of my research is in the development of novel statistical methodology for the analysis of data stemming from such high-throughput studies. It includes methods for dimension reduction and molecular pattern discovery as well as for correlating a qualitative or quantitative outcome variable (including tissue type, presence of disease, patient response to treatment, survival time) with large numbers of covariates (genes, SNPs or sequence tags) based on supervised and unsupervised learning. The primary focus of my research activities consist of the following two problems from statistical learning theory: nonnegative matrix factorization and continuum regression.
Devarajan K, Cheung VC. A Quasi-Likelihood Approach to Nonnegative Matrix Factorization. Neural Comput. 2016 Aug;28(8):1663-93. doi: 10.1162/NECO_a_00853. Epub 2016 Jun 27. PubMed
Duong-Ly KC, Devarajan K, Liang S, Horiuchi KY, Wang Y, Ma H, Peterson JR. Kinase Inhibitor Profiling Reveals Unexpected Opportunities to Inhibit Disease-Associated Mutant Kinases. Cell Rep. 2016 Feb 2;14(4):772-81. doi:10.1016/j.celrep.2015.12.080. Epub 2016 Jan 14. PubMed
Wang M, Devarajan K, Singal AG, Marrero JA, Dai J, Feng Z, Rinaudo JA, Srivastava S, Evans A, Hann HW, Lai Y, Yang H, Block TM, Mehta A. The Doylestown Algorithm: A Test to Improve the Performance of AFP in the Detection of Hepatocellular Carcinoma. Cancer Prev Res (Phila). 2016 Feb;9(2):172-9. doi:10.1158/1940-6207.CAPR-15-0186. Epub 2015 Dec 28. PubMed
Devarajan K, Wang G, Ebrahimi N. A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing. Mach Learn. 2015 Apr 1;99(1):137-163. PubMed. COBRA pre-print series, Article 80. (July 2011). http://biostats.bepress.com/cobra/art80.
Devarajan K, Cheung VC. On nonnegative matrix factorization algorithms for signal-dependent noise with application to electromyography data. Neural Comput. 2014 Jun;26(6):1128-68. doi: 10.1162/NECO_a_00576. Epub 2014 Mar 31. PubMed
Anastassiadis T, Deacon SW, Devarajan K, Ma H, Peterson JR. Comprehensive assay of kinase catalytic activity reveals features of kinase inhibitor selectivity. Nat Biotechnol. 2011 Oct 30;29(11):1039-45. doi: 10.1038/nbt.2017. PubMed
Cortellino S, Xu J, Sannai M, Moore R, Caretti E, Cigliano A, Le Coz M, Devarajan K, Wessels A, Soprano D, Abramowitz LK, Bartolomei MS, Rambow F, Bassi MR, Bruno T, Fanciulli M, Renner C, Klein-Szanto AJ, Matsumoto Y, Kobi D, Davidson I, Alberti C, Larue L, Bellacosa A. Thymine DNA glycosylase is essential for active DNA demethylation by linked deamination-base excision repair. Cell. 2011 Jul 8;146(1):67-79. doi: 10.1016/j.cell.2011.06.020. Epub 2011 Jun 30. PubMed
Devarajan K, Ebrahimi N. A semi-parametric generalization of the Cox proportional hazards regression model: Inference and Applications. Comput Stat Data Anal. 2011 Jan 1;55(1):667-676. PubMed
Astsaturov I, Ratushny V, Sukhanova A, Einarson MB, Bagnyukova T, Zhou Y, Devarajan K, Silverman JS, Tikhmyanova N, Skobeleva N, Pecherskaya A, Nasto RE, Sharma C, Jablonski SA, Serebriiskii IG, Weiner LM, Golemis EA. Synthetic lethal screen of an EGFR-centered network to improve targeted therapies. Sci Signal. 2010 Sep 21;3(140):ra67. doi: 10.1126/scisignal.2001083. PubMed
Devarajan K, Zhou Y, Chachra N, Ebrahimi N. A supervised approach for predicting patient survival with gene expression data. Proc IEEE Int Symp Bioinformatics Bioeng. 2010;2010(5521718):26-31. PubMed
Bellacosa A, Godwin AK, Peri S, Devarajan K, Caretti E, Vanderveer L, Bove B, Slater C, Zhou Y, Daly M, Howard S, Campbell KS, Nicolas E, Yeung AT, Clapper ML, Crowell JA, Lynch HT, Ross E, Kopelovich L, Knudson AG. Altered gene expression in morphologically normal epithelial cells from heterozygous carriers of BRCA1 or BRCA2 mutations. Cancer Prev Res (Phila). 2010 Jan;3(1):48-61. doi: 10.1158/1940-6207.CAPR-09-0078. PubMed
Altomare DA, Vaslet CA, Skele KL, De Rienzo A, Devarajan K, Jhanwar SC, McClatchey AI, Kane AB, Testa JR. A mouse model recapitulating molecular features of human mesothelioma. Cancer Res. 2005 Sep 15;65(18):8090-5. PubMed
Statistical and Computing Software
hpcNMF: C++ based software package for generalized non-negative matrix factorization using high-performance computing clusters available in Linux, Windows and Max OS versions (jointly with G. Wang, http://devarajan.fccc.edu)
gnmf: an R package for generalized non-negative matrix factorization (jointly with J. Maisog and G. Wang, http://cran.r-project.org/web/packages/gnmf/).