Unsupervised dimension reduction and model-based clustering for high-dimensional data with applications in molecular pattern discovery, biomedical informatics, imaging and neuroscience
Assessment of technical reproducibility and probability-based methods for outlier detection in large-scale biological data
Supervised and semi-supervised learning methods
Analysis of censored survival data
Feature selection and predictive modeling for large-scale genomic data in the presence of censored survival outcomes
Integrative genomics analysis investigating the association between digital gene expression, single nucleotide polymorphisms, copy number variation, methylation and censored survival outcomes
Statistical machine learning methods for biomarker discovery
Advances in high-throughput genomic technologies in the past two decades have given rise to large-scale biological data that is measured on a variety of scales. Genome-wide studies enable the simultaneous measurement of the expression profiles of tens of thousands of genomic features, from an ever increasing number of biological samples that may represent phenotypes, experimental conditions or time points. Examples include studies of various types of gene and protein expression, methylation and copy number variation, and high-throughput compound screening assays, among others. Similarly, studies in biomedical imaging and computational neuroscience generate tens of thousands of signals from brain or muscle activity under a variety of experimental conditions across the time-frequency domain. These massive data sets offer tremendous potential for growth in our understanding of the pathophysiology of many diseases. My research spans the two major areas of statistical learning - unsupervised and supervised, as well as survival analysis, with applications in the aforementioned domains. Its principal focus is in the development of statistical and computational approaches for high-dimensional data and includes methods for dimension reduction as well as methods for correlating a quantitative or qualitative outcome variable (such as patient survival time, presence of disease, patient response to treatment) with a large number of covariates (genomic, clinical, laboratory and demographic variables). Our current research activities involve the development of methods for analyzing data from microbiome, radiomics and single-cell RNA-Seq studies.
Advances in high-throughput genomic technologies in the past two decades have given rise to large-scale biological data that is measured on a variety of scales. Genome-wide studies enable the simultaneous measurement of the expression profiles of tens of thousands of genomic features, from an ever increasing number of biological samples that may represent phenotypes, experimental conditions or time points. Examples include studies of various types of gene and protein expression, methylation and copy number variation, and high-throughput compound screening assays, among others. Similarly, studies in biomedical imaging and computational neuroscience generate tens of thousands of signals from brain or muscle activity under a variety of experimental conditions across the time-frequency domain. These massive data sets offer tremendous potential for growth in our understanding of the pathophysiology of many diseases. My research spans the two major areas of statistical learning - unsupervised and supervised, as well as survival analysis, with applications in the aforementioned domains. Its principal focus is in the development of statistical and computational approaches for high-dimensional data and includes methods for dimension reduction as well as methods for correlating a quantitative or qualitative outcome variable (such as patient survival time, presence of disease, patient response to treatment) with a large number of covariates (genomic, clinical, laboratory and demographic variables). Our current research activities involve the development of methods for analyzing data from microbiome, radiomics and single-cell RNA-seq studies.
Unsupervised learning methods
Unsupervised dimension reduction
We have developed methods for unsupervised dimension reduction and model-based clustering of large-scale biological data and demonstrated their applications in high-throughput genomics, biomedical informatics, imaging and computational neuroscience using non-negative matrix factorization (NMF). An important, but often ignored, aspect of high-dimensional biological data is the signal-dependent and correlated nature of noise in the measurements. We addressed this problem by developing a variety of methods (i) using an information-theoretic approach, (ii) by extending NMF using the theory of generalized linear models and quasi-likelihood and (iii) by developing a statistical framework for NMF using generalized dual divergence. Our methods provide a unified framework for the modeling and analysis of data obtained on different scales and are broadly applicable to a variety of high-dimensional data. We have developed computational tools for dimension reduction and visualization using NMF that are freely available to the academic research community. These include hpcNMF, a C++ package that uses high-performance computing clusters (http://devarajan.fccc.edu/) and the R package gnmf (http://cran.r-project.org/web/packages/gnmf/index.html).
A problem that arises frequently in high-throughput studies is the assessment of technical reproducibility of data obtained under homogeneous experimental conditions. This is an important problem considering the significant growth in the number of high-throughput technologies that have become available to the researcher in the past two decades. Existing methods for determining data quality are typically graphical, lack statistical rigor and do not necessarily translate to data obtained across multiple technologies; also, there is an inherent need for quantitative evaluation of reproducibility. To this end, we have developed empirical model-based methods as well as probability-based methods that account for technical variability and potential asymmetry that arise naturally in replicate data. This data-driven approach borrows strength from the large volume of available data and is broadly applicable to a variety of high-throughput studies – such as next-generation sequencing, compound and siRNA screening and other modern “omics” studies - for assessing technical reproducibility and identifying outliers. The R package replicateOutliers implements five different methods for outlier detection and is available at https://github.com/matthew-seth-smith/replicateOutliers.
Supervised and semi-supervised learning methods
Analysis of censored survival data
In studies where information on an outcome variable such as time to an event (or survival time) is available, one of the goals of an investigator is to understand how the expression levels of genomic, clinical and demographic variables (covariates) relate to an individual’s survival in the course of a disease. The analysis of time to event (or survival) data arises in many fields of study such as biology, medicine and public health, and its role and significance in cancer research cannot be overstated. The Cox proportional hazards (PH) model is the most celebrated and widely used statistical model linking survival time to covariates. It is a multiplicative hazards model that implies constant hazard ratio and assumes that the hazard and survival curves do not cross. While this model has proved to be very useful in practice due to its simplicity and interpretability, the assumption of constant hazard ratio has been shown to be invalid in a variety of situations in medical studies. For example, non-proportional hazards are typical when treatment effect increases or decreases over time leading to converging or diverging hazards. This situation cannot be handled by the Cox PH model, and more general models that consider non-proportionality of hazards are required for modeling survival data. To this end, we have developed a class of non-proportional hazards models that embeds the Cox PH model as a special case. We proposed a theoretical and a computational framework for estimation using this generalized model that allows us to rigorously test the PH assumption. Furthermore, we have developed information-theoretic methods to test the effect of an individual covariate or a group of covariates in the PH model as well as in complex survival models that account for varying trends in hazard over time. By identifying different classes of probability link models with symmetric information divergence, we have proposed computationally efficient solutions to the problem of model averaging and model selection.
Feature selection and predictive modeling for large-scale genomic data with censored survival outcomes
Within the context of high-throughput genomic data, our preliminary work involved the development of a model for predicting patient survival by extracting genomic components that were strongly correlated with it. In this high-dimensional setting, it is unreasonable to expect the expression levels of the many thousands of genomic features to exhibit proportionality in hazards. Our current research interests in this area include the systematic comparison of several well-known models for correlating genomic feature expression with patient survival and the identification of features that demonstrate a time-varying effect using publicly available data from repositories such as Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA).
We have developed an array of measures, using information divergence, for quantifying explained randomness in different survival models that incorporate time-varying effects of features and a generalized pseudo-R2 index that covers a spectrum of such survival models. Indeed our investigations involving the re-analysis or meta-analysis of existing data sets have revealed such a time-varying trend exhibited by several genomic features implicated in kidney, head and neck, ovarian and brain cancers. Furthermore, we have developed methods using continuum regression (CR) – a unified framework for supervised dimension reduction - in conjunction with the accelerated failure time model for predictive modeling. CR embeds a spectrum of regression methods into a single framework that includes methods such as ordinary least squares, partial least squares and principal components regression as special cases, thereby enabling a powerful array of methods to be developed for this problem within the linear models framework. R packages implementing these methods are freely available to the research community at https://github.com/lburns27/Feature-Selection and at https://github.com/lburns27/ACPR-AFT.
Integrative genomics analysis
In collaboration with the Ragin laboratory, we are investigating the association between digital gene expression, single nucleotide polymorphisms (SNP), copy number variation, methylation and survival in different cancers using data from TCGA and GEO. One such integrative genomic analysis identified several ancestral-related SNPs for the POLB gene and supported the association of genetic ancestry with survival disparity in head and neck cancer. A follow-up study analyzing genome-wide expression quantitative trait loci identified candidate genes associated with survival. In an ongoing study funded by the ACS (Molecular Modeling, Genomics and Racial Disparities in HNSCC, PI: Ragin) in which we are investigating (i) the genetic susceptibility of Blacks in developing HNSCC by aiming to identify distinctive polymorphic and metabolic profiles related to gene expression and function and (ii) the association between treatment and survival according to race by aiming to determine whether genetic variations related to relevant biological pathways modify this association.
Another area of active interest is the development and novel application of modern statistical machine learning methods for detecting the presence of cancer in a cohort of patients based on biomarker measurements and clinical variables. In collaboration with colleagues at the Drexel college of Medicine, we have systematically compared the performance of various methods and developed the Doylestown algorithm that is better able to detect the presence of hepatocellular carcinoma in the background of cirrhosis using levels of established serum biomarkers and other relevant clinical characteristics of the patient. Our algorithm has been independently validated by the Early Detection Research Network as well as the National Cancer Institute, and provides a significant improvement in prediction accuracy of up to 20%.
Spirko-Burns L.,Devarajan K., Supervised dimension reduction for large-scale "omics" data with censored survival outcomes under possible non-proportional hazards. IEEE/ACM Trans Comput Biol Bioinform. 18(5): 2032-2044, 2021. https://www.ncbi.nlm.nih.gov/pubmed/31940547.
Spirko-Burns L, Devarajan K. Unified methods for feature selection in large-scale genomic studies with censored survival outcomes. Bioinformatics, Volume 36, Issue 11, June 2020, Pages 3409–3417, https://doi.org/10.1093/bioinformatics/btaa161. PMID: 32154833.
Spirko-Burns, L., Devarajan, K. Supervised dimension reduction for large-scale “omics" data with censored survival outcomes under possible non-proportional hazards. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2020 Jan 10. doi: 10.1109/TCBB.2020.2965934. [Epub ahead of print] PMID: 31940547.
Devarajan K, Cheung VC. A Quasi-Likelihood Approach to Nonnegative Matrix Factorization. Neural Computation. 2016 Aug;28(8):1663-93. Epub 2016 Jun 27. PubMed PMID: 27348511; PubMed Central PMCID: PMC5549860.
Devarajan K, Wang G, Ebrahimi N. A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing. Machine Learning. 2015 Apr 1;99(1):137-163. PMID: 25821345; PMCID: PMC4371760.
Devarajan K, Cheung VC. On nonnegative matrix factorization algorithms for signal-dependent noise with application to electromyography data. Neural Computation. 2014 Jun;26(6):1128-68. PMID: 24684448; PMCID: PMC5548326.
Devarajan K, Ebrahimi N. A semi-parametric generalization of the Cox proportional hazards regression model: Inference and Applications. Computational Statistics and Data Analysis. 2011 Jan 1;55(1):667-676. PMID: 21076652; PMCID: PMC2976538.
Devarajan K. Nonnegative matrix factorization: an analytical and interpretive tool in computational biology. PLoS Computational Biology. 2008 Jul 25;4(7):e1000029.doi: 10.1371/journal.pcbi.1000029. Review. PMID: 18654623; PMCID: PMC2447881.
Chang WL, Jackson C, Riel S, Cooper HS, Devarajan K, Hensley HH, Zhou Y, Vanderveer LA, Nguyen MT, Clapper ML. Differential preventive activity of sulindac and atorvastatin in Apc(+/Min-FCCC)mice with or without colorectal adenomas. Gut. 2018 Jul;67(7):1290-1298. Epub 2017 Nov 9. PubMed PMID: 29122850; PubMed Central PMCID: PMC6031273.
Ramakodi MP, Devarajan K, Blackman E, Gibbs D, Luce D, Deloumeaux J, Duflo S, Liu JC, Mehra R, Kulathinal RJ, Ragin CC. Integrative genomic analysis identifies ancestry-related expression quantitative trait loci on DNA polymerase β and supports the association of genetic ancestry with survival disparities in head and neck squamous cell carcinoma. Cancer. 2017 Mar 1;123(5):849-860. PMID: 27906459; PMCID: PMC5319896...Expand
Zook P., Pathak H.B., Belinsky M.G., Gersz L., Devarajan K., Zhou Y., Godwin A.K., von Mehren M., Rink L., Combination of imatinib mesylate and akt inhibitor provides synergistic effects in preclinical study of gastrointestinal stromal tumor. Clin Cancer Res. 23(1): 171-180, 2017.PMC5203981. 8.911
Wang M, Devarajan K, Singal AG, Marrero JA, Dai J, Feng Z, Rinaudo JA, Srivastava S, Evans A, Hann HW, Lai Y, Yang H, Block TM, Mehta A. The Doylestown Algorithm: A Test to Improve the Performance of AFP in the Detection of Hepatocellular Carcinoma. Cancer Prevention Research (Phila). 2016 Feb;9(2):172-9. PMID: 26712941; PMCID: PMC4740237.
Duong-Ly KC, Devarajan K, Liang S, Horiuchi KY, Wang Y, Ma H, Peterson JR. Kinase Inhibitor Profiling Reveals Unexpected Opportunities to Inhibit Disease-Associated Mutant Kinases. Cell Reports. 2016 Feb 2;14(4):772-781. PMID:26776524; PMCID: PMC4740242.
Peri S, Devarajan K, Yang DH, Knudson AG, Balachandran S. Meta-analysis identifies NF-κB as a therapeutic target in renal cancer. PLoS One. 2013 Oct 7;8(10):e76746. PMID: 24116146; PubMed Central PMCID: PMC3792024.
Anastassiadis T, Deacon SW, Devarajan K, Ma H, Peterson JR. Comprehensive assay of kinase catalytic activity reveals features of kinase inhibitor selectivity. Nature Biotechnology. 2011 Oct 30;29(11):1039-45. PMID: 22037377; PMCID: PMC3230241.
Cortellino S, Xu J, Sannai M, Moore R, Caretti E, Cigliano A, Le Coz M, Devarajan K, Wessels A, Soprano D, Abramowitz LK, Bartolomei MS, Rambow F, Bassi MR, Bruno T, Fanciulli M, Renner C, Klein-Szanto AJ, Matsumoto Y, Kobi D, Davidson I, Alberti C, Larue L, Bellacosa A. Thymine DNA glycosylase is essential for active DNA demethylation by linked deamination-base excision repair. Cell. 2011 Jul 8;146(1):67-79. PMID: 21722948; PMCID: PMC3230223.
Astsaturov I, Ratushny V, Sukhanova A, Einarson MB, Bagnyukova T, Zhou Y, Devarajan K, Silverman JS, Tikhmyanova N, Skobeleva N, Pecherskaya A, Nasto RE, Sharma C, Jablonski SA, Serebriiskii IG, Weiner LM, Golemis EA. Synthetic lethal screen of an EGFR-centered network to improve targeted therapies. Science Signaling. 2010 Sep 21;3(140):ra67. PMID: 20858866; PMCID: PMC2950064.
Altomare DA, Vaslet CA, Skele KL, De Rienzo A, Devarajan K, Jhanwar SC, McClatchey AI, Kane AB, Testa JR. A mouse model recapitulating molecular features of human mesothelioma. Cancer Research. 2005 Sep 15;65(18):8090-5. PMID: 16166281. Collapse
hpcNMF: C++ based software package for generalized non-negative matrix factorization using high-performance computing clusters available in Linux, Windows and Max OS versions (jointly with G. Wang). Available at devarajan.fccc.edu.
ACPR-AFT: Algorithms for supervised dimension reduction of large-scale “omics" data with censored survival outcomes under possible non-proportional hazards (jointly with L. spirko-Burns). Available at github.com/lburns27/ACPR-AFT.
Feature-Selection: Methods for feature selection in large-scale “omics" data with censored survival outcomes under possible non-proportional hazards (jointly with L. Spirko-Burns). Available at github.com/lburns27/Feature-Selection.
Pre-prints available online
Smith, M., Devarajan, K. Probability-based methods for outlier detection in replicated high-throughput biological data. bioRxiv 240473; doi: https://doi.org/10.1101/2020.08.07.240473.
Asadi, M., Devarajan, K. Ebrahimi, N., Soofi, E., Spirko-Burns, L. Probability link models with symmetric information divergence. arXiv: 2008.04387v1 [stat.ML] 10 Aug 2020. https://arxiv.org/abs/2008.04387.
Devarajan, K. (2019). Non-negative matrix factorization based on generalized dual divergence. arXiv: 1905.07034v1 [stat.ML] 16 May 2019. https://arxiv.org/abs/1905.07034.
The following ratings and reviews are based on verified feedback collected from independently administered
patient experience surveys. The ratings and comments submitted by patients reflect their own views and opinions.
Patient identities are withheld to ensure confidentiality and privacy.
Learn more about our Patient Experience Ratings.