Karthik Devarajan, PhD

Karthik Devarajan, PhD

Associate Professor, Population Science


    Education and Training

    Educational Background

    • PhD, Northern Illinois University
    • MSc, Tech, Birla Institute of Technology & Science, India

    Industry Experience

    • Statistical Scientist, Cancer Bioinformatics, AstraZeneca R&D Boston, Waltham, MA
    • Biostatistician, Bristol-Myers Squibb Pharmaceutical Research Institute, Bristol-Myers Squibb, Princeton, NJ
    Research Profile

    Research Facility

    Research Interests

    Unsupervised dimension reduction

    We have developed methods for unsupervised dimension reduction and model-based clustering of large-scale biological data and demonstrated their applications in high-throughput genomics, biomedical informatics, imaging and neuroscience using non-negative matrix factorization (NMF). An important, but often ignored, aspect of high-dimensional biological data is the signal-dependent and correlated nature of noise in the measurements. We addressed this problem by developing a variety of methods (i) using an information-theoretic approach and (ii) by extending NMF using the theory of generalized linear models and quasi-likelihood. Our methods provide a unified framework for the modeling and analysis of data obtained on different scales and are broadly applicable to a variety of high-dimensional data. We have developed computational tools for dimension reduction and visualization using NMF that are freely available to the academic research community. These include hpcNMF, a C++ package that uses high-performance computing clusters ( and the R package GNMF ( (Devarajan et al., 2015a,b ; Devarajan & Cheung, 2014, 2016; Cheung et al., 2015; Devarajan, 2008, 2019; Devarajan & Ebrahimi, 2008).

    Supervised and semi-supervised dimension reduction

    In studies where information on an outcome variable such as time to an event (or survival time) is available, one of the goals of an investigator is to understand how the expression levels of genomic, clinical and demographic variables (covariates) relate to an individual’s survival in the course of a disease. The analysis of time to event (or survival) data arises in many fields of study such as biology, medicine and public health, and its role and significance in cancer research cannot be overstated. The Cox proportional hazards (PH) model is the most celebrated and widely used statistical model linking survival time to covariates. It is a multiplicative hazards model that implies constant hazard ratio and assumes that the hazard and survival curves do not cross. While this model has proved to be very useful in practice due to its simplicity and interpretability, the assumption of constant hazard ratio has been shown to be invalid in a variety of situations in medical studies. For example, non-proportional hazards are typical when treatment effect increases or decreases over time leading to converging or diverging hazards. This situation cannot be handled by the Cox PH model, and more general models that consider non-proportionality of hazards are required for modeling survival data. To this end, we have developed a class of non-proportional hazards models that embeds the Cox PH model as a special case. We proposed a theoretical and a computational framework for estimation using this generalized model that allows us to rigorously test the PH assumption. Furthermore, we have developed information-theoretic methods to test the effect of an individual covariate or a group of covariates in the PH model as well as in complex survival models that account for varying trends in hazard over time.

    Within the context of high-throughput genomic data, our preliminary work involved the development of a model for predicting patient survival by extracting genomic components that were strongly correlated with it. In this high-dimensional setting, it is unreasonable to expect the expression levels of the many thousands of genomic features to exhibit proportionality in hazards. Our current research interests in this area include the systematic comparison of several well-known models for correlating genomic feature expression with patient survival and the identification of features that demonstrate a time-varying effect using publicly available data from repositories such as Gene Expression Omnibus and The Cancer Genome Atlas. We have developed an array of measures for quantifying explained variation and predictive accuracy based on these survival models which incorporate different time-varying effects of features. Indeed our investigations involving the re-analysis or meta-analysis of existing data sets have revealed such a time-varying trend exhibited by several genomic features implicated in kidney, head and neck, ovarian and brain cancers. Furthermore, we have developed methods using continuum regression (CR) – a unified framework for supervised dimension reduction - in conjunction with the accelerated failure time model for predictive modeling. CR embeds a spectrum of regression methods into a single framework that includes methods such as ordinary least squares, partial least squares and principal components regression as special cases, thereby enabling a powerful array of methods to be developed for this problem within the linear models framework. We are currently developing R packages implementing these methods that would be available freely to the research community (Devarajan & Ebrahimi, 2009, 2011, 2013; Peri et al., 2013; Devarajan et al., 2010; Spirko-Burns & Devarajan, 2019a,b).

    Assessment of technical reproducibility and outlier detection in large-scale biological data

    A problem that arises frequently in high-throughput studies is the assessment of technical reproducibility of data obtained under homogeneous experimental conditions. This is an important problem considering the significant growth in the number of high-throughput technologies that have become available to the researcher in the past two decades. Existing methods for determining data quality are typically graphical, lack statistical rigor and do not necessarily translate to data obtained across multiple technologies; also, there is an inherent need for quantitative evaluation of reproducibility. To this end, we have developed model-based methods that account for technical variability and potential asymmetry that arise naturally in replicate data. This data-driven approach borrows strength from the large volume of available data and is broadly applicable to a variety of high-throughput studies – such as next-generation sequencing, compound and siRNA screening and other modern “omics” studies - for assessing technical reproducibility and identifying outliers. The R package replicateOutliers implements five different methods for outlier detection and will be released soon in the CRAN repository (Caretti et al., 2008; Anastassiadis et al., 2011, 2013; Duong-Ly et al., 2016; Smith & Devarajan, 2019).

    Biomarker discovery

    Another area of active interest is the development and novel application of modern statistical machine learning methods for detecting the presence of cancer in a cohort of patients based on biomarker measurements and clinical variables. To this end, we have systematically compared the performance of various methods and developed the Doylestown algorithm that is better able to detect the presence of hepatocellular carcinoma in the background of cirrhosis using levels of established serum biomarkers and other relevant clinical characteristics of the patient. Our algorithm has been independently validated by the Early Detection Research Network as well as the National Cancer Institute, and provides a significant improvement in prediction accuracy of up to 20% (Wang et al., 2012, 2013, 2016; Communale et al., 2013).

    Lab Overview

    Advances in high-throughput genomic technologies in the past two decades have given rise to large-scale biological data that is measured on a variety of scales. Genome-wide studies enable the simultaneous measurement of the expression profiles of tens of thousands of genomic variables, from an ever increasing number of biological samples that may represent phenotypes, experimental conditions or time points. Examples include studies of various types of gene and protein expression, methylation and copy number variation, and high-throughput compound screening assays, among others. Similarly, studies in biomedical imaging and neuroscience generate tens of thousands of signals from brain or muscle activity under a variety of experimental conditions across the time-frequency domain. These massive data sets offer tremendous potential for growth in our understanding of the pathophysiology of many diseases. My research spans the two major areas of statistical learning - unsupervised and supervised, as well as survival analysis, with applications in the aforementioned domains. Its principal focus is in the development of statistical and computational approaches for high-dimensional data and includes methods for dimension reduction as well as for correlating a quantitative or qualitative outcome variable (such as patient survival time,  presence of disease, patient response to treatment)  with a large number of covariates (genomic, clinical, laboratory and demographic variables).

    Lab Staff

    Matthew Smith, BS, MPH


    Selected Publications

    Devarajan K, Cheung VC. A Quasi-Likelihood Approach to Nonnegative Matrix Factorization. Neural Computation. 2016 Aug;28(8):1663-93. Epub 2016 Jun 27. PubMed PMID: 27348511; PubMed Central PMCID: PMC5549860.

    Devarajan K, Wang G, Ebrahimi N. A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing. Machine Learning. 2015 Apr 1;99(1):137-163. PMID: 25821345; PMCID: PMC4371760.

    Devarajan K, Cheung VC. On nonnegative matrix factorization algorithms for signal-dependent noise with application to electromyography data. Neural Computation. 2014 Jun;26(6):1128-68. PMID: 24684448; PMCID: PMC5548326.

    Devarajan K. Nonnegative matrix factorization: an analytical and interpretive tool in computational biology. PLoS Computational Biology. 2008 Jul 25;4(7):e1000029.doi: 10.1371/journal.pcbi.1000029. Review. PMID: 18654623; PMCID: PMC2447881.

    Devarajan K, Ebrahimi N. A semi-parametric generalization of the Cox proportional hazards regression model: Inference and Applications. Computational Statistics and Data Analysis. 2011 Jan 1;55(1):667-676. PMID: 21076652; PMCID: PMC2976538.

    Devarajan K, Zhou Y, Chachra N, Ebrahimi N. A supervised approach for predicting patient survival with gene expression data. Proceedings of the IEEE International Symposium in Bioinformatics and Bioengineering. 2010;2010(5521718):26-31. PMID: 20865131; PMCID: PMC2941901.

    Peri S, Devarajan K, Yang DH, Knudson AG, Balachandran S. Meta-analysis identifies NF-κB as a therapeutic target in renal cancer. PLoS One. 2013 Oct 7;8(10):e76746. PMID: 24116146; PubMed Central PMCID: PMC3792024.

    Anastassiadis T, Deacon SW, Devarajan K, Ma H, Peterson JR. Comprehensive assay of kinase catalytic activity reveals features of kinase inhibitor selectivity. Nature Biotechnology. 2011 Oct 30;29(11):1039-45. PMID: 22037377; PMCID: PMC3230241.

    Duong-Ly KC, Devarajan K, Liang S, Horiuchi KY, Wang Y, Ma H, Peterson JR. Kinase Inhibitor Profiling Reveals Unexpected Opportunities to Inhibit Disease-Associated Mutant Kinases. Cell Reports. 2016 Feb 2;14(4):772-781. PMID:26776524; PMCID: PMC4740242.

    Chang WL, Jackson C, Riel S, Cooper HS, Devarajan K, Hensley HH, Zhou Y, Vanderveer LA, Nguyen MT, Clapper ML. Differential preventive activity of sulindac and atorvastatin in Apc(+/Min-FCCC)mice with or without colorectal adenomas. Gut. 2018 Jul;67(7):1290-1298. Epub 2017 Nov 9. PubMed PMID: 29122850; PubMed Central PMCID: PMC6031273.

    Ramakodi MP, Devarajan K, Blackman E, Gibbs D, Luce D, Deloumeaux J, Duflo S, Liu JC, Mehra R, Kulathinal RJ, Ragin CC. Integrative genomic analysis identifies ancestry-related expression quantitative trait loci on DNA polymerase β and supports the association of genetic ancestry with survival disparities in head and neck squamous cell carcinoma. Cancer. 2017 Mar 1;123(5):849-860. PMID: 27906459; PMCID: PMC5319896.

    Wang M, Devarajan K, Singal AG, Marrero JA, Dai J, Feng Z, Rinaudo JA, Srivastava S, Evans A, Hann HW, Lai Y, Yang H, Block TM, Mehta A. The Doylestown Algorithm: A Test to Improve the Performance of AFP in the Detection of Hepatocellular Carcinoma. Cancer Prevention Research (Phila). 2016 Feb;9(2):172-9. PMID: 26712941; PMCID: PMC4740237. PubMed

    Cortellino S, Xu J, Sannai M, Moore R, Caretti E, Cigliano A, Le Coz M, Devarajan K, Wessels A, Soprano D, Abramowitz LK, Bartolomei MS, Rambow F, Bassi  MR, Bruno T, Fanciulli M, Renner C, Klein-Szanto AJ, Matsumoto Y, Kobi D, Davidson I, Alberti C, Larue L, Bellacosa A. Thymine DNA glycosylase is essential for active DNA demethylation by linked deamination-base excision repair. Cell. 2011 Jul 8;146(1):67-79. doi: 10.1016/j.cell.2011.06.020. Epub 2011 Jun 30. PubMed

    Astsaturov I, Ratushny V, Sukhanova A, Einarson MB, Bagnyukova T, Zhou Y, Devarajan K, Silverman JS, Tikhmyanova N, Skobeleva N, Pecherskaya A, Nasto RE, Sharma C, Jablonski SA, Serebriiskii IG, Weiner LM, Golemis EA. Synthetic lethal screen of an EGFR-centered network to improve targeted therapies. Sci Signal. 2010 Sep 21;3(140):ra67. doi: 10.1126/scisignal.2001083. PubMed

    Altomare DA, Vaslet CA, Skele KL, De Rienzo A, Devarajan K, Jhanwar SC, McClatchey AI, Kane AB, Testa JR. A mouse model recapitulating molecular features of human mesothelioma. Cancer Res. 2005 Sep 15;65(18):8090-5. PubMed

    Additional Publications

    My NCBI

    Statistical and Computing Software

    hpcNMF: C++ based software package for generalized non-negative matrix factorization using high-performance computing clusters available in Linux, Windows and Max OS versions (jointly with G. Wang)

    gnmf: an R package for generalized non-negative matrix factorization (jointly with J. Maisog and G. Wang, ).

    The Doylestown Algorithm: A program for evaluating the performance of biomarkers in the detection of hepatocellular carcinoma (jointly with M. Wang and A. Mehta).

    replicateOutliers: an R package implementing probability-based outlier detection methods for replicated data (jointly with M. Smith) (available in October, 2019).

    survPred.I: Algorithms for supervised dimension reduction of large-scale “omics" data with censored survival outcomes under possible non-proportional hazards (jointly with L. spirko-Burns) (available in November, 2019).

    survPred.II: Methods for feature selection in large-scale “omics" data with censored survival outcomes under possible non-proportional hazards (jointly with L. Spirko-Burns) (available in November, 2019).

    Pre-prints available online

    Spirko-Burns, L., Devarajan, K. Unified methods for variable selection in large-scale genomic studies with censored survival outcomes Under review. COBRA pre-print series, Article 120 (June 2019). 

    Spirko-Burns, L., Devarajan, K. Supervised dimension reduction for large-scale “omics" data with censored survival outcomes under possible non-proportional hazards. Under review. bioRxiv 586529; doi: COBRA pre-print series, Article 119 (March 2019).

    Devarajan, K. (2019). Non-negative matrix factorization based on generalized dual divergence. Under review. arXiv: 1905.07034v1 [stat.ML] 16 May 2019.

    Devarajan, K.,Wang, G. (2016). hpcNMF – a high performance toolbox for non-negative matrix factorization. COBRA pre-print series, Article 115 (April 2016).  

    This Fox Chase professor participates in the Undergraduate Summer Research Fellowship
    Learn more about Research Volunteering.

    Connect with Fox Chase