Researchers Demonstrate Capabilities of Machine Learning to Identify Cancer Risk Factors

March 11, 2021

Shannon Lynch, PhD, MPH, co-author of the study and an assistant professor in the Cancer Prevention and Control Program

PHILADELPHIA (March 11, 2021)—Scientists have learned that the socioeconomic circumstances of a neighborhood, including housing and employment, are related to cancer risk and outcomes. But to date, only a handful of measures, such as poverty, have been analyzed.

A new study by Fox Chase Cancer Center scientists has shown how machine learning can mine large volumes of social and environmental data to potentially identify new neighborhood risk factors related to cancer. Machine learning is a way that computer algorithms can be used to analyze large volumes of data.

“We have a general idea that living in neighborhoods with lower socioeconomic conditions is often associated with poor health outcomes, but in terms of how to measure that, there isn’t a consensus,” said Elizabeth Handorf, PhD, an associate professor in the Biostatistics and Bioinformatics Facility at Fox Chase. “We’re trying to find an objective way to determine what the most helpful measures are to identify people who are at higher risk for cancer because of their environment.”

Elizabeth Handorf, PhD, co-author of the study and an associate professor in the Biostatistics and Bioinformatics Facility

Handorf co-authored the study with Shannon Lynch, PhD, MPH, an assistant professor in the Cancer Prevention and Control Program. The research team tested different popular machine learning methods to determine which worked best to analyze the links between socioeconomic data and cancer risk.

They linked prostate cancer patients from the Pennsylvania Cancer Registry to the socioeconomic circumstances of the neighborhood they lived in. The comprehensive study included more than 14,000 neighborhood variables related to housing quality, education level, median household income, marital status, and renting versus owning a home. The goal was to identify which variables were associated with a diagnosis of advanced prostate cancer.

Census data has previously been used to study disease risk, but data sets are so large that scientists rarely use all possible variables. “In the data set we worked with, there are tens of thousands of variables to investigate,” Handorf said. “Prior studies tended to only select a handful of these variables based on their assumptions about how social factors might be impacting health, and different research studies would choose different variables. We were missing an objective way to select the best factors.”

Of the different models they tested, Handorf and her colleagues found that a method called penalized regression or a “lasso” model worked the best at identifying key variables and eliminating false positives. This is a type of regression analysis that assigns a penalty to the estimate of each variable’s effect. It can automatically pick the variables that best predict an outcome. “The biggest takeaway was how well the lasso-type regressions worked,” she said.

Handorf added that the findings demonstrated that machine learning can be used to identify which variables in a large set of socio-environmental data have a measurable effect on health outcomes in cancer and other diseases. She said she hopes the study will help scientists select variables more objectively. “Considering the social environment is important, but you have to think about how to measure that well.”