TEAM
Machine learning techniques in lung cancer prevalence studies
Zhuoran Wang, Fekadu Bayisa, Adekunle Ajiboye

Lung cancer is a major public health concern. This work investigates lung cancer prevalence in Virginia counties (2014--2018) using county-level aggregated data on populations aged 18+. Data span four domains: Demographic (\% Male, Female, Black, White, Hispanic, age 65+), Behavioral (smoking, binge drinking, obesity), Socioeconomic (poverty rate, Social Deprivation Index, median income), and Environmental (PM2.5 air quality). After preprocessing, we apply a Poisson GLM with elastic net and XGBoost with Poisson loss. XGBoost outperforms GLM (MAE: 5.963 vs 6.313), identifying smoking, PM2.5, obesity, and income as key predictors. GLM shows positive associations with smoking, age 65+, and racial composition; negative with poverty and Hispanic proportion. Results support targeting high-risk groups and integrating behavioral and environmental data into prevention strategies.

