Pre-CV Screening Inflates Drug Response AUCs by ≥0.10

Today's Overview

  • Pre-CV Feature Screening Creates Widespread Leakage in Cancer Drug Response Models Pre-CV feature screening inflates accuracy by 16.6% MSE on average across 265 cancer drugs.

Also Worth Noting

02
Benchmarking pKa prediction on 90,000 public data pointsGeneral AIDD

Seven pKa prediction algorithms (three commercial, four open-source ML) were benchmarked on a curated 90,000-entry public data set from 31,000 molecules to quantify accuracy across charge states and polyprotic species. link (Chem)

Today's Observation

Cancer drug response prediction is a canonical benchmark for multi-omics machine learning, yet a widespread data-leakage pitfall undercuts its utility. A survey of 32 recent studies shows that 72 % perform feature selection before cross-validation, inflating mean-squared error by 16.6 % on average across 265 compounds in GDSC and CCLE. Leakage drives models to pick five times more genes than leakage-corrected pipelines, and the two gene sets overlap <20 %, indicating the inflated scores reflect sample-specific noise rather than generalizable signal. The gains reported over plain elastic-net baselines disappear once leakage is removed, implying that many “state-of-the-art” improvements are illusory.

Practically, any project that screens thousands of molecular features must nest selection inside each CV fold or use an external validation cohort. The identical issue applies to other omics-assisted tasks—e.g., predicting CRISPR essentiality or patient outcome—where pre-filtering is tempting. Until journals and competitions enforce stricter code inspection, practitioners should treat published MSE or Pearson r values as upper bounds and retrain models with scrupulous nested CV before deploying biomarkers or moving into expensive in-vitro confirmation.

The above is personal commentary for reference only. Refer to the original papers for authoritative content.