Avoid information leakage pitfalls while doing AI in bioinformatics
- Track: Bioinformatics & Computational Biology
- Room: K.4.601
- Day: Saturday
- Start: 17:25
- End: 17:40
- Video only: k4601
- Chat: Join the conversation!
AI is gaining importance in bioinformatics with new methods and tools popping every day. While applications of AI in bioinformatics inherited a lot of technological solutions from other AI-driven fields, such as image recognition or natural language processing, this particular domain has its own challenges. An alarming example is a study showing that most AI models for detecting COVID from radiographs do not rely on medically relevant pathological signals, but rather in shortcuts such as text tokens on the images (DeGrave et al., Nat Mach Intell, 2021, doi: 10.1038/s42256-021-00338-7), stressing the importance of the data, on which the AI models were trained. Equally special is the data used for training biological language models: first, it is not that large compared to natural languages (e.g. one of the most successful protein language models ESM-2 has been trained on only 250M sequences), and second, it is highly structured by evolution and natural selection, and thus has a relatively low intrinsic dimension.
In my talk, I will speak about consequences of this underlying structure of the data for performance of models that are trained with it -- spoiler alert! it is terribly overestimated. The reason for this is information or data leakage: the model remembers irrelevant features highly correlated with the target variable and does not learn any biologically meaningful properties that can be transferred to out-of-distribution data. I will present our own check list (see our paper Bernett et al., Nat Methods, 2024, doi: 10.1038/s41592-024-02362-y) and a solution (https://github.com/kalininalab/DataSAIL, Joeres et al., Nat Comm, 2025, doi: 10.1038/s41467-025-58606-8) for avoiding the information leakage pitfall. I will discuss examples and applications from protein function prediction and drug discovery.
Speakers
| Olga Kalinina |