Stage M2 : LLM-based prediction of metabolite-level drug responses in cancer cell lines from biomedical literature

Anti-cancer drugs often induce strong and specific rewiring of cellular metabolism, which is reflected in changes in metabolite abundances measured by metabolomics. However, experimental datasets are fragmented across thousands of heterogeneous publications, each using different cell lines, drugs, doses and culture conditions, making it difficult to obtain a global, structured view of “drug → metabolite response” relationships.

Recent advances in large language models (LLMs), dense retrieval and biomedical NLP (e.g. LitSense 2.0 for sentence/paragraph-level PubMed/PMC search, and MedCPT for contrastive pre-trained biomedical retrieval) show that it is now realistic to mine fine-grained evidence directly from full-text articles at scale [1]. In parallel, domain-specific language models such as BioBERT or PubMedBERT have demonstrated strong performance on entity recognition and relation extraction in biomedical text [2, 3].

This internship is positioned at the intersection of biomedical NLP, LLMs and metabolomics. The goal is to automatically construct, using LLM-based methods, a structured knowledge base of metabolite-level responses to anti-cancer drugs in cell lines (including drug dose and other experimental conditions), and to develop LLM-based predictors that can infer which metabolite is likely to go up or down under a specified treatment context (multi-label / multi-output classification). The project will build on existing in-house pipelines for entity recognition and normalisation.

Work for the internship

The intern will first collect and curate a corpus of biomedical articles by programmatically scraping and querying the literature (e.g. PubMed/PMC) for full-text papers that report metabolite-level changes in cancer cell lines treated with specific drugs and doses. From this raw corpus, they will then build a structured knowledge base of metabolite responses by designing an information extraction workflow that combines Large Language Models for entity recognition and entity normalisation - available in the team (for metabolites, drugs and cell lines), as well as relation extraction.

On top of this literature-derived knowledge base, the intern will then develop and evaluate prediction methods that use LLMs to infer, for a given treatment context (cell line, drug, dose, and possibly time and culture conditions), which metabolites are likely to go up or down, and with what confidence. This will involve choosing suitable representations of the experimental context and metabolite panel, implementing LLM-based classifiers through prompting or lightweight fine-tuning, and benchmarking their performance against biomedical fine-tuned models and simple non-LLM baselines derived from aggregated statistics in the knowledge base. Depending on progress, the intern will also explore prompt-engineering strategies to obtain uncertainty-aware outputs (e.g. “highly likely up at low doses”, “unlikely to change given current literature”).

Candidate Profile

M2 / engineering school student in Computer Science, Applied Mathematics, AI, bioinformatics.
Solid Python skills, including familiarity with deep learning frameworks (PyTorch, TensorFlow) and/or LLM APIs (Huggingface, OpenAI).
Knowledge of machine learning.
Prior knowledge of NLP / LLMs is desirable.
Prior exposure to biomedical text mining or omics data is a plus.
Background in molecular/cellular biology or metabolism is a plus; motivation to work in an interdisciplinary setting is essential.

The internship will provide hands-on experience at the intersection of biomedical NLP, LLMs, and computational metabolomics.

How to apply

Send your CV and a motivation letter to macha.nikolski@u-bordeaux.fr, mikael.georges@ibgc.cnrs.fr and gauthier.delrot@u-bordeaux.fr

References

[1] Yeganova, Lana, et al. "LitSense 2.0: AI-powered biomedical information retrieval with sentence and passage level knowledge discovery." Nucleic Acids Research (2025): gkaf417.

[2] Lee, Jinhyuk, et al. "BioBERT: a pre-trained biomedical language representation model for biomedical text mining." Bioinformatics 36.4 (2020): 1234-1240.

[3] ACM Reference Format: Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. 1, 1, Article 1 (January 2021), 24 pages. https://doi.org/10.1145/3458754