CAMDA 2022 – Literature AI Data and Leaderboards
CAMDA Challenge has ended, but the leaderboards are open!
Drug-Induced Liver Injury – Literature Challenge 2022
Introduction
Drug-Induced Liver Injury (DILI) is an adverse effect that severely limits the applicability of drugs – no matter how effective and versatile the drug, it is only useful as long as we can contain the damage it causes. The first step in containing the problem is recognizing it, which is not guaranteed. Even though DILI itself is a widely known problem and drugs undergo rigorous testing during clinical trials, there are too many uncontrollable factors (such as unknown interactions) to be certain of a drug’s safety.
The main way to track drug safety after clinical trials is through scientific publications – either as a case study or a more comprehensive analysis. However, this type of data is fraught with a number of problems. Because the publications are in the form of free text, it is extremely difficult to reliably extract information from this data: In the vast majority of cases, the only available method is to have people curate it. Unfortunately, this is not an optimal solution to this problem – the processing power and throughput of this approach is extremely limited and highly error-prone.
This year’s CAMDA Challenge brings up yet another critical question: Can humans be relieved of this task?
This problem has already been analysed by CAMDA 2021 participants. However, this time we give you an opportunity of in-depth assessment of viability of your algorithm on a scale that has never been published before.
Datasets
We have collected a dataset of over 300,000 publications. Positive samples are DILI-related publications that were screened by liver toxicity experts. Negative data are miscellaneous publications that are not related to DILI and were selected by sampling PubMed and filtering it through MeSH terms, vocabulary filters, and language model filtering. The data has been split into 5 distinct datasets – Training, T1, T2, T3, and T4.
- Training: the first part of the dataset will help you develop your algorithms. It consists of 14,000 publications, 7,000 of which are related to DILI and serve as positive examples. The remaining 7,000 are negatives, but in addition to a random population sample we selected a subset of them so that they retain the general characteristics of the DILI articles but still function as negative examples.
- Test – T1, T2, T3: throughout CAMDA Challenge you will be able to use our automated leaderboard to test your algorithm’s performance against three challenges with increasing level of unbalance – and thus difficulty – that were prepared for you.
- Validation – T1, T2, T3, T4: at the end of the competition, you will receive independent
performance ratings based on the withheld validation data which will also take a form of an automated leaderboards system. In addition to hidden parts of T1-T3 datasets, we have prepared T4 – a domain transfer challenge that will allow you to assess the generalizability of the “DILI relevance detection” models.
It is imperative to use the released data for both training and (nested) cross-validation in order to prevent overfitting. Feel free to use any external data that you will be able to find – but keep in mind that CAMDA Challenge is not about raw scores – it is about the ingenuity and generalizability of your algorithms.
Timeline
- 09.03.2022 – Release of training dataset and T1-T3 test leaderboard
- 14.04.2022 – Release of T1-T4 validation leaderboard
- 20.05.2022 – Abstract submission deadline
- 27.05.2022 – Leaderboards are reopened
Submission
Please submit your research by 20 May, and clearly indicate the presenting author. Extended abstracts are limited to 3-5 pages. Full submission guidelines can be found on the main CAMDA webpage.
In case of any problems or questions, feel free to contact:
Witek Wydmanski (witold.wydmanski(at)uj(.)edu(.)pl) and Alexander Aldoshin (alexander.aldoshin(at)boku(.)ac(.)at).