The CAMDA Contest Challenges

Annual International Conference on Critical Assessment of Massive Data Analysis
Montreal, Canada | July 15-16, 2024

The CAMDA Contest Challenges

For CAMDA 2024, we present:

The Synthetic Clinical Health Records Challenge provides a rich set of highly realistic Electronic Health Records (EHR) tracing the diagnosis trajectories of diabetic patients, created with dual-adversarial auto-encoders trained on data from 1.2 million real patients in the Population Health Database of the Andalusian Ministry of Health. Predict relevant diabetes endpoints like blindness or cardiopathy from past diagnosis trajectories!
The Anti-Microbial Resistance Prediction Challenge features clinical isolates sequences. Predict resistance genes/markers and identify resistant bacteria!
The Gut Microbiome based Health Index Challenge features hundreds of WMS based taxonomic and functional profiles of healthy and unhealthy individuals. Take advantage of the Theater of Activity concept and compete already existing taxonomy based metrics!

CAMDA encourages an open contest, where all analyses of the contest data sets are of interest, not limited to the questions suggested here. There is an online forum for the free discussion of the contest data sets and their analysis, in which you are encouraged to participate.

We look forward to a lively contest!

The Synthetic Clinical Health Records

Although data protection is necessary to preserve patients’ intimacy, privacy regulations are also an obstacle to biomedical research. An interesting alternative is the use of synthetic patients. However, conventional synthetic patients are useless for discovery given that they are built out of known data distributions. Interestingly, Generative Adversarial Networks (GANs) and related developments have emerged as powerful tools to generate synthetic data in a way that captures relationships between the variables produced even if such relationships were previously unknown. GANs became popular in the generation of highly realistic synthetic pictures but have been applied in many fields, including in the generation of synthetic patients with applications such as medGAN and others.

Two datasets of synthetic patients have been subsequently created for this challenge since CAMDA 2023. Both datasets were generated from a real cohort retrieved from the Health Population Database (Base Poblacional de Salud, BPS) at the Andalusian Health System (Spain), by performing a Dual Adversarial AutoEncoder (DAAE) approach:

The first dataset (1st generation) was originally created for CAMDA 2023. It includes a list of pathologies for 999,936 synthetic patients ordered by visits, which was generated from a total of 979,308 real diabetes patients. Additionally all visits feature an age-range (decades) label. This dataset is still available and usable for this challenge, both by itself or combined with the second generation. This first generation can be dowloaded here.
The second dataset (2nd generation) includes a new list of pathologies for 999,936 synthetic patients, which were labeled increasing the age resolution to years. This dataset was generated from an extended cohort of 984,414 real diabetes patients. This dataset is provided as it was generated (raw version) as well as after a minor pre-processing to clean up inconsistencies (pre-processed version).

Two challenges are suggested on both datasets, although any other original analysis you may think will also be welcomed:

1) Finding some strong relationships in diabetes-associated pathologies that allows to predict any pathology before this is diagnosed. Some well-known pathological diabetes consequences, which can be considered relevant endpoints to predict, can be: a) Retinopathy (Code “703”), b) Chronic kidney disease (Code “1401”), c) Ischemic heart disease (Code “910”), d) Amputations (Code “1999”)

2) Another proposed challenge is the prediction of disease trajectories in diabetes patients (see for example: Jensen et al. Nat Commun. 2014)

Prediction proposals which are submitted with the model trained and the code required to run the model can be tested on the real dataset by the organisers and participate in a collective publication.

Please sign up to announcements from the forum for alerts.

Please read and accept the data download agreement for access to the Download Site.

We thank the Institute of Advanced Research in Artificial Intelligence (IARAI) for its support in the preparation of this Challenge.

Anti-Microbial Resistance Prediction

Antimicrobial resistance is one of the biggest challenges facing modern medicine. Because the management of COVID-19 was increasingly becoming dependent on pharmacological interventions, there is greater risk for accelerating the evolution and spread of antimicrobial resistance [Afshinnekoo et al. 2021]. A study in a tertiary hospital environment revealed concerning colonisation patterns of microbes during extended periods [Chng et al. 2020]. It also highlighted the diversity of antimicrobial resistance gene reservoirs in hospitals that could facilitate the emergence and transmission of new modes of antibiotic resistance and AMR burden in cities in general [Danko et al. 2021]. This year we would like CAMDA Community to look into AMR related challenges.

This challenge consists in developing and testing models for predicting antimicrobial resistance (AMR) in 8 different bacterial pathogens and 2 drugs from the WHO’s Priority Pathogen List. You will be provided with a training dataset taken from public databases, about 6,000 isolates in total, containing both WGS accession numbers for each isolate (for about a quarter of those you get accession numbers for assemblies rather than short reads), as well as the antibiotic susceptibility phenotype. The developed models will then be tested on data from another collection of the same pathogens, over 1500 isolates in total, for which you will be provided with the genotypes, but not the phenotypes.

Each team will be able to submit predictions up to 3 times for accuracy evaluation on the withheld phenotypes, and the final submission from each team will qualify as the CAMDA challenge submission; the teams with the best submissions will be invited to present their methods at CAMDA conference at ISMB.

NEW!!! – 01.05.2024

We now provide an additional testing data set and the challenge is to predict an antibiotic susceptibility phenotype

Can you predict antibiotic susceptibility phenotype?

Submit your prediction for verification.

Please sign up to announcements from the forum for alerts.

Please read and accept the data download agreement for access to the Download Site.

The Gut Microbiome based Health Index

The onset of diseases linked to microbiome health, such as obesity or Inflammatory Bowel Disease (IBD), is continuously on the rise. (M’koma, 2013, Hong et al., 2019). Because the gut microbiome is strongly linked to the functioning of the human body, the ability to evaluate one’s health status based on a stool sample is of high clinical value. Stool is becoming a reasonable alternative to other diagnostic tools – it can be collected non-invasively and frequently, and is now becoming affordable.

There are a number of approaches to evaluate microbiome health from stool. Alpha diversity is a frequent choice as it is closely related to dysbiosis, and microbiome richness is described as a key component of microbiome health and robustness (Li et al., 2022, Gong et al., 2016). The most robust indices to date are the Gut Microbiome Health Index, or the GMHI (Gupta et al., 2020), and the hiPCA (Zhu et al., 2023). Those indices rely on the presence of beneficial or harmful bacteria, and classify samples based on their relative ratios.

However, a recently re-visited definition of the microbiome emphasizes the importance of not just the microbiota (a community of microorganisms), but the whole Theatre of activity, ToA (Berg et al., 2020). This means that the microbiome functions, and the interactions of the microbiota, are a more accurate representation of microbiome state.

In this challenge, we provide data set with 613 samples originating from the Human Microbiome Project 2 and two American Gut Project cohorts. We include precomputed taxonomic profiles with health predictions made by existing health indices (Shanon entropy, GMHI and hiPCA). We also extend this by providing functional profiles.

We ask the CAMDA Community to develop a gut microbiome-based health index which will outperform the existing ones ideally by taking advantage of the Theatre of Activity concept.

NEW!!! – 24.04.2024

We now provide an additional data set and the challenge is to explore it and split 35 patients into two groups: healthy controls and COVID-19 patients.

There are two samples for each patient : a) sample from the day of admission, b) the last sample from the ward stay (in the case of the healthy controls, two samples were taken a few days apart).

Can you identify the healthy controls?

Can you say anything interesting about the COVID-19 group?

Please sign up to announcements from the forum for alerts.

Please read and accept the data download agreement for access to the Download Site.