Annual International Conference on Critical Assessment of Massive Data Analysis
Liverpool, U.K. | July 23-24, 2025

The CAMDA Contest Challenges

For CAMDA 2025, we present:

  • The Health Privacy Challenge presents an interactive platform for achieving trust and robustness in the generation of privacy-preserving synthetic gene expression datasets. Join us as either as a Blue Team defending or a Red Team attacking!
  • The Synthetic Clinical Health Records Challenge provides a rich set of highly realistic Electronic Health Records (EHRs) tracing the diagnosis trajectories of diabetic patients, created with dual-adversarial auto-encoders trained on data from 1.2 million real patients in the Population Health Database of the Andalusian Ministry of Health. Predict relevant diabetes endpoints like blindness or cardiopathy from past diagnosis trajectories!
  • The Gut Microbiome Health Index Challenge features hundreds of WMS based taxonomic and functional profiles of healthy and unhealthy individuals. Take advantage of the Theater of Activity concept and explore microbiome synergies to compete with the best taxonomy based metrics!
  • The Anti-Microbial Resistance Prediction Challenge features thousands of clinical isolate sequences. Predict resistance genes and markers to identify resistant bacteria!

CAMDA encourages an open contest, where all analyses of the contest data sets are of interest, not limited to the questions suggested here. There is an online forum for the free discussion of the contest data sets and their analysis, in which you are encouraged to participate.

We look forward to a lively contest!

The Health Privacy Challenge

📢 Computational health research is centered on sensitive health-care data, including genomic, medical and phenotypic data. Progress in the field hinges on the ability to access these data to advance health care using analytical innovations, while simultaneously ensuring that sensitive information of data subjects is not disclosed. 

Synthetic data generation is one of the well-adopted approaches to enable privacy preservation through generating data points that are consistent with the distribution of the real data. Generative models, such as Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN), can be used for this purpose, allowing to generate synthetic data that maintains the utility of original data while protecting privacy. However, the effectiveness of synthetic data generators in biology, and the extent to which they can protect against adversarial attacks, such as membership inference risks, remain underexplored. 

The Health Privacy Challenge, which is organized in the context of the European Lighthouse on Safe and Secure AI (ELSA, https://elsa-ai.eu), invites participants to advance this field by contributing in a “Blue Team (🫐) vs Red Team (🍅)” scheme:

  • The blue teams develop privacy-preserving generative methods to generate synthetic gene expression datasets that are able to balance the biological utility and privacy,
  • The red teams assess the privacy risks that these generative methods might pose by developing novel and effective membership inference attack (MIA) techniques,

While both teams explore robustness and reliability of evaluation metrics in the context of privacy preservation in synthetic biological datasets.

🧩 Challenge Structure: The Health Privacy Challenge consists of two phases, where Blue and Red team members must participate in benchmark method submissions.
Phase 1:

  • (🫐) Blue teams work towards developing methods that improve the baseline generative methods and generating novel insights into privacy preservation in biological datasets,
  • (🍅) Red teams launch membership inference attacks (MIA) against the synthetic datasets, generated by the baseline generative methods.*

Phase 2:

  • After the end of Phase 1, a set of Blue team solutions will be selected, based on their leaderboard performance as well as novelty of their methods.
  • (🍅) During Phase 2, in which only Red teams participate, Red teams will launch MIA against these selected Blue teams’ solutions.

* MIA aims to re-identify the training data points used to generate synthetic datasets from the original dataset. This re-identification process pertains only to identifying the pseudo-identities within the dataset and does not, in any way, attempt to re-identify the original donors.

🎢 Participation: In order to successfully participate in the challenge, the participants must,

  • Register through ELSA Benchmark Platform to access the challenge datasets and detailed instructions. We recommend you to register using an organizational email if possible.
  • Submit their methods (codes and relevant files) through the ELSA Benchmark Platform.
    • (🫐) Blue teams must participate in benchmark submission by the Phase 1 deadline.
    • (🍅) Red teams must participate in two benchmark submissions by the Phase 1 and Phase 2 deadlines.
  • Submit a CAMDA extended abstract that details their benchmark method submissions during Phase 1 and 2 by the CAMDA submission deadline. ( Both teams (🫐,🍅) ).

We provide a Github Starter Package Repo for both teams, which includes baseline methods and evaluation metrics, as well as guideline to base their method developments on.

🗂️ Datasets:  We re-distribute two open access TCGA bulk RNA-seq datasets in the pre-processed form, which can be accessed from the GDC portal (portal.gdc.cancer.gov) as raw counts. Each donor in the datasets has a single sample. 

  1. TCGA-BRCA: Breast cancer dataset of size <1,089 (donors) x 978 (genes)> with five subtypes, suitable for cancer subtype prediction task;
  2. TCGA COMBINED: A collection of ten different cancer tissues of size <4,323 (donors) x 978 (genes)>, suitable for cancer tissue-of-origin prediction task. 

More details about the datasets and preprocessing steps can be found at ELSA Benchmark Platform and Github Starter Package Repo.

🏆 Evaluation:  The teams with the best solutions will be determined based on multiple criteria, including,

  • 🎯 leaderboard ranking,
  • 💡 novelty of methods,
  • 🌱 generation of novel insights into privacy-preservation in biology. 

Therefore we strongly encourage the participants to submit their CAMDA extended abstracts to be evaluated even if they might not have achieved a high ranking on the leaderboards.  

The winners of the blue and red teams will be invited to present their methods at the CAMDA Conference at ISMB 2025, and will be awarded with travel fellowships sponsored by ELSA.

⏳ Timeline: 

  🎉 Get started: 

We are looking forward to engaging with both members of the computational biology and the privacy community, and working together to deepen our understanding of privacy in health care. 🤗

 

The Synthetic Clinical Health Records (TBC)

UPDATE FOR 2025! Please note that we will provide an improved synthetic data set this year. If you are interested, please sign up to the CAMDA google group to receive a message when it gets released!

Although data protection is necessary to preserve patients’ intimacy, privacy regulations are also an obstacle to biomedical research. An interesting alternative is the use of synthetic patients. However, conventional synthetic patients are useless for discovery given that they are built out of known data distributions. Interestingly, Generative Adversarial Networks (GANs) and related developments have emerged as powerful tools to generate synthetic data in a way that captures relationships between the variables produced even if such relationships were previously unknown. GANs became popular in the generation of highly realistic synthetic pictures but have been applied in many fields, including in the generation of synthetic patients with applications such as medGAN and others.

Three datasets of synthetic patients have been subsequently created for this challenge since CAMDA 2023. Both datasets were generated from a real cohort retrieved from the Health Population Database (Base Poblacional de Salud, BPS) at the Andalusian Health System (Spain), by performing a Dual Adversarial AutoEncoder (DAAE) approach:

  1. The first dataset (1st generation, 2023) was originally created for CAMDA 2023. It includes a list of pathologies for 999,936 synthetic patients ordered by visits, which was generated from a total of 979,308 real diabetes patients. Used visits from real diabetes patients were originally collected till the end of 2019. Additionally all visits feature an age-range (decades) label. This dataset is still available and usable for this challenge, both by itself or combined with the second generation. This first generation can be dowloaded here.
  2. The second dataset (2nd generation, 2024) includes a new list of pathologies for 999,936 synthetic patients, which were labeled increasing the age resolution to years. This dataset was generated from an extended cohort of 984,414 real diabetes patients. This dataset is provided as it was generated (raw version) as well as after a minor pre-processing to clean up inconsistencies (pre-processed version). The second generation can be downloaded here.
  3. The third dataset (3rd generation, 2025) includes a new dataset for 999,936 synthetic patients, which were generated extending patient visits till the end of 2022. This extension allows better description of long-term consequences for diabetes, which could results in more accurate endpoints’ predictions. As in the previous generation, the dataset is provided as it was generated (raw version) as well as after a minor pre-processing to clean up inconsistencies (pre-processed version).

Two challenges are suggested on both datasets, although any other original analysis you may think will also be welcomed:

1) Finding some strong relationships in diabetes-associated pathologies that allows to predict any pathology before this is diagnosed. Some well-known pathological diabetes consequences, which can be considered relevant endpoints to predict, can be: a) Retinopathy (Code “703”), b) Chronic kidney disease (Code “1401”), c) Ischemic heart disease (Code “910”), d) Amputations (Code “1999”)

2) Another proposed challenge is the prediction of disease trajectories in diabetes patients (see for example: Jensen et al. Nat Commun. 2014)

Prediction proposals which are submitted with the model trained and the code required to run the model can be tested on the real dataset by the organisers and participate in a collective publication.

Please sign up to announcements from the forum for alerts.

Please read and accept the data download agreement for access to the Download Site.

We thank the Institute of Advanced Research in Artificial Intelligence (IARAI) for its support in the preparation of this Challenge.

The Gut Microbiome-based Health Index Challenge

 
The onset of diseases linked to microbiome health, such as obesity or Inflammatory Bowel Disease (IBD), is continuously on the rise. (M’koma, 2013Hong et al., 2019). Because the gut microbiome is strongly linked to the functioning of the human body, the ability to evaluate one’s health status based on a stool sample is of high clinical value. Stool is becoming a reasonable alternative to other diagnostic tools – it can be collected non-invasively and frequently, and is now becoming affordable.
 
There are a number of approaches to evaluate microbiome health from stool. Alpha diversity is a frequent choice as it is closely related to dysbiosis, and microbiome richness is described as a key component of microbiome health and robustness (Li et al., 2022Gong et al., 2016). The most robust indices to date are the Gut Microbiome Health Index, the GMHI (Gupta et al., 2020) and its successor the Gut Microbiome Wellness Index, GMWI2 (Chang et al., 2024 https://www.nature.com/articles/s41467-024-51651-9 ), as well as the hiPCA (Zhu et al., 2023). Those indices rely on the presence of beneficial or harmful bacteria, and classify samples based on their relative ratios.
 
However, a recently re-visited definition of the microbiome emphasizes the importance of not just the microbiota (a community of microorganisms), but the whole Theatre of activityToA (Berg et al., 2020). This means that the microbiome functions, and the interactions of the microbiota, are a more accurate representation of microbiome state.
 
In this challenge, we provide data set with 4,398 samples originating from numerous cohorts with various diseases (from the curated MetagenomicsData database, https://www.nature.com/articles/nmeth.4468). We include precomputed taxonomic profiles with health predictions made by existing health indices (Shanon entropy on species and functions as well as GMHI and hiPCA). We also extend this by providing functional profiles.
 
We ask the CAMDA Community to develop a gut microbiome-based health index which will outperform the existing ones ideally by taking advantage of the Theatre of Activity concept. This year, however, we would like to put an emphasis on developing novel ways of combining the taxonomic and functional profiles as well as exploring synergies between them and different microbiome components. The classification is a supplementary goal – the greatest value will be placed on creative perspectives that advance our understanding of the microbiome in health and disease.

Please sign up to announcements from the forum for alerts.

Please read and accept the data download agreement for access to the Download Site.

Here we provide a dataset consisting of 4398 samples originating from numerous cohorts with various diseases. There are 3 categories of individuals:
  • Healthy (category “1”)
  • Diseased (category “0”)
The details about specific diseases and cohorts can be found in the metadata file.
We provide 3 files:
  • taxonomy.txt: species-level contribution to the taxonomic profile, calculated using MetaPhlAn
  • pathways.txt: functional profiles of the samples, calculated using HumanN. 
  • metadata.txt: contains sample names, cohort and diagnosis assignment for each sample, along with scores predicted by the existing taxonomic health indices. Note that higher scores indicate better health for Shannon entropies and GMHI, while worse disease for hiPCA.

Anti-Microbial Resistance Prediction

Antimicrobial resistance is one of the biggest challenges facing modern medicine. Because the management of COVID-19 was increasingly becoming dependent on pharmacological interventions, there is greater risk for accelerating the evolution and spread of antimicrobial resistance [Afshinnekoo et al. 2021]. A study in a tertiary hospital environment revealed concerning colonisation patterns of microbes during extended periods [Chng et al. 2020]. It also highlighted the diversity of antimicrobial resistance gene reservoirs in hospitals that could facilitate the emergence and transmission of new modes of antibiotic resistance and AMR burden in cities in general [Danko et al. 2021]. This year we would like CAMDA Community to look into AMR related challenges.

This challenge consists in developing and testing models for predicting antimicrobial resistance (AMR) in 10 different bacterial pathogens and 3 drugs from the WHO’s Priority Pathogen List. You will be provided with a training dataset taken from public databases, about 8,000 isolates in total, containing both WGS accession numbers for each isolate (for about a quarter of those you get accession numbers for assemblies rather than short reads), as well as the antibiotic susceptibility phenotype. The developed models will then be tested on data from another collection of the same pathogens, over 2,000 isolates in total, for which you will be provided with the genotypes, but not the phenotypes.

Each team will be able to submit predictions up to 3 times for accuracy evaluation on the withheld phenotypes, and the final submission from each team will qualify as the CAMDA challenge submission; the teams with the best submissions will be invited to present their methods at CAMDA conference at ISMB.

The data set and leaderboard will become available late January!

PAST KEYNOTE SPEAKERS

IMPORTANT DATES

Call for Abstracts Opens

14 January 2025

CAMDA Extended Abstracts Deadline

15 May 2025

Late Poster Submissions Deadline

15 May 2025

Late Poster Acceptance Notifications

22 May 2025

CAMDA Acceptance Notification

22 May 2025

CAMDA Conference

23-24 July 2025

ISMB/ECCB 2025 MAIN EVENT

STAY CONNECTED

CAMDA PARTNERS

https://www.fda.gov/
https://www.frontiersin.org/journals/genetics
http://www.smashing-studio.com/
http://f1000research.com/
FDA-Logo
logo_frontiers
smashing-studio-logo
F1000R-Logo-orange
previous arrowprevious arrow
next arrownext arrow

CONTACT US

                     

CONTACT US

We're not around right now. But you can send us an email and we'll get back to you, asap.

Sending

Log in with your credentials

or    

Forgot your details?

Create Account