The CAMDA Contest Challenges

Annual International Conference on Critical Assessment of Massive Data Analysis
Washington, DC | July 12-16, 2026

The CAMDA Contest Challenges

For CAMDA 2026, we present:

The Health Privacy Challenge presents an interactive platform for achieving trust and robustness in the generation of privacy-preserving synthetic gene expression datasets. Join us in the bulk RNA-seq track to generate synthetic datasets and evaluate them for biological utility and adversarial privacy risk. Or join the single-cell track to develop generative methods with a focus on biological realism and privacy-aware data sharing.
The Synthetic Clinical Health Records Challenge provides a rich set of highly realistic Electronic Health Records (EHRs) tracing the diagnosis trajectories of diabetic patients, created with dual-adversarial auto-encoders trained on data from 1.2 million real patients in the Population Health Database of the Andalusian Ministry of Health. Predict relevant diabetes endpoints like blindness or cardiopathy from past diagnosis trajectories!
The Gut Microbiome Interaction Network Challenge offers thousands of metagenomic samples with taxonomic and functional profiles from healthy and diseased individuals. This time, the challenge shifts the focus from composition to microbiome interaction networks, embracing the Theatre of Activity concept to uncover taxon-taxon, taxon-function, and function-function relationships driving health and disease.
The Anti-Microbial Resistance Prediction Challenge features thousands of clinical isolate sequences. Predict resistance genes and markers to identify resistant bacteria from WHO’s Priority Pathogen List!

CAMDA encourages an open contest, where all analyses of the contest data sets are of interest, not limited to the questions suggested here. There is an online forum for the free discussion of the contest data sets and their analysis, in which you are encouraged to participate.

The Health Privacy Challenge

📢 Synthetic data generation is one of the well-adopted approaches to enable privacy preservation through generating data points that are consistent with the distribution of the real data. However, the effectiveness of synthetic data generators in biology, and the extent to which they can protect against adversarial attacks, such as membership inference risks, remain underexplored.

The Health Privacy Challenge, which is organized in the context of the European Lighthouse on Safe and Secure AI (ELSA, https://elsa-ai.eu), invites participants to advance this field by contributing in three tasks:

Blue Team (🫐) : develop novel privacy-preserving methods for generating synthetic bulk gene-expression datasets that maintain meaningful biological signals,
Red Team (🍅) : design robust and realistic membership inference attack (MIA) methods to evaluate and expose potential privacy risks in synthetic bulk gene-expression datasets generated during the Health Privacy Challenge 2025.
Single-cell Team (🧬) : (1) develop novel privacy-preserving generative approaches for synthetic single-cell gene-expression datasets that effectively balance privacy and utility, and (2) conduct systematic assessments of privacy risks in synthetic single-cell RNA-seqdata.

Choose your task and get started! Participants are welcome and encouraged to participate in multiple tasks!

🎢 PARTICIPATION: In order to successfully participate in the challenge,

The participants of all three tasks (🫐,🍅, 🧬) must:

Register through ELSA Benchmark Platform to access the challenge datasets and detailed instructions. We recommend you to register using an organizational email if possible.
Submit their methods (codes and relevant files) through the ELSA Benchmark Platform.
Submit a CAMDA extended abstract that details their benchmark method submissions by the CAMDA submission deadline.

We provide a Github Starter Package Repo for all tasks.

🗂️ DATASETS: We re-distribute two open access TCGA bulk RNA-seq datasets in the pre-processed form, which can be accessed from the GDC portal (portal.gdc.cancer.gov) as raw counts. Each donor in the datasets has a single sample.

TCGA-BRCA: Breast cancer dataset of size <1,089 (donors) x 978 (genes)> with five subtypes, suitable for cancer subtype prediction task;
TCGA COMBINED: A collection of ten different cancer tissues of size <4,323 (donors) x 978 (genes)>, suitable for cancer tissue-of-origin prediction task.

For single-cell track, we re-distribute:

Raw counts of OneK1K single-cell RNA-seq dataset, a cohort containing 1.26 million peripheral blood mononuclear cells (PBMCs) of 981 donors (Yazar et al., 2022).

We gratefully acknowledge the authors for granting permission to redistribute this valuable dataset for the challenge.

🏆 EVALUATION: The teams with the best solutions will be determined based on multiple criteria, including,

🎯 leaderboard performance,
💡 novelty of methods,
🌱 generation of novel insights into privacy-preservation in biology.
📃 CAMDA extended abstract.

🎉 GET STARTED:

Please visit the ELSA Benchmark Platform to register and to access the datasets. Detailed information about benchmark method submissions can also be found here.
Visit the Github Starter Package Repo to reproduce baseline generative and membership inference methods, and further instructions.
Make sure to connect with us in the CAMDA Health Privacy Challenge Google Groups for questions, discussions and to follow the upcoming announcements!

We are looking forward to engaging with both members of the computational biology and the privacy community, and working together to deepen our understanding of privacy in health care. 🤗

References

Yazar S., Alquicira-Hernández J., Wing K., Senabouth A., Gordon G., Andersen S., Lu Q., Rowson A., Taylor T., Clarke L., Maccora L., Chen C., Cook A., Ye J., Fairfax K., Hewitt A., Powell J. “Single cell eQTL mapping identified cell type specific control of autoimmune disease.” Science. (2022) (https://onek1k.org)

NOTE: MIA aims to re-identify the training data points used to generate synthetic datasets from the original dataset. This re-identification process pertains only to identifying the pseudo-identities within the dataset and does not, in any way, attempt to re-identify the original donors.

The Synthetic Clinical Health Records

Access to real clinical data is essential for biomedical discovery, yet it remains constrained by legitimate privacy and data-governance requirements. This tension has made synthetic patients one of the most promising frontiers in health data science. But not all synthetic data are equally useful: traditional simulators can reproduce known statistical properties, yet they often fail to preserve the subtle, higher-order structure where new biology and clinically relevant patterns may reside. Recent advances in generative AI, particularly GAN-based and related deep generative models, have changed this landscape. These methods can learn complex dependencies directly from large-scale electronic health records and generate synthetic patient trajectories that are both realistic and analytically useful. Originally popularized through synthetic image generation, generative models are now reshaping biomedical data sharing, method benchmarking, and privacy-conscious discovery, opening the door to challenges in which participants can extract real clinical insight from synthetic yet information-rich patient records.What began as a revolution in image synthesis is now enabling a new generation of biomedical challenges: can we learn, predict, stratify, and discover from synthetic patients in ways that matter clinically?

Three datasets of synthetic patients have been subsequently created for this challenge since CAMDA 2023. Both datasets were generated from a real cohort retrieved from the Health Population Database (Base Poblacional de Salud, BPS) at the Andalusian Health System (Spain), by performing a Dual Adversarial AutoEncoder (DAAE) approach:

The first dataset (1st generation, 2023) was originally created for CAMDA 2023. It includes a list of pathologies for 999,936 synthetic patients ordered by visits, which was generated from a total of 979,308 real diabetes patients. Used visits from real diabetes patients were originally collected till the end of 2019. Additionally all visits feature an age-range (decades) label. This dataset is still available and usable for this challenge, both by itself or combined with the second generation. This first generation can be downloaded here.
The second dataset (2nd generation, 2024) includes a new list of pathologies for 999,936 synthetic patients, which were labeled increasing the age resolution to years. This dataset was generated from an extended cohort of 984,414 real diabetes patients. This dataset is provided as it was generated (raw version) as well as after a minor pre-processing to clean up inconsistencies (pre-processed version). The second generation can be downloaded here.
The third dataset (3rd generation, 2025/2026) includes a new dataset for 999,936 synthetic patients, which were generated extending patient visits till the end of 2022. This extension allows better description of long-term consequences for diabetes, which could result in more accurate endpoints’ predictions. As in the previous generation, the dataset is provided as it was generated (raw version) as well as after a minor pre-processing to clean up inconsistencies (pre-processed version). More details on the generation can be found here: Orduno et al., Advanced Science, 2026.

Two challenges are suggested on both datasets, although any other original analysis you may think will also be welcomed:

1) Finding some strong relationships in diabetes-associated pathologies that allows to predict any pathology before this is diagnosed. Some well-known pathological diabetes consequences, which can be considered relevant endpoints to predict, can be: a) Retinopathy (Code “703”), b) Chronic kidney disease (Code “1401”), c) Ischemic heart disease (Code “910”), d) Amputations (Code “1999”)

2) Another proposed challenge is the prediction of disease trajectories in diabetes patients (see for example: Jensen et al. Nat Commun. 2014

Prediction proposals which are submitted with the model trained and the code required to run the model can be tested on the real dataset by the organisers and participate in a collective publication.

Please read and accept the data download agreement for access to the Download Site.

We thank the Institute of Advanced Research in Artificial Intelligence (IARAI) for its support in the preparation of this Challenge.

The Gut Microbiome Interaction Network Challenge

Diseases associated with gut microbiome dysbiosis such as obesity, type 2 diabetes, and inflammatory bowel diseases are becoming increasingly prevalent health problems. Traditional analytical approaches based solely on taxonomic composition or diversity metrics (alpha and beta diversity) are often insufficient to understand disease mechanisms.

Growing evidence indicates that the key to understanding the gut ecosystem is not a list of species, but rather the network of their interactions and metabolic functions (Faust & Raes, 2012; Berry & Widder, 2014). According to the modern definition of the microbiome, it represents a dynamic Theatre of Activity, a complex system of interactions between microorganisms and their environment (Berg et al., 2020).

The provided metagenomic datasets contain metadata, taxonomic and functional profiles. The task for participants is to go beyond a standard microbiome description and propose an innovative interaction analysis.

The main evaluation criteria are the idea, innovativeness, and a non-conventional approach to the problem.

CHALLENGE OBJECTIVES

Define biologically meaningful interactions within the gut microbiome
Analyze changes in interactions between health and disease states
Assess whether interactions can support the prediction of health status

Participants may choose one, two, or all three analytical tracks, each addressing a different biological question. There is no single correct solution to this challenge. The project does not have to be completed. What matters is the idea, an attempt at its implementation, and the biological interpretation of the results. The highest value will be attributed to projects that offer a new perspective on microbiome interactions and help better understand its role in health and disease.

THREE TRACKS TO CHOOSE

Track 1: Taxon-Taxon Interactions

This track focuses on identifying ecological relationships between microorganisms. Studies such as Faust & Raes, 2012 and Berry & Widder, 2014 have shown that network topology rather than species presence alone can reflect ecosystem stability or dysbiosis. Network inference methods such as SparCC, SPIEC-EASI, or CoNet have become standard tools in microbiome analysis and may serve as a starting point for more innovative modeling approaches.

Track 2: Taxon-Function Interactions

In this track, microorganisms are not analyzed solely as taxonomic entities but as carriers of specific metabolic potentials. The goal is to understand which taxa are responsible for particular functions and how this map of associations changes in disease.

Does dysbiosis result from the loss of organisms, or from the loss of functions and can these functions be taken over by other taxa?

Studies based on functional profiling and the concept of functional redundancy (Louca et al., 2018) demonstrate that similar metabolic potential can be realized by different taxonomic communities.

Track 3: Function-Function Interactions

This track completely departs from the taxonomic level and treats the microbiome as an integrated network of biochemical processes. The analysis focuses on relationships between functions that may reinforce one another or compete for substrates.

Please read and accept the data download agreement for access to the Download Site.

Here we provide a dataset consisting of 3752 samples originating from numerous cohorts [source] with various diseases. The details about specific diseases and cohorts can be found in the metadata file.

We provide 3 files:

taxonomy._relative_abundance.tsv: species-level contribution to the taxonomic profile, calculated using MetaPhlAn
pathways_relative_abundance.tsv: functional profiles of the samples, calculated using HumanN.
metadata.tsv: contains detailed information about the samples. Not all columns need to be used in analyses. We provide additional data so that users have access to more information if needed.

An example approach that can be applied within this challenge will be presented in the form of a preprint in the near future!

Anti-Microbial Resistance Prediction

Antimicrobial resistance is one of the biggest challenges facing modern medicine. Because the management of COVID-19 was increasingly becoming dependent on pharmacological interventions, including inappropriately prescribed antibiotics, there was greater risk for accelerating the evolution and spread of antimicrobial resistance [Afshinnekoo et al. 2021]. A study in a tertiary hospital environment revealed concerning colonisation patterns of microbes during extended periods [Chang et al. 2020]. It also highlighted the diversity of antimicrobial resistance gene reservoirs in hospitals that could facilitate the emergence and transmission of new modes of antibiotic resistance and AMR burden in cities in general [Danko et al. 2021]. This year we would once again like to invite the CAMDA Community to take part in the challenge of predicting AMR phenotypes from genotypes.

This challenge consists in developing and testing models for predicting antimicrobial resistance (AMR) in six different bacterial pathogendrug combinations, with the pathogens taken from the WHO’s Priority Pathogen List. You will be provided with a training dataset selected from the CABBAGE database [Dickens et al. 2026], which represents the largest public AMR dataset in a reconciled, uniform format. This training dataset contains assemblies for each isolate as well as the antibiotic susceptibility phenotype (R or S). There are a total of 4800 samples for the 6 combinations. The developed models will then be tested on data from another collection derived from the public domain, but not included in the CABBAGE database, of the same pathogen-drug combinations for which you are provided with the assemblies, but not the phenotypes. There are a total of 1500 samples in this testing dataset, which is provided courtesy of Seigla Systems, the sponsor of this year’s challenge.