Annual International Conference on Critical Assessment of Massive Data Analysis
Washington, DC | July 12-16, 2026

The CAMDA Contest Challenges

For CAMDA 2026, we present:

  • The Health Privacy Challenge presents an interactive platform for achieving trust and robustness in the generation of privacy-preserving synthetic gene expression datasets. Join us in the bulk RNA-seq track to generate synthetic datasets and evaluate them for biological utility and adversarial privacy risk. Or join the single-cell track to develop generative methods with a focus on biological realism and privacy-aware data sharing. 
  • The Synthetic Clinical Health Records Challenge provides a rich set of highly realistic Electronic Health Records (EHRs) tracing the diagnosis trajectories of diabetic patients, created with dual-adversarial auto-encoders trained on data from 1.2 million real patients in the Population Health Database of the Andalusian Ministry of Health. Predict relevant diabetes endpoints like blindness or cardiopathy from past diagnosis trajectories!
  • The Gut Microbiome Interaction Network Challenge offers thousands of metagenomic samples with taxonomic and functional profiles from healthy and diseased individuals. This time, the challenge shifts the focus from composition to microbiome interaction networks, embracing the Theatre of Activity concept to uncover taxon-taxon, taxon-function, and function-function relationships driving health and disease.
  • The Anti-Microbial Resistance Prediction Challenge features thousands of clinical isolate sequences. Predict resistance genes and markers to identify resistant bacteria from WHO’s Priority Pathogen List!

CAMDA encourages an open contest, where all analyses of the contest data sets are of interest, not limited to the questions suggested here. There is an online forum for the free discussion of the contest data sets and their analysis, in which you are encouraged to participate.

We look forward to a lively contest!

The Health Privacy Challenge

📢  Synthetic data generation is one of the well-adopted approaches to enable privacy preservation through generating data points that are consistent with the distribution of the real data.  However, the effectiveness of synthetic data generators in biology, and the extent to which they can protect against adversarial attacks, such as membership inference risks, remain underexplored. 

The Health Privacy Challenge, which is organized in the context of the European Lighthouse on Safe and Secure AI (ELSA, https://elsa-ai.eu), invites participants to advance this field by contributing in three tasks:

  • Blue Team  (🫐) : develop novel privacy-preserving methods for generating synthetic bulk gene-expression datasets that maintain meaningful biological signals,
  • Red Team  (🍅) : design robust and realistic membership inference attack (MIA)  methods to evaluate and expose potential privacy risks in synthetic bulk gene-expression datasets generated during the Health Privacy Challenge 2025.
  • Single-cell Team  (🧬) :  (1) develop novel privacy-preserving generative approaches for synthetic single-cell gene-expression datasets that effectively balance privacy and utility, and (2) conduct systematic assessments of privacy risks in synthetic single-cell RNA-seqdata.

Choose your task and get started! Participants are welcome and encouraged to participate in multiple tasks

🎢 PARTICIPATION: In order to successfully participate in the challenge,

The participants of all three tasks (🫐,🍅, 🧬)  must:

  • Register through ELSA Benchmark Platform to access the challenge datasets and detailed instructions. We recommend you to register using an organizational email if possible.
  • Submit their methods (codes and relevant files) through the ELSA Benchmark Platform.
  • Submit a CAMDA extended abstract that details their benchmark method submissions  by the CAMDA submission deadline.

We provide a Github Starter Package Repo for all tasks.

🗂️ DATASETS:  We re-distribute two open access TCGA bulk RNA-seq datasets in the pre-processed form, which can be accessed from the GDC portal (portal.gdc.cancer.gov) as raw counts. Each donor in the datasets has a single sample. 

  1. TCGA-BRCA: Breast cancer dataset of size <1,089 (donors) x 978 (genes)> with five subtypes, suitable for cancer subtype prediction task;
  2. TCGA COMBINED: A collection of ten different cancer tissues of size <4,323 (donors) x 978 (genes)>, suitable for cancer tissue-of-origin prediction task. 

For single-cell track, we re-distribute:

  1. Raw counts of OneK1K single-cell RNA-seq dataset, a cohort containing 1.26 million peripheral blood mononuclear cells (PBMCs) of 981 donors (Yazar et al., 2022). 

We gratefully acknowledge the authors for granting permission to redistribute this valuable dataset for the challenge.

🏆 EVALUATION:  The teams with the best solutions will be determined based on multiple criteria, including,

  • 🎯 leaderboard performance,
  • 💡 novelty of methods,
  • 🌱 generation of novel insights into privacy-preservation in biology. 
  • 📃 CAMDA extended abstract.

  🎉 GET STARTED: 

We are looking forward to engaging with both members of the computational biology and the privacy community, and working together to deepen our understanding of privacy in health care. 🤗

 

References

Yazar S., Alquicira-Hernández J., Wing K., Senabouth A., Gordon G., Andersen S., Lu Q., Rowson A., Taylor T., Clarke L., Maccora L., Chen C., Cook A., Ye J., Fairfax K., Hewitt A., Powell J. “Single cell eQTL mapping identified cell type specific control of autoimmune disease.” Science. (2022) (https://onek1k.org)

NOTE: MIA aims to re-identify the training data points used to generate synthetic datasets from the original dataset. This re-identification process pertains only to identifying the pseudo-identities within the dataset and does not, in any way, attempt to re-identify the original donors.

 

The Synthetic Clinical Health Records (TBC)

UPDATE FOR 2025! Please note that we will provide an improved synthetic data set this year. If you are interested, please sign up to the CAMDA google group to receive a message when it gets released!

Although data protection is necessary to preserve patients’ intimacy, privacy regulations are also an obstacle to biomedical research. An interesting alternative is the use of synthetic patients. However, conventional synthetic patients are useless for discovery given that they are built out of known data distributions. Interestingly, Generative Adversarial Networks (GANs) and related developments have emerged as powerful tools to generate synthetic data in a way that captures relationships between the variables produced even if such relationships were previously unknown. GANs became popular in the generation of highly realistic synthetic pictures but have been applied in many fields, including in the generation of synthetic patients with applications such as medGAN and others.

Three datasets of synthetic patients have been subsequently created for this challenge since CAMDA 2023. Both datasets were generated from a real cohort retrieved from the Health Population Database (Base Poblacional de Salud, BPS) at the Andalusian Health System (Spain), by performing a Dual Adversarial AutoEncoder (DAAE) approach:

  1. The first dataset (1st generation, 2023) was originally created for CAMDA 2023. It includes a list of pathologies for 999,936 synthetic patients ordered by visits, which was generated from a total of 979,308 real diabetes patients. Used visits from real diabetes patients were originally collected till the end of 2019. Additionally all visits feature an age-range (decades) label. This dataset is still available and usable for this challenge, both by itself or combined with the second generation. This first generation can be dowloaded here.
  2. The second dataset (2nd generation, 2024) includes a new list of pathologies for 999,936 synthetic patients, which were labeled increasing the age resolution to years. This dataset was generated from an extended cohort of 984,414 real diabetes patients. This dataset is provided as it was generated (raw version) as well as after a minor pre-processing to clean up inconsistencies (pre-processed version). The second generation can be downloaded here.
  3. The third dataset (3rd generation, 2025) includes a new dataset for 999,936 synthetic patients, which were generated extending patient visits till the end of 2022. This extension allows better description of long-term consequences for diabetes, which could results in more accurate endpoints’ predictions. As in the previous generation, the dataset is provided as it was generated (raw version) as well as after a minor pre-processing to clean up inconsistencies (pre-processed version).

Two challenges are suggested on both datasets, although any other original analysis you may think will also be welcomed:

1) Finding some strong relationships in diabetes-associated pathologies that allows to predict any pathology before this is diagnosed. Some well-known pathological diabetes consequences, which can be considered relevant endpoints to predict, can be: a) Retinopathy (Code “703”), b) Chronic kidney disease (Code “1401”), c) Ischemic heart disease (Code “910”), d) Amputations (Code “1999”)

2) Another proposed challenge is the prediction of disease trajectories in diabetes patients (see for example: Jensen et al. Nat Commun. 2014)

Prediction proposals which are submitted with the model trained and the code required to run the model can be tested on the real dataset by the organisers and participate in a collective publication.

The data sets will become available soon!

 

The Gut Microbiome Interaction Network Challenge

Diseases associated with gut microbiome dysbiosis such as obesity, type 2 diabetes, and inflammatory bowel diseases are becoming increasingly prevalent health problems. Traditional analytical approaches based solely on taxonomic composition or diversity metrics (alpha and beta diversity) are often insufficient to understand disease mechanisms. 

Growing evidence indicates that the key to understanding the gut ecosystem is not a list of species, but rather the network of their interactions and metabolic functions (Faust & Raes, 2012; Berry & Widder, 2014). According to the modern definition of the microbiome, it represents a dynamic Theatre of Activity, a complex system of interactions between microorganisms and their environment (Berg et al., 2020).

The provided metagenomic datasets contain metadata, taxonomic and functional profiles. The task for participants is to go beyond a standard microbiome description and propose an innovative interaction network-based analysis.

The main evaluation criteria are the idea, innovativeness, and a non-conventional approach to the problem.

CHALLENGE OBJECTIVES

  1. Define biologically meaningful interactions within the gut microbiome
  2. Analyze changes in interactions between health and disease states
  3. Assess whether interactions can support the prediction of health status

Participants may choose one, two, or all three analytical tracks, each addressing a different biological question. There is no single correct solution to this challenge. The project does not have to be completed. What matters is the idea, an attempt at its implementation, and the biological interpretation of the results. The highest value will be attributed to projects that offer a new perspective on microbiome interactions and help better understand its role in health and disease.

THREE TRACKS TO CHOOSE

Track 1: Taxon-Taxon Interactions
This track focuses on identifying ecological relationships between microorganisms. Studies such as Faust & Raes, 2012 and Berry & Widder, 2014 have shown that network topology rather than species presence alone can reflect ecosystem stability or dysbiosis. Network inference methods such as SparCC, SPIEC-EASI, or CoNet have become standard tools in microbiome analysis and may serve as a starting point for more innovative modeling approaches.
 
Track 2: Taxon-Function Interactions
In this track, microorganisms are not analyzed solely as taxonomic entities but as carriers of specific metabolic potentials. The goal is to understand which taxa are responsible for particular functions and how this map of associations changes in disease.

Does dysbiosis result from the loss of organisms, or from the loss of functions and can these functions be taken over by other taxa?

 
Studies based on functional profiling and the concept of functional redundancy (Louca et al., 2018) demonstrate that similar metabolic potential can be realized by different taxonomic communities.
 
Track 3: Function-Function Interactions
This track completely departs from the taxonomic level and treats the microbiome as an integrated network of biochemical processes. The analysis focuses on relationships between functions that may reinforce one another or compete for substrates.

Please read and accept the data download agreement for access to the Download Site.

Here we provide a dataset consisting of 3752 samples originating from numerous cohorts [source] with various diseases. The details about specific diseases and cohorts can be found in the metadata file.
We provide 3 files:
  • taxonomy._relative_abundance.tsv: species-level contribution to the taxonomic profile, calculated using MetaPhlAn
  • pathways_relative_abundance.tsv: functional profiles of the samples, calculated using HumanN.
  • metadata.tsv: contains detailed information about the samples. Not all columns need to be used in analyses. We provide additional data so that users have access to more information if needed.

An example approach that can be applied within this challenge will be presented in the form of a preprint in the near future!

Anti-Microbial Resistance Prediction

Antimicrobial resistance is one of the biggest challenges facing modern medicine. Because the management of COVID-19 was increasingly becoming dependent on pharmacological interventions, there is greater risk for accelerating the evolution and spread of antimicrobial resistance [Afshinnekoo et al. 2021]. A study in a tertiary hospital environment revealed concerning colonisation patterns of microbes during extended periods [Chng et al. 2020]. It also highlighted the diversity of antimicrobial resistance gene reservoirs in hospitals that could facilitate the emergence and transmission of new modes of antibiotic resistance and AMR burden in cities in general [Danko et al. 2021]. This year we would like CAMDA Community to look into AMR related challenges.

This challenge consists in developing and testing models for predicting antimicrobial resistance (AMR) in different bacterial pathogens and drugs from the WHO’s Priority Pathogen List. You will be provided with a training dataset taken from public databases containing both WGS accession numbers for each isolate, as well as the antibiotic susceptibility phenotype. The developed models will then be tested on data from another collection of the same pathogens for which you will be provided with the genotypes, but not the phenotypes.

This year we will take advantage of newly developed CABBAGE, which represents the largest public AMR dataset in a reconciled, uniform format [Dickens et al. 2025].

The data sets and leaderboard will become available soon!

PAST KEYNOTE SPEAKERS

IMPORTANT DATES

Call for Abstracts Opens

1 February 2026

CAMDA Extended Abstracts Deadline

7 May 2026

Late Poster Submissions Deadline

7 May 2026

Late Poster Acceptance Notifications

14 May 2026

CAMDA Acceptance Notification

14 May 2026

CAMDA Conference

12-16 July 2026

ISMB 2026 MAIN EVENT

STAY CONNECTED

CAMDA PARTNERS

https://www.fda.gov/
https://www.frontiersin.org/journals/genetics
http://www.smashing-studio.com/
http://f1000research.com/
FDA-Logo
logo_frontiers
smashing-studio-logo
F1000R-Logo-orange
previous arrowprevious arrow
next arrownext arrow

CONTACT US

                     

CONTACT US

We're not around right now. But you can send us an email and we'll get back to you, asap.

Sending

Log in with your credentials

or    

Forgot your details?

Create Account