The CAMDA Contest Challenges

For 2020, we present

the Hi-Res Cancer Data Integration Challenge – now with 300Mb of novel human genomic sequence and read level data for splice events!
the CMap Drug Safety Challenge – now with preselected cell-lines, extended structural information, toxicity data, and new labels incorporating the severity of the toxic effect!
the Metagenomic Geolocation Challenge – now with additional climate data!

CAMDA encourages an open contest, where all analyses of the contest data sets are of interest, not limited to the questions suggested here. There is an online forum for the free discussion of the contest data sets and their analysis, in which you are encouraged to participate.

We look forward to a lively contest!

Hi-Res Cancer Data Integration Challenge

From the comprehensive description of genomic, transcriptomic and epigenomic changes of cancers provided by Genomic Data Commons (GDC, formerly at TCGA), the main goal of this challenge is to develop and demonstrate novel methods for gaining novel biological insights or improving support for Precision Medicine. Innovation can come from

Individual human genomic sequence not found in the standard reference genome - We provide reads matched to the standard reference genome plus over 300Mb of novel human genomic sequence.
High resolution expression profiling - Anonymized read level data allow the exploration of aberrations in splicing and regulation of alternative gene transcripts.
More meaningful integration of multiple matched molecular profiles and complementary patient data

Examine algorithm performance in a real-world clinical settings! We know that many approaches work well on some data-sets yet not on others. We here challenge you to demonstrate a unified single approach that matches or outperforms the current state-of-the-art for

Breast cancer

and for at least one of the less well studied

Lung Adenocarcinoma
Kidney Renal Clear Cell Carcinoma

Please visit and participate in the open CAMDA data integration forum for free discussion related to this contest.

Analysis suggestions:
Biological:

What known and new disease mechanisms can you identify?
How can the integration of matched molecular profiles and patient data yield a more meaningful readout, including likely causal changes?
What can we learn about the role of aberrant splicing and regulation of alternative gene transcripts in cancer?
How can individual human genomics sequence aid Precision Medicine and the development of personalized rational drug treatment plans?

Technical:

Can we apply approaches and insights developed from one type of cancer (e.g., a common, well studied cancer) to other diseases

(e.g., less-well studied cancers)?

How large a distortion is observed from restriction of gene expression readout to the standard human reference sequence (vs mapping to individual human genome sequences)?

Contest data comprises raw and pre-processed data from matched molecular profiles with complementary clinical information.

For convenience, we provide a local copy of the data. In addition, anonymized RNA-seq read level data are now available.

Please sign up to announcements from the CAMDA data integration forum for alerts.

Please read and accept the data download agreement for access.

CMap Drug Safety Challenge

Due to safety / toxicity issues, attrition in drug discovery and development remains a significant concern, and there are strong efforts to identify and mitigate risk as early as possible. Drug-induced liver injury (DILI) is one of the primary liabilities in drug development and regulatory clearance due to the limited performance of mandated preclinical models. There is a pressing need to evaluate alternative methods for predicting severe DILI, the main concern of the regulatory agencies. Increasing evidence suggests that multiple factors, including the interactions between drug properties and host factors (i.e., patient information), contribute to the DILI effect of a drug (Journal of Hepatology 63). With great hopes being placed in modern approaches from statistics and machine learning applied to genome scale profiling data. If we can better integrate, understand, and exploit information from multiple complementary studies of chemical compounds remains thus a critical question, specifically, exploring chemical descriptors of the drugs (Mold2, Journal of Chemical Information and Modeling 48), cell-based screening of pathway perturbations of the drugs (Toxicology in the 21st Century/Tox21, Nature Communications 7), gene expression patterns induced by them (Broad Institute Connectivity Map/CMap, Science 313, Nature Reviews Cancer 7, Cell 171), as well as host factors from the FDA Adverse Event Reporting System database (FAERS).

This CAMDA challenge focuses on understanding or predicting a drug’s potential to cause acute liver failure, the most severe type of DILI. To support the development of supervised machine learning approaches, we retrieved DILI severity information from the FDA-approved drug labeling, and specifically, now provide a new set of training labels of 422 drugs, indicating their potential to cause acute liver failure effects. In addition, we acquired a validation set of 195 drugs with blinded labels, which should be predicted. In the 2020 challenge, instead of relying solely on gene expression data, we extended the predictors by Mold2 chemical descriptors, host factors information (age and gender of the patients) from the FDA FAERS database, and pathway perturbation data of Tox21. Moreover, we now narrowed down last year's challenge CMap L1000 gene expression data set to cover six cell lines, potentially most relevant to liver (i.e. PHH, HEPG2, HA1E, A375, MCF7, PC3). The analysis teams will be encouraged to develop models using these predictors individually and/or in combination.

Analysis suggestions:

Integration of potentially complementary assays. Assessment of the relative values of the complementary data types for prediction.
Identification and interpretation of differences in cell-line response across drugs and across different predictors.

Contest data comprise anonymized processed expression profiles from the Broad Institute Human L1000 epsilon platform. Complementary information includes Mold2 chemical descriptors of the drugs, Tox21 cell-based screening of pathway perturbations of the drugs, and FAERS information. Toxicity labels were compiled by the US FDA.

A local copy of relevant subsets of the data, including labels, is available now.

Please sign up to announcements from the CAMDA toxicogenomics forum for alerts and for free discussion related to this contest.

Please read and accept the data download agreement for access.

Metagenomic Geolocation Challenge

MetaSUB is creating a global genetic cartography of urban spaces, based on extensive sampling of mass-transit systems and other public areas across the globe. In a strategic partnership an extended set of data from global City Sampling Days is first introduced through the annual CAMDA contests. CAMDA delegates thus receive access to over a thousand novel MetaSUB samples, comprising over a terabase of whole genome shotgun (WGS) metagenomics data. The primary data set covers over 20 cities around the world, with tens of samples per city (over 1000 samples in total), providing a unique resource for the study of biodiversity within and across geographic locations as well as ecological niches.

For better understanding of the relation between metagenomic profiles and location specificity / ecological niche the set of over a thousand features describing the climate conditions are provided as well as city and neighbouring biomes classification.

Further extended global coverage can be achieved by complementary 16S rRNA studies contributing thousands of samples, the Earth Microbiome Project and A global atlas of the dominant bacteria found in soil. For a range of MetaSUB Boston reference samples we now provide both WGS and 16S profiles, allowing a first systematic link of WGS and 16S resources.

Together, these unique multi-source data set will allow to build novel models to predict ecological niche type or even origin locations of samples from cities seen for the very first time. Performance can be tested on an independent test set of over 50 new 'mystery' samples including locations from cities not sampled before.

Please visit and participate in the open CAMDA meta-genomics forum for free discussion related to this contest.

Analysis suggestions:
A key challenge in metagenomic forensics is the construction of a microbiome fingerprint which will allow the prediction of the geographical origin of a sample even in case when no reference samples from this location are known.

Typical considerations include:

How can we exploit metagenomic fingerprints for identifying the origin of a sample?
How reliable are such predictions of sample origins?

The primary data set is now available. This contains: i) hundreds of samples with WGS raw reads from urban locations from MetaSUB Consortium, ii) Over a thousand of weather/climate features for cities as well as city and neighbouring biome classification.

In addition the 16S sequencing-based OTUs for thousands of soil samples from two mentioned project from allover the world are also available.

With the set of mystery samples (now available), try to:

discover which samples come from the same location,
predict for mystery samples weather/climate features and/or biome,
precict mystery sample geographic origin.

Please sign up to announcements from the CAMDA meta-genomics forum for alerts.

For a copy of our data, please accept the data download agreement for access.

Extended Abstract Proposals due	25 May 2020
Notification of Accepted Contributions	4 Jun 2020
Early Registration Closes	11 Jun 2020
CAMDA2020 Conference	13–14 Jul 2020
ISMB 2020 Conference	12–16 Jul 2020
Full Paper Submission	30 Oct 2020
Click to save the dates!

The CAMDA Contest Challenges

Hi-Res Cancer Data Integration Challenge

CMap Drug Safety Challenge

Metagenomic Geolocation Challenge

PAST KEYNOTE SPEAKERS

IMPORTANT DATES

CAMDA PARTNERS

ISMB 2020 MAIN EVENT

STAY CONNECTED