Augustin Luna

Dept of Systems Biology, Harvard Medical School
Bioinformatics Scientist / Research Associate
augustin_luna AT hms.harvard.edu

Experiences

Research Fellow / Dana-Farber Cancer Institute & Research Associate / Harvard Medical School (2015-Present)

Work with Chris Sander. Develop R packages for cancer pharmacology research, including cgdsr (cBioPortal, TCGA studies), paxtoolsr (Pathway Commons), rcellminer (NCI-60 cell lines). Develop R based integrative analyses and bioinformatics pipelines with a focus on reproducibility using Docker. Develop web applications for cancer pharmacology databases and analyses. Work on tools to conduct complex pathway analyses using the aggregated Pathway Commons database. Work to identify novel drug-target interactions using compound activity from the NCI Developmental Therapeutics Program (DTP) and omic profiling data of the NCI-60 cancer cell lines in collaboration with the Developmental Therapeutics Branch at the NCI, the Genomics of Drug Sensitivity in Cancer Project at the Sanger Institute, and the Institute of Molecular Systems Biology at ETH Zurich. Work to identify novel drug combinations for cancer using proteomics data.

Research Fellow / Memorial Sloan-Kettering Cancer Center (2013-2015)

Worked with Chris Sander. Working on standardized representations of biological data (BioPAX), the aggregation of multiple pathway databases into Pathway Commons, and related software, including: Paxtools and paxtoolsR. In collaboration with the Yves Pommier lab of the Developmental Therapeutics Branch at the NCI, working to identify novel drug-target interactions using compound activity from the NCI Developmental Therapeutics Program (DTP) and genomic profiling data of the NCI-60 cancer cell lines. Compounds are prioritized for further development using understanding of pharmacology, target pathway context, relevance to cancer, and novelty.

Pre-Doctoral Cancer Research Training Award Fellow / National Cancer Institute (2008-2013)

Worked with Mirit I. Aladjem and Kurt W. Kohn. Worked on standardized representations for biological data (Molecular Interaction Maps (MIMs), Systems Biology Graphical Notation (SBGN), and BioPAX) and related software for MIMs: XML-based format, API, graphical editor, validator, scripting interface, and faceted browser. Worked on an ODE-based model of interactions that connect components of the circadian clock to those of DNA damage response; collaborating with Geoffrey B. McFadden (at NIST). Mentored student projects through the Google Summer of Code Project (2011 and 2012).

Intern / Pfizer (2010)

Worked with Jacob Glanville. Developed a wiki-based platform for the analysis and collaborative annotation of experimental results.

Bioinformatics PhD Student / Boston University (2007-2013)

Worked with Daniel Segrè. Worked on a metabolic network to represent a kidney cell using BIGG human metabolic network and microarray data from the GEO repository for flux balance analysis.

Post-Baccalaureate Intramural Research Training Award Fellow / National Institute of Mental Health (2005-2007)

Worked with Daniel Weinberger and Kristen K. Nicodemus. Wrote an automated statistical analysis pipeline for candidate gene studies. Wrote software to visualize the results of SNP analyses. Performed analysis in a simulation study assessing power and type I error of single SNP transmission disequilibrium tests (TDTs).

Undergraduate Research Assistant / Georgia Institute of Technology (2004-2005)

Worked with Andreas Bommarius and Karen M. Polizzi. Created an expression vector for mutant variants in a high throughput screen. Collected information on antibiotic resistance in E.coli due to beta-lactamase.

Undergraduate Research Assistant / Georgia Institute of Technology (2003-2004)

Worked with Allen Tannenbaum and Eli Hershkovitz. Worked on the classification of RNA classification into motif libraries. Wrote programs to parse tRNA X-ray crystallography data

Education

Bioinformatics, PhD / Boston University / 2007-2013
Biomedical Engineering, BS / Georgia Institute of Technology / 2002-2005
Chemistry, AS / Middle Georgia College / 2000-2002

Honors

Postdoctoral Ruth L. Kirschstein National Research Service Award / 2014
Ford Foundation Dissertation Fellowship / 2012
NCI Pre-Doctoral Cancer Research Training Award / 2008
NIH Post-Baccalaureate Intramural Research Training Award / 2005
Alpha Eta Mu Beta Biomedical Engineering Honor Society / 2005
Petit Undergraduate Research Scholar / 2004
Goizueta Foundation Scholarship / 2004
Georgia Governor’s Scholar / 2001
Tau Beta Pi Engineering Honor Society / 2005
Golden Key Academic Honor Society / 2005
HOPE Scholarship / 2001

Publications

Precision Combination Therapies Based on Recurrent Oncogenic Coalterations. (Cancer Discov 2022)

PMID: 35412613

Cancer cells depend on multiple driver alterations whose oncogenic effects can be suppressed by drug combinations. Here, we provide a comprehensive resource of precision combination therapies tailored to oncogenic coalterations that are recurrent across patient cohorts. To generate the resource, we developed Recurrent Features Leveraged for Combination Therapy (REFLECT), which integrates machine learning and cancer informatics algorithms. Using multiomic data, the method maps recurrent coalteration signatures in patient cohorts to combination therapies. We validated the REFLECT pipeline using data from patient-derived xenografts, in vitro drug screens, and a combination therapy clinical trial. These validations demonstrate that REFLECT-selected combination therapies have significantly improved efficacy, synergy, and survival outcomes. In patient cohorts with immunotherapy response markers, DNA repair aberrations, and HER2 activation, we have identified therapeutically actionable and recurrent coalteration signatures. REFLECT provides a resource and framework to design combination therapies tailored to tumor cohorts in data-driven clinical trials and preclinical studies.

BET inhibition induces vulnerability to MCL1 targeting through upregulation of fatty acid synthesis pathway in breast cancer. (Cell Rep 2022)

PMID: 36103824

Therapeutic options for treatment of basal-like breast cancers remain limited. Here, we demonstrate that bromodomain and extra-terminal (BET) inhibition induces an adaptive response leading to MCL1 protein-driven evasion of apoptosis in breast cancer cells. Consequently, co-targeting MCL1 and BET is highly synergistic in breast cancer models. The mechanism of adaptive response to BET inhibition involves the upregulation of lipid synthesis enzymes including the rate-limiting stearoyl-coenzyme A (CoA) desaturase. Changes in lipid synthesis pathway are associated with increases in cell motility and membrane fluidity as well as re-localization and activation of HER2/EGFR. In turn, the HER2/EGFR signaling results in the accumulation of and vulnerability to the inhibition of MCL1. Drug response and genomics analyses reveal that MCL1 copy-number alterations are associated with effective BET and MCL1 co-targeting. The high frequency of MCL1 chromosomal amplifications (>30%) in basal-like breast cancers suggests that BET and MCL1 co-targeting may have therapeutic utility in this aggressive subtype of breast cancer.

COVIDpro: Database for mining protein dysregulation in patients with COVID-19. (bioRxiv 2022)

PMID: 36203550

The ongoing pandemic of the coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) still has limited treatment options partially due to our incomplete understanding of the molecular dysregulations of the COVID-19 patients. We aimed to generate a repository and data analysis tools to examine the modulated proteins underlying COVID-19 patients for the discovery of potential therapeutic targets and diagnostic biomarkers.

CellMiner Cross-Database (CellMinerCDB) version 1.2: Exploration of patient-derived cancer cell line pharmacogenomics. (Nucleic Acids Res 2021)

PMID: 33196823

CellMiner Cross-Database (CellMinerCDB, discover.nci.nih.gov/cellminercdb) allows integration and analysis of molecular and pharmacological data within and across cancer cell line datasets from the National Cancer Institute (NCI), Broad Institute, Sanger/MGH and MD Anderson Cancer Center (MDACC). We present CellMinerCDB 1.2 with updates to datasets from NCI-60, Broad Cancer Cell Line Encyclopedia and Sanger/MGH, and the addition of new datasets, including NCI-ALMANAC drug combination, MDACC Cell Line Project proteomic, NCI-SCLC DNA copy number and methylation data, and Broad methylation, genetic dependency and metabolomic datasets. CellMinerCDB (v1.2) includes several improvements over the previously published version: (i) new and updated datasets; (ii) support for pattern comparisons and multivariate analyses across data sources; (iii) updated annotations with drug mechanism of action information and biologically relevant multigene signatures; (iv) analysis speedups via caching; (v) a new dataset download feature; (vi) improved visualization of subsets of multiple tissue types; (vii) breakdown of univariate associations by tissue type; and (viii) enhanced help information. The curation and common annotations (e.g. tissues of origin and identifiers) provided here across pharmacogenomic datasets increase the utility of the individual datasets to address multiple researcher question types, including data reproducibility, biomarker discovery and multivariate analysis of drug activity.

CellBox: Interpretable Machine Learning for Perturbation Biology with Application to the Design of Cancer Combination Therapy. (Cell Syst 2021)

PMID: 33373583

Systematic perturbation of cells followed by comprehensive measurements of molecular and phenotypic responses provides informative data resources for constructing computational models of cell biology. Models that generalize well beyond training data can be used to identify combinatorial perturbations of potential therapeutic interest. Major challenges for machine learning on large biological datasets are to find global optima in a complex multidimensional space and mechanistically interpret the solutions. To address these challenges, we introduce a hybrid approach that combines explicit mathematical models of cell dynamics with a machine-learning framework, implemented in TensorFlow. We tested the modeling framework on a perturbation-response dataset of a melanoma cell line after drug treatments. The models can be efficiently trained to describe cellular behavior accurately. Even though completely data driven and independent of prior knowledge, the resulting de novo network models recapitulate some known interactions. The approach is readily applicable to various kinetic models of cell biology. A record of this paper's Transparent Peer Review process is included in the Supplemental Information.

Synthetic biology open language visual (SBOL Visual) version 2.3. (J Integr Bioinform 2021)

PMID: 34098590

People who are engineering biological organisms often find it useful to communicate in diagrams, both about the structure of the nucleic acid sequences that they are engineering and about the functional relationships between sequence features and other molecular species. Some typical practices and conventions have begun to emerge for such diagrams. The Synthetic Biology Open Language Visual (SBOL Visual) has been developed as a standard for organizing and systematizing such conventions in order to produce a coherent language for expressing the structure and function of genetic designs. This document details version 2.3 of SBOL Visual, which builds on the prior SBOL Visual 2.2 in several ways. First, the specification now includes higher-level "interactions with interactions," such as an inducer molecule stimulating a repression interaction. Second, binding with a nucleic acid backbone can be shown by overlapping glyphs, as with other molecular complexes. Finally, a new "unspecified interaction" glyph is added for visualizing interactions whose nature is unknown, the "insulator" glyph is deprecated in favor of a new "inert DNA spacer" glyph, and the polypeptide region glyph is recommended for showing 2A sequences.

Causal interactions from proteomic profiles: Molecular data meet pathway knowledge. (Patterns (N Y) 2021)

PMID: 34179843

We present a computational method to infer causal mechanisms in cell biology by analyzing changes in high-throughput proteomic profiles on the background of prior knowledge captured in biochemical reaction knowledge bases. The method mimics a biologist's traditional approach of explaining changes in data using prior knowledge but does this at the scale of hundreds of thousands of reactions. This is a specific example of how to automate scientific reasoning processes and illustrates the power of mapping from experimental data to prior knowledge via logic programming. The identified mechanisms can explain how experimental and physiological perturbations, propagating in a network of reactions, affect cellular responses and their phenotypic consequences. Causal pathway analysis is a powerful and flexible discovery tool for a wide range of cellular profiling data types and biological questions. The automated causation inference tool, as well as the source code, are freely available at http://causalpath.org.

COVID19 Disease Map, a computational knowledge repository of virus-host interaction mechanisms. (Mol Syst Biol 2021)

PMID: 34664389

We need to effectively combine the knowledge from surging literature with complex datasets to propose mechanistic models of SARS-CoV-2 infection, improving data interpretation and predicting key targets of intervention. Here, we describe a large-scale community effort to build an open access, interoperable and computable repository of COVID-19 molecular mechanisms. The COVID-19 Disease Map (C19DMap) is a graphical, interactive representation of disease-relevant molecular mechanisms linking many knowledge sources. Notably, it is a computational resource for graph-based analyses and disease modelling. To this end, we established a framework of tools, platforms and guidelines necessary for a multifaceted community of biocurators, domain experts, bioinformaticians and computational biologists. The diagrams of the C19DMap, curated from the literature, are integrated with relevant interaction and text mining databases. We demonstrate the application of network analysis and modelling approaches by concrete examples to highlight new testable hypotheses. This framework helps to find signatures of SARS-CoV-2 predisposition, treatment response or prioritisation of drug candidates. Such an approach may help deal with new waves of COVID-19 or similar pandemics in the long-term perspective.

Author-sourced capture of pathway knowledge in computable form using Biofactoid. (Elife 2021)

PMID: 34860157

Making the knowledge contained in scientific papers machine-readable and formally computable would allow researchers to take full advantage of this information by enabling integration with other knowledge sources to support data analysis and interpretation. Here we describe Biofactoid, a web-based platform that allows scientists to specify networks of interactions between genes, their products, and chemical compounds, and then translates this information into a representation suitable for computational analysis, search and discovery. We also report the results of a pilot study to encourage the wide adoption of Biofactoid by the scientific community.

Analyzing causal relationships in proteomic profiles using CausalPath. (STAR Protoc 2021)

PMID: 34877547

CausalPath (causalpath.org) evaluates proteomic measurements against prior knowledge of biological pathways and infers causality between changes in measured features, such as global protein and phospho-protein levels. It uses pathway resources to determine potential causality between observable omic features, which are called prior relations. The subset of the prior relations that are supported by the proteomic profiles are reported and evaluated for statistical significance. The end result is a network model of signaling that explains the patterns observed in the experimental dataset. For complete details on the use and execution of this protocol, please refer to Babur et al. (2021).

COVID-19 Disease Map, a computational knowledge repository of virus-host interaction mechanisms. (Mol Syst Biol 2021)

PMID: 34939300

[Image: see text]

A pan-cancer survey of cell line tumor similarity by feature-weighted molecular profiles. (Cell Rep Methods 2021)

PMID: 35475239

Patient-derived cell lines are often used in pre-clinical cancer research, but some cell lines are too different from tumors to be good models. Comparison of genomic and expression profiles can guide the choice of pre-clinical models, but typically not all features are equally relevant. We present TumorComparer, a computational method for comparing cellular profiles with higher weights on functional features of interest. In this pan-cancer application, we compare ∼600 cell lines and ∼8,000 tumor samples of 24 cancer types, using weights to emphasize known oncogenic alterations. We characterize the similarity of cell lines and tumors within and across cancers by using multiple datum types and rank cell lines by their inferred quality as representative models. Beyond the assessment of cell lines, the weighted similarity approach is adaptable to patient stratification in clinical trials and personalized medicine.

Pathway Commons 2019 Update: integration, analysis and exploration of pathway data. (Nucleic Acids Res 2020)

PMID: 31647099

Pathway Commons (https://www.pathwaycommons.org) is an integrated resource of publicly available information about biological pathways including biochemical reactions, assembly of biomolecular complexes, transport and catalysis events and physical interactions involving proteins, DNA, RNA, and small molecules (e.g. metabolites and drug compounds). Data is collected from multiple providers in standard formats, including the Biological Pathway Exchange (BioPAX) language and the Proteomics Standards Initiative Molecular Interactions format, and then integrated. Pathway Commons provides biologists with (i) tools to search this comprehensive resource, (ii) a download site offering integrated bulk sets of pathway data (e.g. tables of interactions and gene sets), (iii) reusable software libraries for working with pathway information in several programming languages (Java, R, Python and Javascript) and (iv) a web service for programmatically querying the entire dataset. Visualization of pathways is supported using the Systems Biological Graphical Notation (SBGN). Pathway Commons currently contains data from 22 databases with 4794 detailed human biochemical processes (i.e. pathways) and ∼2.3 million interactions. To enhance the usability of this large resource for end-users, we develop and maintain interactive web applications and training materials that enable pathway exploration and advanced analysis.

Synthetic biology open language visual (SBOL visual) version 2.2. (J Integr Bioinform 2020)

PMID: 32543457

People who are engineering biological organisms often find it useful to communicate in diagrams, both about the structure of the nucleic acid sequences that they are engineering and about the functional relationships between sequence features and other molecular species. Some typical practices and conventions have begun to emerge for such diagrams. The Synthetic Biology Open Language Visual (SBOL Visual) has been developed as a standard for organizing and systematizing such conventions in order to produce a coherent language for expressing the structure and function of genetic designs. This document details version 2.2 of SBOL Visual, which builds on the prior SBOL Visual 2.1 in several ways. First, the grounding of molecular species glyphs is changed from BioPAX to SBO, aligning with the use of SBO terms for interaction glyphs. Second, new glyphs are added for proteins, introns, and polypeptide regions (e. g., protein domains), the prior recommended macromolecule glyph is deprecated in favor of its alternative, and small polygons are introduced as alternative glyphs for simple chemicals.

Systems biology graphical notation markup language (SBGNML) version 0.3. (J Integr Bioinform 2020)

PMID: 32568733

This document defines Version 0.3 Markup Language (ML) support for the Systems Biology Graphical Notation (SBGN), a set of three complementary visual languages developed for biochemists, modelers, and computer scientists. SBGN aims at representing networks of biochemical interactions in a standard, unambiguous way to foster efficient and accurate representation, visualization, storage, exchange, and reuse of information on all kinds of biological knowledge, from gene regulation, to metabolism, to cellular signaling. SBGN is defined neutrally to programming languages and software encoding; however, it is oriented primarily towards allowing models to be encoded using XML, the eXtensible Markup Language. The notable changes from the previous version include the addition of attributes for better specify metadata about maps, as well as support for multiple maps, sub-maps, colors, and annotations. These changes enable a more efficient exchange of data to other commonly used systems biology formats (e. g., BioPAX and SBML) and between tools supporting SBGN (e. g., CellDesigner, Newt, Krayon, SBGN-ED, STON, cd2sbgnml, and MINERVA). More details on SBGN and related software are available at http://sbgn.org. With this effort, we hope to increase the adoption of SBGN in bioinformatics tools, ultimately enabling more researchers to visualize biological knowledge in a precise and unambiguous manner.

SCLC-CellMiner: A Resource for Small Cell Lung Cancer Cell Line Genomics and Pharmacology Based on Genomic Signatures. (Cell Rep 2020)

PMID: 33086069

CellMiner-SCLC (https://discover.nci.nih.gov/SclcCellMinerCDB/) integrates drug sensitivity and genomic data, including high-resolution methylome and transcriptome from 118 patient-derived small cell lung cancer (SCLC) cell lines, providing a resource for research into this "recalcitrant cancer." We demonstrate the reproducibility and stability of data from multiple sources and validate the SCLC consensus nomenclature on the basis of expression of master transcription factors NEUROD1, ASCL1, POU2F3, and YAP1. Our analyses reveal transcription networks linking SCLC subtypes with MYC and its paralogs and the NOTCH and HIPPO pathways. SCLC subsets express specific surface markers, providing potential opportunities for antibody-based targeted therapies. YAP1-driven SCLCs are notable for differential expression of the NOTCH pathway, epithelial-mesenchymal transition (EMT), and antigen-presenting machinery (APM) genes and sensitivity to mTOR and AKT inhibitors. These analyses provide insights into SCLC biology and a framework for future investigations into subtype-specific SCLC vulnerabilities.

AlignmentViewer: Sequence Analysis of Large Protein Families. (F1000Res 2020)

PMID: 33123346

AlignmentViewer is a web-based tool to view and analyze multiple sequence alignments of protein families. The particular strengths of AlignmentViewer include flexible visualization at different scales as well as analysis of conservation patterns and of the distribution of proteins in sequence space. The tool is directly accessible in web browsers without the need for software installation. It can handle protein families with tens of thousands of sequences and is particularly suitable for evolutionary coupling analysis, e.g. via EVcouplings.org.

netboxr: Automated discovery of biological process modules by network analysis in R. (PLoS One 2020)

PMID: 33137091

Large-scale sequencing projects, such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), have generated high throughput sequencing and molecular profiling data sets, but it is still challenging to identify potentially causal changes in cellular processes in cancer as well as in other diseases in an automated fashion. We developed the netboxr package written in the R programming language, which makes use of the NetBox algorithm to identify candidate cancer-related functional modules. The algorithm makes use of a data-driven, network-based approach that combines prior knowledge with a network clustering algorithm, obviating the need for and the limitation of independently curated functionally labeled gene sets. The method can combine multiple data types, such as mutations and copy number alterations, leading to more reliable identification of functional modules. We make the tool available in the Bioconductor R ecosystem for applications in cancer research and cell biology.

LLGL2 rescues nutrient stress by promoting leucine uptake in ER (Nature 2019)

PMID: 30996345

Drosophila Lgl and its mammalian homologues, LLGL1 and LLGL2, are scaffolding proteins that regulate the establishment of apical-basal polarity in epithelial cells

Synthetic Biology Open Language Visual (SBOL Visual) Version 2.1. (J Integr Bioinform 2019)

PMID: 31199768

People who are engineering biological organisms often find it useful to communicate in diagrams, both about the structure of the nucleic acid sequences that they are engineering and about the functional relationships between sequence features and other molecular species . Some typical practices and conventions have begun to emerge for such diagrams. The Synthetic Biology Open Language Visual (SBOL Visual) has been developed as a standard for organizing and systematizing such conventions in order to produce a coherent language for expressing the structure and function of genetic designs. This document details version 2.1 of SBOL Visual, which builds on the prior SBOL Visual 2.0 standard by expanding diagram syntax to include methods for showing modular structure and mappings between elements of a system, interactions arrows that can split or join (with the glyph at the split or join indicating either superposition or a chemical process), and adding new glyphs for indicating genomic context (e.g., integration into a plasmid or genome) and for stop codons.

Systems Biology Graphical Notation: Process Description language Level 1 Version 2.0. (J Integr Bioinform 2019)

PMID: 31199769

The Systems Biology Graphical Notation (SBGN) is an international community effort that aims to standardise the visualisation of pathways and networks for readers with diverse scientific backgrounds as well as to support an efficient and accurate exchange of biological knowledge between disparate research communities, industry, and other players in systems biology. SBGN comprises the three languages Entity Relationship, Activity Flow, and Process Description (PD) to cover biological and biochemical systems at distinct levels of detail. PD is closest to metabolic and regulatory pathways found in biological literature and textbooks. Its well-defined semantics offer a superior precision in expressing biological knowledge. PD represents mechanistic and temporal dependencies of biological interactions and transformations as a graph. Its different types of nodes include entity pools (e.g. metabolites, proteins, genes and complexes) and processes (e.g. reactions, associations and influences). The edges describe relationships between the nodes (e.g. consumption, production, stimulation and inhibition). This document details Level 1 Version 2.0 of the PD specification, including several improvements, in particular: 1) the addition of the equivalence operator, subunit, and annotation glyphs, 2) modification to the usage of submaps, and 3) updates to clarify the use of various glyphs (i.e. multimer, empty set, and state variable).

Communicating Structure and Function in Synthetic Biology Diagrams. (ACS Synth Biol 2019)

PMID: 31348656

Biological engineers often find it useful to communicate using diagrams. These diagrams can include information both about the structure of the nucleic acid sequences they are engineering and about the functional relationships between features of these sequences and/or other molecular species. A number of conventions and practices have begun to emerge within synthetic biology for creating such diagrams, and the Synthetic Biology Open Language Visual (SBOL Visual) has been developed as a standard to organize, systematize, and extend such conventions in order to produce a coherent visual language. Here, we describe SBOL Visual version 2, which expands previous diagram standards to include new functional interactions, categories of molecular species, support for families of glyph variants, and the ability to indicate modular structure and mappings between elements of a system. SBOL Visual 2 also clarifies a number of requirements and best practices, significantly expands the collection of glyphs available to describe genetic features, and can be readily applied using a wide variety of software tools, both general and bespoke.

The Immune Landscape of Cancer. (Immunity 2019)

PMID: 31433971

Quantitative Proteome Landscape of the NCI-60 Cancer Cell Lines. (iScience 2019)

PMID: 31733513

Here we describe a proteomic data resource for the NCI-60 cell lines generated by pressure cycling technology and SWATH mass spectrometry. We developed the DIA-expert software to curate and visualize the SWATH data, leading to reproducible detection of over 3,100 SwissProt proteotypic proteins and systematic quantification of pathway activities. Stoichiometric relationships of interacting proteins for DNA replication, repair, the chromatin remodeling NuRD complex, β-catenin, RNA metabolism, and prefoldins are more evident than that at the mRNA level. The data are available in CellMiner (discover.nci.nih.gov/cellminercdb and discover.nci.nih.gov/cellminer), allowing casual users to test hypotheses and perform integrative, cross-database analyses of multi-omic drug response correlations for over 20,000 drugs. We demonstrate the value of proteome data in predicting drug response for over 240 clinically relevant chemotherapeutic and targeted therapies. In summary, we present a novel proteome resource for the NCI-60, together with relevant software tools, and demonstrate the benefit of proteome analyses.

A Landscape of Metabolic Variation across Tumor Types. (Cell Syst 2018)

PMID: 29396322

Tumor metabolism is reorganized to support proliferation in the face of growth-related stress. Unlike the widespread profiling of changes to metabolic enzyme levels in cancer, comparatively less attention has been paid to the substrates/products of enzyme-catalyzed reactions, small-molecule metabolites. We developed an informatic pipeline to concurrently analyze metabolomics data from over 900 tissue samples spanning seven cancer types, revealing extensive heterogeneity in metabolic changes relative to normal tissue across cancers of different tissues of origin. Despite this heterogeneity, a number of metabolites were recurrently differentially abundant across many cancers, such as lactate and acyl-carnitine species. Through joint analysis of metabolomic data alongside clinical features of patient samples, we also identified a small number of metabolites, including several polyamines and kynurenine, which were associated with aggressive tumors across several tumor types. Our findings offer a glimpse onto common patterns of metabolic reprogramming across cancers, and the work serves as a large-scale resource accessible via a web application (http://www.sanderlab.org/pancanmet).

Synthetic Biology Open Language Visual (SBOL Visual) Version 2.0. (J Integr Bioinform 2018)

PMID: 29549707

People who are engineering biological organisms often find it useful to communicate in diagrams, both about the structure of the nucleic acid sequences that they are engineering and about the functional relationships between sequence features and other molecular species. Some typical practices and conventions have begun to emerge for such diagrams. The Synthetic Biology Open Language Visual (SBOL Visual) has been developed as a standard for organizing and systematizing such conventions in order to produce a coherent language for expressing the structure and function of genetic designs. This document details version 2.0 of SBOL Visual, which builds on the prior SBOL Visual 1.0 standard by expanding diagram syntax to include functional interactions and molecular species, making the relationship between diagrams and the SBOL data model explicit, supporting families of symbol variants, clarifying a number of requirements and best practices, and significantly expanding the collection of diagram glyphs.

Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. (Cell Syst 2018)

PMID: 29596782

The Cancer Genome Atlas (TCGA) cancer genomics dataset includes over 10,000 tumor-normal exome pairs across 33 different cancer types, in total >400 TB of raw data files requiring analysis. Here we describe the Multi-Center Mutation Calling in Multiple Cancers project, our effort to generate a comprehensive encyclopedia of somatic mutation calls for the TCGA data to enable robust cross-tumor-type analyses. Our approach accounts for variance and batch effects introduced by the rapid advancement of DNA extraction, hybridization-capture, sequencing, and analysis methods over time. We present best practices for applying an ensemble of seven mutation-calling algorithms with scoring and artifact filtering. The dataset created by this analysis includes 3.5 million somatic variants and forms the basis for PanCan Atlas papers. The results have been made available to the research community along with the methods used to generate them. This project is the result of collaboration from a number of institutes and demonstrates how team science drives extremely large genomics projects.

Pan-cancer Alterations of the MYC Oncogene and Its Proximal Network across the Cancer Genome Atlas. (Cell Syst 2018)

PMID: 29596783

Although the MYC oncogene has been implicated in cancer, a systematic assessment of alterations of MYC, related transcription factors, and co-regulatory proteins, forming the proximal MYC network (PMN), across human cancers is lacking. Using computational approaches, we define genomic and proteomic features associated with MYC and the PMN across the 33 cancers of The Cancer Genome Atlas. Pan-cancer, 28% of all samples had at least one of the MYC paralogs amplified. In contrast, the MYC antagonists MGA and MNT were the most frequently mutated or deleted members, proposing a role as tumor suppressors. MYC alterations were mutually exclusive with PIK3CA, PTEN, APC, or BRAF alterations, suggesting that MYC is a distinct oncogenic driver. Expression analysis revealed MYC-associated pathways in tumor subtypes, such as immune response and growth factor signaling; chromatin, translation, and DNA replication/repair were conserved pan-cancer. This analysis reveals insights into MYC biology and is a reference for biomarkers and therapeutics for cancers with alterations of MYC or the PMN.

Machine Learning Detects Pan-cancer Ras Pathway Activation in The Cancer Genome Atlas. (Cell Rep 2018)

PMID: 29617658

Precision oncology uses genomic evidence to match patients with treatment but often fails to identify all patients who may respond. The transcriptome of these "hidden responders" may reveal responsive molecular states. We describe and evaluate a machine-learning approach to classify aberrant pathway activity in tumors, which may aid in hidden responder identification. The algorithm integrates RNA-seq, copy number, and mutations from 33 different cancer types across The Cancer Genome Atlas (TCGA) PanCanAtlas project to predict aberrant molecular states in tumors. Applied to the Ras pathway, the method detects Ras activation across cancer types and identifies phenocopying variants. The model, trained on human tumors, can predict response to MEK inhibitors in wild-type Ras cell lines. We also present data that suggest that multiple hits in the Ras pathway confer increased Ras activity. The transcriptome is underused in precision oncology and, combined with machine learning, can aid in the identification of hidden responders.

Spatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology Images. (Cell Rep 2018)

PMID: 29617659

Beyond sample curation and basic pathologic characterization, the digitized H&E-stained images of TCGA samples remain underutilized. To highlight this resource, we present mappings of tumor-infiltrating lymphocytes (TILs) based on H&E images from 13 TCGA tumor types. These TIL maps are derived through computational staining using a convolutional neural network trained to classify patches of images. Affinity propagation revealed local spatial structure in TIL patterns and correlation with overall survival. TIL map structural patterns were grouped using standard histopathological parameters. These patterns are enriched in particular T cell subpopulations derived from molecular measures. TIL densities and spatial structure were differentially enriched among tumor types, immune subtypes, and tumor molecular subtypes, implying that spatial infiltrate state could reflect particular tumor cell aberration states. Obtaining spatial lymphocytic patterns linked to the rich genomic characterization of TCGA samples demonstrates one use for the TCGA image archives with insights into the tumor-immune microenvironment.

Genomic, Pathway Network, and Immunologic Features Distinguishing Squamous Carcinomas. (Cell Rep 2018)

PMID: 29617660

This integrated, multiplatform PanCancer Atlas study co-mapped and identified distinguishing molecular features of squamous cell carcinomas (SCCs) from five sites associated with smoking and/or human papillomavirus (HPV). SCCs harbor 3q, 5p, and other recurrent chromosomal copy-number alterations (CNAs), DNA mutations, and/or aberrant methylation of genes and microRNAs, which are correlated with the expression of multi-gene programs linked to squamous cell stemness, epithelial-to-mesenchymal differentiation, growth, genomic integrity, oxidative damage, death, and inflammation. Low-CNA SCCs tended to be HPV(+) and display hypermethylation with repression of TET1 demethylase and FANCF, previously linked to predisposition to SCC, or harbor mutations affecting CASP8, RAS-MAPK pathways, chromatin modifiers, and immunoregulatory molecules. We uncovered hypomethylation of the alternative promoter that drives expression of the ΔNp63 oncogene and embedded miR944. Co-expression of immune checkpoint, T-regulatory, and Myeloid suppressor cells signatures may explain reduced efficacy of immune therapy. These findings support possibilities for molecular classification and therapeutic approaches.

Integrated Genomic Analysis of the Ubiquitin Pathway across Cancer Types. (Cell Rep 2018)

PMID: 29617661

Protein ubiquitination is a dynamic and reversible process of adding single ubiquitin molecules or various ubiquitin chains to target proteins. Here, using multidimensional omic data of 9,125 tumor samples across 33 cancer types from The Cancer Genome Atlas, we perform comprehensive molecular characterization of 929 ubiquitin-related genes and 95 deubiquitinase genes. Among them, we systematically identify top somatic driver candidates, including mutated FBXW7 with cancer-type-specific patterns and amplified MDM2 showing a mutually exclusive pattern with BRAF mutations. Ubiquitin pathway genes tend to be upregulated in cancer mediated by diverse mechanisms. By integrating pan-cancer multiomic data, we identify a group of tumor samples that exhibit worse prognosis. These samples are consistently associated with the upregulation of cell-cycle and DNA repair pathways, characterized by mutated TP53, MYC/TERT amplification, and APC/PTEN deletion. Our analysis highlights the importance of the ubiquitin pathway in cancer development and lays a foundation for developing relevant therapeutic strategies.

Driver Fusions and Their Implications in the Development and Treatment of Human Cancers. (Cell Rep 2018)

PMID: 29617662

Gene fusions represent an important class of somatic alterations in cancer. We systematically investigated fusions in 9,624 tumors across 33 cancer types using multiple fusion calling tools. We identified a total of 25,664 fusions, with a 63% validation rate. Integration of gene expression, copy number, and fusion annotation data revealed that fusions involving oncogenes tend to exhibit increased expression, whereas fusions involving tumor suppressors have the opposite effect. For fusions involving kinases, we found 1,275 with an intact kinase domain, the proportion of which varied significantly across cancer types. Our study suggests that fusions drive the development of 16.5% of cancer cases and function as the sole driver in more than 1% of them. Finally, we identified druggable fusions involving genes such as TMPRSS2, RET, FGFR3, ALK, and ESR1 in 6.0% of cases, and we predicted immunogenic peptides, suggesting that fusions may provide leads for targeted drug and immune therapy.

Genomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome Atlas. (Cell Rep 2018)

PMID: 29617664

DNA damage repair (DDR) pathways modulate cancer risk, progression, and therapeutic response. We systematically analyzed somatic alterations to provide a comprehensive view of DDR deficiency across 33 cancer types. Mutations with accompanying loss of heterozygosity were observed in over 1/3 of DDR genes, including TP53 and BRCA1/2. Other prevalent alterations included epigenetic silencing of the direct repair genes EXO5, MGMT, and ALKBH3 in ∼20% of samples. Homologous recombination deficiency (HRD) was present at varying frequency in many cancer types, most notably ovarian cancer. However, in contrast to ovarian cancer, HRD was associated with worse outcomes in several other cancers. Protein structure-based analyses allowed us to predict functional consequences of rare, recurrent DDR mutations. A new machine-learning-based classifier developed from gene expression data allowed us to identify alterations that phenocopy deleterious TP53 mutations. These frequent DDR gene alterations in many human cancers have functional consequences that may determine cancer progression and guide therapy.

Molecular Characterization and Clinical Relevance of Metabolic Expression Subtypes in Human Cancers. (Cell Rep 2018)

PMID: 29617665

Metabolic reprogramming provides critical information for clinical oncology. Using molecular data of 9,125 patient samples from The Cancer Genome Atlas, we identified tumor subtypes in 33 cancer types based on mRNA expression patterns of seven major metabolic processes and assessed their clinical relevance. Our metabolic expression subtypes correlated extensively with clinical outcome: subtypes with upregulated carbohydrate, nucleotide, and vitamin/cofactor metabolism most consistently correlated with worse prognosis, whereas subtypes with upregulated lipid metabolism showed the opposite. Metabolic subtypes correlated with diverse somatic drivers but exhibited effects convergent on cancer hallmark pathways and were modulated by highly recurrent master regulators across cancer types. As a proof-of-concept example, we demonstrated that knockdown of SNAI1 or RUNX1-master regulators of carbohydrate metabolic subtypes-modulates metabolic activity and drug sensitivity. Our study provides a system-level view of metabolic heterogeneity within and across cancer types and identifies pathway cross-talk, suggesting related prognostic, therapeutic, and predictive utility.

Systematic Analysis of Splice-Site-Creating Mutations in Cancer. (Cell Rep 2018)

PMID: 29617666

For the past decade, cancer genomic studies have focused on mutations leading to splice-site disruption, overlooking those having splice-creating potential. Here, we applied a bioinformatic tool, MiSplice, for the large-scale discovery of splice-site-creating mutations (SCMs) across 8,656 TCGA tumors. We report 1,964 originally mis-annotated mutations having clear evidence of creating alternative splice junctions. TP53 and GATA3 have 26 and 18 SCMs, respectively, and ATRX has 5 from lower-grade gliomas. Mutations in 11 genes, including PARP1, BRCA1, and BAP1, were experimentally validated for splice-site-creating function. Notably, we found that neoantigens induced by SCMs are likely several folds more immunogenic compared to missense mutations, exemplified by the recurrent GATA3 SCM. Further, high expression of PD-1 and PD-L1 was observed in tumors with SCMs, suggesting candidates for immune blockade therapy. Our work highlights the importance of integrating DNA and RNA data for understanding the functional and the clinical implications of mutations in human diseases.

Somatic Mutational Landscape of Splicing Factor Genes and Their Functional Consequences across 33 Cancer Types. (Cell Rep 2018)

PMID: 29617667

Hotspot mutations in splicing factor genes have been recently reported at high frequency in hematological malignancies, suggesting the importance of RNA splicing in cancer. We analyzed whole-exome sequencing data across 33 tumor types in The Cancer Genome Atlas (TCGA), and we identified 119 splicing factor genes with significant non-silent mutation patterns, including mutation over-representation, recurrent loss of function (tumor suppressor-like), or hotspot mutation profile (oncogene-like). Furthermore, RNA sequencing analysis revealed altered splicing events associated with selected splicing factor mutations. In addition, we were able to identify common gene pathway profiles associated with the presence of these mutations. Our analysis suggests that somatic alteration of genes involved in the RNA-splicing process is common in cancer and may represent an underappreciated hallmark of tumorigenesis.

Pan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor Context. (Cell Rep 2018)

PMID: 29617668

Long noncoding RNAs (lncRNAs) are commonly dysregulated in tumors, but only a handful are known to play pathophysiological roles in cancer. We inferred lncRNAs that dysregulate cancer pathways, oncogenes, and tumor suppressors (cancer genes) by modeling their effects on the activity of transcription factors, RNA-binding proteins, and microRNAs in 5,185 TCGA tumors and 1,019 ENCODE assays. Our predictions included hundreds of candidate onco- and tumor-suppressor lncRNAs (cancer lncRNAs) whose somatic alterations account for the dysregulation of dozens of cancer genes and pathways in each of 14 tumor contexts. To demonstrate proof of concept, we showed that perturbations targeting OIP5-AS1 (an inferred tumor suppressor) and TUG1 and WT1-AS (inferred onco-lncRNAs) dysregulated cancer genes and altered proliferation of breast and gynecologic cancer cells. Our analysis indicates that, although most lncRNAs are dysregulated in a tumor-specific manner, some, including OIP5-AS1, TUG1, NEAT1, MEG3, and TSIX, synergistically dysregulate cancer pathways in multiple tumor contexts.

The Cancer Genome Atlas Comprehensive Molecular Characterization of Renal Cell Carcinoma. (Cell Rep 2018)

PMID: 29617669

Renal cell carcinoma (RCC) is not a single disease, but several histologically defined cancers with different genetic drivers, clinical courses, and therapeutic responses. The current study evaluated 843 RCC from the three major histologic subtypes, including 488 clear cell RCC, 274 papillary RCC, and 81 chromophobe RCC. Comprehensive genomic and phenotypic analysis of the RCC subtypes reveals distinctive features of each subtype that provide the foundation for the development of subtype-specific therapeutic and management strategies for patients affected with these cancers. Somatic alteration of BAP1, PBRM1, and PTEN and altered metabolic pathways correlated with subtype-specific decreased survival, while CDKN2A alteration, increased DNA hypermethylation, and increases in the immune-related Th2 gene expression signature correlated with decreased survival within all major histologic subtypes. CIMP-RCC demonstrated an increased immune signature, and a uniform and distinct metabolic expression pattern identified a subset of metabolically divergent (MD) ChRCC that associated with extremely poor survival.

Genomic and Functional Approaches to Understanding Cancer Aneuploidy. (Cancer Cell 2018)

PMID: 29622463

Aneuploidy, whole chromosome or chromosome arm imbalance, is a near-universal characteristic of human cancers. In 10,522 cancer genomes from The Cancer Genome Atlas, aneuploidy was correlated with TP53 mutation, somatic mutation rate, and expression of proliferation genes. Aneuploidy was anti-correlated with expression of immune signaling genes, due to decreased leukocyte infiltrates in high-aneuploidy samples. Chromosome arm-level alterations show cancer-specific patterns, including loss of chromosome arm 3p in squamous cancers. We applied genome engineering to delete 3p in lung cells, causing decreased proliferation rescued in part by chromosome 3 duplication. This study defines genomic and phenotypic correlates of cancer aneuploidy and provides an experimental approach to study chromosome arm aneuploidy.

A Comprehensive Pan-Cancer Molecular Study of Gynecologic and Breast Cancers. (Cancer Cell 2018)

PMID: 29622464

We analyzed molecular data on 2,579 tumors from The Cancer Genome Atlas (TCGA) of four gynecological types plus breast. Our aims were to identify shared and unique molecular features, clinically significant subtypes, and potential therapeutic targets. We found 61 somatic copy-number alterations (SCNAs) and 46 significantly mutated genes (SMGs). Eleven SCNAs and 11 SMGs had not been identified in previous TCGA studies of the individual tumor types. We found functionally significant estrogen receptor-regulated long non-coding RNAs (lncRNAs) and gene/lncRNA interaction networks. Pathway analysis identified subtypes with high leukocyte infiltration, raising potential implications for immunotherapy. Using 16 key molecular features, we identified five prognostic subtypes and developed a decision tree that classified patients into the subtypes based on just six features that are assessable in clinical laboratories.

lncRNA Epigenetic Landscape Analysis Identifies EPIC1 as an Oncogenic lncRNA that Interacts with MYC and Promotes Cell-Cycle Progression in Cancer. (Cancer Cell 2018)

PMID: 29622465

We characterized the epigenetic landscape of genes encoding long noncoding RNAs (lncRNAs) across 6,475 tumors and 455 cancer cell lines. In stark contrast to the CpG island hypermethylation phenotype in cancer, we observed a recurrent hypomethylation of 1,006 lncRNA genes in cancer, including EPIC1 (epigenetically-induced lncRNA1). Overexpression of EPIC1 is associated with poor prognosis in luminal B breast cancer patients and enhances tumor growth in vitro and in vivo. Mechanistically, EPIC1 promotes cell-cycle progression by interacting with MYC through EPIC1's 129-283 nt region. EPIC1 knockdown reduces the occupancy of MYC to its target genes (e.g., CDKN1A, CCNA2, CDC20, and CDC45). MYC depletion abolishes EPIC1's regulation of MYC target and luminal breast cancer tumorigenesis in vitro and in vivo.

Comparative Molecular Analysis of Gastrointestinal Adenocarcinomas. (Cancer Cell 2018)

PMID: 29622466

We analyzed 921 adenocarcinomas of the esophagus, stomach, colon, and rectum to examine shared and distinguishing molecular characteristics of gastrointestinal tract adenocarcinomas (GIACs). Hypermutated tumors were distinct regardless of cancer type and comprised those enriched for insertions/deletions, representing microsatellite instability cases with epigenetic silencing of MLH1 in the context of CpG island methylator phenotype, plus tumors with elevated single-nucleotide variants associated with mutations in POLE. Tumors with chromosomal instability were diverse, with gastroesophageal adenocarcinomas harboring fragmented genomes associated with genomic doubling and distinct mutational signatures. We identified a group of tumors in the colon and rectum lacking hypermutation and aneuploidy termed genome stable and enriched in DNA hypermethylation and mutations in KRAS, SOX9, and PCBP1.

Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. (Cell 2018)

PMID: 29625048

We conducted comprehensive integrative molecular analyses of the complete set of tumors in The Cancer Genome Atlas (TCGA), consisting of approximately 10,000 specimens and representing 33 types of cancer. We performed molecular clustering using data on chromosome-arm-level aneuploidy, DNA hypermethylation, mRNA, and miRNA expression levels and reverse-phase protein arrays, of which all, except for aneuploidy, revealed clustering primarily organized by histology, tissue type, or anatomic origin. The influence of cell type was evident in DNA-methylation-based clustering, even after excluding sites with known preexisting tissue-type-specific methylation. Integrative clustering further emphasized the dominant role of cell-of-origin patterns. Molecular similarities among histologically or anatomically related cancer types provide a basis for focused pan-cancer analyses, such as pan-gastrointestinal, pan-gynecological, pan-kidney, and pan-squamous cancers, and those related by stemness features, which in turn may inform strategies for future therapeutic development.

Perspective on Oncogenic Processes at the End of the Beginning of Cancer Genomics. (Cell 2018)

PMID: 29625049

The Cancer Genome Atlas (TCGA) has catalyzed systematic characterization of diverse genomic alterations underlying human cancers. At this historic junction marking the completion of genomic characterization of over 11,000 tumors from 33 cancer types, we present our current understanding of the molecular processes governing oncogenesis. We illustrate our insights into cancer through synthesis of the findings of the TCGA PanCancer Atlas project on three facets of oncogenesis: (1) somatic driver mutations, germline pathogenic variants, and their interactions in the tumor; (2) the influence of the tumor genome and epigenome on transcriptome and proteome; and (3) the relationship between tumor and the microenvironment, including implications for drugs targeting driver events and immunotherapies. These results will anchor future characterization of rare and common tumor types, primary and relapsed tumors, and cancers across ancestry groups and will guide the deployment of clinical genomic sequencing.

Oncogenic Signaling Pathways in The Cancer Genome Atlas. (Cell 2018)

PMID: 29625050

Genetic alterations in signaling pathways that control cell-cycle progression, apoptosis, and cell growth are common hallmarks of cancer, but the extent, mechanisms, and co-occurrence of alterations in these pathways differ between individual tumors and tumor types. Using mutations, copy-number changes, mRNA expression, gene fusions and DNA methylation in 9,125 tumors profiled by The Cancer Genome Atlas (TCGA), we analyzed the mechanisms and patterns of somatic alterations in ten canonical pathways: cell cycle, Hippo, Myc, Notch, Nrf2, PI-3-Kinase/Akt, RTK-RAS, TGFβ signaling, p53 and β-catenin/Wnt. We charted the detailed landscape of pathway alterations in 33 cancer types, stratified into 64 subtypes, and identified patterns of co-occurrence and mutual exclusivity. Eighty-nine percent of tumors had at least one driver alteration in these pathways, and 57% percent of tumors had at least one alteration potentially targetable by currently available drugs. Thirty percent of tumors had multiple targetable alterations, indicating opportunities for combination therapy.

Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation. (Cell 2018)

PMID: 29625051

Cancer progression involves the gradual loss of a differentiated phenotype and acquisition of progenitor and stem-cell-like features. Here, we provide novel stemness indices for assessing the degree of oncogenic dedifferentiation. We used an innovative one-class logistic regression (OCLR) machine-learning algorithm to extract transcriptomic and epigenetic feature sets derived from non-transformed pluripotent stem cells and their differentiated progeny. Using OCLR, we were able to identify previously undiscovered biological mechanisms associated with the dedifferentiated oncogenic state. Analyses of the tumor microenvironment revealed unanticipated correlation of cancer stemness with immune checkpoint expression and infiltrating immune cells. We found that the dedifferentiated oncogenic phenotype was generally most prominent in metastatic tumors. Application of our stemness indices to single-cell data revealed patterns of intra-tumor molecular heterogeneity. Finally, the indices allowed for the identification of novel targets and possible targeted therapies aimed at tumor differentiation.

Pathogenic Germline Variants in 10,389 Adult Cancers. (Cell 2018)

PMID: 29625052

We conducted the largest investigation of predisposition variants in cancer to date, discovering 853 pathogenic or likely pathogenic variants in 8% of 10,389 cases from 33 cancer types. Twenty-one genes showed single or cross-cancer associations, including novel associations of SDHA in melanoma and PALB2 in stomach adenocarcinoma. The 659 predisposition variants and 18 additional large deletions in tumor suppressors, including ATM, BRCA1, and NF1, showed low gene expression and frequent (43%) loss of heterozygosity or biallelic two-hit events. We also discovered 33 such variants in oncogenes, including missenses in MET, RET, and PTPN11 associated with high gene expression. We nominated 47 additional predisposition variants from prioritized VUSs supported by multiple evidences involving case-control frequency, loss of heterozygosity, expression effect, and co-localization with mutations and modified residues. Our integrative approach links rare predisposition variants to functional consequences, informing future guidelines of variant classification and germline genetic testing in cancer.

Comprehensive Characterization of Cancer Driver Genes and Mutations. (Cell 2018)

PMID: 29625053

Identifying molecular cancer drivers is critical for precision oncology. Multiple advanced algorithms to identify drivers now exist, but systematic attempts to combine and optimize them on large datasets are few. We report a PanCancer and PanSoftware analysis spanning 9,423 tumor exomes (comprising all 33 of The Cancer Genome Atlas projects) and using 26 computational tools to catalog driver genes and mutations. We identify 299 driver genes with implications regarding their anatomical sites and cancer/cell types. Sequence- and structure-based analyses identified >3,400 putative missense driver mutations supported by multiple lines of evidence. Experimental validation confirmed 60%-85% of predicted mutations as likely drivers. We found that >300 MSI tumors are associated with high PD-1/PD-L1, and 57% of tumors analyzed harbor putative clinically actionable events. Our study represents the most comprehensive discovery of cancer genes and mutations to date and will serve as a blueprint for future biological and clinical endeavors.

A Pan-Cancer Analysis of Enhancer Expression in Nearly 9000 Patient Samples. (Cell 2018)

PMID: 29625054

The role of enhancers, a key class of non-coding regulatory DNA elements, in cancer development has increasingly been appreciated. Here, we present the detection and characterization of a large number of expressed enhancers in a genome-wide analysis of 8928 tumor samples across 33 cancer types using TCGA RNA-seq data. Compared with matched normal tissues, global enhancer activation was observed in most cancers. Across cancer types, global enhancer activity was positively associated with aneuploidy, but not mutation load, suggesting a hypothesis centered on "chromatin-state" to explain their interplay. Integrating eQTL, mRNA co-expression, and Hi-C data analysis, we developed a computational method to infer causal enhancer-gene interactions, revealing enhancers of clinically actionable genes. Having identified an enhancer ∼140 kb downstream of PD-L1, a major immunotherapy target, we validated it experimentally. This study provides a systematic view of enhancer activity in diverse tumor contexts and suggests the clinical implications of enhancers.

An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. (Cell 2018)

PMID: 29625055

For a decade, The Cancer Genome Atlas (TCGA) program collected clinicopathologic annotation data along with multi-platform molecular profiles of more than 11,000 human tumors across 33 different cancer types. TCGA clinical data contain key features representing the democratized nature of the data collection process. To ensure proper use of this large clinical dataset associated with genomic features, we developed a standardized dataset named the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR), which includes four major clinical outcome endpoints. In addition to detailing major challenges and statistical limitations encountered during the effort of integrating the acquired clinical data, we present a summary that includes endpoint usage recommendations for each cancer type. These TCGA-CDR findings appear to be consistent with cancer genomics studies independent of the TCGA effort and provide opportunities for investigating cancer biology using clinical correlates at an unprecedented scale.

The Immune Landscape of Cancer. (Immunity 2018)

PMID: 29628290

We performed an extensive immunogenomic analysis of more than 10,000 tumors comprising 33 diverse cancer types by utilizing data compiled by TCGA. Across cancer types, we identified six immune subtypes-wound healing, IFN-γ dominant, inflammatory, lymphocyte depleted, immunologically quiet, and TGF-β dominant-characterized by differences in macrophage or lymphocyte signatures, Th1:Th2 cell ratio, extent of intratumoral heterogeneity, aneuploidy, extent of neoantigen load, overall cell proliferation, expression of immunomodulatory genes, and prognosis. Specific driver mutations correlated with lower (CTNNB1, NRAS, or IDH1) or higher (BRAF, TP53, or CASP8) leukocyte levels across all cancers. Multiple control modalities of the intracellular and extracellular networks (transcription, microRNAs, copy number, and epigenetic processes) were involved in tumor-immune cell interactions, both across and within immune subtypes. Our immunogenomics pipeline to characterize these heterogeneous tumors and the resulting data are intended to serve as a resource for future targeted studies to further advance the field.

The Cancer Genome Atlas Comprehensive Molecular Characterization of Renal Cell Carcinoma. (Cell Rep 2018)

PMID: 29925010

Comprehensive Analysis of Alternative Splicing Across Tumors from 8,705 Patients. (Cancer Cell 2018)

PMID: 30078747

Our comprehensive analysis of alternative splicing across 32 The Cancer Genome Atlas cancer types from 8,705 patients detects alternative splicing events and tumor variants by reanalyzing RNA and whole-exome sequencing data. Tumors have up to 30% more alternative splicing events than normal samples. Association analysis of somatic variants with alternative splicing events confirmed known trans associations with variants in SF3B1 and U2AF1 and identified additional trans-acting variants (e.g., TADA1, PPP2R1A). Many tumors have thousands of alternative splicing events not detectable in normal samples; on average, we identified ≈930 exon-exon junctions ("neojunctions") in tumors not typically found in GTEx normals. From Clinical Proteomic Tumor Analysis Consortium data available for breast and ovarian tumor samples, we confirmed ≈1.7 neojunction- and ≈0.6 single nucleotide variant-derived peptides per tumor sample that are also predicted major histocompatibility complex-I binders ("putative neoantigens").

Comprehensive Characterization of Cancer Driver Genes and Mutations. (Cell 2018)

PMID: 30096302

A Pan-Cancer Analysis Reveals High-Frequency Genetic Alterations in Mediators of Signaling by the TGF-β Superfamily. (Cell Syst 2018)

PMID: 30268436

We present an integromic analysis of gene alterations that modulate transforming growth factor β (TGF-β)-Smad-mediated signaling in 9,125 tumor samples across 33 cancer types in The Cancer Genome Atlas (TCGA). Focusing on genes that encode mediators and regulators of TGF-β signaling, we found at least one genomic alteration (mutation, homozygous deletion, or amplification) in 39% of samples, with highest frequencies in gastrointestinal cancers. We identified mutation hotspots in genes that encode TGF-β ligands (BMP5), receptors (TGFBR2, AVCR2A, and BMPR2), and Smads (SMAD2 and SMAD4). Alterations in the TGF-β superfamily correlated positively with expression of metastasis-associated genes and with decreased survival. Correlation analyses showed the contributions of mutation, amplification, deletion, DNA methylation, and miRNA expression to transcriptional activity of TGF-β signaling in each cancer type. This study provides a broad molecular perspective relevant for future functional and therapeutic studies of the diverse cancer pathways mediated by the TGF-β superfamily.

Comprehensive Molecular Characterization of the Hippo Signaling Pathway in Cancer. (Cell Rep 2018)

PMID: 30380420

Hippo signaling has been recognized as a key tumor suppressor pathway. Here, we perform a comprehensive molecular characterization of 19 Hippo core genes in 9,125 tumor samples across 33 cancer types using multidimensional "omic" data from The Cancer Genome Atlas. We identify somatic drivers among Hippo genes and the related microRNA (miRNA) regulators, and using functional genomic approaches, we experimentally characterize YAP and TAZ mutation effects and miR-590 and miR-200a regulation for TAZ. Hippo pathway activity is best characterized by a YAP/TAZ transcriptional target signature of 22 genes, which shows robust prognostic power across cancer types. Our elastic-net integrated modeling further reveals cancer-type-specific pathway regulators and associated cancer drivers. Our results highlight the importance of Hippo signaling in squamous cell cancers, characterized by frequent amplification of YAP/TAZ, high expression heterogeneity, and significant prognostic patterns. This study represents a systems-biology approach to characterizing key cancer signaling pathways in the post-genomic era.

CellMinerCDB for Integrative Cross-Database Genomics and Pharmacogenomics Analyses of Cancer Cell Lines. (iScience 2018)

PMID: 30553813

CellMinerCDB provides a web-based resource (https://discover.nci.nih.gov/cellminercdb/) for integrating multiple forms of pharmacological and genomic analyses, and unifying the richest cancer cell line datasets (the NCI-60, NCI-SCLC, Sanger/MGH GDSC, and Broad CCLE/CTRP). CellMinerCDB enables data queries for genomics and gene regulatory network analyses, and exploration of pharmacogenomic determinants and drug signatures. It leverages overlaps of cell lines and drugs across databases to examine reproducibility and expand pathway analyses. We illustrate the value of CellMinerCDB for elucidating gene expression determinants, such as DNA methylation and copy number variations, and highlight complexities in assessing mutational burden. We demonstrate the value of CellMinerCDB in selecting drugs with reproducible activity, expand on the dominant role of SLFN11 for drug response, and present novel response determinants and genomic signatures for topoisomerase inhibitors and schweinfurthins. We also introduce LIX1L as a gene associated with mesenchymal signature and regulation of cellular migration and invasiveness.

The NCI-60 Methylome and Its Integration into CellMiner. (Cancer Res 2017)

PMID: 27923837

A unique resource for systems pharmacology and genomic studies is the NCI-60 cancer cell line panel, which provides data for the largest publicly available library of compounds with cytotoxic activity (∼21,000 compounds), including 108 FDA-approved and 70 clinical trial drugs as well as genomic data, including whole-exome sequencing, gene and miRNA transcripts, DNA copy number, and protein levels. Here, we provide the first readily usable genome-wide DNA methylation database for the NCI-60, including 485,577 probes from the Infinium HumanMethylation450k BeadChip array, which yielded DNA methylation signatures for 17,559 genes integrated into our open access CellMiner version 2.0 (https://discover.nci.nih.gov/cellminer). Among new insights, transcript versus DNA methylation correlations revealed the epithelial/mesenchymal gene functional category as being influenced most heavily by methylation. DNA methylation and copy number integration with transcript levels yielded an assessment of their relative influence for 15,798 genes, including tumor suppressor, mitochondrial, and mismatch repair genes. Four forms of molecular data were combined, providing rationale for microsatellite instability for 8 of the 9 cell lines in which it occurred. Individual cell line analyses showed global methylome patterns with overall methylation levels ranging from 17% to 84%. A six-gene model, including PARP1, EP300, KDM5C, SMARCB1, and UHRF1 matched this pattern. In addition, promoter methylation of two translationally relevant genes, Schlafen 11 (SLFN11) and methylguanine methyltransferase (MGMT), served as indicators of therapeutic resistance or susceptibility, respectively. Overall, our database provides a resource of pharmacologic data that can reinforce known therapeutic strategies and identify novel drugs and drug targets across multiple cancer types. Cancer Res; 77(3); 601-12. ©2016 AACR.

Erratum to: Tumor immune microenvironment characterization in clear cell renal cell carcinoma identifies prognostic and immunotherapeutically relevant messenger RNA signatures. (Genome Biol 2017)

PMID: 28249590

The digital revolution in phenotyping. (Brief Bioinform 2016)

PMID: 26420780

Phenotypes have gained increased notoriety in the clinical and biological domain owing to their application in numerous areas such as the discovery of disease genes and drug targets, phylogenetics and pharmacogenomics. Phenotypes, defined as observable characteristics of organisms, can be seen as one of the bridges that lead to a translation of experimental findings into clinical applications and thereby support 'bench to bedside' efforts. However, to build this translational bridge, a common and universal understanding of phenotypes is required that goes beyond domain-specific definitions. To achieve this ambitious goal, a digital revolution is ongoing that enables the encoding of data in computer-readable formats and the data storage in specialized repositories, ready for integration, enabling translational research. While phenome research is an ongoing endeavor, the true potential hidden in the currently available data still needs to be unlocked, offering exciting opportunities for the forthcoming years. Here, we provide insights into the state-of-the-art in digital phenotyping, by means of representing, acquiring and analyzing phenotype data. In addition, we provide visions of this field for future research work that could enable better applications of phenotype data.

rcellminer: exploring molecular profiles and drug response of the NCI-60 cell lines in R. (Bioinformatics 2016)

PMID: 26635141

The rcellminer R package provides a wide range of functionality to help R users access and explore molecular profiling and drug response data for the NCI-60. The package enables flexible programmatic access to CellMiner's unparalleled breadth of NCI-60 data, including gene and protein expression, copy number, whole exome mutations, as well as activity data for ∼21K compounds, with information on their structure, mechanism of action and repeat screens. Functions are available to easily visualize compound structures, activity patterns and molecular feature profiles. Additionally, embedded R Shiny applications allow interactive data exploration.

PaxtoolsR: pathway analysis in R using Pathway Commons. (Bioinformatics 2016)

PMID: 26685306

PaxtoolsR package enables access to pathway data represented in the BioPAX format and made available through the Pathway Commons webservice for users of the R language to aid in advanced pathway analyses. Features include the extraction, merging and validation of pathway data represented in the BioPAX format. This package also provides novel pathway datasets and advanced querying features for R users through the Pathway Commons webservice allowing users to query, extract and retrieve data and integrate these data with local BioPAX datasets.

An Integrated Metabolic Atlas of Clear Cell Renal Cell Carcinoma. (Cancer Cell 2016)

PMID: 26766592

Dysregulated metabolism is a hallmark of cancer, manifested through alterations in metabolites. We performed metabolomic profiling on 138 matched clear cell renal cell carcinoma (ccRCC)/normal tissue pairs and found that ccRCC is characterized by broad shifts in central carbon metabolism, one-carbon metabolism, and antioxidant response. Tumor progression and metastasis were associated with metabolite increases in glutathione and cysteine/methionine metabolism pathways. We develop an analytic pipeline and visualization tool (metabolograms) to bridge the gap between TCGA transcriptomic profiling and our metabolomic data, which enables us to assemble an integrated pathway-level metabolic atlas and to demonstrate discordance between transcriptome and metabolome. Lastly, expression profiling was performed on a high-glutathione cluster, which corresponds to a poor-survival subgroup in the ccRCC TCGA cohort.

Tumor immune microenvironment characterization in clear cell renal cell carcinoma identifies prognostic and immunotherapeutically relevant messenger RNA signatures. (Genome Biol 2016)

PMID: 27855702

Tumor-infiltrating immune cells have been linked to prognosis and response to immunotherapy; however, the levels of distinct immune cell subsets and the signals that draw them into a tumor, such as the expression of antigen presenting machinery genes, remain poorly characterized. Here, we employ a gene expression-based computational method to profile the infiltration levels of 24 immune cell populations in 19 cancer types.

Using drug response data to identify molecular effectors, and molecular "omic" data to identify candidate drugs in cancer. (Hum Genet 2015)

PMID: 25213708

The current convergence of molecular and pharmacological data provides unprecedented opportunities to gain insights into the relationships between the two types of data. Multiple forms of large-scale molecular data, including but not limited to gene and microRNA transcript expression, DNA somatic and germline variations from next-generation DNA and RNA sequencing, and DNA copy number from array comparative genomic hybridization are all potentially informative when one attempts to recognize the panoply of potentially influential events both for cancer progression and therapeutic outcome. Concurrently, there has also been a substantial expansion of the pharmacological data being accrued in a systematic fashion. For cancer cell lines, the National Cancer Institute cell line panel (NCI-60), the Cancer Cell Line Encyclopedia (CCLE), and the collaborative Genomics of Drug Sensitivity in Cancer (GDSC) databases all provide subsets of these forms of data. For the patient-derived data, The Cancer Genome Atlas (TCGA) provides analogous forms of genomic information along with treatment histories. Integration of these data in turn relies on the fields of statistics and statistical learning. Multiple algorithmic approaches may be chosen, depending on the data being considered, and the nature of the question being asked. Combining these algorithms with prior biological knowledge, the results of molecular biological studies, and the consideration of genes as pathways or functional groups provides both the challenge and the potential of the field. The ultimate goal is to provide a paradigm shift in the way that drugs are selected to provide a more targeted and efficacious outcome for the patient.

Alterations of DNA repair genes in the NCI-60 cell lines and their predictive value for anticancer drug activity. (DNA Repair (Amst) 2015)

PMID: 25758781

Loss of function of DNA repair (DNAR) genes is associated with genomic instability and cancer predisposition; it also makes cancer cells reliant on a reduced set of DNAR pathways to resist DNA-targeted therapy, which remains the core of the anticancer armamentarium. Because the landscape of DNAR defects across numerous types of cancers and its relation with drug activity have not been systematically examined, we took advantage of the unique drug and genomic databases of the US National Cancer Institute cancer cell lines (the NCI-60) to characterize 260 DNAR genes with respect to deleterious mutations and expression down-regulation; 169 genes exhibited a total of 549 function-affecting alterations, with 39 of them scoring as putative knockouts across 31 cell lines. Those mutations were compared to tumor samples from 12 studies of The Cancer Genome Atlas (TCGA) and The Cancer Cell Line Encyclopedia (CCLE). Based on this compendium of alterations, we determined which DNAR genomic alterations predicted drug response for 20,195 compounds present in the NCI-60 drug database. Among 242 DNA damaging agents, 202 showed associations with at least one DNAR genomic signature. In addition to SLFN11, the Fanconi anemia-scaffolding gene SLX4 (FANCP/BTBD12) stood out among the genes most significantly related with DNA synthesis and topoisomerase inhibitors. Depletion and complementation experiments validated the causal relationship between SLX4 defects and sensitivity to raltitrexed and cytarabine in addition to camptothecin. Therefore, we propose new rational uses for existing anticancer drugs based on a comprehensive analysis of DNAR genomic parameters.

Predicted Role of NAD Utilization in the Control of Circadian Rhythms during DNA Damage Response. (PLoS Comput Biol 2015)

PMID: 26020938

The circadian clock is a set of regulatory steps that oscillate with a period of approximately 24 hours influencing many biological processes. These oscillations are robust to external stresses, and in the case of genotoxic stress (i.e. DNA damage), the circadian clock responds through phase shifting with primarily phase advancements. The effect of DNA damage on the circadian clock and the mechanism through which this effect operates remains to be thoroughly investigated. Here we build an in silico model to examine damage-induced circadian phase shifts by investigating a possible mechanism linking circadian rhythms to metabolism. The proposed model involves two DNA damage response proteins, SIRT1 and PARP1, that are each consumers of nicotinamide adenine dinucleotide (NAD), a metabolite involved in oxidation-reduction reactions and in ATP synthesis. This model builds on two key findings: 1) that SIRT1 (a protein deacetylase) is involved in both the positive (i.e. transcriptional activation) and negative (i.e. transcriptional repression) arms of the circadian regulation and 2) that PARP1 is a major consumer of NAD during the DNA damage response. In our simulations, we observe that increased PARP1 activity may be able to trigger SIRT1-induced circadian phase advancements by decreasing SIRT1 activity through competition for NAD supplies. We show how this competitive inhibition may operate through protein acetylation in conjunction with phosphorylation, consistent with reported observations. These findings suggest a possible mechanism through which multiple perturbations, each dominant during different points of the circadian cycle, may result in the phase advancement of the circadian clock seen during DNA damage.

Systems Biology Graphical Notation: Entity Relationship language Level 1 Version 2. (J Integr Bioinform 2015)

PMID: 26528562

The Systems Biological Graphical Notation (SBGN) is an international community effort for standardized graphical representations of biological pathways and networks. The goal of SBGN is to provide unambiguous pathway and network maps for readers with different scientific backgrounds as well as to support efficient and accurate exchange of biological knowledge between different research communities, industry, and other players in systems biology. Three SBGN languages, Process Description (PD), Entity Relationship (ER) and Activity Flow (AF), allow for the representation of different aspects of biological and biochemical systems at different levels of detail. The SBGN Entity Relationship language (ER) represents biological entities and their interactions and relationships within a network. SBGN ER focuses on all potential relationships between entities without considering temporal aspects. The nodes (elements) describe biological entities, such as proteins and complexes. The edges (connections) provide descriptions of interactions and relationships (or influences), e.g., complex formation, stimulation and inhibition. Among all three languages of SBGN, ER is the closest to protein interaction networks in biological literature and textbooks, but its well-defined semantics offer a superior precision in expressing biological knowledge.

Systems Biology Graphical Notation: Activity Flow language Level 1 Version 1.2. (J Integr Bioinform 2015)

PMID: 26528563

The Systems Biological Graphical Notation (SBGN) is an international community effort for standardized graphical representations of biological pathways and networks. The goal of SBGN is to provide unambiguous pathway and network maps for readers with different scientific backgrounds as well as to support efficient and accurate exchange of biological knowledge between different research communities, industry, and other players in systems biology. Three SBGN languages, Process Description (PD), Entity Relationship (ER) and Activity Flow (AF), allow for the representation of different aspects of biological and biochemical systems at different levels of detail. The SBGN Activity Flow language represents the influences of activities among various entities within a network. Unlike SBGN PD and ER that focus on the entities and their relationships with others, SBGN AF puts the emphasis on the functions (or activities) performed by the entities, and their effects to the functions of the same or other entities. The nodes (elements) describe the biological activities of the entities, such as protein kinase activity, binding activity or receptor activity, which can be easily mapped to Gene Ontology molecular function terms. The edges (connections) provide descriptions of relationships (or influences) between the activities, e.g., positive influence and negative influence. Among all three languages of SBGN, AF is the closest to signaling pathways in biological literature and textbooks, but its well-defined semantics offer a superior precision in expressing biological knowledge.

PathVisio-Faceted Search: an exploration tool for multi-dimensional navigation of large pathways. (Bioinformatics 2013)

PMID: 23547033

The PathVisio-Faceted Search plugin helps users explore and understand complex pathways by overlaying experimental data and data from webservices, such as Ensembl BioMart, onto diagrams drawn using formalized notations in PathVisio. The plugin then provides a filtering mechanism, known as a faceted search, to find and highlight diagram nodes (e.g. genes and proteins) of interest based on imported data. The tool additionally provides a flexible scripting mechanism to handle complex queries.

SIRT1/PARP1 crosstalk: connecting DNA damage and metabolism. (Genome Integr 2013)

PMID: 24360018

An intricate network regulates the activities of SIRT1 and PARP1 proteins and continues to be uncovered. Both SIRT1 and PARP1 share a common co-factor nicotinamide adenine dinucleotide (NAD+) and several common substrates, including regulators of DNA damage response and circadian rhythms. We review this complex network using an interactive Molecular Interaction Map (MIM) to explore the interplay between these two proteins. Here we discuss how NAD + competition and post-transcriptional/translational feedback mechanisms create a regulatory network sensitive to environmental cues, such as genotoxic stress and metabolic states, and examine the role of those interactions in DNA repair and ultimately, cell fate decisions.

PathVisio-Validator: a rule-based validation plugin for graphical pathway notations. (Bioinformatics 2012)

PMID: 22199389

The PathVisio-Validator plugin aims to simplify the task of producing biological pathway diagrams that follow graphical standardized notations, such as Molecular Interaction Maps or the Systems Biology Graphical Notation. This plugin assists in the creation of pathway diagrams by ensuring correct usage of a notation, and thereby reducing ambiguity when diagrams are shared among biologists. Rulesets, needed in the validation process, can be generated for any graphical notation that a developer desires, using either Schematron or Groovy. The plugin also provides support for filtering validation results, validating on a subset of rules, and distinguishing errors and warnings.

Gene expression profiles of the NCI-60 human tumor cell lines define molecular interaction networks governing cell migration processes. (PLoS One 2012)

PMID: 22570691

Although there is extensive information on gene expression and molecular interactions in various cell types, integrating those data in a functionally coherent manner remains challenging. This study explores the premise that genes whose expression at the mRNA level is correlated over diverse cell lines are likely to function together in a network of molecular interactions. We previously derived expression-correlated gene clusters from the database of the NCI-60 human tumor cell lines and associated each cluster with function categories of the Gene Ontology (GO) database. From a cluster rich in genes associated with GO categories related to cell migration, we extracted 15 genes that were highly cross-correlated; prominent among them were RRAS, AXL, ADAM9, FN14, and integrin-beta1. We then used those 15 genes as bait to identify other correlated genes in the NCI-60 database. A survey of current literature disclosed, not only that many of the expression-correlated genes engaged in molecular interactions related to migration, invasion, and metastasis, but that highly cross-correlated subsets of those genes engaged in specific cell migration processes. We assembled this information in molecular interaction maps (MIMs) that depict networks governing 3 cell migration processes: degradation of extracellular matrix, production of transient focal complexes at the leading edge of the cell, and retraction of the rear part of the cell. Also depicted are interactions controlling the release and effects of calcium ions, which may regulate migration in a spaciotemporal manner in the cell. The MIMs and associated text comprise a detailed and integrated summary of what is currently known or surmised about the role of the expression cross-correlated genes in molecular networks governing those processes.

Software support for SBGN maps: SBGN-ML and LibSBGN. (Bioinformatics 2012)

PMID: 22581176

LibSBGN is a software library for reading, writing and manipulating Systems Biology Graphical Notation (SBGN) maps stored using the recently developed SBGN-ML file format. The library (available in C++ and Java) makes it easy for developers to add SBGN support to their tools, whereas the file format facilitates the exchange of maps between compatible software applications. The library also supports validation of maps, which simplifies the task of ensuring compliance with the detailed SBGN specifications. With this effort we hope to increase the adoption of SBGN in bioinformatics tools, ultimately enabling more researchers to visualize biological knowledge in a precise and unambiguous manner.

A formal MIM specification and tools for the common exchange of MIM diagrams: an XML-Based format, an API, and a validation method. (BMC Bioinformatics 2011)

PMID: 21586134

The Molecular Interaction Map (MIM) notation offers a standard set of symbols and rules on their usage for the depiction of cellular signaling network diagrams. Such diagrams are essential for disseminating biological information in a concise manner. A lack of software tools for the notation restricts wider usage of the notation. Development of software is facilitated by a more detailed specification regarding software requirements than has previously existed for the MIM notation.

PathVisio-MIM: PathVisio plugin for creating and editing Molecular Interaction Maps (MIMs). (Bioinformatics 2011)

PMID: 21636591

A plugin for the Java-based PathVisio pathway editor has been developed to help users draw diagrams of bioregulatory networks according to the Molecular Interaction Map (MIM) notation. Together with the core PathVisio application, this plugin presents a simple to use and cross-platform application for the construction of complex MIM diagrams with the ability to annotate diagram elements with comments, literature references and links to external databases. This tool extends the capabilities of the PathVisio pathway editor by providing both MIM-specific glyphs and support for a MIM-specific markup language file format for exchange with other MIM-compatible tools and diagram validation.

Evidence of statistical epistasis between DISC1, CIT and NDEL1 impacting risk for schizophrenia: biological validation with functional neuroimaging. (Hum Genet 2010)

PMID: 20084519

The etiology of schizophrenia likely involves genetic interactions. DISC1, a promising candidate susceptibility gene, encodes a protein which interacts with many other proteins, including CIT, NDEL1, NDE1, FEZ1 and PAFAH1B1, some of which also have been associated with psychosis. We tested for epistasis between these genes in a schizophrenia case-control study using machine learning algorithms (MLAs: random forest, generalized boosted regression andMonteCarlo logic regression). Convergence of MLAs revealed a subset of seven SNPs that were subjected to 2-SNP interaction modeling using likelihood ratio tests for nested unconditional logistic regression models. Of the 7C2 = 21 interactions, four were significant at the α = 0.05 level: DISC1 rs1411771-CIT rs10744743 OR = 3.07 (1.37, 6.98) p = 0.007; CIT rs3847960-CIT rs203332 OR = 2.90 (1.45, 5.79) p = 0.003; CIT rs3847960-CIT rs440299 OR = 2.16 (1.04, 4.46) p = 0.038; one survived Bonferroni correction (NDEL1 rs4791707-CIT rs10744743 OR = 4.44 (2.22, 8.88) p = 0.00013). Three of four interactions were validated via functional magnetic resonance imaging (fMRI) in an independent sample of healthy controls; risk associated alleles at both SNPs predicted prefrontal cortical inefficiency during the N-back task, a schizophrenia-linked intermediate biological phenotype: rs3847960-rs440299; rs1411771-rs10744743, rs4791707-rs10744743 (SPM5 p < 0.05, corrected), although we were unable to statistically replicate the interactions in other clinical samples. Interestingly, the CIT SNPs are proximal to exons that encode theDISC1 interaction domain. In addition, the 3' UTR DISC1 rs1411771 is predicted to be an exonic splicing enhancer and the NDEL1 SNP is ~3,000 bp from the exon encoding the region of NDEL1 that interacts with the DISC1 protein, giving a plausible biological basis for epistasis signals validated by fMRI.

The BioPAX community standard for pathway data sharing. (Nat Biotechnol 2010)

PMID: 20829833

Biological Pathway Exchange (BioPAX) is a standard language to represent biological pathways at the molecular and cellular level and to facilitate the exchange of pathway data. The rapid growth of the volume of pathway data has spurred the development of databases and computational tools to aid interpretation; however, use of these data is hampered by the current fragmentation of pathway information across many databases with incompatible formats. BioPAX, which was created through a community process, solves this problem by making pathway data substantially easier to collect, index, interpret and share. BioPAX can represent metabolic and signaling pathways, molecular and genetic interactions and gene regulation networks. Using BioPAX, millions of interactions, organized into thousands of pathways, from many organisms are available from a growing number of databases. This large amount of pathway data in a computable form will support visualization, analysis and biological discovery.

Biological validation of increased schizophrenia risk with NRG1, ERBB4, and AKT1 epistasis via functional neuroimaging in healthy controls. (Arch Gen Psychiatry 2010)

PMID: 20921115

NRG1 is a schizophrenia candidate gene and plays an important role in brain development and neural function. Schizophrenia is a complex disorder, with etiology likely due to epistasis.

The Systems Biology Graphical Notation. (Nat Biotechnol 2009)

PMID: 19668183

Circuit diagrams and Unified Modeling Language diagrams are just two examples of standard visual languages that help accelerate work by promoting regularity, removing ambiguity and enabling software tool support for communication of complex information. Ironically, despite having one of the highest ratios of graphical to textual information, biology still lacks standard graphical notations. The recent deluge of biological knowledge makes addressing this deficit a pressing concern. Toward this goal, we present the Systems Biology Graphical Notation (SBGN), a visual language developed by a community of biochemists, modelers and computer scientists. SBGN consists of three complementary languages: process diagram, entity relationship diagram and activity flow diagram. Together they enable scientists to represent networks of biochemical interactions in a standard, unambiguous way. We believe that SBGN will foster efficient and accurate representation, visualization, storage, exchange and reuse of information on all kinds of biological knowledge, from gene regulation, to metabolism, to cellular signaling.

An evaluation of power and type I error of single-nucleotide polymorphism transmission/disequilibrium-based statistical methods under different family structures, missing parental data, and population stratification. (Am J Hum Genet 2007)

PMID: 17160905

Researchers conducting family-based association studies have a wide variety of transmission/disequilibrium (TD)-based methods to choose from, but few guidelines exist in the selection of a particular method to apply to available data. Using a simulation study design, we compared the power and type I error of eight popular TD-based methods under different family structures, frequencies of missing parental data, genetic models, and population stratifications. No method was uniformly most powerful under all conditions, but type I error was appropriate for nearly every test statistic under all conditions. Power varied widely across methods, with a 46.5% difference in power observed between the most powerful and the least powerful method when 50% of families consisted of an affected sib pair and one parent genotyped under an additive genetic model and a 35.2% difference when 50% of families consisted of a single affection-discordant sibling pair without parental genotypes available under an additive genetic model. Methods were generally robust to population stratification, although some slightly less so than others. The choice of a TD-based test statistic should be dependent on the predominant family structure ascertained, the frequency of missing parental genotypes, and the assumed genetic model.

snp.plotter: an R-based SNP/haplotype association and linkage disequilibrium plotting package. (Bioinformatics 2007)

PMID: 17234637

snp.plotter is a newly developed R package which produces high-quality plots of results from genetic association studies. The main features of the package include options to display a linkage disequilibrium (LD) plot below the P-value plot using either the r2 or D' LD metric, to set the X-axis to equal spacing or to use the physical map of markers, and to specify plot labels, colors, symbols and LD heatmap color scheme. snp.plotter can plot single SNP and/or haplotype data and simultaneously plot multiple sets of results. R is a free software environment for statistical computing and graphics available for most platforms. The proposed package provides a simple way to convey both association and LD information in a single appealing graphic for genetic association studies.