| 2026 NCI Summer School on Algorithmic Cancer Biology

Link to SSACB26 Program

Russell Schwartz, Carnegie Mellon University

" Constrained Optimization Problems in Cancer Clonal Evolution from Multimodal Data Integration"

This lecture will examine tools for constrained optimization, a broadly useful framework for describing computational inferences, applies in a variety of contexts in studying clonal evolution in cancers. We will review some of the basic biology and biotechnology applied in studying cancer clonal evolution today. We will then explore some basic concepts in defining constrained optimization problems and tools for solving them. We will then explore how this framework applies to diverse problems in studying cancer evolution, including to problems in genomic deconvolution, tumor phylogenetics, integrating multimodal data sources, and optimizing complex multimodal study designs.

Dario Ghersi, University of Nebraska

“More Than Meets the Slide: Computational Approaches in Digital Pathology for Segmentation and Registration"

Digital pathology is transforming our ability to study tissue architecture of human cancers, yet much of the information encoded in histological slides remains underutilized. In this talk, I will present computational approaches that move beyond single-slide analysis by focusing on semantic segmentation and cross-slide registration. First, I will describe the development of a large-scale, human-in-the-loop segmentation dataset for pancreatic cancer, and how it enabled systematic benchmarking of a range of machine learning models for accurate and biologically meaningful annotation of tumor and microenvironmental structures. I will then discuss methods for registering structures across serial sections to support 3D reconstruction of solid tumors. This problem introduces significant algorithmic challenges, including sparse correspondences, structural heterogeneity, and deformation across slices, but also opens the door to new biological insights into tumor organization and progression. Together, these approaches highlight how integrating segmentation and registration can reveal spatial features of cancer that are not apparent from individual slides alone.

Rachel Karchin, Johns Hopkins University

Tumors are not static targets. They are evolving populations of competing clonal lineages that shift dominance under therapeutic pressure. By integrating serial tissue biopsies and liquid biopsies with computational clonal tracking, we can uncover expanding and contracting lineages in real time. I will present some examples of how computational models can potentially support adaptive, evolution-informed therapy selection, transforming cancer treatment from a fixed strategy into a medicine that learns.

Jingyi Jessica Li, Fred Hutchinson Cancer Center

“FDR Calibration withy Synthetic Null Data: Controlling False Discoveries While Maintaining Power in High-Throughput Biology”

False discovery rate (FDR) control is essential for reliable inference in high-throughput biology, yet it is increasingly compromised in moder analyses due to data reuse, selection bias, and model misspecification. Common remedies such as data splitting or knockoff constructions often achieve FDR control at the cost of power loss and changes to existing workflows. In this talk, I present a unified framework for calibrated inference via synthetic null data, which achieves FDR control while preserving power and leaving original data and analysis pipelines intact. The central idea is to generate data-driven synthetic null data as in silico negative controls, apply the same estimation or testing procedure to both observed and synthetic data, and use their parallel contrast to calibrate significance thresholds. This framework was motivated by a common “double-dipping” issue in single-cell RNA-seq analysis, where the same data are used both to identify cell clusters and to test for clusters-specific marker genes, leading to clustering-induced bias. This challenge led to ClusterDE, which mitigates post-clustering bias in marker discovery across single-cell, spatial, bulk, and microbiome data. Building on this idea, we developed Nullstrap, a general framework for FDR-controlled variable selection in high-dimensional models without data splitting or knockoffs. I then presented Nullstrap-DE, an application of this framework for RNA-seq differential expression (DE) analysis, which calibrates popular tools such as DESeq2 and edgeR to improve FDR control under mild model violations while retaining high power. Together, these methods illustrate how synthetic null data provide a flexible and principled route to FDR calibration in high-throughput biological data analysis.

Victoria Popic, Broad Institute

Deep Reinforcement Learning for Hard Combinatorial Problems in Genomics

Many key problems in genomics are computationally hard and commonly tackled with hand-crafted heuristics. In this tutorial, we will introduce an alternative approach based on deep reinforcement learning (DRL), where an agent learns to construct a solution step by step by selecting actions that maximize its expected cumulative reward. We will define the core components of this framework -- states, actions, and rewards -- and show how combinatorial problems in genomics can be formulated in these terms. We will focus on graph-based state representations, which naturally capture the structure of many genomic problems, using haplotype assembly and cancer phylogeny inference as driving examples.

Kai Tan, Children’s Hospital of Philadelphia

“Algorithms for modeling tumor microenvironment from spatial omics maps”

Understanding the tumor microenvironment (TME) is critical for advancing cancer diagnostics, prognostics, and therapeutic strategies. Spatial omics technologies have emerged as powerful tools for characterizing the complex cellular and molecular landscape of tumors. This seminar will discuss advancements and challenges in this rapidly evolving field. Specifically, I will introduce three algorithms designed for 1) automated segmentation and cell annotation of imaging-based spatial omics maps; 2) identification and comparison of tissue cellular neighborhood and 3) de novo construction of cell-specific signaling pathways. I will present case studies from recent research to illustrate how these algorithms can reveal insights into how dynamic cell-cell interactions in the TME contribute to tumor heterogeneity, therapy response, and patient outcomes.

Atul Deshpande, Johns Hopkins University

Trust, but calibrate: from spatial measurement to mechanistic models of the tumor microenvironment

Predicting which patients respond to cancer immunotherapy increasingly depends on what we can read from a piece of their tumor — but a description of that tumor, however high-dimensional, is not the same as a mechanism we can simulate, perturb, and trust. This talk traces an arc through a small set of methods developed in our lab and with collaborators. I will first describe SpaceMarkers, which analyzes spatially overlapping cell populations to recover molecular signatures of cell–cell interactions in spot-based spatial transcriptomics, and FunCN, which quantifies functional cellular neighborhoods by spatial weighting that goes beyond simple proportions. I will then argue that descriptors from both can serve as constraints on a spatial quantitative-systems-pharmacology model of the tumor microenvironment, calibrated to patient samples by approximate Bayesian computation, with a hepatocellular carcinoma example where FunCN descriptors both inform calibration and validate the simulated tumors. The talk closes by examining the fidelity of the data source upstream: 2D cell segmentations on a 3D tissue section result in an inaccurate picture of the tissue landscape, and I will describe TRACER, a new method for inferring a more complete tissue landscape by refining cellular profiles and detecting cells missed by standard segmentation pipelines.

Wenyi Wang, MD Anderson

“Transfer Learning for Survival-based Clustering of Predictors with an Application to TP53 Mutation Annotation”

TP53 is the most frequently mutated gene in human cancers, and germline mutations in TP53 cause Li-Fraumeni syndrome (LFS), a hereditary predisposition to diverse cancers. Accurate annotation of TP53 mutations based on their survival effects is critical for informed LFS patient management. Motivated by this need, we develop a new approach for Survival-based Clustering of Predictors (SCP) by identifying homogeneous coefficients in Cox regression. We formulate this task as a fusion penalized Cox regression problem and provide an efficient computational algorithm. A nonconvex distance-to-set penalty is adopted to facilitate parameter tuning and improve estimation accuracy. To overcome data limitations, we further develop TL-SCP, a transfer learning extension that borrows coefficient ranking information from a source dataset under the assumption of similar ranking patterns between source and target. TL-SCP integrates ranking information through weighted rank averaging, allowing flexibility in accommodating cohort heterogeneity while maintaining model simplicity. Simulation studies demonstrate TL-SCP’s superior performance over SCP in clustering recovery and coefficient estimation. In the application of TP53 mutation annotation where we utilize non-LFS germline TP53 mutation carriers as a source cohort for the target LFS cohort, TL-SCP identifies biologically meaningful TP53 mutation clusters and offers improved clinical interpretability compared to experiment-based annotations.

Roded Sharan, Tel Aviv University

“Multi Modal Data Integration for Gene Representation Learning”

The data deluge in biology calls for computational approaches that can integrate multiple datasets of different types to build a holistic view of biological processes or structures of interest. An emerging paradigm in this domain is the unsupervised learning of data embeddings that can be used for downstream clustering and classification tasks. In this talk I will describe recent work in my group to integrate diverse data types for gene representation learning with applications to gene module detection and gene function prediction.

Peter Van Loo, MD Anderson

“Molecular Archeology of Cancer”

The cancer genome carries an archeological record of the tumor’s past. Over the past years, we have developed several approaches to mine that archeological record, which we collectively call 'molecular archeology of cancer'. Using these approaches, we are able to infer the subclonal architecture and evolutionary history of tumors. We applied these approaches in a large-scale pan-cancer setting, showing that intra-tumor heterogeneity is pervasive across cancers and that the timelines of tumor evolution span multiple years to decades. Key driver events in tumor evolution typically occur early, and copy number gains often accumulate as punctuated bursts, commonly after genome doubling. Late genome doubling is frequent in cancer evolution and is typically followed by an increase in chromosomal instability. Our approaches increase the evolutionary information that can be obtained from tumor genome sequences and, therefore, improve our understanding of the developmental history of cancer.

Yuichi Shiraishi, National Cancer Center Japan

"Centromere Variation and Its Role in Cancer Chromosomal Rearrangements"

Large-scale pangenome efforts, including the Human Pangenome Reference Consortium (HPRC) and the Telomere-to-Telomere (T2T) Consortium, are producing an expanding collection of near-complete genome assemblies. For the first time, these resources have resolved highly repetitive and structurally complex regions that were largely inaccessible to short-read sequencing, including centromeres, segmental duplications, and satellite arrays. This progress is now providing new opportunities in cancer genomics, where structural rearrangements involving these difficult-to-analyze regions are frequently observed yet remain poorly characterized.

As an example of the application of pangenome resources, we introduce a k-mer-based computational framework, ascairn, to infer centromere haplotypes from short-read whole-genome sequencing (WGS) data. We applied ascairn to investigate the genomic structure underlying the 1p/19q co-deletion, a highly recurrent centromere-involving translocation in oligodendrogliomas. Analyzing short-read WGS data from 142 cases with 1p/19q co-deletion using rare k-mers, we showed that the breakpoints of the 1p/19q co-deletion map to aHOR arrays on chromosome 1 (D1Z7) and chromosome 19 (D19Z3), with a clear positional relationship to kinetochore attachment sites; this finding was validated by long-read sequencing of two 1p/19q co-deletion–positive cases. We also demonstrate that ascairn can be effectively used for interrogating the diversity of centromere structures and their geographic distributions across populations.

As pangenome resources expand, they will enable systematic analysis of previously inaccessible repetitive regions in large cancer cohorts, revealing mechanisms of chromosomal rearrangements.

Uthsav Chitra, Johns Hopkins University

“Mapping the geometry of spatial gene expression”

Recent spatial transcriptomics (ST) technologies make high-throughput measurements of gene expression at thousands of locations in a 2-D tissue slice. However, due to cost and technological limitations, these measurements are highly sparse—thus complicating the identification and analysis of spatial gene expression patterns, particularly in spatially heterogeneous tissues such as tumors. In this talk, I will present machine learning approaches that overcome these limitations by modeling the underlying geometry of a 2-D tissue slice. First, I will present GASTON and GASTON-Mix, unsupervised and interpretable deep learning algorithms which learn "topographic maps" of a 2-D tissue slice. Then I will present SLOPER, an algorithm which leverages point processes and score matching to learn spatial gradient vector fields that characterize spatial variation in the expression of individual genes. I will show how our algorithms uncover subtle spatial gene expression patterns across tumors and other biological systems.

Ben Raphael, Princeton University

Spatial multi-omics of cancer

Tumors have an intricate spatial organization consisting of cancerous cells growing within the microenvironment of the host tissue. The cell types and cell states within this microenvironment are often related to treatment response and resistance, and thus characterizing the spatial organization of tumors is of great importance. Recent spatial technologies yield high-throughput and high-dimensional measurements of RNA, proteins, metabolites, and other modalities at thousands of locations within tissue sections, revealing spatial patterns of cell types and molecular activity. However, current datasets are often sparse and incomplete due to technological and cost constraints. I will present computational approaches that use machine learning and AI technologies to overcome this sparsity by integrating measurements across multiple slices and measurement modalities. We apply the resulting methods to analyze spatial variation in gene/protein expression across multiple cancer types and to reconstruct three-dimensional multi-modal tumor atlases.

Mona Singh, Princeton University

Details of the talk will be posted soon

Mohammed El Kebir, University of Illinois at Urbana-Champaign

“GReinSS: Generative Modeling via Reinforcement Learning for Latent Structured States”

Many scientific problems require inferring unobserved mechanistic latent states from indirect observations. While classical approaches, including expectation-maximization, do not scale to combinatorially large spaces, deep learning approaches such as variational autoencoders typically form artificial latent states rather than reconstructing the mechanistic ground-truth states. Here, we introduce GReinSS, a policy learning framework that uses dynamically rescaled rewards to learn latent state distributions that maximize the observed data likelihood. We show that GReinSS accurately reconstructs simulated latent sets and latent graphs, outperforming alternative policy learning and generative modeling baselines. Additionally, GReinSS reconstructs isoforms from real short-read RNA sequencing data that better match orthogonal long-read sequencing detected isoforms than the standard RSEM algorithm. Overall, GReinSS is a principled and practically effective approach for generative modeling and inference of combinatorial latent states from indirect observations.

Gregoire Altan-Bonnet, NCI

"Decoding immune response using the latent geometry of T cells’ activation trajectories: from bulk data to single-cell resolution"

Immune responses arise from the coordinated behavior of diverse leukocytes responding to complex tissue environments. How heterogeneous and dynamic activation states are integrated into coherent functional immune programs remains poorly understood. In this lecture, I will discuss how we built and leveraged the IMMUNOtron (a high-throughput lab automation platform for scalable immune profiling, combined with machine learning) to show that stochastic T cell activation states contain sufficient structure to predict antigen identity, functional outputs, and antagonistic immune responses across immunological environments. Critically, our framework I can be generalized: it is applicable across cell types, antigenic contexts, and experimental systems, establishing a broadly deployable strategy for decoding immune information at single-cell resolution. Our analysis reveals a time-dependent combinatorial code in which a subset of markers encodes a highly-resolved continuous pattern of activation, organized along a 1D interpretable manifold that captures the hierarchical nature of antigen discrimination by T cells. This latent geometry is robust across computational settings and experimental conditions, enabling alignment of datasets and direct comparison of equivalent activation states under ligand mixtures, including antagonistic combinations. Modes inferred from single-cell embeddings quantitatively match independent collective cytokine measurements and further identify single-cell signatures of immune antagonism (collaboration with Paul François’s group - Université de Montréal & MILA). Together, these results show how an apparently digital self/non-self decision can coexist with graded, high-dimensional ligand discrimination within a low-dimensional manifold, and demonstrate how machine learning approaches (deployed at scale through laboratory automation) can decode the structure of immune information processing from single-cell data to bulk responses, across diverse immunological settings.

Justin Zook, NIST

Somatic variant benchmarks from haplotype-resolved assemblies of a pancreatic tumor cell line and normal tissues

Standard cancer genome sequencing often misses mutations and structural changes hidden within the highly repetitive DNA. To overcome this limitation, the Genome in a Bottle Consortium constructed near-complete, haplotype-resolved genome assemblies of a pancreatic cancer cell line alongside its matched normal tissue. We resolved the sequence of all 35 tumor chromosomes , uncovering complex structural phenomena like fused centromeres and massive chromosomal inversions. By directly comparing these tumor and normal assemblies, the team identified over 7,000 variants difficult to resolve on GRCh38 and revealed that standard reference genomes misrepresent nearly half of all small somatic mutations. Ultimately, these paired benchmarks demonstrate that conventionally ignored genomic regions harbor extensive cancer-associated changes, establishing a foundation for precision oncology algorithms.

Misha Kolmogorov, CDSL, NCI

The power of somatic and germline variant phasing to improve reconstruction of cancer genome architecture

A common signature of cancer genomes is a complex, rearranged karyotype, characterized by acquired gains or losses of chromosomal material, referred to as somatic copy number alterations (CNAs). Identification of haplotype-specific CNAs from bulk sequencing data is a key step in many short-read cancer genomic workflows; however, short reads have a limited phasing range. In contrast, long reads can directly phase genomic variants into contiguous haplotypes.

In this tutorial, I will present several principles how cancer genomic analysis can be enhanced using various types of phasing, from population panels to direct phasing of both germline and somatic variants with long reads. I will Illustrate the applications in haplotype-specific CNA profiling, mutational timing, clonal deconvolution and reconstruction of complex SV events.

Cenk Sahinalp, CDSL, NCI

Details of the talk will be posted soon

Salem Malikic, CDSL, NCI

A Bi-partition Function Algorithm to Evaluate Inferred Subclonal Structures in Single-Cell Sequencing Data

Single-cell sequencing enables reconstruction of tumor evolutionary histories, but existing phylogenetic methods typically infer a single maximum-likelihood tree without quantifying uncertainty in mutation-clade relationships. In this presentation, I introduce the bi-partition function, a probabilistic framework that measures the confidence that a mutation is associated with a specific group of cells across all plausible tumor evolutionary histories. I present a sampling-based algorithm that efficiently estimates these probabilities with theoretical convergence guarantees, despite the underlying problem being #P-hard. Using simulated and experimental single-cell datasets, I demonstrate how this approach identifies robust and uncertain evolutionary relationships, improves interpretation of inferred tumor phylogenies, and provides a principled framework for evaluating subclonal structure beyond a single inferred tree.

Teresa Przytycka, NCBI, NLM

Details of the talk will be posted soon

Vishaka Gopalan, CDSL, NCI

Manifold-based representation of cellular morphology for in situ spatial transcriptomics

In situ spatial (ISS) sequencing can uncover co-variation between cellular morphology and gene expression in vivo. However, a principled and interpretable mathematical representation of morphology has not yet been applied in this context. In particular, current deep learning-based representations of cell images confound a cell's shape with its size. I will go over an interpretable representation of nuclear and cellular boundary contours, based on tangent principal component analysis (TPCA) in a Kendall shape manifold, that captures size-independent contour shape features. This approach successfully recovers shape-perturbing genes in an RNAi screen than a previous metric geometry-based approach. We build on TPCA to develop STITCH (Shape-TranscriptomIc Correlation and Harmonization), an approach to reveal covariation between cell morphology with gene expression in ISS datasets.

Eytan Ruppin, Cedars-Sinai Medical Center

"Inferring Tissue Omics Without Sequencing at both bulk and spatial resolution"

Our understanding of cancer and other human disease has made significant advances in recent years, mainly based on quite costly and time/labor intensive DNA and RNA sequencing. I will start by briefly describing published approaches for fast and low-cost inference of bulk tissue omics data from the ubiquitous tumor pathology H&E slides, with translational applications in precision oncology. Those include (1) DeepPT-Enlight/Path2Omics: inferring bulk tumor transcriptomics and patient response [Nat Cancer 2024, Cancer Research 2025] and (2) TIME-ACT: inferring the immune activation of the tumor microenvironment (TME ‘hotness’) and predicting checkpoint immunotherapy response [bioRxiv 2025]. I will devote the main part of my talk to describing more recent approaches for inferring omics data at a spatial resolution, including (3) Path2Space: inferring spatial transcriptomics to identify spatially grounded biomarkers of treatment response [Cell, 2026], and finally, (4) Path2Marker and Path2Cell: inferring protein biomarkers and annotating cells at single cell resolution [ongoing work]. Taken together, these approaches lay the basis for democratizing and accelerating medical and translational research in the next few years.

Sridhar Hannenhalli, CDSL, NCI

Context Specificity of Biological Functions

Effect of a genetic mutation can be highly tissue-specific. Think BRCA mutation. At some level, all biological phenotypes are emergent properties of interacting parts. This issue of context-specificity undergirds all biological investigations. Drawing on our and others’ works over the years we will discuss this important, annoying, and under-appreciated issue.

Ben Greenbaum, Memorial Sloan Kettering Cancer Center

Details of the talk will be posted soon

Peng Jiang, CDSL, NCI

"Data-Driven Discovery of Secreted Proteins as Cancer Immunotherapies"

My research focuses on developing data-integration and artificial intelligence frameworks to study intercellular signaling mediated by secreted proteins in antitumor immunity. Data-driven analyses estimate that about two thousand human genes encode secreted proteins. Yet, our literature mining revealed that 61% of these genes lack known roles in cancer. To address this gap, we develop computational methods and apply diverse immunological models to dissect cytokine networks, secreted proteins, and ligand–receptor interactions in cancer. Ultimately, our goal is to uncover new mechanisms of immune regulation and identify therapeutic opportunities that harness intercellular communication against tumors.

Erin Molloy, University of Maryland

Establishing Statistical Guarantees for Cell Lineage Tree Reconstruction via Triplets and Quartets

Cell lineage tree reconstruction is often formulated as a combinatorial optimization problem. But under what conditions can we recover the true lineage tree with high probability, and how much data is required? This lecture presents tools for addressing these fundamental questions focusing on two common models for cell lineage tracing. We show how to derive statistical consistency and sample complexity guarantees for methods based on triplets (rooted three-leaf induced subtrees) and quartets (unrooted four-leaf induced subtrees). Along the way, we highlight recent statistical advances for cell lineage tracing as well as their implications for algorithm design and outstanding open questions from both the theoretical and practical perspectives.

Lichun Ma, CDSL, NCI

Resolving the Spatial Organization of Cellular Communities in Liver Cancer

Liver tumors comprise a complex ecosystem of malignant, stromal, and immune cells whose spatial organization and interactions shape tumor progression, heterogeneity, and therapeutic response. Recent advances in spatial technologies enable high-resolution characterization of cellular states and neighborhoods within intact tissues. Here, we apply these approaches to define the spatial landscape of liver cancer, with the aim of uncovering the cellular and molecular mechanisms driving tumor heterogeneity and evolution, ultimately informing the development of more effective precision therapies.

SSACB Organizing Committee:

Mohammed El Kebir (University of Illinois Urbana-Champaign)

Vishaka Gopalan (NCI)

Mikhail Kolmogorov (NCI)

Salem Malikic (NCI)

Teresa Przytycka (NLM)

Ben Raphael (Princeton)

Cenk Sahinalp (NCI)

Mona Singh (Princeton)