Today's Overview
- Correcting Tn5 Accessibility Bias in CUT&Tag Epigenomic Profiling Tn5 transposase introduces systematic open chromatin bias in CUT&Tag data, particularly problematic for repressive histone marks and single-cell applications.
- Reusing Homolog Fitness Data to Predict Variant Effects in Protein Engineering Fitness translocation uses protein language model embeddings to transfer experimental variant fitness data from homologous proteins to a target protein, generating synthetic training examples by applying homolog mutation vectors to the target wild type.
- Do Genomic Foundation Models Actually Learn Biology? A Reality Check Randomly initialized character-token models often match pretrained k-mer/BPE genomic foundation models across 52 tasks, questioning the cost-efficiency of current pretraining approaches.
- Cenote-Taker 3 for Fast and Accurate Virus Discovery and Annotation of the Virome
Featured
01 Correcting Tn5 Accessibility Bias in CUT&Tag Epigenomic Profiling
Mapping **histone modifications** across the genome is fundamental to understanding gene regulation, chromatin states, and cellular identity. **CUT&Tag** has emerged as a powerful alternative to ChIP-seq for profiling histone marks and transcription factors, offering lower input requirements and compatibility with single-cell applications. However, CUT&Tag relies on the **Tn5 transposase** for DNA tagmentation, which exhibits strong preference for accessible chromatin regions. This **open chromatin bias** systematically distorts read distributions, artificially enriching signal at accessible sites regardless of the true occupancy of the target histone mark or protein. The problem is particularly acute for **repressive modifications** like H3K27me3 and H3K9me3, which naturally localize to closed chromatin, and becomes more severe in sparse single-cell datasets where signal-to-noise ratios are already challenging.
The authors demonstrate that this accessibility bias pervades published CUT&Tag datasets, including those generated with optimized high-salt protocols intended to reduce background. To address this, they developed **PATTY (Propensity Analyzer for Tn5 Transposase Yielded bias)**, a computational method that corrects CUT&Tag data by leveraging paired **ATAC-seq** measurements of chromatin accessibility from the same samples. PATTY models the Tn5 insertion propensity and removes accessibility-driven artifacts from the CUT&Tag signal. The authors validated PATTY's performance across multiple histone marks including the active mark **H3K27ac** and repressive marks **H3K27me3** and **H3K9me3**, showing improved peak calling accuracy and consistency with orthogonal experimental data.
Using machine learning integration of transcriptomic and corrected epigenomic profiles, the authors show that PATTY-corrected data better predict gene expression patterns and chromatin states. For single-cell applications, they developed an analysis framework incorporating PATTY correction and demonstrate **improved cell type clustering** compared to uncorrected data, addressing a critical bottleneck in single-cell epigenomics. Validation includes comparison with known biological ground truth and experimental confirmation of predicted binding sites. While PATTY requires paired ATAC-seq data (adding experimental cost), the method provides a systematic solution to a pervasive technical artifact. The approach is applicable beyond CUT&Tag to other Tn5-based assays including CUT&RUN and potentially ATAC-seq itself, establishing a framework for bias correction in widely adopted epigenomic technologies.
Source: PATTY corrects open chromatin bias for improved bulk and single-cell CUT&Tag profiling
02 Reusing Homolog Fitness Data to Predict Variant Effects in Protein Engineering
Protein engineering relies on predicting which amino acid substitutions will improve or impair function, but **fitness data scarcity** remains a fundamental bottleneck. Experimentally measuring variant effects through deep mutational scanning or directed evolution is resource-intensive, often yielding datasets of only hundreds to thousands of variants for a single protein. This data limitation severely constrains supervised machine learning models that could otherwise guide rational design of enzymes, fluorescent proteins, or therapeutic antibodies. The core challenge is whether fitness information from evolutionary relatives can be systematically transferred to a target protein of interest.
This study introduces **fitness translocation**, a biologically-grounded data augmentation strategy that exploits homologous proteins to synthetically expand training datasets. The method operates in the embedding space of protein language models (PLMs), which capture evolutionary and structural patterns from billions of natural sequences. For a variant in a homolog protein, the approach computes the embedding difference between the homolog's wild type and mutant, then applies this delta vector to the target protein's wild type embedding to generate a synthetic variant. The fitness value from the homolog variant is assigned to this synthetic target variant, effectively translating experimental measurements across protein families. This differs from naive sequence alignment approaches by leveraging the rich, context-aware representations learned by transformer-based PLMs like ESM-2.
The authors validate fitness translocation across three protein families with distinct functional contexts: **IGPS** (indole-3-glycerol phosphate synthase, enzymatic activity), **GFP** (green fluorescent protein, fluorescence intensity), and **SARS-CoV-2 spike proteins** (viral entry, ACE2 binding). Evaluation is performed **in silico** using held-out experimental fitness measurements as ground truth, testing multiple prediction models including ridge regression, random forests, and gradient boosting. Across all families and model architectures, fitness translocation consistently improves Spearman correlation between predicted and measured fitness, with gains most pronounced when training data is limited (10-100 variants). Remarkably, the method works even between remote homologs sharing only **35% sequence identity**, suggesting broad applicability across diverse protein families. The approach demonstrates that historical fitness data from related proteins—often generated in different labs for different purposes—can be systematically repurposed to accelerate engineering of new targets, offering a path toward more data-efficient computational protein design.
03 Do Genomic Foundation Models Actually Learn Biology? A Reality Check
**Genomic foundation models (GFMs)** promise to transform computational biology by learning universal representations of DNA sequences through large-scale pretraining, analogous to how GPT models revolutionized natural language processing. The hypothesis is compelling: train on vast genomic datasets, then fine-tune for specific tasks like variant effect prediction, gene expression forecasting, or regulatory element identification. However, this approach assumes that unsupervised pretraining on genomic sequences captures biologically meaningful patterns that transfer to downstream applications. This study rigorously tests that assumption by comparing seven prominent GFMs against a surprisingly simple baseline—randomly initialized models with identical architectures.
The authors evaluated models across **52 diverse genomic tasks** spanning regulatory genomics, variant effect prediction, and sequence classification. The results challenge prevailing assumptions about GFM utility. **Character-token models with random initialization often matched or exceeded the performance of pretrained k-mer and byte-pair encoding (BPE) models**, despite the latter requiring substantial computational investment for pretraining. Only subword tokenization approaches showed consistent benefits from pretraining, suggesting that **tokenizer choice fundamentally determines whether pretraining provides value**. This finding has immediate practical implications: practitioners may achieve comparable performance with simpler, faster-to-train character-level models for many genomic tasks.
More concerning for clinical applications, the study reveals that **current GFMs fail to capture clinically relevant genetic mutations**. When tested on annotated variants from databases like ClinVar, model embeddings and log-likelihood ratios showed limited sensitivity to pathogenic versus benign mutations. This represents a critical gap, as variant effect prediction is among the most important applications for genomic AI in precision medicine. The models appear to learn statistical patterns in DNA sequences without necessarily encoding the functional consequences of sequence changes.
All evaluations were **in silico**, comparing model predictions against curated genomic annotations and experimental datasets. The findings suggest that simply scaling up NLP-style pretraining on genomic sequences may not suffice. Instead, the authors advocate for **biologically informed tokenization strategies** that respect functional units like codons or regulatory motifs, and **variant-aware training objectives** that explicitly teach models about mutation effects. For practitioners currently investing in or deploying GFMs, these results recommend careful baseline comparisons and tokenizer selection as first steps before committing to expensive pretraining regimes.
Source: Tokenization to Transfer: Do Genomic Foundation Models Learn Good Representations?
04 Cenote-Taker 3 for Fast and Accurate Virus Discovery and Annotation of the Virome
Source: Cenote-Taker 3 for Fast and Accurate Virus Discovery and Annotation of the Virome
Also Worth Noting
Today's Observation
The convergence of machine learning and experimental biology continues to surface fundamental questions about what our models actually learn and how we validate them. A sobering reality check on **genomic foundation models** reveals that random initialization baselines can match or exceed the performance of pretrained models on downstream tasks, particularly when fine-tuning data is abundant. This challenges the assumption that self-supervised pretraining on DNA sequences inherently captures biologically meaningful representations. For practitioners in AI-driven drug discovery, this suggests that **task-specific architectures and training strategies may matter more than pretraining scale** when working with genomic data. The implication is clear: before investing computational resources in foundation model pretraining, teams should rigorously benchmark against simpler baselines and critically assess whether pretraining objectives align with downstream prediction tasks like variant pathogenicity or regulatory element identification.
Transfer learning shows more promise in the protein engineering domain, where **fitness landscape data from homologous proteins can improve variant effect prediction even at 35% sequence identity**. This work demonstrates that evolutionary information encoded in homolog fitness measurements—obtained through deep mutational scanning or directed evolution—transfers across protein families to enhance predictions for target proteins with limited experimental data. The practical value lies in data efficiency: rather than conducting exhaustive mutagenesis screens for every engineering target, teams can leverage existing fitness datasets from related proteins. The method works across diverse protein types including enzymes, antibodies, and viral proteins, with performance gains most pronounced when training data for the target protein is scarce. This validates a **meta-learning approach to protein design** where accumulated experimental knowledge across homologs serves as inductive bias.
Meanwhile, the experimental measurement side faces its own biases that AI must account for. The PATTY algorithm addresses **Tn5 transposase sequence bias** in CUT&Tag epigenomic profiling, a widely used technique for mapping histone modifications and transcription factor binding. Tn5's preference for certain DNA sequences creates systematic distortions in accessibility measurements, confounding biological signal with technical artifact. By modeling and correcting this bias, PATTY improves the accuracy of chromatin state inference, which feeds into AI models predicting gene regulation and expression. For teams building models on epigenomic data—whether for target identification or understanding drug mechanism of action—this highlights the importance of **preprocessing pipelines that remove technical confounders** before training. The broader lesson applies to virome analysis as well, where Cenote-Taker 3 automates virus genome discovery and annotation in metagenomic sequencing data, addressing the challenge of identifying novel viral sequences without relying on close reference genomes. Accurate viral genome characterization matters for understanding host-pathogen interactions and identifying therapeutic targets, but requires computational tools that can handle the extreme diversity and rapid evolution of viral sequences.