Denisovan Fossil Shows Enigmatic Human Cousins Lived from Siberia to Subtropics A fossilized jawbone found off the coast of Taiwan nearly two decades ago belonged to a male Denisovan, scientists have found, confirming that this enigmatic group of archaic humans thrived across a vast geographic range, from Siberian snowfields to subtropical jungles. Unlike their cousins the Neandertals, however, Denisovans left behind few physical clues: this is only the third location to yield verifiable remains since their discovery 15 years ago. Yet scientists had no inkling of their existence until 2010, when researchers realized that a finger bone—and later some other bone fragments and teeth—from Denisova Cave in southern Siberia represented an entirely unknown branch on the hominin tree. “That's very little to go on,” says Frido Welker, a molecular anthropologist at the University of Copenhagen and a co-author of the new jawbone analysis, which was published on Thursday in Science. The Siberian and Tibetan fossils revealed that Denisovans started roaming the Eurasian continent at least 200,000 years ago and survived long enough to interbreed with anatomically modern humans as the latter ventured out of Africa some 50,000 years ago. If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today. “It shows that they were extraordinarily adaptive,” says Bence Viola, a paleoanthropologist at the University of Toronto, who was not involved in this study. Besides Homo sapiens, no other hominin group—not even the hardy Neandertals—mastered such diverse environments. Photos of the Penghu 1 mandible viewed from right side (l) and top (r). So Welker and his colleagues focused instead on proteins, complex biomolecules that take longer to break down than DNA. For many researchers who work in this field, that confirmation comes as no surprise. “We all pretty much expected this would be a Denisovan,” Viola says, noting that its robust structure was similar to the Tibetan mandible. But like Penghu 1, its DNA was too far gone, so there's no conclusive evidence that it belonged to a Denisovan. It's possible, however, that the molar could eventually be tested using Welker's protein analysis technique. This new study provides another anchor point for where Denisovans and early modern humans could have met and swapped genes, says Emilia Huerta-Sánchez, a population geneticist at Brown University, who was not involved in the study. Her work has shown that several genetically distinct Denisovan populations engaged in these interhominin unions—and that some of the DNA we acquired from them offered an evolutionary advantage (for instance, by enabling Tibetans to breathe the thin air of their homeland). The proteins in Penghu 1 don't, however, do much to advance researchers' understanding of gene flow between Denisovans and early modern humans, according to Huerta-Sánchez. “This is a little bit of data,” she says, but to really flesh out those ancient interactions, “it would be nice to get a whole genome from a different geographic location,” somewhere outside Siberia and Tibet. That's an inherently difficult task, given the fragility of DNA, especially in warmer climates. Welker says he'll wait until more data come in to speculate on precisely how our most cryptic relatives fit into the story of human evolution. Cody Cottier is a freelance journalist based in Fort Collins, Colo.
Three massive volcanic eruptions around 540 A.D. likely triggered the brief but impactful climate shift, as volcanic ash blocked sunlight and lowered global temperature for around 200 to 300 years. “We knew these rocks seemed somewhat out of place because the rock types are unlike anything found in Iceland today,” Christopher Spencer, lead author of the research, said in a statement. “Zircons are essentially time capsules that preserve vital information including when they crystalized as well as their compositional characteristics,” Spencer said. “The combination of age and chemical composition allows us to fingerprint currently exposed regions of the Earth's surface, much like is done in forensics.” “This is the first direct evidence of icebergs carrying large Greenlandic cobbles to Iceland,” Spencer said. “On one hand, you're surprised to see anything but basalt in Iceland, but having seem them for the first time, you instantly suspect they arrived by iceberg from Greenland,” Ross Mitchell, a co-author of the study, said in a statement. “The fact that the rocks come from nearly all geological regions of Greenland provides evidence of their glacial origins,” Gernon said. “As glaciers move, they erode the landscape, breaking up rocks from different areas and carrying them along, creating a chaotic and diverse mixture—some of which ends up stuck inside the ice.” “This timing coincides with a known major episode of ice-rafting, where vast chunks of ice break away from glaciers, drift across the ocean, and eventually melt, scattering debris along distant shore,” Gernon said. Tim Newcomb is a journalist based in the Pacific Northwest. He covers stadiums, sneakers, gear, infrastructure, and more for a variety of publications, including Popular Mechanics. Experts Found an Ancient Altar in the Wrong City Psychedelics Could Change If We Even Need Sleep The Secrets of Queen Bees Could Help Humanity A Tortoise Just Had Her First Hatchlings at 100
You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). You can also search for this author in PubMed Google Scholar You have full access to this article via your institution. The patterns of crossing threads that make up a knot can be explored mathematically.Credit: Ilie Lupescu/500px via Getty However, the machines could prove to be particularly suited to solving problems in mathematics — especially in topology, the branch of maths that studies shapes. In a preprint posted on arXiv in March1, researchers at Quantinuum, a company headquartered in Cambridge, UK, report using their quantum machine H2-2 to distinguish between different types of knot on the basis of topological properties, and show that the method could be faster than those that run on ordinary, or ‘classical', computers. Quantinuum chief product officer Ilyas Khan says that Helios, a quantum computer that the company expects to release later this year, could get much closer to beating classical supercomputers at analysing fiendishly complicated knots. Although other groups have already made similar claims of ‘quantum advantage', typically for ad hoc calculations that have no practical use, classical algorithms tend to catch up eventually. But theoretical results2,3 suggest that for some topology problems, quantum algorithms could be faster than any possible classical counterpart. This is owing to mysterious connections between topology and quantum physics. In that work, Meichanetzidis and his colleagues used a quantum computer to calculate knot ‘invariants' — numbers that describe particular types of knot. This is still well within the scope of classical computing, but the company's machines should eventually be able to handle 3,000 crossings or so, at which point even the fastest classical supercomputers will run out of steam, says Meichanetzidis. Mathematically, the theoretical equivalence between knot crossings and quantum algorithms has been known for decades, but only now has the team been able to fully put it into practice, says Aharonov, who is at the Hebrew University of Jerusalem. Schmidhuber, A., Reilly, M., Zanardi, P., Lloyd, S. & Lauda, A. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.12378 (2025). The future is quantum: universities look to train engineers for an emerging industry Fresh ‘quantum advantage' claim made by computing firm D-Wave Why an overreliance on AI-driven modelling is bad for science Meet ‘qudits': more complex cousins of qubits boost quantum computing Mini-satellite paves the way for quantum messaging anywhere on Earth Fundacion Sector Publico Estatal Centro Nacional de Investigaciones Oncológicas Carlos III (F.S.P. Two-year, $150,000 fellowship for U.S.-based theoretical physicist studying mathematical modeling of the early universe. The future is quantum: universities look to train engineers for an emerging industry Fresh ‘quantum advantage' claim made by computing firm D-Wave An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday. Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
The idea is so ubiquitous in science fiction that it's become nearly synonymous with the word “hologram.” In almost every news story written about hologram technology and how far it has come, at some point, a disclaimer has to be made explaining that ‘it's not quite Tony Stark tech, but it's still cool!' For the first time, a team of engineers has managed to create a hologram that you can directly interact with using your hands. If you're projecting a holo-cube, you can reach into that display and slide it back and forth or turn it around. “What we see in films and call holograms are typically volumetric displays,” Elodie Bouzbib, the first author on the paper describing this new tech, said in a press release. “These are graphics that appear in mid-air and can be viewed from various angles without the need for wearing virtual reality glasses. Volumetric displays have existed (at least in prototype form) for some time. They work by showing a bunch of images really fast onto something called a diffuser, which is basically a projector screen that moves up and down faster than the human eye can see. This change in material facilitates the breakthrough at the heart of this work—direct interaction, which lead researcher Asier Marzo defined as “being able to insert our hands to grab and drag virtual objects.” “We are used to direct interaction with our phones,” he continued, “where we tap a button or drag a document directly with our finger on the screen— it is natural and intuitive for humans. This project enables us to use this natural interaction with 3D graphics to leverage our innate abilities of 3D vision and manipulation.” Obviously, this tech is still in the very early stages of development. And the team has plenty of plans for where to take it next, like adding haptic feedback to further enhance the tactile experience. But at this moment, for the first time ever, you can poke a cube that isn't there and it will move. Experts Found an Ancient Altar in the Wrong City The Secrets of Queen Bees Could Help Humanity A Tortoise Just Had Her First Hatchlings at 100
You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. The major spliceosome includes five small nuclear RNA (snRNAs), U1, U2, U4, U5 and U6, each of which is encoded by multiple genes. Here, we report that recurrent germline mutations in RNU2-2 (previously known as pseudogene RNU2-2P), a 191-bp gene that encodes the U2-2 snRNA, are responsible for a related disorder. By genetic association, we identified recurrent de novo single-nucleotide mutations at nucleotide positions 4 and 35 of RNU2-2 in nine cases. We replicated this finding in 16 additional cases, bringing the total to 25. The disorder is characterized by intellectual disability, autistic behavior, microcephaly, hypotonia, epilepsy and hyperventilation. All cases display a severe and complex seizure phenotype. We found that U2-2 and canonical U2-1 were similarly expressed in blood. Despite mutant U2-2 being expressed in patient blood samples, we found no evidence of missplicing. Our findings cement the role of major spliceosomal snRNAs in the etiologies of neurodevelopmental disorders. More than 4,000 genes have been established as etiological for a rare disease, of which only 69 are noncoding1. Three of these noncoding genes—RNU4ATAC, RNU12 and RNU4-2—encode snRNAs that have crucial roles in pre-messenger RNA (mRNA) splicing. Variants in RNU4ATAC are responsible for microcephalic osteodysplastic primordial dwarfism type I (refs. These pathologies are inherited in an autosomal-recessive manner. Both RNU4ATAC and RNU12 encode components of the minor spliceosome, a molecular complex that catalyzes splicing for fewer than 1% of all introns in humans8. However, more than 99% of introns are spliced by the major spliceosome. Recently, we reported that de novo mutations in RNU4-2, which is transcribed into the U4-2 snRNA component of the major spliceosome, cause one of the most prevalent monogenic neurodevelopmental disorders (NDDs)9. The discovery was published independently by a separate group10. To explore whether other noncoding genes might also be causal for NDDs, we performed a refined statistical analysis of the 100,000 Genomes Project (100KGP) data in the National Genomic Research Library (NGRL)11. Following a previously described approach9,12, we used the BeviMed genetic association method13 to compare rare variant genotypes in the 41,132 canonical transcript entries in Ensembl v.104 with a biotype other than ‘protein_coding' (Supplementary Data), which included 14,307 entries annotated as pseudogene transcripts, between 7,452 unrelated, unexplained cases annotated with the ‘Neurodevelopmental abnormality' (NDA) Human Phenotype Ontology (HPO) term and 43,727 unrelated participants without the NDA term. Notably, whereas our previous analyses filtered out single-nucleotide variants with combined annotation-dependent depletion (CADD)14 score < 10, our present analysis removed this threshold to expand the variant search space. RNU4-2, which we have reported previously9, had a PPA of ~1, and RNU2-2P (now called RNU2-2) had a PPA of 0.97. Conditional on the association, two variants, at nucleotide positions 4 and 35, had a BeviMed posterior probability of pathogenicity (PPP) > 0.5 (Fig. The nine NDA cases with either of the variants had a significantly greater phenotypic homogeneity based on HPO terms than expected under random selection of nine NDA cases from unexplained and unrelated NDA study participants in the 100KGP (P = 1.33 × 10−3, Fig. RNU2-2 has a 191-bp sequence that is identical to that of the canonical gene RNU2-1, except for eight single-nucleotide substitutions (all within n.108–191). Although at the time of analysis, RNU2-2 was known as RNU2-2P and annotated as one of many U2 pseudogenes in bioinformatics databases15, it has recently been shown to be expressed in cell lines, and its transcripts, U2-2P (now U2-2), have been shown to have the greatest abundance and stability of all noncanonical U2 snRNAs16. After aggregation over the 11 copies of RNU2-1 in the GRCh38 build of the reference genome, RNU2-1 and RNU2-2 show comparable levels of expression in whole blood and in blood cells (Fig. RNU2-2 resides in a 5′ untranslated exon of WDR74 that had previously been identified as being enriched for hotspot mutations in cancer, although the existence of RNU2-2 at that locus was not known at the time17. A recent study showed that both RNU2-1 and RNU2-2 carry recurrent somatic mutations (n.28C>T) that drive B cell-derived tumors, prostate cancers and pancreatic cancers18. The same study showed that RNU2-2 is a functional gene that is transcribed independently of WDR74—a finding that we recapitulated in blood and blood cells (Extended Data Fig. All other noncoding genes and pseudogenes had PPA < 0.5. Only two RNU2-2 variants had conditional PPP > 0.5: n.4G>A and n.35A>G. b, Distribution of phenotypic homogeneity scores for 100,000 randomly selected sets of nine participants chosen from 9,112 unrelated NDA-coded participants. The color-coded track shows the aggregated (over distinct alleles at a position) minor allele count (aMAC) in gnomAD v.4.1.0 (gn.) at each position, and the black bars show the numbers of distinct alternate alleles in gnomAD at each position (multiple insertions and multiple deletions at a given position each count as one). Above and below the RNU2-2 cDNA sequence (Seq. ), the alternate alleles in 100KGP participants and the distinct alleles in gnomAD are shown, respectively; ‘+' indicates insertions, and the variant that failed QC in gnomAD is indicated. e, Pedigrees for participants with a rare alternate allele n.4 or n.35 in RNU2-2. Pedigrees used for discovery have a ‘G' prefix and are labeled in black. Pedigrees used for replication in the IMPaCT-GENóMICA, URDCat and ENoD-CIBERER aggregate collection; the 100KGP; the NBR; Erasmus MC UMC; the GMS; Radboud UMC; deCODE or the ZOEMBA study have an ‘I', ‘M', ‘N', ‘R', ‘S', ‘W', ‘Y' or ‘Z' prefix, respectively, and are labeled in blue. The locus has a markedly reduced density of population genetic variation in gnomAD19, consistent with the effects of negative selection (Fig. Published secondary structure data of the U2 snRNA show that r.4 is located within the helix II U2–U6 interaction domain, whereas r.35 is part of the highly conserved recognition domain GUAGUA that binds the branch sites of introns20,21,22 (Extended Data Fig. Trio sequencing of four of the five cases with n.4G>A and three of the four cases with n.35A>G showed that the variants were de novo in each case. A variant with a different alternate allele at nucleotide 35, n.35A>T, was called in eight unaffected participants; it was also present in gnomAD but failed quality control (QC) (Fig. Analysis of whole-genome sequencing (WGS) and Sanger sequencing data suggested that n.35A>G is a germline variant, but n.35A>T is a recurring somatic mosaic variant. This somatic variant is observed only in individuals over the age of 40 years, consistent with clonal hematopoiesis (Extended Data Fig. To replicate our findings in the nine NDD cases, we examined eight additional rare disease collections: a component of the 100KGP not included in the discovery dataset (10,373 participants, of whom 1,736 have an NDA); the NIHR BioResource-Rare Diseases (NBR) data23 (7,388 participants, of whom 731 have an NDA); the UK Genomic Medicine Service (GMS) data (32,030 participants, of whom 6,469 have an NDA); data from the Erasmus MC UMC (1,527 participants, of whom approximately 400 have an NDA); an aggregate of the IMPaCT-GENóMICA, URDCat and ENoD-CIBERER programs for undiagnosed rare diseases24 (1,707 probands with NDDs and WGS data); clinical data from Radboud UMC Nijmegen (1,037 probands with an NDA); WGS data from deCODE genetics (73,821 participants, of whom 4,416 have an NDA) and data from the ZOEMBA study (127 participants, of whom 71 have an NDA). We identified a further 16 cases in these replication collections (Fig. 1e), all but two of whom were confirmed to have a de novo variant. There were no unaffected carriers of either variant. Although this case represented the only individual harboring n.35A>C, modeling of the interactions between U2-2 snRNA and canonical branch site sequences suggested that n.35A>C has a destabilizing effect on binding that is greater than that of the n.35A>G variant and in many cases similar in magnitude to that of the n.4G>A variant with respect to its cognate partner U6 (Extended Data Fig. All these variants were called confidently by WGS (Extended Data Fig. In the 100KGP, RNU2-2 was a more prevalent etiological gene than all but 29 of the ~1,400 known etiological genes for intellectual disability, explaining about one-fifth the number of cases as RNU4-2, the etiological gene for RNU4-2 syndrome, also known as ReNU syndrome (Fig. This relative prevalence was consistent with observations in the IMPaCT-GENóMICA, URDCat and ENoD-CIBERER aggregate collection, which identified 27 cases with RNU4-2 syndrome and six cases (that is, 4.5 times fewer) with RNU2-2 syndrome. Of the 9,112 unrelated NDA-coded cases in the 100KGP, the numbers solved through pathogenic or likely pathogenic variants in a gene are shown, provided at least nine cases were diagnosed. Analysis of HPO terms for the nine uniformly phenotyped 100KGP cases revealed that 100% were assigned ‘Intellectual disability' and ‘Global developmental delay', 89% were assigned ‘Delayed speech and language development', 78% were assigned ‘Motor delay' and 56% were assigned ‘Autistic behavior', in line with frequencies among NDA cases generally (Fig. No HPO terms were significantly underrepresented in the RNU2-2 cases. Of the terms that were enriched among cases of RNU4-2 syndrome, ‘Seizure', ‘Microcephaly' and ‘Generalized hypotonia' were also enriched in RNU2-2 cases. However, ‘Severe global developmental delay' and ‘Hyperventilation' were only enriched in RNU2-2 cases, suggesting that these may be differentiating phenotypic features. Graph showing the ‘is-a' relationships among HPO terms present in at least three of the nine NDA-coded RNU2-2 cases in the discovery collection or significantly enriched among them relative to 9,112 unrelated NDA-coded participants of the 100KGP. Detailed clinical vignettes for the 15 cases in pedigrees G1–2, G4, I1–6, M2, R1, S3, W1, Y1 and Z1 are provided in Supplementary Note and Supplementary Table 1. These indicate that the neurodevelopmental phenotype caused by the RNU2-2 variants typically manifests from 3 to 6 months of age but is progressive, frequently severe and accompanied by characteristic dysmorphic features (Fig. All the cases displayed prominent epilepsy, usually from the first few months of life, and seizures were severe and pharmacoresistant. Seizures were characteristically complex and included spasms, tonic, tonic clonic, myoclonic and absence types, classified in some probands as Lennox–Gastaut syndrome. These features distinguish the RNU2-2 cases from previously reported cases of RNU4-2 syndrome, in which the developmental phenotype was reported as less severe, some of the dysmorphic features were different, and epilepsy was typically later in onset, less severe and more commonly focal9,10,25. Extraordinarily, case M2 also harbored a de novo truncating variant in SPEN predicted to cause Radio–Tartaglia syndrome26. However, the individual in this case had short stature (<−2.65 s.d.) ), which are not characteristic of Radio–Tartaglia syndrome, as well as having a craniofacial morphology that more closely resembled that of other RNU2-2 patients than Radio–Tartaglia syndrome patients (Supplementary Note). This atypical presentation is consistent with a dual rare genetic diagnosis. Clinical photographs of individuals from pedigrees G1, G4, S3, R1 and I1–6. The individuals in these cases show common features of long palpebral fissures with slight eversion of the lateral lower lids, long eyelashes, broad nasal root, large low set ears, wide mouth and wide spaced teeth. Photographs of individual M2, who has Radio–Tartaglia syndrome in addition to RNU2-2 syndrome, are included in the Supplementary Note. We have obtained specific consent from the families to publish these clinical photographs. Using trio WGS data, which were available for 17 families, we were able to determine the parental origin of the de novo mutations for ten of those families. Echoing observations in cases with RNU4-2 syndrome, the pathogenic RNU2-2 mutations were ubiquitously of maternal origin, suggesting that they may affect spermatogenesis. Analysis of uniquely aligned reads at heterozygous sites in whole-blood RNA sequencing (RNA-seq) data revealed that both alleles of RNU2-2 were expressed robustly in cases (Extended Data Fig. However, a genome-wide comparison of the RNA-seq alignments between five cases and 495 unrelated unexplained NDA-coded participants did not reveal differential gene expression, differential splice junction usage or any pattern of aberrant splicing in the cases (Extended Data Fig. 8), suggesting that transcriptomic analysis of other tissue types will be required to uncover the underlying molecular mediators of disease. U2 is involved in all stages of pre-mRNA splicing and contains distinct domains that interact with the catalytic U6, intronic branch sites and scaffolding of several protein assemblies27. Notably, the U6 binding domain and the branch site recognition domain of U2-2 are transcribed from a region in RNU2-2 exhibiting markedly reduced population genetic variation (Fig. Studies in the 1990s of yeast U2 snRNA showed that variants in branch site recognition sequence GUAGUA inhibit splicing and even generate a dominant lethal phenotype when the recognition sequence is changed entirely28,29. Position r.35 in the human U2 sequence corresponds to r.36 in the yeast U2 sequence, where n.36A>G and n.36A>T result in 0–10% and 10–20% splicing activity, respectively, compared with the wild-type sequence29. Although the U2–U6 recognition sequences are not conserved between yeast and human, a similar organization is retained. Mice with variants in a direct ortholog of RNU2-2 do not exist; however, mice with a homozygous 5-bp deletion in U2 ortholog Rnu2-8 present with ataxia and neurodegeneration32. Although it remains unclear how this splicing defect might cause neuronal death, it has been hypothesized that premature translation termination codons within the retained introns could trigger the nonsense-mediated decay (NMD) pathway. We and others have shown that the recessive human disorders caused by variants in RNU4ATAC and RNU12 result in minor intron retention in blood cells and fibroblasts2,4,6,33,34. By contrast, we have been unable to detect any significant and reproducible large-scale splicing defect in the blood cells of patients with dominant germline variants in the major spliceosome gene RNU2-2. Although a recent study described systematic disruption of 5′ splice site usage in the whole blood of some patients with de novo RNU4-2 variants10, RNA-seq of fibroblasts in a separate case study could not detect any defect in splicing25. Moreover, transcriptomic analysis of primary hematological tumors and cell lines transfected with vectors expressing the n.28C>T RNU2-2 mutation did not reveal any significant differences in splicing18. Therefore, further studies are required to understand how RNU4-2 and RNU2-2 mutations affect splicing. It might be that, in contrast to recessive splicing disorders, it is challenging to detect widespread splicing defects in these newly discovered dominant disorders because wild-type transcripts are expressed in combination with misspliced transcripts from the same gene that are subjected to NMD. In certain cell types, the effects of NMD might be overcome such that the overall expression levels of mRNAs remain unchanged, owing to rapid mRNA turnover and dosage compensation35. However, certain cell types, such as stem cells, which we have not yet been able to study, might be more sensitive to high NMD dosage than terminally differentiated cells. Neuronal stem cells and mouse models of RNU4-2 and RNU2-2 pathologies may be needed to resolve these mechanistic questions. We obtained written informed consent to publish additional clinical data from a subset of the affected cases in the NGRL following local best practices. NBR participants were enrolled under a protocol approved by the East of England–Cambridge South Research Ethics Committee (ref. Informed consent at that institution was obtained for all diagnostics, and written informed consent was obtained from the parents of participants for publication of medical data including photographs, in line with the Declaration of Helsinki. Participants in the IMPaCT-GENóMICA, URDCat and ENoD-CIBERER programs were enrolled through clinical services under a protocol approved by the Instituto de Salud Carlos III Research Ethics Committee (CEI-PI01_2022) and endorsed by the institutional review boards of the participating hospitals. Written informed consent to publish clinical data and photographs of the affected individuals were obtained following local best practices. The available enrollment criteria for replication cohorts are given in refs. The genetic association analysis was conducted as described previously9,12, except that variants were not thresholded on CADD score. Cases explained by variants in a given gene were reassigned to the control group in the genetic association analyses for genes other than that gene. To assess the phenotypic homogeneity of the nine participants in the discovery collection with n.4G>A or n.35A>G in RNU2-2, we computed a phenotype homogeneity score for that group with respect to unexplained and unrelated NDA study participants. We then obtained a Monte Carlo P value as the proportion of random sets of nine unexplained unrelated NDA cases with a homogeneity score greater than or equal to the homogeneity score of the group carrying either of the RNU2-2 variants. To identify enriched or depleted HPO terms among the nine NDA-annotated cases with n.4G>A or n.35A>G in RNU2-2 in the discovery collection, compared with unrelated NDA-coded participants without either of these two variants, we computed P values of association using Fisher's two-sided exact test. We only tested enrichment for terms that were attached to at least three of the nine cases and belonged to the set of nonredundant terms at each level of frequency among the cases. To account for multiple comparisons, we adjusted the P values by multiplying them by the number of tests. An adjusted P < 0.05 was deemed to indicate statistical significance. To visualize both common and distinctive HPO terms for RNU2-2 cases, we selected terms that were either statistically significant or present in at least 50% of the cases, removed redundant terms at each level of frequency among the nine cases, and arranged the terms along with a nonredundant set of ancestral terms as a directed acyclic graph of ‘is-a' relations. These analyses were conducted using the ontologyX R packages37. Approximately 5,000 study participants in the NGRL also underwent whole-blood RNA-seq. We did the same for NGRL participants using RNA-seq reads aligned by DRAGEN to the GRCh38 reference genome. Both the NBR and the NGRL data were generated following a ribosomal RNA depletion and fragment size selection protocol that enables sequencing of short RNAs. To quantify expression of U2-1 and U2-2 in the NBR and the NGRL participants, we used the kallisto v.0.51.1 pseudoaligner to map reads against a GRCh38 reference transcriptome composed of all transcript sequences in Ensembl v.104 after removing duplicate sequences using the rmdup function from seqkit v.2.9.0. To compute the proportions of WGS reads supporting alternate alleles, we extracted the sequencing depth and the number of reads supporting each alternate allele at n.4 and n.35 of RNU2-2 from BAM files using ‘samtools mpileup' with default settings. We used the following primers to amplify genomic DNA containing the RNU2-2 gene before Sanger sequencing: forward primer, 5′-CCAATCCCAGGATCCTAAAAA-3′; reverse primer, 5′-GAAGACCACATGGAGATACTACG-3′. 11:62841419–62842071 in version GRCh38 of the human reference genome. We calculated the free energy of duplex formation ΔG38 of duplex formation with U6-1 and with branch site sequences for wild-type and mutant U2-2 using the RNA.fold_compound.eval_structure function in the ViennaRNA (v.2.6.4) Python package. This enabled us to calculate the difference in stability change on mutation, ΔΔG. For each proband for which trio WGS data were available, we selected read pairs overlapping the position of the de novo variant in question. If across all of these maternally inherited variants, the number of reads supporting linkage between the reference allele for one variant and the alternate allele for the other variant was equal to zero, and if at least one read supported linkage between the de novo alternate allele and at least one maternally inherited alternate allele, then the origin was determined to be maternal. If across all of the paternally inherited variants, the number of reads supporting linkage between the two reference alleles was equal to zero and the number of reads supporting linkage between the two alternate alleles was equal to zero, and at least one read supported linkage between the reference allele at the de novo variant position and at least one paternally inherited alternate allele, then the origin was determined to be maternal. The same logic was applied to determine a paternal origin. If none of the above conditions was met, the origin was determined to be inconclusive. We performed QC on RNA-seq data derived from the whole blood of 5,546 participants in the NGRL as follows. Based on visual inspection of QC parameter distributions, we filtered out samples with a percentage of RNA fragments larger than 200 bases (as measured using an Agilent TapeStation 4200) of ≤65%, a total read count outside the range (108M, 592M), a genome mapping rate <0.85 or a high-quality read rate <0.9 (where reads were deemed to be of high quality if they aligned as proper pairs, had fewer than seven mismatches and had a mapping quality ≥60). After QC filtering, 5,165 samples remained for analysis, including five cases with implicated variants in RNU2-2. We assessed allele-specific expression in cases by counting genome-aligned RNA-seq reads overlapping heterozygous sites using ‘samtools mpileup' with default settings. We selected 500 samples for differential gene expression and splice junction usage analysis by taking samples from the five cases and 495 samples selected at random from those passing the QC criteria and belonging to unrelated NDA-coded individuals presently unexplained. For the differential splicing analysis, we used the 905,036 junctions observed (that is, supported by at least one spliced read) in at least five of the 500 samples. The numbers of reads for each sample were normalized by dividing by the total number of uniquely aligned reads supporting splice junctions genome-wide. These values were then compared with equivalents for 500 randomly selected sets of five samples from among all 500 samples to assess whether there was at least one splice junction with extreme usage among the five RNU2-2 cases. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. Access to blood cell RNA-seq data generated by the NIHR BioResource can be requested by contacting the NIHR BioResource Data Access Committee at dac@bioresource.nihr.ac.uk. HPO phenotype data in the NGRL were obtained from the ‘rare_diseases_participant_phenotype' table (Main Programme v.14), ‘observation' table (GMS v.3) and ‘hpo' table (Rare Diseases Pilot v.3); specific disease class data from the ‘rare_diseases_participant_disease' table (Main Programme v.13); ICD-10 codes from the ‘hes_apc' table (Main Programme v.13); pedigree information from the ‘rare_diseases_pedigree_member' table (Main Programme v.13), ‘referral_participant' table (GMS v.3), and ‘pedigree' table (Rare Diseases Pilot v.3); and explained and/or unexplained status of cases from the ‘gmc_exit_questionnaire' tables (Main Programme v.18, GMS v.3). Ensembl v.104 (http://may2021.archive.ensembl.org/index.html), gnomAD v.3.0 (https://gnomad.broadinstitute.org/) and CADD v.1.6 (https://cadd.gs.washington.edu/) were used for transcript selection and variant annotation against reference genome GRCh38. A more recent version of gnomAD, v.4.1.0, was used to assign the variant allele frequencies in RNU2-2 shown in Fig. Data presented in this paper were requested from the Genomics England Airlock on 13 August 2024 at 03:39 BST. The manuscript was submitted to the Genomics England Publication Committee on 21 August 2024 at 23:51 BST and approved for submission on 27 August 2024 at 15:52 BST. Software packages rsvr v.1.0, bcftools v.1.16, samtools v.1.9/1.16.1 and Perl v.5 were used to build the 100KGP Rareservoir. The Rareservoir software is available from https://github.com/turrogroup/rsvr. R v.3.6.2 and v.4.3.3 and all R packages that were used for data analysis and visualization (Matrix v.1.2-18, dplyr v.0.8.5, bit64 v.0.9-7, bit v.1.1-14, DBI v.1.1.0, RSQLite v.2.1.4, BeviMed v.5.7, ontologyIndex v.2.12, ontologySimilarity v.2.7, ontologyPlot v.1.7, ggplot2 v.3.5.0, tximport v.1.32.0 and DESeq2 v.1.44) are available via the Comprehensive R Archive Network site (https://cran.r-project.org/) or Bioconductor (https://bioconductor.org). The ViennaRNA v.2.6.4, salmon v.1.10.0, seqkit v.2.9.0 and kallisto v0.51.1 packages can be installed via the conda package manager, available from https://anaconda.org/anaconda/conda. Martin, A. R. et al. PanelApp crowdsources expert knowledge to establish consensus diagnostic gene panels. He, H. et al. Mutations in U4atac snRNA, a component of the minor spliceosome, in the developmental disorder MOPD I. Edery, P. et al. Association of TALS developmental disorder with defect in minor splicing component U4atac snRNA. Compound heterozygous mutations in the noncoding RNU4ATAC cause Roifman syndrome by disrupting minor intron splicing. The expanding phenotype of RNU4ATAC pathogenic variants to Lowry Wood syndrome. Elsaid, M. F. et al. Mutation in noncoding RNA RNU12 causes early onset cerebellar ataxia. Xing, C. et al. Biallelic variants in RNU12 cause CDAGS syndrome. Moyer, D. C. et al. Comprehensive database and evolutionary dynamics of U12-type introns. Greene, D. et al. Mutations in the U4 snRNA gene RNU4-2 cause one of the most prevalent monogenic neurodevelopmental disorders. Chen, Y. et al. De novo variants in the RNU4-2 snRNA cause a frequent neurodevelopmental syndrome. Caulfield, M. et al. National Genomics Research Library. Greene, D. et al. Genetic association analysis of 77,539 genomes reveals rare disease etiologies. Greene, D., Richardson, S. & Turro, E. A fast association test for identifying pathogenic variants involved in rare diseases. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. & Dvinge, H. Human spliceosomal snRNA sequence variants generate variant spliceosomes. Weinhold, N. et al. Genome-wide analysis of noncoding regulatory mutations in cancer. Bousquets-Muñoz, P. et al. PanCancer analysis of somatic mutations in repetitive regions reveals recurrent mutations in snRNA U2. Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nez-Lumbreras, S., Morguet, C. & Sattler, M. Dynamic interactions drive early spliceosome assembly. Large-scale analysis of branchpoint usage across species and cell lines. Turro, E. et al. Whole-genome sequencing of patients with rare diseases in a national health system. Luque, J. et al. CIBERER: Spanish national network for research on rare diseases: a highly productive collaborative initiative. Schot, R. et al. Re-analysis of whole genome sequencing ends a diagnostic odyssey: case report of an RNU4-2 related neurodevelopmental disorder. Radio, F. C. et al. SPEN haploinsufficiency causes a neurodevelopmental disorder overlapping proximal 1p36 deletion syndrome with an episignature of X chromosomes in females. Structural and functional modularity of the U2 snRNP in pre-mRNA splicing. Miraglia, L., Seiwert, S., Igel, A. H. & Ares, M. Limited functional equivalence of phylogenetic variation in small nuclear RNA: yeast U2 RNA with altered branchpoint complementarity inhibits splicing and produces a dominant lethal phenotype. McPheeters, D. S. & Abelson, J. Mutational analysis of the yeast U2 snRNA suggests a structural similarity to the catalytic core of group I introns. & Manley, J. L. Base pairing between U2 and U6 snRNAs is necessary for splicing of a mammalian pre-mRNA. & Weiner, A. M. Genetic evidence for base pairing between U2 and U6 snRNA in mammalian mRNA splicing. Jia, Y., Mu, J. C. & Ackerman, S. L. Mutation of a U2 snRNA gene causes global disruption of alternative splicing and neurodegeneration. Heremans, J. et al. Abnormal differentiation of B cells and megakaryocytes in patients with Roifman syndrome. Cologne, A. et al. New insights into minor splicing—a transcriptomic analysis of cells derived from TALS patients. The rules and impact of nonsense-mediated mRNA decay in human cancers. Devereau, A., Scott, R. & Thomas, E. Rare Disease Eligibility Criteria: 100,000 Genomes Project (Genomics England, 2018); https://files.genomicsengland.co.uk/forms/Rare-Disease-Eligibility-Criteria.pdf Greene, D., Richardson, S. & Turro, E. ontologyX: a suite of R packages for working with ontological data. Tinoco, I., Uhlenbeck, O. C. & Levine, M. D. Estimation of secondary structure in ribonucleic acids. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Patro, R. et al. Salmon provides fast and bias-aware quantification of transcript expression. & Robinson, M. D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. Choi, S., Cho, N. & Kim, K. K. The implications of alternative pre-mRNA splicing in cell signal transduction. Rhode, B. M., Hartmuth, K., Westhof, E. & Hrmann, R. Proximity of conserved U6 and U2 snRNA elements to the 5′ splice site region in activated spliceosomes. Wilkinson, M. E., Charenton, C. & Nagai, K. RNA splicing by the spliceosome. Zhang, Z. et al. Cryo-EM analyses of dimerized spliceosomes provide new insights into the functions of B complex proteins. Boesler, C. et al. A spliceosome intermediate with loosely associated tri-snRNP accumulates in the absence of Prp28 ATPase activity. This research was made possible through access to data in the NGRL, which is managed by Genomics England Limited (a wholly owned company of the Department of Health and Social Care). The NGRL holds data provided by patients and collected by the NHS as part of their care and data collected as part of their participation in research. We thank NIHR BioResource volunteers for their participation, and gratefully acknowledge NIHR BioResource centers, NHS Trusts and staff for their contribution. We thank the National Institute for Health and Care Research, NHS Blood and Transplant, and Health Data Research UK as part of the Digital Innovation Hub Programme. The Barakat laboratory was supported by the Netherlands Organisation for Scientific Research (ZonMw Vidi, grant 09150172110002) and acknowledges support from EpilepsieNL and CURE Epilepsy. These funding bodies had no influence over the study design, results, data interpretation or final manuscript. We thank all participants and families involved in the programs ‘Infraestructura de Medicina de Precisión asociada a la Ciencia y la Tecnología en Medicina Genómica (IMPaCT-GENóMICA)' and ‘Programes de Malalties Rares no Diagnosticades de Catalunya i CIBERER (URDCat/ENoD-CIBERER)'. IMPaCT-GENóMICA was supported by Instituto de Salud Carlos III, Ministerio de Ciencia e Innovación and the European Union European Regional Development Fund (IMP/00009) (principal investigator: Á.C.). The ENoD-CIBERER program was funded by the Biomedical Network Research Center for Rare Diseases-CIBER-ER-ISCIII (principal investigator: L.A.P.-J.). The ZOEMBA study was funded by Metakids and the United for Metabolic Diseases consortium, who thank M. Oud for bioinformatic support. was supported by Katholieke Universiteit (KU) Leuven Special Research Fund (BOF) (C14/23/121), Research Foundation – Flanders (G072921N) and NIH award R01HL161365. was supported by the Belgian American Education Foundation and NIH award R01HL161365. was further supported by the Lowy Foundation USA. These authors jointly supervised this work: Andrew D. Mumford, Ernest Turro. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA Daniel Greene, Koenraad De Wispelaere & Ernest Turro Department of Clinical and Molecular Genetics, Hospital Universitari Vall d'Hebron, Barcelona, Spain Medicine Genetics Group Vall d'Hebron Research Institute, Barcelona, Spain Emma Hales, Andrea Katrinecz, Sonia Pascoal, Natasha P. Morgan & Kathy Stirrups Emma Hales, Andrea Katrinecz, Sonia Pascoal, Natasha P. Morgan & Kathy Stirrups Department of Human Genetics, Radboud University Medical Center, Nijmegen, the Netherlands Department of Clinical Genetics, Erasmus MC University Medical Center, Rotterdam, the Netherlands Rachel Schot, Frank Sleutels, Sarina G. Kant & Tahsin Stefan Barakat CIBER-ER (Biomedical Network Research Center for Rare Diseases), Instituto de Salud Carlos III (ISCIII), Madrid, Spain Marta Sevilla Porras, Anna Duat Rodríguez, Elena González Alguacil, Irene Madrigal Bajo, Nelmar Valentina Ortiz Cabrera, Laia Rodríguez-Revenga Bodi, Ángel Carracedo, Pablo Lapunzina, Beatriz Morte & Luis Alberto Pérez-Jurado Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, the Netherlands Pediatric Department, Hospital San Pedro de Alcántara, Cáceres, Spain Genetics Department, Hospital Niño Jesús, Madrid, Spain Anna Duat Rodríguez, Bárbara Fernández Garoz, Elena González Alguacil & Nelmar Valentina Ortiz Cabrera Biochemistry and Molecular Genetics Department, Hospital Clinic of Barcelona and Institut de Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona, Spain Neuropediatric Department, Pediatric Service, Hospital Universitario Marqués de Valdecilla, Santander, Spain Genomic Medicine Group, Center for Research in Molecular Medicine and Chronic Diseases, University of Santiago de Compostela, Santiago de Compostela, Spain Institute for Medical and Molecular Genetics (INGEMM), IdiPAZ, Madrid, Spain NHS South West Genomic Medicine Service Alliance, Bristol, UK Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar analyzed RNA-seq data, generated expression scatterplots and made the illustration showing molecular interactions. E.H. oversaw recruitment to the NBR RNA-seq project. designed primers, selected cases for sequencing and provided early access to detailed phenotype data on RNU4-2 cases for comparative analysis. provided data for the family that gave consent at Radboud UMC Nijmegen. recruited and provided data for the ZOEMBA study participants. and K. Stefansson provided data for the deCODE study participant. obtained consent and provided detailed phenotype information. obtained consent from and provided clinical information on individuals recruited to the IMPaCT-GENóMICA, URDCat and ENoD-CIBERER programs. provided biological interpretation and cowrote the paper. The other authors declare no competing interests. Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Histograms of the posterior probability of association (PPA) between the 41,132 canonical Ensembl transcripts not annotated as being protein-coding and neurodevelopmental abnormality (NDA), with and without filtering out variants with a CADD v1.6 score <10. The more recent CADD v1.7 gives scores >10 for these variants. Coverage of uniquely aligned RNA-seq reads from the whole blood of five RNU2-2 cases in the NGRL and in four blood cell types of an exemplar participant in the NBR demonstrating that RNU2-2 (previously annotated as the pseudogene RNU2-2P) is expressed abundantly in blood cells. The branch site sequence is depicted as the human YNYUNAY consensus motif (Y means C or T; N means any ribonucleotide), which interacts with the GUAGUA sequence at positions 33 to 38 in the U2-2 snRNA (depicted in red)20. Tethering of U4/U6.U5 tri-snRNP to U2-2 within the spliceosome pre-B complex enables displacement of U1 to enable a new interaction between U6 snRNA with the 5′SS and reconfiguration of U4/U6.U5 tri-snRNP to form the catalytically active spliceosome B complex, which is a prerequisite for the splicing reaction44. The variants responsible for RNU2-2 syndrome occur at critical interaction sites between U2-2 snRNA near r.4 and U6 snRNA and between U2-2 snRNA near r.35 and intronic branch sites. These interactions are necessary for intron recognition and the correct assembly of the catalytically active spliceosome B complex46. a, For each of the three rare variants at positions n.4 and n.35 of RNU2-2 called in the discovery collection, truncated bar charts showing the distribution of the proportions of reads supporting the alternate allele over participants, partitioned into 0% and all left-open intervals of size 4% up to 100%. Furthermore, seven participants with a homozygous reference call at n.35 have at least 8% of aligned reads at that position supporting the ‘T' allele, suggesting that n.35 A > T is not a germline variant, but rather a low-frequency somatic mosaic variant. These participants are significantly older than expected by chance (P = 1.3 × 10−3, Kolmogorov-Smirnoff test). To comply with Genomics England's rules on identifiability, all ages of at least 95 years are included in the same x = 95 bin. a, Differential binding stability (ΔΔG) values between U2-2 and U6-1 for the A4 mutant allele compared to the reference G4 allele and between U2-2 and each of 16 branch site sequences consistent with the human YUNAY motif. Hydrogen bonding between cognate nucleotides is depicted with dotted lines. c, For each of the germline alleles observed at r.35 (the reference A35 and the mutant G35 and C35 alleles), a graphical representation of Watson-Crick interactions between the branch site recognition region in U2-2 (GUAG at n.33–36) and an example branch site sequence (CUUAU). Hydrogen bonding between cognate nucleotides is depicted with dotted lines. Sequencing read pileups for cases identified in the replication collections. Coverage of RNA-seq reads from whole blood aligned to the genome near RNU2-2 in five cases. The coverage levels of reads containing alternate alleles at heterozygous sites are shown in red. The aligned reads overlapping heterozygous sites show that both alleles are expressed robustly in the cases in pedigrees G6, M1 and S3. The cases in pedigrees G1 and G5 were heterozygous only at n.4, where coverage was too low to assess allele-specific expression. a, Histogram of the number of differentially expressed genes controlling FDR at 0.05 with the Benjamini-Hochberg procedure for randomly selected sets of five from 500 RNA-seq samples (five cases with implicated variants in RNU2-2 and 495 unexplained unrelated NDD cases). b, Histogram of the proportion of unique RNA-seq alignments that contain a splice junction in the 500 RNA-seq samples. c, Histogram of the mean (over randomly selected sets of five samples) rank of normalized splice junction (SJ) usage of the splice junction with the lowest (left) and highest (right) mean rank. d, One-sided P values obtained by permutation of case labels within the 500 NGRL samples for the lowness of the sum of ranks of normalized numbers of reads supporting groups of splice junctions ranked from high to low (the upward facing blue triangles) and low to high (the downward facing red triangles), assigning the maximum rank in the event of ties. The splice junctions were grouped by: dinucleotide pairs at the splice sites (for N ≥ 5), quantile of GC content in the region encompassed by the splice junction, and quantile of splice junction length. GRCh38 coordinates for the 41,132 canonical transcript entries in Ensembl v.104 with a biotype other than ‘protein_coding'. Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Greene, D., De Wispelaere, K., Lees, J. et al. Mutations in the small nuclear RNA gene RNU2-2 cause a severe neurodevelopmental disorder with prominent epilepsy. Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.
You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Pursuit of honest and truthful decision-making is crucial for governance and accountability in democracies. However, people sometimes take different perspectives of what it means to be honest and how to pursue truthfulness. Here we explore a continuum of perspectives from evidence-based reasoning, rooted in ascertainable facts and data, at one end, to intuitive decisions that are driven by feelings and subjective interpretations, at the other. We analyse the linguistic traces of those contrasting perspectives in congressional speeches from 1879 to 2022. We find that evidence-based language has continued to decline since the mid-1970s, together with a decline in legislative productivity. The decline was accompanied by increasing partisan polarization in Congress and rising income inequality in society. A collective commitment to truth cultivates discourse grounded in empirical evidence and fosters social cohesion through a shared understanding of reality1. In many democracies, there is currently much concern about ‘truth decay'2: the blurring of the boundary between fact and fiction3, not only fuelling polarization but also undermining public trust in institutions3,4. We adopt a framework that distinguishes two rhetorical approaches with which politicians can express their pursuit of truth5,6,7,8. One approach, which we call evidence-based, pursues truth by relying on evidence, facts, data and other elements of external reality. An alternative approach, called intuition-based, pursues truth by relying on feelings, instincts, personal values and other elements drawn mainly from a person's internal experiences. Productive democratic discourse balances between evidence-based and intuition-based conceptions of truth. While evidence-based discourse provides a foundation for ‘reasoned' debate, intuition contributes emotional and experiential dimensions that can be critical for exploring and resolving societal issues. However, although the mix of evidence-based and intuition-based pathways to truth ranges along a continuum, exclusive reliance on intuition may prevent productive political debate because evidence and data can no longer adjudicate between competing political positions and eventually lead to an agreement. Here, we examine these developments by analysing the basic conceptions of truth that politicians deploy in political speech. We are not concerned with the truth value of individual assertions but with how the pursuit of truth is reflected in political rhetoric. We apply computational text analysis9,10 to measure the relative prevalence of evidence-based and intuition-based language in 145 years of speeches on the floor of the US Congress. The conceptions of truth used in congressional rhetoric are relevant to various measures of political and societal welfare. We analyse congressional rhetoric in relation to two likely drivers of democratic backsliding11: partisan polarization and income inequality. Polarization, characterized by growing ideological divisions and partisan animosity, undermines constructive dialogue, hampers compromise and erodes trust in political institutions, ultimately weakening democratic processes12,13. Previous research underscores the link between political polarization and language use, highlighting the influence of ideological divisions on communication patterns and political behaviour14,15. Economic inequality is also negatively associated with various individual and social outcomes16. For example, individuals in environments characterized by high inequality tend to project individualistic norms onto society17. This fosters greater competition and reduces cooperation, which, in turn, may damage democracy11. Polarization can play a role in increasing inequality through lower congressional productivity18, which could be affected by a shift from evidence-based language to intuition-based language in congressional rhetoric. This motivates our analysis of congressional rhetoric in congressional productivity, as assessed through the quantity and quality of enacted laws over time19,20. Our analysis involves 8 million congressional speech transcripts between 1879 and 2022. We measure the relative salience of evidence-based language over intuition-based language as the evidence-minus-intuition (EMI) score, building on a text analysis approach that combines dictionaries with word embeddings to represent documents and concepts10 as used in previous work on political communication21,22. We constructed dictionaries to capture evidence-based and intuition-based language styles that underlie the two conceptions of truth (for example ‘fact' and ‘proof' in the evidence-based dictionary and ‘guess' and ‘believe' in the intuition-based dictionary; see the Methods for the full dictionaries). We adopt the approach for construction and validation of dictionaries used in22 (see the Methods for details). Our final dictionaries consist of 49 keywords for evidence-based language and 35 keywords for intuition-based language (see Methods). This approach converts each conception of truth into a vector representation by averaging the embeddings of the corresponding dictionary keywords. Similarly, the target text is represented as the average word embeddings of content words. A positive EMI score indicates a higher prevalence of evidence-based language (Supplementary Table 1), whereas a negative score suggests reliance on intuition-based language (Supplementary Table 2). See the Methods for further details, including validation of the EMI score against human ratings. Figure 1a shows the trend of EMI score over time, reflecting the relative prevalence of evidence-based language. EMI was high and relatively stable from 1875 through the early part of the twentieth century. Subsequently, an upward trend from the 1940s culminated in a peak in the mid-1970s. Since then, evidence-based language has been on the decline. We also include a plot showing the trends of the evidence and intuition scores in the Supplementary Note 3 (Supplementary Fig. a–d, Time series of the EMI score in each congressional session between 1879 and 2022 (a), EMI scores separated by party (b), congressional polarization and inequality (c) and congressional productivity, measured as the MLI and the number of public laws passed by each session (d). We compute bootstrapping 95% CIs for EMI with 10,000 samples, which may appear too small to be visible owing to the large sample size. These two periods align with notable historical events. In the 1890s, the USA experienced the Gilded Age, marked by rapid industrialization and economic growth, but also social unrest and increasing economic inequality. These economic and social upheavals probably influenced the language used in Congress during these periods. The profound impact of these events might have led to a greater emphasis on intuition-based language, consonant with previous research that has documented shifts in language use among individuals facing stressful situations23 as well as among political leaders confronted with crises24,25. An examination of a sample of speeches with low EMI scores in specific periods shows a tendency to focus on the crisis of the time (see Supplementary Table 3 for illustrative examples). Focusing on the period past 1970, one striking observation is that the level of EMI has recently fallen to its historical minimum, following a decreasing linear trend that started in the peak session of 1975–1976 (b = −0.032, P < 0.001, R2 = 0.927). Figure 1b illustrates the temporal trend of the EMI score for Democrats and Republicans separately. There is a strong positive correlation between the EMI scores for both parties (Pearson's r = 0.778, 95% confidence interval (CI) = [0.666, 0.855], P < 0.005). We observe some divergence between parties in the early periods. However, since the mid-70s, both parties have moved largely in the same downward direction in their rhetoric. The same pattern holds for both parties across the House and Senate (Supplementary Fig. It is, however, noticeable that the EMI of Republicans dropped substantially, and more steeply than for Democrats, in the last session (2021–2022). A Mann–Whitney test shows that the difference in median EMI score (−0.435 for Democrats and −0.753 for Republicans) is significant (P < 0.001). To exclude a dependence of the trend observed in Fig. 1a on topic composition of the speeches, we aggregate the EMI score by taking a macroaverage over topics such that topics have equal weighting. The results of this approach show a very similar trend (Supplementary Note 10). To address potential concerns about semantic change over the extended timescale of this study, we perform an analysis of the stability of the meaning of dictionary keywords and also compute the EMI score using temporal embeddings. 1c shows partisan polarization in Congress over time, measured as the difference between the first dimension of DW-NOMINATE (Dynamic Weighted NOMINAl Three-step Estimation) scores26,27 for the two major parties averaged across the House and Senate. It is important to clarify that the political polarization indicator used in this study, DW-NOMINATE, measures polarization using voting behaviour within a legislative context. Additional indicators can also be derived from computational text analysis, as well as from opinion, structural and interactional dynamics. The recent decline of EMI is accompanied by a corresponding upward trend in partisan polarization in Congress and rising income inequality in society, which is statistically supported as follows. EMI and polarization are negatively cross-correlated (Pearson's r = −0.615, 95% CI = [−0.741, −0.447], P < 0.005), and a lagged correlation analysis shows that lag zero has the highest correlation (Supplementary Fig. When included in lagged regression models, EMI does not explain a significant amount of the empirical variance of polarization, but polarization has a significant coefficient in the EMI model (b = −0.15, 95% CI = [−0.29, −0.01], P < 0.05). EMI values are informative of future inequality. Figure 2 shows the historical values of inequality as a function of EMI in the previous session, that is, the previous two years (Pearson's r = −0.948, 95% CI = [−0.973, −0.902], P < 0.001). A lagged correlation analysis shows that the strongest correlations appear when EMI precedes inequality (Supplementary Fig. This is buttressed by a lagged regression model including the level of inequality, polarization and EMI from the previous session, as well as their interaction. The results of that fit in comparison with an autoregressive (AR) model reveal a negative coefficient of EMI with inequality 2 years later (b = −0.11, 95% CI [−0.20, −0.02]). Details of these models are in the Methods section. The interaction with polarization is not significant and weak enough for the slope of EMI to stay negative (Supplementary Fig. These regression results are robust to other specifications of the analysis, for example, when using the Gini index instead of the top 1% share of income (restricted to the time since full income data became available), when using all available data since 1912 and when considering a longer lag for polarization (Supplementary Table 4). Evidence-based language can be a tool to identify factual constraints for Congress to formulate legislation, which often requires some form of bipartisan agreement. We examine the relationship between EMI and congressional productivity as measured by three indicators. First is the major legislation index (MLI)19 which measures the productivity of Congress in terms of important legislation. Second is the legislative productivity index (LPI)19, which combines assessments of important legislation and number of laws enacted. Previous research analyses congressional productivity as a function of polarization, party composition in the legislature and executive branch19 and public mood towards more regulation as measured in surveys29. From these indicators, polarization and public mood towards regulation are the most important predictors, explaining a significant amount of the variance of productivity over time19. Figure 3 shows the relationship between all three congressional productivity metrics and EMI measured in the same session. All three cases have positive and significant correlations (MLI: Pearson's r = 0.454, 95% CI = [0.09, 0.711], P < 0.05; LPI: Pearson's r = 0.836, 95% CI = [0.667, 0.923], P < 0.001; log-transformed number of laws: Pearson's r = 0.796, 95% CI = [0.633, 0.891], P < 0.001). However, polarization and public mood about regulation play an important role in congressional productivity, which is shown by the colour of the plotting symbols in Fig. Points representing high public mood (blue) tend to lie above the regression line, and points with low public mood (red) tend to lie below. For that reason, we fitted the base models of ref. 19 and tested if adding the EMI of a session has a positive association with the LPI. The results (see Methods for details) reveal that, after controlling for known correlates in productivity and for an interaction between polarization and EMI, the coefficient of EMI is positive and significant for MLI (b = 0.67, 95% CI = [0.14, 1.20], P < 0.05) and LPI (b = 0.83, 95% CI = [0.40, 1.26], P < 0.05), and positive for the number of laws but not statistically significant (P < 0.1). We see this as an indication that EMI plays a role in congressional productivity, with the association being more salient when considering major legislation in comparison with minor laws where parliamentary debate might not play a bigger role. Points are coloured according to public mood towards regulation during the legislative period. The grey lines represent linear regression models of each productivity variable as a function of EMI alone, and the shaded areas indicate the 95% CIs for the regression fits. We introduce an approach for quantifying the conception of truth that members of Congress embrace and deploy in their rhetoric. Using embedded dictionaries in conjunction with embedding of congressional speeches, we calculate and validate the EMI score from transcripts of congressional speeches spanning the years 1879–2022. The EMI score reflects the prevalence of evidence-based language when positive and intuition-based language when negative. We study the temporal trends of the EMI score and investigate its relationships with measures of polarization and inequality as well as congressional productivity. We find that EMI shows a pattern of relative stability until the 1940s, which is followed by a clear upward trajectory that reached a maximum in the 1970s. Since then, EMI trends downwards, indicating a decline in the prevalence of evidence-based language for both parties. The degree of synchronization in the linguistic styles used by both Democrats and Republicans during this period points to their alignment around messaging strategies30. We examine the decline in EMI in relation to three outcome variables that are indicative of democratic health and find a concerning association in all cases: a decline in evidence-based language is associated with increasing polarization and increasing income inequality but decreased congressional productivity. The temporal sequence of those trends differs between variables. For polarization, the strongest association with EMI is greatest at lag zero, and we find that polarization is a significant predictor of EMI, but not vice versa. This suggests that polarization and politicians' rhetoric evolve in tandem. By contrast, EMI precedes shifts in income inequality, such that a stronger emphasis on evidence-based reasoning is associated with subsequent reduction in income inequality whereas greater reliance on intuition seems to be associated with the persistence of existing social disparities. This finding aligns with existing research on language and social inequality31, which underscores how language patterns have consequences for understanding social issues and may either promote or inhibit necessary changes. Intuition-based language may help to explain the relationship between polarization and inequality, as it is linked to legislative inaction and can hinder policies that address income inequality through redistribution18. Finally, the association of evidence-based language with congressional productivity is again contemporaneous. In the Habermasian view of communicative action32, evidence-based language serves as the foundation for ‘reasoned' debates and can steer discussions away from personal and political hostilities. In this communicative process, evidence-based language serves as a tool to establish a shared understanding of the state of the world and contributes to the formulation of well-informed decisions. The positive correlation that we observe in our study between the EMI score and legislative productivity (in terms of quality and quantity) is in line with this viewpoint. The observed patterns in congressional language are the result of a complex interplay of various factors, some of which are unique to the political and societal context of the USA. One contributing factor to these patterns is the control exerted by party leadership over who speaks on the congressional floor33, potentially shaping the content and tone of speeches. This control mechanism is likely to influence the language used by congressional members in aligning with the strategic objectives of their party. In addition to the influence exerted by party leaders, members of Congress may find themselves compelled to cater to their base, encompassing constituents, donors and lobbyists, particularly in a highly polarized environment driven by partisanship34. Modifications to congressional rules and procedures, particularly around the length of debates, can influence the breadth and depth of discussions on the congressional floor. For example, the introduction of the ‘cloture' rule in the Senate in 1917 provided a mechanism to limit debate time and expedite legislative processes. Before this, there was no formal method to end a debate or force a vote on an issue, which allowed extended deliberations. While such rules may improve efficiency, they can also shorten discussions and potentially limit the richness of legislative debates. The evolving nature of congressional rules and procedures can influence the characteristics of discourse on the congressional floor over time. Presidents have increasingly sought to expand their powers, often justified by their role as commander-in-chief, particularly during crises or in an attempt to unilaterally advance their policy agendas35. Mechanisms such as executive orders and the creation of administrative agencies under presidential control have facilitated this expansion. While some of these actions are supported by congressional authorization, the steady accumulation of executive power may have implications for the legislative branch. This expansion may limit the sphere of influence of Congress, potentially reducing its role to rubber-stamping presidential initiatives. Conversely, it can also lead to tensions and heightened oversight efforts by Congress on activities of the executive branch and agencies. Furthermore, the impact of media on politicians, particularly their adoption of media logic36, introduces an additional dimension to the nature of political representation. In an era characterized by increasing polarization, politicians might find themselves driven to embrace a perpetual campaign style of representation37, transforming congressional speeches into orchestrated performances aimed at capturing media attention. Consequently, this shift may result in a reduced focus on meaningful intellectual discourse and nuanced policy discussions within the legislative body. This interpretation meshes well with a recent analysis of the Twitter/X communications of US Congress members from 2011 to 2022, which similarly differentiated between evidence-based ‘fact-speaking' and authentic ‘belief-speaking' as alternative expressions of honesty22. That study discovered an association between the prevalence of authentic belief-speaking and a decrease in the quality of shared sources in tweets, particularly among Republicans. This suggests a potential link between belief-based language and the dissemination of low-quality information to the public. The findings presented in this study highlight important correlational associations. The absence of causal evidence underscores the need for future research to further establish definitive causal relationships. We have highlighted concerning trends in Congress where evidence-based language is declining and partisan polarization is increasing. The decline in the quality and quantity of legislative output at a time of multiple global crises should be of concern. On a more positive note, understanding the complex relationship between the language of political discourse and partisan polarization points to avenues for interventions focused on fostering more constructive and productive debate. Initiatives such as those promoting collaboration and communication across partisan boundaries38 can contribute to rebuilding a more robust democratic discourse. Ultimately, the challenge lies in having a Congress (and, by extension, a deliberative public) where truth is valued, polarization is in check and legislative outcomes reflect the diverse needs of the citizens. We initially rely on the dataset compiled by Gentzkow et al.39 and supplement it with recent data obtained by accessing the congressional records' website using an automated script40. The dataset includes essential metadata such as speaker information (including party) and dates. To ensure the quality of our dataset, we use a number of preprocessing steps. Procedural speeches are speeches delivered by members of Congress that mainly deal with the rules and procedures that govern legislative proceedings. These may include discussions on amendments to rules, requests for unanimous consent, or the announcement of votes. We train three classifiers, following the methodology outlined by Card et al.41, to identify procedural speeches. We remove procedural speeches by using a majority vote ensemble of the classifiers. In general, the congressional record is of high quality. However, in the earlier years, it contains some instances of optical character recognition errors that result in unintelligible content (for example, in the rendering of a table). To mitigate the potential noise from lengthy speeches that consist mainly of lists of names or numbers, we use a filtering mechanism. This filter evaluates the ratio of common (top 100) English words (for example, ‘the', ‘and' and ‘is') to the total token length of a speech. We set a threshold of 0.05, ensuring that speeches with substantive content are retained for further analysis. We keep speeches that are attributed to members of the two major parties. We filter out speeches with fewer than 11 tokens and remove duplicate entries. Our final dataset consists of 8,435,769 speeches with an average length of approximately 199 tokens. Speeches made by Democrats account for 53% of the dataset and 47% of speeches are by Republicans. 1 shows the number of speeches for each congressional session across both chambers (House and Senate) from 1879 to 2022. The number of speeches for each session varies. Nevertheless, there is a substantial amount of speeches available, with at least 35,000 speeches for each session, to enable a reliable analysis. To facilitate further analysis, we split longer speeches (consisting of more than 150 tokens) into chunks of approximately 150 tokens each. We set a minimum chunk size of 50 tokens, such that a chunk smaller than the minimum size is merged with the immediately preceding chunk. We start with seed keywords, one for each conception of truth, generated by the researchers involved in this work. The goal is to capture linguistic cues that may signal the pursuit of truth in a speaker. Initial keywords for evidence-based language include ‘reality', ‘assess', ‘examine', ‘evidence', ‘fact', ‘truth' and ‘proof'. For intuition-based language, the initial keywords include ‘believe', ‘opinion', ‘consider', ‘feel', ‘intuition' and ‘common sense'. We expand these lists computationally using a combination of fastText embeddings42 and Colexification networks43. Using fastText embeddings, we expand the seed words by including those words with a cosine similarity score above 0.75. Colexification networks connect words within a language on the basis of their common translations across other languages, thus identifying words that express related concepts. For instance, the words ‘air' and ‘breath' are considered colexifications because they translate into the same word in multiple languages. Incorporating colexification networks into lexicon expansion results in word lists with a better trade-off between precision and recall compared with methods relying solely on word embeddings44. We filter the expanded lists by removing duplicates and terms appearing in both categories. In addition, we retain only one variant of lemma inflections (for example, ‘investigate', ‘investigates' and ‘investigated'). Following the same approach used in ref. 22, we then recruited participants on Prolific to rate each keyword on their representativeness on two scales, one for evidence-based and one for intuition-based language. We then keep only words rated as statistically more representative for their respective construct than the other. Informed consent was obtained from all participants before their participation in the annotation task. The annotation task was performed in accordance with relevant ethical guidelines and regulations. Our final dictionaries consist of 49 keywords for evidence-based language and 35 keywords for intuition-based language (Table 1). The difference in the number of keywords is not a concern in our approach, because we use the distributed dictionary representation method10, which effectively normalizes the impact of varying keyword counts by representing each dictionary with a single vector. This ensures a consistent measure of evidence-based and intuition-based language, enabling meaningful comparisons across both constructs. In our methodology, we start by training 300-dimensional word embeddings using the Word2Vec9 algorithm on the corpus of congressional speeches. Word2Vec is an algorithm that generates dense vector representations of words, known as word embeddings. The rationale behind using Word2Vec lies in its ability to capture semantic relationships among words by representing them in a continuous vector space. This algorithm learns to predict the context of a word on the basis of its surrounding words or vice versa. The resulting word embeddings encode semantic similarities, making them valuable for computational analysis of language. Following this, we compute a representation for the concepts of interest by averaging the word embeddings for the relevant keywords in the respective dictionaries for evidence-based and intuition-based language. For a given text, we compute its representation by taking the average of the word embeddings for its content words. This representation allows a graded measure of relatedness to each construct as we can calculate the cosine similarity between each construct representation and the representation of a given target text that is computed in the same manner. To generate the representations and compute cosine similarities, we use the sentence-transformers library46, leveraging our trained Word2Vec model. This approach offers efficiency and effectiveness in capturing the semantic content of textual data. This set-up allows us to obtain textual embeddings with minimal computational resources and ensures the scalability of our analysis. To address variations in the length of speeches, we perform length adjustments for the cosine similarities. This involves binning the similarities by length and subtracting the mean similarity within each bin from the cosine similarity of each instance. Subsequently, we apply a Z-transform to the cosine similarities to derive the evidence and intuition scores. Supplementary Tables 1 and 2 contain illustrative examples of speeches with positive and negative EMI scores, respectively. For further analysis, we take the mean of the EMI score per 2-year period, corresponding to the typical duration of congressional sessions. For completeness, we also include a plot of the trends for each of the component scores (Supplementary Fig. Given the extended timescale of this study, concerns may arise about semantic change and the possibility that the embeddings model relies mostly on more recent data. To address these concerns, we train temporal embeddings on two-decade slices of the speeches and downsample the data to ensure comparable token counts across time periods. We split the EMI score into four bins per decade. We sample five (four for the most recent decade) (quasi)sentences from each bin per party (Democrats versus Republicans) and decade, resulting in a sample of size 592. We ask participants on Prolific to rate to what extent a given text is evidence-based and intuition-based (or evidence-free) on two Likert scales ranging from 1 to 7. We received an ethics review exemption from the University of Konstanz ethics review board for the annotation task used to validate the EMI score. Informed consent was obtained from all participants before their participation in the annotation task. The annotation task was performed in accordance with relevant ethical guidelines and regulations. Each text has at least five ratings. We collected a total of 4,563 human ratings from 156 participants. As the average of the ratings for each scale are negatively correlated at the document level (−0.85, P < 0.001), we derive human judgement by assigning a label of evidence-based if the average evidence-based rating is greater than the average intuition-based rating; otherwise, we classify the item as intuition-based. Annotators have relatively high levels of agreement. We compute the AUC score per decade and for all samples. The AUC is a measure of the reliability of our computed EMI score, with a score of 1 indicating perfect accuracy and 0.5 representing performance equivalent to random chance. Our method achieves an overall AUC of 0.79 across decades, ranging from 0.60 to 0.94 (Table 2). Compared with the random baseline AUC of 0.5, our method demonstrates acceptable to excellent discrimination levels47. Supplementary Table 3 presents examples of speeches with low EMI score (in the bottom 1%) in periods with overall low EMI scores in Fig. Consistent with previous research23,24,25 that highlighted changes in the language of individuals and political leaders during crises, these examples suggest a tendency for discussions about the crisis of the time to rely more on intuition-based language rather than evidence-based language. 3 shows the trend of EMI by party in both chambers of the US Congress over time. We fit time series as linear regression models that include lagged dependent variables to consider autocorrelation. For each time series, we fit AR models with increasing lags up to a point in which the quality of models does not improve with additional lags. In all cases we report, inclusion of one lag generated the best univariate AR model. We next extend these models with other variables including EMI and other covariates. We measure variance inflation factors (VIFs) of the independent variables of the models and include interaction terms when any of the covariates, excluding the lagged dependent variable, has a VIF above 10. After fitting a model specified in this way, we measure standard errors and P values with a heteroskedasticity and autocorrelation (HAC)-adjusted estimator. We assess the stationarity of residuals with augmented Dickey–Fuller (ADF) tests and Kwiatkowski–Phillips–Schmidt–Shin (KPSS) tests, and the normality of residual distributions with Jarque–Bera (JB) tests. Models generally passed these regression diagnostics, being able to reject the null hypothesis of the ADF test at a 0.05 level and failing to reject the null of the KPSS and JB tests at a 0.1 level. We report here any relevant cases where those diagnostics are different. In our primary analysis of inequality, we consider the fraction of income of the top 1% from 1944, which is the year when tax declaration exemption rules qualitatively changed and led to more reliable inequality metrics48. We add one more specification to robustly test how the role of polarization influences our results about EMI and inequality. A lagged correlation analysis between inequality and polarization indicates that the correlation between these two is strongest when considering a lag of eight legislative sessions (Supplementary Fig. To consider this longer lag, we fitted an additional regression model of inequality with EMI and the previous value of inequality, but with the value of polarization eight sessions prior. Results of this fit are reported in Supplementary Table 4. The session with the highest EMI score is 1975–1976, with a score of 0.358, closely followed by the previous session with an EMI score of 0.355 but substantially higher than the mean session, which has a slightly negative EMI of −0.017. The peak EMI is more than two standard deviations (s.d.) From that peak, a downward trend is noticeable and is confirmed by a linear regression model of the form The fit has an intercept a = 0.258 and a slope b = −0.032, both with P < 0.001. This is further corroborated by breakpoint analyses (Supplementary Note 12) that identified the session 1973–1974 (the session before 1975–1976) as a breakpoint. To measure partisan polarization in Congress, we use the first dimension of the DW-NOMINATE score26, which measures the ideological position of members of Congress derived from their roll-call votes. A higher difference in the first dimension indicates a greater ideological distance or polarization between the parties. We use the DW-NOMINATE data from Voteview27, which offers a comprehensive and widely used resource for studying the ideological landscape and partisan dynamics within the US Congress. To understand the relationship between EMI and polarization (scatter plot in Supplementary Fig. 5), we fitted lagged regression models of the form and compared them with AR models ignoring the other variable. Results of the fits (Table 3) show that polarization does not have a significant coefficient in the EMI model and that the polarization model has a significant negative coefficient for EMI, but of small magnitude compared with the AR coefficient of EMI. A KPSS test of residuals in this model rejects the null hypothesis (P = 0.036), but an ADF test also rejects the null (P = 0.02). While residuals deviate a bit from being stationary, we use HAC covariance matrix estimation and residuals do not significantly deviate from normality, as a JB test is not significant (P = 0.645). A lagged correlation analysis shows that the strongest correlation between EMI and inequality has a lag 2, where EMI precedes inequality (Supplementary Fig. Inequality is also known to be correlated with polarization18, which we also observe in our lagged correlation analysis in Supplementary Fig. For that reason, we study the role of EMI in inequality while considering polarization, as EMI and polarization are negatively cross-correlated. The VIF of a specification including lagged measures of inequality, EMI and polarization is 9.67, indicating that we need to include an interaction term between EMI and polarization. Thus, our model has the form We compare this model with a simple AR model including lagged values of inequality and polarization. The results are presented in Table 4. The lagged value of EMI has a negative and significant coefficient on inequality, and the interaction with polarization is not significant. The interaction between EMI and polarization, while positive, does not lead to an important mediation in the role of EMI, as shown in Supplementary Fig. The results of our analysis of inequality remain qualitatively similar with different specifications for the decisions we took in our analysis above. In this specification, the VIF of predictors is 13.17, indicating the need for inclusion of an interaction term between polarization and EMI. The coefficient of EMI is negative and significant, but it has a significant positive interaction with polarization. 7b shows the shape of this interaction, revealing that high levels of polarization do not reverse the direction of association with EMI. In this model, the VIF is 3.175, but we keep the interaction term between polarization and EMI for comparability to other models. 7c shows the shape of this interaction, where high levels of polarization do not reverse the slope of inequality with EMI. For that reason, we performed a bootstrapping test on the coefficient of EMI with 10,000 samples, which indicates that the negative coefficient for EMI is robust to non-normal residuals (95% CI = [−0.34, −0.13]). The model with a longer lag for polarization also has a high VIF of 10.02, motivating the inclusion of the interaction between EMI and polarization. In this model, the coefficient of EMI is also negative and significant, and the interaction between polarization and EMI is significant only at the 0.1 level. 7d shows the shape of this interaction, revealing the same pattern in which, even for high polarization, the slope of EMI is negative. 19, we fit a base model of three congressional productivity indices (MLI, LPI and log-transformed number of laws) as a function of the lagged dependent variable, polarization, policy mood and two indicator variables for whether the same party controls both the presidency and the majority in Congress and for a change in this variable. We extend this model by adding the EMI score of the same session in which productivity is measured. Thus, our model is for each variable Y (MLI, LPI and log laws) Note that, in this model specification, we use the EMI in the same session as the congressional productivity metric, as we aim to identify a correlation between variables that is robust to the known associations with other indicators. Across the three models, explanatory variables reached VIF values up to 12.98, so we included an interaction term between polarization and EMI. Tests of stationarity of residuals had lower significance due to the smaller sample sizes (ADF P = 0.26 for MLI, P = 0.07 for LPI and P = 0.03 for number of laws), but KPSS tests were not significant in all three cases (P > 0.1) and JB tests were not significant either (P > 0.5). These small deviations from stationarity of residuals are corrected with the HAC covariance estimator. While our analysis of productivity includes the important variable of mood, data on public policy mood are available only since the 1950s, as they were collected via surveys. To analyse further the role of EMI in productivity, we adopt the approach of ref. 19, using the logarithm of the number of patents (from https://www.uspto.gov/web/offices/ac/ido/oeip/taf/h_counts.htm) approved during each session as an approximation of public mood regarding regulation. While this is an imperfect approximation, it allows us to study a much longer period, dating back to the nineteenth century. Thus, for each dependent variable, we now have models of the form where npatents(t) represents the logarithm of the number of patents approved during the congressional session t. Covariates in this model have VIF up to 7.67 (LPI), and therefore we include an interaction term between EMI and Pol in each model. Results are presented in Supplementary Table 5. Residuals were approximately stationary, with significant ADF tests for MLI (P = 0.014) and number of laws (P = 0.01), and significant at the 0.1 level for LPI (P = 0.09). KPSS tests were not significant for all three models (P > 0.1), and JB tests were not significant for LPI (P = 0.33) and number of laws (P = 0.52). For MLI, a JB test was significant (P < 0.01), indicating non-normal residuals. For that reason, we performed a bootstrap test with 10,000 samples, which gave a 95% CI for the coefficient of EMI of [0.026, 0.214], indicating that the significant coefficient of the MLI model is robust to deviations from normality in the residuals. The coefficients of interaction terms between EMI and polarization are not significant and the coefficient for EMI is significant only for MLI, while it is not for LPI nor the number of laws. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. Congressional speeches are available at https://data.stanford.edu/congress_text (ref. DW-NOMINATE scores are from https://voteview.com (ref. The Gini index is from https://www.census.gov/data/tables/time-series/demo/income-poverty/historical-income-inequality.html. Data on the number of patents are from https://www.uspto.gov/web/offices/ac/ido/oeip/taf/h_counts.htm. Data on public policy mood are available at https://stimson.web.unc.edu/data/ (ref. Data on legislative productivity are available at https://doi.org/10.7910/DVN/ILILUD (ref. All the data used in this study are deposited in an Open Science Framework (OSF) repository at https://doi.org/10.17605/OSF.IO/Z6UTW (ref. The codes used to perform the analyses reported in this Article are available via GitHub at https://github.com/saroyehun/EvidenceMinusIntuition (with a snapshot available via Zenodo at https://doi.org/10.5281/zenodo.14288137 (ref. Higgins, E. T., Rossignac-Milon, M. & Echterhoff, G. Shared reality: from sharing-is-believing to merging minds. Beyond misinformation: understanding and coping with the ‘post-truth' era. Bennett, W. L. & Livingston, S. The disinformation order: disruptive communication and the decline of democratic institutions. Garrett, R. K. & Weeks, B. E. Epistemic beliefs' role in promoting misperceptions and conspiracist ideation. Deliberate Ignorance: Choosing Not to Know (MIT Press, 2020) Cooper, B., Cohen, T. R., Huppert, E., Levine, E. E. & Fleeson, W. Honest behavior: truth-seeking, belief-speaking, and fostering understanding of the truth in others. & Carrella, F. When liars are considered honest. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. 26 (eds Burges, C. J. et al.) 3111– 3119 (Curran Associates, 2013). Garten, J. et al. Dictionaries and distributions: combining expert knowledge and large scale textual data content analysis: distributed dictionary representation. Szostak, R. Restoring democratic stability: a backcasting wheel approach. Partisanship, polarization, and the robustness of support for democracy in the United States. Gentzkow, M. & Shapiro, J. M. What drives media slant? Political polarization and the dynamics of political language: evidence from 130 years of partisan speech [with comments and discussion]. Polacko, M. Causes and consequences of income inequality—an overview. & Rodríguez-Bailón, R. Economic inequality enhances inferences that the normative climate is individualistic and competitive. McCarty, N., Poole, K. T. & Rosenthal, H. Polarized America: The Dance of Ideology and Unequal Riches (MIT Press, 2016). Legislative productivity of the US Congress, 1789–2004. A comprehensive dataset of US federal laws (1789–2022). Gennaro, G. & Ash, E. Emotion and reason in political language. Lasser, J. et al. From alternative conceptions of honesty to alternative facts in communications by US politicians. Pennebaker, J. W., Mehl, M. R. & Niederhoffer, K. G. Psychological aspects of natural language use: our words, our selves. Wallace, M. D., Suedfeld, P. & Thachuk, K. Political rhetoric of leaders under stress in the Gulf crisis. Pennebaker, J. W. & Lay, T. C. Language use and personality during crises: analyses of Mayor Rudolph Giuliani's press conferences. Poole, K. T. & Rosenthal, H. L. Ideology and Congress 1 (Transaction Publishers, 2011). Lewis Jeffrey, B., Keith, P., Adam, B., Aaron, R. & Luke, S. Voteview: congressional roll-call votes database. Alvaredo, F. et al. Distributional National Accounts (DINA) Guidelines: Concepts And Methods Used in WID.world (HAL, 2016). Public Opinion in America: Moods, Cycles, and Swings (Routledge, 2018). & Hibbing, J. R. Speaking different languages or reading from the same script? Word usage of Democratic and Republican politicians. Augoustinos, M. & Callaghan, P. The Language of Social Inequality. In The Social Psychology of Inequality (eds Jetten, J. & Feinberg, M. Polarization in the contemporary political and media landscape. & Thrower, S. Checks in the Balance: Legislative Capacity and the Dynamics of Executive Power 193 (Princeton Univ. Altheide, D. L. Media logic and political communication. Gentzkow, M., Shapiro, J. M. & Taddy, M. Congressional record for the 43rd–114th congresses: parsed speeches and phrase counts. & Young, L. Congressional-Record: A Parser for the Congressional Record. & Mikolov, T. Enriching word vectors with subword information. François, A. in From Polysemy to Semantic Change: Towards a Typology of Lexical Semantic Associations 163 (John Benjamins Publishing Company, 2008); https://doi.org/10.1075/slcs.106.09fra & Garcia, D. Lexpander: applying colexification networks to automated lexicon expansion. Řehůřek, R. & Sojka, P. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks 45–50 (ELRA, 2010). Reimers, N. & Gurevych, I. Sentence-BERT: sentence embeddings using Siamese BERT-networks. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (eds Inui, K. et al.) 3982–3992 (Association for Computational Linguistics, 2019). Piketty, T. & Saez, E. Income inequality in the United States, 1913–1998. Aroyehun, S. T. et al. Data repository for ‘Computational analysis of US congressional speeches reveals a shift from evidence to intuition'. Aroyehun, S. T. et al. Code repository for ‘Computational analysis of US congressional speeches reveals a shift from evidence to intuition'. acknowledges financial support from the European Research Council (ERC Advanced Grant 101020961 PRODEMINFO), the Humboldt Foundation through a research award, the Volkswagen Foundation (grant ‘Reclaiming individual autonomy and democratic discourse online: How to rebalance human and algorithmic decision making') and the European Commission (Horizon 2020 grant 101094752 SoMe4Dem). also receives funding from Jigsaw (a technology incubator created by Google) and from UK Research and Innovation through EU Horizon replacement funding grant number 10049415. is also a beneficiary of the ERC Advanced Grant 101020961 PRODEMINFO, and S.T.A. also received funding from the Deutsche Forschungsgemeinschaft (DFG – German Research Foundation) under Germany's Excellence Strategy – EXC-2035/1 – 390681379. was supported by the Marie Skłodowska-Curie grant number 101026507. The funders had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript. We thank T. Brown for useful input on the manuscript. You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar provided advice on the statistical analyses and visualization. acquired funding and supervised the project. All authors contributed to preparing and editing the final version of the manuscript. The authors declare no competing interests. Nature Human Behaviour thanks Renáta Németh and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Aroyehun, S.T., Simchon, A., Carrella, F. et al. Computational analysis of US congressional speeches reveals a shift from evidence to intuition. Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative Nature Human Behaviour (Nat Hum Behav) Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Despite their deleterious effects, small insertions and deletions (InDels) have received far less attention than substitutions. Here we generated isogenic CRISPR-edited human cellular models of postreplicative repair dysfunction (PRRd), including individual and combined gene edits of DNA mismatch repair (MMR) and replicative polymerases (Pol ε and Pol δ). Unique, diverse InDel mutational footprints were revealed. However, the prevailing InDel classification framework was unable to discriminate these InDel signatures from background mutagenesis and from each other. To address this, we developed an alternative InDel classification system that considers flanking sequences and informative motifs (for example, longer homopolymers), enabling unambiguous InDel classification into 89 subtypes. Through focused characterization of seven tumor types from the 100,000 Genomes Project, we uncovered 37 InDel signatures; 27 were new. In addition to unveiling previously hidden biological insights, we also developed PRRDetect—a highly specific classifier of PRRd status in tumors, with potential implications for immunotherapies. Small insertions and deletions (InDels; <100 bp) represent the second most prevalent form of genetic variation after substitutions1. InDel mutagenesis is substantial and nonrandom, reflecting underlying mutational processes, be they benign by-products of normal physiology or malign consequences of exogenous exposures and/or endogenous dysfunctions. While studies of mutational processes over the last decade have focused primarily on substitutions2,3, recent advancements in InDel detection and annotation4,5,6 have led to the identification of 18 small InDel signatures (IDS) in human cancers7. These signatures were defined using an 83-channel classification system (that is, 83 InDel subtypes, herein referred to as COSMIC-83), founded upon characteristics such as the InDel size, nucleotides affected, lengths of flanking mononucleotide/polynucleotide repeat and sequence homology at the InDel junctions. Subsequent re-analysis of the same dataset using a revised algorithm reported nine additional de novo IDS8. Accurate InDel characterization is crucial for biological and clinical purposes. For instance, a high proportion of microhomology-mediated deletions is a key predictor of clinically actionable homologous recombination deficiency, carrying the greatest weight in homologous recombination deficiency detection and classification algorithms9,10. Additionally, the detection of microsatellite instability (MSI) in mismatch repair (MMR)-deficient tumors relies on measuring genome-wide InDel mutagenesis and/or analyzing panels of mononucleotide/dinucleotide repeats11,12,13. Recent studies have also highlighted how 2–5 bp short deletions and 2–4 bp duplications are distinct readouts of TOP1 activity (amplified in RNaseH2-null cells)14 and TOP2A dysfunction15, respectively. Given the increasing importance of InDel mutagenesis in tumor classification and prediction of therapeutic sensitivities16,17,18,19, we established a ‘ground truth' set of experimental IDS focused on ‘postreplicative repair deficiency' (PRRd), which encompasses defects in MMR and replicative polymerase proofreading—biological abnormalities that often display exquisite sensitivity to immune checkpoint inhibition (ICI)20,21,22. We uncovered inherent limitations in the prevailing InDel classification schema (COSMIC-83) that hamper its ability to distinguish biologically distinct signatures. To address this, we explored whether incorporating sequences 5′ and 3′ to an InDel–features crucial for substitution classification–could contribute additional understanding and resolution power. Additionally, sequence motifs known to increase mutational vulnerability and their genome-wide prevalence were factored into our proposition. Our approach unambiguously classifies every InDel into a specific subcategory. Here we demonstrate that our alternative InDel taxonomy uncovers new etiologies of InDel mutagenesis, offering mechanistic insights and potential clinical added value. We generated a ‘ground truth' set of isogenic cellular models by introducing CRISPR edits to key PRRd-associated genes in an hTERT-immortalized RPE1 (TP53−/−) cell line23. We created four single MMR gene knockouts (ΔMLH1, ΔMSH2, ΔMSH3 and ΔSETD2 (ref. Successfully edited clones were propagated in culture for approximately 45–50 days to permit mutation accumulation. Subsequently, two to five daughter subclones were isolated per genotype for whole-genome sequencing (WGS) and mutational signature analyses (Fig. a, Mutation accumulation experiment in TP53-null hTERT-immortalized retinal pigment epithelial cell (hTERT-RPE1TP53-null, herewith referred to as the background control). b, InDel burden and average InDel fold increase of CRISPR gene edits (n = 2–5 subclones per genotype; Supplementary Tables 1–3). Red dashed line represents the mean InDel burden of control subclones. c, Distinguishing COSMIC-83 InDel profiles of edited subclones from background control. Light blue error bars depict the mean ±3 s.d. of cosine similarities between n = 100 bootstrapped InDel profiles of unedited controls and the background profile (Extended Data Fig. d, COSMIC-83 InDel mutational signatures associated with gene edits following background subtraction (Supplementary Table 4). e, Key features of COSMIC ID1, ID2 and ID7 (v.3.3). Known, proposed etiologies are annotated above the heatmap (blue). g, Decomposed solution of gene-edit InDel signatures in d into COSMIC IDS (v.3.3). Except for ΔSETD2, we observed elevated InDel burdens in all gene edits compared to an unedited control (background) (Fig. All lines except ΔSETD2 showed variations in their COSMIC-83 InDel signature profiles compared to control (Fig. We noted discriminative characteristics between gene edits (Fig. Dominant 1 bp T deletions at homopolymers of 6 bp or more (poly-T6+) were observed for ΔMLH1, ΔMSH2 and ΔMSH3, while POLD1S478N and POLEP286R showed exclusive 1 bp T insertions at poly-T5+. POLD1R689W, POLEL424V and all three combined polymerase/MMRd edits predominantly exhibited 1 bp T insertions at long homopolymers, although not exclusively, with variations of 1 bp T deletions between different genotypes. Together, these experiments revealed unique, diverse InDel signatures among different PRRd mutants. Remarkably, mutations within the same gene but affecting different functional protein domains manifested signature variations (that is, POLD1 exonuclease p.S478N versus polymerase p.R689W). We also examined the substitution profiles of all gene edits (Extended Data Fig. Intriguingly, MMRd lines showed lower substitution-to-InDel ratios compared to control, while polymerase-dysfunction (Pol-dys) lines exhibited markedly increased ratios (Extended Data Fig. This suggests that genome instability is predominantly driven by an excess of InDel mutagenesis in MMRd, whereas substitution mutagenesis plays a more significant role in polymerase proofreading dysfunction. Furthermore, mutational asymmetry analyses revealed enrichment of both substitutions and InDels on the leading strand for POLE mutants while POLD1 mutants exhibited lagging strand bias, specifically T insertions at homopolymeric tracts of 5–7 nts (Supplementary Fig. This is in keeping with the hypothesized preferential activity of Pol ε and Pol δ in leading and lagging strand synthesis, respectively25,26, suggesting that POLE/POLD1 mutants tend to accumulate 1 bp A insertions on the nascent strand while replicating through 5–7 nts poly-T-tracts27,28. This lends support to the proposition that polymerase ε and δ are more proficient at detecting incorrectly paired bases at template adenines27. Nevertheless, while the diversity of experimental InDel profiles was appreciable among PRRd genotypes, it was difficult to disambiguate gene-edit signatures from background mutagenesis. Clustering analyses and direct comparisons of gene edit and control InDel profiles showed extremely high similarity (cosine similarity > 0.9; Extended Data Fig. Discrimination between MMRd and Pol-dys signatures was also limited (Extended Data Fig. Unsupervised clustering using cosine distance revealed mainly two groups of signatures—deletion-driven MMRd signatures and insertion-driven polymerase mutant signatures (Extended Data Fig. We thus investigated the sufficiency of COSMIC-83 taxonomy, given that signal variation among the ten gene edits was primarily observed in two channels—1 bp T insertions at poly-T5+ and 1 bp T deletions at poly-T6+. We compared experimental gene-edit InDel signatures with COSMIC IDS7. InDel signatures of ΔMSH2 and ΔMLH1 showed no similarity to the purported MMRd-associated ID7 (Fig. The COSMIC-83 taxonomy aggregates 1 bp InDels at homopolymers >5 bp into single channels (that is, T6+ for deletions and T5+ for insertions, respectively; Fig. We surmised that the conflation of discriminatory signals within longer homopolymers into single ‘insertion at T5+' or ‘deletion at T6+' channel likely reduces the separative capacity for signature extraction. This contrasts with corresponding PRRd-associated substitution signatures, which manifest as distinct and diverse patterns amongst MMRd and/or Pol-dys cancers3,7,31 (Extended Data Fig. Notably, ID7 lacks signal within reputedly the most informative homopolymer channel (>5 bp). Instead, signals are only present in channels associated with ID1 and ID2 (Fig. 1e), resulting in systematic misattribution of all MMRd gene-edit signatures to ID1 and ID2 (Fig. Moreover, InDel signatures of POLE, POLD1 mutants and all combined polymerase/MMRd edits were indistinguishable from ID1 using COSMIC-83 taxonomy and sometimes indistinguishable from each other (Extended Data Fig. The signature of polymerase mutant POLD1R689W did not resemble any reported signatures. Because InDel mutagenesis of gene edits occurred predominantly at longer homopolymers and was erroneously assigned to ID1 and/or ID2 (Fig. As with substitutions, incorporating surrounding sequence characteristics may enhance the discriminatory capacity of InDel catalogs for signature analyses32. We first classified InDels according to whether they were insertions, deletions or complex InDels (simultaneous insertions and deletions; Fig. For InDels ≥2 bp, we identified the maximally repetitive motif within the InDel and accounted for its repeat length in the 3′ sequence (Supplementary Note 1). b, Distinguishing 89-channel InDel profiles of edited subclones from background control. Light blue error bars depict the mean ± 3 s.d. of cosine similarities between n = 100 bootstrapped InDel profiles of unedited controls and the background profile (Extended Data Fig. c, Cosine similarities of edited subclones and bootstrapped controls in COSMIC-83 InDel profiles against 89-channel InDel profiles. d, The 89-channel InDel mutational signatures associated with PRRd gene edits following background subtraction (Supplementary Table 4; https://signal.mutationalsignatures.com/explore/main/experimental/experiments?study=7). We examined whether all 476 channels were informative. By analyzing the InDel distribution across all channels in 18,522 tumors covering most cancer types from the International Cancer Genome Consortium (ICGC)/The Cancer Genome Atlas (TCGA)33, Hartwig34 and the Genomics England (GEL) 100,000 Genomes Project35 (Extended Data Fig. 3e), we identified noninformative channels (that is, channels with no signal) and consolidated those with low signal to reduce the total number of InDel channels to 89 (Fig. Overall, compared to COSMIC-83, the 89-channel taxonomy expands upon channels that had most of the signals, here 1 bp A/T InDels, into a larger array of channels, and condenses longer InDels and/or genome motifs infrequent in the genome (where signals were scant or nonexistent) into fewer InDel subcategories (Extended Data Fig. Although the final numbers are not vastly different between the two classification systems, our data-driven approach, incorporating sequence contexts and enhancing signal distribution of mononucleotide/polynucleotide repeat tracts into additional channels, provides alternative information to the mutational signature extraction and assignment process, potentially increasing the likelihood of detecting new biologically meaningful signatures. To test this, we applied the new 89-channel InDel taxonomy to our ground truth gene-edit dataset (Supplementary Table 4). Cosine similarities between experimental InDel profiles and control were much lower with the 89-channel format than with COSMIC-83 (Fig. 5a,b), indicating that the new classification improved separation of gene edits from the background (mean cosine similarity, 0.68 ± 0.08 for 89-channel versus 0.89 ± 0.11 for COSMIC-83; two-tailed Wilcoxon signed-rank test, P = 1.917 × 10−7). We subsequently determined signatures associated with each gene edit using the 89-channel format. Gene-edit signatures were also more readily discernible from one another (mean signature pairwise cosine similarity, 0.57 ± 0.25 for 89-channel versus 0.64 ± 0.3 for COSMIC-83; two-tailed Wilcoxon signed-rank test, P = 1.483 × 10−5; Extended Data Fig. Notably, InDel signatures of combined MMRd/polymerase mutants were not simply the sum of the individual mutational processes, likely reflecting the biological interactions of Pol ε and Pol δ with MMR in suppressing InDel formation during the replication of repetitive DNA. The higher InDel rates in shorter homopolymers conferred by defective proofreading of Pol ε and Pol δ likely reflect the distance over which they interact with duplex DNA upstream of the polymerase active site29. Indeed, crystal structures of Pol ε and Pol δ have shown numerous contacts made within 5–7 bp of the polymerase active sites with duplex DNA36,37, with experimental model reinforcing this optimal distance29, explaining how proofreading may offer reduced protection against InDels outside of this ‘footprint' (that is, unpaired bases further upstream of the active site38; longer runs where MMR plays a more crucial role29). These unique insights were only appreciable due to the new 89-channel format, offering enhanced capturing of biological variation. To compare the discriminatory capacity of both classification systems, we also performed de novo signature extraction on our ground truth experimental dataset (n = 37; Extended Data Fig. With COSMIC-83, only two de novo signatures were extracted—one dominated by T insertions at poly-T5+ (ID83A) and the other by T deletions at poly-T6+ (ID83B; Extended Data Fig. In contrast, the 89-channel format yielded four signatures, matching our expectation of a predominantly deletion-driven MMRd signature (InD89B), a predominantly insertion-driven polymerase signature (InD89D) and two distinct signatures with differing proportions of InDels (InD89A and InD89C), likely reflecting the combined polymerase/MMRd phenotypes (Extended Data Fig. Finally, to determine whether this observed relationship between channel information content and signature extraction extended to other datasets and workflows, we applied three different algorithms3,8,39 to an unrelated cohort of 52 colorectal WGS from ICGC33 (Extended Data Fig. All three algorithms failed to discern all available signatures using COSMIC-83, reaching a discrimination limit of five, yielding sparse signatures with signal density highly concentrated in two channels (Extended Data Fig. Contrarily, the 89-channel format consistently enabled the detection of more de novo signatures across all algorithms used (Extended Data Fig. The extracted signatures also displayed signals across more channels, highlighting the superior performance of the 89-channel classification over COSMIC-83 in uncovering additional, true mutational processes. To explore the impact of our new InDel taxonomy on signature discovery beyond PRRd phenotypes in human cancers, we analyzed seven tumor types (n = 4,775) known to display clinically relevant high tumor mutational burden (TMB) due to a range of abnormalities (for example, MMRd, environmental ultraviolet (UV) radiation, APOBEC-related mutagenesis)—bladder (n = 347), brain (CNS, n = 392), colorectal (n = 2,146), endometrial (n = 695), lung (n = 958), stomach (n = 181) and skin (n = 56) cancers from the GEL 100,000 Genomes Project35 (Fig. a, InDel burden across seven cancer types (n = 4,775; left) and the number of mutations contributed by each InD to the GEL tumors. b, Profiles of 37 consensus InDel mutational signatures (InDS) extracted and curated from seven GEL cancer cohorts (Supplementary Table 10; https://signal.mutationalsignatures.com/explore/main/cancer/signatures?mutationType=3&study=7). Putative etiologies are provided in the top-left squircles. We performed mutational signature analysis per tumor type as previously described3 (Fig. We identified 37 consensus InDel signatures, referred to as InDS (to distinguish from COSMIC IDS; Fig. Ten signatures shared characteristics mappable to known IDS (InD1, InD2a, InD3a/InD3b, InD4a, InD6, InD8, InD9a, InD13 and InD18)7. InD3a and InD3b often co-occurred in lung cancers with tobacco exposure. InD3a/InD3b clustered with experimental signatures induced by benzo(a)pyrene and its metabolite benzo(a)pyrene diol epoxide (Extended Data Figs. 8 and 9), supporting the notion that they represent modulated versions of tobacco-related DNA damage. InD13, characterized by T deletions at TT dinucleotides, is linked to UV damage, and InD18, found exclusively in colorectal samples, is due to colibactin exposure40. InD32 was identified in samples with prior exposure to platinum and was associated with a new platinum-associated signature, SBS112 (ref. Twenty InDS had probable endogenous origins (Extended Data Fig. Several have been described, including InD1 and InD2a, errors associated with nascent and template strand slippage during normal DNA replication, respectively7. InD1 and InD2a were seen universally across all tumor types except CNS and skin cancers, which showed a tissue-specific variant, InD2b (Fig. InD6, marked by microhomology-mediated deletions, is associated with deficiency in HR repair7. InD8, which had deletions with little to no microhomology at deletion junctions, likely reflects the footprint of nonhomologous end-joining activity and/or radiotherapy41. InD9a, correlated with SBS2 and SBS13 hypermutation, featured 1 bp C deletions at TCT and TCA (mutated base underlined), identical to mutable motifs characteristic of SBS2/SBS13, particularly at short poly-T tracts. It was presumptively induced by APOBEC (Extended Data Fig. 8c), corroborated by experimental evidence from an APOBEC overexpression DT40 model42. We proposed a mutagenesis mechanism wherein following C-to-U deamination at TCT by APOBEC, uracil removal by UNG leaves an uninformative abasic site. Template strand slippage can then occur at this short repetitive T tract, leading to a C deletion (Extended Data Fig. For reasons currently unclear, we also found similar C-deletion-dominated InD9b/InD9c, which, although resembled InD9a, lacked the predilection for a preceding T, and was possibly caused by an alternative mechanism. Interestingly, we extracted eight gene-specific MMRd and Pol-dys InDS. InD7 is characterized by the expected excess of 1 bp and 2 bp deletions, particularly at longer mononucleotide/dinucleotide repeat tracts. InD7 clustered with experimental signatures of ΔMLH1, ΔMSH2 and ΔMSH6 (Extended Data Fig. We also identified InD19 (due to PMS2 deficiency), InD14 (associated with POLD1 exonuclease mutations), InD15 (associated with POLE exonuclease mutations), InD16a and 16b (resulting from concurrent loss of POLE proofreading and MMR), InD21 (associated with combined POLD1 proofreading defect and MMRd) and InD20, which we found through experimental investigations, was due to MMRd occurring on a POLE dysfunction background. The remaining 12 signatures were of uncertain etiology. Five were probably artifacts—InD27 and InD28 often co-occurred, incurring thousands of InDels, and were related to SBS57, potentially an amplification or a sequencing artifact7. InD28m was likely a mixed signature of InD28 and InD4, remaining to be resolved with larger cohorts. While C insertions dominated both InD26 and InD30 at poly-C tracts followed by a 3′A, InD30 C insertions induced thousands of insertions at homopolymers CCC and CCCC, whereas InD26 C insertions mainly occurred at longer CCCCC and were not associated with hypermutation. Three InDS (InD31, InD24 and InD12) showed striking correlations with signatures of other classes. InD31 displayed distinct C deletions at short homopolymers (<5 bp) followed by 3′G and T deletions at short homopolymers (<5 bp) followed by 3′A. It was only reported in samples with novel rare SBS105 (ref. InD24 deletions peaked prominently at GTA and GTG and were strongly correlated with DBS8, which shows double substitutions at the same motifs (TGTG > TAGG/TTGG). InD12 exhibited C deletions between dinucleotides AA and AT and was associated with DBS25 featuring a tall peak at TT dinucleotide. Despite clear co-occurrence, the causes for these signatures remain cryptic. Whether they represented tissue-specific variants, were mixed or caused by different mechanisms requires further investigation. InD11 appeared related to InD1 and might be an oversplit signature frequently enriched in high InDel burden samples, such as those with MMRd and Pol-dys. Seen in bladder and colorectal cancers, InD23 showed a striking pattern of longer insertions (≥5 bp) at nonrepeats. These insertions were almost exclusively tandemly duplicated from immediate neighboring sequences. InD33 was most prominent in one CNS tumor treated with temozolomide; however, its etiology remains unknown. PRRd subtypes, typified by MSI, are clinically actionable with potential selective sensitivity to immunotherapies20,21,22. Current methods of detecting PRRd mainly rely on immunohistochemistry (IHC) staining of MMR proteins (but not for polymerase mutants) and/or PCR-based assays to determine MSI at selected genomic loci. These assays are not sensitive or robust enough, especially in nonepithelial tissues16. Using insights from this study, we therefore explored constructing a classifier for tumor PRRd stratification, reporting MMRd, Pol-dys and mixed MMRd/Pol-dys as distinct classes versus PRR proficiency. Samples treated as controls had neither MMRd and/or Pol-dys confirmed through the lack of driver mutations in key MMR genes (that is, MLH1, MSH2, MSH6, and PMS2), POLE, POLD1, and displayed no evidence of MSI associated with these abnormalities43. We trained multiple multinomial elastic net regression models applying 7:3 partitioning iteratively across the dataset. Through exploring all possible features/models (Supplementary Table 13), we identified exposures of SBS and InDS associated with MMRd, Pol-dys and mixed MMRd/Pol-dys, as well as the ratio of total InDels to substitutions as the most predictive features (Fig. The final model, termed PRRDetect (postreplicative repair detect), was retrained on the entire dataset (n = 571). Then, in an independent validation cohort of 504 ICGC breast cancers44,45 and 847 GEL cancers, for which the true labels of PRRd were known, PRRDetect achieved an AUROC (area under the ROC curve) of 1 and an AUPRC (precision–recall curve) of 0.99 at distinguishing PRR-dysfunctional from PRR-proficient samples, performing superiorly to other MSI/MMRd detection tools, including MSIseq43, MMRDetect46 and TMB—an approved biomarker for immunotherapies20,21,47,48,49 (Fig. (1) Initial exploratory training using 571 ground truth samples. (2) Final retraining to produce the PRRDetect classifier. b, Distribution of coefficients across seven genomic features contributing to the final PRRDetect classifier. Green error bars depict the mean ± s.d. from ten replicates of training in cross-validation. Red dots indicate the final coefficients chosen for each class prediction (Supplementary Table 14). c, Validation and application of PRRDetect on independent cancer cohorts. d, ROC curves demonstrating the superior performances of PRRDetect on independent cancer cohort (n = 1,351) against alternative biomarker strategies. P values were calculated using two-sided nonparametric test based on the bootstrap distribution (10,000) of the difference in AUCs53. e, PRRDetect results of n = 1,335 ICGC and Hartwig cancers, ordered from the lowest to the highest prediction probability across the x axis (left to right) for MMRd (purple), combined MMRd/Pol-dys (blue) and Pol-dys samples (orange). Negative samples were ordered by TMB in increasing order from left to right. Results of MSIseq, MMRDetect, cancer gene driver annotation and cancer tissue origin are labeled at the bottom tracks. Dashed rectangle highlights the extent of false positive overcalling if using TMB > 10 mutations per Mb as a cutoff. f, Concordance of calls among TMB-high (>10 mutations per Mb), positive exposure to SBS signatures that impart hypermutation and PRRDetect prediction across n = 1,335 ICGC and Hartwig cancers. g, Concordance of calls among TMB-high (>10 mutations per Mb), positive exposure to SBS signatures that impart hypermutation and PRRDetect prediction across n = 4,775 GEL tumors. Next, to survey the prevalence of PRRd across alternative cancer cohorts, we applied PRRDetect on seven cancer types commonly enriched with hypermutator samples from ICGC33 and Hartwig34 (Fig. PRRDetect predicted 3.7% (50/1,335) samples as PRR-dysfunctional, correctly identifying all Pol-dys, MMRd/Pol-dys samples and missing two subclonal MMRd samples (based on available published driver information for PRRd status). Unsurprisingly, PRRDetect captured all MMRDetect-positive cases. However, MMRDetect failed to identify all PRRd cases as it was not designed to detect Pol-dys/mixed phenotypes and missed seven MMRd samples. Crucially, we noted that many PRRDetect-positive cases did not have an associated driver mutation identified (33/50). Of 50 PRRDetect-positive cases, 39 were MMRd (only 8 had an associated driver mutation), 7 were Pol-dys (all had driver mutations in polymerase proofreading domains) and 4 were predicted as mixed MMRd/Pol-dys (2 had POLE exonuclease mutations and none had MMR drivers). If PRRDetect predictions were all true and sequencing approaches focused exclusively on identifying driver events associated with these deficiencies were used, a significant proportion of cases (66%) could be missed. Given that PRRd cancers often present with high TMB, and TMB is used as a biomarker for immunotherapies, we explored the limits of TMB-based patient stratification. With an FDA-approved TMB cutoff of 10 mutations per Mb49, just over a tenth of 459 cases classified as TMB-high (50/459, 10.9%) had predicted PRR dysfunction (Fig. The majority of other cases (353/459, 76.9%) had high TMB from tobacco, UV and APOBEC exposure; 56 (12.2%) were due to alternative causes. Thus, across independent cancer cohorts where MMRd and Pol-dys are known to occur at higher frequencies, ~89% of the samples classified as TMB-high may not have the intrinsic biological underpinnings associated with response to immunotherapies, with implications for the use of TMB as a selective biomarker for ICI50,51. We asked whether this trend extended to the larger GEL cohort (n = 4,775). Among the 1,371 TMB-high cases, nearly half (677, 49.4%) were predicted as having MMRd and/or Pol-dys (Fig. The remaining 564 (41.1%) had high TMB due to alternative mutagenic exposures; 130 (9.5%) were due to other undetermined causes. Furthermore, beyond revealing PRR dysfunction in typical tumor types such as colorectal cancers (19%, 400/2,146) and uterine cancers (37%, 255/695), PRRDetect predicted PRRd in a small but notable proportion of stomach (11/181, 6%), bladder (3/347, 1%), CNS (3/392, 1%) and lung cancers (8/958, 1%; Extended Data Fig. This reinforces two important clinical points—first, PRRd is not restricted to colorectal and uterine cancers despite being more prevalent in these tumor types; second, WGS can serve as a tumor-agnostic assay uncovering PRRd and any other actionable biological abnormalities in the future. The ability to distill biologically relevant signatures is heavily influenced by how mutations are represented or classified, more so than the underlying algorithms used for signature extraction. Here we showed that a classification schema that aggregates potentially discriminatory signals into only a few channels and/or does not take surrounding sequence context into account is limited in its ability to discern biologically insightful InDel patterns, irrespective of the extraction algorithms used. Consequently, some of the currently reported InDel signatures may correspond to multiple mutational processes, affecting the specificity of their assignments. To overcome this limitation, we proposed an alternative InDel taxonomy that incorporates flanking sequence context and distributes signals across a broader set of channels, offering increased discriminative capability without sacrificing power for signature extraction. Using this framework, we captured the distinct MSI phenotypes and true biological diversity of PRRd-associated InDel patterns, evident in both isogenic cellular models and patient tumors. Furthermore, we deciphered 37 consensus InDS from seven cancer types. We confirmed ten previously described IDS7,14,42, including those associated with tobacco use, UV exposure and APOBEC activity and reported eight new InDS of MMRd and polymerase proofreading dysfunction. While we have offered putative causes and associations for several new signatures, our current understanding of InDel mutagenesis remains incomplete. Future studies incorporating more cancer types and/or larger sample cohorts will help uncover additional signatures and illuminate new etiologies. The possibility of adapting the taxonomy in the future, to include features currently not explorable due to the limitations incurred by technological error rates of calling InDels using short-read WGS (that is, at longer simple repeats), could also be revealing. Our classifier, PRRDetect, is highly sensitive and specific. It utilizes both SBS and InD signatures to stratify tumors by PRRd subtypes and, to our knowledge, is the only tool with this capability. Importantly, we found a lack of concordance between the current biomarkers of MSI/MMRd and the true biological state. Particularly, TMB, despite being FDA approved, is nonspecific50,51. This has profound clinical implications as more than 50% of TMB-high (>10 mutations per Mb) cancers arise from biological abnormalities and environmental exposures that have no substantiated biological basis for immunotherapies, potentially impacting patient outcomes. PRRDetect can also detect samples that have signatures of PRR deficiency but for which no drivers can be detected (nearly 50% of all PRRd cases). Finally, our classifier does not distinguish between MMRd genotypes despite clear differences between, for example, MLH1, MSH2, MSH6 and PMS2; currently, there is no clinical indication to do so. However, should it become clinically important to distinguish between these genes, it shall be possible to do so. In summary, our study highlights how mutation classification directly impacts the accuracy of signature analysis. Our decision to leverage the surrounding sequence context for classifying InDels stems from mechanistic work demonstrating the relationship between InDel formation and flanking 3′ and 5′ sequences32,52. Nevertheless, optimal classification remains an active research area. Alternative schema could unveil additional mutational processes in the future. Unraveling the landscape of InDel mutagenesis through the refined framework described here will hopefully translate into meaningful benefits for cancer patients. The experiments described herein did not require approval from a specific ethics board. All cell line models generated and used in this study are provided in Supplementary Table 1. All cells were grown in DMEM/F12 medium (Gibco/Thermo Fisher Scientific) supplemented with 10% FBS, at 37 °C and 5% CO2 in a humidified incubator. The original hTERT-RPE1 ΔTP53 cell was generated from a previous study23. To generate the remaining isogenic CRISPR-edited cell lines, 200k RPE1 ΔTP53 cells per each edit were electroporated with preformed ribonucleoprotein complex (RNP with final concentrations of 120 pmol gRNAs and 100 pmol Alt-R Cas9) in supplemented SE buffer, using nucleocuvette and AMAXA 4D-Nucleofector (Lonza) on program EH-158 according to the manufacturer's instructions. Following electroporation, cells were replated into fully supplemented DMEM/F12 medium to recover for 48 h. For knock-ins, homology-directed repair (HDR) donor oligos were supplied with the RNP for electroporation, and cells were replated for recovery in medium spiked in with final concentrations of 2 μM M3814 (Selleckchem) and 0.5 μM Alt-R HDR Enhancer (Integrated DNA Technologies) for the first 24 h. All cells were then cultured for additional 2 to 4 days to allow for gene editing and eventually subjected to limiting dilution on 96-w plates to isolate single-cell clones. All gRNAs, sequencing primers and genotypes of cell lines generated in this study are summarized in Supplementary Tables 1 and 2. Double mutants were harder to establish but had similar doubling time to single mutants. Edited clones were cultured for 45–55 days (~40 to 50 doublings) to allow for mutation accumulation before a second round of single-cell limiting dilution was performed to isolate two to three daughter subclones per edit genotype, providing a bottleneck to capture mutations that had occurred since the isolation of the initial edited parental clones. Genomic DNA was isolated from all samples using Quick-DNA Miniprep Plus Kit (ZymoResearch) following the manufacturer's protocol. WGS libraries were prepared and sequenced (150 bp, paired end, 25×) on the Illumina NovaSeq 6000 platform by Novogene. Short reads were aligned to GRCh38/hg38 using BWA-MEM 0.7.17-r1188. Postprocessing filters were applied to improve the specificity of mutation-calling. Specifically, for single-nucleotide variant (SNV) calls by CaVEMan54 (v.1.13.15), we used CLPM = = 0 and ASMD ≥ 140. To reduce false positive calls by Pindel55 (v.3.2.0), we used QUAL ≥ 250 and REP < 10. Cell clones with an average variant allele frequency of <0.4 were designated as polyclonal and excluded from all subsequent quantitative analyses (that is, estimation of mutation burden). A variant allele frequency filter of 0.2 was applied to substitutions and InDels. De novo substitutions and InDels in subclones were obtained by subtracting from respective parental clones whenever available, or by removing mutations shared among subclones. De novo mutation counts are provided in Supplementary Table 3. The derivation of gene edit-associated mutational signatures with cosine similarity, bootstrapping and background subtraction was performed using published framework (https://github.com/xqzou/COMSIG_KO)46. In short, we first (1) determined the background mutational signature in background control by aggregating the unedited and untreated subclone mutational profiles, (2) determined the difference between the mutational profiles of the edited clones and background mutation profile and if an edit generates a signature, we (3) removed background mutation profile from the mutation profile of the edited subclone (Supplementary Tables 4 and 5). We also used Uniform Manifold Approximation and Projection (UMAP)56 to cluster the InDel profiles. Experimentally derived signatures were compared to published reference signatures3,7 using signature.tools.lib (v.2.4.4) from https://rdrr.io/github/Nik-Zainal-Group/signature.tools.lib/. Replicative strand bias analysis was performed for 1 bp InDels only. InDels were mapped to leading or lagging strand using Repli-seq data (MCF-7) of the ENCODE project57. IntersectBed58 in BEDTools (v.2.26.0-114-g4c407ce) was utilized to identify mutations overlapping specific genomic features. To assess a specific mutational signature, the ‘expected' ratio of InDels between lagging and leading strands was calculated according to the distribution of the repeats in these regions. The ‘observed' ratio of InDels among different strands was determined by mapping InDels to genomic coordinates of all leading/lagging regions. The asymmetry between different strands was quantified by calculating the odds ratio of InDels occurring on one strand (for example, leading) versus the other strand (for example, lagging). P values were computed using binomial test or Χ2 test and corrected for multiple testing using the Benjamini–Hochberg method. De novo signature extraction and decomposition of mutational signatures were performed using SigProfilerExtractor39 (v.1.1.18), along with SigProfilerMatrixGenerator59 (v.1.2.4). The recommended default settings (including 500 NMF replicates) were applied (https://github.com/AlexandrovLab/SigProfilerExtractor). Signatures were also extracted using Indel.signature.tools3 (v.2.4.4) with default settings (20 bootstraps, 500 repeats per bootstrap, matched clustering). The number of InDel signatures selected per channel set was determined by maximal drop in average silhouette width of clustered signatures. For extraction using MuSiCal8 (https://github.com/parklab/MuSiCal, v.1.0.0), the default hyperparameters included random initialization (init hyperparameter), the minimum volume nonnegative matrix factorization (MVNMF) algorithm (method hyperparameter), 20 MVNMF replicate runs (n_replicates hyperparameter) and between 10,000 and 1,000,000 MVNMF iterations per replicate run (min_iter and max_iter hyperparameters, respectively). Indels were called with Strelka60 (v.2.4.7) using somatic calling mode. We exploited the relationship between InDels and their associated 3′ sequence context to determine, for each InDel, the minimal InDel prefix that is maximally repetitive in the 3′ sequence context and within the InDel itself. This applies to all InDel variants that are left-aligned and parsimonious. These values, along with the sequence context, may be used to group InDels into biologically relevant, nonoverlapping InDel subcategories or ‘channels'. Using the segmentation values for each InDel, we constructed a set of 476 nonoverlapping InDel channels. By surveying the InDel frequency distribution across all 476 channels in 17,253 tumors covering most cancer types from ICGC/TCGA, Hartwig and the GEL 100,000 Genomes Project, we discarded channels with no signal and consolidated those with low signal to reduce the total number of InDel channels to 89. Channels were constructed such that each InDel could only be unambiguously assigned to a single channel. A complete description of each channel and exemplar InDels are included in Supplementary Table 8. A description of the reasoning behind channel construction is presented in Supplementary Note. Our approach to signature extraction was motivated by previous study3. We observed that hypermutator samples strongly influenced signature extraction using the standard β-divergence NMF model. We sought to filter out hypermutator samples from our initial extraction. For each tissue, we first removed samples with a total InDel burden <100, and then clustered sample profiles according to their cosine similarity (cosine distance, 1 − cosine similarity) using hierarchical clustering with complete linkage. Clusters of similar samples were determined by thresholding the resulting dendrogram such that the average silhouette width was maximized, and within-cluster variation was minimized. To determine cluster-specific hypermutators, for each cluster, we fit the total burden per sample using a two-component Gaussian mixture model (mixtools61) compared to a one-component mixture model, using the Bayesian information criteria for model selection. Hypermutators were defined as the union of samples with a total burden more than third quartile + 1.5 × IQR, where quartiles were calculated over the total dataset, and samples with a greater than 50% probability of being generated by the higher burden Gaussian distribution. Normal and hypermutator clusters were then manually reviewed per tissue, and only normal clusters were used in primary extraction. Signature extraction was performed per tissue as described, with an increased number of bootstraps (40) and repetitions per bootstrap (1,000), to increase final solution stability. We sought to determine whether excess variation indicative of rare or unextracted signatures was present in sample catalogs using a published framework3. In refitting tissue-specific catalogs, hypermutator samples included, with signatures extracted from nonhypermutator samples using FitMS3, we observed profiles with lower InDel counts displayed higher degrees of error (as measured by total residual normalized by sample burden). Generally, error decreased logarithmically as InDel burden increased. Therefore, using a single threshold on fitting error to define samples with excess variation excluded a large proportion of samples. To more accurately calibrate our expectation of fitting error, and therefore, our threshold for detecting excess variation, we performed a parametric bootstrapping procedure to generate a sample-specific expected error distribution. For each sample catalog, we constructed a multinomial distribution using the per-channel density from the normalized reconstructed profile produced by FitMS. Using this distribution, we simulated 10,000 sample profiles with a total burden equal to the sample burden, fit these profiles with FitMS and calculated the resulting error distribution. Comparing the experimentally derived error distribution to the resulting null distribution allowed us to estimate an empirical P value. This procedure was repeated for all samples in the GEL cohort, and P values were corrected for multiple testing. To control the false discovery rate at 5%, samples with an adjusted P value less than 0.05 were selected for further analysis to determine rare signatures. For each tissue, samples with excess variation were clustered using the residual signal after subtracting out FitMS reconstructed profiles. Hierarchical clustering with average linkage and Euclidean distance resulted in multiple clusters per tissue. An additional one to five rare signatures were determined per cluster, and the number of rare signatures was determined as the minimal value of n, such that rare signatures were not found to perfectly recapitulate cluster members or match common signatures. All extracted rare signatures across a tissue were subject to manual curation to identify recurrent patterns, and the rare signature exemplars that displayed minimal mixing with common signatures were selected. Using this consensus set of common and rare signatures per tissue, all samples in each catalog were refitted using FitMS to determine per-sample signature exposures. We clustered InDel signatures from seven GEL cancer cohorts (111 tissue-specific InDS) using hierarchical clustering with average linkage and cosine distance and derived a set of consensus signatures following a published framework3. Variants in samples generated in previous studies46,52 were reclassified and analyzed using our refined InDel classification scheme. Experimental signatures were obtained via background subtraction and determined using a bootstrapping and cosine similarity-based framework previously described46. Consensus InDel signatures and all experimentally derived signatures were clustered using hierarchical clustering and cosine distance. We trained PRRDetect, a multinomial elastic net regression model, on a subset of 571 GEL cancers confidently assigned as MMRd (n = 214), Pol-dys (n = 36), mixed MMRd/Pol-dys (n = 41) or PRR-proficient (negative controls, n = 280) based on manual curation of relevant driver mutations and/or supporting immunohistochemistry where possible. Samples with neither Pol-dys and/or MMRd were confirmed to lack driver mutations in key MMR genes (that is, MLH1, MSH2, MSH6 and PMS2), POLE, POLD1, and displayed no evidence of MSI associated with these abnormalities43. To create our classifier, we explored a range of feature combinations as model inputs, including (1) summed exposures of SBS and InDel signatures related to PRR deficiency (MMRd, MMRd/Poly-dys, Poly-dys); (2) feature set (1) combined with TMB; (3) feature set (1) combined with total InDel/SNV ratio; (4) summed exposure of SBS signatures related to PRRd; (5) summed exposure of InD signatures related to PRRd. For each feature set, we constructed the model using either proportion (that is, normalized signature exposure) or the absolute values of the features (that is, raw mutation count contributed by each signature). In total, ten model structures were attempted (five sets of features × two normalizations; Supplementary Table 13). For all models, the feature values were first log2 transformed, then z-score normalized using the formula \({x}^{{\prime} }=\frac{x-\mu }{\sigma }\). We used the implementation of multinomial elastic net regression (glmnet) in caret (https://topepo.github.io/caret/). In each training iteration, we first partitioned the cohort into 70% for training and 30% for testing, retaining relative proportions of MMRd, Pol-dys, mixed and negative categories across the training and test datasets. A ten-repeat tenfold cross-validation strategy was adopted within the 70% training group. A grid search approach was used to determine the best combination of two hyperparameters (that is, α, which acts as a balancing factor between a lasso and a ridge penalty; and λ, which defines the strength of the penalty), aiming to minimize the log loss. of λ, median coefficient s.d., median multiclass area under the curve (AUC) on the test set and training multiclass AUC of the final model (Supplementary Table 13). Eventually, PRRDetect was selected as the one having the following input variables: (1) summed exposures to MMRd-associated SBS 6, 15, 26, 44, 97; (2) summed exposures to Pol-dys-associated SBS 10a, 10d; (3) summed exposures to combined MMRd/Pol-dys SBS 14, 20; (4) summed exposures to MMRd-associated InD7, InD19; (5) summed exposures to Pol-dys InD14, InD15; (6) summed exposures to combined MMRd/Pol-dys InD16a, InD16b, InD20, InD21; and (7) total InDel to total SNV ratio, with proportional normalization of the first six features (Supplementary Table 14). The model outputs a categorical distribution across the four PRRd subclasses (that is, MMRd, Pol-dys, MMRd/Pol-dys or Neg). We validated PRRDetect in an independent ICGC breast cohort (n = 504)44,45 and a subset of held-out samples from GEL (n = 847), for which the true PRRd labels were established based on immunohistochemistry staining of four MMR proteins (PMS2, MLH1, MSH2 and MSH6) and driver mutations. The final cohort consists of 1,351 samples, for which we also computed the MSIseq and MMRDetect prediction results. The ROC curves and their relative AUC values were calculated using R package ‘pROC'53. To survey the prevalence of PRRd in other cancer cohorts, we applied PRRDetect to two additional datasets not included in InD signature extraction and PRRDetect training—the ICGC/TGCA pan-cancer dataset33 and Hartwig Medical Foundation metastatic cancer cohort34 (n = 1,335), focusing on seven cancer types commonly enriched with samples with high InDel burdens. InDels from individual samples in these cohorts were processed to 89-channel profiles as was done for GEL cohort samples. For these datasets, we used published driver annotations as PRRd labels. All comparisons were between biologically independent samples. No statistical method was used to predetermine sample size. No data were excluded from the analyses. The investigators were not blinded to allocation during experiments and outcome assessment. Further details are provided in the Reporting Summary. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. Raw sequence files from the hTERT-RPE1 mutation accumulation experiment are deposited at the European Genome-Phenome Archive with accession EGAD50000000209. Mutation calls have been deposited at Mendeley (https://doi.org/10.17632/3k2tpx9ssr.2). RPE1 cells can be obtained directly from the authors. The curated data are available for general browsing from our reference mutational signatures website, Signal (https://signal.mutationalsignatures.com). Primary data from the 100,000 Genomes Project, which are held in a secure research environment, are available to registered users. See https://www.genomicsengland.co.uk/research for further information or contact research-network@genomicsengland.co.uk. ICGC/TCGA WGS data can be obtained from https://dcc.icgc.org/releases/PCAWG. Hartwig metastasis WGS data can be obtained from Hartwig Medical Foundation through standardized procedures and request forms that can be found at https://www.hartwigmedicalfoundation.nl/en/appyling-for-data/. Mutagen signatures from human induced pluripotent stem cells (iPS)52 can be accessed via https://data.mendeley.com/datasets/m7r4msjb4c/2. Human iPS knockout signatures can be obtained directly from https://doi.org/10.1038/s43018-021-00200-0 (ref. The results of RPE1 experimental signatures can be browsed at https://signal.mutationalsignatures.com/explore/main/experimental/experiments?study=7. InD signatures of the seven cancer types are accessible at https://signal.mutationalsignatures.com/explore/main/cancer/signatures?mutationType=3&study=7. Source data are provided with this paper. The R source code of PRRDetect is available via GitHub at https://github.com/Nik-Zainal-Group/PRRDetect and Zenodo at https://doi.org/10.5281/zenodo.14906103 (ref. InDel segmentation and signature classification script can be accessed via GitHub at https://github.com/Nik-Zainal-Group/indelsig.tools.lib and Zenodo at https://doi.org/10.5281/zenodo.14906117 (ref. Mills, R. E. et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Albers, C. A. et al. Dindel: accurate indel calls from short-read data. & Shomron, N. Analysis of insertion-deletion from deep-sequencing data: software evaluation for optimal detection. The repertoire of mutational signatures in human cancer. Accurate and sensitive mutational signature analysis with MuSiCal. Davies, H. et al. HRDetect is a predictor of BRCA1 and BRCA2 deficiency based on mutational signatures. & Cuppen, E. Pan-cancer landscape of homologous recombination deficiency. Thibodeau, S. N., Bren, G. & Schaid, D. Microsatellite instability in cancer of the proximal colon. Clues to the pathogenesis of familial colorectal cancer. Boland, C. R. et al. A National Cancer Institute Workshop on Microsatellite Instability for cancer detection and familial predisposition: development of international criteria for the determination of microsatellite instability in colorectal cancer. Reijns, M. A. M. et al. Signatures of TOP1 transcription-associated mutagenesis in cancer and germline. Boot, A. et al. Recurrent mutations in topoisomerase IIalpha cause a previously undescribed mutator phenotype in human cancers. Chung, J. et al. DNA Polymerase and Mismatch Repair Exert Distinct Microsatellite Instability Signatures in Normal and Malignant Human Cells. Turajlic, S. et al. Insertion-and-deletion-derived tumour-specific neoantigens and the immunogenic phenotype: a pan-cancer analysis. Chan, E. M. et al. WRN helicase is a synthetic lethal target in microsatellite unstable cancers. Van Wietmarschen, N. et al. Repeat expansions confer WRN dependence in microsatellite-unstable cancers. Mutational landscape determines sensitivity to PD-1 blockade in non-small cell lung cancer. Cristescu, R. et al. Pan-tumor genomic biomarkers for PD-1 checkpoint blockade-based immunotherapy. Evaluation of POLE and POLD1 Mutations as Biomarkers for Immunotherapy Outcomes Across Multiple Cancer Types. Zimmermann, M. et al. CRISPR screens identify genomic ribonucleotides as a source of PARP-trapping lesions. The histone mark H3K36me3 regulates human DNA mismatch repair through its interaction with MutSalpha. A. et al. Genome-wide model for the normal eukaryotic DNA replication fork. Korona, D. A., Lecompte, K. G. & Pursell, Z. F. The high fidelity and unique error signature of human DNA polymerase epsilon. Herzog, M. et al. Mutagenic mechanisms of cancer-associated DNA polymerase ϵ alleles. Differences in genome-wide repeat sequence instability conferred by proofreading and mismatch repair defects. Strand, M., Prolla, T. A., Liskay, R. M. & Petes, T. D. Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Distinct mutational signatures characterize concurrent loss of polymerase proofreading and mismatch repair. & Siggia, E. D. Sequence context affects the rate of short insertions and deletions in flies and primates. The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Turnbull, C. Introducing whole-genome sequencing into routine cancer care: the Genomics England 100 000 Genomes Project. Swan, M. K., Johnson, R. E., Prakash, L., Prakash, S. & Aggarwal, A. K. Structural basis of high-fidelity DNA synthesis by yeast DNA polymerase delta. Kroutil, L. C., Register, K., Bebenek, K. & Kunkel, T. A. Exonucleolytic proofreading during replication of repetitive DNA. Islam, S. M. A. et al. Uncovering novel mutational signatures by de novo extraction with SigProfilerExtractor. Pleguezuelos-Manzano, C. et al. Mutational signature in colorectal cancer caused by genotoxic pks+ E. coli. Kocakavuk, E. et al. Radiotherapy is associated with a deletion signature that contributes to poor outcomes in patients with cancer. DeWeerd, R. A. et al. Prospectively defined patterns of APOBEC3A mutagenesis are prevalent in human cancers. Huang, M. N. et al. MSIseq: software for assessing microsatellite instability from catalogs of somatic mutations. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Davies, H. et al. Whole-genome sequencing reveals breast cancers with mismatch repair deficiency. Zou, X. et al. A systematic CRISPR screen defines mutational mechanisms underpinning signatures caused by replication errors and endogenous DNA damage. & Kurzrock, R. The FDA approval of pembrolizumab for adult and pediatric patients with tumor mutational burden (TMB) ≥ 10: a decision centered on empowering patients and their physicians. Samstein, R. M. et al. Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Marabelle, A. et al. Association of tumour mutational burden with outcomes in patients with advanced solid tumours treated with pembrolizumab: prospective biomarker analysis of the multicohort, open-label, phase 2 KEYNOTE-158 study. McGrail, D. J. et al. High tumor mutation burden fails to predict immune checkpoint blockade response across all cancer types. Addeo, A., Banna, G. L. & Weiss, G. J. Tumor mutation burden-from hopes to doubts. Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. Jones, D. et al. cgpCaVEManWrapper: simple execution of CaVEMan in order to detect somatic single nucleotide variants in NGS data. Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. & Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. An integrated encyclopedia of DNA elements in the human genome. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bergstrom, E. N. et al. SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. Benaglia, T., Chauveau, D., Hunter, D. R. & Young, D. S. mixtools: an R package for analyzing finite mixture models. is supported by the Sir Jeffrey Cheah Foundation. 's laboratory was funded by the Cancer Research UK (CRUK) Advanced Clinician Scientist Award (C60100/A23916), Dr. Josef Steiner Cancer Research Award 2019, Basser Gray Prime Award 2020, CRUK Pioneer Award (C60100/A23433), CRUK Grand Challenge Awards (C60100/A25274 and CGCATF-2021/100013), CRUK Early Detection Project Award (C60100/A27815) and the National Institute of Health Research (NIHR) Research Professorship (NIHR301627). This work was also supported by the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014). This research was made possible through access to data in the National Genomic Research Library, which is managed by Genomics England Limited (a wholly owned company of the Department of Health and Social Care). The National Genomic Research Library holds data provided by patients and collected by the NHS as part of their care and data collected as part of their participation in research. The National Genomic Research Library is funded by the National Institute for Health Research and NHS England. The 100,000 Genomes Project uses data provided by patients and collected by the National Health Service as part of their care and support. These authors contributed equally: Gene Ching Chiek Koh, Arjun Scott Nanda. Gene Ching Chiek Koh, Arjun Scott Nanda, Giuseppe Rinaldi, Soraya Boushaki, Andrea Degasperi, Cherif Badja, Andrew Marcel Pregnall, Salome Jingchen Zhao, Lucia Chmelova, Daniella Black, Laura Heskin, João Dias, Jamie Young, Yasin Memari, Scott Shooter, Jan Czarnecki, Helen Ruth Davies, Xueqing Zou & Serena Nik-Zainal Gene Ching Chiek Koh, Arjun Scott Nanda, Giuseppe Rinaldi, Soraya Boushaki, Andrea Degasperi, Cherif Badja, Salome Jingchen Zhao, Lucia Chmelova, Daniella Black, Laura Heskin, João Dias, Jamie Young, Yasin Memari, Scott Shooter, Jan Czarnecki, Helen Ruth Davies, Xueqing Zou & Serena Nik-Zainal You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar performed gene editing and mutation accumulation experiments. designed and implemented computational analyses, with input from A.D., A.M.P., L.C., D.B., L.H., J.D., Y.M., S.S., J.C., M.B. Data interpretation and write-up were provided by G.C.C.K., A.S.N., X.Z., G.R. and S.N.-Z., with inputs from all the other authors. hold patents or have submitted applications on clinical algorithms of mutational signatures: MMRDetect (PCT/EP2022/057387), HRDetect (PCT/EP2017/060294), clinical use of signatures (PCT/EP2017/060289), rearrangement signature methods (PCT/EP2017/060279), clinical predictor (PCT/EP2017/060298), hotspots for chromosomal rearrangements (PCT/EP2017/060298), InDel signature methods (PCT/EP2024/077959) and PRRDetect (PCT/EP2024/078030). All other authors declare no competing interests. Nature Genetics thanks Geoff Macintyre and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. UMAP, uniform manifold approximation and projection for dimension reduction. b, Cosine similarities between individual gene-edit subclones and the averaged InDel profile of background controls (d). deletion; R, repeat; M, microhomology; C, cytosine; T, thymine. d, Aggregated InDel profile of hTERT-RPE1TP53-null background control. e, Cosine similarities among experimentally generated COSMIC-83 PRRd InDel signatures (Fig. f, Unsupervised clustering of experimentally generated COSMIC-83 PRRd InDel signatures with cosine distance. a, De novo substitution burden, and average substitution fold-increase of CRISPR gene edits and unedited control (n = 2–5 subclones per gene edit; Supplementary Table 3). Red dashed line represents the mean substitution burden of control. b, Distinguishing substitution profiles of edited subclones from background controls. Light green error bars depict the mean ±3 s.d. of cosine similarities between n = 100 bootstrapped substitution profiles of unedited controls and the background profile aggregated from n = 7 unedited subclones. UMAP, uniform manifold approximation and projection for dimension reduction. e, Substitution signatures associated with PRRd gene edits following background subtraction. f, Cosine similarities among experimentally generated PRRd substitution signatures in e. g, Cosine similarities between experimentally generated PRRd substitution signatures and cancer-derived SBS reference signatures (RefSig)3. h, Averaged SNV to InDel ratio of gene edits. Bar represents the averaged ratio for each genotype. a, Aggregated COSMIC-83 InDel profile of ICGC samples, with (left, n = 3,323) and without (right, n = 3,287) hypermutator samples. b, Aggregated 89-channel InDel profile of ICGC samples, with (left, n = 3,323) and without (right, n = 3,287) hypermutator samples. c, Percentage of total InDels per channel for ICGC cohort, with COSMIC-83 (left) or 89-channel (right) format. d, Percentage of total InDels per channel for Hartwig cohort, with COSMIC-83 (left) or 89-channel (right) format. e, Aggregated 476-full-channel InDel profiles of ICGC, Hartwig and GEL cohorts. f, Final, consolidated 89-channel InDel profiles of aggregated samples from GEL cohort, with (left, n = 11,585) and without (right, n = 10,792) hypermutator samples. There is no simple, direct 1-to-1 mapping between the methods. In general, the 89-channel taxonomy expands upon channels that had most of the signals, here 1 bp A/T InDels, into a larger array of channels, and condenses longer InDels and/or genome motifs infrequent in the genome (where signals were scant or nonexistent) into fewer InDel subcategories. UMAP, uniform manifold approximation and projection for dimension reduction. b, Aggregated 89-channel background InDel profile of hTERT-RPE1TP53-null control subclones (n = 7). c, Gini index of experimentally generated InDel signatures (n = 10) in COSMIC-83 versus 89-channel format. Two-tailed Wilcoxon signed-rank test, P = 0.001953. d, Cosine similarities among experimentally generated 89-channel PRRd InDel signatures (Fig. e, Unsupervised clustering of experimentally generated 89-channel PRRd InDel signatures using cosine distance. a, Signature selection plot of de novo extraction by SigProfilerExtractor39 of n = 37 experimental samples in COSMIC-83 or 89-channel InDel catalogs. Suggested solutions are shaded in orange or purple for COSMIC-83 and 89-channel catalogs, respectively. b, Suggested de novo solution (two signatures) by SigProfilerExtractor for COSMIC-83 experimental cohort (n = 37). c, Suggested de novo solution (four signatures) by SigProfilerExtractor for 89-channel experimental cohort (n = 37). d, Signature selection plots of de novo extraction by signature.tools.lib3, SigProfilerExractor39 and MuSiCal8 of n = 52 ICGC colorectal cancers in COSMIC-83 versus 89-channel InDel catalogs. Suggested solutions are shaded in orange or purple for COSMIC-83 and 89-channel InDel catalogs, respectively. e, Selected de novo solution (five signatures) from signature.tools.lib for ICGC colorectal cancer COSMIC-83 catalogs. f, Selected de novo solution (eight signatures) from signature.tools.lib for ICGC colorectal cancer 89-channel catalogs. g, Suggested de novo solution (five signatures) by SigProfilerExtractor for ICGC colorectal cancer COSMIC-83 catalogs. h, Suggested de novo solution (six signatures) by SigProfilerExtractor for ICGC colorectal cancer 89-channel catalogs. Signatures from MuSiCal extraction are not shown. Hierarchical clustering and dendrogram partitioning of n = 111 tissue-specific signatures into n = 40 distinct patterns using cosine distance (1 − cosine similarity) as distance metric, with a cut-off of 0.15. The 40 distinct patterns were manually reviewed, revised and inspected for mixed patterns to produce the final 37 consensus InDS (Fig. a, InDel signatures of exposures to environmental mutagens in human induced pluripotent stem cells (hiPSC). b, InDel signatures of DNA repair/replication gene knockouts in human induced pluripotent stem cells. c, Extended sequence context of 1 bp C deletions of InD9a showed TTCT/TTCA enrichment for APOBEC deamination. APOBEC, apolipoprotein B mRNA-editing enzyme, catalytic polypeptide. APOBEC, apolipoprotein B mRNA-editing enzyme, catalytic polypeptide; TLS, translesion synthesis; UNG, Uracil-N-glycosylase. Heatmap clustering of n = 26 experimentally generated signatures and 37 cancer-derived InDS. Experimentally generated signatures include n = 10 from the current study, n = 10 from the mutagen study52 (Extended Data Fig. 8a) and n = 6 from iPSC knockout study46 (Extended Data Fig. Putative sources (left) and etiologies (right) of the signatures are annotated if known. HRd, homologous recombination deficiency; PAH, polycyclic aromatic hydrocarbons; N-Slip, nascent strand slippage; T-Slip, template strand slippage; MMRd, mismatch repair deficiency. a, Venn diagrams showing the concordance and discordance between different predictors in selected ICGC and Hartwig cohort (n = 1,335). b, PRRDetect prediction of ICGC and Hartwig seven cancer types (n = 1,335; Supplementary Table 17). c, PRRDetect prediction of GEL seven cancer types (n = 4,775; Supplementary Table 12). Supplementary Table 1: CRISPR-edited RPE1 isogenic cellular models and their genotypes. Supplementary Table 2: CRISPR gene-editing guide RNAs, HDR donor templates and genotyping sequencing primers. Supplementary Table 3: WGS coverage, de novo mutation count, and SNV-to-InDel ratio of experimental subclones. Supplementary Table 5: Experimental gene-edit single base substitution mutation catalogs and signatures. Supplementary Table 6: Replicative strand asymmetry analysis result. Supplementary Table 8: InDel examples of redefined InDel classification of small insertions and deletions (<100 bp) for mutational signature analysis. Supplementary Table 9: Tissue-specific InDS of Genomics England seven cancer types (n = 4,775). Supplementary Table 10: Thirty-seven consensus InDS in numerical format. Supplementary Table 12: PRRDetect of GEL samples (n = 4775). Supplementary Table 13: Performance metrics of attempted prediction models using different input parameters. Supplementary Table 15: Performance metrics of different algorithms for predicting PRR dysfunction in the validation cohort (n = 1,351). Supplementary Table 16: PRRDetect of ICGC breast samples (n = 504). Supplementary Table 17: PRRDetect of ICGC and Hartwig selected samples (n = 1,335). InDel signatures (89-channel) of mutagen exposures and isogenic knockouts in human induced pluripotent stem cells. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Koh, G.C.C., Nanda, A.S., Rinaldi, G. et al. A redefined InDel taxonomy provides insights into mutational signatures. Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.
You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). You can also search for this author in PubMed Google Scholar Upgrades to electricity grids might not keep up with the demands of power-hungry data centres. You have full access to this article via your institution. The electricity consumption of data centres is projected to more than double by 2030, according to a report from the International Energy Agency published today. IEA's models project that data centres will use 945 terawatt-hours (TWh) in 2030, roughly equivalent to the current annual electricity consumption of Japan. The projections largely focus on data centres, which also run computing tasks other than AI. Alex de Vries, a researcher at VU Amsterdam and the founder of Digiconomist, who was not involved with the report, thinks this is an underestimate. The report “is a bit vague when it comes to AI specifically,” he says. Even with these uncertainties, “we should be mindful about how much energy is ultimately being consumed by all these data centers,” says de Vries. Of the predicted growth in consumption, developing economies will account for around 5% by 2030, while advanced economies will account for more than 20% (see ‘Data-centre energy growth'). Countries are building power plants and upgrading electricity grids to meet the forecasted energy demand for data centres. Fundacion Sector Publico Estatal Centro Nacional de Investigaciones Oncológicas Carlos III (F.S.P. Two-year, $150,000 fellowship for U.S.-based theoretical physicist studying mathematical modeling of the early universe. An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday. Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
Heavy drinkers who have eight or more alcoholic drinks per week have an increased risk of brain lesions called hyaline arteriolosclerosis, signs of brain injury that are associated with memory and thinking problems, according to a study published on April 9, 2025, online in Neurology®, the medical journal of the American Academy of Neurology. The study does not prove that heavy drinking causes brain injury; it only shows an association. Hyaline arteriolosclerosis is a condition that causes the small blood vessels to narrow, becoming thick and stiff. This makes it harder for blood to flow, which can damage the brain over time. It appears as lesions, areas of damaged tissue in the brain. "Heavy alcohol consumption is a major global health concern linked to increased health problems and death," said study author Alberto Fernando Oliveira Justo, PhD, of University of Sao Paulo Medical School in Brazil. "We looked at how alcohol affects the brain as people get older. The study included 1,781 people who had an average age of 75 at death. They also measured brain weight and the height of each participant. Researchers then divided the participants into four groups: 965 people who never drank, 319 moderate drinkers who had seven or fewer drinks per week; 129 heavy drinkers who had eight or more drinks per week; and 368 former heavy drinkers. Of those who never drank, 40% had vascular brain lesions. Of the former heavy drinkers, 50% had vascular brain lesions. After adjusting for factors that could affect brain health such as age at death, smoking and physical activity, heavy drinkers had 133% higher odds of having vascular brain lesions compared to those who never drank, former heavy drinkers had 89% higher odds and moderate drinkers, 60%. Researchers also found heavy and former heavy drinkers had higher odds of developing tau tangles, a biomarker associated with Alzheimer's disease, with 41% and 31% higher odds, respectively. Researchers also found that heavy drinkers died an average of 13 years earlier than those who never drank. "We found heavy drinking is directly linked to signs of injury in the brain, and this can cause long-term effects on brain health, which may impact memory and thinking abilities," said Justo. A limitation of the study was that it did not look at participants before death and did not have information on the duration of alcohol consumption and cognitive abilities. Stay informed with ScienceDaily's free email newsletter, updated daily and weekly. Or view our many newsfeeds in your RSS reader: Keep up to date with the latest news from ScienceDaily via social networks: Tell us what you think of ScienceDaily -- we welcome both positive and negative comments.