Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Nature Communications , Article number: (2026) Cite this article We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply. We measure the temperature profile and investigate the thermal conductivity of suspended monoisotopic hexagonal boron nitride (h10BN) heterostructures by combining suspended microbridge technique and Raman spectroscopy. The thermal conductivities exceed 1650 W.m−1.K−1 at room temperature, significantly higher than in previous reports, highlighting the crucial influence of the measurement conditions on the experimental results. By including more data points, we refine our models beyond the accuracy of conventional approaches. Our results show a striking deviation of thermal transport from the classical diffusion regime described by Fourier's law: while the temperature profiles are linear above 300 K, they become clearly nonlinear below this temperature, indicating a strong non-diffusive heat transport regime. This behavior underscores the need for a new theoretical framework to fully account for heat transport in two-dimensional materials. Ultimately, our findings pave the way for innovative heat dissipation technologies and challenge conventional paradigms in nano-heat engineering. This study establishes a practical framework linking Raman-based temperature mapping, the number of measurement points, and thermal simulations to reliably determine the in-plane thermal conductivity of 2D materials. The data that support the findings of this study are available from the corresponding author. They are no restrictions to accessing data. The Comsol simulation are provided in the Supplementary Information section. Source data are provided with this paper. Qian, X., Zhou, J. & Chen, G. Phonon-engineered extreme thermal conductivity materials. Rahimi, M. et al. Complete determination of thermoelectric and thermal properties of supported few-layer two-dimensional materials. Rahimi, M. et al. Probing the electric and thermoelectric response of ferroelectric 2H and 3R α-In2Se3. Li, D. et al. Recent progress of two-dimensional thermoelectric materials. Nano-Micro Lett. Kim, S. E. et al. Extremely anisotropic van der Waals thermal conductors. Huang, X. et al. A graphite thermal Tesla valve driven by hydrodynamic phonon transport. Vaziri, S. et al. Ultrahigh thermal isolation across heterogeneously layered two-dimensional materials. Lu, S. et al. Towards n-type conductivity in hexagonal boron nitride. Cai, Q. et al. Outstanding thermal conductivity of single atomic layer isotope-modified boron nitride. Cepellotti, A. et al. Phonon hydrodynamics in two-dimensional materials. Lee, S., Broido, D., Esfarjani, K. & Chen, G. Hydrodynamic phonon transport in suspended graphene. Fugallo, G. et al. Thermal conductivity of graphene and graphite: collective excitations and mean free paths. Nano Lett. Huberman, S. et al. Observation of second sound in graphite at temperatures above 100 K. Science 364, 375–379 (2019). Rezgui, H. Phonon hydrodynamic transport: observation of thermal wave-like flow and second sound propagation in graphene at 100 K. ACS Omega 8, 23964–23974 (2023). Ding, Z. et al. Phonon hydrodynamic heat conduction and knudsen minimum in graphite. Nano Lett. Machida, Y. et al. Observation of Poiseuille flow of phonons in black phosphorus. Srivastav, S. K. et al. Universal quantized thermal conductance in graphene. Phonon hydrodynamics in bulk insulators and semimetals | Low Temperature Physics | AIP Publishing. Jo, I. et al. Thermal conductivity and phonon transport in suspended few-layer hexagonal boron nitride. Nano Lett. Yuan, C. et al. Modulating the thermal conductivity in hexagonal boron nitride via controlled boron isotope concentration. Mercado, E. et al. Isotopically enhanced thermal conductivity in few-layer hexagonal boron nitride: implications for thermal management. ACS Appl. Nano Mater. Kim, Y. D. et al. Bright visible light emission from graphene. Dobusch, L., Schuler, S., Perebeinos, V. & Mueller, T. Thermal light emission from monolayer MoS2. Balandin, A. A. et al. Superior thermal conductivity of single-layer graphene. Nano Lett. Braun, O. et al. Spatially mapping thermal transport in graphene by an opto-thermal method. Npj 2D Mater. Taube, A., Judek, J., Łapińska, A. & Zdrojek, M. Temperature-dependent thermal properties of supported MoS2 monolayers. ACS Appl. Cai, W. et al. Thermal transport in suspended and supported monolayer graphene grown by chemical vapor deposition. Nano Lett. Machida, Y., Matsumoto, N., Isono, T. & Behnia, K. Phonon hydrodynamics and ultrahigh–room-temperature thermal conductivity in thin graphite. Fong, K. C. & Schwab, K. C. Ultrasensitive and wide-bandwidth thermal measurements of graphene at low temperatures. Mecklenburg, M. et al. Nanoscale temperature mapping in operating microelectronic devices. Castioni, F. et al. Nanosecond nanothermometry in an electron microscope. Nano Lett. Menges, F. et al. Temperature mapping of operating nanoscale devices by scanning probe thermometry. Eloïse, G. et al. Scanning thermal microscopy on samples of varying effective thermal conductivities and identical flat surfaces. Morell, N. et al. Optomechanical measurement of thermal transport in two-dimensional MoSe2 lattices. Nano Lett. Chiout, A. et al. Extreme mechanical tunability in suspended MoS2 resonator controlled by Joule heating. npj 2D Mater. Blaikie, A., Miller, D. & Alemán, B. J. A fast and sensitive room-temperature graphene nanomechanical bolometer. Kloppstech, K. et al. Giant heat transfer in the crossover regime between conduction and radiation. Majumdar, A., Chowdhury, S. & Ahuja, R. Ultralow thermal conductivity and high thermoelectric figure of merit in two-dimensional thallium selenide. ACS Appl. Energy Mater. Zhao, J. et al. Graphene microheater chips for in situ TEM. Nano Lett. Vakulov, D. et al. Ballistic phonons in ultrathin nanowires. Nano Lett. Vincent, P. et al. Observations of the synthesis of straight single wall carbon nanotubes directed by electric fields in an environmental transmission electron microscope. Panciera, F. et al. Controlling nanowire growth through electric field-induced deformation of the catalyst droplet. Lee, J. E., Ahn, G., Shim, J., Lee, Y. S. & Ryu, S. Optical separation of mechanical strain from charge doping in graphene. Chaste, J. et al. Intrinsic properties of suspended MoS2 on SiO2/Si pillar arrays for nanomechanics and optics. ACS Nano 12, 3235–3242 (2018). Chen, S. et al. Thermal conductivity of isotopically modified graphene. Blundo, E. et al. Vibrational properties in highly strained hexagonal boron nitride bubbles. Nano Lett. Dadgar, A. M. et al. Strain engineering and raman spectroscopy of monolayer transition metal dichalcogenides. Chiout, A. et al. High strain engineering of a suspended wsse monolayer membrane by indentation and measured by tip-enhanced photoluminescence. Fulkerson, W., Moore, J. P., Williams, R. K., Graves, R. S. & McElroy, D. L. Thermal conductivity, electrical resistivity, and seebeck coefficient of silicon from 100 to 1300. Shanks, H. R., Maycock, P. D., Sidles, P. H. & Danielson, G. C. Thermal conductivity of silicon from 300 to 1400°K. Cai, Q. et al. High thermal conductivity of high-quality monolayer boron nitride and its thermal expansion. Xiao, P. et al. MoS2 phononic crystals for advanced thermal management. Schilling, A., Zhang, X. & Bossen, O. Heat flowing from cold to hot without external intervention by using a “thermal inductor”. & Lee, S. Role of hydrodynamic viscosity on phonon transport in suspended graphene. Liu, S. et al. Single crystal growth of millimeter-sized monoisotopic hexagonal boron nitride. Download references We thank Anis Chiout, Jérôme Saint-Martin and Michele Lazzeri for fruitfull discussion. The work was supported, by French grants ANR ANETHUM (ANR-19-CE24-0021, J.C.), ANR Deus-nano (ANR-19-CE42-0005, J.C.), ANR 2DHeco (ANR-20-CE05-0045, J.C.), ANR Comodes (ANR-22-CE09-0021, J.C.)), ANR ELEPHANT (ANR-21-CE30-0012-01, J.C.), and (ANR-22-PEXD-0006, J.C.) FastNano project, as well as the French technological network RENATECH, J.C. Support for the monoisotopic hBN crystal growth and was provided by the USA Office f Naval Research award N00014-22-1-2582 (J.H.E. Université Paris-Saclay, CNRS, Centre de Nanosciences et de Nanotechnologies, Palaiseau, France Cléophanie Brochard-Richard, Gaia Di Berardino, Etienne Herth, Chen Wei, Federico Panciera, Fabrice Oehler, Abdelkarim Ouerghi & Julien Chaste Tim Taylor Department of Chemical Engineering, Kansas State University, Durland Hall, Manhattan, KS, USA Thomas Poirier & James H. Edgar Laboratoire Charles Coulomb (L2C), UMR 5221 CNRS-Université de Montpellier, Montpellier, France Bernard Gil & Guillaume Cassabois Université Paris Cité, CNRS, Laboratoire Matériaux et Phénomènes Quantiques, Paris, France Maria Luisa Della Rocca Université Grenoble Alpes, CNRS, Grenoble INP, Institut NEEL, Grenoble, France Suman Sarkar, Nedjma Bendiab & Laëtitia Marty Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar initiated the work. C.B.-R. fabricated the 2D heterostructures, developed the soft 2D transfer and did the measurements with calibration. fabricated the microheater. have grown the isotopic hBN samples with the help of B.G. have grown the CVD WSe2 flakes. proceeded to the thermal reflectance measurements. are responsible for the graphene sample and measurements. C.B.-R., J.C., and E.H. did the PDMS 2D stamp and sample preparation. J.C. guided the research and wrote the manuscript with the input from all the authors. Correspondence to Julien Chaste. The authors declare no competing interests. Nature Communications thanks Michel Kazan, and Zexiao Wang for their contribution to the peer review of this work. A peer review file is available. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions Brochard-Richard, C., Di Berardino, G., Herth, E. et al. Extreme longitudinal thermal conductivity and non-diffusive heat transport in isotopic hBN. Download citation DOI: https://doi.org/10.1038/s41467-026-69907-x Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative Nature Communications © 2026 Springer Nature Limited Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Deep learning models that infer clinically relevant biomarker status from tissue images are being explored as rapid and low-cost alternatives to molecular testing. Here we show, through statistical analysis across multiple cancer types, datasets and modelling approaches, that the datasets used to train these models contain strong dependencies between biomarkers and clinicopathological features, which prevent models from isolating the effect of a single biomarker and lead them to learn confounded signals. Consequently, their prediction accuracy varies substantially with the status of codependent biomarkers and clinicopathological variables, and for several biomarkers, the gain over what a pathologist can already infer from routine histopathological features, such as grade, remains modest. These findings indicate that current approaches are not yet suitable as substitutes for molecular testing but can support triage or complementary decision-making with caution. Unconfounded biomarker prediction will require models that learn causal rather than correlational relationships between biomarkers and tissue morphology. Fuelled by developments in computational pathology, several studies have proposed methods to predict clinically relevant biomarkers1,2,3,4,5,6,7,8,9, such as gene mutations and expression levels, directly from routine haematoxylin and eosin (H&E)-stained whole-slide images (WSIs)1,2,3,4,10. These approaches take a WSI as input and predict the status of clinically relevant biomarkers such as microsatellite instability (MSI), hormonal receptors or mutations in TP53, BRAF, KRAS, EGFR and other genes, as their target. Such methods are typically motivated by two main objectives: first, to identify or mine histological patterns associated with specific biomarkers11, and second, to rule out certain biomarkers from routine WSIs, avoiding the need for additional stains or molecular testing, which can be tissue-destructive, costly and associated with longer turnaround time12. For example, the accurate prediction of MSI1,4,13 and mutations in genes such as BRAF1 and KRAS/NRAS10 from WSIs can inform personalized treatment decisions while reducing cost and waiting time compared with sequencing12. Several methods have demonstrated that, in specific cancers10,14,15, biomarker status2,3,16 and alterations in certain genes are predictable from WSIs using deep learning pipelines trained in a weakly supervised fashion on imaging and molecular data from The Cancer Genome Atlas (TCGA) or other similar data repositories, such as the Clinical Proteomic Tumour Analysis Consortium (CPTAC)17. However, for most biomarkers, the prediction accuracy of these methods remains low, with the area under the receiver operating characteristic curve (AUROC) values ranging from 0.50 to 0.90. Moreover, the true generalization of such methods to external datasets is further challenged by factors such as mutation prevalence, limited multicentric data, class imbalance between positive (mutated or high expression) and negative (wild-type or low expression) cases, quality of WSIs (such as pen markings and tissue tears) and domain shifts18. In this Article, we demonstrate that even if these challenges have been handled, there are underlying fundamental issues that require addressing. In a WSI, disease phenotypes manifest as different visual patterns arising from the interaction of multiple codependent genes rather than from a single factor. These interactions are often characterized by patterns of mutual exclusivity or co-occurrence among molecular factors19,20,21. Despite this, current approaches primarily focus on predicting the status of individual biomarker or gene mutation from WSIs, neglecting codependencies between covariates. Although several recent studies5,22 have used multi-output models and leveraged representations from multimodal foundation models to predict biomarker status from WSIs; these studies remain limited to optimizing aggregated accuracies and do not extend to assessing the stability of model performance across patient groups stratified by the status of a codependent biomarker. In this study, we show that overlooking interdependencies among biomarkers can influence the predictive performance of machine learning (ML) models. We argue that interdependencies among biomarker statuses in the training data, when ignored during model development, can lead to models capturing the aggregated influence of multiple interdependent biomarkers rather than patterns linked to a single biomarker. Moreover, this could also spuriously inflate or deflate models' apparent performance in certain subgroups when the interdependency structure among molecular factors shifts in the test cohorts. Finally, when clinicopathological variables (for example, tumour mutational burden (TMB) or tumour grade) are themselves associated with biomarker status, models may rely on phenotypes associated with these correlated variables as predictive proxies, instead of capturing the intended biological signal. To illustrate these effects, we first analysed interdependencies among biomarkers by assessing their patterns of mutual exclusivity and co-occurrence23. We then use permutation testing and stratification analysis to demonstrate failure modes of WSI-based predictors by showing that their accuracy for a given biomarker varies substantially when conditioned on the status of other biomarkers. We also highlight the need for appropriate causal adjustments in WSI-based predictors to ensure reliable inferences necessary for informing clinical decisions, such as treatment selection and pathobiological understanding. To this end, we propose a stratification-based evaluation framework to report bias and support the development of more transparent and trustworthy ML models to advance WSI-based precision diagnostics. We analysed the limitations of existing ML approaches for predicting molecular biomarkers (for example, mutations, genomic instability indicators and protein expression) from H&E stained WSIs. A high-level concept diagram of these approaches is provided in Fig. We hypothesize that interdependencies among biomarker statuses and clinicopathological variables in the training data, and the disregard of such associations during model development, bias ML models towards relying on aggregated influences of multiple factors in WSIs rather than patterns linked to individual biomarkers. To illustrate this, we retrospectively analysed n = 8,221 patients with breast cancer (BRCA), colorectal cancer (CRC), endometrial cancer (UCEC) and lung cancer across four cohorts for which WSIs and/or molecular information (for example, receptor status, gene mutations and so on) were available (Methods). Using these datasets, we performed the four major steps listed below: An analysis of the interdependency among biomarkers and somatic mutation status of genes in samples; Training deep learning models to predict biomarker status from WSIs; Stratification analysis and permutation testing to assess whether the model trained to predict a certain biomarker is biased by the status of other biomarkers or clinicopathological variables; An analysis of the added value of using ML models in predicting various biomarkers over and above the pathologist-assigned grade. a, The ML-based prediction of molecular biomarkers from WSIs involves using training data of WSIs with known biomarker statuses. b, An ideal predictor should be able to predict the status of a molecular biomarker from histological effects of that biomarker contained in the WSI, and its output (Z) should be independent of unrelated confounding factors (lumped into a variable C) as shown in the simplified causal diagram. Conversely, if the predictor's output is dependent not only on the histological effects of \(\left(Y\right)\) but also on other confounding factors (for example, histological grade, TMB or status of other biomarkers), then the prediction is confounded because the model is relying on these additional covariates rather than solely on the effects of \((Y)\). Drawing from established methods in gene functional analysis20,21,29,30,31, we quantified the interdependency among molecular factor labels across patients by evaluating their pattern of co-occurrence and mutual exclusivity. We used log odds ratios (LOR) to quantify these relationships, where positive LOR values indicate co-occurrence, and negative values indicate mutual exclusivity. Statistical significance was assessed with a two-sided Fisher's exact test, and the resulting P values were corrected for multiple hypothesis testing. To assess whether biomarker interdependencies introduce bias into WSI-based models, we analysed three deep learning algorithms with different principles of operation: attention-based (CLAM32), graph neural network-based (\({\mathrm{SlideGraph}}^{\infty }\)33) and a WSI-level multimodal foundation model (TITAN22). These algorithms represent existing ML approaches that do not explicitly consider interdependencies between prediction variables. As CLAM and \({\mathrm{SlideGraph}}^{\infty }\) rely on a patch-level encoder, we trained them with two different encoders: CTransPath34 (trained on histology images) and ShuffleNet35 (trained on ImageNet)36 to minimize encoder-specific bias. For each biomarker, we train these models with both encoders on the TCGA cohort using fourfold cross-validation and report AUROC as a performance metric. We further evaluated the trained models on two independent validation cohorts, CPTAC37 and the Australian Breast Cancer Tissue Bank (ABCTB)38. Finally, we used WSI-level representation from a multimodal foundation model (TITAN)22, trained on 330,000 image–text pairs, under the hypothesis that these embeddings better capture biomarker-related morphology, and trained both single-output and multi-output biomarker predictors on them. To investigate whether WSI-based biomarker prediction models are confounded by the interdependency among molecular factors or clinicopathological variables (for example, histological grade or TMB), we performed a stratification analysis and permutation testing. For each model, we define two types of variable: the prediction variable, which is the biomarker the model is trained to predict, and stratification variables, which are biomarkers or clinicopathological features showing significant mutual exclusivity or co-occurrence with the prediction variable and may act as confounders (identified in step 1). The motivation for considering interdependent variables as confounders is that they may be associated with a shared phenotypic pattern in WSIs, which the model can exploit as proxies for the prediction variable, potentially leading to biased predictions when such signals are absent or decoupled at test time. To detect such confounders, we evaluate model performance at two levels: (1) across the entire cohort and (2) within subgroups defined by stratification variables. Examining model performance within these subgroups allows us to isolate the effect of the prediction variable from confounders. If the model truly captures prediction variable specific patterns in WSIs, its subgroup-level performance should closely match the cohort-level performance. By contrast, substantial differences between subgroups and overall performance indicate the influence of confounding effects or Simpson's paradox39,40. To quantify these effects, we perform permutation testing and report their statistical significance. For example, to evaluate whether the performance of a WSI-based predictor for oestrogen receptor (ER) status (prediction variable) is influenced by TP53 mutation status (stratification variable), we first divide the cohort into two subgroups on the basis of the stratification variable: patients with a TP53-mutant status and patients with a TP53 wild-type status. We then compute the AUROC of the ER predictor within each of these subgroups. Finally, we compare these subgroup-level AUROCs to the model's overall AUROC across the entire cohort. A substantial difference between subgroup and cohort-level AUROCs indicates a potential bias, suggesting the model captures the combined effects of ER and TP53 rather than ER-specific features alone. To establish statistical significance, we run a permutation test with 10,000 trials (see Methods for more details). This definition of the ‘prediction variable' (ER status in this example) and the ‘stratification variable' (TP53 status in this example) will be used consistently in subsequent results and figures to ensure clarity. Repeating this across alternative stratification variables (for example, grade and TMB) provides a systematic way of detecting the influence of confounding factors on different WSI-based models. To assess the added value of ML models in predicting various biomarkers over and above pathologist-assigned grades, we used a support vector machine with one-hot encoded histological grades to predict various clinical biomarkers following the same protocols used for weakly supervised models. Our analysis revealed significant interdependencies (\(P\ll 0.05\)) among biomarkers across cancer types (Fig. In BRCA, elevated ER and progesterone receptor (PR) expression co-occur with mutations in CDH1, MAP3K1 and PIK3CA, but not with TP53, which is mutually exclusive with CDH1, GATA3, MAP3K1 and PIK3CA41. In CRC, MSI-high (MSI-H) cases frequently carry BRAF, ATM, ARID1A and RNF43 mutations and are less likely to harbour KRAS mutations; BRAF-mutant tumours also show higher TMB and show co-occurrence with ATM, RNF43 and ARID1A. Similar patterns of interdependencies are also observed in UCEC and lung adenocarcinoma (LUAD) (Supplementary Fig. For instance, in UCEC, PTEN mutations co-occur with APC, ATM, JAK1, KRAS and ARIDA, whereas in LUAD, STK11 mutations co-occur with KEAP1 but rarely with EGFR. The heat maps display a set of biomarkers and genes along the axes, with cell colours within the heat map showing the strength of association (dark red colours for co-occurrence and dark blue for mutual exclusivity). Cells marked with asterisks indicate statistically significant associations (Benjamini–Hochberg FDR-corrected P values from two-sided Fisher's exact test \(P\ll 0.05\)). The top bar above each heat map shows the percentage of cases mutated for a specific gene in case of gene mutations, whereas for biomarkers, it indicates the percentage of patients with elevated ER, PR and HER2 in case of breast tumours, high MSI, hypermutation and CIMP activity and CIN for colorectal tumours. CINGS, chromosomally instable versus genome stable; HM, hypermutated. Our analysis further showed that, within the same tissue type, biomarker associations can vary across datasets, showing sampling variations. In the TCGA-BRCA cohort, MAP3K1 mutations showed mutual exclusivity with AKT1 and ARID1A, whereas in the METABRIC cohort, they showed a tendency towards co-occurrence (Fig. ER status and high TMB showed mild co-occurrence in the TCGA-BRCA cohort but mutual exclusivity in the METABRIC cohort. In the TCGA-CRC cohort, BRAF-mutant tumours were significantly less likely to harbour TP53 mutations, whereas this association is less pronounced in the DFCI cohort and lacks statistical significance. Similar cross-dataset differences were observed in UCEC and LUAD (Supplementary Fig. For instance, in TCGA-LUAD, BRAF and STK11 showed a weak tendency towards mutual exclusivity, whereas in the MSK cohort, they showed a weak tendency towards co-occurrence. These results show that biomarker statuses are significantly interdependent and that their association patterns can vary across datasets. Consequently, ML models trained on WSIs may learn composite phenotypes driven by multiple interdependent biomarkers, introducing cohort-specific biases and limiting their generalizability to unseen cases. To demonstrate that the ML models analysed in the study were properly trained, we report biomarker prediction performance across algorithms, feature embeddings and modelling approaches (Fig. Different model configurations achieved AUROCs >0.80 for multiple biomarkers in both cross-validation and independent validation cohorts. The plots show the AUROC for two weakly supervised models (CLAM and \({\mathrm{SlideGraph}}^{\infty }\)), each trained with two different patch-level encoders: ShuffleNet, a convolutional neural network-based encoder pretrained on natural images, and CTransPath, a transformer-based model pretrained on WSIs through self-supervised learning. For each biomarker or gene mutation, the comparative predictive performance for these four model-encoder combinations is shown. Bar heights represent mean AUROC values, whereas error bars indicate the 95% confidence (two-sided, using Student's t-distribution) calculated across 100 class-stratified bootstrap sampling runs. Similar AUROCs were observed for \({\mathrm{SlideGraph}}^{\infty }\) (CTransPath). Beyond breast tumours, these models also achieved high AUROC values for predicting biomarkers and gene mutations in CRC, lung cancer and UCEC (Fig. A strong predictive performance was also observed for other biomarkers, including BRAF, CpG island methylator phenotype pathway (CIMP), CINGS and hypermutation status (Fig. Apart from weakly supervised approaches, single-output and multi-output models trained on TITAN WSI-level feature representation showed roughly similar performance (Supplementary Fig. For example, the multi-output model predicts the ER and PR status of TCGA-BRCA cases with an AUROC of 0.89 and 0.81, respectively, closely matching the AUROC values of models trained under the single-output setting (ER 0.89 and PR 0.79). Next, on the basis of AUROC, we selected the best model for each biomarker and assessed the influence of biomarker interdependencies through permutation testing and stratification analysis. Our confounding factor analysis reveals that WSI-based predictors are strongly influenced by biomarker interdependencies. Across multiple biomarkers, the higher cohort-level AUROCs achieved by these models drop substantially in subgroups defined by the statuses of various stratification variables (Fig. For example, \({\mathrm{SlideGraph}}^{\infty }\) predicts colorectal tumours' MSI status (the ‘prediction variable') with an AUROC of 0.88 (0.873–0.886). However, when the same patient set is divided into hypermutated and non-hypermutated subgroups (the ‘stratification variable'), the AUROC for MSI status prediction drops to 0.72 within each subgroup. A similar effect is observed in stratification by other biomarkers showing co-occurrence with MSI (for example, CIMP activity, hypermutation and APC statuses) and those showing mutual exclusivity (for example, BRAF and CINGS) (Fig. The predictive performance of each predictor on all the cases in the cohort (denoted by ‘All' in the plot) over 100 bootstrap runs is shown using a violin plot, whereas its performance in different stratification groups is depicted with a doughnut chart, with the centre representing the AUROC values. Doughnuts marked with an asterisk at the top indicate statistically significant variation in results in the stratification analysis (Benjamini–Hochberg FDR-corrected P values from two-sided permutation testing \(P\ll 0.05\)). Red and blue colours in each doughnut indicate the proportion of positive and negative cases in each stratified group concerning prediction variables. These observations extend beyond colorectal tumours and are evident in biomarker predictors of breast and endometrial tumours, irrespective of the specific model architecture, feature embeddings or training methodology used. For instance, in breast tumours, the performance of the ER predictor substantially declines in cases with GATA3, CDH1 and PIK3CA mutations (Fig. Likewise, the ER predictor's AUROC drops substantially in both PR-positive and negative cases, as well as in TP53-mutant and wild-type cases. Similar trends are apparent for PR, TP53, CDH1 and PIK3CA predictors (Fig. This trend of inconsistent subgroup performance is also observed for other single- and multi-output models, such as those utilizing TITAN WSI-level feature representation (Supplementary Figs. For example, the AUROC of the ER predictor drops from 0.89 to 0.57 in single-output settings, whereas it drops from 0.88 to 0.58 under multi-output settings. These results suggest that the biomarker prediction from ML models is contingent on the status of other interdependent biomarkers, and these models are probably relying on composite phenotypes arising from potentially interacting biomarkers rather than learning biomarker-specific morphology. WSI-based models predict breast tumour receptor status (ER, PR) with high cohort-level AUROCs of 0.87 and 0.79 in the TCGA-BRCA cohort, and 0.90 and 0.78 in the ABCTB cohort, respectively. However, the stratification analysis by tumour grade reveals marked subgroup-level performance drops (Fig. The ER predictor AUROC drops to 0.76 for medium-grade cases in both cohorts, and the PR predictor AUROCs in low and medium-grade cases drop to 0.59 and 0.69 in the TCGA-BRCA cohort and to 0.65 and 0.73 in the ABCTB cohort. Mutation predictors show similar grade-specific performance declines; for example, AUROC of the TP53 predictor drops from 0.81 (cohort-level) to 0.73, 0.73 and 0.72 for low-, medium- and high-grade cases. These patterns extend beyond breast tumours and are evident in the mutation predictors of endometrial tumours, irrespective of model architecture, feature embeddings or training methodology (Fig. For example, TP53 predictors trained on TITAN WSI-level embeddings also show performance drops in high-grade cases, with AUROCs decreasing from 0.83 to 0.77 in single-output settings and from 0.86 to 0.77 in multi-output settings. The predictive performance of each predictor on all the cases in the cohort (denoted by ‘All' in the plot) over 100 bootstrap runs is shown using a violin plot, whereas its performance in a group of patients with a certain histological grade is depicted with a doughnut chart, with the centre representing the AUROC values. Doughnuts marked with an asterisk at the top indicate statistically significant differences in results (Benjamini–Hochberg FDR-corrected P values from two-sided permutation testing \(P\ll 0.05\)). Red and blue colours in each doughnut indicate the proportion of positive and negative cases in each stratified group in relation to prediction variables. b, Heat maps highlighting the shift in the association structure between histological grade and biomarker status across two distinct datasets. The colour intensity reflects the strength of association, with dark red indicating strong co-occurrence and dark blue indicating strong mutual exclusivity. Our analysis further shows that the apparent AUROCs of WSI-based models are sensitive to shifts in biomarker-grade associations between training and test cohorts. For example, in high-grade UCEC cases, the TP53 predictor attains an AUROC of 0.70 in the TCGA cohort but only 0.36 in the CPTAC cohort, a pattern consistent with a shift in TP53-grade relationship from strong co-occurrence in the training cohort to moderate mutual exclusivity in the test cohort. Consistent with these, single- and multi-output models trained on TITAN WSI-level feature representations showed similar sensitivity (Supplementary Fig. For example, in TCGA-UCEC, TP53 AUROC drops from 0.83 to 0.77 in high-grade cases for the single-output model and from 0.86 to 0.77 for the multi-output model. The confounding influence of grade is further supported by experiments in which, for selected biomarkers, we trained separate models for grade 1, 2 and 3 patients; these grade-specific models attained lower AUROCs than the pooled model (Supplementary Table 1). For example, in TCGA-BRCA, the TP53 grade-specific predictors achieved AUROCs of ~0.73 compared with 0.84 for the pooled model, and ER and PR showed similar reductions. To evaluate whether these disparities could be attributed to demographic differences, we examined the demographic balance between biomarker-positive and biomarker-negative cases and found moderate racial differences (Supplementary Table 2). We therefore repeated the grade-stratified experiment only on patients in a single racial subgroup (white). The same trends persisted (Supplementary Table 3); for example, the ER predictor trained only on grade 1 cases achieved an AUROC of 0.66, substantially lower than the pooled AUROC of 0.85, suggesting that demographic factors are unlikely to drive these performance differences (Supplementary Table 3). These results, reminiscent of Simpson's paradox, indicate that WSI-based biomarker prediction models rely heavily on grade-associated morphology rather than biomarker-specific phenotypic signatures, making them less generalizable to external cohorts where grade–biomarker associations differ from those in the training data. Our analysis shows that the status of several biomarkers across cancer types can be inferred with accuracy higher than expected from pathologist-assigned grade, and in several cases, approaches the performance of deep learning models. Grade also predicts TP53 mutations with an AUROC of 0.75, nearly matching the 0.81 achieved by weakly supervised ML models. Similar AUROC patterns were seen for TP53 and PTEN predictors in the TCGA-UCEC and CPTAC-UCEC cohorts. These results suggest that, for some biomarkers, ML algorithms offer limited additional predictive value over pathologist-assigned grade (Fig. The strong grade–biomarker association also risks ML models linking grade-associated phenotypic differences to biomarker status; therefore, WSI-based models are expected to exceed this grade-derived baseline and establish robust phenotype–genotype associations that are independent of tumour grade. The plots illustrate the AUROC achieved by a support vector machine classifier trained to predict a biomarker/gene mutation from one-hot encoded histological grades. Bar heights represent mean AUROC values, whereas error bars indicate the 95% confidence (two-sided, using Student's t-distribution) calculated across 100 class-stratified bootstrap sampling runs. WSI-based models infer BRAF and TP53 mutations in colorectal tumours (TCGA-CRC) from WSIs with high confidence, achieving AUROCs 0.774 (0.764–0.785) and 0.717 (0.711–0.722), respectively (Fig. However, stratification analysis reveals a significant challenge: for cases with low mutation density in genes other than BRAF (denoted as \({\mathrm{TMB}}_{\widetilde{{BRAF}}}\)), the BRAF predictor accuracy drops to an AUROC of 0.65 (Fig. Similarly, the TP53 predictor AUROC drops to 0.50 for high TMB cases. In the CPTAC-CRC cohort, similar trends were observed, with BRAF and TP53 predictors' performance dropping in low and high TMB cases, respectively. In addition, APC and KRAS mutation predictors are also influenced by TMB. This observation also extends to UCEC, where the PTEN predictor achieved AUROCs of 0.803 in TCGA-UCEC and 0.731 in CPTAC-UCEC but drops to 0.63 and 0.32 for low TMB cases in the respective cohorts (Fig. a, AUROC values are plotted on the y axis, with the top x axis indicating the prediction variables and the bottom x axis showing patients' stratification with respect to TMB. The predictive performance of each predictor on all the cases in the cohort (denoted by ‘All' in the plot) over 100 bootstrap runs is shown using a violin plot, whereas its performance in patients with high and low TMB is depicted with a doughnut chart, with the centre representing the AUROC values. Doughnuts marked with an asterisk at the top indicate statistically significant variation in results (Benjamini–Hochberg FDR-corrected P values from two-sided permutation testing \(P\ll 0.05\)). Red and blue colours in each doughnut indicate the proportion of positive and negative cases in each stratified group based on prediction variables. b, Heat maps highlighting the shift in the association structure between TMB and gene mutations across two distinct datasets. The colour intensity reflects the strength of association, with dark red indicating strong co-occurrence and dark blue indicating strong mutual exclusivity. We further show that varying associations between TMB and biomarker status across datasets significantly influence the prediction accuracy of WSI-based predictors. In CRC, the association between KRAS mutation and TMB is slightly stronger in the CPTAC-CRC cohort compared with the TCGA-CRC cohort (Fig. This stronger association could explain the KRAS predictor's significantly improved prediction accuracy (AUROC: 0.83) in high TMB cases in the CPTAC-UCEC cohort, compared with an AUROC of 0.63 for high TMB cases in the TCGA-CRC cohort. Deep learning models trained on routine WSIs of H&E-stained tissue sections are increasingly discussed as rapid and cost-effective tools to infer molecular biomarker status in patients with cancer. In this study, we identified key limitations of these approaches for clinical and preclinical use, in particular, their failure to account for biomarker interdependencies during model training and inference. Through statistical analysis, we first demonstrated significant interdependencies among molecular factors across tissue types and datasets (TCGA, METABRIC, MSK and DFCI), manifested as patterns of mutual exclusivity and co-occurrence that reflect both pathobiological and spurious associations. Subsequently, using permutation testing and stratification analysis, we showed that these associations in the training data lead to models whose predictions for a given biomarker are contingent on the status of other codependent biomarkers. For example, the PR predictor showed a marked drop in performance in CDH1-mutant cases, with AUROC decreasing from 0.79 to 0.50. This decline in subgroup performance suggests that the current ML models cannot fully disentangle biomarker-specific signals from the multifaceted influence of molecular characteristics and other factors on tissue phenotypes in WSIs. The inability of WSI-based models to discern biomarker-specific signals has direct clinical implications when codependent biomarkers have divergent therapeutic roles. An example is the BRAF-MSI association in CRC. Our analysis shows that MSI predictions from WSI-based models are contingent on BRAF status, with AUROCs dropping in both BRAF-mutant and wild-type subgroups, and a similar pattern was observed for the BRAF predictor when stratified by MSI status (Figs. This reflects their well-known biological co-occurrence: MSI-H CRCs frequently harbour BRAF V600E mutations, whereas MSI-stable CRCs rarely harbour BRAF mutations. Crucially, however, MSI-H and BRAF mutations have distinct therapeutic implications. MSI-H is a strong predictor of the response to immune checkpoint inhibitors such as pembrolizumab or nivolumab, whereas BRAF V600E mutations are targeted using BRAF and MEK inhibitors in combination with EGFR blockade. Combinations of immunotherapy and BRAF inhibitors are currently being tested for the double mutant. A model that cannot disentangle MSI-H from BRAF status may achieve high aggregate AUROC but lacks clinical utility, as confusing the two would misguide treatment selection. This example underscores the broader need for bias-aware evaluation: predictors must be assessed not only for overall accuracy but also for their ability to distinguish correlated biomarkers with divergent therapeutic pathways42. Beyond the influence of biomarker interdependencies, we showed that these models exploit prominent grade- or TMB-associated features in WSIs as proxies for biomarker prediction (Figs. In breast tumours, AUROCs of ER and TP53 predictors drop markedly within grade-stratified subgroups and shifts in the grade–biomarker association across cohorts lead to apparent improvements or declines in accuracy. Likewise, TMB-stratified analysis shows substantial AUROC declines for BRAF, TP53 and other markers, with shifts in TMB–biomarker association across cohorts influencing apparent accuracy. These patterns, observed across different ML models and feature representations, reflect a broader challenge in computational pathology: models tend to exploit confounding variables (grade, TMB) and conflate them with biomarkers of interest (for example, ER, PR, TP53 and PTEN status), thereby obscuring true genotype–phenotype relationships, limiting generalizability and introducing bias. This also raises concerns about their suitability for routine clinical use, because substantial heterogeneity in biomarker profiles can exist among tumours with the same grade or TMB, and both grade and TMB can evolve over the disease course or treatment. Consequently, models that rely on these prominent features are vulnerable to distribution shifts and may produce inconsistent predictions for the same patient at different time points, irrespective of the true biomarker status. These findings underscore the need to interpret external validation results with caution. In our analysis, the ER predictor achieved a high AUROC of 0.87 in cross-validation on TCGA-BRCA and 0.90 in a larger independent cohort (ABCTB), which could be interpreted as an excellent generalizability of the model. However, upon closer examination, we found that the apparent improvement in AUROC was largely driven by a stronger grade-ER association in the ABCTB than in the training cohort. Moreover, within grade-stratified subgroups, the predictive performance of this sophisticated ER predictor was not substantially more informative than a simple grade-based classifier. This illustrates that external validation must be complemented by bias-aware evaluations, such as grade- and TMB-stratified analyses, before claiming clinical utility. The confounding influence of biomarker interdependencies and clinicopathological variables (for example, grade and TMB) on current WSI-based biomarker prediction suggests that current models are not yet ready to replace genomic testing in routine care. Instead, they are better positioned for triaging, screening or supplementary decision support, provided that their performance is rigorously assessed and key clinical decisions remain supported by confirmatory testing. To ensure true clinical utility, we suggest bias-aware evaluation, including reporting grade- and TMB-stratified metrics and subgroup calibration rather than relying solely on aggregate AUROC (Figs. Our findings also have implications for studies and trials that link disease phenotypes to biomarkers or assess treatment response conditioned on biomarker status. In both contexts, establishing robust relationships requires that the biomarker of interest is not tightly coupled to cohort-specific covariates, as such dependencies can lead to false conclusions. To mitigate this risk, we recommend: (1) preserving variation in the target biomarker relative to correlated variables during enrolment; (2) prespecifying stratification factors (for example, grade, TMB, site, key comutations) and conducting prospective subgroup analyses; (3) including a dependency-aware analysis plan (for example, stratified permutation tests, subgroup confidence intervals, comparison with simple clinical baselines such as grade-only models); and (4) conducting per stratum power calculations rather than only aggregate targets. Although ML methods for predicting biomarker status from WSI have limitations, they can still provide substantial value. They can facilitate research and hypothesis generation by uncovering associations between histology and molecular factors, particularly in tissue-limited or retrospective scenarios where running additional assays is not feasible. WSI-based models also offer a scalable and cost-effective surrogate for large-scale preclinical and translational studies and can serve as rapid prescreening tools in early phase trials or resource-constrained settings43. In drug development, they can help narrow the pool of candidates for more resource-intensive molecular analyses and, with appropriate safeguards and clinician oversight, can support triage by guiding decisions on when confirmatory testing is essential43. To support safe use, we recommend bias-aware evaluation and interpretation of prediction results, including subgroup-stratified metrics and permutation-based checks, and comparisons against simple baselines such as grade-based classifiers. Although predicting biomarker status from routine H&E WSIs may appear to be a simple image-to-label mapping, it is considerably more complex because phenotypes in WSIs are rarely driven by a single factor and instead reflect combined effects of multiple codependent molecular factors. Our analyses show that current approaches, including single and multi-output models, as well as ML and graph-based methods across different feature representations, fail to reliably learn biomarker-specific genotype–phenotype mapping; instead, they exploit aggregated phenotypes of interdependent biomarkers or cohort-specific association as proxies for prediction. This results in biased models whose performance drops across patients' strata defined by codependent variables. These findings motivate the need for methods that formalize the problem as causal, structured multilabel learning: explicitly encode dependencies among biomarkers in the label space, learn disentangled image representations guided by conditional-independence objectives, mitigate confounding via causal adjustment and counterfactual data augmentation and optimize for invariance and distributional robustness, coupled with evaluation protocols based on conditional metrics and subgroup calibration44,45. Although we demonstrated the generalizability of our findings across multiple cancer types, datasets and modelling approaches, this study still has limitations. First, our analyses were limited to H&E WSIs with WSI-level (coarse) labels, and we did not evaluate immunohistochemistry (IHC) slides or models trained with fine-grained labels (for example, spatial omics supervision). Second, although we used a large multicentre dataset (n = 8,221), prospective studies are needed to define clinical and deployment guidelines. Third, we note that learning disentangled genotype–phenotype mapping using ML will probably require combinatorially richer datasets with the exhaustive coverage of comutation or biomarker-pair combinations than current cohorts; however, curating such datasets would necessitate significant long-term efforts. Last, we suggested several methodological directions for ML researchers to explore and words of caution for clinicians, but their effectiveness remains to be established; it is premature to recommend definitive clinical guidelines. All samples used in the study were obtained with research consent and ethics approvals as indicated in the consent and ethics statements for TCGA, METABRIC, COAD-DFCI, MSK-LUAD, CPTAC and ABCTB. We analysed data of four cancer types (BRCA, CRC, LUAD and UCEC), sourced from six cohorts: TCGA46, METABRIC24,25, COAD-DFCI28, MSK-LUAD26, CPTAC and ABCTB. Biomarkers and gene mutation status information, except for the ABCTB cohort, were collected from cBioportal30. WSIs of formalin-fixed paraffin-embedded (FFPE) H&E-stained tissue for TCGA atlas cases were collected from TCGA46,47, whereas for CPTAC atlas cases, they were retrieved from The Cancer Imaging Archive (TCIA)17. Within the ABCTB cohort, WSIs and receptor status (ER, PR and human epidermal growth factor receptor 2 (HER2) status) information were available for 2,303 patients. In terms of biomarkers, for breast tumours, ER, PR and HER2 status were recorded. For colorectal cases, MSI, hypermutation (HM), chromosomal instability (CIN) and CIMP activity statuses were documented. Given the status of two biomarkers \({\rm{A}}\) and \({\rm{B}}\), in a given dataset, we calculated the LOR as follows: In the above equation, \({n}_{{\rm{A}}}\) and \({n}_{{\rm{B}}}\) denote the number of cases that are positive for \({\rm{A}}\) and \({\rm{B}}\), respectively, whereas \({n}_{ \sim {\rm{A}}}\) and \({n}_{ \sim {\rm{B}}}\) denote the number of cases that are negative for those biomarkers. A higher positive LOR between gene pairs indicates mutation co-occurrence (that is, if one gene is mutated, the other is likely to be mutated), whereas a negative value signifies mutual exclusivity of mutation (that is, if one gene is mutated, the other is less likely to be mutated). In addition to the LOR analysis, we statistically assessed the interdependence among the mutational status of different genes using a two-sided Fisher's exact test. All gene pairs were enumerated, and a Fisher's exact test was performed on each pair. Subsequently, we reported the multi-hypothesis corrected P values for each pair using the Benjamini–Hochberg method, with a significance threshold set at \(P\ll 0.05\). We assessed the predictability of biomarkers and gene alteration status from WSIs within their respective cohorts using two algorithms with different principles of operation: CLAM32 and \({\mathrm{SlideGraph}}^{\infty }\)33. To avoid drawing conclusions specific to a certain approach or type of features, the predictive performance of both algorithms was evaluated over different types of feature: deep features (a convolutional neural network-based encoder trained on ImageNet)36 and self-supervised features (a transformer-based model trained on histology images in a self-supervised manner)34. Our predictive pipeline comprises three main steps: (1) preprocessing of WSIs, (2) embedding of WSI patches, (3) biomarkers and gene mutation prediction from WSIs using CLAM and \({\mathrm{SlideGraph}}^{\infty }\). In our preprocessing pipeline, utilizing a U-Net-based segmentation model from TIAToolbox48, we first segment viable tissue areas of each WSI and exclude regions with artefacts (pen-marking, tissue folding and so on). The model-generated tissue mask highlights informative tissue areas within the WSI using a pixel value of 1, whereas those with a value of 0 represent background or regions with artefacts. We selectively keep patches (both benign and tumour) that have more than 40% viable tissue in terms of pixel proportion. We utilized various encoders to extract feature representation from WSI patches. Specifically, we used ShuffleNet35 pretrained on ImageNet36 as a patch-level encoder to extract the 1,024-dimensional feature representation (deep features) from WSI patches of size \(512\,\mathrm{pixels}\times 512\) pixels. Moreover, we also extracted a 768-dimensional self-supervised feature representation from each patch of size \(\mathrm{1,024}\,\mathrm{pixels}\times \mathrm{1,024}\) pixels using CTransPath (a transformer-based self-supervised model trained on histology images)34. We trained \({\mathrm{SlideGraph}}^{\infty }\) and CLAM for predicting the status of different clinical biomarkers using both deep features and self-supervised features. In case of \({\mathrm{SlideGraph}}^{\infty }\), we first construct a graph representation of the WSI and then pass the WSI graph to a graph neural network for predicting the status of a certain biomarker as output. In cases where patients had multiple WSIs, we constructed a serial graph incorporating all WSIs and predicted the target label accordingly. Apart from these weakly supervised models, we also analysed alternative modelling approaches using feature representations from TITAN22, a state-of-the-art multimodal foundation model trained on more than 330,000 WSIs paired with pathology reports. We leveraged TITAN-derived features to train both single-output and multi-output models for biomarker prediction. In the single-output settings, WSI-level features were fed into a logistic regression model to predict the status of a single biomarker. In the multi-output settings, we used a multilayer perceptron (MLP) model that takes WSI-level representations as input and simultaneously predicts the status of all biomarkers as output. The model architecture consists of a single hidden layer that projects the input to half its dimension, followed by a rectified linear unit activation function and then an output layer. The model was trained using a pairwise ranking loss function33. We trained and evaluated the performance of both \({\mathrm{SlideGraph}}^{\infty }\) and CLAM using fourfold cross-validation, in which the dataset is partitioned into four 75/25 non-overlapping splits. We trained the model for 300 epochs on the training set, with a batch size of 8 and a learning rate of 0.001 using the adaptive momentum-based optimizer49. To limit overfitting, we stop the model training if its performance on the validation cohort is not improving over ten consecutive epochs. We quantitatively assess model performance on the test set using AUROC as a performance metric. Our motivation for using AUROC as the primary metric was twofold: (1) it allows us to maintain comparability with existing literature and align with established benchmarking practices, and (2) it serves as a threshold-free, rank-based statistic for bias detection, enabling subgroup evaluation and stratified permutation testing. We used the same train, validation and test splits for both \({\mathrm{SlideGraph}}^{\infty }\) and CLAM. To assess the predictability of biomarkers and gene mutation status on the basis of histology grade, we used a linear model (specifically, a support vector machine). This model uses the one-hot encoded histological grade as input to predict the status of a certain clinical biomarker as the target. We followed the same training and evaluation protocols used for our weakly supervised models. To investigate whether WSI-based biomarker prediction models are confounded by biomarker interdependency or clinicopathological variables (for example, histology grade or TMB), we used a stratification-based permutation testing approach. A high-level conceptual overview of the approach is shown in Fig. 8, and complete algorithmic details are presented in Supplementary Table 4. Using the procedure outlined in that table, we evaluate the robustness of model performance to confounding influence from biomarkers or clinicopathological features that exhibit mutual exclusivity or co-occurrence with the prediction variable (hereafter referred to as stratification variables). The algorithm takes as input a dataset containing prediction scores (\(Z\)), ground truth labels (\(Y\)) and a confounding or stratification variable (\(C\)). In step 1, the algorithm computes foreground statistics, such as AUROC within each stratum defined by the values of \({\rm{C}}\). AUROCs are computed in each permuted dataset, where any association \(C\) and \(Y\) has been randomized to form a null distribution reflecting expected model performance under the assumption of no association between \(C\) and \(Y\). In step 3, the algorithm compares the observed AUROCs against null distributions to assess how extreme they are. If they lie in the tails, the effect of \(C\) is considered statistically significant, and a two-sided multiple hypothesis corrected P value is computed. The variable \({C}_{i}\in {V}_{C}\) denotes the stratification variable (for example, status of a codependent biomarker or clinicopathological feature), and \({V}_{C}\) is the set of all unique values that \(C\) can take (for example, mutant or wild-type for mutation status). For each subgroup \(v\in {V}_{C}\), we compute a stratified performance measure using AUROC as a performance metric. We define the foreground metric as \({M}_{C=v}=\mathrm{AUROC}\left(\{\left({Z}_{i},{Y}_{i}\right),|,{C}_{i}=v\}\right)\), which reflects model performance restricted to a subgroup where \(C=v\). To determine whether \({M}_{C=v}\) significantly deviates from what would be expected under the null hypothesis, that is, when the model predictions \(Z\) are independent of \(C\), we conduct a stratified permutation test. Let \(Q=\mathrm{10,000}\) be the number of permutations. For each permutation trial \(q=1,\ldots ,Q\), we define a permutation function \({\pi }_{q}:\{1,\ldots ,N\}\to\) \(\{1,\ldots ,N\}\), which randomly shuffles the assignment of \(C\) while preserving the correspondence between Z and \(Y\). A permuted dataset is constructed as: \({D}^{\left(q\right)}=\{\left({Z}_{i},{Y}_{i},{C}_{{\pi }_{q}\left(i\right)}\right){|i}=1,\ldots ,N\}\) and for each \(v\in {V}_{C}\), we compute the permuted AUROC: \({M}_{C=v}^{\left(q\right)}=\text{AUROC}\left(\{\left({Z}_{i},{Y}_{i}\right)|{C}_{{\pi }_{q}\left(i\right)}=v\}\right)\). To quantify whether the observed stratified performance \({M}_{C=v}\) is significantly different from the null distribution, we compute a two-sided P value: A lower value of \({p}_{v}\) suggests that the model's predictions are influenced by the stratification variable, implying reliance on proxy features rather than those directly linked with the prediction variable50,51. Using the stratified permutation test discussed above, we examined three key factors that could introduce bias into an ML model: first, the bias due to interdependency among biomarkers and the somatic mutation status of genes in the training dataset; second, a likely bias due to patients' tumour histological grades; and third, an expected bias due to the TMB of a patient with cancer. To assess the influence of interdependence among biomarker statuses on model predictive performance, we select the model with the highest AUROC score for each biomarker and run a permutation test, treating other biomarkers with codependent statuses as confounding variables. Subsequently, to analyse the influence of histological grade on WSI-based biomarker predictors, we use a similar approach, utilizing histology grade as a confounding variable. Finally, to evaluate the impact of TMB on histology image-based biomarker predictors, we first calculate patient-level TMB excluding genetic alterations of the gene of interest used for prediction, then use this \({\mathrm{TMB}}_{\widetilde{\mathrm{voi}}}\) as a confounding variable. On the basis of \({\mathrm{TMB}}_{\widetilde{\mathrm{voi}}}\), we divide the patients into low and high TMB cases using a threshold of ten mutations per megabase. As this procedure is repeated across multiple stratification variables and subgroups, all P values \({p}_{v}\) are corrected for multiple hypothesis testing using the Benjamini–Hochberg procedure. Adjusted P values below a false discovery rate (FDR) threshold of 0.05 are considered statistically significant. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. WSIs of TCGA patients used in the study can be downloaded from the NIH Genomic Data Commons Portal at this link: https://portal.gdc.cancer.gov/. The genomic data and clinical data of patients in TCGA, METABRIC, COAD-DFCI, MSK-LUAD and CPTAC cohorts can be downloaded from cBioPortal at https://www.cbioportal.org/. Code and documentation of all Python scripts used in the study are available via GitHub at https://github.com/imuhdawood/HistBiases. Any additional information required to reproduce the data reported in this work is available from the corresponding author upon request. Bilal, M. et al. Development and validation of a weakly supervised deep learning framework to predict the status of molecular pathways and key mutations in colorectal cancer from routine histology images: a retrospective study. Lu, W. et al. SlideGraph+: whole slide image level graphs to predict HER2 status in breast cancer. Wagner, S. J. et al. Transformer-based biomarker prediction from colorectal cancer histology: a large-scale multicentric study. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. McCaw, Z. et al. Machine learning enabled prediction of digital biomarkers from whole slide histopathology image. Deep learning using histological images for gene mutation prediction in lung cancer: a multicentre retrospective study. Real-world deployment of a fine-tuned pathology foundation model for lung cancer biomarker detection. Deep learning in cancer pathology: a new generation of clinical biomarkers. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Saldanha, O. L. et al. Self-supervised attention-based deep learning for pan-cancer mutation prediction from histopathology. Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Lim, C. et al. Biomarker testing and time to treatment decision in patients with advanced nonsmall-cell lung cancer†. Echle, A. et al. Clinical-grade detection of microsatellite instability in colorectal tumors by deep learning. Kather, J. N. et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Maximum mean discrepancy kernels for predictive and prognostic modeling of whole slide images. Deep learned tissue ‘fingerprints' classify breast cancers by ER/PR/Her2 status from H&E images. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. Jahanifar, M. et al. Domain generalization in computational pathology: survey and guidelines. Sanchez-Vega, F. et al. Oncogenic signaling pathways in The Cancer Genome Atlas. Systematic identification of cancer driving signaling pathways based on mutual exclusivity of genomic alterations. Ciriello, G., Cerami, E., Sander, C. & Schultz, N. Mutual exclusivity analysis identifies oncogenic network modules. Ding, T. et al. A multimodal whole-slide foundation model for pathology. Tekle, G. E. et al. Co-occurrence and mutual exclusivity: what cross-cancer mutation patterns can tell us. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes. The underlying tumor genomics of predominant histologic subtypes in lung adenocarcinoma. Weigelt, B. et al. Molecular characterization of endometrial carcinomas in black and white patients reveals disparate drivers with therapeutic implications. Giannakis, M. et al. Genomic correlates of immune-cell infiltrates in colorectal carcinoma. A novel independence test for somatic alterations in cancer shows that biology drives mutual exclusivity but chance explains most co-occurrence. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Gao, J. et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Dawood, M. et al. Cross-linking breast tumor transcriptomic states and tissue histology. Wang, X. et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Zhang, X., Zhou, X., Lin, M. & Sun, J. ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 6848–6856 (IEEE, 2018); https://doi.org/10.1109/CVPR.2018.00716. Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Mertins, P. et al. Proteogenomics connects somatic mutations to signaling in breast cancer. Carpenter, J. E. & Clarke, C. L. Biobanking sustainability—experiences of the Australian Breast Cancer Tissue Bank (ABCTB). Bonovas, S. & Piovani, D. Simpson's paradox in clinical research: a cautionary tale. The molecular landscape of Asian breast cancers reveals clinically relevant population-specific differences. Phase 1/2 trial of encorafenib, cetuximab, and nivolumab in microsatellite stable BRAFV600E metastatic colorectal cancer. Screening of normal endoscopic large bowel biopsies with interpretable graph learning: a retrospective study. The impact of site-specific digital histology signatures on deep learning model accuracy and bias. Schölkopf, B. in Probabilistic and Causal Inference: The Works of Judea Pearl 765–804 (Association for Computing Machinery, 2022). Koboldt, D. C. et al. Comprehensive molecular portraits of human breast tumours. Hoadley, K. A. et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Pocock, J. et al. TIAToolbox as an end-to-end library for advanced tissue image analytics. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR, 2015). Ojala, M. & Garriga, G. C. Permutation tests for studying classifier performance. Chaibub Neto, E. et al. A permutation approach to assess confounding in machine learning applications for digital health. 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 54–64 (Association for Computing Machinery, 2019); https://doi.org/10.1145/3292500.3330903 acknowledges support from the GSK-Warwick PhD Studentship and the Department of Computer Science, University of Warwick. were partially supported by the PathLAKE consortium, which was funded by the Data to Early Diagnosis and Precision Medicine strand of the government's Industrial Strategy Challenge Fund, managed and delivered by UK Research and Innovation (UKRI). also acknowledges funding support from EPSRC grant no. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. Predictive Systems in Biomedicine Lab, Department of Computer Science, University of Warwick, Coventry, UK Muhammad Dawood & Fayyaz ul Amir Afsar Minhas Tissue Image Analytics Centre, University of Warwick, Coventry, UK Muhammad Dawood, Nasir Rajpoot & Fayyaz ul Amir Afsar Minhas Artificial Intelligence and Machine Learning, GSK, San Francisco, CA, USA Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar designed the study with support from co-authors. visualized and verified the underlying data. drafted the manuscript with input from co-authors. All authors had full access to all the data in the study and had the final decision to submit for publication. Correspondence to Muhammad Dawood or Fayyaz ul Amir Afsar Minhas. conducted this study during his PhD at the University of Warwick, UK. received PhD studentship support from GSK. is the founding Director, CEO and CSO of Histofy Ltd. FM holds shares in Histofy Ltd with no operational involvement. The other authors declare no competing interests. Nature Biomedical Engineering thanks Lee Cooper, Nikos Paragios and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Dawood, M., Branson, K., Tejpar, S. et al. Confounding factors and biases abound when predicting molecular biomarkers from histological images. Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.
We may earn commission if you buy from a link. A simple handheld copper “bow drill” from ancient Egypt is yielding new insights more than 100 years after it was first found. The small copper alloy object was originally discovered at Badari in Upper Egypt in the 1920s, inside an adult male's grave dubbed Grave 3932. “The ancient Egyptians are famous for stone temples, painted tombs, and dazzling jewelry, but behind those achievements lay practical, everyday technologies that rarely survive in the archaeological record,” Martin Odler, visiting fellow in Newcastle University's School of History, Classics, and Archaeology, said in a statement about the research he co-authored. “One of the most important was the drill: A tool used to pierce wood, stone, and beads, enabling everything from furniture-making to ornament production.” That all changed when the research team took a liking to it and found that it featured wear consistent with drilling, highlighted by fine striations, rounded edges, and a slight curvature at the working tip. All the evidence pointed to rotary motion, not just simple puncturing. It also meant the drill could penetrate a surface more forcefully. “The re-analysis has provided strong evidence that this object was used as a bow drill—which would have produced a faster, more controlled drilling action than simply pushing or twisting an awl-like tool by hand,” Odler said. “This suggests that Egyptian craftspeople mastered reliable rotary drilling more than two millennia before some of the best-preserved drill sets.” The oldest bow drills previously known to archaeologists are 2,000 years newer, and depictions of such drills even appeared in tomb scenes created around 1500 B.C.E. But archaeologists are now aware that those examples are fairly recent pieces of a much longer history of tool use in Egypt. “This re-evaluation not only enriches our understanding of early Egyptian tool use but also raises intriguing questions about early metallurgical knowledge and interregional interactions in the ancient Near East.” To further explore the bow drill, researchers subjected it to chemical analysis and portable X-ray fluorescence, finding it was crafted with an unusual copper alloy. Tim Newcomb is a journalist based in the Pacific Northwest. He covers stadiums, sneakers, gear, infrastructure, and more for a variety of publications, including Popular Mechanics. This Mosaic Shows a Lost Version of the Trojan War
Incredible discoveries keep happening in the most unlikely places. AT FIRST, THE skeleton in grave 134 seemed unremarkable.In 2018, archaeologists in Germany descended upon the ancient town of Nida, about a mile and a half outside Frankfurt. The archaeologists focused their attention on excavating a cemetery. Here they would look at more than 100 plots where people had been buried, often with objects. So when the archaeologists came upon grave 134 and saw what appeared to be a necklace resting below the skeleton's chin, they weren't shocked or overly curious. The male, estimated to be 35 to 40 years old at the time of his death, had no observable injuries or abnormalities. The inch-long silver charm held a fragile, broken piece of foil that seemed to have writing on it. The amulet, it turned out, was a phylactery—and that 1,800-year-old trinket would alter the course of what we thought we knew about Christianity and European history. At the time of the excavation, however, archaeologist Markus Scholz could only determine that the amulet contained a small piece of rolled-up foil with 18 lines of Latin written on it. “[I]t quickly became clear that it would be impossible to physically unroll the scroll—which would have simply crumbled to pieces,” said Scholz, a professor at Goethe University in Frankfurt who was part of the excavation team. But they still couldn't make out the text. Then in 2024, Scholz and the team tried computed tomography, which combines x-rays and computer processing to analyze cross-sections of an object. “I couldn't believe my own eyes for a while,” Scholz said. After years of debate, most scholars now agree that the lines, now known as the Frankfurt Silver Inscription, invoke Christ. The inscription has powerful implications, suggesting that Christianity had spread from the Roman Empire's base in Italy much earlier than historians thought, into the heart of what we now know as Germany. “It remains an absolute stroke of luck,” Scholz said. “Many archaeological findings are generated more like a large jigsaw puzzle.” That kind of serendipitous luck fuels many modern archaeological finds that have altered what we think we know about history. It's not an uncommon story—farmers digging a well stumble upon China's Terracotta Army; a man hunting for his lost chickens finds the underground city of Derinkuyu in Turkey; a construction crew discovers 51 decapitated Viking warriors near Weymouth in southern England. And these auspicious finds are still happening regularly. “This is the most insane thing ever,” he pants while piling up gold coins in front of the camera. Indeed, it was quite unbelievable: The farmer had found piles of Civil War–era gold coins hiding just under the surface of his field. There was no treasure chest, not even a can or cigar box to hold the loot. Their quality was astonishing: more than 700 near-pristine coins dated between 1840 and 1863, including 18 prized 1863 gold Liberty Double Eagles. The Liberty Double Eagles were particularly valuable, especially in their condition: In a 2014 auction, just one Double Eagle fetched thousands of dollars. The coins drew attention for their historical value too. The fact that the coins were in near-mint condition suggested that they hadn't been in circulation for long, if at all. And their owner was likely a Confederate; Southerners at that time during the war had few options for depositing money into banks and expecting to be able to retrieve it later. Ryan McNutt, an associate professor at Georgia Southern University who focuses on conflict archaeology, believes that whoever had them “was potentially engaged in selling goods or trading goods with the U.S. government or military. LEONARDO DA VINCI'S Vitruvian Man is among history's most famous works of art, as recognizable as the Mona Lisa. Unlike the Mona Lisa, Vitruvian Man isn't completely original. Da Vinci's Vitruvian Man shows a spread-eagled man with his arms and legs in two positions: feet closed and arms extended at 90-degree angles, like a tree, and angled out like an X. Outside the body, Da Vinci drew a circle and a square, showing how human anatomy could fit neatly within those incongruous shapes. Most art historians believed those two shapes were the only ones present in Vitruvian Man. But within its clean lines was a secret dating back to Vitruvius, who suggested that the perfect proportions of the human form could be achieved by a geometric relationship. Vitruvius, however, kept the geometric relationship a tantalizing secret. That is, until the summer of 2025, when Rory Mac Sweeney, a British dentist, took a long look at Vitruvian Man. The shapes embedded within the artwork are positioned in a way that was, in true Vitruvian form, simple yet beautiful. The answer had been right there all along. Explaining the stunning geometry to dentists and art historians alike, however, was more difficult. “Clinical dentistry focuses on immediate treatment concerns, not evolutionary geometry,” he said. “Art historians I approached were unfamiliar with craniofacial biomechanics. “The enthusiasm I felt was matched by isolation in discussing it.” Nevertheless, Mac Sweeney published an article in the Journal of Mathematics and the Arts offering his unique insight into Da Vinci's cryptic puzzle. For his part, Mac Sweeney doesn't think his coming upon the hidden geometry of Vitruvian Man was luck. “Discovery feels like communion with nature's underlying logic,” he reflected. “You're not inventing connections; you're recognizing patterns that were always present, waiting for the right perspective to render them visible.” HOME RENOVATION PROJECTS can feel like a money sink. The plumber had arrived to work on a basement demolition when he spied a rope sticking out of the basement floor. What he uncovered was stunning: a treasure chest filled with 66 pounds of gold coins, each stamped with composer Wolfgang Mozart's image and dating back before World War II. The haul is worth an estimated $2.7 million today. Tim Newcomb is a journalist based in the Pacific Northwest. He covers stadiums, sneakers, gear, infrastructure, and more for a variety of publications, including Popular Mechanics. Tanya Basu is a features editor at Popular Mechanics. Prior to being a journalist, she toyed with the idea of being an economist before deciding journalism was more her speed. Aside from her work, she is a competent knitter, an overly ambitious cook, and has a bad habit of buying more books than she can actually read. Inside the World's Most Deadly Cave Diving Disasters Don't Wait to Grab These Editor-Approved Lego Deals
You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). The metaphor captured the promise of personal computing: tools that enable people to go further and faster with less effort. But the deeper brilliance of bicycles lies in what they do not do: they do not mimic human biology, nor any form found in nature. By comparison, I propose that artificial-intelligence agents are aeroplanes for the mind — they can speed things up for humans even more than bicycles do, but they are harder to control and the consequences of mistakes can be huge. And scientists are particularly poised to benefit from these tools. Scientific research is, at its core, a journey into the unknown. Yet working in new terrains brings unexpected challenges2 and frequent failures3. Why we don't really know what the public thinks about science Why we don't really know what the public thinks about science To push the frontiers of knowledge forwards quickly and responsibly, science and scientists urgently need a playbook for flying these aeroplanes. The real question is not whether machines will replace scientists, but what kind of scientists we will become when we learn to fly them. To put this into practice, my team developed SciSciGPT4, a prototype multi-agent system in which several specialized AI agents divide and coordinate research workflows. We used the science of science5 — a field that combines large data sets and computational methods to probe the dynamics of scientific progress. It orchestrates the workflow, dividing a researcher's natural-language query entered through a chat interface into tasks then delegating them to agents that specialize in literature review, data extraction or analysis. These agents plan and execute sub-tasks — retrieving publications, writing code, running analyses and generating figures — while the EvaluationSpecialist continuously audits their output. Each step is logged, creating a transparent end-to-end provenance record. It completed these research tasks faster and with higher-quality results than experienced researchers did using AI tools. Here, I outline the lessons that my team learnt from building a research-focused AI agent and the principles that scientists should consider when using agents for science. The temptation today is to fully automate scientific workflows6–8, switching to ‘AI scientists' or ‘self-driving laboratories' that generate hypotheses, design experiments and draft manuscripts end-to-end. These systems can be dazzling, but science is not an assembly line, nor does it have fixed objectives to optimize. For example, a fully automated system could conduct Newton's prism experiments, measuring how white light splits as it passes through a prism and fitting those data to a model. But Newton did something categorically different: he reversed the set-up, recomposing the coloured beams back into white light, decisively showing that colour belongs to light itself, not to the glass. That act — deciding that an apparent anomaly was the phenomenon rather than an error to eliminate — was a leap of interpretation, not computation. Automated workflows, by design, smooth out anomalies and optimize towards fit. As AI tools become central to research, science faces not only a technological inflection point but also a civic one. The legitimacy of science rests on a shared social contract: that conclusions are open to scrutiny, that authors stand behind their evidence and that knowledge is produced in good faith for the public good. In an era when public confidence in science is already fragile, this is the moment to strengthen the foundations that sustain it and to renew that contract by embedding transparency, traceability and accountability into the infrastructure of discovery itself. Full automation might deliver some answers, but it would erode the credibility that gives those answers meaning. Interfaces should be built for steerability and disagreement, inviting researchers to inspect reasoning, compare alternatives and override conclusions. Making this model robust will require deliberate collaboration between scientists specializing in the domain of the study, engineers who work on AI, designers and ethicists to ensure that agents amplify human creativity rather than replace it. Throughout history, discoveries have been made by humans. As AI becomes capable of contributing to discovery, the central question is not what machines can do alone, but how we design them to keep science accountable and reproducible. When the cost of failure collapses, riskier and more ambitious ideas become rational, making it practical to test questions that were once too costly or time-consuming. Genomics illustrates this shift: decoding the first human genome took more than a decade and billions of dollars; today, sequencing costs less than US$1,000 and takes hours, transforming the field from studies of individual genes to broad exploration of entire genomes. And with the shift came fresh vantage points, enabling researchers to see connections across the scientific landscape that were otherwise invisible. Speed also changes who can ask the questions. But the same forces that accelerate discovery can also amplify error. Fast science without reflection risks converging on mistakes at scale. This reinforces the importance of human–AI collaboration rather than full automation. SciSciGPT was a natural first test case: the science of science is rich in data and methodologically diverse, and it studies how discovery itself works. But the same idea applies across disciplines, although the training data that grounds these agents will differ. In chemistry, this might mean databases tied to kinetic models that predict reaction rates and highlight where experiments tend to fail; in biomedicine, clinical guidelines linked to trial data, diagnostic protocols and multimodal patient information; in mathematics, formalized proof libraries. AI research agents will look different in each field, but they should follow the same basic rules: results should be traceable, methods verifiable and responsibilities assigned clearly. Establishing those rules will require coordination between scientific societies, funders, journals, public research infrastructures and the AI labs building today's models. The goal is a shared public–private framework for interoperability — for instance, common standards for logging agent decisions so that an analysis run in one lab can be audited or reproduced by another. Some laboratories are trying to automate research work by using ‘AI scientists' that perform projects from start to finish.Credit: Qilai Shen/Bloomberg via Getty My team's research shows that AI's benefits to science are widespread across disciplines9. But when we analysed university syllabuses to examine how much each discipline teaches AI-related courses, we found a systematic mismatch: AI education is concentrated in computer science, mathematics and engineering, even though disciplines that could benefit just as much — from medicine and psychology to economics — offer much less training9. At the same time, academia remains organized around departmental silos that drift farther apart as the burden of knowledge rises. When bicycles crash, the consequences are generally localized. Aeroplanes are different: when they crash, it can be catastrophic for everyone on board, often with collateral damage on the ground. As they flourish, their failures won't just inconvenience a single researcher; they could mislead fields, redirect funding and erode public trust in science. One crucial advantage of large language models (LLMs) is that they can write. Even when my best students do an experiment, I cannot expect to see or reconstruct every step that led them to a result. AI could transform research assessment — and some academics are worried AI could transform research assessment — and some academics are worried Yet this brings another challenge: too much information. The solution is to log not more, but better: to design systems that turn raw provenance into understanding. Lu, C. et al. Preprint at arXiv https://doi.org/10.48550/arXiv.2408.06292 (2024). Why we don't really know what the public thinks about science AI could transform research assessment — and some academics are worried I'm going to halve my publication output. This AI can improve your peer review — and make it more polite ‘An AlphaFold 4' — scientists marvel at DeepMind drug spin-off's exclusive new AI Pop-up journals for policy research: can temporary titles deliver answers? We need a global assessment of avoidable climate-change risks Defunding Chile's climate research will undermine science and the region What's the best way to change research fields? The Contributor Role Taxonomy tool must serve to record extent of authorship Historically Black US universities chase top research ranking Nu Quantum are seeking applications for an AMO Physicist Nu Quantum are seeking applications for a Senior AMO Physicist Seeking scientists in pathobiology, immunology, vaccinology, epidemiology, drug discovery, focusing on microbial infections and inflammations. Center for Infectious Disease Research, Westlake University We seek outstanding applicants for full-time tenure-track/tenured faculty positions. Positions are available for both junior and senior-level. Southern University of Science and Technology (Biomedical Engineering) Why we don't really know what the public thinks about science AI could transform research assessment — and some academics are worried I'm going to halve my publication output. An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday. Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
He built the ultimate test for humanoid robots, and they beat it in months Roboticist Benjie Holson created the “Humanoid Olympic Games” thinking home robots were 15 years away. Dressed as a robot, Benjie Holson demonstrates the silver medal challenge in his proposed Humanoid Olympics. Last September roboticist Benjie Holson posted the “Humanoid Olympic Games”: a set of increasingly difficult tests for humanoid robots that he demonstrated himself while dressed in a silver bodysuit. While other competitions feature robots playing sports and dancing, Holson argued that the robots we actually want are the ones that can do laundry and cook meals. Instead, within months, robotics company Physical Intelligence completed 11 of the 15 challenges—from bronze to gold—with a robot that washed windows, spread peanut butter and used a dog poop bag. If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today. Scientific American spoke to Holson about why vision-only, or camera-based, systems are outperforming his expectations and how close we are to a genuinely useful machine. He has since released a new, more difficult set of challenges. Were you surprised by how quickly the results came in? When I chose the challenges, I was trying to calibrate them so some bronze ones would get done in the first month or two, then silver and gold in the next six months, and the most difficult ones might take a year or a year and a half. I started with the premise that we have things that look impressive at a fairly narrow set of tasks—vision-only, no touch, simple manipulator, not incredible precision. That limits what you can be good at. I tried to think of tasks that would require us to break forward out of that set. It turns out I wildly underestimated what's possible with vision-only and simple manipulators. They're doing all of that 100 percent vision-based. There is a lot of confusion about whether large language models (LLMs) are useless for robots. On the other hand, we've started doing vision-action models using the same transformer architecture [as that used in LLMs]. The neat thing is they're starting with models pretrained on text, images, maybe video. Before you even start training your specific task, the AI already understands what a teapot is, what water is, that you might want to fill a teapot with water. So while training your task, it doesn't have to start from, “Let me figure out what geometry is.” It can start with, “I see, we're moving teapots around”—which is wild that it works. How did you come up with the “Olympic” tasks? Humans rely on touch to do things such as finding keys in a pocket. How do we get around that in robotics? That's a very good question we don't know the answer to yet. Touch technology is way worse, more expensive, delicate and far behind cameras. Cameras, we've been working on for a long time. Both Physical Intelligence and Sunday Robotics [which completed the bronze-medal task of rolling matched socks] have made the bet that putting a camera on the wrist, very close to the fingers, lets you kind of see forces by seeing how everything smushes. It works way better than I expected. The energy needed to stay balanced is often quite high. If a robot is falling, that's a very fast, hard acceleration to get the leg in front in time. Your system has to inject a lot of energy into the world—and that's what's unsafe. For safety, that's such an easier way to get there quickly. If a humanoid loses power, it's going to fall down. They're dangerous but so valuable that we tolerate the risk. Have these results changed your time line? I used to think home robots were at least 15 years away. It takes a long time to get reliability squared away. Reliability and safety—the stuff Physical Intelligence shows is incredibly impressive, but if you put it on a different table with different lighting and use a different sock, it might not work. Each step toward generalization seems to take an order of magnitude more data, turning days of data collection into weeks or months. Deni Ellis Béchard is Scientific American's senior writer for technology. He holds two master's degrees in literature, as well as a master's degree in biology from Harvard University. His most recent novel, We Are Dreams in the Eternal Machine, explores the ways that artificial intelligence could transform humanity. You can follow him on X, Instagram and Bluesky @denibechard If you enjoyed this article, I'd like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history. I hope it does that for you, too. If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized. In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. There has never been a more important time for us to stand up and show why science matters. I hope you'll support us in that mission. David M. Ewalt, Editor in Chief, Scientific American Subscribe to Scientific American to learn and share the most exciting discoveries, innovations and ideas shaping our world today.
Researchers at Oregon State University have created a new nanomaterial designed to destroy cancer cells from the inside. The material activates two separate chemical reactions once inside a tumor cell, overwhelming it with oxidative stress while leaving surrounding healthy tissue unharmed. This emerging cancer treatment strategy takes advantage of the unique chemical conditions found inside tumors. Compared with normal tissue, cancer cells tend to be more acidic and contain higher levels of hydrogen peroxide. These reactive oxygen species damage cells through oxidation, stripping electrons from essential components such as lipids, proteins, and DNA. More recent CDT approaches have also succeeded in generating singlet oxygen inside tumors. Consequently, preclinical studies often only show partial tumor regression and not a durable therapeutic benefit." To address these shortcomings, the team developed a new CDT nanoagent built from an iron-based metal-organic framework or MOF. This structure is capable of producing both hydroxyl radicals and singlet oxygen, increasing its cancer-fighting potential. The MOF demonstrated strong toxicity across multiple cancer cell lines while causing minimal harm to noncancerous cells. "When we systemically administered our nanoagent in mice bearing human breast cancer cells, it efficiently accumulated in tumors, robustly generated reactive oxygen species and completely eradicated the cancer without adverse effects," Olena Taratula said. "We saw total tumor regression and long-term prevention of recurrence, all without seeing any systemic toxicity." Other contributors to the study included Oregon State researchers Kongbrailatpam Shitaljit Sharma, Yoon Tae Goo, Vladislav Grigoriev, Constanze Raitmayr, Ana Paula Mesquita Souza, and Manali Parag Phawde. New “Hell Heron” Spinosaurus Discovered in the Sahara With Giant Blade Crest Stay informed with ScienceDaily's free email newsletter, updated daily and weekly. Keep up to date with the latest news from ScienceDaily via social networks: Tell us what you think of ScienceDaily -- we welcome both positive and negative comments.
They play a vital role in basic biological functions, including cell growth and specialization. In recent years, scientists have focused on these compounds, especially spermidine, for their potential to support healthy aging. Often described as 'geroprotectors,' they have been shown to stimulate autophagy, a cellular recycling process that clears out damaged components. This benefit largely depends on a protein called eukaryotic translation initiation factor 5A (eIF5A1). How can the same molecules that appear to promote longevity also be associated with cancer? However, exactly how polyamines influence this metabolic shift has not been fully understood. Adding to the complexity, eIF5A1 has well established functions in normal, healthy cells. A closely related protein, eIF5A2, shares 84% of its amino acid sequence but has been linked to cancer development. Why two nearly identical proteins behave so differently has been a major unanswered question. To investigate, a team led by Associate Professor Kyohei Higashi from the Faculty of Pharmaceutical Sciences at Tokyo University of Science in Japan carried out an in-depth study using advanced molecular and proteomic methods. The findings clarify how polyamines stimulate cancer cell growth through biological routes that differ from those involved in healthy aging. The researchers worked with human cancer cell lines to examine how polyamines affect protein production and metabolism. They first reduced polyamine levels using a drug, then restored them by adding spermidine. This approach allowed them to directly measure the impact of polyamines on cancer cells. Using high-resolution proteomic techniques, they analyzed changes across more than 6,700 proteins. Their results showed that polyamines primarily boost glycolysis, the process that quickly converts glucose into energy, rather than enhancing mitochondrial respiration, which is more closely tied to healthy aging. "The biological activity of polyamines via eIF5A differs between normal and cancer tissues," explains Dr. Higashi. "In normal tissues, eIF5A1, activated by polyamines, activates mitochondria via autophagy, whereas in cancer tissues, eIF5A2, whose synthesis is promoted by polyamines, controls gene expression at the translational level to facilitate the proliferation of cancer cells." In other words, polyamines trigger very different effects depending on which protein they influence. Under typical conditions, production of the eIF5A2 protein is restrained by a small regulatory RNA molecule called miR-6514-5p. The researchers found that polyamines disrupt this natural brake, allowing eIF5A2 to be produced in greater amounts. These findings carry important implications for both cancer treatment and the use of polyamine supplements. In tissues that are cancerous or at risk of becoming malignant, the same molecules can stimulate tumor growth through eIF5A2. This dual behavior helps explain why polyamines have been so challenging to interpret in medical research. Targeting eIF5A2 specifically could, in theory, slow cancer growth without interfering with the beneficial effects linked to eIF5A1. Overall, this research marks a significant advance in understanding the complex and sometimes contradictory roles of polyamines. This study was supported in part by a Grant-in-Aid for Scientific Research (C) (No. New “Hell Heron” Spinosaurus Discovered in the Sahara With Giant Blade Crest Stay informed with ScienceDaily's free email newsletter, updated daily and weekly. Keep up to date with the latest news from ScienceDaily via social networks: Tell us what you think of ScienceDaily -- we welcome both positive and negative comments.
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Nature Nanotechnology (2026)Cite this article The development of compact and highly sensitive microwave detectors compatible with complementary metal–oxide–semiconductor (CMOS) processes remains a major challenge in microwave technology. Spin-torque diodes are emerging nanoscale spintronic devices capable of surpassing the theoretical thermodynamic sensitivity limits of Schottky diodes. However, their practical use in compact systems is limited by the need for external antennas or probes. Here we demonstrate a magnetoelectric (ME) spin-torque microwave detector that monolithically integrates a ME antenna with a magnetic tunnel junction (MTJ). The device directly converts wireless electromagnetic signals into a d.c. output at sub-microwatt power levels, achieving a sensitivity greater than 90 kV W−1, a noise equivalent power of 3 pW Hz−1/2 and a compact footprint of 0.4 mm2. This performance is due to the non-linear coupling between incoherent magnetization dynamics, driven by a d.c. current in the MTJ, and the combined effects of the microwave voltage and strain generated by the ME antenna under incident electromagnetic waves. We further show that this design is scalable, enabling the cointegration of a ME antenna with an array of MTJs. A detector incorporating four MTJs exhibits an increased sensitivity exceeding 400 kV W−1. Our results may contribute to the development of a new generation of highly sensitive, compact and scalable microwave detectors that combine ME antennas and spintronic diodes. This is a preview of subscription content, access via your institution Access Nature and 54 other Nature Portfolio journals Get Nature+, our best-value online-access subscription cancel any time Subscribe to this journal Receive 12 print issues and online access $259.00 per year only $21.58 per issue Buy this article Prices may be subject to local taxes which are calculated during checkout All data are available in the Article or its Supplementary Information, and are also available from the corresponding authors upon request. Zaeimbashi, M. et al. Ultra-compact dual-band smart NEMS magnetoelectric antennas for simultaneous wireless energy harvesting and magnetic field sensing. Google Scholar Kranold, L. et al. Microwave breast screening prototype: system miniaturization with IC pulse radio. IEEE J. Electromagn. RF Microw. Google Scholar Cheng, S. & Wu, Z. G. A microfluidic, reversibly stretchable, large-area wireless strain sensor. Google Scholar Jaeschke, T., Bredendiek, C., Küppers, S. & Pohl, N. High-precision D-band FMCW-radar sensor based on a wideband SiGe-transceiver MMIC. IEEE Trans. Theory Tech. Pauli, M. et al. Miniaturized millimeter-wave radar sensor for high-accuracy applications. IEEE Trans Microw. Theory Tech. Kiselev, S. I. et al. Microwave oscillations of a nanomagnet driven by a spin-polarized current. Zhu, K. Q. et al. Nonlinear amplification of microwave signals in spin-torque oscillators. Finocchio, G. et al. Perspectives on spintronic diodes. Tulapurkar, A. A. et al. Spin-torque diode effect in magnetic tunnel junctions. Baibich, M. N. et al. Giant magnetoresistance of (001)Fe/(001)Cr magnetic superlattices. Moodera, J. S., Kinder, L. R., Wong, T. M. & Meservey, R. Large magnetoresistance at room-temperature in ferromagnetic thin-film tunnel-junctions. Skowronski, W. et al. High frequency voltage-induced ferromagnetic resonance in magnetic tunnel junctions. Myers, E. B. et al. Current-induced switching of domains in magnetic multilayer devices. Igarashi, J. et al. Single-nanometer CoFeB/MgO magnetic tunnel junctions with high-retention and high-speed capabilities. npj Spintronics 2, 1 (2024). Fang, B. et al. Experimental demonstration of spintronic broadband microwave detectors and their capability for powering nanodevices. Sharma, R. et al. Electrically connected spin-torque oscillators array for 2.4 GHz WiFi band transmission and energy harvesting. Sharma, R. et al. Nanoscale spin rectifiers for harvesting ambient radiofrequency energy. Cheng, X. A., Boone, C. T., Zhu, J. & Krivorotov, I. N. Nonadiabatic stochastic resonance of a nanomagnet excited by spin torque. Miwa, S. et al. Highly sensitive nanoscale spin-torque diode. Fang, B. et al. Giant spin-torque diode sensitivity in the absence of bias magnetic field. Zhang, L. et al. Ultrahigh detection sensitivity exceeding 105 V/W in spin-torque diode. Goto, M. et al. Uncooled sub-GHz spin bolometer driven by auto-oscillation. Markovic, D. et al. Detection of the microwave emission from a spin-torque oscillator by a spin diode. Nan, T. X. et al. Acoustically actuated ultra-compact NEMS magnetoelectric antennas. Klein, M. W., Enkrich, C., Wegener, M. & Linden, S. Second-harmonic generation from magnetic metamaterials. Luo, B. et al. Magnetoelectric microelectromechanical and nanoelectromechanical systems for the IoT. Trinh, M. T. et al. Observation of magneto-electric rectification at non-relativistic intensities. Chen, A. T. & Zhao, Y. G. Research update: electrical manipulation of magnetism through strain-mediated magnetoelectric coupling in multiferroic heterostructures. APL Mater 4, 032303 (2016). Caruntu, G., Yourdkhani, A., Vopsaroiu, M. & Srinivasan, G. Probing the local strain-mediated magnetoelectric coupling in multiferroic nanocomposites by magnetic field-assisted piezoresponse force microscopy. Makarov, A. et al. CMOS-compatible spintronic devices: a review. Peng, W. Y. et al. Electrical and thermophysical properties of epoxy/aluminum nitride nanocomposites: effects of nanoparticle surface modification. Part A Appl. Dong, C. Z. et al. Characterization of magnetomechanical properties in FeGaB thin films. Choi, B. J. et al. High-speed and low-energy nitride memristors. & Muralt, P. Properties of aluminum nitride thin films for piezoelectric transducers and microwave filter applications. Lakin, K. M., Kline, G. R. & McCarron, K. T. High-Q microwave acoustic resonators and filters. IEEE Trans Microw. Ge, L. & Kwai, M. L. A low-profile magneto-electric dipole antenna. IEEE Trans. Antennas Propag. Chen, A. T. et al. Nonvolatile magnetoelectric switching of magnetic tunnel junctions with dipole interaction. Zhang, Y. K. et al. Electric-field control of nonvolatile resistance state of perpendicular magnetic tunnel junction via magnetoelectric coupling. Masciocchi, G. et al. Strain-controlled domain wall injection into nanowires for sensor applications. Peng, R. C. et al. Fast 180° magnetization switching in a strain-mediated multiferroic heterostructure driven by a voltage. Nitzan, S. H. et al. Self-induced parametric amplification arising from nonlinear elastic coupling in a micromechanical resonating disk gyroscope. Yun, X. F. et al. Bandwidth-enhanced magnetoelectric antenna based on composite bulk acoustic resonators. Download references The work was supported by the National Natural Science Foundation of China (NNSFC) (numbers 52371206, 12474127, 12204357 and U24A6001). This work was supported in part by the National Key Research and Development Program of China under grant 2023YFB2407700, Frontier Technologies R&D Program of Jiangsu (number BF2025031), CAS Young Talent Program and the Gusu Leading Talents Program (number ZXL2023172). A.C. acknowledges support from the National Key Research and Development Program of China (number 2024YFA1408503) and the Sichuan Province Science and Technology Support Program (number 2025YFHZ0147). The work of G.F. was supported by the project PRIN_20225YF2S4—Magneto-mechanical accelerometers, gyroscopes and computing based on nanoscale magnetic tunnel junctions (MMagyc) funded by the Italian Ministry of University and Research and the MUR-PNRR project SPINERGY ‘SPINtronic Electromagnetic eneRGY harvesting with magnetic tunnel junctions for next generation of green IoT nodes', CUP D93C22000900001 by Nest—Network 4 Energy Sustainable Transition, Parternariato Esteso–PE000002. The work of R.T. and M.C. was partially supported by the Project PE0000021, ‘Network 4 Energy Sustainable Transition—NEST', funded by the European Union—NextGenerationEU, under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.3—Call for tender number 1561 of 11 October 2022 of Ministero dell'Università e della Ricerca (MUR) (CUP C93C22005230007). and G.F. are with the Petaspin TEAM and are thankful for the support of the PETASPIN association (www.petaspin.com). These authors contributed equally: Shuhui Liu, Riccardo Tomasello. Nanofabrication Facility, Suzhou Institute of Nano-Tech and Nano-Bionics, Chinese Academy of Sciences, Suzhou, China Shuhui Liu, Bin Fang, Zhenhao Liu, Rui Hu, Wenkui Lin, Baoshun Zhang & Zhongming Zeng School of Nano Technology and Nano Bionics, University of Science and Technology of China, Hefei, China Shuhui Liu, Bin Fang, Rui Hu, Wenkui Lin, Baoshun Zhang & Zhongming Zeng Department of Electrical and Information Engineering, Politecnico di Bari, Bari, Italy Riccardo Tomasello & Mario Carpentieri State Key Laboratory of Electronic Thin Film and Integrated Devices, School of Physics, University of Electronic Science and Technology of China, Chengdu, China Aitian Chen School of Integrated Circuit Science and Engineering, Wuxi University, Wuxi, China Like Zhang Physical Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia Xixiang Zhang Department of Mathematical and Computer Sciences, Physical Sciences and Earth Sciences, University of Messina, Messina, Italy Giovanni Finocchio Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar and G.F. designed the experiments. prepared the films. performed the device fabrication and TEM characterization. and B.F. performed the electrical characterizations. and G.F. designed the micromagnetic solver. carried out the micromagnetic simulations. analysed the data. B.F., S.L., G.F. and R.T. wrote the manuscript with the help of Z.Z. All the authors commented on the final version of the manuscript. The work was performed under the supervision of B.F., Z.Z. Correspondence to Bin Fang, Giovanni Finocchio or Zhongming Zeng. The authors declare no conflicts of interest. Nature Nanotechnology thanks Shinji Miwa and the other, anonymous, reviewers for their contribution to the peer review of this work. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Supplementary Figs. 1–10, Notes 1–9 and Table 1. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Reprints and permissions Liu, S., Tomasello, R., Fang, B. et al. A CMOS-compatible, scalable and compact magnetoelectric spin-torque microwave detector. Download citation Received: 18 July 2025 Accepted: 12 January 2026 Published: 02 March 2026 Version of record: 02 March 2026 Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative Nature Nanotechnology ISSN 1748-3395 (online) ISSN 1748-3387 (print) © 2026 Springer Nature Limited Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Bimagrumab is an investigational antibody targeting type II activin receptors, intended to reduce total body and visceral fat mass and promote muscle growth. In this double-blind, placebo-controlled phase 2, trial, 507 adults with obesity (body mass index ≥30 kg m−2 or ≥27 kg m−2 with at least one obesity-associated complication (excluding diabetes) were randomized to nine groups (1:1:1:1:1:1:1:1:1 ratio) to receive treatment for 48 weeks: placebo, bimagrumab (10 mg kg−1 or 30 mg kg−1 intravenously every 12 weeks), semaglutide (1.0 mg or 2.4 mg subcutaneously once a week) and combinations thereof. An open-label treatment extension to week 72 followed. Randomization was stratified by sex across the treatment groups. The primary and secondary endpoints were absolute change from baseline in body weight at week 48 and week 72, respectively. The least squares mean absolute changes in body weight at week 48 were −9.3 kg (bimagrumab 30 mg kg−1), −14.2 kg (semaglutide 2.4 mg) and −17.8 kg (bimagrumab 30 mg kg−1 plus semaglutide 2.4 mg—that is, high-dose combination) versus −3.3 kg (placebo) (all P < 0.001 versus placebo). Continued improvements were observed through week 72. Bimagrumab plus semaglutide resulted in substantial reductions in body weight, and safety was consistent with the known safety profiles of both drugs. Obesity is a chronic disease projected to affect nearly 3.3 billion adults worldwide by 2035, with an estimated economic impact exceeding USD $4 trillion1. Excess adipose tissue, particularly visceral adipose tissue (VAT), increases the risk of obesity complications and related diseases, including metabolic and cardiovascular diseases2,3. Most weight reduction with caloric restriction, including with pharmacotherapy, is attributable to reduction in body fat mass; the remainder (approximately 25–40%) can be attributed to lean tissues, including skeletal muscle and visceral organs4,5,6. Patients with obesity who are at risk for low muscle mass, affecting both physical and metabolic function, may benefit from treatments that maximize fat mass reduction while preserving skeletal muscle7. Bimagrumab is a fully human, recombinant monoclonal antibody that targets activin type II receptors (ActRIIA and ActRIIB), preventing binding of natural ligands. Due to its inhibition of myostatin and activin A signaling via the ActRII−activin receptor-like kinase 4 (ALK4) pathway, which leads to anabolic effects in skeletal muscle, bimagrumab was initially developed to treat muscle-related disorders8. More recently, activin signaling via the ActRII−activin receptor-like kinase 7 (ALK7) pathway in adipose tissue has been recognized as an important regulator of abdominal obesity, based on exome-wide sequencing analysis indicating the importance of the INHBE (inhibin subunit beta E) gene that encodes activin E and the ACVR1C (activin A receptor type 1C) gene that encodes ALK7 (refs. By blocking ligand signaling in adipose tissue, bimagrumab increases lipid mobilization and lipolysis, leading to reduction in fat mass11. In a phase 2, 48-week study in adults with obesity and type 2 diabetes, treatment with bimagrumab significantly reduced total body fat mass, VAT and intrahepatic fat while increasing lean mass and lowering glycated hemoglobin (HbA1c)12. The study provided evidence that uncoupling of fat and lean mass loss is feasible during weight reduction. Incretin-based therapies reduce body weight primarily by targeting the central mechanisms that regulate energy balance, thereby decreasing appetite and food intake13. Conversely, bimagrumab does not appear to affect food intake but primarily targets activin signaling in adipose tissue and skeletal muscle directly, leading to fat mass reduction and muscle growth8,11,12. In preclinical animal models with diet-induced obesity, combining bimagrumab with incretins such as semaglutide or tirzepatide resulted in enhanced fat loss and preservation of muscle mass14,15. Treatment with bimagrumab also prevented decreases in thigh muscle volume (assessed by magnetic resonance imaging (MRI)) in a study of low dietary protein intake in healthy volunteers16. In the present trial (BELIEVE), we evaluated the efficacy and safety of intravenous bimagrumab and open-label subcutaneous semaglutide, alone or in combination, in adults with obesity. From 16 November 2022 to 16 May 2024, 730 participants were screened for eligibility. Overall, 377 (74.4%) participants completed the primary treatment period at week 48. Treatment discontinuations due to adverse events were higher in the bimagrumab groups (14.0–21.4%) than in the semaglutide (3.6–8.8%), combination (5.3–12.5%) and placebo (3.6%) groups. During the extension period, 25 (7.1%) participants discontinued treatment (Fig. *12 participants met eligibility criteria but were not randomized, and three participants did not meet all eligibility criteria but were randomized. a, Participant disposition during primary treatment period (to week 48). b, Participant disposition during extension treatment period (weeks 48 to 72). Demographic and clinical baseline characteristics were largely similar across treatment groups; most participants were female (57.4%) and White (75.1%) (Table 1). Mean values for the trial were as follows: age 47.5 years, body weight 107.5 kg, body mass index (BMI) 37.3 kg m−2, waist circumference 118.1 cm, total body fat mass (by dual-energy X-ray absorptiometry (DXA)) 45.8 kg and total body lean mass (by DXA) 58.3 kg (Table 1). For the efficacy results at week 48, nominal P values versus placebo and versus semaglutide 2.4 mg are reported in Table 2 and Fig. P values for comparisons with placebo and semaglutide 2.4 mg were calculated using two-sided t-tests without multiplicity adjustment. a−e, The LSM percent or absolute changes from baseline to week 48 in efficacy endpoints are based on an MMRM analysis for the efficacy estimand and an ANCOVA model with multiple imputation for the treatment regimen estimand. f, The LSM percent changes in hsCRP from baseline to week 48 are based on MMRM analysis using log transformation for the efficacy estimand. ANCOVA, analysis of covariance; hsCRP, High-sensitivity C-Reactive Protein; LSM, least-squares mean; MMRM, mixed model for repeated measures; VAT, visceral adipose tissue. The LSM change in absolute body weight was greater with the high-dose combination versus semaglutide 2.4 mg (−17.8 kg versus −14.2 kg; P < 0.05). By week 48, ≥15% weight reduction was achieved in 23.3% (bimagrumab 30 mg kg−1), 43.4% (semaglutide 2.4 mg) and 63.9% (high-dose combination) of participants (Extended Data Fig. The LSM percent change in body weight at week 48 was −5.0% to −9.7% (bimagrumab), −11.0% to −14.8% (semaglutide) and −14.3% to −20.2% (combination) versus −2.5% (placebo) (P < 0.001 for the high-dose combination versus semaglutide 2.4 mg; Fig. By week 72, ≥15% weight reduction was achieved in 21.8% (bimagrumab 30 mg kg−1), 51.8% (semaglutide 2.4 mg) and 84.9% (high-dose combination) of participants (Extended Data Fig. Data are presented as LSM change from baseline ± standard error. n represents the number of participants with baseline and post-baseline values at week 72. a−e, The LSM percent or absolute changes from baseline at week 72 in efficacy endpoints are based on MMRM analysis for the efficacy estimand. f, The LSM percent changes in hsCRP from baseline to week 72 are based on MMRM analysis using log transformation for the efficacy estimand. hsCRP, high-sensitivity C-reactive protein; LSM, least-squares mean; MMRM, mixed model for repeated measures; VAT, visceral adipose tissue. Results for improvements in waist-to-height ratio categories at week 48 are provided in Extended Data Fig. The LSM percent reduction in total body fat mass at week 48 was −13.0% to −18.9% (bimagrumab), −16.0% to −21.1% (semaglutide) and −24.7% to −33.7% (combination) compared to −5.6% (placebo) (P < 0.001 for the high-dose combination versus semaglutide 2.4 mg; Fig. Mean percent body fat decreased from 43.7% at baseline to 32.5% at week 48 with high-dose combination compared to 41.9% to 37.9% with semaglutide 2.4 mg. By week 48, fat mass reduction ≥25% was achieved in 30.8% (bimagrumab 30 mg kg−1), 36.3% (semaglutide 2.4 mg) and 73.6% (high-dose combination) of participants (Extended Data Fig. The LSM percent changes in total body lean mass at week 48 were +1.0% to +1.1% (bimagrumab), −4.7% to −6.9% (semaglutide) and −0.8% to −2.3% (combination) versus −0.9% (placebo) (P < 0.001 for all combination groups versus semaglutide 2.4 mg; Fig. At week 48, the proportion of weight loss due to fat mass (fat loss index) was 100% (bimagrumab 30 mg kg−1), 71.1% (semaglutide 2.4 mg) and 92.3% (high-dose combination). Results for appendicular lean mass at week 48 are presented in Table 2. The percent LSM changes in estimated VAT were −15.7% to −26.7% (bimagrumab), −19.3% to −24.5% (semaglutide) and −33.1% to −43.8% (combination) versus −3.3% (placebo) (P < 0.01 for the high-dose combination versus semaglutide 2.4 mg; Fig. At week 72, LSM percent reductions in total body fat mass were −28.5% (bimagrumab 30 mg kg−1), −27.8% (semaglutide 2.4 mg) and −45.7% (high-dose combination) (Fig. By week 72, fat mass reduction ≥30% was achieved in 50.0% (bimagrumab 30 mg kg−1), 36.4% (semaglutide 2.4 mg) and 94.0% (high-dose combination) of participants (Extended Data Fig. The LSM percent changes in total body lean mass at week 48 were +2.3% to +2.7% (bimagrumab), −5.3% to −7.9% (semaglutide) and −1.1% to −2.6% (combination) versus −0.5% (placebo) (P < 0.001 for all combination groups versus semaglutide 2.4 mg; Fig. At week 72, LSM changes in total body lean mass were +2.5% (bimagrumab 30 mg kg−1), −7.4% (semaglutide 2.4 mg) and −2.9% (high-dose combination) (Fig. Results for appendicular lean mass at week 72 are presented in Extended Data Table 1. At week 72, the proportion of weight loss due to fat mass was 100% (bimagrumab 30 mg kg−1), 75.6% (semaglutide 2.4 mg) and 92.2% (high-dose combination) (Extended Data Table 1). At week 48, the percent LSM changes in estimated VAT were −23.0% to −40.2% (bimagrumab), −21.5% to −29.5% (semaglutide) and −41.0% to −54.8% (combination) versus −2.1% (placebo) (P < 0.01 for all combination groups versus semaglutide 2.4 mg; Fig. At week 72, reductions in estimated VAT were −45.1% (bimagrumab 30 mg kg−1), −35.8% (semaglutide 2.4 mg) and −58.2% (high-dose combination) (Fig. HbA1c levels improved in semaglutide and combination groups at week 48 (Table 2; treatment regimen estimand). The LSM decreases in HbA1c levels at week 72 were −0.23% (bimagrumab 30 mg kg−1), −0.40% (semaglutide 2.4 mg) and −0.55% (high-dose combination) (Extended Data Table 1; efficacy estimand). Among participants with HbA1c ≥ 5.7% at baseline (prediabetes), normoglycemia (defined as HbA1c < 5.7%) at week 48 was achieved in 19 of 29 participants (66%, bimagrumab groups), in 23 of 27 participants (85%, semaglutide groups) and in 44 of 45 participants (98%, combination groups), compared to six of 15 participants (40%) in the placebo group (Table 2; efficacy estimand). At week 72, normoglycemia was achieved in 22 of 29 participants (76%, bimagrumab groups), in 26 of 27 participants (96%, semaglutide groups) and in 45 of 45 participants (100%, combination groups) compared to eight of 15 participants (53%) in the placebo group (Extended Data Table 1; efficacy estimand). Improvements in 36-Item Short Form Health Survey (SF-36) Physical Functioning score were similar across treatment groups at week 48 (P value not significant versus placebo; Table 2). Combination groups containing bimagrumab 30 mg kg−1 showed greater improvements in Impact of Weight on Quality of Life-Lite Clinical Trials Version (IWQOL-Lite-CT) Physical Function scores at week 48 compared to the placebo group (P < 0.05; Table 2). At week 72, improvements in SF-36 Physical Functioning scores and IWQOL-Lite-CT Physical Function scores were greater in the high-dose combination group than in the remaining groups (Extended Data Table 1). Overall, safety results during the primary treatment period were consistent with known safety profiles of the two drugs. The incidence of treatment-emergent adverse events during the primary treatment period was similar among active drug treatment groups (91.1−98.2%) and greater than placebo (74.5%) (Table 3). Common adverse events included muscle spasms (commonly, muscle cramps), diarrhea and acne with bimagrumab and nausea, diarrhea, constipation and fatigue with semaglutide, with similar events in the combination groups. All treatment discontinuations due to nausea (N = 6) occurred in the combination groups, and treatment discontinuations due to muscle spasms (N = 5) occurred in the bimagrumab monotherapy groups (Table 3). Four discontinuations were due to acne: two in the bimagrumab monotherapy groups and two in the combination groups. Thirteen participants had severe (grade 3) gastrointestinal-related events, including three with pancreatitis serious adverse events (one each in placebo, bimagrumab 10 mg kg−1 and semaglutide 1.0 mg groups). One participant had severe acne and five had severe muscle-related events (muscle spasms and back pain) in bimagrumab-containing groups (Table 3). Four participants reported basal or squamous cell skin carcinoma (all in bimagrumab-only or semaglutide-only groups); no other malignancies were reported (Table 3). There were no new safety signals during weeks 48–72 (Extended Data Table 2). No clinically relevant changes in hematologic or renal parameters were observed. Mean magnesium levels decreased in bimagrumab-containing groups but remained within the normal range across treatment groups. Bimagrumab-containing groups showed mean increases in alkaline phosphatase (ALP) and creatine kinase, transient increases in alanine aminotransferase (ALT) (Extended Data Fig. Serum lipase increased transiently with bimagrumab but increased and remained elevated with semaglutide treatment (Extended Data Fig. The mean reduction in diastolic blood pressure (DBP) was greater in the high-dose combination group versus the placebo and semaglutide 2.4 mg groups at week 48 (−6.7 mmHg versus −2.8 mmHg and −3.4 mmHg, respectively; P < 0.05) (Table 2). The LSM percent changes in total body and lumbar spine bone mineral density (BMD) were ≤1.1% in all groups at week 48 (Table 2). The LSM percent decreases in total hip BMD were significantly greater in the semaglutide 2.4 mg (−2.1%) group, the bimagrumab 10 mg kg−1 plus semaglutide 1 mg (−2.0%) group and the two combination groups containing bimagrumab 30 mg kg−1 (−2.2% to −2.3%) versus placebo (−0.8%) (P < 0.05). The LSM percent changes from baseline in femoral neck BMD were not significantly different in the treatment groups versus placebo (Table 2). Changes in these BMD outcomes at week 72 were similar, with greater decreases in total hip and/or femoral neck BMD than in total body or lumbar spine BMD across groups (Extended Data Table 1). Total and low-density lipoprotein (LDL) cholesterol levels increased in the first 12 weeks in the bimagrumab-containing groups and then decreased toward baseline in the combination groups containing semaglutide 2.4 mg while remaining above baseline in the bimagrumab-only groups and the combination groups containing semaglutide 1.0 mg (Table 2, Extended Data Table 1 and Extended Data Fig. By contrast, increases in high-density lipoprotein (HDL) cholesterol and decreases in triglyceride levels were similar in the combination and semaglutide-only groups at weeks 48 and 72 (Table 2, Extended Data Table 1 and Extended Data Fig. At week 72, LSM percent changes in LDL cholesterol were 17.6% (bimagrumab 30 mg kg−1), −8.9% (semaglutide 2.4 mg) and 0.1% (high-dose combination) (Extended Data Fig. For triglycerides, these were −1.2% (bimagrumab 30 mg kg−1), −20.8% (semaglutide 2.4 mg) and −25.3% (high-dose combination) (Extended Data Fig. The LSM percent reductions in high-sensitivity C-reactive protein (hsCRP) at week 48 were −52.9% to −69.0% (bimagrumab), −54.5% to −55.4% (semaglutide) and −71.5% to −83.1% (combination) versus −15.6% (placebo group) (Fig. At week 72, the LSM percent reductions in hsCRP were −72.0% (bimagrumab 30 mg kg−1), −59.3% (semaglutide 2.4 mg) and −84.0% (high-dose combination) (Fig. At week 72, the LSM decreases in fasting insulin were greater in the high-dose combination group (−34.4 pmol l−1) versus bimagrumab 30 mg kg−1 (−26.3 pmol l−1) and semaglutide 2.4 mg (−27.6 pmol l−1) groups (Extended Data Table 1). The LSM changes in free testosterone levels are presented in Table 2 (week 48) and Extended Data Table 1 (week 72). The bimagrumab 30 mg kg−1 plus semaglutide 1.0 mg group showed the greatest increase in grip strength among treatment groups at week 48 (4.8 kg; P < 0.05 versus placebo (1.7 kg)) (Table 2); all remaining groups were similar to placebo. There was a trend for increased grip strength in the bimagrumab monotherapy groups that was not significantly different compared to semaglutide 2.4 mg at week 48. Results at week 72 are presented in Extended Data Table 1. At week 24, the median change in total calories (kcal d−1) in the higher dose groups was −182.0 (bimagrumab 30 mg kg−1), −482.5 (semaglutide 2.4 mg) and −487.0 (high-dose combination) versus −238.5 (placebo). At week 48, participants in the semaglutide 2.4 mg and placebo groups had a greater median reduction in total caloric intake than those in the other groups (Table 2). The median increase in protein intake (as % total calories) was highest in the bimagrumab and placebo groups compared to the semaglutide and combination groups (Table 2). Obesity is a disease of excess adiposity, which can be confirmed by measurement of body fat by methods such as DXA17. In the treatment of obesity with caloric restriction (including lifestyle intervention, incretin-based therapies18,19 and bariatric surgery20), most weight loss is fat mass, with lean mass comprising approximately 25–40% of total weight loss. All three treatment approaches can induce substantial weight reduction in individuals with obesity, but the associated reduction in lean mass may attenuate the metabolic benefits of substantial weight loss and diminish physical function in those with low muscle mass. Obesity management therapies that preserve lean mass would be expected to cause less overall weight loss unless it was accompanied by increased fat mass reduction7. In this phase 2 trial, treatment with the combination of an activin pathway inhibitor (bimagrumab) plus an incretin (semaglutide) in adults with obesity achieved substantial weight loss by augmenting fat mass reduction while preserving lean mass. Although bimagrumab 30 mg kg−1 (10.8%) achieved numerically less weight reduction than semaglutide 2.4 mg (15.7%), weight reduction with the high-dose combination (22.1%) was greater than semaglutide 2.4 mg at week 72. Notably, the total body fat mass reduction achieved with bimagrumab 30 mg kg−1 (28.5%) was similar to semaglutide 2.4 mg (27.8%) at week 72 and resulted in nearly additive fat mass reduction with the high-dose combination (45.7%) owing to the distinct mechanisms of action of each drug. This reduction in fat mass achieved with the high-dose combination was in the range of results reported for bariatric surgery at similar timepoints21,22. Notably, a higher proportion of the weight reduction in each of the combination groups was due to fat mass loss versus in the semaglutide 2.4 mg group. Treatment with bimagrumab 30 mg kg−1 resulted in a small increase above baseline in lean mass and largely preserved lean mass in the combination groups compared to the greater reduction in lean mass observed with semaglutide 2.4 mg at weeks 48 and 72. Similar changes were observed for appendicular lean mass, a proxy measure of skeletal muscle mass23. The combination therapy resulted in preservation of lean mass despite achieving a greater reduction in fat mass, including intra-abdominal fat (VAT), supporting the premise that measures of body composition (and waist circumference) can be more informative regarding optimal obesity management than body weight or BMI. DXA measurements also provide information regarding BMD, with observed changes possibly related to reduced mechanical loading with weight loss. In the present trial, estimated VAT reduction was notable, particularly for bimagrumab-containing groups. The higher adiponectin levels in the bimagrumab-containing groups at week 48 are likely associated with effects of bimagrumab on adipose tissue and may have downstream effects on insulin sensitivity and inflammation. These outcomes could reflect a positive impact on inflammatory mechanisms underpinning many obesity complications and related diseases, including cardiovascular and metabolic diseases24. However, HbA1c lowering was similar or greater in the combination groups compared to semaglutide alone, suggesting an additive effect on glycemic control. Among participants with prediabetes (HbA1c ≥ 5.7%) at baseline, 100% reversion to normoglycemia was achieved only in the combination groups (except for the low-dose combination) by week 48. Elevated total and LDL cholesterol levels observed in bimagrumab-containing groups returned toward baseline in the combination groups containing semaglutide 2.4 mg but remained elevated in the bimagrumab-only groups by week 72. Total and HDL cholesterol (and derived non-HDL cholesterol) normalized in the high-dose combination group by week 48. HDL cholesterol and triglycerides improved relative to baseline in the combination and semaglutide-only groups by week 72; HDL cholesterol also improved in the bimagrumab 30 mg kg−1 group. The magnitude of lipid changes with bimagrumab may be explained in part by effects of high intravenous doses used in this study, with likely direct effects on lipid metabolism in adipose tissue and/or liver. Future analyses will assess the mechanism and durability of effects of treatment, including post-drug withdrawal, on insulin resistance, lipid metabolism and systemic inflammation as mechanistic determinants of potential cardiovascular benefits or risks. Adverse events related to bimagrumab and/or semaglutide were infrequent reasons for treatment discontinuation in combination groups. Muscle spasms (for example, muscle cramps) were the primary reason for discontinuation for five participants in the bimagrumab monotherapy groups but none in the combination groups. Muscle spasms have also been reported with blockade of the muscle ligands myostatin and activin A together and activin A alone25. The incidence of pancreatitis was infrequent and balanced across treatment groups. There were no instances of telangiectasias or hematologic abnormalities, as reported with activin receptor ligand traps26. Additional research is needed to understand the mechanisms underlying specific adverse events such as muscle spasms and acne. Strengths of this trial include a full factorial study design with two dose levels of each drug (including semaglutide 2.4 mg as approved for obesity); serial direct measurements of body composition using DXA in all participants; treatment extension period to week 72, allowing evaluation of continued weight and fat mass reduction and comparisons with published studies; and the post-treatment follow-up period (ongoing) evaluating weight loss maintenance and composition of weight regain. Limitations include the use of open-label semaglutide due to unavailability of matched placebo at trial initiation, potentially leading to early discontinuation of some participants who may have been disappointed not to receive semaglutide. Administering bimagrumab intravenously every 12 weeks, with additional loading doses at randomization (week 1) and week 4, may have contributed to early laboratory abnormalities and adverse events; subcutaneous dosing may attenuate these effects, as shown in a comparison study27. A phase 2 trial of bimagrumab and tirzepatide, alone or in combination, will evaluate the subcutaneous dosing of both drugs in adults with obesity or overweight (NCT06643728). Unlike previous bimagrumab studies that used MRI for direct assessment of skeletal muscle volume and intramuscular fat12,28,29,30,31, our study used DXA, where muscle mass was measured as part of lean mass. Future studies should include MRI to better characterize changes in muscle mass and quality along with physical function measures in populations at risk. Additionally, decreased hepatic fat fraction by MRI was reported in a previous bimagrumab study12; forthcoming studies could also evaluate changes in visceral and ectopic fat in various regions using MRI. The lack of significant improvement in patient-reported outcomes and grip strength in this study may be due to the broad population studied; specific subpopulations may be more responsive for these measures. Given the sample size for this study, analyses by age and gender would have resulted in subgroups that were too small to support reliable or meaningful conclusions, although these populations should be explored in future studies. These findings support further development of bimagrumab, alone or in combination with incretin therapy, to achieve optimal weight loss, with augmented reduction in adiposity and preserved lean mass, in people living with obesity. BELIEVE (NCT05616013) is a phase 2, multicenter, randomized, double-blind, placebo-controlled trial conducted at 26 sites in the United States, Australia and New Zealand. The trial adhered to the Declaration of Helsinki, Council for International Organizations of Medical Sciences international ethical guidelines and Good Clinical Practice guidelines. An independent ethics committee (New Zealand Southern Health and Disability Ethics Committee, New Zealand; Bellberry Limited Human Research Ethics Committee, Australia; and Austin Health Human Research Ethics Committee, Australia) or institutional review board (WCG Institutional Review Board, United States, and Pennington Biomedical Research Center Institutional Review Board, United States) for each site approved the protocol. All participants provided written informed consent before participation. The trial consisted of four sequential periods: a 6-week screening period (ends at randomization), a 48-week blinded primary treatment period, a 24-week open-label treatment extension period (through week 72) and a 32-week treatment-withdrawal follow-up period (end of study at week 104) (Extended Data Fig. We report results from the 48-week primary treatment period and the 24-week extension period (weeks 48−72) (November 2022−November 2024). Versanis Bio Inc., a wholly owned subsidiary of Eli Lilly and Company (sponsor), designed and oversaw the trial conduct. All authors contributed to data interpretation and authoring and/or critical review of the manuscript. The trial included adults (aged ≥18 years and ≤80 years) with obesity: BMI ≥30 kg m−2 or BMI ≥27 kg m−2 with at least one obesity-associated comorbidity (for example, hypertension, insulin resistance, sleep apnea or dyslipidemia). All participants maintained a stable body weight (±5 kg) within 90 days of screening, had body weight less than 150 kg and had at least one previous unsuccessful behavioral effort to lose weight. Key exclusion criteria included diagnosis of diabetes requiring current use of an antihyperglycemic drug or HbA1c ≥6.5%. A complete list of the eligibility criteria is provided below. Participants were eligible to be included in the study only if all of the following criteria applied: Written informed consent must be obtained before any study-related assessments are performed. Men and women between 18 years and 80 years of age, inclusive; women of childbearing potential (defined as those who are not postmenopausal or postsurgical sterilization) must meet both of the following criteria: Use of an intrauterine device, from ≥3 months before the baseline visit through ≥4 months after the last dose of bimagrumab/placebo intravenous, and an additional contraceptive (barrier) method from screening through ≥4 months after the last dose of bimagrumab/placebo intravenous. BMI ≥30 kg m−2 or BMI ≥27 kg m−2 with at least one obesity-associated comorbidity (for example, hypertension, insulin resistance, sleep apnea or dyslipidemia). Stable body weight (±5 kg) within 90 days of screening and body weight less than 150 kg. Have a history of at least one self-reported unsuccessful behavioral effort to lose body weight. Capable of using common software applications on a mobile device (smartphone). Access to an internet-enabled smartphone, tablet or computer for the duration of the study, meeting minimal operations systems requirements. Use of other investigational drugs at the time of enrollment or within 30 days or five half-lives of enrollment, whichever is longer, or longer if required by local regulations. Not able or willing to comply with protocol requirements, including lifestyle interventions. Diseases known to cause cachexia or muscle atrophy or diseases known to cause gastrointestinal malabsorption. Use of any prescription drugs known to adversely affect muscle mass or body weight. Treatment with any medication for the indication of obesity within the past 30 days before screening. Previous or planned (during the trial period) obesity treatment with surgery or a weight loss device. However, the following are allowed: (1) liposuction and/or abdominoplasty, if performed more than 1 year before screening, and (2) lap banding, intragastric balloon or dudodenal-jejunal bypass sleeve, if removed more than 1 year before screening. Patients with hypothyroidism treated with thyroid hormone replacement therapy must be on a stable dose for at least 6 weeks prior to screening. Diagnosis of diabetes, requiring current use of any antidiabetic drug or HbA1c ≥6.5% Note: Metabolic syndrome is not an exclusion, even if managed with an antidiabetic drug such as metformin or a sodium-glucose co-transporter 2 inhibitor. A diagnosis of prediabetes or impaired glucose tolerance managed exclusively with non-pharmacologic approaches (for example, diet and exercise) is not an exclusion. History of malignancy of any organ system, treated or untreated within the past 5 years, regardless of whether there was evidence of local recurrence or metastases, except non-melanoma skin cancer treated only with local therapy— specifically, multiple endocrine neoplasia type 2 or a personal or family history of medullary thyroid cancer or known elevation of blood calcitonin higher than 50 ng l−1. Known heart failure classified as New York Heart Association class III and IV or a history of chronic hypotension (SBP <100 mmHg or DBP <50 mmHg). Electrocardiogram showing clinically significant abnormalities or any history of resuscitated cardiac arrest or presence of an automated internal cardioverter-defibrillator. Prolonged QT syndrome or QTcF > 450 ms (Fridericia correction) for males and QTcF >470 ms for females at screening. History of unstable angina, myocardial infarction, coronary artery bypass graft surgery or percutaneous coronary intervention (such as angioplasty or stent placement) within 180 days of screening. History or presence of significant coagulopathy—for example, prothrombin time/international normalized ratio (PT/INR) >1.5. Liver injury as indicated by abnormal liver function tests, such as AST, ALT, GGT, ALP or serum bilirubin: Total bilirubin concentration increased above 1.5× ULN (except for cases of known Gilbert syndrome). Any chronic infections likely to interfere with study conduct or interpretation. Donation or loss of 400 ml or more of blood within 8 weeks prior to initial dosing, or longer if required by local regulations, or plasma donation (>250 ml) within 14 days prior to the first dose. Smoking more than one pack of cigarettes daily. Drinking five or more alcoholic beverages on each of five or more days in the past 30 days. Using cannabis more than twice weekly. Any use of heroin, cocaine, etc. Randomization was stratified by sex across the treatment groups. Bimagrumab or matching placebo was administered by 30-minute intravenous infusion at the clinical trial sites. Loading doses for bimagrumab or placebo were administered at randomization (week 1) and week 4, followed by dosing every 12 weeks (weeks 16, 28, 40, 52 and 64). After week 48, with the start of the open-label treatment extension period, group 1 (placebo) and group 2 (bimagrumab 10 mg kg−1) switched to receive bimagrumab 30 mg kg−1 every 12 weeks without the loading dose. The participant, investigator and sponsor were blinded to bimagrumab dose or placebo−bimagrumab until database lock to avoid bias in reporting adverse events and efficacy. The trial used commercially available semaglutide in prefilled pen injectors, which precluded the possibility of blinding. Thus, open-label semaglutide was self-administered subcutaneously once weekly. Participants had monthly counseling sessions to follow a diet with a daily deficit of approximately 500 kcal and ≥1.2 g kg−1 d−1 of protein and to engage in at least 150 minutes of physical activity weekly. Grip strength was measured using the Jamar Plus Digital Hand Dynamometer, with participants seated and using their dominant hand. Each participant had one practice trial before the recorded official measurement. The investigator and study staff were trained to use the dynamometer, and the same staff member conducted all assessments for a given participant. The primary endpoint was absolute change from baseline in body weight at week 48. The secondary efficacy endpoints included here are as follows: percent change in body weight at week 48; absolute and percent change in body weight at week 72; absolute and percent changes at weeks 48 and 72 in total body fat and lean mass, appendicular lean mass and estimated VAT as assessed by DXA; absolute changes in waist circumference at weeks 48 and 72; proportion of participants who achieved body weight reduction thresholds and fat mass reduction thresholds at weeks 48 and 72; percentage of weight loss due to fat mass or lean mass at weeks 48 and 72; proportion of participants in waist-to-height ratio categories at week 48; and absolute changes in HbA1c and patient-reported outcomes (SF-36 Physical Functioning score and IWQoL-Lite-CT Physical Function score) at weeks 48 and 72. Safety assessments included treatment-emergent adverse events, serious adverse events and changes in vital signs and laboratory assessments according to the protocol. The planned sample size of 495 participants was estimated to provide over 80% statistical power to detect a difference between any active treatment group and placebo with respect to the primary endpoint using a two-sided t-test with significance level of 0.05. Efficacy endpoints were analyzed using data from all randomized participants. Safety endpoints were analyzed using data from participants who received at least one dose of study treatment. All statistical tests were performed using a two-sided 5% significance level, with corresponding 95% confidence intervals. A preplanned unblinded interim analysis was conducted when approximately 80% of participants completed the week 24 visit or prematurely discontinued the study treatment. An independent team evaluated the efficacy and safety profile of monotherapy and combination groups during this interim analysis for internal decision-making. All week 48 analyses were prespecified in the statistical analysis plan, and all week 72 analyses were considered post hoc. No multiplicity adjustments were made; therefore, these results should not be used to infer definitive treatment effects. Two estimands (treatment regimen estimand and efficacy estimand), based on the ICH E9(R1) guidance32, were used to assess treatment efficacy from different perspectives and accounted for intercurrent events differently. Both estimands were used for the primary treatment period analysis for primary and secondary endpoints, unless specified otherwise. This estimand is used to assess the average treatment effect of bimagrumab, semaglutide or bimagrumab plus semaglutide for all randomized participants at week 48, regardless of treatment adherence and/or premature discontinuation of study treatment or placebo. For the analyses of this estimand, missing values (unobserved due to patient loss to follow-up or other reasons) were assumed to be missing at random and were handled by multiple imputation using observed data in the placebo group. Due to insufficient retrieved dropout data (that is, data from participants who discontinued treatment but remained in the study), a control-based imputation approach using the placebo group was selected as a more conservative strategy. Continuous endpoints were analyzed using the analysis of covariance (ANCOVA) model, and categorical endpoints were analyzed by logistic regression. Both models included treatment group, gender and country as fixed effects and baseline value as covariate. The analyses were conducted with multiple imputation of missing values at week 48 and statistical inference over multiple imputation of missing data guided by Rubin33. This estimand is used to assess the average treatment effect of bimagrumab, semaglutide or bimagrumab plus semaglutide for all randomized participants at weeks 48 and 72 had they received at least one dose of study treatment, adhered to protocol-defined treatment and did not discontinue treatment prematurely. Data after intercurrent events (for example, permanent treatment discontinuation) were excluded from analysis. Continuous endpoints were analyzed using a mixed model for repeated measures (MMRM), and missing values were implicitly handled by MMRM under the assumption of missing at random. The MMRM includes treatment group, gender, country, visit and visit-by-treatment as fixed effects and baseline value as covariate. A logistic regression model with treatment group, gender and country as fixed effects and baseline value as covariate was used for categorical outcomes. Further information on research design is available in the Nature Portfolio Reporting Summarylinked to this article. Eli Lilly and Company provides access to all individual participant data collected during the trial, after anonymization, with the exception of pharmacokinetic or genetic data. Data are available upon reasonable request 6 months after the indication studied has been approved in the United States and the European Union and after primary publication acceptance, whichever is later. Access is provided after a proposal has been approved by an independent review committee identified for this purpose and after receipt of a signed data-sharing agreement. Data and documents, including the trial protocol, statistical analysis plan, clinical study report and blank or annotated case report forms, will be provided in a secure data-sharing environment. For details on submitting a request, see the instructions provided at https://vivli.org/. Source data are provided with this paper. Powell-Wiley, T. M. et al. Obesity and cardiovascular disease: a scientific statement from the American Heart Association. Blüher, M. Obesity: global epidemiology and pathogenesis. Wilding, J. P. H. et al. Once-weekly semaglutide in adults with overweight or obesity. Jastreboff, A. M. et al. Tirzepatide once weekly for the treatment of obesity. Heymsfield, S. B., Gonzalez, M. C., Shen, W., Redman, L. & Thomas, D. Weight loss composition is one-fourth fat-free mass: a critical review and critique of this widely cited rule. Stefanakis, K., Kokkorakis, M. & Mantzoros, C. S. The impact of weight loss on fat-free mass, muscle, bone and hematopoiesis health: Implications for emerging pharmacotherapies aiming at fat reduction and lean mass preservation. Lach-Trifilieff, E. et al. An antibody blocking activin type II receptors induces strong skeletal muscle hypertrophy and protects from atrophy. Deaton, A. M. et al. Rare loss of function variants in the hepatokine gene INHBE protect from abdominal obesity. Akbari, P. et al. Multiancestry exome sequencing reveals INHBE mutations associated with favorable fat distribution and protection from diabetes. Garito, T. et al. Bimagrumab improves body composition and insulin sensitivity in insulin-resistant individuals. Heymsfield, S. B. et al. Effect of bimagrumab vs placebo on body fat mass among adults with type 2 diabetes and obesity: a phase 2 randomized clinical trial. The incretin/glucagon system as a target for pharmacotherapy of obesity. Nguyen, K., Wang, X., Xu, D. & Klickstein, L. OR10-03 Murine bimagrumab co-administration with incretin agonists results in additive efficacy and superior quality weight loss in the mouse diet-induced obesity model. Nunn, E. et al. Antibody blockade of activin type II receptors preserves skeletal muscle mass and enhances fat loss during GLP-1 receptor agonism. Coleman, L. et al. Bimagrumab prevents muscle loss associated with low dietary protein intake in healthy volunteers or with weight loss in obesity. Definition and diagnostic criteria of clinical obesity. Mechanick, J. I. et al. Strategies for minimizing muscle loss during use of incretin-mimetic drugs for treatment of obesity. Beavers, K. M. et al. GLP1Ra-based therapies and DXA-acquired musculoskeletal health outcomes: a focused meta-analysis of placebo-controlled trials. Carter, J. et al. American Society for Metabolic and Bariatric Surgery review of body composition. Body fat mass and distribution as predictors of metabolic outcome and weight loss after Roux-en-Y gastric bypass. Schneider, J. et al. Laparoscopic sleeve gastrectomy and Roux-en-Y gastric bypass lead to equal changes in body composition and energy metabolism 17 months postoperatively: a prospective randomized trial. McCarthy, C. et al. Total and regional appendicular skeletal muscle mass prediction from dual-energy X-ray absorptiometry body composition models. Lempesis, I. G. & Georgakopoulou, V. E. Physiopathological mechanisms related to inflammation in obesity and type 2 diabetes mellitus. Gonzalez Trotter, D. et al. GDF8 and activin A are the key negative regulators of muscle mass in postmenopausal females: a randomized phase I trial. Petricoul, O. et al. Pharmacokinetics and pharmacodynamics of bimagrumab (BYM338). Treatment of sarcopenia with bimagrumab: results from a phase II, randomized, controlled, proof-of-concept study. Rooks, D. S. et al. Effect of bimagrumab on thigh muscle volume and composition in men with casting-induced atrophy. Hofbauer, L. C. et al. Bimagrumab to improve recovery after hip fracture in older adults: a multicentre, double-blind, randomised, parallel-group, placebo-controlled, phase 2a/b trial. Rooks, D. et al. Bimagrumab vs optimized standard of care for treatment of sarcopenia in community-dwelling older adults: a randomized clinical trial. Multiple Imputation for Nonresponse in Surveys (Wiley & Sons, 1987); https://onlinelibrary.wiley.com/doi/book/10.1002/9780470316696 We thank M. Pruzanski, formerly with Versanis Bio, for his contributions to the trial design and conduct. We thank the contract research organizations—Harvest Integrated Research Organization (HiRO; formerly PharmaSols), New Zealand; ABio Clinical Research Partners LLC, Richmond, VA, USA; McCloud Consulting Group, Sydney, Australia; My Medical Department, Queensland, Australia; Parexel PVSG, Durham, NC, USA; and Calyx Medical Imaging, Irvine, CA, USA—for site monitoring, data collation and data analysis. Versanis Bio Inc., a wholly owned subsidiary of Eli Lilly and Company (sponsor), designed and oversaw the trial conduct and partially funded the study before its acquisition by Lilly. A list of authors and their affiliations appears at the end of the paper. Pennington Biomedical Research Center, Louisiana State University, Baton Rouge, LA, USA Weill Cornell Medicine, New York, NY, USA Optimal Clinical Trials, Auckland, New Zealand Eli Lilly and Company, Indianapolis, IN, USA Laura A. Coleman, Kiran Dole, Xingyuan Li & Kenneth M. Attie Applied Statistics and Consulting, Spruce Pine, NC, USA Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar Search author on:PubMed Google Scholar : Conception and design of the work; acquisition of data for the work; analysis and interpretation of data for the work; and critical revision of the work for important intellectual content. : Conception and design of the work; analysis and interpretation of data for the work; and critical revision of the work for important intellectual content. : Analysis and interpretation of data for the work; drafting of the work; and critical revision of the work for important intellectual content. : Conception and design of the work; analysis and interpretation of data for the work; drafting of the work; and critical revision of the work for important intellectual content. : Analysis and interpretation of data for the work and drafting of the work. : Conception and design of the work; analysis and interpretation of data for the work; drafting of the work; and critical revision of the work for important intellectual content. has a contract with Eli Lilly and Company for clinical trials (institutional support). He has received honoraria for serving on the medical advisory boards of Tanita Corporation, Novo Nordisk, Lilly, Regeneron, Abbott and Medifast. He is also on the Data Safety Monitoring Committee for Novo Nordisk. has received research funding from Lilly, Novo Nordisk, Altimmune and Skye Bioscience. He has received payments or honoraria from Boehringer Ingelheim for his role as a consultant/advisory board member as well as from Skye Bioscience, Zealand Pharmaceuticals, Jamieson Wellness and Pfizer for lectures. He has received support for attending meetings and/or travel from Jamieson Wellness for his role as a consultant/advisory board member. He has patents pending with FlyteHealth and has served on the board of directors for FlyteHealth, Jamieson Wellness and ERX Pharmaceuticals. He holds equity interests in Jamieson Wellness, FlyteHealth, Kallyope, Mediflix, Metsera, MBX Bioscience, Syntis, Veru Pharmaceuticals and Skye Bioscience. is a former employee and shareholder of Versanis Bio and a former employee of Lilly. He is an inventor or co-inventor on the following patents assigned to Versanis Bio: US20240368291A1 (ActRII antibody treatments), WO2024044782A1 (ActRII antibody fixed-dose treatments) and US20240325530A1 (combination therapies). is an employee and shareholder of Lilly. She is a former employee of Versanis Bio with equity holdings. She also has a pending patent (PAT058683-US-PSP). is an employee and shareholder of Lilly. She is also a former employee of Versanis Bio with equity holdings. is a former consultant to Versanis Bio with equity holdings. She is now a consultant to Lilly. S.S. was a former consultant to Versanis Bio and is now a consultant to Lilly. is an employee and stockholder of Lilly. is an employee and shareholder of Lilly. He is also a former employee of Versanis Bio with equity holdings. Nature Medicine thanks Rhonda Bacher, W. Garvey and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ashley Castellanos-Jankiewicz, in collaboration with the Nature Medicine team. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. BELIEVE is a phase 2, multicenter, randomized, double-blind, placebo-controlled trial evaluating the efficacy and safety of intravenous bimagrumab and open-label subcutaneous semaglutide, alone or in combination, in adults with obesity or overweight plus at least one obesity-associated comorbidity. The trial includes a 48-week primary treatment period, blinded to bimagrumab study drug, followed by a 24-week open-label treatment extension period, and a subsequent 32-week follow-up period after treatment withdrawal. Participants were randomly assigned (1:1:1:1:1:1:1:1:1 ratio) to one of the nine treatment groups using a centralized interactive web randomization system. Stratification across the treatment groups was based on gender. After Week 48, participants in treatment groups 1 (placebo) and 2 (bimagrumab 10 mg/kg) switched to bimagrumab 30 mg/kg, while all other groups continued their original treatments without the placebo infusions during the 24-week open-label extension period. Abbreviations: DXA, dual-energy X-ray absorptiometry; N, number of randomized participants. Panel a: The percentage of participants reaching weight-reduction thresholds at Week 48 is based on logistic regression with multiple imputation using a treatment-regimen estimand. N represents number of participants with baseline value. Panel b: The percentage of participants reaching weight-reduction thresholds at Week 72 is calculated using logistic regression with mixed model repeated measures for missing data imputation using an efficacy estimand. N represents number of participants based on imputed data. N=number of participants with measurement at the timepoint. Panel a: The proportion of participants achieving total body fat mass reduction thresholds at Week 48 by dual-energy X-ray absorptiometry (DXA) is based on logistic regression with multiple imputation using a treatment-regimen estimand. N represents number of participants with baseline value. Panel b: The proportion of participants achieving total body fat mass reduction thresholds at Week 72 is calculated based on logistic regression with mixed model repeated measures for missing data imputation using an efficacy estimand. N represents number of participants based on imputed data. Panel a: Data are presented as arithmetic means for ALT. Panel b: Data are presented as arithmetic means for lipase. Abbreviations: ALT, alanine aminotransferase; BL, baseline; ULN, upper limit of normal. Data are presented as LSM percent change from baseline±SE. The primary treatment period starts at the first dose of Week 1 and ends at the last visit on or prior to Week 48. The open-label extension treatment period starts at the first dose after Week 48 and ends at the last visit on or prior to Week 72. Panels a-d: The LSM percent changes in lipids from baseline to Week 72 are based on a MMRM using log-transformation for the efficacy estimand. Abbreviations: CV, coefficient of variation; HDL, high-density lipoprotein; LDL, low-density lipoprotein; MMRM, mixed model for repeated measures; SE, standard error. Data are presented as LSM change from baseline±SE. n represents number of participants with baseline and post-baseline value at Week 48. Panels a-b: LSM changes from baseline at Week 48 in leptin and adiponectin are based on a MMRM using log-transformation for the efficacy estimand. Abbreviations: CV, coefficient of variation; MMRM, mixed model for repeated measures; SE, standard error. Investigator list, protocol and statistical analysis plan Source data (n, LSM, s.e., P values) included in the submission. Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Heymsfield, S.B., Aronne, L.J., Montgomery, P. et al. Bimagrumab plus semaglutide alone or in combination for the treatment of obesity: a randomized phase 2 trial. Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.