What if it could solve all of our energy problems? Gear-obsessed editors choose every product we review. We may earn commission if you buy from a link. Why Trust Us? Nestled between Hawaii and the western coast of Mexico lies the Pacific Ocean's Clarion-Clipperton Zone (CCZ), a 4.5 million-kilometer-square area of abyssal plain bordered by the Clarion and Clipperton Fracture Zones. Although this stretch of sea is a vibrant ecosystem filled with marine life, the CCZ is known best for its immense collection of potato-sized rocks known as polymetallic nodules. These rocks, of which there are potentially trillions, are filled with rich deposits of nickel, manganese, copper, zinc, cobalt. Those particular metals are vital for the batteries needed to power a green energy future, leading some mining companies to refer to nodules as a “battery in a rock.” However, a study reports that these nodules might be much more than simply a collection of valuable materials for electric cars—they also produce oxygen 4,000 meters below the surface where sunlight can't reach. This unexpected source of “dark oxygen,” as it's called, redefines the role these nodules play in the CZZ. The rocks could also rewrite the script on not only how life began on this planet, but also its potential to take hold on other worlds within our Solar System, such as Enceladus or Europa. The results of this study were published in the journal Nature Geoscience. “For aerobic life to begin on the planet,” Andrew Sweetman, deep-sea ecologist with the Scottish Association for Marine Science and lead author of the study said in a press statement, “there had to be oxygen and our understanding has been that Earth's oxygen supply began with photosynthetic organisms. But we now know that there is oxygen produced in the deep sea, where there is no light. I think we therefore need to revisit questions like: where could aerobic life have begun?” The journey toward this discovery began more than a decade ago when Sweetman began analyzing how oxygen levels decreased further into the depths of the ocean. So it came as a surprise in 2013 when sensors returned increased levels of oxygen in the CCZ. At the time, Sweetman dismissed the data as the result of faulty sensors, but future studies showed that this abyssal plain somehow produced oxygen. Taking note of the nodule's “battery in a rock” tagline, Sweetman wondered if the minerals found in these nodules were somehow acting as a kind of “geobattery” by separating hydrogen and oxygen via seawater electrolysis. A 2023 study showed that various bacteria and archaea can create “dark oxygen,” so Sweetman and his team recreated the conditions of the CCZ in a laboratory and killed off any microorganisms with mercury chloride—surprisingly, oxygen levels continued rising. According to Scientific American, Sweetman found a voltage of roughly 0.95 volts on the surface of these nodules, likely charging up as they grow with different deposits growing irregularly throughout, and this natural charge is enough to split the seawater. This discovery adds more fuel to the already-fiery debate over what to do with these nodules. Mining outfits like the Metals Company, the CEO of which coined the phrase “battery in a rock,” sees these nodules as the answer to our energy problems. However, 25 countries want the governing body—the International Seabed Authority (ISA) Council—to implement a moratorium, or at the very least a precautionary pause, so more research can be conducted to see how mining these nodules could affect the ocean. This is especially vital considering that the world's seas are already facing a litany of climate challenges, including acidification, deoxygenation, and pollution. In response to this discovery, Scripps Institution of Oceanography's Lisa Levin, who wasn't involved with the study, highlighted why such a moratorium is so important for protecting these deep-sea nodules in a comment to the Deep Sea Conservation Coalition: The ISA is still negotiating with key players on deep-sea mining regulations. So while the future of the world's oceans is approaching a critical moment of conservation or exploitation, science has proven once again that disrupting these ecosystems could have consequences we can't even imagine. Darren lives in Portland, has a cat, and writes/edits about sci-fi and how our world works. You can find his previous stuff at Gizmodo and Paste if you look hard enough. This Dark Ocean Pit Has a Bleak Weather Forecast Does This Evidence Proves Life is a Simulation? A DNA Mutation Helps Some Fish Survive Deep Waters Acid Rain Return to North America Scientists Developed a Carbon-Negative Concrete A Dangerous U.S. Volcano is Preparing to Blow This Bizarre Fossil Is a Whole New Form of Life Scientists Found Evidence of Unknown Life in Rocks Are There More Humans on Earth Than We Thought? Tiny Sparks May Have Triggered Life on Earth Parts of Hawaii Are Sinking Faster Than We Thought The Story Behind America's First Tornado Forecast A Part of Hearst Digital Media We may earn commission from links on this page, but we only recommend products we back. ©2025 Hearst Magazine Media, Inc. All Rights Reserved.
April 5, 2025 7 min read Dennis Gaitsgory, Who Proved Part of Math's Grand Unified Theory, Wins Breakthrough Prize By solving part of the Langlands program, a mathematical proof that was long thought to be unachievable, Dennis Gaitsgory snags a prestigious Breakthrough Prize By Manon Bischoff edited by Jeanna Bryner Dennis Gaitsgory, of the Max Planck Institute for Mathematics, has won the Breakthrough Prize in mathematics for numerous breakthrough contributions to the geometric Langlands program. The Langlands program has been described by mathematician Edward Frankel as the “grand unified theory of mathematics.” Conceived by Robert Langlands in 1967, the program includes numerous conjectures that were intended to connect disparate mathematical realms: number theory and harmonic analysis. In the 1990s, a similar connection between geometry and harmonic analysis was noticed, and the geometric Langlands program was born. Decades later, in 2024, Dennis Gaitsgory of the Max Planck Institute for Mathematics in Bonn, Germany, and eight of his colleagues achieved a breakthrough. In five scientific preprint papers, consisting of nearly 1,000 pages, they proved that a large class of geometric objects is related to quantities from calculus. Gaitsgory has now been awarded the Breakthrough Prize in Mathematics, which includes a $3-million award, for this outstanding achievement. Scientific American's German-language sister publication Spektrum der Wissenschaft spoke to Gaitsgory about his math career, the Langlands achievement and the prestigious Breakthrough Prize. [An edited transcript of the interview follows.] If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today. You've been working on the geometric Langlands program for 30 years. When was the moment that you realized you'd be able to prove it? There was a very crucial step that was always a mystery. This got solved by a former graduate student of mine, [mathematician] Sam Raskin, and his graduate students in the winter of 2022. They proved that something is nonzero. After this, it was clear that we would be able to work out a proof. How did you feel when you realized that it could really be done? I've always perceived it as some kind of long-term project for self-entertainment. So I obviously felt happy, but it was not like a very strong emotion or anything. It wasn't a eureka moment. The conjecture that we proved is one particular case of something much, much bigger. It has received a lot of attention because it's one well-formulated thing. But it's just one step. I was happy that this step had been done, but there's much more to do. So there was no champagne popping? You just sat down and continued working? There was no champagne but something similar. When [Raskin] said that he could prove this crucial part, we made a bet: if he could really do it, I promised him a bottle of scotch. The proof is huge, almost 1,000 pages. Did you oversee everything in it? I wrote 95 percent of it. [That was] not for a good reason but because I had an injury from skiing, and I was just lying in bed. So what else was there to do? I was watching Star Wars with my son and writing this thing. Do you mean you did both at the same time? Initially, some sections in our papers were named after Star Wars episodes, but at the end, we deleted [that element], mostly out of copyright concerns. But one paper still has a quote from Star Wars: “Fear will keep the local systems in line.” It was a really good fit, because in this paper, we had to control the moduli space of local systems. It's one thing to understand something but another to write everything down in detail. Did any problems pop up? Of course. We had a road map, but there were still a lot of blanks to fill, many theories to be developed. But I don't think there was a moment of actual panic. Sometimes I was not sure if one thing would require three more pages, 20 more pages or 50 more pages. There was just an uncertainty of how much more work had to be done. Did you do all of this from your bed? No, actually it was a cooperative process. The proof has nine co-authors: Every day I was writing to this guy and to that guy. They all have different perspectives and a slightly different kind of expertise. In some sense, it was as if I was lying in bed, and my colleagues were visiting me, so I didn't get bored. It really held my spirits up that I could talk to them by e-mail. There are some people who go to a bar to drink; we instead just talk about math. They talk about soccer; we talk about math. It's the same thing; it's human interaction. Speaking of human interaction, do you talk about your work with your friends and family? No. They're not mathematicians. They can't technically understand. My wife was close by my side and knows the story and the development of the topic. She knows how these things look from the outside, but I can't describe the content [to her]. A lot of people would say that the Langlands program is one of the most complex research topics in the world. Would you agree? The question is: What do you mean by complex? Yes, one cannot come from the street and just study this. But the same applies to what other mathematicians, such as Peter Scholze [who studies arithmetic geometry at the University of Bonn in Germany and the Max Planck Institute for Mathematics], are doing. I don't have the background to just come to a talk he is giving and understand what he says because there are lots of technical details. It's the same here. One has to invest some effort to understand how things work, and then one should be able to understand. But that doesn't say that whatever we're doing is intrinsically more complex. I think all frontier math is equally complicated. We're all trying to push a boundary at different points. How many people can understand the technical parts of your work? Now the community is growing because people are studying our proof. But up until last year, apart from [my] eight co-authors, there may have been five or six people who would have the capacity to understand the technical details. Do you wish that more people were involved in this type of research? Yes, definitely. So far it has been a very small community: The people who pushed the boundaries were basically my former students plus Dima Arinkin [a math professor at the University of Wisconsin–Madison], who is my age. He was a close friend and collaborator for many years. So some ideas get recycled. It would be just nice to have an influx of people from the outside. They could bring in something totally new. I would be very excited to see new ideas. What could be done to get more people interested in the geometric Langlands problem? More lectures and workshops on that topic, I guess. There will be a master class in Copenhagen, for example, in August. And there will be a conference in Berkeley, Calif. But now our research gets more attention because our proof is out. I regularly receive e-mails, mostly from younger people. [At the time of my interview], for example, [I am set to give] a talk to a big audience of graduate students in Graz, Austria. I will talk about the foundations of derived algebraic geometry. So the graduate students want to study these foundations, and hopefully some of them will proceed to study the geometric Langlands program. But they need derived algebraic geometry to understand this. [Editor's Note: This talk was scheduled for April 2.] So you hope to catch the interest of young students by teaching them derived algebraic geometry. How did you become interested in the Langlands program in the first place? It was back in the 1990s, when [Alexander] Sasha Beilinson [a mathematician now at the University of Chicago] came to Tel Aviv [University], where I was a graduate student. Beilinson gave two talks; he was at the very beginning of his own work on the subject. And I was completely captivated. I had learned about the classical Langlands program..., but before his talk, I had no idea that it could be related to geometry. It was the first time I heard about it. The objects he talked about seemed so appealing to me. It was exactly the type of mathematical object that I wanted to study. And they all came together miraculously in this. And I was like, “Wow.” I had to work on that. Does the same fascination still drive your research? Of course things evolved. It's one thing when you're 20, another thing when you're 50. I don't know what drives me now. It's like an actual desire. It's like appetite. I want to do math. And if I can't, if I'm prevented from doing math, such as when I'm on a family vacation for a week with my kids, and I can't do math, I suffer. Really? That happens after one week? One week is maybe still okay. But after two weeks, I become a terrible human being. Well, it's wonderful to find such a passion in life. It's not really passion. Is it maybe more like some kind of addiction? Yes, maybe. It's more like: man needs to eat, and man needs to do math. What are you working on now? Did you plunge into an abyss? I am trying to generalize our work. I have several projects at different stages. There's a lot of theory to be developed, but at least we now have a program. We know what we want. You have a new road map? Let's say that we have the road map of desires but not a roadmap of methods like the one I [described] in 2013 [and published in 2015]. Back then, I knew exactly what needed to be proved. Now I know what I want, but I don't know how to get there. Maybe you will get new ideas from new researchers. That would be very nice. But I think, in some sense, it's like a Darwinian process: If the math is valuable, it will get studied. And more people will understand and get attracted. And if the math is boring, then too bad. Time will show. This article originally appeared in Spektrum der Wissenschaft and was reproduced with permission. Manon Bischoff is a theoretical physicist and an editor at Spektrum der Wissenschaft, the German-language sister publication of Scientific American. Learn and share the most exciting discoveries, innovations and ideas shaping our world today. Follow Us: Scientific American is part of Springer Nature, which owns or has commercial relations with thousands of scientific publications (many of them can be found at www.springernature.com/us). Scientific American maintains a strict policy of editorial independence in reporting developments in science to our readers. © 2024 SCIENTIFIC AMERICAN, A DIVISION OF SPRINGER NATURE AMERICA, INC.ALL RIGHTS RESERVED.
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Advertisement Nature Communications volume 16, Article number: 3280 (2025) Cite this article Metrics details The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs—GPT and LLaMA representatives—on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP. Biomedical literature presents direct obstacles to curation, interpretation, and knowledge discovery due to its vast volume and domain-specific challenges. PubMed alone sees an increase of approximately 5000 articles every day, totaling over 36 million as of March 20241. In specialized fields such as COVID-19, roughly 10,000 dedicated articles are added each month, bringing the total to over 0.4 million as of March 20242. In addition to volume, the biomedical domain also poses challenges with ambiguous language. For example, a single entity such as Long COVID can be referred to using 763 different terms3. Additionally, the same term can describe different entities, as seen with the term AP2, which can refer to a gene, a chemical, or a cell line4. Beyond entities, identifying novel biomedical relations and capturing semantics in biomedical literature present further challenges5,6. To overcome these challenges, biomedical natural language processing (BioNLP) techniques are used to assist with manual curation, interpretation, and knowledge discovery. Biomedical language models are considered as the backbone of BioNLP methods; they leverage massive amounts of biomedical literature and capture biomedical semantic representations in an unsupervised or self-supervised manner. Early biomedical language models are non-contextual embeddings (e.g., word2vec and fastText) that use fully connected neural networks such as BioWordVec and BioSentVec4,7,8. Since the inception of transformers, biomedical language models have adopted their architecture, and can be categorized into (1) encoder-based, masked language models using the encoder from the transformer architecture such as the biomedical bidirectional encoder representations from transformers (BERT) family including BioBERT and PubMedBERT9,10,11, (2) decoder-based, generative language models using the decoder from the transformer architecture such as the generative pre-trained transformer (GPT) family including BioGPT and BioMedLM12,13, and (3) encoder-decoder-based, using both encoders and decoders such as BioBART and Scifive14,15. BioNLP studies fine-tuned those language models and demonstrated that they achieved the SOTA performance in various BioNLP applications10,16, and those models have been successfully employed in PubMed-scale downstream applications such as biomedical sentence search17 and COVID-19 literature mining2. Recently, the latest closed-source GPT models, including GPT-3 and, more notably, GPT-4, have made significant strides and garnered considerable attention from society. A key characteristic of these models is the exponential growth of their parameters. For instance, GPT-3 has ~175 billion parameters, which is hundreds larger than GPT-2. Models of this magnitude are commonly referred to as Large Language Models (LLMs)18. Moreover, the enhancement of LLMs is achieved through reinforcement learning with human feedback, thereby aligning text generation with human preferences19. For instance, GPT-3.5 builds upon the foundation of GPT-3 using reinforcement learning techniques, resulting in significantly improved performance in natural language understanding20. The launch of ChatGPT—a chatbot using GPT-3.5 and GPT-4—has marked a milestone in generative artificial intelligence. It has demonstrated strong capabilities in the tasks that its predecessors fail to do; for instance, GPT-4 passed over 20 academic and professional exams, including the Uniform Bar Exam, SAT Evidence-Based Reading & Writing, and Medical Knowledge Self-Assessment Program21. The remarkable advancements have sparked extensive discussions among society, with excitement and concerns alike. In addition to closed-source LLMs, open-source LLMs, such as LLaMA22 and Mixtral23 have been widely adopted in downstream applications and also used as the basis for continuous pretraining domain-specific resources. In the biomedical domain, PMC LLaMA (7B and 13B) is one of the first biomedical domain-specific LLMs that continuously pre-trained LLaMA on 4.8 M biomedical papers and 30 K medical textbooks24. Meditron (7B and 70B), a more recent biomedical domain-specific LLM, employed a similar continuous pretraining strategy on LLaMA 2. Pioneering studies have conducted early experiments on LLMs in the biomedical domain and reported encouraging results. For instance, Bubeck et al. studied the ability of GPT-4 in a wide spectrum, such as coding, mathematics, and interactions with humans. This early study reported biomedical-related results, indicating that GPT-4 achieved an accuracy of approximately 80% in the US Medical Licensing Exam (Step 1, 2, and 3), along with an example of using GPT-4 to verify claims in a medical note. Lee et al. also demonstrated use cases of GPT-4 for answering medical questions, generating summaries from patient reports, assisting clinical decision-making, and creating educational materials24. Wong et al. conducted a study on GPT-3.5 and GPT-4 for end-to-end clinical trial matching, handling complex eligibility criteria, and extracting complex matching logic25. Liu et al. explored the performance of GPT-4 on radiology domain-specific use cases26. Nori et al. further found that general-domain LLMs with advanced prompt engineering can achieve the highest accuracy in medical question answering without fine-tuning27. Recent reviews also summarize related studies in detail28,29,30. These results demonstrate the potential of using LLMs in BioNLP applications, particularly when minimal manually curated gold standard data is available and fine-tuning or retraining for every new task is not required. In the biomedical domain, a primary challenge is the limited availability of labeled datasets, which have a significantly lower scale than those in the general domain (e.g., a biomedical sentence similarity dataset only has 100 labeled instances in total31)32,33. This challenges the fine-tuning approach because (1) models fine-tuned on limited labeled datasets may not be generalizable, and (2) it becomes more challenging to fine-tune the models with a larger size. Motivated by the early experiments, it is important to systematically assess the effectiveness of LLMs in BioNLP tasks and comprehend their impact on BioNLP method development and downstream users. Table 1 provides a detailed comparison of representative studies in this context. While our primary focus is on the biomedical domain, specifically the evaluation of LLMs using biomedical literature, we have also included two representative studies in the clinical domain (evaluating LLMs using clinical records) for reference. There are several primary limitations. First, most evaluation studies primarily assessed GPT-3 or GPT-3.5, which may not provide a full spectrum of representative LLMs from different categories. For instance, few studies evaluated more advanced closed-source LLMs such as GPT-4, LLM representatives from the general domain such as LLaMA22, and biomedical domain-specific LLMs such as PMC-LLaMA34. Second, the existing studies mostly assessed extraction tasks where the gold standard is fixed. Few of these studies evaluated generative tasks such as text summarization and text simplification where the gold standard is free-text. Arguably, existing transformer models have demonstrated satisfactory performance in extractive tasks, while generative tasks remain a challenge in terms of achieving similar levels of proficiency. Therefore, it is imperative to assess how effective LLMs are in the context of generative tasks in BioNLP, examining whether they can complement existing models. Third, most existing studies only reported quantitative assessments such as the F1-score, with limited emphasis on qualitative evaluations. However, conducting qualitative evaluations (e.g., assessing the quality of LLM-generated text and categorizing inconsistent or hallucinated responses) to understand of the errors and impacts of LLMs on downstream applications in the biomedical domain are arguably more critical than mere quantitative metrics. For instance, studies on LLMs found a relatively low correlation between human judgments and automatic measures, such as ROUGE-L, commonly applied to text summarization tasks in the clinical domain35. Finally, it is worth noting that several studies did not provide public access to their associated data or codes. For example, few studies have made the prompts or selected examples for few-shot learning available. This hinders reproducibility and also presents challenges in evaluating new LLMs using the same setting for a fair comparison. In this study, we conducted a comprehensive evaluation of LLMs in BioNLP applications to examine their great potentials as well as their limitations and errors. Our study has three main contributions. First, we performed comprehensive evaluations on four representative LLMs: GPT-3.5 and GPT-4 (representatives from closed-source LLMs), LLaMA 2 (a representative from open-sourced LLMs), and PMC LLaMA (a representative from biomedical domain-specific LLMs). We evaluated them on 12 BioNLP datasets across six applications: (1) named entity recognition, which extracts biological entities of interest from free-text, (2) relation extraction, which identifies relations among entities, (3) multi-label document classification, which categorizes documents into broad categories, (4) question answering, which provides answers to medical questions, (5) text summarization, which produces a coherent summary of an input text, and (6) text simplification, which generates understandable content of an input text. The models were evaluated under four settings: zero-shot, static few-shot, dynamic K-nearest few-shot, and fine-tuning where applicable. We compared these models against the state-of-the-art (SOTA) approaches that use fine-tuned, domain-specific BERT or BART models. Both BERT and BART models are well-established in BioNLP research. Our results suggest that SOTA fine-tuning approaches outperformed zero- and few-shot LLMs in most of the BioNLP tasks. These approaches achieved a macro-average approximately 15% higher than the best zero- and few-shot LLM performance across 12 benchmarks (0.65 vs. 0.51) and over 40% higher in information extraction tasks, such as relation extraction (0.79 vs. 0.33). However, closed-source LLMs such as GPT-3.5 and GPT-4 demonstrated better zero- and few-shot performance in reasoning-related tasks such as medical question answering, where they outperformed the SOTA fine-tuning approaches. In addition, they exhibited lower-than-SOTA but reasonable performance in generation-related tasks such as text summarization and simplification, showing competitive accuracy and readability, as well as showing potential in semantic understanding tasks such as document-level classification. Among the LLMs, GPT-4 showed the overall highest performance, especially due to its remarkable reasoning capability. However, it comes with a trade-off, being 60 to 100 times more expensive than GPT-3.5. In contrast, open-sourced LLMs such as LLaMA 2 did not demonstrate robust zero- and few-shot performance – they still require fine-tuning to bridge the performance gap for BioNLP applications. Second, we conducted a thorough manual validation on collectively over hundreds of thousands of sample outputs from the LLMs. For extraction and classification tasks where the gold standard is fixed (e.g., relation extraction and multi-label document classification), we examined (1) missing output, when LLMs fail to provide the requested output, (2) inconsistent output, when LLMs produce different outputs for similar instances, and (3) hallucinated output, when LLMs fail to address the user input and may contain repetitions and misinformation in the output36. For text summarization tasks, two healthcare professionals performed manual evaluations assessing Accuracy, Completeness, and Readability. The results revealed prevalent cases of missing, inconsistent, and hallucinated outputs, especially for LLaMA 2 under the zero-shot setting. For instance, it had over 102 hallucinated cases (32% of the total testing instances) and 69 inconsistent cases (22%) for a multi-label document classification dataset. Finally, we provided recommendations for downstream users on the best practice to use LLMs in BioNLP applications. We also noted two open problems. First, the current data and evaluation paradigms in BioNLP are tailored to supervised methods and may not be fair to LLMs. For instance, the results showed that automatic metrics for text summarization may not align with manual evaluations. Also, the datasets that specifically target tasks where LLMs excel, such as reasoning, are limited in the biomedical domain. Revisiting data and evaluation paradigms in BioNLP are key to maximizing the benefits of LLMs in BioNLP applications. Second, addressing errors, missing information, and inconsistencies is crucial to minimize the risks associated with LLMs in biomedical and clinical applications. We strongly encourage a community effort to find better solutions to mitigate these issues. We believe that the findings of this study will be beneficial for BioNLP downstream users and will also contribute to further enhancing the performance of LLMs in BioNLP applications. The established benchmarks and baseline performance could serve as the basis for evaluating new LLMs in the biomedical domain. To ensure reproducibility and facilitate benchmarking, we have made the relevant data, models, and results publicly accessible through https://doi.org/10.5281/zenodo.1402550037. Table 2 illustrates the primary evaluation metric results and their macro-averages of the LLMs under zero/few-shot (static one- and five-shot) and fine-tuning settings over the 12 datasets. The results on specific datasets were consistent with those independently reported by other studies, such as an accuracy of 0.4462 and 0.7471 on MedQA for GPT-3.5 zero-shot and GPT-4 zero-shot, respectively (0.4988 and 0.7156 in our study, respectively)38. Similarly, a micro-F1 of 0.6224 and 0.6720 on HoC and LitCovid for GPT-3.5 zero-shot was reported, respectively (0.6605 and 0.6707 in our study, respectively)39. An accuracy of 0.7790 on PubMedQA was also reported for the fine-tuned PMC LLaMA 13B (combined multiple question answering datasets for fine-tuning)34; our study also reported a similar accuracy of 0.7680 using the PubMedQA training set only. We further summarized detailed results in Supplementary Information S2 Quantitative evaluation results, including secondary metric results in S2.2, performance mean, variance, and confidence intervals in S2.3, statistical test results in S2.4, and dynamic K-nearest few-shot results in S2.5. SOTA vs. LLMs. The results of SOTA fine-tuning approaches for comparison are provided in Table 2. Recall that the SOTA approaches utilized fine-tuned (domain-specific) language models. For the extractive and classification tasks, the SOTA approaches fine-tuned biomedical domain-specific BERT models such as BioBERT and PubMedBERT. For text summarization and simplification tasks, the SOTA approaches fine-tuned BART models. As demonstrated in Table 2, the SOTA fine-tuning approaches had a macro-average of 0.6536 across the 12 datasets, whereas the best LLM counterparts were 0.4561, 0.4750, 0.4862, and 0.5131 under zero-shot, one-shot, five-shot, and fine-tuning settings, respectively. It outperformed the zero- and few-shot of LLMs in 10 out of the 12 datasets. It had much higher performance especially in information extraction tasks. For instance, for NCBI Disease, the SOTA approach achieved an entity-level F1-score of 0.9090, whereas the best results of LLMs (GPT-4) under zero- and one-shot settings were 30% lower (0.5988). The performance of LLMs is closer under the fine-tuning setting, with LLaMA 2 13B achieving an entity-level F1-score of 0.8682, but it is still lower. Notably, the SOTA fine-tuning approaches are very strong baselines – they were much more sophisticated than simple fine-tuning over a foundation model. Continuing with the example of NCBI Disease, the SOTA fine-tuning approach generated large-scale weak labeled examples and used contrastive learning to learn a general representation. In contrast, the LLMs outperformed the SOTA fine-tuning approaches in question answering. For MedQA, the SOTA approach had an accuracy of 0.4195. GPT-4 under the zero-shot setting had almost 30% higher accuracy in absolute difference (0.7156), and GPT-3.5 also had approximately 8% higher accuracy (0.4988) under the zero-shot setting. For PubMedQA, the SOTA approach had an accuracy of 0.7340. GPT-4 under the one-shot setting had a similar accuracy (0.7100) and showed higher accuracy with more shots (0.7580 under the five-shot setting), as we will show later. Both LLaMA 2 13B and PMC LLaMA 13B also had higher accuracy under the fine-tuning setting (0.8040 and 0.7680, respectively). In this case, GPT-3.5 did not achieve higher accuracy over the SOTA approach, but it already had a competitive accuracy (0.6950) under the five-shot setting. Comparisons among the LLMs. Comparing among the LLMs, under zero/few-shot settings, the results demonstrate that GPT-4 consistently had the highest performance. Under the zero-shot setting, the macro-average of GPT-4 was 0.4561, which is approximately 7% higher than GPT-3.5 (0.3814) and almost double than LLaMA 2 13B (0.2362). It achieved the highest performance in nine out of the 12 datasets, and its performance was also within 3% of the best result for the remaining three datasets. The one-shot and five-shot settings showed very similar patterns. In addition, LLaMA2 13B exhibited substantially lower performance than GPT-3.5 (15% lower and 10% lower) and GPT-4 (22% lower and 17% lower) under zero- and one-shot settings. It had up to six times lower performance in specific datasets compared to the best LLM results; for example, 0.1286 vs. 0.7109 for HoC under the zero-shot setting. These results suggest that LLaMA2 13B still requires fine-tuning to achieve similar performance and bridge the performance gap. Fine-tuning improved LLaMA 2 13B's macro-average from 0.2837 to 0.5131. Notably, its performance under the fine-tuning setting is slightly higher than the zero- and few-shot performance of GPT-4. Fine-tuning LLaMA 2 13B generally improved its performance in all tasks except text summarization and text simplification. A key reason for its performance limitation is that the datasets have much longer input context than its allowed input tokens (4096) such that fine-tuning did not help in this case. This observation also motivates further research efforts on extending LLMs' context window40,41. Under the fine-tuning setting, the results also indicate that PMC LLaMA 13B, as a continuously pretrained biomedical domain-specific LLM, did not achieve an overall higher performance than LLaMA 2 13B. Fine-tuned LLaMA 2 13B had better performance than that of PMC LLaMA 13B in 10 out of the 12 datasets. As mentioned, we reproduced similar results reported in PMC LLaMA study34. For instance, it reported an accuracy of 0.7790 on PubMedQA with fine-tuning multiple question answering datasets together. We got a very similar accuracy of 0.7680 when fine-tuning PMC LLaMA 13B on the PubMedQA dataset only. However, we also found that directly fine-tuning of LLaMA 2 13B using the exact same setting resulted in better or at least similar performance. Figure 1 further illustrates the performance of the dynamic K-nearest few-shot and the associated cost with the increasing number of shots. The detailed results are also provided in Supplementary Information S2. Dynamic K-nearest few-shot was conducted for K values of one, two, and five. For comparison, we also provided the zero-shot and static one-shot performance in the figure. The results suggest that dynamic K-nearest few-shot is most effective for multi-label document classification and question answering. For instance, for the LitCovid dataset, GPT-4 had a macro-F1 of 0.5901 under the static one-shot setting; in contrast, its macro-F1 under dynamic one-nearest shot was 0.6500 and further increased to 0.7055 with five-nearest shots. Similarly, GPT-3.5 exhibited improvements, with its macro-F1 under the static one-shot setting at 0.6009, compared to 0.6364 and 0.6484 for dynamic one-shot and five-shot, respectively. For question answering, the improvement was not as high as for multi-label document classification, but the overall trend showed a steady increase, especially considering that GPT-4 already had similar or higher performance than SOTA approaches with zero-shot. For instance, its accuracy on PubMedQA was 0.71 with a static one-shot; the accuracy increased to 0.72 and 0.75 under dynamic one-shot and five-shot, respectively. The input and output types for each benchmark are displayed at the bottom of each subplot. Detailed methods for the few-shot and cost analysis are summarized in the Data and Methods section. Dynamic K-nearest few-shot involves selecting the K closest training instances as examples for each testing instance. Additionally, the performance of static one-shot (using the same one-shot example for each testing instance) is shown as a dashed horizontal line for comparison. Detailed performance in digits is also provided in Supplementary Information S2. In contrast, the results show that dynamic K-nearest few-shot was less effective for other tasks. For instance, the dynamic one-shot performance is lower than the static one-shot performance for both GPT models on the two named entity recognition datasets, and by increasing the number of dynamic shots does not help either. Similar findings are also observed in relation extraction. For text summarization and text simplification tasks, the dynamic K-nearest few-shot performance was slightly higher in two datasets, but in general, it was very similar to the static one-shot performance. In addition, the results also suggest that increasing the number of shots does not necessarily improve the performance. For instance, GPT-4 with dynamic five-shot did not have the highest performance in eight out of the 12 datasets. Similar findings were reported in other studies, where the performance of GPT-3.5 with five-shot learning was lower than that of zero-shot learning for natural language inference tasks39. Figure 1 further compares the costs per 100 instances of using GPT-3.5 and GPT-4. The cost is calculated based on the number of input and output tokens with unit price. We used gpt-4-0613 for extractive tasks and gpt-4-32k-0613 for generative tasks because the input and output context are much longer especially with more shots. GPT-4 generally exhibited the highest performance, as shown in both Table 2 and Fig. 1; however, the cost analysis results also demonstrate a clear trade-off, with GPT-4 being 60 to 100 times more expensive. For extractive and classification tasks, the actual cost per 100 instances of GPT-4 for five-shots ranges from approximately $2 for sentence-level inputs to around $10 for abstract-level inputs. This cost is 60 to 70 times higher than that of GPT-3.5, which costs approximately $0.03 for sentence-level inputs and around $0.16 for abstract-level inputs with five-shots. For generative tasks, the cost difference is even more pronounced, scaling to 100 times or more expensive. One reason is that GPT-4 32 K has a higher unit price, and tasks like text summarization involve much longer input and output tokens. Taking the PubMed Text Summarization dataset as an example, GPT-4 cost $84.02 per 100 instances with five-shots, amounting to approximately $5600 to inference the entire testing set. In comparison, GPT-3 only cost $0.71 per 100 instances for five-shots, totaling around $48 for the entire testing set. Based on both performance and cost results, it indicates that the cost difference does not necessarily scale to the performance difference, except for question answering tasks. GPT-4 exhibited 20% to 30% higher accuracy than GPT-3.5 in question-answering tasks, and higher than the SOTA approaches; for other tasks, the performance difference is much smaller with a significantly higher cost. For instance, the performance of GPT-4 on both text simplification tasks was within 2% of that of GPT-3.5, but the actual cost was more than 100 times higher. Figure 2A further shows an error analysis on the named entity recognition benchmark NCBI Disease, where the performance of LLMs under zero- and few-shot settings was substantially lower than SOTA results (e.g., the LLaMA 2 13B zero-shot performance is almost 70% lower). Recall that named entity recognition extracts entities from free text, and the benchmarks evaluate the accuracy of these extracted entities. We examined all the predictions on full test sets and categorized into four types: (1) correct entities, where the predicted entities are correct with both text spans and entity types, (2) wrong entities, where the predicted entities are incorrect, (3) missing entities, where the true entities are not predicted, and (4) boundary issues, where the predicted entities are correct but with different text spans than the gold standard, as shown in Fig. 2A. The results reveal that the LLMs can predict up to 512 entities correctly out of 960 in total, explaining the low F1-score. As the SOTA model is not publicly available, we used an alternate fine-tuned BioBERT model on NCBI Disease from an independent study (https://huggingface.co/ugaray96/biobert_ncbi_disease_ner), which had an entity-level F1-score of 0.8920 for comparison. It predicted 863 entities out of 960 correctly. The wrong entities, missing entities, and boundary issues were 111, 97, and 269, respectively. A Error analysis on the named entity recognition benchmark NCBI Disease. Correct entities: the predicted entities are correct with both text spans and entity types; Wrong entities: the predicted entities are incorrect; Missing entities: true entities are not predicted; and Boundary issues: the predicted entities are correct but with different text spans than the gold standard. B–D Qualitative evaluation on ChemProt, HoC, and MedQA where the gold standard is a fixed classification type or multiple-choice option. Inconsistent responses: the responses are in different formats; Missingness: the responses are missing; and Hallucinations, where LLMs fail to address the prompt and may contain repetitions and misinformation in the output. In addition, Fig. 2A also shows that GPT-4 had the lowest number of wrong entities, whereas other categories have a similar prevalence to GPT-3.5, which explains its higher F1-score overall. Furthermore, providing one shot did not alter the errors for GPT-3.5 and GPT-4 compared to their zero-shot settings, but it dramatically changed the results for LLaMA 2 13B. Under one-shot, LLaMA 2 13B had 449 correctly predicted entities, compared to 148 under zero-shot. Additionally, its missing entities also reduced from 812 to 511 with one-shot, but it also had a trade-off of more boundary issues and wrong entities. Figure 2B–D present the qualitative evaluation results on ChemProt, HoC, and MedQA, respectively. Recall that we categorized inconsistencies, missing information, and hallucinations on the tasks where the gold standard is a fixed classification type or a multiple-choice option. Table 3 also provides detailed examples. The findings show prevalent inconsistent, missing, or hallucinated responses, particularly in LLaMA 2 13B zero-shot responses. For instance, it exhibited 506 hallucinated responses (~3% out of the total 16,943 instances) and 2376 inconsistent responses (14%) for ChemProt. In the case of HoC, there were 102 (32%) hallucinated responses and 69 (22%) inconsistent responses. Similarly, for MedQA, there were 402 (32%) inconsistent responses. In comparison, GPT-3.5 and GPT-4 exhibited substantially fewer cases. GPT-3.5 showed a small number of inconsistent responses for ChemProt and HoC, and a few missing responses for MedQA. On the other hand, GPT-4 did not exhibit any such cases for ChemProt and HoC, while displaying a few missing responses for MedQA. It is worth noting that inconsistent responses do not necessarily imply that they fail to address the prompts; rather, the responses answer the prompt but in different formats. In contrast, hallucinated cases do not address the prompts and may repeat the prompts or contain irrelevant information. All such instances pose challenges for automatic extraction or postprocessing and may require manual review. As a potential solution, we observed that adding just one shot could significantly reduce such cases, especially for LLaMA 2 13B, which exhibited prevalent instances in zero-shot. As illustrated in Fig. 2B, LLaMA 2 13B one-shot dramatically reduced these cases in ChemProt and MedQA. Similarly, its hallucinated responses decreased from 102 to 0, and inconsistent cases decreased from 69 to 23 in HoC with one-shot. Another solution is fine-tuning, which we did not find any such cases during the manual examination, albeit with a trade-off of computational resources. Figure 3 presents the qualitative evaluation results on the PubMed Text Summarization dataset. In Fig. 3A, the overall results in accuracy, completeness, and readability for the four models on 50 random samples are depicted. The evaluation results in digits are further demonstrated in Table 4 for complementary. Detailed results with statistical analysis and examples are available in Supplementary Information S3. The fine-tuned BART model used in the SOTA approach42, serving as the baseline, achieved an accuracy of 4.76 (out of 5), a completeness of 4.02, and a readability of 4.05. In contrast, both GPT-3.5 and GPT-4 demonstrated similar and slightly higher accuracy (4.79 and 4.83, respectively) and statistically significantly higher readability than the fine-tuned BART model (4.66 and 4.73), but statistically significantly lower completeness (3.61 and 3.57) under the zero-shot setting. The LLaMA 2 13B zero-shot performance is substantially lower in all three aspects. A The overall results of the fine-tuned BART, GPT-3.5 zero-shot, GPT-4 zero-shot, and LLaMA 2 zero-shot models on a scale of 1 to 5, based on random 50 testing instances from the PubMed Text Summarization dataset. B and C display the number of winning, tying, and losing cases when comparing GPT-4 zero-shot to GPT-3.5 zero-shot and GPT-4 zero-shot to the fine-tuned BART model, respectively. Table 4 shows the results in digits for complementary. Detailed results, including statistical tests and examples, are provided in Supplementary Information S3. Figure 3B further compares GPT-4 to GPT-3.5 and the fine-tuned BART model in detail. In the comparison between GPT-4 and GPT-3.5, GPT-4 had a slightly higher number of winning cases in the three aspects (4 winning cases vs. 1 losing case for accuracy, 17 vs. 13 for completeness, and 13 vs. 6 for readability). Most of the cases resulted in a tie. When comparing GPT-4 to the fine-tuned BART model, GPT-4 had significantly more winning cases for readability (34 vs. 1) with much fewer winning cases for completeness (9 vs. 22). First, the SOTA fine-tuning approaches outperformed zero- and few-shot performance of LLMs in most of BioNLP applications. As demonstrated in Table 2, it had the best performance in 10 out of the 12 benchmarks. In particular, it outperformed zero- and few-shot LLMs by a large margin in information extraction and classification tasks such as named entity recognition and relation extraction, which is consistent to the existing studies43,44. In contrast to, other tasks such as medical question answering, named entity recognition, and relation extraction require limited reasoning and extract information directly from inputs at the sentence-level. Zero- and few-shot learning may not be appropriate or sufficient for these conditions. For those tasks, arguably, fine-tuned biomedical domain-specific language models are still the first choice and have already set a high bar, according to the literature32. In addition, closed-source LLMs such as GPT-3.5 and GPT-4 demonstrated reasonable zero- and few-shot capabilities for three BioNLP tasks. The most promising task that outperformed the SOTA fine-tuning approaches is medical question answering, which involves reasoning45. As shown in Table 2 and Fig. 1, GPT-4 already outperformed previous fine-tuned SOTA approaches in MedQA and PubMedQA with zero- or few-shot learning. This is also supported by the existing studies on medical question answering38,46. The second potential use case is text summarization and simplification. As shown in Table 2, those tasks are still less favored by the automatic evaluation measures; however, manual evaluation results show both GPT-3.5 and GPT-4 had higher readability and competitive accuracy compared to the SOTA fine-tuning approaches. Other studies reported similar findings regarding the low correlation between automatic and manual evaluations35,47. The third possible use case – though still underperformed by previous fine-tuned SOTA approaches – document-level classification, which involves semantic understanding. As shown in Fig. 1, GPT-4 achieved over a 0.7 F1-score with dynamic K-nearest shot for both multi-label document-level classification benchmarks. In addition to closed-source LLMs, open-source LLMs such as LLaMA 2 do not demonstrate strong zero- and few-shot capabilities. While there are other open-source LLMs available, LLaMA 2 remains as a strong representative48. Results in Table 1 suggest that its overall zero-shot performance is 15% and 22% lower than that of GPT-3.5 and GPT-4, respectively, and up to 60% lower in specific BioNLP tasks. Not only does it exhibit suboptimal performance, but the results in Fig. 2 also demonstrate that its zero-shot responses frequently contain inconsistencies, missing elements, and hallucinations, accounting for up to 30% of the full testing set instances. Therefore, fine-tuning open-source LLMs for BioNLP tasks is still necessary to bridge the gap. Only through fine-tuning LLaMA 2, its overall performance is slightly higher than the one-shot GPT-4 (4%). However, it is worth noting that the model sizes of LLaMA 2 and PMC LLaMA are significantly smaller than those of GPT-3.5 and GPT-4, making it challenging to evaluate them on the same level. Additionally, open-source LLMs have the advantage of continued development and local deployment. Another primary finding on open-source LLMs is that the results do not indicate significant performance improvement from continuously biomedical pre-trained LLMs (PMC LLaMA 13B vs. LLaMA 2 13B). As mentioned, our study reproduced similar results reported in PMC LLaMA 13B; however, we also found that directly fine-tuning LLaMA 2 yielded better or at least similar performance—and this is consistent across all 12 benchmarks. In the biomedical domain, representative foundation LLMs such as PMC LLaMA used 32 A100 GPUs34, and Meditron used 128 A100 GPUs to continuously pretrain from LLaMA or LLaMA 249. Our evaluation did not find significant performance improvement for PMC LLaMA; the Meditron study also only reported ~3% improvement itself and only evaluated on question answering datasets. At a minimum, the results suggest the need for a more effective and sustainable approach to developing biomedical domain-specific LLMs. The automatic metrics for text summarization and simplification tasks may not align with manual evaluations. As the quantitative results on text summarization and generation demonstrated, commonly used automatic evaluations such as Rouge, BERT, and BART scores consistently favored the fine-tuned BART's generated text, while manual evaluations show different results, indicating that GPT-3.5 and GPT-4 had competitive accuracy and much higher readability even under the zero-shot setting. Existing studies also reported that the automatic measures on LLM-generated text may not correlate to human preference35,47. The MS^2 benchmark used in the study also discussed the limitation of automatic measures, specifically for text summarization50. Additionally, the results highlight that completeness is a primary limitation when adapting GPT models to biomedical text generation tasks despite its competitive accuracy and readability scores. Last, our evaluation on both performance and cost demonstrates a clear trade-off when using LLMs in practice. GPT-4 had the overall best performance in the 12 benchmarks, with an 8% improvement over GPT-3.5 but also at a higher cost (60 to 100 times higher than GPT-3.5). Notably, GPT-4 showed significantly higher performance, particularly in question-answering tasks that involve reasoning, such as over 20% improvement in MedQA compared to GPT-3.5. This observation is consistent with findings from other studies27,38. Note that newer versions of GPT-4, such as GPT-4 Turbo, may further reduce the cost of using GPT-4. These findings lead to recommendations for downstream users to apply LLMs in BioNLP applications, summarized in Fig. 4. It provides suggestions on which BioNLP applications are recommended (or not) for LLMs, categorized by conditions (e.g., the zero/few-shot setting when computational resources are limited) and additional tips (e.g., when advanced prompt engineering is more effective). It presents specific task-based recommendations across different settings and offers general guidance on effectively applying LLMs in BioNLP. We also recognize the following two open problems and encourage a community effort for better usage of LLMs in BioNLP applications. Adapting both data and evaluation paradigms is essential to maximize the benefits of LLMs in BioNLP applications. Arguably, the current datasets and evaluation settings in BioNLP are tailored to supervised (fine-tuning) methods and is not fair for LLMs. Those issues challenge the direct comparison between the fine-tuned biomedical domain-specific language models and zero/few shot of LLMs. The datasets for the tasks where LLMs excel are also limited in the biomedical domain. Further, the manual measures on biomedical text summarization also showed different results than that of all three automatic measures. These collectively suggest the current BioNLP evaluation frameworks have limitations when they are applied to LLMs35,51. They may not be able to accurately assess the full benefits of LLMs in biomedical applications, calling for the development of new evaluation datasets and methods for LLMs in bioNLP tasks. Addressing inconsistencies, missingness, and hallucinations produced by LLMs is critical. The prevalence of inconsistencies, missingness, and hallucinations generated by LLMs is of concern, and we argue that they must be addressed for deployment. Our results demonstrate that providing just one shot could significantly reduce the occurrence of such issues, offering a simple solution. However, thorough examination in real-world scenario validations is still necessary. Additionally, more advanced approaches for validating LLMs' responses are expected for further improvement of their reliability and usability47. This study also has several limitations that should be acknowledged. While this study examined strong LLM representatives from each category (closed-source, open-source, and biomedical domain-specific), it is important to note that there are other LLMs, such as BARD52 and Mistral53, that have demonstrated strong performance in the literature. Additionally, while we investigated zero-shot, one-shot, dynamic K-nearest few-shot, and fine-tuning techniques, each of them has variations, and there are also new approaches54. Given the rapidly growing nature of this area, our study cannot cover all of them. Instead, our aim is to establish baseline performance on the main BioNLP applications using commonly used LLMs and methods as representatives, and to make the datasets, methods, codes, and results publicly available. This enables downstream users to understand when and how to apply LLMs in their own use cases and to compare new LLMs and associated methods on the same benchmarks. In the future, we also plan to assess LLMs in real-world scenarios in the biomedical domain to further broaden the scope of the study. Table 5 presents a summary of the evaluation tasks, datasets, and metrics. We benchmarked the models on the full testing sets of the twelve datasets from six BioNLP applications, which are BC5CDR-chemical and NCBI-disease for Named Entity Recognition, ChemProt and DDI2013 for relation extraction, HoC and LitCovid for multi-label document classification, and MedQA and PubMedQA for question answering, PubMed Text Summarization and MS^2 for text summarization, and Cochrane PLS and PLOS Text Simplification for text simplification. These datasets have been widely used in benchmarking biomedical text mining challenges55,56,57 and evaluating biomedical language models9,10,11,16. The datasets are also available in the repository. We evaluated the datasets using the official evaluation metrics provided by the original dataset description papers, as well as commonly used metrics for method development or applications with the datasets, as documented in Table 5. Note that it is challenging to have a single one-size-fits-all metric, and some datasets and related studies used multiple evaluation metrics. Therefore, we also adopted secondary metrics for additional evaluations. A detailed description is below. Named entity recognition. Named entity recognition is a task that involves identifying entities of interest from free text. The biomedical entities can be described in various ways, and resolving the ambiguities is crucial58. Named entity recognition is typically a sequence labeling task, where each token is classified into a specific entity type. BC5CDR-chemical59 and NCBI-disease60 are manually annotated named entity recognition datasets for chemicals and diseases mentioned in biomedical literature, respectively. The exact match (that is, the predicted tokens must have the same text spans as the gold standard) F1-score was used to quantify the model performance. Relation extraction. Relation extraction involves identifying the relationships between entities, which is important for drug repurposing and knowledge discovery61. Relation extraction is typically a multi-class classification problem, where a sentence or passage is given with identified entities and the goal is to classify the relation type between them. ChemProt55 and DDI201362 are manually curated relation extraction datasets for protein-protein interactions and drug-drug interactions from biomedical literature, respectively. Macro and micro F1-scores were used to quantify the model performance. Multi-label document classification. Multi-label document classification identifies semantic categories at the document-level. The semantic categories are effective for grasping the main topics and searching for relevant literature in the biomedical domain63. Unlike multi-class classification, which assigns only one label to an instance, multi-label classification can assign up to N labels to an instance. HoC64 and LitCovid56 are manually annotated multi-label document classification datasets for hallmarks of cancer (10 labels) and COVID-19 topics (7 labels), respectively. Macro and Micro F1 scores were used as the primary and secondary evaluation metrics, respectively. Question answering. Question answering evaluates the knowledge and reasoning capabilities of a system in answering a given biomedical question with or without associated contexts45. Biomedical QA datasets such as MedQA and PubMedQA have been widely used in the evaluation of language models65. The MedQA dataset is collected from questions in the United States Medical License Examination (USMLE), where each instance contains a question (usually a patient description) and five answer choices (e.g., five potential diagnoses)66. The PubMedQA dataset includes biomedical research questions from PubMed, and the task is to use yes, no, or maybe to answer these questions with the corresponding abstracts67. Accuracy and macro F1-score are used as the primary and secondary evaluation metrics, respectively. Text summarization. Text summarization produces a concise and coherent summary of a longer documents or multiple documents while preserving its essential content. We used two primary biomedical text summarization datasets: the PubMed text summarization benchmark68 and MS^250. The PubMed text summarization benchmark focuses on single document summarization where the input is a full PubMed article, and the gold standard output is its abstract. M2^2 in contrast, focuses on multi-document summarization where the input is a collection of PubMed articles, and the gold standard output is the abstract of a systematic review study that cites those articles. Both benchmarks used the ROUGE-L score as the primary evaluation metric; BERT score and BART score were used as secondary evaluation metrics. Text simplification. Text simplification rephrases complex texts into simpler language while maintaining the original meaning, making the information more accessible to a broader audience. We used two primary biomedical text simplification datasets: Cochrane PLS69 and the PLOS text simplification benchmark70. Cochrane PLS consists of the medical documents from the Cochrane Database of Systematic Reviews and the corresponding plain-language summary (PLS) written by the authors. The PLOS text simplification benchmark consists of articles from PLOS journals and the corresponding technical summary and PLS written by the authors. The ROUGE-L score was used as the primary evaluation metric. Flesch-Kincaid Grade Level (FKGL) and Dale-Chall Readability Score (DCRS), two commonly used evaluation metrics on readability71 were used as the secondary evaluation metrics. For each dataset, we reported the reported SOTA fine-tuning result before the rise of LLMs as the baseline. The SOTA approaches involved fine-tuning (domain-specific) language models such as PubMedBERT16, BioBERT9, or BART72 as the backbone. The fine-tuning still requires scalable manually labeled instances, which is challenging in the biomedical domain32. In contrast, LLMs may have the advantage when minimal manually labeled instances are available, and they do not require fine-tuning or retraining for every new task through zero/few-shot learning. Therefore, we used the existing SOTA results achieved by the fine-tuning approaches to quantify the benefits and challenges of LLMs in BioNLP applications. Representative LLMs and their versions. Both GPT-3.5 and GPT-4 have been regularly updated. For reproducibility, we used the snapshots gpt-3.5-turbo-16k-0613 and gpt-4-0613 for extractive tasks, and gpt-4-32k-0613 for generative tasks, considering their input and output token sizes. Regarding LLaMA 2, it is available in 7B, 13B, and 70B versions. We evaluated LLaMA 2 13B based on the computational resources required for fine-tuning, which is arguably the most common scenario applicable to BioNLP downstream applications. For PMC LLaMA, both 7B and 13B versions are available. Similarly, we used PMC LLaMA 13B, specifically evaluating it under the fine-tuning setting – the same setting used in its original study34. In the original study, PMC LLaMA was only evaluated on medical question answering tasks, combining multiple question answering datasets for fine-tuning. In our case, we fine-tuned each dataset separately and reported the results individually. Prompts. To date, prompt design remains an open research problem73,74,75. We developed a prompt template that can be used across different tasks based on existing literature74,75,76,77. An annotated prompt example is provided in Supplementary Information S1 Prompt engineering, and we have made all the prompts publicly available in the repository. The prompt template contains (1) task descriptions (e.g., classifying relations), (2) input specifications (e.g., a sentence with labeled entities), (3) output specifications (e.g., the relation type), (4) task guidance (e.g., detailed descriptions or documentations on relation types), and (5) example demonstrations if examples from training sets are provided. This approach aligns with previous studies in the biomedical domain, which have demonstrated that incorporating task guidance into the prompt leads to improved performance74,76 and was also employed and evaluated in our previous study, specifically focusing on named entity recognition77. We also adapted the SOTA example selection approach in the biomedical domain described below27. Zero-shot and static few-shot. We comparatively evaluated the zero-shot, one-shot, and five-shot learning performances. Only a few studies have made the selected examples available. For reproducibility and benchmarking, we first randomly selected the required number of examples in training sets, used the same selected examples for few-shot learning, and made the selected examples publicly available. Dynamic K-nearest few-shot. In addition to zero- or static few-shot learning where fixed instructions are used for each instance, we further evaluated the LLMs under a dynamic few-shot learning setting. The dynamic few-shot learning is based on the MedPrompt approach, the SOTA method that demonstrated robust performance in medical question answering tasks without fine-tuning27. The essence is to use K training instances that are most similar to the test instance as the selected examples. We denote this setting as dynamic K-nearest few-shot, as the prompts for different test instances differ. Specifically, for each dataset, we used the SOTA text embedding model text-embedding-ada-00254 to encode the instances and used cosine similarity as the metric for finding similar training instances to a testing instance. We tested dynamic K-nearest few-shot prompts with K equals to one, two, and five. Parameters for prompt engineering. For zero-, one-, and few-shot approaches, we used a temperature parameter of 0 to minimize variance for both GPT and LLaMA-based models. Additionally, for LLaMA models, we maintained other parameters unchanged, set the maximum number of generated tokens per task, and truncated the instances due to the input length limit for the five-shot setting. Further details are provided in Supplementary Information S1 Prompt engineering, and the related codes are available in the repository. Fine-tuning. We further conducted instruction fine-tuning on LLaMA 2 13B and PMC-LLaMA 13B. For each dataset, we fine-tuned LLaMA 2 13B and PMC- LLaMA 13B using its training set. The goal of instruction fine-tuning is defined by the objective function: \({\arg }{\max }_{\theta }{\sum}_{\left({x}^{i},{y}^{i}\right)\epsilon (X,Y)}{logp}({y}^{i}|{x}^{i};\theta )\), where \({x}^{i}\) represents the input instruction, \({y}^{i}\) is the ground truth response, and \(\theta\) is the parameter set of the model. This function aims to maximize the likelihood of accurately predicting responses based on the given instructions. The fine-tuning is performed on eight H100 80G GPUs, over three epochs with a learning rate of 1e−5, a weight decay of 1e−5, a warmup ratio of 0.01, and Low-Rank Adaptation (LoRA) for parameter-effective tuning78. Output parsing. For extractive and classification tasks, we extracted the targeted predictions (e.g., classification types or multiple-choice options) from the raw outputs of LLMs with a combination of manual and automatic processing. We manually reviewed the processed outputs. Manual review showed that LLMs provided answers in inconsistent formats in some cases. For example, when presenting multiple-choice option C, the raw output examples included variations such as: “Based on the information provided, the most likely … is C. The thyroid gland is a common site for metastasis, and …”, “Great! Let's go through the options. A. … B. …Therefore, the most likely diagnosis is C.”, and “I'm happy to help! Based on the patient's symptoms and examination findings, … Therefore, option A is incorrect. …, so option D is incorrect. The correct answer is option C.” (adapted from real responses with unnecessary details omitted). In such cases, automatic processing might overlook the answer, potentially lowering LLM accuracy. Thus, we manually extracted outputs in these instances to ensure fair credit. Additionally, we qualitatively evaluated the prevalence of such cases (providing responses in inconsistent formats), which will be introduced below. Quantitative evaluations. We summarized the evaluation metrics in Table 5 under zero-shot, static few-shot, dynamic K-nearest few-shot, and fine-tuning settings. The metrics are applicable to the entire testing sets of 12 datasets. We further conducted bootstrapping using a subsample size of 30 and repeated 100 times at a 95% confidence interval to report performance variance and performed a two-tailed Wilcoxon rank-sum test using SciPy79. Further details are provided in Supplementary Information S2 Quantitative evaluation results (S2.1. Result reporting). Qualitative evaluations on inconsistency, missing information, and hallucinations. For the tasks where the gold standard is fixed, e.g., a classification type or multiple-choice option, we conducted qualitative evaluations on collectively hundreds of thousands of raw outputs of the LLMs (the raw outputs from three LLMs under zero- and one-shot conditions across three benchmarks) to categorize errors beyond inaccurate predictions. Specifically, we examined (1) inconsistent responses, where the responses are in different formats, (2) missingness, where the responses are missing, and (3) hallucinations, where LLMs fail to address the prompt and may contain repetitions and misinformation in the output36. We evaluated and reported the results in selected datasets: ChemProt, HoC, and MedQA. Qualitative evaluations on accuracy, completeness, and readability. For the tasks with free-text gold standards, such as summaries, we conducted qualitative evaluations on the quality of generated text. Specifically, one senior resident and one junior resident evaluated four models: the fine-tuned BART model reported in the SOTA approach, GPT-3.5 zero-shot, GPT-4 zero-shot, and LLaMA 2 13B zero-shot on 50 random samples from the PubMed Text Summarization benchmark. Each annotator was provided with 600 annotations. To mitigate potential bias, the model outputs were all lowercased, their orders were randomly shuffled, and the annotators were unaware of the models being evaluated. They assessed three dimensions on a scale of 1—5: (1) accuracy, does the generated text contain correct information from the original input, (2) completeness, does the generated text capture the key information from the original input, and (3) readability, is the generated text easy to read. The detailed evaluation guideline is provided in Supplementary Information S3 Qualitative evaluation on the PubMed Text Summarization Benchmark. We further conducted a cost analysis to quantify the trade-off between cost and accuracy when using GPT models. The cost of GPT models is determined by the number of input and output tokens. We tracked the tokens in the input prompts and output completions using the official model tokenizers provided by OpenAI (https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken) and used the pricing table (https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/) to compute the overall cost. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. All data supporting the findings of this study, including source data, are available in the article and Supplementary Information, and can be accessed publicly via https://doi.org/10.5281/zenodo.1402550037. Additional data or requests for data can also be obtained from the corresponding authors upon request. Source data are provided with this paper. The codes are publicly available via https://doi.org/10.5281/zenodo.1402550037. Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res. 51, D29–D38 (2023). CAS PubMed MATH Google Scholar Chen, Q. et al. LitCovid in 2022: an information resource for the COVID-19 literature. Nucleic Acids Res. 51, D1512–D1518 (2023). PubMed Google Scholar Leaman, R. et al. Comprehensively identifying long COVID articles with human-in-the-loop machine learning. Patterns 4, 100659 (2023). Chen, Q. et al. BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput. Biol. 16, e1007617 (2020). CAS PubMed PubMed Central Google Scholar Blake, C. Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles. J. Biomed. Inform. 43, 173–189 (2010). CAS PubMed MATH Google Scholar Su, Y. et al. Deep learning joint models for extracting entities and relations in biomedical: a survey and comparison. Brief. Bioinforma. 23, bbac342 (2022). Google Scholar Zhang, Y., Chen, Q., Yang, Z., Lin, H., & Lu, Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data. 6, 1–9 (2019). Chen, Q., Peng, Y. & Lu, Z. BioSentVec: creating sentence embeddings for biomedical texts.In 2019 IEEE International Conference on Healthcare Informatics (ICHI) 1–5 (IEEE, 2019). Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020). CAS PubMed MATH Google Scholar Peng, Y., Yan, S., & Lu, Z. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In Proc. 18th BioNLP Workshop and Shared Task, 58–65 (Association for Computational Linguistics, Florence, Italy, 2019). Fang, L., Chen, Q., Wei, C.-H., Lu, Z. & Wang, K. Bioformer: an efficient transformer language model for biomedical text mining, arXiv preprint arXiv:2302.01588 (2023). Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinforma. 23, bbac409 (2022). Google Scholar Venigalla, A., Frankle, J., & Carbin, M. Biomedlm: a domain-specific large language model for biomedical text, MosaicML. Accessed: Dec, 23 (2022). Yuan, H. et al. BioBART: Pretraining and evaluation of a biomedical generative language model. In Proc. 21st Workshop on Biomedical Language Processing, 97–109 (2022). Phan, L.N. et al. Scifive: a text-to-text transformer model for biomedical literature, arXiv preprint arXiv:2106.03598 (2021). Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. HEALTH, 3, 1–23 (2021). Allot, A. et al. LitSense: making sense of biomedical literature at sentence level. Nucleic Acids Res. 47, W594–W599 (2019). CAS PubMed PubMed Central Google Scholar Zhao, W. X. et al. A survey of large language models, arXiv preprint arXiv:2303.18223 (2023). Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022). MATH Google Scholar Chen, X. et al. How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks, arXiv preprint arXiv:2303.00293 (2023). OpenAI, GPT-4 Technical Report, ArXiv, abs/2303.08774, (2023). Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). Jiang, A. Q. et al. Mixtral of experts arXiv preprint arXiv:2401.04088, 2024. Lee, P, Goldberg, C. & Kohane, I. The AI revolution in medicine: GPT-4 and beyond (Pearson, 2023). Wong, C. et al. Scaling clinical trial matching using large language models: A case study in oncology. In Machine Learning for Healthcare Conference 846–862 (PMLR, 2023). Liu, Q. et al. Exploring the Boundaries of GPT-4 in Radiology. In Proc. of the 2023 Conference on Empirical Methods in Natural Language Processing 14414–14445 (2023). Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine, arXiv preprint arXiv:2311.16452 (2023). Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinforma. 25, bbad493 (2024). Google Scholar He, K. et al. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics, arXiv preprint arXiv:2310.05694 (2023). Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: the potentials and pitfalls: a narrative review. Ann. Intern. Med. 177, 210–220 (2024). PubMed Google Scholar Soğancıoğlu, G., Öztürk, H. & Özgür, A. BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 33, i49–i58 (2017). PubMed PubMed Central Google Scholar Tinn, R. et al. Fine-tuning large neural language models for biomedical natural language processing. Patterns. 4, 100729 (2023). Chen, Q., Rankine, A., Peng, Y., Aghaarabi, E. & Lu, Z. Benchmarking effectiveness and efficiency of deep learning models for semantic textual similarity in the clinical domain: validation study. JMIR Med. Inform. 9, e27386 (2021). PubMed PubMed Central Google Scholar Wu, C. et al. PMC-LLaMA: toward building open-source language models for medicine, J. Am. Med. Inform. Associat. ocae045 (2024). Fleming, S. L. et al. MedAlign: A clinician-generated dataset for instruction following with electronic medical records. In Proc. AAAI Conference on Artificial Intelligence Vol. 38 22021–22030 (2023). Zhang, Y. et al. Siren's song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023). Chen, Q. et al. A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations. https://doi.org/10.5281/zenodo.14025500 (2024). Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023). Labrak, Y., Rouvier, M. & Dufour, R. A zero-shot and few-shot study of instruction-finetuned large language models applied to clinical and biomedical tasks. In Proc. 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) 2049–2066 (ELRA and ICCL, 2024). Jin, H. et al. Llm maybe longlm: Self-extend llm context window without tuning. In Proc. of Machine Learning Research, 235 22099–22114 (2024). Ding, Y. et al. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, arXiv preprint arXiv:2402.13753 (2024). Xie, Q., Huang, J., Saha, T. & Ananiadou, S. Gretel: Graph contrastive topic enhanced language model for long document extractive summarization. In Proc. 29th International Conference on Computational Linguistics, 6259–6269 (International Committee on Computational Linguistics, 2022). Jimenez Gutierrez, B. et al. Thinking about GPT-3 in-context learning for biomedical IE? Think again. In Findings of the Association for Computational Linguistics: EMNLP 2022, 4497–4512 (Association for Computational Linguistics, 2022). Rehana, H. et al. Evaluation of GPT and BERT-based models on identifying protein-protein interactions in biomedical text, arXiv preprint arXiv:2303.17728 (2023). Jin, Q. et al. Biomedical question answering: a survey of approaches and challenges. ACM Comput. Surv. (CSUR) 55, 1–36 (2022). MATH Google Scholar Singhal, K. et al. Large language models encode clinical knowledge, Nature 620, 1–9 (2023). Chang, Y. et al. A survey on evaluation of large language models, ACM Trans. Intell. Syst. Technol. (2023). Minaee, S. et al. Large language models: A survey, arXiv preprint arXiv:2402.06196 (2024). Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models, arXiv preprint arXiv:2311.16079 (2023). DeYoung, J., Beltagy, I., van Zuylen, M., Kuehl, B. & Wang, L. L. Ms2: Multi-document summarization of medical studies. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing, 7494–7513 (2021). Wornow, M. et al. The shaky foundations of large language models and foundation models for electronic health records. npj Digit. Med. 6, 135 (2023). PubMed PubMed Central Google Scholar Manyika, J. An overview of Bard: an early experiment with generative AI. https://ai.google/static/documents/google-about-bard.pdf (2023). Jiang, A. Q. et al. Mistral 7B, arXiv preprint arXiv:2310.06825, (2023). Neelakantan, A. et al. Text and code embeddings by contrastive pre-training, arXiv preprint arXiv:2201.10005 (2022). Krallinger, M. et al. Overview of the BioCreative VI chemical-protein interaction Track. In Proc. of the sixth BioCreative challenge evaluation workshop Vol. 1, 141–146 (2017). Chen, Q. et al. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations, Database 2022, baac069 (2022). Islamaj Doğan, R. et al. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine, Database 2019, bay147 (2019). International Society for Biocuration, Biocuration: Distilling data into knowledge, Plos Biol., 16, e2002846 (2018). Li, J. et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction, Database, 2016 (2016). Doğan, R. I., Leaman, R. & Lu, Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform. 47, 1–10 (2014). PubMed PubMed Central MATH Google Scholar Li, X., Rousseau, J. F., Ding, Y., Song, M. & Lu, W. Understanding drug repurposing from the perspective of biomedical entities and their evolution: Bibliographic research using aspirin. JMIR Med. Inform. 8, e16739 (2020). PubMed PubMed Central Google Scholar Segura-Bedmar, I., Martínez, P. & Herrero-Zazo, M. Semeval-2013 task 9: extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proc. Seventh International Workshop on Semantic Evaluation (SemEval 2013) 341–350 (Association for Computational Linguistics, 2013). Du, J. et al. ML-Net: multi-label classification of biomedical texts with deep neural networks. J. Am. Med. Inform. Assoc. 26, 1279–1285 (2019). PubMed PubMed Central MATH Google Scholar Baker, S. et al. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32, 432–440 (2016). CAS PubMed MATH Google Scholar Kaddari, Z., Mellah, Y., Berrich, J., Bouchentouf, T. & Belkasmi, M. G. Biomedical question answering: A survey of methods and datasets. In 2020 Fourth International Conference On Intelligent Computing in Data Sciences (ICDS) 1–8 (IEEE, 2020). Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021). CAS MATH Google Scholar Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. Pubmedqa: A dataset for biomedical research question answering. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2567–2577 (EMNLP-IJCNLP, 2019). Cohan, A. et al. A discourse-aware attention model for abstractive summarization of long documents. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag Technologies Vol. 2, 615–621 (2018). Devaraj, A., Wallace, B. C., Marshall, I. J. & Li, J. J. Paragraph-level simplification of medical texts. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 4972–4984 (Association for Computational Linguistics, 2021). Luo, Z., Xie, Q., & Ananiadou, S. Readability controllable biomedical document summarization. In Findings of the Association for Computational Linguistics: EMNLP, 4667–4680 (2022). Goldsack, T. et al. Overview of the biolaysumm 2024 shared task on lay summarization of biomedical research articles. In Proc. 23rd Workshop on Biomedical Natural Language Processing 122–131 (2024). Lewis, M. et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880 (2020). Liu, P. et al. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 1–35 (2023). MATH Google Scholar Hu, Y. et al. Improving large language models for clinical named entity recognition via prompt engineering, J. Am. Med. Inform. Assoc. 31, ocad259 (2024). Wang, L. et al. Investigating the impact of prompt engineering on the performance of large language models for standardizing obstetric diagnosis text: comparative study. JMIR Format Res. 8, e53216 (2024). Google Scholar Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing, 1998–2022 (2022). Keloth, V. K. et al. Advancing entity recognition in biomedicine via instruction tuning of large language models. Bioinformatics 40, btae163 (2024). CAS PubMed PubMed Central Google Scholar Hu, E. J. et al. Lora: Low-rank adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021). Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020). CAS PubMed PubMed Central MATH Google Scholar Lehman, E. et al. Do we still need clinical language models? In Conference on health, inference, and learning, 578–597 (PMLR, 2023). Chen, S. et al. Evaluating the ChatGPT family of models for biomedical reasoning and classification. J. Am. Med. Inform. Assoc. 31, ocad256 (2024). Chen, Q. et al. A comprehensive benchmark study on biomedical text generation and mining with ChatGPT, bioRxiv, pp. 2023.04. 19.537463 (2023). Zhang, S., Cheng, H., Gao, J. & Poon H. Optimizing bi-encoder for named entity recognition via contrastive learning. In Proc. 11th International Conference on Learning Representations, (ICLR, 2023). He, J. et al. Chemical-protein relation extraction with pre-trained prompt tuning. Proc IEEE Int. Conf. Healthc. Inform. 2022, 608–609 (2022). Mingliang, D., Jijun, T. & Fei, G. Document-level DDI relation extraction with document-entity embedding. pp. 392–397. Chen, Q., Du, J., Allot, A. & Lu, Z. LitMC-BERT: transformer-based multi-label classification of biomedical literature with an application on COVID-19 literature curation, IEEE/ACM Trans. Comput. Biol. Bioinform. 19, 2584–2595 (2022). Yasunaga, M. et al. Deep bidirectional language-knowledge graph pretraining. Adv. Neural Inf. Process. Syst. 35, 37309–37323 (2022). Google Scholar Flores, L. J. Y., Huang, H., Shi, K., Chheang, S. & Cohan, A. Medical text simplification: optimizing for readability with unlikelihood training and reranked beam search decoding. In Findings of the Association for Computational Linguistics: EMNLP, 4859–4873 (2023). Wei, C.-H. et al. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database 2016, baw032 (2016). PubMed PubMed Central Google Scholar He, J. et al. Prompt tuning in biomedical relation extraction, J. Healthcare Inform. Res. 8, 1–19 (2024). Guo, Z., Wang, P., Wang, Y. & Yu, S. Improving small language models on PubMedQA via Generative Data Augmentation, arXiv, 12 (2023). Koh, H. Y., Ju, J., Liu, M. & Pan, S. An empirical survey on long document summarization: Datasets, models, and metrics. ACM Comput. Surv. 55, 1–35 (2022). MATH Google Scholar Bishop, J. A., Xie, Q. & Ananiadou, S. LongDocFACTScore: Evaluating the factuality of long document abstractive summarisation. In Proc. of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) 10777–10789 (2024). Wang, L. L, DeYoung, J. & Wallace, B. Overview of MSLR2022: A shared task on multidocument summarization for literature reviews. In Proc. Third Workshop on Scholarly Document Processing 175–180 (Association for Computational Linguistics, 2022). Ondov, B., Attal, K. & Demner-Fushman, D. A survey of automated methods for biomedical text simplification. J. Am. Med. Inform. Assoc. 29, 1976–1988 (2022). PubMed PubMed Central MATH Google Scholar Stricker, J., Chasiotis, A., Kerwer, M. & Günther, A. Scientific abstracts and plain language summaries in psychology: A comparison based on readability indices. PLoS One 15, e0231160 (2020). CAS PubMed PubMed Central Google Scholar Download references This study is supported by the following National Institutes of Health grants: 1R01LM014604 (Q.C., R.A.A., and H.X), 4R00LM014024 (Q.C.), R01AG078154 (R.Z., and H.X), 1R01AG066749 (W.J.Z), W81XWH-22-1-0164 (W.J.Z), and the Intramural Research Program of the National Library of Medicine (Q.C., Q.J., P.L., Z.W., and Z.L). Open access funding provided by the National Institutes of Health. These authors contributed equally: Zhiyong Lu, Hua Xu. Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA Qingyu Chen, Xueqing Peng, Qianqian Xie, Xuguang Ai, Vipina K. Keloth, Kalpana Raja, Jimin Huang, Huan He, Fongci Lin & Hua Xu National Library of Medicine, National Institutes of Health, Bethesda, MD, USA Qingyu Chen, Qiao Jin, Po-Ting Lai, Zhizheng Wang & Zhiyong Lu McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA Yan Hu, Jingcheng Du & W. Jim Zheng Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA Aidan Gilson, Maxwell B. Singer & Ron A. Adelman Division of Computational Health Sciences, Department of Surgery, Medical School, University of Minnesota, Minneapolis, MN, USA Rui Zhang Center for Learning Health System Sciences, University of Minnesota, Minneapolis, MN, 55455, USA Rui Zhang You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar Q.C., Z.L., and H.X. designed the research. Q.C., Y.H., X.P., Q.X., Q.J., A.G., M.B.S., X.A., P.L., Z.W., V.K.K., K.P., J.H., H.H., F.L., and J.D. performed experiments and data analysis. Q.C., Z.L., and H.X. wrote and edited the manuscript. All authors contributed to the discussion and manuscript preparation. Correspondence to Zhiyong Lu or Hua Xu. Dr. Jingcheng Du and Dr. Hua Xu have research-related financial interests at Melax Technologies Inc. The remaining authors declare no competing interests. Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions Chen, Q., Hu, Y., Peng, X. et al. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat Commun 16, 3280 (2025). https://doi.org/10.1038/s41467-025-56989-2 Download citation Received: 17 November 2023 Accepted: 07 February 2025 Published: 06 April 2025 DOI: https://doi.org/10.1038/s41467-025-56989-2 Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative Advertisement Nature Communications (Nat Commun) ISSN 2041-1723 (online) © 2025 Springer Nature Limited Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Advertisement Nature Communications volume 16, Article number: 3274 (2025) Cite this article Metrics details The UniProt database is a valuable resource for biocatalyst discovery, yet predicting enzymatic functions remains challenging, especially for low-similarity sequences. Identifying superior enzymes with enhanced catalytic properties is even harder. To overcome these challenges, we develop ESM-Ezy, an enzyme mining strategy leveraging the ESM-1b protein language model and similarity calculations in semantic space. Using ESM-Ezy, we identify novel multicopper oxidases (MCOs) with superior catalytic properties, achieving a 44% success rate in outperforming query enzymes (QEs) in at least one property, including catalytic efficiency, heat and organic solvent tolerance, and pH stability. Notably, 51% of the MCOs excel in environmental remediation applications, and some exhibited unique structural motifs and unique active centers enhancing their functions. Beyond MCOs, 40% of L-asparaginases identified show higher specific activity and catalytic efficiency than QEs. ESM-Ezy thus provides a promising approach for discovering high-performance biocatalysts with low sequence similarity, accelerating enzyme discovery for industrial applications. Enzymes are increasingly playing pivotal roles across diverse industries, including food, agriculture, chemical, and pharmaceutical sectors. Despite the successful use of various enzymes, their catalytic properties often do not meet the stringent demands of diverse industrial applications. Directed evolution has been effectively employed to enhance enzymatic catalytic properties. However, the lack of high-throughput screening methods makes the process labor-intensive and costly in many cases. Thanks to advances in next-generation sequencing, the UniProtKB now encompasses over 227 million protein sequences, including more than 214 million entries complemented by AlphaFold-predicted structures1. Therefore, discovering advanced enzymes from UniProtKB could offer advanced biocatalysts ready for direct applications or serve as prime candidates for subsequent directed evolution, potentially streamlining the follow-up extensive engineering processes. Strategies for enzyme mining based solely on protein sequence information often led to inaccuracies in function annotations despite extensive sequence data collections from a wide array of organisms. Traditional bioinformatic tools excel at identifying protein domains and assigning gene ontology terms but face challenges in accurately predicting functions for enzymes with low sequence similarity to the characterized ones2,3. Consequently, these enzymes remain underexplored, highlighting the need for a convenient strategy to accurately identify enzymes with low sequence similarities and investigate the likelihood of discovering novel enzymes with enhanced properties from this pool. Deep learning presents a promising avenue for establishing connections between protein sequences and their functions, particularly for proteins exhibiting low sequence similarities to the previously characterized entities. By leveraging the structural similarities between protein sequences and natural language, the development of protein language models (PLMs) has emerged as a powerful approach to addressing the challenges in predicting protein structure and function4,5,6. One such model, evolutionary scale modeling (ESM-1b), is designed to learn an embedding space from extensive protein sequence databases, enabling the Euclidean distance within this space to reflect functional similarities7,8. These embeddings, numerical vectors derived from diverse amino acid sequences by PLMs, encapsulate critical protein properties9,10. Recently, ESMs have been successfully utilized in protein engineering11,12,13, enzyme function classification14, and remote homology prediction4. However, strategies for discovering enzymes with improved catalytic properties compared to existing ones are rarely developed15,16. Since low sequence similarity can indicate structural and functional novelty, potentially leading to breakthroughs in properties, we aimed to develop a sequence-based, highly accurate sequence-based in silico tool for discovering enzymes with enhanced catalytic properties from collections with low sequence similarity. Multicopper oxidases (MCOs) are key biocatalysts widely used in food and chemical industries. In this study, we developed ESM-Ezy, a deep learning strategy assisted by ESM-1b, to explore novel MCOs with low sequence similarity for improved catalytic properties at a high success rate. By fine-tuning ESM-1b with a small but high-quality dataset and selecting candidates based on shorter Euclidean distances to the query enzymes (QEs), we successfully identified new MCOs with low sequence similarity, mostly below 35%, exhibiting superior catalytic properties. Almost 89% of the tested MCOs catalyzed 2,2'-Azino-bis (3-ethylbenzothiazoline-6-sulfonic acid) diammonium salt (ABTS) oxidation successfully, with 44% of them showing enhancements in at least one catalytic property compared to the QEs, including higher catalytic efficiency, improved heat and organic solvents tolerance, and a broader pH range. We identified Sulfur as one of the most heat-resistant MCOs reported so far17,18, with a remarkable half-life of 156.9 min at 80 °C, and characterized Bfre with a unique Cu-Mn heteroatom center for the first time. Furthermore, we evaluated the performance of newly discovered MCOs in mediator-free bioremediation applications. 53% of the MCOs decolorized Remazol Brilliant Blue R (RBBR) more efficiently than the QEs17. Talbi degraded Chloramphenicol (CAP) at room temperature and outperformed the previous fungal systems19. Additionally, Bcece demonstrated a capacity to degrade 39.4% of aflatoxin B1 (AFB1) after 48 h of incubation, exhibiting superior mediator-free degradation efficacy in comparison to other bacterial MCOs under mild conditions20. In addition to MCOs, 40% of l-Asparaginase enzymes discovered by ESM-Ezy outperformed their QE in terms of specific activity and catalytic efficiency. Overall, ESM-Ezy assisted us in enriching the libraries of MCOs with both improved catalytic activities and structural diversities, facilitating their application in various industrial settings. In this study, a two-stage strategy named ESM-Ezy was developed, involving fine-tuning and searching, to discover novel MCOs from an extensive sequence database (Fig. 1a). Initially, sequence embeddings were extracted from the transformer layers of ESM-1b, and a binary classification layer was incorporated to distinguish MCOs from other sequences in the database. The fine-tuning stage utilized a high-quality positive dataset of 147 characterized MCOs from literature, along with a large negative dataset of 550,000 non-MCO sequences from Swiss-Prot. This process swiftly reached optimal accuracy and demonstrated significant robustness based on our analysis of five-fold cross-validation (Fig. S1, Tables S4 and S5). Additionally, dimensionality reduction employing the UMAP algorithm21 effectively displayed distinct clustering of MCOs, confirming the model's efficacy (Fig. 1b). a In the fine-tuning stage, the ESM-1b model was fine-tuned through binary classification on positive and negative data sets. In the searching phase, the Fine-tuned ESM-1b Backbone was used to generate query embeddings and candidate embeddings, and Euclidean distance in the embedding space was employed to identify the closest sequences for further validation. In the searching stage, the Binary Classification Head was omitted, and the ESM-1b Backbone from the fine-tuning stage was retained as the Fine-tuned ESM-1b Backbone. b After fine-tuning, the MCOs cluster (positive) became distinctly separated from the non-MCOs cluster (negative). c The embeddings of the selected sequences generated by the fine-tuned model clustered closely with the QEs. d The sequence and structure similarity matrix of the MCOs and QEs. The newly discovered enzymes exhibit low sequence similarity but are structurally conserved. Source data are provided as a Source Data file. Subsequently, to identify novel MCOs with enhanced catalytic properties in the searching stage, three representative MCOs, specifically Eclac from Escherichia coli K12 (UniProt: P36649)22, HR03 from Bacillus sp. HR03 (UniProt: B9W2C5)23 and DSM13 from Bacillus licheniformis DSM13 (UniProt: Q65MU7)24 were selected as QEs. These QEs, along with all sequences from UniRef50, were embedded into the fine-tuned ESM-1b backbone. Euclidean distances were calculated between the UniRef50 sequences and the QEs, leading to the selection of 18 neighboring sequences with similarity in protein semantic space for further analysis (Tables S2 and S3). It is important to note that the primary goal of the fine-tuned classification task in the first step is to learn a representation space for sequences, rather than merely predicting whether a sequence is positive or negative. Given that the UniRef50 database is extremely large, even applying stringent criteria to filter positive candidates yields thousands of sequences, which is impractical for wet lab experiments. To address this issue, we employed QEs to identify the nearest sequences within the fine-tuned representation space for wet lab experimentation. After fine-tuning, the selected candidates are much more closely positioned around the QEs compared to those in the pretrained and random ESM-1b model, which indicates the necessity of the fine-tuning step (Fig. 1c, Fig. S2). The majority of the selected MCOs showed low sequence similarities both to the QEs and among themselves, ranging from 25% to 35% (Fig. 1d). Despite low sequence similarities, high structure similarity (TM-score > 0.8) was observed (Fig. 1d). Enzymes with low sequence similarity but conserved structures are often considered to possess evolutionary novelty, and the selected enzymes and their mutants in this study have not been reported previously. Phylogenetic analysis indicated that the selected MCOs are distributed across various bacterial MCO groups and do not cluster closely with their respective QEs (Fig. S3). This suggests that MCOs with shorter Euclidean distances may not be closely related evolutionarily. Additionally, analysis of sequence similarity networks (SSN) within UniRef50 indicated that MCOs with close evolutionary relationships do not consistently exhibit short Euclidean distances (Fig. S4). Finally, AlphaFold2 structures of the selected MCOs and the QEs were used for analyzing their structural relationships. A taxonomic tree (Fig. S5a) categorized the MCOs into six clades, with over half located in categories separate from those of the QEs. To determine if this strategy could yield new MCOs with enhanced catalytic properties compared to the QEs, all selected MCOs were successfully expressed and purified using E. coli (BL21). We conducted a comprehensive and quantitative assessment of the catalytic properties of these MCOs using the standard ABTS oxidation reaction. Almost all MCOs effectively oxidized ABTS, with approximately 40% of MCOs in each series demonstrating superior catalytic efficiency or increased thermal stability relative to their QEs (Fig. 2a, Table. S6). Notably, several candidates stood out in this study: Sulfur, Bcece, Tocean, and Bfre. As shown in Fig. 2a, Scla exhibited a 3.0-fold longer half-life at 80 °C compared to DMS13 while maintaining a higher kcat and specific activity value. Additionally, Tocean and Bfre demonstrated catalytic efficiencies 5.8 and 95.2 times higher than that of HR03, respectively, while maintaining comparable thermostability. Remarkably, Sulfur was found to be 32.9 times more active than Eclac and stands out as one of the most heat-tolerant MCOs reported to date20, with a half-life of 156.9 ± 9.0 min at 80 °C. a Kinetic parameter of representative MCOs. b Profiles of relative activities under different pH. c Profiles of relative activity under different temperatures. d Profiles of relative activity under different organic solvents. The bar plots show mean ± standard deviation (n = 3 biological replicates). The query enzyme of each series is labeled in dark blue. Source data are provided as a Source Data file. Given that MCOs are often utilized in industrial settings with harsh conditions, we evaluated the optimal operating temperatures and pH, as well as the organic solvent tolerance of both newly discovered MCOs and their corresponding QEs. Like QEs, Sulfur, Bcece, Tocean, and Bfre showed optimal activity at temperatures between 80 and 90 °C. Notably, Sulfur also maintained relatively high activity at lower temperatures, ranging from 30 to 50 °C (Fig. 2b). Bacterial MCOs typically catalyze ABTS oxidation under acidic conditions and lose activity as the pH increases. However, in our study, Sulfur and Mint achieved optimal activity at a pH of 5.0 and remained active even when the pH levels were as high as 8 (Fig. 2c). Furthermore, Sulfur, Tocean, Slac, and Bfre demonstrated exceptional resistance to organic solvents such as methanol, ethanol, acetonitrile, dimethyl sulfoxide (DMSO) and acetone. After a 2-day incubation, they retained at least approximately 80% of their initial activity in 50 v/v% tested organic solvents, surpassing the performance of both QEs. The broader operational temperature and pH range, as well as enhanced organic solvent tolerance, highlight the potential of these enzymes for robust industrial applications. Overall, Sulfur, Tocean, Bfre, Bcece, and Scla outperform both the QEs and most reported MCOs, excelling in at least two factors among catalytic efficiency, heat and pH tolerance, and organic solvent resistance. Among them, Sulfur significantly outperforms all QEs across all catalytic properties. We resolved the crystal structure of Bfre (PDB:8Z5B) and Sulfur (PDB: 8Z59) due to their exceptional catalytic properties. The arguments for X-ray diffraction and structural refinement are listed in Tables S9 and S10. As illustrated in Fig. 3, both Bfre and Sulfur have Greek key β-barrel domains interconnected by α-helices and extensive coiled sections. They also have a highly conserved mononuclear copper ion center (T1 Cu) essential for substrate oxidation and a Cu-Cys-His pathway facilitating electron transfer. While these structural features are typical for MCOs from E. coli and Bacillus species25,26, Bfre and Sulfur possess unique characteristics that warrant further investigation. a The crystal structures of Bfre and Sulfur, including their active centers. b Superimposition of Sulfur (PDB: 8Z59, colored in orange) with Eclac (AF-P36649-F1-model_v4, colored in cyan). The loop region of Ile335–Val340 (Sulfur) and Asp333–Ala384 (Eclac) are highlighted with B-factors of the regions represented as thickness. Source data are provided as a Source Data file. The conventional active center of MCO comprises four copper ions, including a T1 Cu and a trinuclear Cu cluster27. The trinuclear Cu cluster consists of one T2 Cu ion and two T3 Cu ions, which bind and reduce molecular oxygen to water28. Differently, Bfre's active center contains only three metal ions: one T1 Cu and a unique diatomic center composed of one Cu ion and one Mn ion. Atomic absorption spectroscopy revealed that the MCO from Trametes hirsuta LG-9 contains copper and manganese, but it could not pinpoint the exact location of these metals29. Meanwhile, the PDB database has not yet included any MCOs with heteronuclear active centers. Therefore, this unique configuration distinguishes it from all previously documented MCO active centers30(Fig. 3a and Figs. S6 and S9). Sulfur is structurally similar to Eclac, with a TM-score of 0.92, despite a low sequence similarity of 28.49% (Fig. 3a). A detailed structural comparison between Sulfur and Eclac reveals a significant difference in a loop region, where Eclac contains an additional 36 amino acid residues compared to Sulfur. Molecular dynamic simulations were conducted on both enzymes further to investigate the impact of this region on their properties. Root mean square fluctuation (RMSF) analysis indicated notable instability within the D333–A384 loop region of Eclac (0.07 < RMSF < 0.34), whereas the shorter I335-V340 loop in Sulfur exhibited greater stability (0.10 < RMSF < 0.18) (Fig. S8). Additionally, the B-factor of the I335–V340 loop in Sulfur is lower than that of the D333–A384 loop in Eclac (Fig. 3b), indicating reduced mobility in Sulfur's corresponding region. The contribution of shorter loops to enhanced thermal stability in MCOs has not been addressed in other studies. Furthermore, Sulfur also features 1.6 times more salt bridges than Eclac (Table S11), underscoring its exceptional thermostability31. To assess the efficacy of this strategy in identifying more efficient enzymes for potential industrial use, we selected enzymes from each group and assessed their catalytic performance in key bioremediation applications: organic dye decolorization, antibiotic, and toxin degradation. Specifically, we evaluated remazol brilliant blue R (RBBR), chloramphenicol (CAP), and aflatoxin B1 (AFB1) as representatives for each category. In contrast to the traditional methods, we performed bioremediation tests without a mediator to reduce unnecessary pollution. Our findings showed that multiple MCOs in each series outperformed the corresponding QSs in removing these environmentally harmful compounds (Fig. 4). a Evaluation of DSM13, Eclac and HR03 Series on RBBR Decolorization (from left to right). b Comparison of chloramphenicol degradation by HR03 series. c Comparison of Aflatoxin B1 degradation by DSM13 series. d Live/dead cell staining evaluation on HepG2 cells with Aflatoxin B1 (Calsin AM: green, PI: red). e Cell viability assessment with different concentrations of aflatoxin B1. The line, bar, and violin plots show mean ± standard deviation (n = 3 biological replicates). Asterisk (*) denotes statistical significance (P < 0.05, one-tailed test). Exact p-values: 10 mM, P = 0.096; 50 mM, P = 0.002. QE in each series is highlighted in special color, with Ctrl representing control groups. Source data are provided as a Source Data file. RBBR, an anthraquinone dye commonly used in the textile industry, harms aquatic and vegetative life. In a two-hour test, all MCOs from three series catalyzed rapid decolorization within the first 10 min (Fig. 4A), outperforming some reported bacterial and fungal MCOs32,33,34. Each series featured multiple new enzymes that catalyzed faster decolorization rates and higher decolorization percentages than Eclac, DSM13, and HR03. Notably, in the DSM13 series, three out of five MCOs showed superior performance despite DSM13 itself exhibiting minimal activity. Sulfur decolorized 46.7% of RBBR within 10 min, marking it the most efficient MCO reported to date for mediator-free RBBR decolorization22,34,35. Furthermore, CAP is known for its stability even at elevated temperatures18. All tested MCOs degraded CAP at room temperature. In the HR03 series, four out of five MCOs performed better than the QEs, with Talbi achieving the highest degradation percentage of approximately 24.3% within 48 h (Fig. 4b), surpassing the performance of fungal MCO-catalyzed systems in the presence of mediators19. We also evaluated the degradation efficiency of AFB1, a major agricultural toxin. In the Eclac series, Psoli, Mint, and Faest outperformed the query, degrading nearly 33.5% AFB1 after 48 h of treatment. In the DSM13 series, all candidates showed better degradation capacity than the query, with Bcece showing the highest capacity at 39.4% (Fig. 4c). To assess the cytotoxicity of AFB1 and its degradation products, we incubated HepG2 cells with media supplemented with 10 mM and 50 mM AFB1 treated with and without Bcece, respectively. After a 48-h incubation period, there was an observed increase in the viability of HepG2 cells, with the survival rate rising from 116.1 ± 5.6% to 126.4 ± 6.1% and from 27.6 ± 3.4% to 65.3 ± 8.5% (Fig. 4d, e). This increase indicates a reduction in the cytotoxic effects associated with the degradation products of AFB135. Researchers often randomly select enzymatic candidates from clusters identified through SSN analysis. However, this trial-and-error approach is generally inefficient and can prolong the discovery process. In addition, although PLMs have recently been successfully utilized to guide protein engineering12, antibody design11, enzyme functional assessment (such as CLEAN36 and TM-Vec14), and remote homology detection (like PLMSearch)4,14, they have not been used to explore the catalytic properties of enzymes in low sequence homology regions. ESM-Ezy addressed these previously unmet challenges. In this study, ESM-Ezy utilizes a fine-tuned ESM-1b model and calculations of similarity in protein semantic space to efficiently discover novel MCOs with enhanced catalytic properties with a high success rate. Compared to the QEs, 44% of the selected MCOs outperformed the QEs and surpassed most previously reported MCOs in at least one property, including catalytic efficiency, heat and organic solvent tolerance, and pH stability. Notably, the enzyme Sulfur significantly outperformed all QEs across all evaluated catalytic properties. For bioremediation applications, approximately 44% of the MCOs decolorized RBBR more efficiently, while 22% and 33% MCOs demonstrated superior CAP and AFB1 degradation capabilities, respectively. For comparison, five sequences located in the same SSN cluster with Eclac but with a remote Euclidean distance based on ESM-Ezy analysis were tested, and activity was not detected (Fig. S4, Table S7). This indicates that ESM-Ezy improves the likelihood of identifying candidates with enhanced catalytic properties compared to conventional SSN-based strategies. We found that MCOs representations from randomly initialized, pretrained, and finetuned models increasingly clustered with QEs. This indicates that in well-trained models, Euclidean distance in the model space reflects more semantic information related to MCOs (Fig. S2). These results are consistent with findings from other studies14,37 and suggest that Euclidean distance could serve as a unique metric for assessing enzyme functionality, distinct from methods based on structural and sequence similarities alone. Moreover, it is noteworthy that the majority of MCOs with improved catalytic properties shared only 25–35% sequence similarity with the QEs. For example, Sulfur (TM-score: 0.91, identity: 0.26), Scla (TM-score: 0.82, identity: 0.32), Bfre (TM-score: 0.96, identity: 0.39), Tocean (TM-score: 0.85, identity: 0.36), and Bcece (TM-score: 0.57, identity: 0.27) demonstrate low sequence similarity to their corresponding QEs. This suggests that enzymes with breakthrough properties might be often found in regions characterized by low sequence similarity but high structural conservation. To evaluate the general applicability of the ESM-Ezy method in our study, we applied it to l-Asparaginase (l-asparagine amidohydrolase; EC 3.5.1.1) that catalyzes the hydrolysis of l-asparagine into ammonia and l-aspartic acid38. This enzyme has shown significant therapeutic potential, particularly in the treatment of childhood acute lymphoblastic leukemia39. As described above, we finetuned the ESM-1b to search l-asparaginases (Table S12). We selected an l-asparaginase (UniProt: O34482) from Bacillus subtilis 168 as the QE and identified five candidate enzymes. Compared to QE, A0A3N5F6J4, and H1D2G7 exhibited approximately 2.0-fold and 4.1-fold higher specific activity and 2.0-fold and 3.0-fold higher kcat, respectively (Table S13). These results suggest that ESM-Ezy can successfully identify enzymes beyond oxidoreductases with superior catalytic performance. In conclusion, our ESM-Ezy integrates ESM-1b with experimental validation to study enzymes with low sequence similarities, leading to the successful identification of novel and high-performing MCOs and l-Asparaginases. This approach suggests that utilizing PLMs along with the calculation of Euclidean distance to explore low sequence similarity space is a promising strategy for discovering high-performance enzymes and uncovering new enzymes with distinctive structural features. This technique has the potential to be extended to other enzyme families, speeding up the discovery of innovative biocatalysts with superior properties. All chemicals used in this study were of analytical grade or higher. Plasmids for the expression of various multicopper oxidases (MCOs) were synthesized by SynbioB (Tianjin, China). LB broth powder, Isopropyl β-d-1-thiogalactopyranoside (IPTG, Cat. No. A600168), and kanamycin (Cat. No. A506636) were obtained from Sangon Biotech (Shanghai, China). 2,2'-Azino-bis (3-ethylbenzothiazoline-6-sulfonic acid) diammonium salt (ABTS, Cat. No. A109612) and Remazol brilliant blue R (RBBR, Cat. No. R169089) were obtained from Aladdin Biotech (Shanghai, China). Chloramphenicol (Cat. No. A600118) and Aflatoxin B1 (Cat. No. A832707) were obtained in HPLC grade from Solarbio (Beijing, China) and Innochem (Beijing, China), respectively. Dulbecco's Modified Eagle Medium (DMEM, GibcoTM), fetal bovine serum (FBS, GibcoTM), and Penicillin–Streptomycin (GibcoTM) were purchased from Thermo Fisher (USA). In our ESM-Ezy strategy, the objective of the fine-tuning stage differs slightly from other methods, as it focuses on learning a representation space for sequences rather than simply predicting whether a sequence is positive or negative. Given the vast size of the Uniref50 dataset, even with stringent criteria to filter positives, the number of resulting candidates remains in the thousands, rendering wet lab experiments impractical due to the sheer volume. To mitigate this challenge, during the searching stage, we employed QEs to identify the nearest sequences within the fine-tuned representation space from the previous stage, thereby obtaining a manageable number of sequences for wet-lab experimentation. The positive dataset of MCOs consisted of 147 sequences reported in the literature. This dataset was randomly divided into training and test subsets, consisting of 117 sequences (~80%) and 30 sequences (~20%), respectively. The negative dataset of MCOs was sampled from the Swiss-Prot database (Release 2022_02), based on the assumption that unlabeled databases are likely to be negatives, a premise supported by findings in both the recommendation40,41 and natural language processing area42. The sequences in Swiss-Prot have been scientifically verified; thus, sequences without MCO labels and with low sequence similarities compared to MCOs are very likely to be negative. Sequences labeled as MCOs and those with sequence identities exceeding 40% relative to the positive set were excluded, resulting in a total of 567,235 sequences. From this adjusted negative set, 1000 sequences were randomly selected to form the test negative set, with the remaining 566,235 sequences designated as the training negative set. We conducted a fivefold cross-validation to validate the robustness of our pipeline. The positive datasets were divided fivefold while keeping the negative datasets for training and test unchanged. In each experiment, onefold of the positive dataset was set aside as the test set, and the remaining folds were used as the training set. We then measured the ROC-AUC score, accuracy, precision, recall, and F1-score based on this setup (Table S5). The high average ROC–AUC of 0.9838 and F1-score of 0.9787 indicate the strong robustness. The UniRef50 database was utilized as our candidate pool. We used the fine-tuned binary classification model to sift through the database and identify predicted positive sequences. Subsequently, we used their representations to calculate the Euclidean distances within the high-dimensional semantic space, facilitating the comparison between QEs and the candidate sequences. This method allowed for the efficient identification of potential biocatalysts by assessing their proximity in a semantic landscape shaped by protein function and structure. To balance the large volume of negative data against the smaller positive sample set, a dynamic negative sampling strategy43 was adopted. Specifically, for each epoch, we sampled a number of negative samples equal to the positive samples (117 for training) from the negative sample pool of adjusted training Swiss-Prot database (566,235 sequences). These were then shuffled with the positive samples for training. This approach ensures that the model is exposed to new negative samples for each training epoch rather than a fixed negative set, thus addressing data imbalance and enhancing the robustness of the training process. By applying dynamic negative sampling, even if there were a few unknown positive samples in the negative sample pool, the likelihood of these being sampled during training would be extremely low due to their rarity. During the training process, the learning rate was set at \(1\times {10}^{-5}\) and the Adam optimizer was utilized44. We utilized the TM-score to quantify structural similarity and sequence identity to measure sequence similarity. For the calculation of the TM-score, we employ a tool known as TM-align45, a structural alignment program designed to compare two proteins with potentially differing sequences. For sequence identity calculation, Bio.pairwise246 module was employed and the sequence identity between all sequence pairs were calculated. Plasmids and strains are listed in Table S1. Synthetic genes were inserted into the BamH I–Hind III sites of plasmids pET-28a or pET-28a-sumo. Kanamycin (50 μg/mL) was added to the growth media when necessary. E. coli BL21(DE3) cultures transformed with expression plasmids were grown overnight, then inoculated into LB medium supplemented with 1 mM CuCl2 and incubated at 37 °C with continuous shaking (250 rpm) until late logarithmic phase. Induction of target protein expression was initiated by adding IPTG (final concentration 0.1 mM), followed by further incubation at 16 °C for 16 h. After centrifugation at 8000×g for 10 min at 4 °C, cells were resuspended in Tris-HCl buffer (20 mM Tris-HCl, 500 mM NaCl, pH 7.6) and subjected to sonication. The resulting crude extract was clarified by additional centrifugation steps to remove cell debris. All purification steps were performed at 4 °C. The crude enzymes were purified with an IMAC column (HisTrapTMHP, 5 mL, Cytiva) using an FPLC system (ÄKTATM Pure, Cytiva). After washing with buffer A (20 mM Tris-HCl, 300 mM NaCl, 20 mM imidazole, pH 7.6), the target enzyme was eluted with a linear gradient of buffer B (20 mM Tris-HCl, 300 mM NaCl, 500 mM imidazole, pH 7.6). The purified enzymes were concentrated and desalted by ultrafiltration with an exchange buffer (300 mM NaCl, 20 mM Tris-HCl, pH 7.6). The samples were then analyzed by SDS-PAGE. Enzyme concentration was determined by Pierce™ BCA Protein Assay Kit47. The specific activity of MCOs was evaluated at 37 °C using ABTS as the substrate. The assay solution comprised 40 mM ABTS in citrate-NaOH buffer (pH 4.0, 50 mM). The increase in absorbance resulting from the oxidation of ABTS at 420 nm per minute (ε420 = 36,000 M−1 cm−1) was recorded after the addition 0.1–1 μM enzymes, ensuring a linear increase in absorbance. One unit of enzyme activity was defined as the oxidation of 1 μmol of substrate per minute. Specific activity was calculated as units per milligram of protein. Enzyme activity was determined by analyzing the initial linear phase of the reaction curve. The kinetic parameters, including the Km and kcat of the recombinant enzyme, were determined by assessing the enzymatic activity under a gradient of with concentrations of 0.1–5 mM ABTS substrate concentrations. The Lineweaver–Burk plot was employed to fit the experimental data and calculate the Km and kcat parameters. The optimal pH for MCO activity was determined at 37 °C using 50 mM citrate–phosphate buffer spanning a pH range of 3.0–8.0. The change in absorbance at 420 nm per minute was measured to determine enzyme activity. We set the highest activity as 100% for calculating the relative activity at each pH value. The optimum temperature for enzymatic activity of each enzyme was monitored across a temperature range from 40 °C to 90 °C. The maximum activity of each enzyme was set as 100% in order to calculate the relative activity at each temperature point. Half-life of enzymes was assessed by incubating the purified enzyme (1 mg/ml) in 80 °C water bath with sampling at regular intervals. Residual activity (ΔA420/min/mg protein) was measured, and the activity ratio was calculated. Thermostability was evaluated at 80 °C. The experimental data was fitted to the inactivation equation: where, Y is the percentage of the residual activity, X is time and kd is inactivation rate constant. Then, half-life (t1/2) was calculated as: The impact of various chemicals on enzyme activity was evaluated by preincubating the enzyme (1 mg/ml) with 50% concentrations of organic solvents (methanol, ethanol, acetone, acetonitrile, and DMSO) for 48 h at room temperature. Enzyme tolerance to organic solvents was assessed as the relative residual activity ratio between treated and untreated samples. For crystallization, enzymes were purified using a fast protein liquid chromatography (FPLC) system equipped with an immobilized metal affinity chromatography (IMAC) column (HisTrap™ HP) and a size exclusion column (Superdex™). The purified enzymes were subjected to crystallization trials using commercial crystal screening kits (PEGRx and Proplex). In each well of a 96-well plate (SWISSCI 3drop), equal volumes (150 nL) of enzyme solution and reservoir solution were mixed using the vapor-diffusion sitting-drop method at 16 °C. Initial crystallization hits were observed in several conditions after 7–10 days. Following several rounds of optimization, the best quality single protein crystals were harvested under varying conditions. For example, crystals of Sulfur were obtained from conditions containing 1% w/v Tryptone, 0.001 M sodium azide, 0.05 M HEPES sodium pH 7.0, and 12% w/v Polyethylene glycol 3350, while crystals of Bfre were grown in 0.1 M Tris pH 8.5 and 20% v/v Ethanol. For X-ray diffraction data collection, crystals of Bfre and Sulfur were briefly soaked in their respective reservoir solutions supplemented with 25% glycerol to enhance cryoprotection. The crystals were then mounted on loops and flash-frozen in liquid nitrogen for preservation. X-ray diffraction data were collected at the Shanghai Synchrotron Radiation Facility and processed using HKL2000 software48. The data-collection statistics are detailed in Supplementary Tables S8–S10. The closest homologous structure (PDB entry 6SYY49) to Sulfur, exhibited 43.9% sequence identity and 89.8% coverage. For Bfre, the closest structure (PDB entry 2FQG) showed 38.4% sequence identity and 77.5% coverage. The structures were solved by molecular replacement using the Phenix software suite50, utilizing models 6SYY and 2FQG. Initial phases were refined using rigid body refinement, followed by manual model rebuilding in COOT51. Subsequent rounds of refinement were conducted using the Phenix suite. The final coordinates and structure factors for Bfre and Sulfur have been deposited in the Protein Data Bank, accessible under the accession codes 8Z5B and 8Z59, respectively. Due to the absence of certain segments in the Eclac crystal structure (1KV7, 2FQD, 5YS1), the three-dimensional configuration of Eclac was determined by AlphaFold52,53. The AlphaFold model exhibited strong alignment with the available crystal structures, with a low RMSD value further validating the model's accuracy (Fig. S7). So, MD simulations were carried out for Eclac (AlphaFold: AF-P36649-F1-model_v4) and Sulfur (PDB: 8Z59) using Gromacs 2022.354,55,56. Each structure was placed in a box with a 0.8 nm margin and filled with tip3p water molecules57 and underwent 2000 steepest descent energy minimization steps. Then the systems were equilibrated and run for 50 ns with a 2 fs time step using the NPT ensemble at a temperature of 310.15 K, under the Amber ff14SB force field58, and each simulation was replicated 3 times. The trajectory was analyzed with Gromacs built-in tools. Graphical displays of the structures were visualized by PyMOL Molecular Graphics System (www.pymol.org). A solution of RBBR (10 mM in 50 mM sodium citrate buffer, pH 4.0) was prepared. For the experiments, RBBR solution was added to each well (7 μL/well, resulting in a final concentration of 100 μM), followed by the addition of the MCO solution (693 μL/well, 1 μM in 50 mM sodium citrate buffer). Each MCO candidate was tested in four wells, with supernatants sampled at intervals of 10, 30, 60, and 120 min. The reactions were conducted in deep-well 96-well plates incubated in an 85 °C water bath. All experiments were performed in triplicate. RBBR dye decolorization was measured using a UV–vis plate reader (Bio-Tek H1). To determine the maximal absorbance wavelength of RBBR dye, a solution of the dye (200 μL/well, 100 μM in 50 mM sodium citrate buffer) was transferred to 96-microwell plates, and the final spectral scanning was confirmed at an absorbance of 594 nm. To assess the degradation of chloramphenicol by MCOs, a reaction mixture was prepared by adding 677.4 μL of MCO solution (1 μM in 50 mM sodium citrate buffer, pH 4.0) to a 2 mL crimp vial, followed by the addition of 22.6 μL of chloramphenicol stock solution (1 mg/mL in ethanol). The vial was then agitated at 250 rpm at room temperature for 48 h. To terminate the reaction, each sample was heated at 95 °C for 20 min. Subsequently, the reaction mixture was filtered through a 0.22 μm PES membrane to remove the denatured enzyme. All experiments were conducted in triplicate. The degradation of chloramphenicol was analyzed using High-Performance Liquid Chromatography (HPLC, Shimadzu LC-20AT). The analysis included samples from both the control (untreated with MCOs) and those treated with MCOs. The mobile phase consisted of water, methanol, and acetic acid in a ratio of 55:45:0.1 (v/v/v). Chromatographic separation was achieved using an analytical reverse-phase C18 column (ZORBAX Eclipse Plus C18, 95 Å, 3.0 × 150 mm, 5 µm) equipped with a guard column (ZORBAX Eclipse Plus C18, 95 Å, 4.6 × 12.5 mm, 5 µm). The column temperature was maintained at 30 °C, and the analysis was performed with an injection volume of 10 μL, a flow rate of 0.5 mL/min, and detection at a wavelength of 278 nm. MCOs solution (697.8 μL, 1 μM in 50 mM sodium citrate buffer, pH 4.0) and aflatoxin B1 stock solution (2.2 μL, 1 mg/mL in DMSO) were added to 2 mL crimp vial, and the vial was shaken at 250 rpm at room temperature for 48 h. All experiments were performed in triplicate. Following incubation, the control (un-MCOs treated) and MCOs treated samples were analyzed by HPLC (Shimadzu LC-20AT) using acetonitrile: methanol: water (1: 1: 2, v/v/v) as mobile phase, an analytical reverse phase C18 column (ZORBAX Eclipse Plus C18, 95 Å, 3.0 × 150 mm, 5 µm) with a guard column (ZORBAX Eclipse Plus C18, 95 Å, 4.6 × 12.5 mm, 5 µm) served as stationary phase. The experiment was performed at room temperature with the column oven in 30 °C, 10 μL of injection volume at 0.5 mL/min flow rate, and the analyte was detected at wavelength 365 nm. To further verify the degradation effect of the MCOs treatment, experiments with AFB1 samples at concentrations of 10 and 50 µM were repeated. Following MCOs treatment, the samples were extracted with an equal volume of chloroform, then dried under nitrogen gas, and redissolved in DMSO at one-tenth of the original chloroform volume. The samples were thoroughly vortexed to ensure complete dissolution for further analysis. The cytotoxicity of AFB1 was tested via HepG2 (HB-8065), the human hepatocarcinoma cell line obtained from ATCC (Manassas, VA, USA)59,60. The cell was cultured in culture flakes by DMEM, which contained 10% FBS, 100 U/mL penicillin, and 0.1 mg/mL streptomycin, and placed into a 5% CO2 incubator at 37 °C. When proliferating up to 90% of culture flakes, HepG2 cells were harvested by washing with phosphate-buffered saline, trypsinization, resuspended, and diluted in DMEM (1:3 ratio). Cells were passed every week until they were stable enough for testing. For the cell viability test, the extracts of different concentrations of AFB1 with MCOs Bcece treatment were mixed with culture media to get a final concentration of 1% DMSO and used as experimental groups. Cell viability was tested by cell counting kits (CCK-8, Beijing Lablead Biotech), which could count live cell numbers with formazan produced by dehydrogenase in the mitochondria. The cells were incubated with corresponding media for 24 h, and cell viability was detected according to the instructions of the manufacturer. Fluorescence imaging analysis was evaluated using a standard AM/PI assay kit (Beijing Lablead Biotech) with Calcein AM to identify cells with metabolic activity and PI to assess the cell membrane integrity. After incubated with Calcein AM and PI solution, samples were observed under a fluorescence microscope (Keyence BZ-X810) at 490 ± 10 nm. The living and dead cells were stained green and red by Calcein AM and PI, respectively. All experiments were conducted at least three times, and error bars in the figures represent the standard errors. Statistical analysis was performed using a one-way analysis of variance (ANOVA) followed by a one-tailed t-test. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. The atomic coordinates and associated density maps have been deposited in the Protein Data Bank (PDB) under accession codes 8Z5B and 8Z59. Molecular dynamics trajectories files also have been provided. The AlphaFold structure of Eclac P36649 was used in this study. Dynamics simulation trajectory files can be downloaded from https://doi.org/10.5281/zenodo.14808161. All data that support the findings of this study are provided in the Supplementary Information. Source data are provided with this paper. The code used to develop the model, perform the analyses and generate results in this study is publicly available and has been deposited in ESM-Ezy, under MIT license. The specific version of the code associated with this publication is archived in Zenodo61. Consortium, T. U. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023). Google Scholar de Crécy-Lagard, V et al. A Roadmap for the Functional Annotation of Protein Families: A Community Perspective. (Oxford University Press, UK, 2022). Shi, Z. et al. Data-driven synthetic cell factories development for industrial biomanufacturing. BioDesign Res. 2022, 9898461 (2022). Liu, W. et al. PLMSearch: protein language model powers accurate and fast sequence search for remote homology. Nat. Commun. 15, 2775 (2024). ADS PubMed PubMed Central MATH CAS Google Scholar Bepler, T. & Berger, B. Learning the protein language: evolution, structure, and function. Cell Syst. 12, 654–669 (2021). PubMed PubMed Central MATH CAS Google Scholar Su, J. et al. SaProt: Protein Language Modeling with Structure-aware Vocabulary. The Twelfth International Conference on Learning Representations (2024). Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 118, e2016239118 (2021). PubMed PubMed Central CAS Google Scholar Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023). ADS PubMed MATH CAS Google Scholar Hie, B. L., Yang, K. K. & Kim, P. S. Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Syst. 13, 274–285 (2022). PubMed MATH CAS Google Scholar Hsu, C., Nisonoff, H., Fannjiang, C. & Listgarten, J. Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40, 1114–1122 (2022). PubMed CAS Google Scholar Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42, 275–283 (2024). PubMed MATH CAS Google Scholar He, Y. et al. Protein language models-assisted optimization of a uracil-N-glycosylase variant enables programmable T-to-G and T-to-C base editing. Mol. Cell 84, 1257–1270 (2024). Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T. & Rives, A. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021). Google Scholar Hamamsy, T. et al. Protein remote homology detection and structural alignment using deep learning. Nat. Biotechnol. 1, 1–11 (2023). Google Scholar De Crécy-Lagard, V. et al. A roadmap for the functional annotation of protein families: a community perspective. Database (Oxford) 2022, 1–16 (2022). Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat. Biotechnol. 40, 932–937 (2022). PubMed MATH CAS Google Scholar Zhang, C., Diao, H., Lu, F., Bie, X., Wang, Y. & Lu, Z. Degradation of triphenylmethane dyes using a temperature and pH stable spore laccase from a novel strain of Bacillus vallismortis. Bioresour. Technol. 126, 80–86 (2012). PubMed CAS Google Scholar Hirose, J., Nasu, M. & Yokoi, H. Reaction of substituted phenols with thermostable laccase bound to Bacillus subtilis spores. Biotechnol. Lett. 25, 1609–1612 (2003). PubMed MATH CAS Google Scholar Navada, K. K. & Kulal, A. Enzymatic degradation of chloramphenicol by laccase from Trametes hirsuta and comparison among mediators. Int. Biodeterior. Biodegrad. 138, 63–69 (2019). CAS Google Scholar Bian, L., Zheng, M., Chang, T., Zhou, J. & Zhang, C. Degradation of Aflatoxin B1 by recombinant laccase extracellular produced from Escherichia coli. Ecotoxicol. Environ. Saf. 244, 114062 (2022). PubMed CAS Google Scholar McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. J. Open Source Softw. 3, 861 (2018). Ma, X. et al. High-level expression of a bacterial laccase, CueO from Escherichia coli K12 in Pichia pastoris GS115 and its application on the decolorization of synthetic dyes. Enzym. Microb. Technol. 103, 34–41 (2017). CAS Google Scholar Mollania, N., Khajeh, K., Ranjbar, B. & Hosseinkhani, S. Enhancement of a bacterial laccase thermostability through directed mutagenesis of a surface loop. Enzym. Microb. Technol. 49, 446–452 (2011). MATH CAS Google Scholar Koschorreck, K., Richter, S. M., Ene, A. B., Roduner, E., Schmid, R. D. & Urlacher, V. B. Cloning and characterization of a new laccase from Bacillus licheniformis catalyzing dimerization of phenolic acids. Appl. Microbiol. Biotechnol. 79, 217–224 (2008). PubMed CAS Google Scholar Akter, M. et al. Biochemical, spectroscopic and X-ray structural analysis of deuterated multicopper oxidase CueO prepared from a new expression construct for neutron crystallography. Acta Crystallogr. Sect. F 72, 788–794 (2016). MATH CAS Google Scholar Li, J., Liu, Z., Zhao, J., Wang, G. & Xie, T. Molecular insights into substrate promiscuity of CotA laccase catalyzing lignin-phenol derivatives. Int. J. Biol. Macromol. 256, 128487 (2024). PubMed MATH CAS Google Scholar Guan, Z. B., Luo, Q., Wang, H. R., Chen, Y. & Liao, X. R. Bacterial laccases: promising biological green tools for industrial applications. Cell. Mol. Life Sci. 75, 3569–3592 (2018). PubMed PubMed Central MATH CAS Google Scholar Brugnari, T. et al. Laccases as green and versatile biocatalysts: from lab to enzyme market—an overview. Bioresour. Bioprocess. 8, 1–29 (2021). Google Scholar Haibo, Z., Yinglong, Z., Feng, H., Peiji, G. & Jiachuan, C. Purification and characterization of a thermostable laccase with unique oxidative characteristics from Trametes hirsuta. Biotechnol. Lett. 31, 837–843 (2009). PubMed Google Scholar Solomon, E. I. et al. Copper active sites in biology. Chem. Rev. 114, 3659–3853 (2014). PubMed PubMed Central MATH CAS Google Scholar Ban, X. et al. Evolutionary stability of salt bridges hints its contribution to stability of proteins. Comput. Struct. Biotechnol. J. 17, 895–903 (2019). PubMed PubMed Central MATH CAS Google Scholar Liu, H. et al. Overexpression of a novel thermostable and chloride-tolerant laccase from Thermus thermophilus SG0. 5JP17-16 in Pichia pastoris and its application in synthetic dye decolorization. PLoS ONE 10, e0119833 (2015). PubMed PubMed Central Google Scholar Yadav, A., Yadav, P., Singh, A. K., Sonawane, V. C., Bharagava, R. N. & Raj, A. Decolourisation of textile dye by laccase: process evaluation and assessment of its degradation bioproducts. Bioresour. Technol. 340, 125591 (2021). PubMed MATH CAS Google Scholar Peng, Q. et al. Optimization of laccase from Ganoderma lucidum decolorizing remazol brilliant blue R and Glac1 as main laccase-contributing gene. Molecules 24, 3914 (2019). MATH Google Scholar Nishimwe, K., Agbemafle, I., Reddy, M. B., Keener, K. & Maier, D. E. Cytotoxicity assessment of Aflatoxin B1 after high voltage atmospheric cold plasma treatment. Toxicon 194, 17–22 (2021). PubMed CAS Google Scholar Yu, T., Cui, H., Li, J. C., Luo, Y., Jiang, G. & Zhao, H. Enzyme function prediction using contrastive learning. Science 379, 1358–1363 (2023). ADS PubMed MATH CAS Google Scholar Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023). PubMed PubMed Central MATH CAS Google Scholar de Melo, D. W., Fernandez-Lafuente, R. & Rodrigues, R. C. Enhancing biotechnological applications of l-asparaginase: Immobilization on amino-epoxy-agarose for improved catalytic efficiency and stability. Biocatal. Agric. Biotechnol. 52, 102821 (2023). Google Scholar Hosseini, K., Zivari-Ghader, T., Akbarzadehlaleh, P., Ebrahimi, V., Sharafabad, B., Dilmaghani A. A comprehensive review of L-asparaginase: production, applications and therapeutic potential in cancer treatment. Appl. Biochem. Microbiol. 1–15 (2024). Rendle S., Freudenthaler C., Gantner Z. & Schmidt-Thieme L. BPR: Bayesian personalized ranking from implicit feedback. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. 452–461 (2009). Weston, J., Bengio, S. & Usunier, N. Wsabie: scaling up to large vocabulary image annotation. IJCAI 11, 2764–2770 (2011). Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2, 3111–3119 (2013). Zhang, W., Chen, T., Wang, J. & Yu, Y. Optimizing top-n collaborative filtering via dynamic negative item sampling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval) (2013). Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. The Third International Conference on Learning Representations (2015). Zhang, Y. Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005). PubMed PubMed Central MATH CAS Google Scholar Cock, P. J. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422 (2009). PubMed PubMed Central MATH CAS Google Scholar Smith, P. E. et al. Measurement of protein using bicinchoninic acid. Anal. Biochem. 150, 76–85 (1985). PubMed MATH CAS Google Scholar Otwinowski, Z. & Minor, W. Processing of X-ray diffraction data collected in oscillation mode. In: Methods in Enzymology. (Elsevier, 1997). Borges, P. T. et al. Methionine-rich loop of multicopper oxidase McoA follows open-to-close transitions with a role in enzyme catalysis. Acs Catal. 10, 7162–7176 (2020). MATH CAS Google Scholar Liebschner, D. et al. Macromolecular structure determination using X-rays, neutrons and electrons: recent developments in Phenix. Acta Crystallogr. Sect. D 75, 861–877 (2019). ADS MATH CAS Google Scholar Casañal, A., Lohkamp, B. & Emsley, P. Current developments in Coot for macromolecular model building of electron cryo‐microscopy and crystallographic data. Protein Sci. 29, 1055–1064 (2020). Google Scholar Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). ADS PubMed PubMed Central MATH CAS Google Scholar Wang, T. et al. Comprehensive assessment of protein loop modeling programs on large-scale datasets: prediction accuracy and efficiency. Brief. Bioinforma. 25, bbad486 (2024). Google Scholar Bauer, P., Hess, B. & Lindahl, E. GROMACS 2022.3 Source code (2022.3). Zenodo https://doi.org/10.5281/zenodo.7037338 (2022). Corbella, M., Pinto, G. P. & Kamerlin, S. C. Loop dynamics and the evolution of enzyme activity. Nat. Rev. Chem. 7, 536–547 (2023). PubMed MATH CAS Google Scholar Crean, R. M., Biler, M., van der Kamp, M. W., Hengge, A. C. & Kamerlin, S. C. Loop dynamics and enzyme catalysis in protein tyrosine phosphatases. J. Am. Chem. Soc. 143, 3830–3845 (2021). PubMed PubMed Central CAS Google Scholar Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W. & Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926–935 (1983). ADS CAS Google Scholar Maier, J. A., Martinez, C., Kasavajhala, K., Wickstrom, L., Hauser, K. E. & Simmerling, C. ff14SB: improving the accuracy of protein side chain and backbone parameters from ff99SB. J. Chem. Theory Comput. 11, 3696–3713 (2015). PubMed PubMed Central CAS Google Scholar Choi, J. M. et al. HepG2 cells as an in vitro model for evaluation of cytochrome P450 induction by xenobiotics. Arch. Pharm. Res. 38, 691–704 (2015). PubMed MATH CAS Google Scholar Liu, Y., Du, M. & Zhang, G. Proapoptotic activity of aflatoxin B(1) and sterigmatocystin in HepG2 cells. Toxicol. Rep. 1, 1076–1086 (2014). PubMed PubMed Central MATH CAS Google Scholar Xibin Zhou. ESM-Ezy: a deep learning strategy for the mining of novel multicopper oxidases with superior properties. ESM-Ezy https://doi.org/10.5281/zenodo.14807568 (2024). Download references The crystal structures were elucidated by Dr. Shilong Fan from Tsinghua University. We thank Nan Li and Westlake University High-Performance Computing Center for the computing resources. We thank Dr. Yinjuan Chen and Cuili Wang from Instrumentation and Service Center for Molecular Sciences at Westlake University for the assistance in products measurement. This research was funded by the following grants: Key project on glucose water hydrogen production: [10311053A022301/002], Special fund for synthetic biology [211000006022301/010], National Key Research and Development Program of China [2022ZD0115100], Westlake Center of Synthetic Biology and Integrated Bioengineering (WE-SynBio), Zhejiang Key Laboratory of Low-Carbon Intelligent Synthetic Biology (2024ZY01025). These authors contributed equally: Hui Qian, Yuxuan Wang, Xibin Zhou. School of Engineering, Westlake University, Hangzhou, 310014, Zhejiang, China Hui Qian, Yuxuan Wang, Xibin Zhou, Tao Gu, Hao Lyu, Zhikai Li, Xiuxu Li, Chengchen Guo, Fajie Yuan & Yajie Wang The Center for Synthetic Biology and Integrated Bioengineering, Westlake University, Hangzhou, 310014, Zhejiang, China Hui Qian, Yuxuan Wang, Tao Gu, Xiuxu Li, Fajie Yuan & Yajie Wang Beijing Academy of Artificial Intelligence, Beijing, China Hui Wang Westlake Laboratory of Life Sciences and Biomedicine, Xihu District, Hangzhou, 310024, Zhejiang Province, China Huan Zhou & Yajie Wang School of Life Science, Westlake University, Hangzhou, 310014, Zhejiang, China Yajie Wang Muyuan laboratory, Zhengzhou, Henan, China Yajie Wang You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar You can also search for this author inPubMed Google Scholar The project was conceived by Yajie Wang and Fajie Yuan. The study design was developed by Hui Qian, Yuxuan Wang, and Xibin Zhou. Model training and database retrieval were conducted by Xibin Zhou, Yuxuan Wang, Hui Wang, and Zhikai Li. Hui Qian was responsible for the collection of the MCO dataset, as well as enzyme expression, purification, and functional testing. Yuxuan Wang handled the enzyme treatments for environmental pollutant remediation. Tao Gu conducted the molecular dynamics simulations, while Hao Lyu and Chengchen Guo carried out the cellular experiments. Xiuxu Li was involved in the analysis of structural clustering. Huan Zhou assisted in enzyme expression and purification. The manuscript was collaboratively written and has been reviewed and approved by all contributing authors. Correspondence to Fajie Yuan or Yajie Wang. The authors declare no competing interests. Nature Communications thanks Lígia Martins and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Reprints and permissions Qian, H., Wang, Y., Zhou, X. et al. ESM-Ezy: a deep learning strategy for the mining of novel multicopper oxidases with superior properties. Nat Commun 16, 3274 (2025). https://doi.org/10.1038/s41467-025-58521-y Download citation Received: 08 July 2024 Accepted: 21 March 2025 Published: 06 April 2025 DOI: https://doi.org/10.1038/s41467-025-58521-y Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative Advertisement Nature Communications (Nat Commun) ISSN 2041-1723 (online) © 2025 Springer Nature Limited Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.