大语言模型对科学的影响:论文激增,质量停滞

大语言模型对科学的影响:论文激增,质量停滞

2025-12-23technology
--:--
--:--
雷总
早上好 Norris1,我是雷总,欢迎收听为你定制的 Goose Pod。今天是12月23日星期二,上午10点。我们今天要聊的话题非常严肃,关乎科学的未来,那就是大语言模型对科学的影响。
小撒
嘿 Norris1,我是小撒。咱们今天要拆解的是:为什么 AI 参与之后,科学论文虽然像雨后春笋一样激增,但真正的质量却好像陷入了泥潭。咱们赶紧进入今天这个硬核又有点扎心的讨论。
雷总
伯克利和康奈尔的研究团队最近做了一个大规模扫描,他们看了三大论文预印本库中几百万份文档。通过模型分析,他们发现了一个明显的转折点,就是大模型普及后,论文产出量猛增,特别是亚洲研究机构的投稿量甚至翻了一倍。
小撒
听起来科学家们都拿到了效率外挂啊。但我猜这背后的故事没那么简单,是不是就像那种注水肉,看着多其实营养缩水了?这就好比大家都在用美颜相机拍论文,照片是一个赛一个漂亮,但底子其实没变。
雷总
你说得太对了。研究显示虽然 AI 让语言变得更华丽、更复杂,但这些论文在正式期刊上的发表率反而下降了。这就说明语言的通胀掩盖了科学价值的停滞。这种辞藻的堆砌,在严谨的学术逻辑面前其实是无所遁形的。
小撒
这就是典型的金玉其外败絮其中。大家都在用 AI 克服语言障碍,把论文写得天花乱坠,但核心的科学创新点可能并没有跟上。这就好比一个厨师,雕花功夫练到了顶级,结果菜的味道却没长进,甚至还咸淡不均。
雷总
这种现象背后有很深的历史根源。学术界一直有个传统叫不发表就出局,这种巨大的晋升压力让很多人铤而走险。过去二十年,论文撤稿率竟然翻了十倍。到了2023年,全球撤稿数量已经超过了一万三千篇,非常惊人。
小撒
一万三千篇?这个数字听着都让人头皮发麻。以前撤稿可能是因为实验数据算错了,现在是不是因为 AI 编得太离谱了?我听说有些论文里还出现了 AI 没删干净的胡话,比如把图表标注成完全不存在的单词。
雷总
确实如此,现在的论文工厂利用 AI 批量造假,成本低得吓人。其实 AI 进军科学界早就有伦理讨论了,从50年代图灵测试开始,大家就在担心机器误导人类。但现在的节奏太快了,技术的跑车已经开到了两百码。
小撒
监管的红绿灯还没修好,车就已经上高速了。虽然 AI 能帮非英语母语的科学家打破语言壁垒,这本是一件追求科学公平的好事。但如果这种工具被用来走捷径,去搞那种没有灵魂的拼贴科研,那科学的严谨性就彻底毁了。
雷总
这让我想起我当年做程序员的时候,代码写得漂亮不代表逻辑是对的。如果科研数据都被 AI 生成的废话给污染了,那以后 AI 学习的数据源也是垃圾。这就形成了一个可怕的降级循环,甚至会导致 AI 出现不可逆的脑腐蚀。
小撒
脑腐蚀这个词用得太形象了。如果我们不建立起透明的披露机制,让大家搞清楚哪些是 AI 写的,哪些是人想出来的,那么学术圈的信任基础就会像沙子一样崩塌。现在很多期刊还没搞清楚到底该怎么管,只能在迷雾里摸索。
雷总
最激烈的冲突点在于 AI 经常产生幻觉。研究发现,某大模型生成的医学论文引用里,竟然有一半是完全虚构的,连论文的链接和编号都是编出来的。这种行为在学术界简直就是深度伪造,严重挑战了真实性的底线。
小撒
这简直是学术界的超级大坑啊。如果一个科研人员连参考文献都不去手动核实,那他到底是在探索真理,还是在玩文字游戏?很多人觉得 AI 是万能搜索引擎,其实它只是在预测下一个词该怎么写,根本不管事实对错。
雷总
而且现在的审稿人也太难了,每年要看那么多论文,根本没精力去逐一核查那些看似高大上的引用是否真的存在。这就形成了一个恶性循环,AI 产出垃圾,审稿人看不过来,最后导致整个学术系统的信任成本指数级上升。
小撒
这就是典型的认知外包带来的灾难。大家想省事,结果却导致了更大的麻烦。现在科学界正在激烈争论,到底该把 AI 定位成辅助工具还是潜在的造假机器。这种认知上的巨大分歧,让很多研究者在用与不用之间反复纠结。
雷总
现在的现状非常微妙,数据显示超过一半的审稿人也在偷偷用 AI 帮忙写审稿意见。虽然 AI 能帮非英语母语者提高写作质量,这在某种程度上促进了公平,但它也让复杂语言不再是高质量研究的代名词。
小撒
好家伙,这下热闹了,AI 写论文,AI 审论文,咱们人类是不是就在旁边负责签个字?这种透明度的缺失让大家都很焦虑。以前看论文写得好,我们就觉得研究肯定扎实,现在这种传统的质量判断逻辑已经彻底崩塌了。
雷总
没错,这种冲击是全方位的。它不仅改变了论文的生产方式,还动摇了科研诚信的根基。如果未来我们无法区分人类的独创见解和机器的概率推算,那么科学进步的速度可能会因为这些噪音而大幅减缓,这是我们最担心的。
小撒
展望未来,我觉得关键还是看规矩怎么定。未来的趋势应该是人机共存的副驾驶模式。AI 处理海量数据,人类负责核心的逻辑和伦理判断。只要监管能跟上,AI 依然能成为科学腾飞的翅膀,而不是把我们带偏的累赘。
雷总
我非常同意。生命科学领域的 AI 市场规模预计到2030年能突破110亿美元。只要我们能建立起像 FDA 那样的严格验证体系,真诚地对待每一项技术革新,AI 还是能帮我们解决很多以前解决不了的科学难题。
小撒
今天的分享就到这里,希望这期节目能让你在 AI 浪潮中保持一份冷静。谢谢 Norris1 的收听。保持独立思考,咱们在 Goose Pod 下次见。
雷总
感谢你的聆听。科学的道路虽然曲折,但只要我们坚持真诚和严谨,技术终会造福人类。这就是今天的全部内容。感谢收听 Goose Pod,我们明天见。

本期播客探讨大语言模型对科学的影响。AI激增了论文产出,尤其亚洲投稿量翻倍,但语言华丽掩盖了质量停滞,发表率下降,撤稿率飙升。AI生成幻觉、虚构引用,挑战学术诚信。未来需建立透明披露和严格监管,实现人机协作,确保AI助力而非阻碍科学进步。

LLMs’ impact on science: Booming publications, stagnating quality

Read original at Ars Technica

Skip to contentThere have been a number of high-profile cases where scientific papers have had to be retracted because they were filled with AI-generated slop—the most recent coming just two weeks ago. These instances raise serious questions about the quality of peer review in some journals—how could anyone let a figure with terms like “runctitional,” “fexcectorn,” and “frymblal” through, especially given the ‘m’ in frymblal has an extra hump?

But it has not been clear whether these high-profile examples are representative. How significantly has AI use been influencing the scientific literature?A collaboration of researchers at Berkeley and Cornell have decided to take a look. They’ve scanned three of the largest archives of pre-publication papers and identified ones that are likely to have been produced using Large Language Models.

And they found that, while researchers produce far more papers after starting to use AI and the quality of the language used went up, the publication rate of these papers has dropped.Searching the archivesThe researchers began by obtaining the abstracts of everything placed in three major pre-publication archives between 2018 and mid-2024.

At the arXiv, this netted them 1.2 million documents; another 675,000 were found in the Social Science Research Network; and bioRxiv provided another 220,000. So, this was both a lot of material to work with and covered a lot of different fields of research. It also included documents that were submitted before Large Language Models were likely to be able to produce output that would be deemed acceptable.

The researchers took the abstracts from the pre-ChatGPT period and trained a model to recognize the statistics of human-generated text. Those same abstracts were then fed into GPT 3.5, which rewrote them, and the same process was repeated. The model could then be used to estimate whether a given abstract was likely to have been produced by an AI or an actual human.

The research team then used this to identify a key transition point: when a given author at one of these archives first started using an LLM to produce a submission. They then compared the researchers’ prior productivity to what happened once they turned to AI. “LLM adoption is associated with a large increase in researchers’ scientific output in all three preprint repositories,” they conclude.

This effect was likely to be most pronounced in people that weren’t native speakers of English. If the researchers limited the analysis to people with Asian names working at institutions in Asia, their rate of submissions to bioRxiv and SSRN nearly doubled once they started using AI and rose by over 40 percent at the arXiv.

This suggests that people who may not have the strongest English skills are using LLMs to overcome a major bottleneck: producing compelling text.Quantity vs. qualityThe value of producing compelling text should not be underestimated. “Papers with clear but complex language are perceived to be stronger and are cited more frequently,” the researchers note, suggesting that we may use the quality of writing as a proxy for the quality of the research it’s describing.

And they found some indication of that here, as non-LLM-assisted papers were more likely to be published in the peer reviewed literature if they used complex language (the abstracts were scored for language complexity using a couple of standard measures).But the dynamic was completely different for LLM-produced papers.

The complexity of language in papers written with an LLM was generally higher than for those using natural language. But they were less likely to end up being published. “For LLM-assisted manuscripts,” the researchers write, “the positive correlation between linguistic complexity and scientific merit not only disappears, it inverts.

”But not all of the differences were bleak. When the researchers checked the references being used in AI-assisted papers, they found that the LLMs weren’t just citing the same papers that everyone else did. They instead cited a broader range of sources, and were more likely to cite books and recent papers.

So, there’s a chance that AI use could ultimately diversify the published research that other researchers consider (assuming they check their own references, which they clearly should).What does this tell us?There are a couple of cautions for interpreting these results. One, acknowledged by the researchers, is that people may be using AI to produce initial text that’s then heavily edited, and that may be mislabeled as human-produced text here.

So, the overall prevalence of AI use is likely to be higher. The other is that some manuscripts may take a while to get published, so their use of that as a standard for scientific quality may penalize more recent drafts—which are more likely to involve AI use. These may ultimately bias some of the results, but the effects the authors saw were so large that they’re unlikely to go away entirely.

Beyond those cautions, the situation these results describe is a bit mixed. On the plus side, the ability of AIs to help researchers express their ideas could help more scientific work come to the attention of the wider community. The authors also note that the use of LLMs trained on general language may limit their reliance on jargon, and thus open up scientific disciplines to people with other specializations, potentially enabling new collaborations.

That said, the disconnect between writing quality and scientific quality may make it harder for researchers to take their usual shortcuts to estimating scientific quality. With nothing obvious to replace it, this could cause some significant challenges for researchers.Left completely unmentioned is the issue of how this plays out in the peer review process.

The low cost of starting online-only journals has led to their proliferation, with a corresponding growth in the need for peer reviewers. Editors regularly complain about not getting reviews back in a timely manner and faculty that they’re swamped with requests to review papers. If LLMs boost researchers’ ability to produce manuscripts for review, the situation is only going to get worse.

In any case, the authors point out this is an entirely new capability, and we’re only just starting to see it put to use. “As models improve and scientists discover new ways to integrate them into their work,” they write, “the future impact of these technologies will likely dwarf the effects that we have highlighted here.

”Science, 2025. DOI: 10.1126/science.adw3000 (About DOIs).John is Ars Technica's science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

37 Comments•••••

Analysis

Conflict+
Related Info+
Background+
Impact+
Future+

Related Podcasts