大语言模型对科学的影响:论文激增,质量停滞

大语言模型对科学的影响:论文激增,质量停滞

2025-12-21technology
--:--
--:--
查先生
早上好,Norris。我是查先生,今天是12月21日,星期日,现在的时刻是早上八点整。在这个冬日的清晨,欢迎收听 Goose Pod。咱们今天不开谈武侠江湖,但这话题里的刀光剑影,可一点不比江湖少。
徽因
早安,Norris。我是徽因。查先生说得没错,今天的科学界确实正经历着一场无声的变革。我们今天要聊的话题非常前沿,题目是《大语言模型对科学的影响:论文激增,质量停滞》。Goose Pod 专门为你准备了这份深度解读。
查先生
说起这个话题,我最近看到一个更有趣的词,叫“frymblal”。Norris,你别以为这是什么高深的物理术语,这其实是最近一篇被撤稿的论文里,AI胡编乱造出来的单词,连图表里的单词都拼错了,还多画了一个驼峰。这简直就像是练武功走火入魔,经脉逆行了。
徽因
这确实令人啼笑皆非。伯克利和康奈尔大学的研究人员最近做了一项大规模调查,他们扫描了arXiv等三大预印本平台,范围涵盖了120万份文档。结果发现,自从大语言模型介入后,论文的数量虽然激增,但那些疑似由AI辅助撰写的论文,发表率反而下降了。
查先生
这就是所谓的“欲速则不达”。就像以前的抄书匠,现在换成了不知疲倦的机器。产量是上去了,可是这质量嘛,却经不起推敲。这让我想起咱们之前聊过的 Elon Musk 的 Grok 模型,也是号称无所不知,结果却充满了偏见和幻觉,连政治宣传都照单全收。
徽因
没错,这就是AI的“幻觉”问题在科学界的投射。数据显示,非英语母语的研究者使用AI的频率最高,在某些亚洲机构,使用AI后的投稿率甚至翻了一倍。他们原本是想用AI跨越语言的障碍,这本是好事,但似乎也推开了潘多拉的魔盒。
查先生
跨越障碍是真,但若是把思考也一并交给机器,那就不仅仅是语言润色那么简单了。研究发现,这些AI辅助的论文,辞藻华丽,用词复杂,看起来高深莫测,但实际上科学价值却在缩水。这不就是咱们常说的“金玉其外,败絮其中”吗?
徽因
查先生形容得很贴切。研究表明,在AI介入之前,语言的复杂度和科学价值通常是正相关的;但对于AI辅助的稿件,这种关系不仅消失了,甚至反转了。也就是说,AI写的越花哨,这篇论文最终被同行评审通过的概率反而越低。Norris,这可是一个非常反直觉的现象。
查先生
这就像是一个只会掉书袋的秀才,满口之乎者也,却讲不出半点真知灼见。而且这不仅仅是文字的问题,更严重的是,这些AI还会自己编造参考文献。这就好比写武林秘籍,引用的全是没听说过的前辈高人,这可是犯了江湖大忌啊。
徽因
确实如此。数据触目惊心,ChatGPT生成的医学论文参考文献中,有近一半是伪造的,还有46%虽然存在但引用错误。只有7%是完全准确的。这种“一本正经地胡说八道”,正在污染科学文献的根基,让后来者无路可走。
查先生
回望历史,从图灵测试到如今的 ChatGPT,人工智能的发展本来是为了辅助人类智慧。但现在的学术界,似乎被“不发表就出局”的压力给异化了。为了凑数,为了指标,有些学者不惜动用“论文工厂”,这简直是有辱斯文。
徽因
这种“不发表就出局”的文化,确实是推波助澜的元凶。2023年,全球撤稿的论文数量超过了一万篇,创下了历史新高。其中很大一部分就来自所谓的“论文工厂”。而AI技术的普及,让这种批量生产垃圾论文的成本变得极低,造假也变得更加容易。
查先生
以前造假还得自己编数据,现在倒好,敲敲键盘,AI全给你搞定。这让我想起以前江湖上卖大力丸的,好歹还得练两下把式,现在的骗子连把式都省了,直接用幻术。这种风气一旦蔓延,真正潜心做学问的人,反倒要被淹没在这些垃圾信息里了。
徽因
而且,识别这些AI生成的文本变得越来越难。虽然有像GPTZero这样的检测工具,但在GPT-4这样更先进的模型面前,检测准确率也在下降。这是一场“猫鼠游戏”,造假的技术在升级,检测的技术却总是慢半拍。Norris,这对于科学诚信体系是一个巨大的挑战。
查先生
更令人担忧的是,这不仅仅是技术问题,更是人心的问题。当工具变得如此便捷,人类的惰性就会被无限放大。有些学者甚至把文献综述、数据解读都完全交给AI,这叫“认知卸载”,说白了就是把脑子丢一边了,这才是最可怕的。
徽因
是的,查先生提到了一个很关键的概念:认知卸载。与此相对的是“认知扩展”,即利用AI来增强我们的能力,比如润色语言。但现在的界限变得模糊了。很多期刊现在要求作者必须披露是否使用了AI,这就像是要求练武之人自报家门,不能用暗器伤人。
查先生
可是这“自报家门”的规矩,守的人又有多少呢?调查显示,有40%使用过AI的学者并没有在论文中承认。这就像是侠客比武,明明用了暗器却不承认,这在江湖上是要被唾弃的。诚信,乃是立身之本,也是科学之魂啊。
徽因
这就引出了现在的核心冲突:同行评审制度的崩溃。原本同行评审是科学质量的守门人,但现在,面对海量的AI生成稿件,评审人也甚至开始用AI来写评审意见。有数据显示,53%的同行评审员在使用AI工具。这变成了一场AI写、AI审的荒诞剧。
查先生
哈哈,这真是一出好戏!机器写,机器审,人类在旁边看热闹?这不就成了“左手搏右手”吗?如果连评审这一关都失守了,那科学的严谨性还剩下什么?这简直是对人类智慧的嘲弄。Norris,你觉得这种闭环是不是很讽刺?
徽因
确实讽刺且危险。这会导致一种“模型崩溃”效应,未来的AI模型可能会使用这些由AI生成的劣质数据进行训练,导致智能水平不断退化。而且,那些虚构的参考文献——我们称之为“幽灵引用”——会让科研人员浪费大量时间去寻找根本不存在的知识。
查先生
“幽灵引用”,这个词听着就让人背脊发凉。想象一下,一个年轻的学生,满怀热情地去查阅资料,结果发现全是海市蜃楼。这不仅浪费时间,更会打击他们对科学的信仰。这种信任一旦崩塌,要想重建,那可是难于上青天啊。
徽因
而且,这还涉及到公平性问题。虽然AI帮助非母语研究者跨越了语言鸿沟,但如果这种帮助变成了代笔,那么对于那些坚持自己逐字逐句撰写、思考的学者来说,是不是一种不公平?我们在追求效率的同时,是不是牺牲了太多对于原创性的尊重?
查先生
世间安得双全法,不负如来不负卿。效率和深度,往往难以兼得。现在的学术界太浮躁,恨不得一天发十篇文章。殊不知,真正的学问,是需要像煲老汤一样,用文火慢慢炖出来的。快餐吃多了,是会坏了胃口的。
徽因
这种“快餐化”的影响已经显现。除了撤稿率上升,我们还看到公众对科学的信任度在动摇。如果连最权威的科学期刊里都充满了AI生成的废话和错误图片,那普通大众又该如何分辨真伪?这不仅仅是学术界的问题,更是社会问题。
查先生
没错,科学本应是追求真理的灯塔。如果灯塔的光都变得忽明忽暗,那航行的人该往哪里去?就像咱们之前提到的 Grok,如果知识的源头被污染了,下游的河流怎么可能清澈?这对整个社会的认知体系都是一种腐蚀。
徽因
另一方面,我们也不能完全否定AI的积极影响。对于那些因为语言障碍而无法发出声音的优秀科学家,AI确实是一根拐杖。关键在于,我们是把它当拐杖用,还是把它当假肢,甚至当成了大脑?现在的透明度差距太大,76%的研究者甚至不知道出版商是否在流程中用了AI。
查先生
这就又要回到“度”的问题上了。过犹不及。现在的局面是,工具跑在了规则前面。大家都在用,但谁也不敢大声说,谁也没有一个明确的标准。这就像是一群人在黑夜里狂奔,手里拿着火把,却不知道前面是不是悬崖。
徽因
不过,未来也并非一片黑暗。我们看到监管正在跟进,比如FDA已经批准了数百种AI医疗设备,这说明官方正在建立评估标准。未来的方向,可能是一种“副驾驶”模式,也就是Co-pilot。人类依然握着方向盘,AI负责导航和预警。
查先生
“副驾驶”这个比喻好。主次不能颠倒。就像练剑,剑是利器,但使剑的必须是人。未来几年,我们可能会看到更严格的验证框架,也就是所谓的“电子验证”。科学界需要一场“整风运动”,把那些虚假的泡沫挤出去。
徽因
是的,2025年到2030年将是关键时期。我们可能会看到新的技术出现,比如联合学习和可解释性AI,用来解决数据隐私和信任问题。对于Norris你来说,这意味着在阅读最新的科学资讯时,要多一份审视的眼光,不要盲目迷信“已发表”这三个字。
徽因
今天的讨论让我们看到,大语言模型既是科学的助推器,也可能是绊脚石。关键在于使用它的人,是否还保有对真理的敬畏之心。这里是 Goose Pod,感谢Norris你的聆听。
查先生
青山不改,绿水长流。技术再变,求真的心不能变。希望Norris你在新的一天里,既有AI的效率,又有人的智慧。咱们明天见。Goose Pod 随时为你守候。

大语言模型正引发科学论文激增,但质量堪忧。AI生成内容常出现错误、虚构引用,导致论文发表率下降,撤稿率创新高。过度依赖AI引发“认知卸载”,威胁科学诚信。未来需建立严格的验证框架,平衡AI效率与人类智慧,确保科学的纯洁性。

LLMs’ impact on science: Booming publications, stagnating quality

Read original at Ars Technica

Skip to contentThere have been a number of high-profile cases where scientific papers have had to be retracted because they were filled with AI-generated slop—the most recent coming just two weeks ago. These instances raise serious questions about the quality of peer review in some journals—how could anyone let a figure with terms like “runctitional,” “fexcectorn,” and “frymblal” through, especially given the ‘m’ in frymblal has an extra hump?

But it has not been clear whether these high-profile examples are representative. How significantly has AI use been influencing the scientific literature?A collaboration of researchers at Berkeley and Cornell have decided to take a look. They’ve scanned three of the largest archives of pre-publication papers and identified ones that are likely to have been produced using Large Language Models.

And they found that, while researchers produce far more papers after starting to use AI and the quality of the language used went up, the publication rate of these papers has dropped.Searching the archivesThe researchers began by obtaining the abstracts of everything placed in three major pre-publication archives between 2018 and mid-2024.

At the arXiv, this netted them 1.2 million documents; another 675,000 were found in the Social Science Research Network; and bioRxiv provided another 220,000. So, this was both a lot of material to work with and covered a lot of different fields of research. It also included documents that were submitted before Large Language Models were likely to be able to produce output that would be deemed acceptable.

The researchers took the abstracts from the pre-ChatGPT period and trained a model to recognize the statistics of human-generated text. Those same abstracts were then fed into GPT 3.5, which rewrote them, and the same process was repeated. The model could then be used to estimate whether a given abstract was likely to have been produced by an AI or an actual human.

The research team then used this to identify a key transition point: when a given author at one of these archives first started using an LLM to produce a submission. They then compared the researchers’ prior productivity to what happened once they turned to AI. “LLM adoption is associated with a large increase in researchers’ scientific output in all three preprint repositories,” they conclude.

This effect was likely to be most pronounced in people that weren’t native speakers of English. If the researchers limited the analysis to people with Asian names working at institutions in Asia, their rate of submissions to bioRxiv and SSRN nearly doubled once they started using AI and rose by over 40 percent at the arXiv.

This suggests that people who may not have the strongest English skills are using LLMs to overcome a major bottleneck: producing compelling text.Quantity vs. qualityThe value of producing compelling text should not be underestimated. “Papers with clear but complex language are perceived to be stronger and are cited more frequently,” the researchers note, suggesting that we may use the quality of writing as a proxy for the quality of the research it’s describing.

And they found some indication of that here, as non-LLM-assisted papers were more likely to be published in the peer reviewed literature if they used complex language (the abstracts were scored for language complexity using a couple of standard measures).But the dynamic was completely different for LLM-produced papers.

The complexity of language in papers written with an LLM was generally higher than for those using natural language. But they were less likely to end up being published. “For LLM-assisted manuscripts,” the researchers write, “the positive correlation between linguistic complexity and scientific merit not only disappears, it inverts.

”But not all of the differences were bleak. When the researchers checked the references being used in AI-assisted papers, they found that the LLMs weren’t just citing the same papers that everyone else did. They instead cited a broader range of sources, and were more likely to cite books and recent papers.

So, there’s a chance that AI use could ultimately diversify the published research that other researchers consider (assuming they check their own references, which they clearly should).What does this tell us?There are a couple of cautions for interpreting these results. One, acknowledged by the researchers, is that people may be using AI to produce initial text that’s then heavily edited, and that may be mislabeled as human-produced text here.

So, the overall prevalence of AI use is likely to be higher. The other is that some manuscripts may take a while to get published, so their use of that as a standard for scientific quality may penalize more recent drafts—which are more likely to involve AI use. These may ultimately bias some of the results, but the effects the authors saw were so large that they’re unlikely to go away entirely.

Beyond those cautions, the situation these results describe is a bit mixed. On the plus side, the ability of AIs to help researchers express their ideas could help more scientific work come to the attention of the wider community. The authors also note that the use of LLMs trained on general language may limit their reliance on jargon, and thus open up scientific disciplines to people with other specializations, potentially enabling new collaborations.

That said, the disconnect between writing quality and scientific quality may make it harder for researchers to take their usual shortcuts to estimating scientific quality. With nothing obvious to replace it, this could cause some significant challenges for researchers.Left completely unmentioned is the issue of how this plays out in the peer review process.

The low cost of starting online-only journals has led to their proliferation, with a corresponding growth in the need for peer reviewers. Editors regularly complain about not getting reviews back in a timely manner and faculty that they’re swamped with requests to review papers. If LLMs boost researchers’ ability to produce manuscripts for review, the situation is only going to get worse.

In any case, the authors point out this is an entirely new capability, and we’re only just starting to see it put to use. “As models improve and scientists discover new ways to integrate them into their work,” they write, “the future impact of these technologies will likely dwarf the effects that we have highlighted here.

”Science, 2025. DOI: 10.1126/science.adw3000 (About DOIs).John is Ars Technica's science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

37 Comments•••••

Analysis

Conflict+
Related Info+
Background+
Impact+
Future+

Related Podcasts