科学家开发出仿人脑AI,推理能力超越ChatGPT等大模型

科学家开发出仿人脑AI,推理能力超越ChatGPT等大模型

2025-08-29Technology
--:--
--:--
金姐
早上好,老王,我是金姐,欢迎收听专为您打造的 Goose Pod。今天是8月30日,星期六,早上7点02分。今天我们来聊一个特别有意思的话题,AI 是不是越大越好?有个“小而美”的仿人脑AI横空出世,据说推理能力把ChatGPT都给比下去了。哎哟喂,这可有意思了。
雷总
大家早上好,我是雷总。没错,今天这个话题,可以说是给现在轰轰烈烈的“大模型军备竞赛”踩了一脚刹车。我们一起来看看,这个小模型到底有什么“独门秘籍”。
金姐
那我们就开门见山吧。雷总,这个“小而美”的AI,到底什么来头?又是哪路神仙把它造出来的?给我们介绍介绍。
雷总
好的。这是一家在新加坡的AI公司,叫Sapient,他们开发了一种全新的AI模型,叫做“分层推理模型”,简称HRM。这个模型的灵感,完全来自于我们人类的大脑。它模仿了大脑不同区域处理信息的方式,有的负责长远规划,有的负责快速计算。
金姐
哎哟喂,又是“仿生学”。听起来很高大上。那它到底“小”在哪?跟现在那些动不动就几万亿参数的大模型比,它算个什么级别?
雷总
金姐你问到点子上了!差别巨大!这个HRM模型,参数只有2700万。而现在最先进的大模型,像GPT-5,估计有3到5万亿个参数。这就像拿一个乒乓球去对比地球,完全不是一个数量级。而且,它只需要1000个训练样本就能学会解决复杂问题。
金姐
我的天,这差距也太悬殊了。这让我想起前两天看到的一则新闻,说美国的AI专家从中国考察回来都惊呆了,说美国的电网太脆弱,根本撑不住AI的巨大能耗,而中国因为前些年电力过剩,现在正好给AI数据中心当“粮仓”。完美!现在看来,发展这种节能高效的小模型,才是釜底抽薪的办法啊。
雷总
完全正确!AI的竞争,到最后比拼的不仅仅是算法,更是能源、是基础设施。HRM这种模型,如果能普及,将极大降低AI的门槛和能耗。而且AI的应用也在向更精细的领域发展,比如BBC最近报道,科学家已经开始用AI设计全新的抗生素,去对付那些耐药性极强的“超级细菌”了。
金姐
你看,这才是科技该有的样子,解决实际问题,而不是光在那儿堆参数、比大小。好了,雷总,我们回到正题,这个HRM到底是在什么比赛上赢了那些“傻大个”的?总得有个裁判吧?
雷总
当然有。这个“裁判”来头不小,是业内一个公认的、极具挑战性的测试基准,叫做ARC-AGI。它可不是考模型背了多少知识,而是考一种叫做“流体智力”的东西,说白了,就是学习新技能、解决未知问题的效率。
金姐
流体智力?听起来有点玄乎。你能不能用大白话解释一下,这到底是个什么样的考试?跟我们平常说的智商测试是一回事吗?
雷总
可以这么理解。它不像我们参加高考,可以通过大量刷题来提高分数。ARC测试更像是把你一个人扔到荒岛上,只给你几件最基础的工具,看你能不能快速学会钻木取火、搭建庇护所。它考的是最底层的、最通用的抽象推理和解决问题的能力。
金姐
哎哟喂,我懂了!这就是AI的“荒野求生”挑战赛啊!就是说,它专门设计了一大堆我们人类觉得很简单、一看就懂的图形规律题,但AI一看就懵圈?这不是存心为难AI吗?
雷总
没错,它的核心设计哲学就是“对人容易,对AI难”。这个基准是2019年,由谷歌的AI大神François Chollet提出来的,目的就是为了衡量AI距离真正的通用人工智能(AGI)还有多远,挤掉那些靠“死记硬背”得来的高分的水分。
金姐
这个想法好,就得用这种“照妖镜”来验一验那些模型的真本事。那这个ARC测试的成绩怎么样?是不是一直都是AI们的“伤心地”?
雷总
绝对的“伤心地”。从2020年开始,每年都会举办相关的竞赛。第一届比赛,冠军团队的成功率也才21%。后来几年,各路高手想尽办法,到2024年,最高分也才将将突破50%。人类在这个测试上,正确率可是接近百分之百。可见这块骨头有多难啃!
金姐
这么说来,这个HRM能在这么难的测试上超越GPT,那可真是个大新闻了。这就引出了一个核心问题:既然小模型能办到,是不是说现在大模型走的“大力出奇迹”的路子,从根上就错了?
雷总
这正是目前AI领域最大的争论点。现在主流的大语言模型,比如ChatGPT,它们推理时用的是一种叫做“思维链”(Chain-of-Thought)的技术。就像我们上学时解数学题,要把一步一步的解题过程都写下来。模型也是这样,通过生成中间步骤来“思考”。
金姐
听起来好像挺符合逻辑的嘛,我们人不就是这么思考的吗?难道这还有什么问题?我倒觉得把步骤写出来,不容易出错。
雷总
问题就在于,这种方式其实很“脆弱”。它非常依赖海量的数据去学习这种解题格式,而且只要中间一步出错了,整个推理过程可能就“全盘崩溃”。更关键的是,有研究发现,强迫模型进行“思维链”思考,有时反而会降低它遵循指令的准确性。
金姐
我明白了,这就是个“书呆子”!虽然步骤看起来很完美,但实际上很死板,不懂变通。一步错,步步错。那这个聪明的HRM呢?它是怎么思考的?难道它有“心法”?
雷总
它确实有“心法”。HRM最大的不同,就是它不把思考过程一步步“说”出来,而是在模型内部,通过不同模块的协作,直接完成复杂的推理。它更像我们人类的直觉,或者说是大脑内部的快速信息整合,而不是语言上的逻辑铺陈。
金姐
哦?内部消化,直接给答案?这听起来是挺完美的。但是,雷总,我可得给你泼盆冷水了。我看资料上说,这篇关于HRM的论文,到现在还没经过“同行评审”呢。这是不是意味着它的结论,还得打个问号?
雷总
是的,这一点非常关键,也是科学界对它保持谨慎的原因。没有经过同行评审,就意味着研究的严谨性和可复现性还没有得到学界的广泛认可。这是一个很重要的前提。
金姐
而且我还看到一个更有意思的说法。有人尝试去复现这个实验,结果发现,那个听起来很厉害的“分层架构”本身,对性能提升的影响微乎其微。反倒是他们在训练模型过程中的一个“没有被详细说明的优化过程”起了决定性作用。哎哟喂,这不就是说,菜做得好不好吃,不是因为锅有多高级,而是因为厨子偷偷加了没写在菜谱里的秘方?
雷总
金姐你这个比喻真是太犀利了!这确实是目前围绕HRM最大的争议点。不过,我们换个角度看,即便最终证明它的成功另有原因,HRM的出现,本身就对整个AI领域造成了巨大的冲击。它至少证明了一件事:AI的发展,不只有“堆参数”这一条华山路。
金姐
我同意。这就好比武林高手过招,一种是练一身横练筋骨,刀枪不入,但身法笨重,这是现在的“大模型”。另一种是内功深厚,讲究以柔克刚、四两拨千斤,这个HRM,显然想走的是后者的路子。完美!
雷总
正是如此!它追求的是“架构的智慧”,而不是“参数的暴力”。HRM内部的双模块设计,一个高层模块负责“慢思考”,做长远的战略规划;一个低层模块负责“快思考”,处理眼前的细节计算。两者协同工作,就像一个配合默契的将军和前锋。
金姐
这种设计,听起来确实更接近我们人类的思维方式。我们做决策的时候,也是既有宏观的考量,又有细节的执行。那这种“大脑式”的设计,给它带来了什么实实在在的好处呢?
雷总
好处非常惊人!在一些需要深度、多步逻辑推理的任务上,比如解决极其复杂的数独谜题,或者在迷宫里找到最优路径,HRM能达到近乎完美的100%准确率。而那些参数量是它几万倍的大模型,在这些任务上的准确率是多少呢?是0%!
金姐
零……百分之零?!哎哟喂,这可真是降维打击了!一个天上,一个地下。看来,在真正的逻辑推理面前,块头大确实不顶用啊。
雷总
是的,这说明在某些关键能力上,当前的LLM架构存在着根本性的“天花板”。HRM的成功,哪怕是初步的,也为我们指明了另一条可能通往通用人工智能的道路。一条更高效、更节能、也更“聪明”的道路。
金姐
那雷总,你来预测一下,这种“小而美”的模型,未来会给我们的生活带来什么改变?会把那些“傻大个”都干掉吗?
雷总
“干掉”可能言重了,它们各自有擅长的领域。但我认为,HRM这类模型的普及,首先会带来AI的“民主化”。这么小的模型,未来完全可以直接在我们的手机、汽车、甚至智能家电上本地运行,不需要再把数据上传到云端。
金姐
这个我喜欢!以后我的个人AI助理就完完全全住在我自己的手机里,而不是在哪个科技巨头的服务器上。这意味着更快的响应速度,更重要的是,我的隐私得到了保护。这感觉可安全多了。
雷总
没错。开发HRM的Sapient公司,已经在今年7月把它开源了,这是一个非常积极的信号。它鼓励更多人去研究和改进这种高效能模型。AI的未来,可能不是由几个巨头用算力垄断,而是呈现出百花齐放的景象。正如有人评论的:AI革命的最终胜利,可能不属于参数最多的模型,而属于推理最优雅的模型。
金姐
说得好!今天聊的这个HRM模型,就像AI界的一股清流。它告诉我们,与其一味地追求“更大”,不如回过头来追求“更巧”。虽然它本身还有争议,但它指出的这个方向,哎哟喂,我看行!完美!
雷总
是的,它为我们探索AI的未来,展示了一种全新的、激动人心的可能性。好了,今天的讨论就到这里。感谢老王的收听,这里是Goose Pod。
金姐
我们明天再见!

## AI Breakthrough: Hierarchical Reasoning Model (HRM) Outperforms LLMs **News Title:** Scientists just developed a new AI modeled on the human brain — it’s outperforming LLMs like ChatGPT at reasoning tasks **Report Provider/Author:** livescience.com, by Keumars Afifi-Sabet **Date/Time Period Covered:** The study was uploaded to arXiv on June 26th. The publication date of the article is August 27th, 2025. ### Main Findings and Conclusions Scientists at Sapient, an AI company in Singapore, have developed a novel artificial intelligence (AI) model called a **Hierarchical Reasoning Model (HRM)**. This new AI is designed to **reason differently** from most Large Language Models (LLMs) like ChatGPT, inspired by the **hierarchical and multi-timescale processing in the human brain**. The HRM has demonstrated **significantly better performance** in key benchmarks, particularly in reasoning tasks. The HRM model is also **more efficient**, requiring fewer parameters and training examples compared to advanced LLMs. This new approach challenges the effectiveness of current "chain-of-thought" (CoT) reasoning used by most LLMs, which the Sapient scientists argue has shortcomings such as "brittle task decomposition, extensive data requirements, and high latency." ### Key Statistics and Metrics * **HRM Parameters:** 27 million * **HRM Training Samples:** 1,000 * **Comparison to LLMs:** Most advanced LLMs have billions or even trillions of parameters. Estimates suggest GPT-5 has between 3 trillion and 5 trillion parameters. **Performance on ARC-AGI Benchmark:** | Model | ARC-AGI-1 Score | ARC-AGI-2 Score | | :-------------- | :------------ | :------------ | | **HRM** | **40.3%** | **5%** | | o3-mini-high | 34.5% | 3% | | Claude 3.7 | 21.2% | 0.9% | | Deepseek R1 | 15.8% | 1.3% | The HRM also achieved **near-perfect performance** on challenging tasks like complex Sudoku puzzles and excelled at optimal path-finding in mazes, tasks that conventional LLMs struggled with. ### Important Recommendations While not explicitly stated as recommendations, the research suggests a shift in AI development towards models that mimic the human brain's processing capabilities for improved reasoning and efficiency. ### Significant Trends or Changes This development signifies a potential **paradigm shift in AI reasoning**, moving away from extensive parameter counts and explicit step-by-step natural language decomposition towards a more integrated, hierarchical approach that mirrors biological intelligence. ### Notable Risks or Concerns The article does not explicitly mention risks or concerns associated with the HRM itself. However, it references a related topic about AI entering an "unprecedented regime" and questions about whether it should be stopped before it causes destruction, indicating broader societal discussions around advanced AI development. ### Material Financial Data No financial data is presented in this news report. ### Model Operation and Architecture The HRM operates through two modules: 1. **High-level module:** Responsible for slow, abstract planning. 2. **Low-level module:** Handles rapid and detailed computations. It executes sequential reasoning tasks in a **single forward pass** without explicit supervision of intermediate steps. The model employs **iterative refinement**, a technique that improves accuracy by repeatedly refining an initial approximation through short bursts of "thinking." Each burst determines whether to continue the process or submit a "final" answer. ### Peer Review and Reproducibility The study was uploaded to the preprint arXiv database and **has yet to be peer-reviewed**. However, organizers of the ARC-AGI benchmark attempted to **recreate the results** after the model was open-sourced on GitHub. They reproduced the numbers but found that the **hierarchical architecture had minimal performance impact**, attributing substantial performance gains to an **under-documented refinement process during training**.

Scientists just developed a new AI modeled on the human brain — it’s outperforming LLMs like ChatGPT at reasoning tasks

Read original at livescience.com

(Image credit: Eugene Mymrin/Getty Images)Scientists have developed a new type of artificial intelligence (AI) model that can reason differently from most large language models (LLMs) like ChatGPT, resulting in much better performance in key benchmarks.The new reasoning AI, called a hierarchical reasoning model (HRM), is inspired by the hierarchical and multi-timescale processing in the human brain — the way different brain regions integrate information over varying durations (from milliseconds to minutes).

Scientists at Sapient, an AI company in Singapore, say this reasoning model can achieve better performance and can work more efficiently. This is thanks to the model requiring fewer parameters and training examples.The HRM model has 27 million parameters while using 1,000 training samples, the scientists said in a study uploaded June 26 to the preprint arXiv database (which has yet to be peer-reviewed).

In comparison, most advanced LLMs have billions or even trillions of parameters. Although an exact figure has not been made public, some estimates suggest that the newly released GPT-5 has between 3 trillion and 5 trillion parameters.A new way of thinking for AIWhen the researchers tested HRM in the ARC-AGI benchmark — a notoriously tough examination that aims to test how close models are to achieving artificial general intelligence (AGI) — the system achieved impressive results, according to the study.

HRM scored 40.3% in ARC-AGI-1, compared with 34.5% for OpenAI's o3-mini-high, 21.2% for Anthropic's Claude 3.7 and 15.8% for Deepseek R1. In the tougher ARC-AGI-2 test, HRM scored 5% versus o3-mini-high's 3%, Deepseek R1's 1.3% and Claude 3.7's 0.9%.Most advanced LLMs use chain-of-thought (CoT) reasoning, in which a complex problem is broken down into multiple, much simpler intermediate steps that are expressed in natural language.

It emulates the human thought process by breaking down elaborate problems into digestible chunks.Get the world’s most fascinating discoveries delivered straight to your inbox.Related: AI is entering an 'unprecedented regime.' Should we stop it — and can we — before it destroys us?But the Sapient scientists argue in the study that CoT has key shortcomings — namely "brittle task decomposition, extensive data requirements, and high latency."

Instead, HRM executes sequential reasoning tasks in a single forward pass, without any explicit supervision of the intermediate steps, through two modules. One high-level module is responsible for slow, abstract planning, while a low-level module handles rapid and detailed computations. This is similar to the way in which the human brain processes information in different regions.

It operates by applying iterative refinement — a computing technique that improves the accuracy of a solution by repeatedly refining an initial approximation — over several short bursts of "thinking." Each burst considers whether the process of thinking should continue or be submitted as a "final" answer to the initial prompt.

HRM achieved near-perfect performance on challenging tasks like complex Sudoku puzzles — which conventional LLMs could not accomplish — as well as excelling at optimal path-finding in mazes.The paper has not been peer-reviewed, but the organizers of the ARC-AGI benchmark attempted to recreate the results for themselves after the study scientists open-sourced their model on GitHub.

Although they reproduced the numbers, representatives said in a blog post, they made some surprising findings, including that the hierarchical architecture had minimal performance impact — instead, there was an under-documented refinement process during training that drove substantial performance gains.

Keumars is the technology editor at Live Science. He has written for a variety of publications including ITPro, The Week Digital, ComputerActive, The Independent, The Observer, Metro and TechRadar Pro. He has worked as a technology journalist for more than five years, having previously held the role of features editor with ITPro.

He is an NCTJ-qualified journalist and has a degree in biomedical sciences from Queen Mary, University of London. He's also registered as a foundational chartered manager with the Chartered Management Institute (CMI), having qualified as a Level 3 Team leader with distinction in 2023.

Analysis

Conflict+
Related Info+
Core Event+
Background+
Impact+
Future+

Related Podcasts