Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models
Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junming Huang, Jibin Guan, Junfeng Hao, Junhao Song
2025年8月,OpenAI发布了GPT-OSS模型,这是自2019年GPT-2以来其首个开源权重的大型语言模型,包含两个专家混合架构,参数量分别为120B和20B。我们在十个基准测试上评估了这两个变体,涵盖通用知识、数学推理、代码生成、多语言理解和对话能力,并与六个当代开源大型语言模型进行比较,这些模型的参数量从14.7B到235B不等,代表了密集和稀疏两种设计。所有模型均在非量化形式下使用标准化推理设置进行测试,并采用麦克尼马尔检验和效应大小分析进行统计验证。结果显示,尽管gpt-oss-20B每个响应所需的内存和能量显著较少,但在多个基准测试(如HumanEval和MMLU)上 consistently 优于gpt-oss-120B。两个模型在当前开源生态中都表现出中等水平的整体性能,在代码生成方面具有相对优势,但在多语言任务中存在明显弱点。这些发现提供了经验证据,表明稀疏架构中的扩展可能不会带来成比例的性能提升,强调了需要进一步研究优化策略,并为未来开源部署提供更高效的模型选择信息。
In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ab...