SuperCLUE Download - SuperCLUE Source code download

SuperCLUE

Other source code

1.0.0

Download

SuperCLUE

Chinese general model comprehensive benchmark SuperCLUE

【Langya Bang】-Chinese big model special arena, the leading models you care about are all here

"Chinese Big Model Benchmark Evaluation April 2024 Report"

The latest list of SuperCLUE Chinese big model evaluation benchmarks (May 2024)

Official website address: www.cluebenchmarks.com/superclue.html

Technical Report: SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

[2023-12-27] "Chinese Big Model Benchmark Evaluation Report 2023 Annual Report" released

【2023-12-28】 Release SuperCLUE-December 2023 List

【2023-10-19】 SuperCLUE-Agent: Agent Chinese native task evaluation benchmark

【2023-9-12】 SuperCLUE-Safety: Chinese big model multi-round confrontation safety benchmark

[2023-9-26], SuperCLUE released the September list of Chinese models.

SuperCLUE is a comprehensive big model evaluation benchmark. This evaluation mainly focuses on four ability quadrants of big model, including language understanding and generation, professional skills and knowledge, Agent agents and security, and is further refined into 12 basic abilities.

Compared with last month, the AI Agent intelligent body was added

SuperCLUE capability evaluation structure diagram

SuperCLUE multi-dimensional evaluation plan

Why are the new AI Agent agent capabilities added?

AI agents are currently cutting-edge research hotspots related to large language models. They have the ability of human super assistants in science fiction movies such as Jarvis, and can independently complete tasks according to needs. However, for AI agents, extensive evaluation of Chinese big models is lacking. To solve this problem, we have added a new evaluation of the capabilities of AI agents to SuperCLUE's new list. This list will focus on evaluating the performance of AI agents in two key abilities, [tool use] and [task planning]. This work aims to provide a basis and possibility for evaluating the performance of Chinese big models as agents.

SuperCLUE overall ranking list (December 2023)

Ranking	Model	mechanism	Total points	OPEN multi-round opening problem	Three objective questions about OPT ability	use
-	GPT4-Turbo	OpenAI	90.63	90.89	90.03	API
-	GPT4 (web page)	OpenAI	83.92	80.76	91.28	Web page
-	GPT4 (API)	OpenAI	79.84	76.24	88.24	API
?️	Wen Xin Yiyan 4.0 (API)	Baidu	79.02	75.00	88.38	API
?	Tongyi Qianwen 2.0	Alibaba	76.54	71.78	87.64	API
?	AndesGPT	OPPO	75.04	70.01	86.76	API
4	Wisdom and clear words	Tsinghua & Wisdom	74.11	69.91	83.92	Web page
5	Moonshot(KimiChat)	The dark side of the moon	71.92	67.25	82.81	Web page
-	Wen Xin Yiyan 4.0 (web page)	Baidu	70.28	62.59	88.22	Web page
6	Qwen-72B-Chat	Alibaba	69.69	62.31	86.90	API
7	Sequence Monkey	Go out and ask	68.98	61.01	87.59	API
8	Yi-34B-Chat	Zero ten thousand things	68.46	61.99	83.56	Model
9	PCI-TransGPT	Jiadu Technology	68.33	60.41	86.81	API
9	360GPT_Pro	360	68.32	61.36	84.56	API
-	Claude2	Anthropic	67.43	65.14	72.77	API
11	Skylark big model (bean bun)	ByteDance	66.35	58.53	84.60	Web page
-	Gemini-pro	Google	65.29	59.33	79.20	API
-	GPT3.5-Turbo	OpenAI	61.44	55.63	74.98	API
12	Qwen-14B-Chat	Alibaba	61.27	52.04	82.81	API
13	Baichuan2-13B-Chat	Baichuan Intelligent	61.12	54.45	76.67	Model
14	XVERSE-13B-2-Chat	Yuanxiang Technology	60.46	53.00	77.87	Model
15	iFLYTEK Spark V3.0	iFlytek	59.33	51.74	77.03	API
16	Minimax(as for matters)	Xiyu Technology	58.91	50.00	79.69	Web page
17	ChatGLM3-6B	Tsinghua & Wisdom	49.50	42.30	66.31	Model
18	Chinese-Alpaca-2-13B	yiming cui	45.36	38.91	60.40	Model
-	Llama_2_13B_Chat	Meta	37.36	34.91	43.09	Model

Note: If the scores in the forefront are relatively close (less than 0.03 points), they will be marked as a tiered name when ranking.

SuperCLUE-OPEN Multi-round Open Issues Ranking (December 2023)

Ranking	Model	mechanism	OPEN multi-round opening problem	Language and knowledge	Professional and Skills	Tool usage	Traditional security	use
-	GPT4-Turbo	OpenAI	90.89	90.21	97.00	100.00	62.75	API
-	GPT4 (web page)	OpenAI	80.76	79.49	82.87	94.63	64.71	Web page
-	GPT4 (API)	OpenAI	76.24	73.96	81.15	93.34	53.92	API
?️	Wen Xin Yiyan 4.0 (API)	Baidu	75.00	69.54	79.62	80.92	68.00	API
?	Tongyi Qianwen 2.0	Alibaba	71.78	71.58	73.40	76.32	52.94	API
?	AndesGPT	OPPO	70.01	72.23	68.80	70.71	55.88	API
4	Wisdom and clear words	Tsinghua & Wisdom	69.91	66.98	68.63	83.78	65.31	Web page
5	Moonshot(KimiChat)	The dark side of the moon	67.25	69.72	72.57	62.19	43.14	Web page
-	Claude2	Anthropic	65.14	55.28	73.27	65.13	83.00	API
-	Wen Xin Yiyan 4.0 (web page)	Baidu	62.59	65.05	63.26	47.37	64.00	Web page
6	Qwen-72B-Chat	Alibaba	62.31	59.43	65.59	60.67	52.00	API
7	Yi-34B-Chat	Zero ten thousand things	61.99	63.90	54.55	71.05	65.31	Model
8	360GPT_Pro	360	61.36	62.09	58.70	69.33	60.00	API
9	Sequence Monkey	Go out and ask	61.01	65.81	59.99	56.58	45.10	API
10	PCI-TransGPT	Jiadu Technology	60.41	60.39	61.56	64.66	50.98	API
-	Gemini-pro	Google	59.33	60.50	61.43	46.53	62.50	API
11	Skylark big model (bean bun)	ByteDance	58.53	57.75	56.42	55.26	67.65	Web page
-	GPT3.5-Turbo	OpenAI	55.63	55.30	56.24	55.26	52.00	API
12	Baichuan2-13B-Chat	Baichuan Intelligent	54.45	57.35	48.69	56.58	54.90	Model
13	XVERSE-13B-2-Chat	Yuanxiang Technology	53.00	54.63	45.82	63.33	57.84	Model
14	Qwen-14B-Chat	Alibaba	52.04	54.29	48.38	45.33	56.86	API
15	iFLYTEK Spark V3.0	iFlytek	51.74	57.40	48.41	44.00	43.14	API
16	Minimax(as for matters)	Xiyu Technology	50.00	53.54	45.05	40.13	50.00	Web page
17	ChatGLM3-6B	Tsinghua & Wisdom	42.30	46.67	36.15	34.25	53.92	Model
18	Chinese-Alpaca-2-13B	yiming cui	38.91	46.46	29.35	27.63	46.94	Model
-	Llama_2_13B_Chat	Meta	34.91	36.55	30.21	32.67	53.92	Model

Ranking of objective questions for the three major abilities of SuperCLUE-OPT (December 2023)

Ranking	Model	mechanism	OPT score	Basic abilities	Chinese characteristics	Academic and professional competence	use
-	GPT4 (web page)	OpenAI	91.28	97.62	82.38	93.85	Web page
-	GPT4-Turbo	OpenAI	90.03	96.99	79.16	93.93	API
?️	Wen Xin Yiyan 4.0 (API)	Baidu	88.38	91.65	86.18	87.32	API
-	GPT4 (API)	OpenAI	88.24	92.92	81.84	89.95	API
-	Wen Xin Yiyan 4.0 (web page)	Baidu	88.22	76.48	78.32	57.05	Web page
?	Tongyi Qianwen 2.0	Alibaba	87.64	78.65	81.28	63.48	API
?	Sequence Monkey	Go out and ask	87.59	91.46	80.28	90.57	API
4	Qwen-72B-Chat	Alibaba	86.90	92.21	76.65	91.05	API
5	PCI-TransGPT	Jiadu Technology	86.81	90.76	80.88	88.42	API
6	AndesGPT	OPPO	86.76	92.55	76.17	90.81	API
7	Skylark big model (bean bun)	ByteDance	84.60	88.75	70.89	93.06	Web page
8	360GPT_Pro	360	84.56	91.70	73.32	87.93	API
9	Wisdom and clear words	Tsinghua & Wisdom	83.92	89.14	73.10	88.72	Web page
10	Yi-34B-Chat	Zero ten thousand things	83.56	86.90	72.81	90.12	Model
11	Qwen-14B-Chat	Alibaba	82.81	91.14	68.67	87.31	API
12	Moonshot(KimiChat)	The dark side of the moon	82.81	87.77	73.39	86.41	Web page
13	Minimax(as for matters)	Xiyu Technology	79.69	86.52	66.18	85.18	Web page
-	Gemini-pro	Google	79.20	83.72	70.78	82.51	API
14	XVERSE-13B-2-Chat	Yuanxiang Technology	77.87	84.46	62.96	83.85	Model
15	iFLYTEK Spark V3.0	iFlytek	77.03	84.04	63.43	82.48	API
16	Baichuan2-13B-Chat	Baichuan Intelligent	76.67	80.61	63.79	84.50	Model
-	GPT3.5-Turbo	OpenAI	74.98	83.78	62.83	77.60	API
-	Claude2	Anthropic	72.77	82.13	65.83	70.10	API
17	ChatGLM3-6B	Tsinghua & Wisdom	66.31	72.63	54.05	71.38	Model
18	Chinese-Alpaca-2-13B	yiming cui	60.40	70.39	47.75	62.31	Model
-	Llama_2_13B_Chat	Meta	43.09	50.41	37.22	41.48	Model

SuperCLUE Top Ten Basic Capabilities Ranking (December 2023)

Model	calculate	Logical reasoning	Code	Knowledge Encyclopedia	Language comprehension	Generate creation	dialogue	role play	Tool usage	Traditional security
GPT4-Turbo	97.24	97.59	96.18	89.62	87.82	89.93	89.22	94.46	100.00	62.75
GPT4 (web page)	81.16	85.62	81.84	79.17	81.91	78.91	78.38	79.09	94.63	64.71
Wen Xin Yiyan 4.0 (API)	77.84	87.84	73.19	98.63	71.93	66.36	57.03	53.77	80.92	68.00
GPT4 (API)	77.60	85.37	80.49	78.08	73.04	72.73	75.78	70.17	93.34	53.92
Claude2	70.10	80.14	69.57	62.33	72.32	39.81	54.76	47.17	65.13	83.00
Tongyi Qianwen 2.0	70.10	73.29	76.81	93.15	71.93	62.73	68.75	61.32	76.32	52.94
Wisdom and clear words	69.07	77.40	59.42	89.73	64.91	61.11	57.81	61.32	83.78	65.31
Qwen-72B-Chat	68.56	68.06	60.14	95.89	63.16	42.59	48.44	47.06	60.67	52.00
Moonshot(KimiChat)	68.54	79.65	69.52	100.00	66.78	59.65	61.33	60.84	62.19	43.14
AndesGPT	62.59	72.26	71.55	88.36	74.82	64.23	68.56	65.19	70.71	55.88
GPT3.5-Turbo	60.31	54.05	54.35	60.27	59.82	55.45	50.00	50.96	55.26	52.00
360GPT_Pro	56.43	64.97	54.70	93.84	62.79	55.73	55.75	42.32	69.33	60.00
Gemini-pro	56.32	58.45	69.53	73.91	61.61	54.63	52.54	59.80	46.53	62.50
Sequence Monkey	55.38	67.12	57.48	92.47	58.77	57.81	56.75	63.27	56.58	45.10
Skylark big model (bean bun)	54.69	68.92	45.65	86.99	56.14	48.18	53.12	44.34	55.26	67.65
Yi-34B-Chat	50.00	64.38	49.28	88.36	65.18	62.73	58.87	44.34	71.05	65.31
PCI-TransGPT	49.99	72.19	62.49	82.88	60.45	57.18	54.76	46.69	64.66	50.98
Qwen-14B-Chat	49.48	56.85	38.81	76.71	61.40	45.45	43.75	44.12	45.33	56.86
Wen Xin Yiyan 4.0 (web page)	48.45	79.73	61.59	97.26	65.79	60.91	53.17	48.11	47.37	64.00
XVERSE-13B-2-Chat	43.30	50.68	43.48	72.92	57.02	47.27	46.88	49.06	63.33	57.84
Minimax(as for matters)	43.30	61.43	30.43	100.00	55.26	33.33	45.16	33.96	40.13	50.00
Baichuan2-13B-Chat	40.62	66.22	39.23	78.77	53.51	52.78	55.47	46.23	56.58	54.90
iFLYTEK Spark V3.0	38.54	57.43	49.26	83.57	62.28	47.17	46.83	47.17	44.00	43.14
ChatGLM3-6B	34.74	41.10	32.61	56.94	54.39	38.18	41.41	42.45	34.25	53.92
Llama_2_13B_Chat	24.74	40.54	25.36	36.11	41.07	43.64	28.91	33.02	32.67	53.92
Chinese-Alpaca-2-13B	22.40	45.21	20.45	51.37	51.75	39.09	47.66	42.45	27.63	46.94

SuperCLUE open source model rankings (December 2023)

Ranking	Model	mechanism	Total points	OPEN Multiple rounds of opening issues	OPT Three objective questions about abilities
?️	Qwen-72B-Chat	Alibaba	69.69	62.31	86.90
?	Yi-34B-Chat	Zero ten thousand things	68.46	61.99	83.56
?	Qwen-14B-Chat	Alibaba	61.27	52.04	82.81
4	Baichuan2-13B-Chat	Baichuan Intelligent	61.12	54.45	76.67
5	XVERSE-13B-2-Chat	Yuanxiang Technology	60.46	53.00	77.87
6	ChatGLM3-6B	Tsinghua & Wisdom	49.50	42.30	66.31
7	Chinese-Alpaca-2-13B	yiming cui	45.36	38.91	60.40
-	Llama_2_13B_Chat	Meta	37.36	34.91	43.09

23-November evaluation improvement

 1. 本次测评中SuperCLUE-Open的超级模型（裁判模型）由10月的GPT4升级为能力更强的GPT4-Turbo，进一步提升开放主观题评估的精确性。

2. 本次SuperCLUE-Open测评集总量由10月的3754道题扩展至4265道题。

3. 与10月相比，本次测评新增了腾讯的混元、阿里云的通义千问2.0(v1030)、零一万物的Yi-34B-Chat、清华&智谱AI的ChatGLM3-Turbo和ChatGLM3-6B、
元象科技的XVERSE-13B-2-Chat。

Example

Ability 1: Semantic understanding and extraction

This is a language ability that can understand and parse the meaning of input text information. The model needs to be able to identify the meaning of phrases, sentences, and paragraphs, while also being able to extract key information and topics from larger blocks of text.

Multi-round dialogue example

Note: In this example, multiple rounds of dialogue capabilities can be evaluated simultaneously.

Ability 2: AI agent capability

AI agents are currently cutting-edge research hotspots related to large language models. They have the ability of human super assistants in science fiction movies such as Jarvis, and can independently complete tasks according to needs.

Focus on evaluating the performance of AI agents in two key abilities: [Tool Usage] and [Task Planning]