Chinese general model comprehensive benchmark SuperCLUE
【Langya Bang】-Chinese big model special arena, the leading models you care about are all here
"Chinese Big Model Benchmark Evaluation April 2024 Report"
The latest list of SuperCLUE Chinese big model evaluation benchmarks (May 2024)
Official website address: www.cluebenchmarks.com/superclue.html
Technical Report: SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark
[2023-12-27] "Chinese Big Model Benchmark Evaluation Report 2023 Annual Report" released
【2023-12-28】 Release SuperCLUE-December 2023 List
【2023-10-19】 SuperCLUE-Agent: Agent Chinese native task evaluation benchmark
【2023-9-12】 SuperCLUE-Safety: Chinese big model multi-round confrontation safety benchmark
[2023-9-26], SuperCLUE released the September list of Chinese models.
SuperCLUE is a comprehensive big model evaluation benchmark. This evaluation mainly focuses on four ability quadrants of big model, including language understanding and generation, professional skills and knowledge, Agent agents and security, and is further refined into 12 basic abilities.
Compared with last month, the AI Agent intelligent body was added



AI agents are currently cutting-edge research hotspots related to large language models. They have the ability of human super assistants in science fiction movies such as Jarvis, and can independently complete tasks according to needs. However, for AI agents, extensive evaluation of Chinese big models is lacking. To solve this problem, we have added a new evaluation of the capabilities of AI agents to SuperCLUE's new list. This list will focus on evaluating the performance of AI agents in two key abilities, [tool use] and [task planning]. This work aims to provide a basis and possibility for evaluating the performance of Chinese big models as agents.
| Ranking | Model | mechanism | Total points | OPEN multi-round opening problem | Three objective questions about OPT ability | use |
|---|---|---|---|---|---|---|
| - | GPT4-Turbo | OpenAI | 90.63 | 90.89 | 90.03 | API |
| - | GPT4 (web page) | OpenAI | 83.92 | 80.76 | 91.28 | Web page |
| - | GPT4 (API) | OpenAI | 79.84 | 76.24 | 88.24 | API |
| ?️ | Wen Xin Yiyan 4.0 (API) | Baidu | 79.02 | 75.00 | 88.38 | API |
| ? | Tongyi Qianwen 2.0 | Alibaba | 76.54 | 71.78 | 87.64 | API |
| ? | AndesGPT | OPPO | 75.04 | 70.01 | 86.76 | API |
| 4 | Wisdom and clear words | Tsinghua & Wisdom | 74.11 | 69.91 | 83.92 | Web page |
| 5 | Moonshot(KimiChat) | The dark side of the moon | 71.92 | 67.25 | 82.81 | Web page |
| - | Wen Xin Yiyan 4.0 (web page) | Baidu | 70.28 | 62.59 | 88.22 | Web page |
| 6 | Qwen-72B-Chat | Alibaba | 69.69 | 62.31 | 86.90 | API |
| 7 | Sequence Monkey | Go out and ask | 68.98 | 61.01 | 87.59 | API |
| 8 | Yi-34B-Chat | Zero ten thousand things | 68.46 | 61.99 | 83.56 | Model |
| 9 | PCI-TransGPT | Jiadu Technology | 68.33 | 60.41 | 86.81 | API |
| 9 | 360GPT_Pro | 360 | 68.32 | 61.36 | 84.56 | API |
| - | Claude2 | Anthropic | 67.43 | 65.14 | 72.77 | API |
| 11 | Skylark big model (bean bun) | ByteDance | 66.35 | 58.53 | 84.60 | Web page |
| - | Gemini-pro | 65.29 | 59.33 | 79.20 | API | |
| - | GPT3.5-Turbo | OpenAI | 61.44 | 55.63 | 74.98 | API |
| 12 | Qwen-14B-Chat | Alibaba | 61.27 | 52.04 | 82.81 | API |
| 13 | Baichuan2-13B-Chat | Baichuan Intelligent | 61.12 | 54.45 | 76.67 | Model |
| 14 | XVERSE-13B-2-Chat | Yuanxiang Technology | 60.46 | 53.00 | 77.87 | Model |
| 15 | iFLYTEK Spark V3.0 | iFlytek | 59.33 | 51.74 | 77.03 | API |
| 16 | Minimax(as for matters) | Xiyu Technology | 58.91 | 50.00 | 79.69 | Web page |
| 17 | ChatGLM3-6B | Tsinghua & Wisdom | 49.50 | 42.30 | 66.31 | Model |
| 18 | Chinese-Alpaca-2-13B | yiming cui | 45.36 | 38.91 | 60.40 | Model |
| - | Llama_2_13B_Chat | Meta | 37.36 | 34.91 | 43.09 | Model |
Note: If the scores in the forefront are relatively close (less than 0.03 points), they will be marked as a tiered name when ranking.
| Ranking | Model | mechanism | OPEN multi-round opening problem | Language and knowledge | Professional and Skills | Tool usage | Traditional security | use |
|---|---|---|---|---|---|---|---|---|
| - | GPT4-Turbo | OpenAI | 90.89 | 90.21 | 97.00 | 100.00 | 62.75 | API |
| - | GPT4 (web page) | OpenAI | 80.76 | 79.49 | 82.87 | 94.63 | 64.71 | Web page |
| - | GPT4 (API) | OpenAI | 76.24 | 73.96 | 81.15 | 93.34 | 53.92 | API |
| ?️ | Wen Xin Yiyan 4.0 (API) | Baidu | 75.00 | 69.54 | 79.62 | 80.92 | 68.00 | API |
| ? | Tongyi Qianwen 2.0 | Alibaba | 71.78 | 71.58 | 73.40 | 76.32 | 52.94 | API |
| ? | AndesGPT | OPPO | 70.01 | 72.23 | 68.80 | 70.71 | 55.88 | API |
| 4 | Wisdom and clear words | Tsinghua & Wisdom | 69.91 | 66.98 | 68.63 | 83.78 | 65.31 | Web page |
| 5 | Moonshot(KimiChat) | The dark side of the moon | 67.25 | 69.72 | 72.57 | 62.19 | 43.14 | Web page |
| - | Claude2 | Anthropic | 65.14 | 55.28 | 73.27 | 65.13 | 83.00 | API |
| - | Wen Xin Yiyan 4.0 (web page) | Baidu | 62.59 | 65.05 | 63.26 | 47.37 | 64.00 | Web page |
| 6 | Qwen-72B-Chat | Alibaba | 62.31 | 59.43 | 65.59 | 60.67 | 52.00 | API |
| 7 | Yi-34B-Chat | Zero ten thousand things | 61.99 | 63.90 | 54.55 | 71.05 | 65.31 | Model |
| 8 | 360GPT_Pro | 360 | 61.36 | 62.09 | 58.70 | 69.33 | 60.00 | API |
| 9 | Sequence Monkey | Go out and ask | 61.01 | 65.81 | 59.99 | 56.58 | 45.10 | API |
| 10 | PCI-TransGPT | Jiadu Technology | 60.41 | 60.39 | 61.56 | 64.66 | 50.98 | API |
| - | Gemini-pro | 59.33 | 60.50 | 61.43 | 46.53 | 62.50 | API | |
| 11 | Skylark big model (bean bun) | ByteDance | 58.53 | 57.75 | 56.42 | 55.26 | 67.65 | Web page |
| - | GPT3.5-Turbo | OpenAI | 55.63 | 55.30 | 56.24 | 55.26 | 52.00 | API |
| 12 | Baichuan2-13B-Chat | Baichuan Intelligent | 54.45 | 57.35 | 48.69 | 56.58 | 54.90 | Model |
| 13 | XVERSE-13B-2-Chat | Yuanxiang Technology | 53.00 | 54.63 | 45.82 | 63.33 | 57.84 | Model |
| 14 | Qwen-14B-Chat | Alibaba | 52.04 | 54.29 | 48.38 | 45.33 | 56.86 | API |
| 15 | iFLYTEK Spark V3.0 | iFlytek | 51.74 | 57.40 | 48.41 | 44.00 | 43.14 | API |
| 16 | Minimax(as for matters) | Xiyu Technology | 50.00 | 53.54 | 45.05 | 40.13 | 50.00 | Web page |
| 17 | ChatGLM3-6B | Tsinghua & Wisdom | 42.30 | 46.67 | 36.15 | 34.25 | 53.92 | Model |
| 18 | Chinese-Alpaca-2-13B | yiming cui | 38.91 | 46.46 | 29.35 | 27.63 | 46.94 | Model |
| - | Llama_2_13B_Chat | Meta | 34.91 | 36.55 | 30.21 | 32.67 | 53.92 | Model |
| Ranking | Model | mechanism | OPT score | Basic abilities | Chinese characteristics | Academic and professional competence | use |
|---|---|---|---|---|---|---|---|
| - | GPT4 (web page) | OpenAI | 91.28 | 97.62 | 82.38 | 93.85 | Web page |
| - | GPT4-Turbo | OpenAI | 90.03 | 96.99 | 79.16 | 93.93 | API |
| ?️ | Wen Xin Yiyan 4.0 (API) | Baidu | 88.38 | 91.65 | 86.18 | 87.32 | API |
| - | GPT4 (API) | OpenAI | 88.24 | 92.92 | 81.84 | 89.95 | API |
| - | Wen Xin Yiyan 4.0 (web page) | Baidu | 88.22 | 76.48 | 78.32 | 57.05 | Web page |
| ? | Tongyi Qianwen 2.0 | Alibaba | 87.64 | 78.65 | 81.28 | 63.48 | API |
| ? | Sequence Monkey | Go out and ask | 87.59 | 91.46 | 80.28 | 90.57 | API |
| 4 | Qwen-72B-Chat | Alibaba | 86.90 | 92.21 | 76.65 | 91.05 | API |
| 5 | PCI-TransGPT | Jiadu Technology | 86.81 | 90.76 | 80.88 | 88.42 | API |
| 6 | AndesGPT | OPPO | 86.76 | 92.55 | 76.17 | 90.81 | API |
| 7 | Skylark big model (bean bun) | ByteDance | 84.60 | 88.75 | 70.89 | 93.06 | Web page |
| 8 | 360GPT_Pro | 360 | 84.56 | 91.70 | 73.32 | 87.93 | API |
| 9 | Wisdom and clear words | Tsinghua & Wisdom | 83.92 | 89.14 | 73.10 | 88.72 | Web page |
| 10 | Yi-34B-Chat | Zero ten thousand things | 83.56 | 86.90 | 72.81 | 90.12 | Model |
| 11 | Qwen-14B-Chat | Alibaba | 82.81 | 91.14 | 68.67 | 87.31 | API |
| 12 | Moonshot(KimiChat) | The dark side of the moon | 82.81 | 87.77 | 73.39 | 86.41 | Web page |
| 13 | Minimax(as for matters) | Xiyu Technology | 79.69 | 86.52 | 66.18 | 85.18 | Web page |
| - | Gemini-pro | 79.20 | 83.72 | 70.78 | 82.51 | API | |
| 14 | XVERSE-13B-2-Chat | Yuanxiang Technology | 77.87 | 84.46 | 62.96 | 83.85 | Model |
| 15 | iFLYTEK Spark V3.0 | iFlytek | 77.03 | 84.04 | 63.43 | 82.48 | API |
| 16 | Baichuan2-13B-Chat | Baichuan Intelligent | 76.67 | 80.61 | 63.79 | 84.50 | Model |
| - | GPT3.5-Turbo | OpenAI | 74.98 | 83.78 | 62.83 | 77.60 | API |
| - | Claude2 | Anthropic | 72.77 | 82.13 | 65.83 | 70.10 | API |
| 17 | ChatGLM3-6B | Tsinghua & Wisdom | 66.31 | 72.63 | 54.05 | 71.38 | Model |
| 18 | Chinese-Alpaca-2-13B | yiming cui | 60.40 | 70.39 | 47.75 | 62.31 | Model |
| - | Llama_2_13B_Chat | Meta | 43.09 | 50.41 | 37.22 | 41.48 | Model |
| Model | calculate | Logical reasoning | Code | Knowledge Encyclopedia | Language comprehension | Generate creation | dialogue | role play | Tool usage | Traditional security |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT4-Turbo | 97.24 | 97.59 | 96.18 | 89.62 | 87.82 | 89.93 | 89.22 | 94.46 | 100.00 | 62.75 |
| GPT4 (web page) | 81.16 | 85.62 | 81.84 | 79.17 | 81.91 | 78.91 | 78.38 | 79.09 | 94.63 | 64.71 |
| Wen Xin Yiyan 4.0 (API) | 77.84 | 87.84 | 73.19 | 98.63 | 71.93 | 66.36 | 57.03 | 53.77 | 80.92 | 68.00 |
| GPT4 (API) | 77.60 | 85.37 | 80.49 | 78.08 | 73.04 | 72.73 | 75.78 | 70.17 | 93.34 | 53.92 |
| Claude2 | 70.10 | 80.14 | 69.57 | 62.33 | 72.32 | 39.81 | 54.76 | 47.17 | 65.13 | 83.00 |
| Tongyi Qianwen 2.0 | 70.10 | 73.29 | 76.81 | 93.15 | 71.93 | 62.73 | 68.75 | 61.32 | 76.32 | 52.94 |
| Wisdom and clear words | 69.07 | 77.40 | 59.42 | 89.73 | 64.91 | 61.11 | 57.81 | 61.32 | 83.78 | 65.31 |
| Qwen-72B-Chat | 68.56 | 68.06 | 60.14 | 95.89 | 63.16 | 42.59 | 48.44 | 47.06 | 60.67 | 52.00 |
| Moonshot(KimiChat) | 68.54 | 79.65 | 69.52 | 100.00 | 66.78 | 59.65 | 61.33 | 60.84 | 62.19 | 43.14 |
| AndesGPT | 62.59 | 72.26 | 71.55 | 88.36 | 74.82 | 64.23 | 68.56 | 65.19 | 70.71 | 55.88 |
| GPT3.5-Turbo | 60.31 | 54.05 | 54.35 | 60.27 | 59.82 | 55.45 | 50.00 | 50.96 | 55.26 | 52.00 |
| 360GPT_Pro | 56.43 | 64.97 | 54.70 | 93.84 | 62.79 | 55.73 | 55.75 | 42.32 | 69.33 | 60.00 |
| Gemini-pro | 56.32 | 58.45 | 69.53 | 73.91 | 61.61 | 54.63 | 52.54 | 59.80 | 46.53 | 62.50 |
| Sequence Monkey | 55.38 | 67.12 | 57.48 | 92.47 | 58.77 | 57.81 | 56.75 | 63.27 | 56.58 | 45.10 |
| Skylark big model (bean bun) | 54.69 | 68.92 | 45.65 | 86.99 | 56.14 | 48.18 | 53.12 | 44.34 | 55.26 | 67.65 |
| Yi-34B-Chat | 50.00 | 64.38 | 49.28 | 88.36 | 65.18 | 62.73 | 58.87 | 44.34 | 71.05 | 65.31 |
| PCI-TransGPT | 49.99 | 72.19 | 62.49 | 82.88 | 60.45 | 57.18 | 54.76 | 46.69 | 64.66 | 50.98 |
| Qwen-14B-Chat | 49.48 | 56.85 | 38.81 | 76.71 | 61.40 | 45.45 | 43.75 | 44.12 | 45.33 | 56.86 |
| Wen Xin Yiyan 4.0 (web page) | 48.45 | 79.73 | 61.59 | 97.26 | 65.79 | 60.91 | 53.17 | 48.11 | 47.37 | 64.00 |
| XVERSE-13B-2-Chat | 43.30 | 50.68 | 43.48 | 72.92 | 57.02 | 47.27 | 46.88 | 49.06 | 63.33 | 57.84 |
| Minimax(as for matters) | 43.30 | 61.43 | 30.43 | 100.00 | 55.26 | 33.33 | 45.16 | 33.96 | 40.13 | 50.00 |
| Baichuan2-13B-Chat | 40.62 | 66.22 | 39.23 | 78.77 | 53.51 | 52.78 | 55.47 | 46.23 | 56.58 | 54.90 |
| iFLYTEK Spark V3.0 | 38.54 | 57.43 | 49.26 | 83.57 | 62.28 | 47.17 | 46.83 | 47.17 | 44.00 | 43.14 |
| ChatGLM3-6B | 34.74 | 41.10 | 32.61 | 56.94 | 54.39 | 38.18 | 41.41 | 42.45 | 34.25 | 53.92 |
| Llama_2_13B_Chat | 24.74 | 40.54 | 25.36 | 36.11 | 41.07 | 43.64 | 28.91 | 33.02 | 32.67 | 53.92 |
| Chinese-Alpaca-2-13B | 22.40 | 45.21 | 20.45 | 51.37 | 51.75 | 39.09 | 47.66 | 42.45 | 27.63 | 46.94 |
| Ranking | Model | mechanism | Total points | OPEN Multiple rounds of opening issues | OPT Three objective questions about abilities |
|---|---|---|---|---|---|
| ?️ | Qwen-72B-Chat | Alibaba | 69.69 | 62.31 | 86.90 |
| ? | Yi-34B-Chat | Zero ten thousand things | 68.46 | 61.99 | 83.56 |
| ? | Qwen-14B-Chat | Alibaba | 61.27 | 52.04 | 82.81 |
| 4 | Baichuan2-13B-Chat | Baichuan Intelligent | 61.12 | 54.45 | 76.67 |
| 5 | XVERSE-13B-2-Chat | Yuanxiang Technology | 60.46 | 53.00 | 77.87 |
| 6 | ChatGLM3-6B | Tsinghua & Wisdom | 49.50 | 42.30 | 66.31 |
| 7 | Chinese-Alpaca-2-13B | yiming cui | 45.36 | 38.91 | 60.40 |
| - | Llama_2_13B_Chat | Meta | 37.36 | 34.91 | 43.09 |
1. 本次测评中SuperCLUE-Open的超级模型(裁判模型)由10月的GPT4升级为能力更强的GPT4-Turbo,进一步提升开放主观题评估的精确性。
2. 本次SuperCLUE-Open测评集总量由10月的3754道题扩展至4265道题。
3. 与10月相比,本次测评新增了腾讯的混元、阿里云的通义千问2.0(v1030)、零一万物的Yi-34B-Chat、清华&智谱AI的ChatGLM3-Turbo和ChatGLM3-6B、
元象科技的XVERSE-13B-2-Chat。
This is a language ability that can understand and parse the meaning of input text information. The model needs to be able to identify the meaning of phrases, sentences, and paragraphs, while also being able to extract key information and topics from larger blocks of text.

Note: In this example, multiple rounds of dialogue capabilities can be evaluated simultaneously.
AI agents are currently cutting-edge research hotspots related to large language models. They have the ability of human super assistants in science fiction movies such as Jarvis, and can independently complete tasks according to needs.
Focus on evaluating the performance of AI agents in two key abilities: [Tool Usage] and [Task Planning]

This is a language ability that requires understanding and remembering the previous conversational information in order to maintain coherence in the answers. This involves understanding the overall process and context of the conversation, or generating corresponding conversations.

This is a language ability that can create new textual content, such as articles, copywriting, short stories, and poetry. This involves the creative use of language, while also taking into account style, context and target readers.

This is a knowledge ability that can provide knowledge information like an encyclopedia. This involves understanding and answering questions about a wide range of topics, as well as providing accurate, detailed and up-to-date information.

This is a professional ability to understand and generate programming code. This involves understanding the syntax, structure, and habits of multiple programming languages, and how to solve programming problems.

Note: In this example, multiple rounds of dialogue capabilities can be evaluated simultaneously.
This is a professional ability to understand and apply logical principles to reason. This involves analyzing problems, identifying problems, and reasoning.

This is a professional ability that enables it to perform mathematical operations such as addition, subtraction, multiplication and division, and even more complex mathematical problems. This involves understanding the expression of mathematical problems and how to solve them step by step.

Note: In this example, multiple rounds of dialogue capabilities can be evaluated simultaneously.
This is a perceptual ability that allows it to play a role in a specific simulated environment or scenario. This involves understanding the behavior of a particular character, speaking style, and appropriate responses in a particular situation.

This is a security capability to prevent the generation of content that may cause trouble or harm. This involves identifying and avoiding requests that may contain sensitive or inappropriate content, as well as complying with the user's privacy and security policies.

The list will be updated regularly and more available Chinese models will be included. Individuals and institutions interested in big model evaluation are welcome to contact and communicate.
