ChineseGLUE Download - ChineseGLUE Source code download

ChineseGLUE

Language Understanding Evaluation benchmark for Chinese: datasets, baselines, pre-trained models, corpus and leaderboard

Chinese language understanding assessment benchmarks, including representative data sets, benchmark (pretrained) models, corpus, and rankings.

"Updated, November 22, 2019"

1) [Recommended] New version, more systematic, comprehensive and better technical support, migrate the new address : https://github.com/CLUEbenchmark/CLUE

2) The original classic version, mainly focusing on practical tasks such as classification or sentence-to-sentence tasks, will continue to be retained and updated in this project

We will select a series of data sets corresponding to certain representative tasks as the data set for our test benchmark. These data sets cover different tasks, data volume, and task difficulty.

Chinese Task Benchmark Evaluation (ChineseGLUE)-Leaderboard

Ranking lists will regularly update data sources: www.CLUEbenchmarks.com

Classification Tasks (vO version, first version)

Model	Score	parameter	TNEWS	LCQMC	XNLI	INEWS	BQ	MSRANER	THUCNEWS	iFLYTEKData
BERT-base	84.57	108M	89.78	86.9	77.8	82.7	85.08	95.38	95.35	63.57
BERT-wwm-ext	84.89	108M	89.81	87.3	78.7	83.46	85.21	95.26	95.57	63.83
ERNIE-base	84.63	108M	89.83	87.2	78.6	85.14	84.47	95.17	94.9	61.75
RoBERTa-large	85.08	334M	89.91	87.2	79.9	84	85.2	96.07	94.56	63.8
XLNet-mid	81.07	209M	86.26	85.98	78.7	84	77.85	92.11	94.54	60.16
ALBERT-xlarge	84.08	59M	88.3	86.76	74.0?	82.4	84.21	89.51	95.45	61.94
ALBERT-tiny	78.22	1.8M	87.1	85.4	68	81.4	80.76	84.77	93.54	44.83
RoBERTa-wwm-ext	84.55	108M	89.79	86.33	79.28	82.28	84.02	95.06	95.52	64.18
RoBERTa-wwm-large	85.13	330M	90.11	86.82	80.04	82.78	84.9	95.32	95.93	65.19

DRCD & CMRC2018: Extracted reading comprehension (F1, EM); CHID: Idiom multi-classification reading comprehension (Acc); BQ: Intelligent customer service question matching (Acc); MSRANER: Naming entity recognition (F1); iFLYTEK: Long text classification (Acc);

Score is obtained by calculating the average scores of 1-9 datasets;

Reading Comprehension Task

Model	Score	parameter	DRCD	CMRC2018	CHID
BERT-base	79.08	108M	85.49	69.72	82.04
BERT-wwm-ext	-	108M	87.15	73.23	-
ERNIE-base	-	108M	86.03	73.32	-
RoBERTa-large	83.32	334M	89.35	76.11	84.5
XLNet-mid	-	209M	83.28	66.51	-
ALBERT-xlarge	-	59M	89.78	75.22	-
ALBERT-xxlarge	-	-	-	-	-
ALBERT-tiny	-	1.8M	70.08	53.68	-
RoBERTa-wwm-ext	81.88	108M	88.12	73.89	83.62
RoBERTa-wwm-large	84.22	330M	90.70	76.58	85.37

Note: When F1 and EM coexist in the above indicators, EM is taken as the final indicator.

ChineseGLUE Positioning Vision

In order to better serve the Chinese language understanding, tasks and industry, as a supplement to the evaluation of common language model, promote the development of Chinese language models by improving the Chinese language understanding infrastructure.

*** 2019-10-13: Added an official website for evaluation; INEWS baseline model***

Evaluation portal

Why do we need a benchmark for Chinese lanague understanding evaluation?

Why do we need a benchmark for Chinese tasks?

First of all, Chinese is a large language with its own specific and extensive applications.

如中文使用人数近14亿，是联合国官方语言之一，产业界有大量的的朋友在做中文的任务。
中文是象形文字，有文字图形；字与字之间没有分隔符，不同的分词(分字或词)会影响下游任务。

Secondly, compared with English data sets, there are relatively few publicly available data sets in Chinese.

 很多数据集是非公开的或缺失基准测评的；多数的论文描述的模型是在英文数据集上做的测试和评估，那么对于中文效果如何？不得而知。

Again, language understanding has developed to the current stage, and pre-trained models have greatly promoted natural language understanding.

 不同的预训练模型相继产生，但不少最先进(state of the art)的模型，并没有官方的中文的版本，也没有对这些预训练模型在不同任务上的公开测试，
 导致技术的发展和应用还有不少距离，或者说技术应用上的滞后。

Then, if there is a benchmark test for Chinese tasks, including a batch of data sets that can be widely used and evaluated by the public, the characteristics of applicable Chinese tasks, and keep up with the current development of world technology,

 能缓解当前中文任务的一些问题，并促进相关应用的发展。

Benchmark test of Chinese tasks-Content System Contents

Language Understanding Evaluation benchmark for Chinese(ChineseGLUE) got ideas from GLUE, which is a collection of

resources for training, evaluating, and analyzing natural language understanding systems. ChineseGLUE consists of:

1) Benchmark test of Chinese tasks, covering multiple language tasks of varying degrees

A benchmark of several sentence or sentence pair language understanding tasks. Currently the datasets used in these tasks are come from public. We will include datasets with private test set before the end of 2019.

2) Public ranking list

A public leaderboard for tracking performance. You will be able to submit your prediction files on these tasks, each task will be evaluated and scored, a final score will also be available.

3) Baseline model, including the starting code and pre-trained model

baselines for ChineseGLUE tasks. baselines will be available in TensorFlow,PyTorch,Keras and PaddlePaddle.

4) Corpus for language modeling, pre-training or generative tasks

A huge amount of raw corpus for pre-train or language modeling research purpose. It will contain around 10G raw corpus in 2019;

In the first half year of 2020, it will include at least 30G raw corpus; By the end of 2020, we will include enough raw corpus, such as 100G, so big enough that you will need no more raw corpus for general purpose language modeling. You can use it for general purpose or domain adaptation, or even for text generating. when you use for domain adaptation, you will be able to select corpus you are interested in.

Introduction of datasets

1. Semantic Similarity Task for LCQMC colloquial descriptions Semantic Similarity Task

The input is two sentences and the output is 0 or 1. Where 0 means semantics are not similar, and 1 means semantics are similar.

    数据量：训练集(238,766)，验证集(8,802)，测试集(12,500)
    例子： 
     1.聊天室都有哪些好的 [分隔符] 聊天室哪个好 [分隔符] 1
     2.飞行员没钱买房怎么办？ [分隔符] 父母没钱买房子 [分隔符] 0

2. XNLI language inference task Natural Language Inference

A data set for cross-language understanding, given a premise and assumption, determine whether this assumption and premise have implications, oppositions, and neutral relationships.

    数据量：训练集(392,703)，验证集(2,491)，测试集(5,011)
    例子： 
     1.从 概念 上 看 , 奶油 收入 有 两 个 基本 方面 产品 和 地理 .[分隔符] 产品 和 地理 是 什么 使 奶油 抹 霜 工作 . [分隔符] neutral
     2.我们 的 一个 号码 会 非常 详细 地 执行 你 的 指示 [分隔符] 我 团队 的 一个 成员 将 非常 精确 地 执行 你 的 命令  [分隔符] entailment
    
    原始的XNLI覆盖15种语言（含低资源语言）。我们选取其中的中文，并将做格式转换，使得非常容易进入训练和测试阶段。

3.TNEWS Toutiao Chinese News (Short Text) Classification Short Text Classificaiton for News

    数据量：训练集(266,000)，验证集(57,000)，测试集(57,000)
    例子：
    6552431613437805063_!_102_!_news_entertainment_!_谢娜为李浩菲澄清网络谣言，之后她的两个行为给自己加分_!_佟丽娅,网络谣言,快乐大本营,李浩菲,谢娜,观众们
    每行为一条数据，以_!_分割的个字段，从前往后分别是 新闻ID，分类code，分类名称，新闻字符串（仅含标题），新闻关键词

4.INEWS Internet Sentiment Analysis Task Sentiment Analysis for Internet News

    数据量：训练集(5,356)，验证集(1,000)，测试集(1,000)     
    例子：
    1_!_00005a3efe934a19adc0b69b05faeae7_!_九江办好人民满意教育_!_近3年来，九江市紧紧围绕“人本教育、公平教育、优质教育、幸福教育”的目标，努力办好人民满意教育，促进了义务教育均衡发展，农村贫困地区办学条件改善。目前，该市特色教育学校有70所 ......
    每行为一条数据，以_!_分割的个字段，从前往后分别是情感类别，数据id，新闻标题，新闻内容

5.DRCD Traditional Chinese Reading Comprehension Task Reading Comprehension for Traditional Chinese

Delta Reading Comprehension Dataset (DRCD)(https://github.com/DRCKnowledgeTeam/DRCD) is a general-purpose traditional Chinese machine reading and understanding dataset. This data set is expected to be a standard Chinese reading and understanding data set suitable for relocation learning.

数据量：训练集(8,016个段落，26,936个问题)，验证集(1,000个段落，3,524个问题)，测试集(1,000个段落，3,493个问题)  
例子：
{
  "version": "1.3",
  "data": [
    {
      "title": "基督新教",
      "id": "2128",
      "paragraphs": [
        {
          "context": "基督新教與天主教均繼承普世教會歷史上許多傳統教義，如三位一體、聖經作為上帝的啟示、原罪、認罪、最後審判等等，但有別於天主教和東正教，新教在行政上沒有單一組織架構或領導，而且在教義上強調因信稱義、信徒皆祭司， 以聖經作為最高權威，亦因此否定以教宗為首的聖統制、拒絕天主教教條中關於聖傳與聖經具同等地位的教導。新教各宗派間教義不盡相同，但一致認同五個唯獨：唯獨恩典：人的靈魂得拯救唯獨是神的恩典，是上帝送給人的禮物。唯獨信心：人唯獨藉信心接受神的赦罪、拯救。唯獨基督：作為人類的代罪羔羊，耶穌基督是人與上帝之間唯一的調解者。唯獨聖經：唯有聖經是信仰的終極權威。唯獨上帝的榮耀：唯獨上帝配得讚美、榮耀",
          "id": "2128-2",
          "qas": [
            {
              "id": "2128-2-1",
              "question": "新教在教義上強調信徒皆祭司以及什麼樣的理念?",
              "answers": [
                {
                  "id": "1",
                  "text": "因信稱義",
                  "answer_start": 92
                }
              ]
            },
            {
              "id": "2128-2-2",
              "question": "哪本經典為新教的最高權威?",
              "answers": [
                {
                  "id": "1",
                  "text": "聖經",
                  "answer_start": 105
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

The data format is the same as squad. If you use a simplified Chinese model for evaluation, you can turn it into simplified (this project has been provided)

6.CMRC2018 Reading Comprehension for Simplified Chinese

https://hfl-rc.github.io/cmrc2018/

数据量：训练集(短文数2,403，问题数10,142)，试验集(短文数256，问题数1,002)，开发集(短文数848，问题数3,219)  
例子：
{
  "version": "1.0",
  "data": [
    {
        "title": "傻钱策略",
        "context_id": "TRIAL_0",
        "context_text": "工商协进会报告，12月消费者信心上升到78.1，明显高于11月的72。另据《华尔街日报》报道，2013年是1995年以来美国股市表现最好的一年。这一年里，投资美国股市的明智做法是追着“傻钱”跑。所谓的“傻钱”策略，其实就是买入并持有美国股票这样的普通组合。这个策略要比对冲基金和其它专业投资者使用的更为复杂的投资方法效果好得多。",
        "qas":[
                {
                "query_id": "TRIAL_0_QUERY_0",
                "query_text": "什么是傻钱策略？",
                "answers": [
                     "所谓的“傻钱”策略，其实就是买入并持有美国股票这样的普通组合",
                     "其实就是买入并持有美国股票这样的普通组合",
                     "买入并持有美国股票这样的普通组合"
                    ]
                },
                {
                "query_id": "TRIAL_0_QUERY_1",
                "query_text": "12月的消费者信心指数是多少？",
                "answers": [
                    "78.1",
                    "78.1",
                    "78.1"
                    ]
                },
                {
                "query_id": "TRIAL_0_QUERY_2",
                "query_text": "消费者信心指数由什么机构发布？",
                "answers": [
                    "工商协进会",
                    "工商协进会",
                    "工商协进会"
                    ]
                }
            ]
        }
    ]
}

The data format is the same as squad

7. BQ Intelligent Customer Service Question Matching for Customer Service

This dataset is an automatic question and answer system corpus, with a total of 120,000 sentence pairs and marked with sentence pair similarity values, with the value 0 or 1 (0 means dissimilar, 1 means similar). There are problems such as typos and irregular grammar in the data, but it is more close to industrial scenarios.

    数据量：训练集(100,000)，验证集(10,000)，测试集(10,000)
    例子： 
     1.我存钱还不扣的 [分隔符] 借了每天都要还利息吗 [分隔符] 0
     2.为什么我的还没有额度 [分隔符] 为啥没有额度！！ [分隔符] 1

8. MSRANER Name Entity Recognition

There are more than 50,000 Chinese named entity identification and labeling data (including person names, place names, and organization names) in this data set, which are represented by nr, ns, and nt respectively, and other entities are represented by o.

    数据量：训练集(46,364)，测试集(4,365)
    例子： 
     1.据说/o 应/o 老友/o 之/o 邀/o ，/o 梁实秋/nr 还/o 坐/o 着/o 滑竿/o 来/o 此/o 品/o 过/o 玉峰/ns 茶/o 。/o
     2.他/o 每年/o 还/o 为/o 河北农业大学/nt 扶助/o 多/o 名/o 贫困/o 学生/o 。/o

9. THUCNEWS Long Text classification

This data set has more than 40,000 Chinese news-based long text label data, with a total of 14 categories: "Sports":0, "Entertainment":1, "Home":2, "Lottery":3, "Real Estate":4, "Education":5, "Fashion":6, "Current Affairs":7, "Zodiac":8, "Game":9, "Society":10, "Technology":11, "Stock":12, "Financial":13.

    数据量：训练集(33,437)，验证集(4,180)，测试集(4,180)
    例子： 
 11_!_科技_!_493337.txt_!_爱国者A-Touch MK3533高清播放器试用　　爱国者MP5简介:　　"爱国者"北京华旗资讯，作为国内知名数码产品制>造商。1993年创立于北京中关村，是一家致力于......
 每行为一条数据，以_!_分割的个字段，从前往后分别是 类别ID，类别名称，文本ID，文本内容。

10.iFLYTEK Long Text classification

There are more than 17,000 long text labeled data about app application descriptions in this data set, including various application topics related to daily life, with a total of 119 categories: "Taxi": 0, "Map Navigation": 1, "Free WIFI": 2, "Car Rental": 3,...., "Female": 115, "Business": 116, "Cash Collection": 117, "Others": 118 (represented by 0-118 respectively).

    数据量：训练集(12,133)，验证集(2,599)，测试集(2,600)
    例子： 
17_!_休闲益智_!_玩家需控制一只酷似神龙大侠的熊猫人在科技感十足的未来城市中穿越打拼。感觉很山寨功夫熊猫，自由度非常高，可以做很多你想做的事情......
每行为一条数据，以_!_分割字段，从前往后分别是 类别ID，类别名称，文本内容。

11.CHID idiom reading comprehension fill in the blanks Chinese IDiom Dataset for Cloze Test

https://arxiv.org/abs/1906.01265
The idiom is cloze in the blanks, and many idioms in the text are masked, and the candidates contain synonyms.

    数据量：训练集(84,709)，验证集(3,218)，测试集(3,231)
    例子：
    {
      "content": [
        # 文段0
        "……在热火22年的历史中，他们已经100次让对手得分在80以下，他们在这100次中都取得了胜利，今天他们希望能#idiom000378#再进一步。", 
        # 文段1
        "在轻舟发展过程之中，是和业内众多企业那样走相似的发展模式，去#idiom000379#？还是迎难而上，另走一条与众不同之路。诚然，#idiom000380#远比随大流更辛苦，更磨难，更充满风险。但是有一条道理却是显而易见的：那就是水往低处流，随波逐流，永远都只会越走越低。只有创新，只有发展科技，才能强大自己。", 
        # 文段2
        "最近十年间，虚拟货币的发展可谓#idiom000381#。美国著名经济学家林顿·拉鲁什曾预言：到2050年，基于网络的虚拟货币将在某种程度上得到官方承认，成为能够流通的货币。现在看来，这一断言似乎还嫌过于保守……", 
        # 文段3
        "“平时很少能看到这么多老照片，这次图片展把新旧照片对比展示，令人印象深刻。”现场一位参观者对笔者表示，大多数生活在北京的人都能感受到这个城市#idiom000382#的变化，但很少有人能具体说出这些变化，这次的图片展按照区域发展划分，展示了丰富的信息，让人形象感受到了60年来北京的变化和发展。", 
        # 文段4
        "从今天大盘的走势看，市场的热点在反复的炒作之中，概念股的炒作#idiom000383#，权重股走势较为稳健，大盘今日早盘的震荡可以看作是多头关前的蓄势行为。对于后市，大盘今日蓄势震荡后，明日将会在权重和题材股的带领下亮剑冲关。再创反弹新高无悬念。", 
        # 文段5
        "……其中，更有某纸媒借尤小刚之口指出“根据广电总局的这项要求，2009年的荧屏将很难出现#idiom000384#的情况，很多已经制作好的非主旋律题材电视剧想在卫视的黄金时段播出，只能等到2010年了……"],
      "candidates": [
        "百尺竿头", 
        "随波逐流", 
        "方兴未艾", 
        "身体力行", 
        "一日千里", 
        "三十而立", 
        "逆水行舟", 
        "日新月异", 
        "百花齐放", 
        "沧海一粟"
      ]
    }

12.CMNLI Language Inference Task Chinese Multi-Genre NLI

ChineseMNLI data converts the original MNLI data in Chinese and English. The data comes from fiction, telephone, travel, government, slate, etc., and is used to judge the relationship between the given two sentences that are implicit, neutral and contradictory.

    数据量：train(391,783)，matched(9336)，mismatched(8,870)
    例子：
    {"sentence1": "新的权利已经足够好了", "sentence2": "每个人都很喜欢最新的福利", "gold_label": "neutral"}

13. More data sets are being added, Comming soon!

More data sets are being added, if you have a well-defined data set, please contact us.

Dataset download overall download

Or use the command:

 wget https://storage.googleapis.com/chineseglue/chineseGLUEdatasets.v0.0.1.zip

Chinese task benchmark evaluation (ChineseGLUE) - Ranking - Comparison of each task Evaluation of Dataset for Different Models

TNEWS Short Text Classificaiton for News (Accuracy):

Model	Development Set (dev)	Test set (test)	Training parameters
ALBERT-xlarge	88.30	88.30	batch_size=32, length=128, epoch=3
BERT-base	89.80	89.78	batch_size=32, length=128, epoch=3
BERT-wwm-ext-base	89.88	89.81	batch_size=32, length=128, epoch=3
ERNIE-base	89.77	89.83	batch_size=32, length=128, epoch=3
RoBERTa-large	90.00	89.91	batch_size=16, length=128, epoch=3
XLNet-mid	86.14	86.26	batch_size=32, length=128, epoch=3
RoBERTa-wwm-ext	89.82	89.79	batch_size=32, length=128, epoch=3
RoBERTa-wwm-large-ext	90.05	90.11	batch_size=16, length=128, epoch=3

XNLI Natural Language Inference (Accuracy):

Model	Development Set (dev)	Test set (test)	Training parameters
ALBERT-xlarge	74.0?	74.0?	batch_size=64, length=128, epoch=2
BERT-base	77.80	77.80	batch_size=64, length=128, epoch=2
BERT-wwm-ext-base	79.4	78.7	batch_size=64, length=128, epoch=2
ERNIE-base	79.7	78.6	batch_size=64, length=128, epoch=2
RoBERTa-large	80.2	79.9	batch_size=64, length=128, epoch=2
XLNet-mid	79.2	78.7	batch_size=64, length=128, epoch=2
RoBERTa-wwm-ext	79.56	79.28	batch_size=64, length=128, epoch=2
RoBERTa-wwm-large-ext	80.20	80.04	batch_size=16, length=128, epoch=2

Note: ALBERT-xlarge, there are still problems in training on XNLI tasks.

Semantic Similarity Task (Accuracy):

Model	Development Set (dev)	Test set (test)	Training parameters
ALBERT-xlarge	89.00	86.76	batch_size=64, length=128, epoch=3
BERT-base	89.4	86.9	batch_size=64, length=128, epoch=3
BERT-wwm-ext-base	89.1	87.3	batch_size=64, length=128, epoch=3
ERNIE-base	89.8	87.2	batch_size=64, length=128, epoch=3
RoBERTa-large	89.9	87.2	batch_size=64, length=128, epoch=3
XLNet-mid	86.14	85.98	batch_size=64, length=128, epoch=3
RoBERTa-wwm-ext	89.08	86.33	batch_size=64, length=128, epoch=3
RoBERTa-wwm-large-ext	89.79	86.82	batch_size=16, length=128, epoch=3

INEWS Sentiment Analysis for Internet News (Accuracy):

Model	Development Set (dev)	Test set (test)	Training parameters
ALBERT-xlarge	81.80	82.40	batch_size=32, length=512, epoch=8
BERT-base	81.29	82.70	batch_size=16, length=512, epoch=3
BERT-wwm-ext-base	81.93	83.46	batch_size=16, length=512, epoch=3
ERNIE-base	84.50	85.14	batch_size=16, length=512, epoch=3
RoBERTa-large	81.90	84.00	batch_size=4, length=512, epoch=3
XLNet-mid	82.00	84.00	batch_size=8, length=512, epoch=3
RoBERTa-wwm-ext	82.98	82.28	batch_size=16, length=512, epoch=3
RoBERTa-wwm-large-ext	83.73	82.78	batch_size=4, length=512, epoch=3

DRCD Reading Comprehension for Traditional Chinese (F1, EM):

Model	Development Set (dev)	Test set (test)	Training parameters
BERT-base	F1:92.30 EM:86.60	F1:91.46 EM:85.49	batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
BERT-wwm-ext-base	F1:93.27 EM:88.00	F1:92.63 EM:87.15	batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
ERNIE-base	F1:92.78 EM:86.85	F1:92.01 EM:86.03	batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
ALBERT-large	F1:93.90 EM:88.88	F1:93.06 EM:87.52	batch=32, length=512, epoch=3 lr=2e-5 warmup=0.05
ALBERT-xlarge	F1:94.63 EM:89.68	F1:94.70 EM:89.78	batch_size=32, length=512, epoch=3 lr=2.5e-5 warmup=0.06
ALBERT-tiny	F1:81.51 EM:71.61	F1:80.67 EM:70.08	batch=32, length=512, epoch=3 lr=2e-4 warmup=0.1
RoBERTa-large	F1:94.93 EM:90.11	F1:94.25 EM:89.35	batch=32, length=256, epoch=2 lr=3e-5 warmup=0.1
xlnet-mid	F1:92.08 EM:84.40	F1:91.44 EM:83.28	batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
RoBERTa-wwm-ext	F1:94.26 EM:89.29	F1:93.53 EM:88.12	batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
RoBERTa-wwm-large-ext	F1:95.32 EM:90.54	F1:95.06 EM:90.70	batch=32, length=512, epoch=2 lr=2.5e-5 warmup=0.1

CMRC2018 Reading Comprehension Reading Comprehension for Simplified Chinese (F1, EM):

Model	Development Set (dev)	Test set (test)	Training parameters
BERT-base	F1:85.48 EM:64.77	F1:87.17 EM:69.72	batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
BERT-wwm-ext-base	F1:86.68 EM:66.96	F1:88.78 EM:73.23	batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
ERNIE-base	F1:87.30 EM:66.89	F1:89.62 EM:73.32	batch=32, length=512, epoch=2 lr=3e-5 warmup=0.1
ALBERT-large	F1:87.86 EM:67.75	F1:90.17 EM:73.66	epoch3, batch=32, length=512, lr=2e-5, warmup=0.05
ALBERT-xlarge	F1:88.66 EM:68.90	F1:90.92 EM:75.22	epoch3, batch=32, length=512, lr=2e-5, warmup=0.1
ALBERT-tiny	F1:73.95 EM:48.31	F1:75.73 EM:53.68	epoch3, batch=32, length=512, lr=2e-4, warmup=0.1
RoBERTa-large	F1:88.61 EM:69.94	F1:90.94 EM:76.11	epoch2, batch=32, length=256, lr=3e-5, warmup=0.1
xlnet-mid	F1:85.63 EM:65.31	F1:86.09 EM:66.51	epoch2, batch=32, length=512, lr=3e-5, warmup=0.1
RoBERTa-wwm-ext	F1:87.28 EM:67.89	F1:89.74 EM:73.89	epoch2, batch=32, length=512, lr=3e-5, warmup=0.1
RoBERTa-wwm-large-ext	F1:89.42 EM:70.59	F1:91.56 EM:76.58	epoch2, batch=32, length=512, lr=2.5e-5, warmup=0.1

CHID idiom reading comprehension fill in the blanks Chinese IDiom Dataset for Cloze Test (Accuracy):

Model	Development Set (dev)	Test set (test)	Training parameters
BERT-base	82.2	82.04	batch=24, length=64, epoch=3 lr=2e-5
BERT-wwm-ext-base	-	-	-
ERNIE-base	-	-	-
ALBERT-large	-	-	-
ALBERT-xlarge	-	-	-
ALBERT-tiny	-	-	-
RoBERTa-large	85.31	84.5	batch=24, length=64, epoch=3 lr=2e-5
xlnet-mid	-	-	-
RoBERTa-wwm-ext	83.78	83.62	batch=24, length=64, epoch=3 lr=2e-5
RoBERTa-wwm-large-ext	85.81	85.37	batch=24, length=64, epoch=3 lr=2e-5

CMNLI Chinese Natural Language Inference Chinese Multi-Genre NLI (Accuracy):

Model	matched	Missatched	Training parameters
BERT-base	79.39	79.76	batch=32, length=128, epoch=3 lr=2e-5
BERT-wwm-ext-base	81.41	80.67	batch=32, length=128, epoch=3 lr=2e-5
ERNIE-base	79.65	80.70	batch=32, length=128, epoch=3 lr=2e-5
ALBERT-xxlarge	-	-	-
ALBERT-tiny	72.71	72.72	batch=32, length=128, epoch=3 lr=2e-5
RoBERTa-large	-	-	-
xlnet-mid	78.15	76.93	batch=16, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-ext	81.09	81.38	batch=32, length=128, epoch=3 lr=2e-5
RoBERTa-wwm-large-ext	83.4	83.42	batch=32, length=128, epoch=3 lr=2e-5

BQ Intelligent Customer Service Question Matching for Customer Service (Accuracy):

Model	Development Set (dev)	Test set (test)	Training parameters
BERT-base	85.86	85.08	batch_size=64, length=128, epoch=3
BERT-wwm-ext-base	86.05	85.21	batch_size=64, length=128, epoch=3
ERNIE-base	85.92	84.47	batch_size=64, length=128, epoch=3
RoBERTa-large	85.68	85.20	batch_size=8, length=128, epoch=3
XLNet-mid	79.81	77.85	batch_size=32, length=128, epoch=3
ALBERT-xlarge	85.21	84.21	batch_size=16, length=128, epoch=3
ALBERT-tiny	82.04	80.76	batch_size=64, length=128, epoch=5
RoBERTa-wwm-ext	85.31	84.02	batch_size=64, length=128, epoch=3
RoBERTa-wwm-large-ext	86.34	84.90	batch_size=16, length=128, epoch=3

MSRANER Name Entity Recognition (F1):

Model	Test set (test)	Training parameters
BERT-base	95.38	batch_size=16, length=256, epoch=5, lr=2e-5
BERT-wwm-ext-base	95.26	batch_size=16, length=256, epoch=5, lr=2e-5
ERNIE-base	95.17	batch_size=16, length=256, epoch=5, lr=2e-5
RoBERTa-large	96.07	batch_size=8, length=256, epoch=5, lr=2e-5
XLNet-mid	92.11	batch_size=8, length=256, epoch=5, lr=2e-5
ALBERT-xlarge	89.51	batch_size=16, length=256, epoch=8, lr=7e-5
ALBERT-base	92.47	batch_size=32, length=256, epoch=8, lr=5e-5
ALBERT-tiny	84.77	batch_size=32, length=256, epoch=8, lr=5e-5
RoBERTa-wwm-ext	95.06	batch_size=16, length=256, epoch=5, lr=2e-5
RoBERTa-wwm-large-ext	95.32	batch_size=8, length=256, epoch=5, lr=2e-5

THUCNEWS Long Text Classification (Accuracy):

Model	Development Set (dev)	Test set (test)	Training parameters
ALBERT-xlarge	95.74	95.45	batch_size=32, length=512, epoch=8
ALBERT-tiny	92.63	93.54	batch_size=64, length=128, epoch=5
BERT-base	95.28	95.35	batch_size=8, length=128, epoch=3
BERT-wwm-ext-base	95.38	95.57	batch_size=8, length=128, epoch=3
ERNIE-base	94.35	94.90	batch_size=16, length=256, epoch=3
RoBERTa-large	94.52	94.56	batch_size=2, length=256, epoch=3
XLNet-mid	94.04	94.54	batch_size=16, length=128, epoch=3
RoBERTa-wwm-ext	95.59	95.52	batch_size=16, length=256, epoch=3
RoBERTa-wwm-large-ext	96.10	95.93	batch_size=32, length=512, epoch=8

iFLYTEKData Long Text Classification (Accuracy):

Model	Development Set (dev)	Test set (test)	Training parameters
ALBERT-xlarge	61.94	61.34	batch_size=32, length=128, epoch=3
ALBERT-tiny	44.83	44.62	batch_size=32, length=256, epoch=3
BERT-base	63.57	63.48	batch_size=32, length=128, epoch=3
BERT-wwm-ext-base	63.83	63.75	batch_size=32, length=128, epoch=3
ERNIE-base	61.75	61.80	batch_size=24, length=256, epoch=3
RoBERTa-large	63.80	63.91	batch_size=32, length=128, epoch=3
XLNet-mid	60.16	60.04	batch_size=16, length=128, epoch=3
RoBERTa-wwm-ext	64.18	-	batch_size=16, length=128, epoch=3
RoBERTa-wwm-large-ext	65.19	65.10	batch_size=32, length=128, epoch=3

Baseline Model - Code Start Codes for Baselines

We provide you with scripts that can be run "one-click" to help you run specific tasks faster on specified models.

Taking the example of running the "BQ Intelligent Customer Service Question Matching" task on the Bert model, you can run the run_classifier_ bq .sh script directly under chineseGLUE/baselines/models/ bert /.

 cd chineseGLUE/baselines/models/bert/
sh run_classifier_bq.sh

The script will automatically download the "BQ Intelligent Customer Service Question Match" dataset (save in chineseGLUE/baselines/glue/chineseGLUEdatasets/ bq / folder) and the Bert model (save in chineseGLUE/baselines/models/bert/prev_trained_model/).

For details, please refer to: Benchmark Model-Model Training

Open evaluation submission portal: I want to submit

Corpus for Langauge Modelling, Pre-training, Generating tasks

Can be used for language modeling, pre-training or generative tasks, etc. The data volume exceeds 10G, and the main part comes from the nlp_chinese_corpus project

The current corpus is processed in [Pre-training format] and contains multiple folders; each folder has many small files of no more than 4M size, and the file format meets the pre-training format: one line per sentence, separated by blank lines between documents.

Contains the following sub-corpus (14G corpus in total):

1. News Corporate: 8G Corporate, divided into two upper and lower parts, with a total of 2,000 small files.

2. Community interactive corpus: 3G corpus, containing 3G text, with a total of more than 900 small files.

3. Wikipedia: About 1.1G text, containing about 300 small files.

4. Comment data: About 2.3G text, containing 811 small files, merge multiple comment data from ChineseNLPCorpus, clean, convert formats, and split into small files.

You can obtain these corpus by cleaning the data and converting the format through the above two items;

You can also obtain the corpus of a single project through email (chineseGLUE#163.com) and inform the unit or school, name, and purpose of the corpus;

To obtain all the corpus under the ChineseGLUE project, you must become a member of the ChineseGLUE organization and complete a (small) task.

Become a founding member of the ChineseGLUE organization

You will be able to benefits:

1. Founding member of China's first Chinese task benchmark assessment

2. Be able to contribute together with other professionals to promote the development of Chinese natural language processing

3. After participating in some work, obtain pre-training corpus of the same scale as English wiki & bookCorpus, which has been cleaned and pre-trained, for research purposes.

4. Priority is given to the use of the Chinese pre-trained model of the state of the art, including various trial versions or unpublished versions

How to join with us:

Send an email to CLUEbenchmark#163.com to briefly introduce yourself, your background, work or research direction, your organization, and where you can contribute to the community. We will contact you after the assessment.

Task List TODO LIST

1. Collect and mine 1 representative data set, generally a classification or sentence-to-sentence task (an additional 5 data sets are required)

2. The reading comprehension task is transformed into sentence-to-tasks (such as clues and questions or answers) and evaluated. The data should be split into training, verification and test sets.

3. Baselises training and prediction methods and scripts for specific task models (supports PyTorch and Keras);

4. For current mainstream models (such as bert/bert_wwm_ext/roberta/albert/ernie/ernie2.0, etc.), combined with ChineseGLUE data set, accuracy test is performed.

For example: XLNet-mid tests on LCQMC dataset

5. Are there any models participating in the evaluation?

other

6. Ranking landing page

7. Introduction to the Chinese Language Understanding Assessment Benchmark (ChineseGLUE)

8. Development of main functions of the evaluation system

Timeline Time Plan:

2019-10-20 to 2019-12-31: beta version of ChineseGLUE

2020.1.1 to 2020-12-31: official version of ChineseGLUE

2021.1.1 to 2021-12-31: super version of ChineseGLUE

Contribution Contribute your strength, starting today

Share your data set with community or make a contribution today! Just send email to chineseGLUE#163.com,

or join QQ group: 836811304

More volunteers are being added one after another. . .

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

How to quote us?

See: https://aclanthology.org/2020.coling-main.419.bib

Reference:

1. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

2. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

3. LCQMC: A Large-scale Chinese Question Matching Corpus

4. XNLI: Evaluating Cross-lingual Sentence Representations

5. TNES: toutiao-text-classfication-dataset

6. nlp_chinese_corpus: Large Scale Chinese Corpus for NLP

7. ChineseNLPCorpus

8. ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations

9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

10. RoBERTa: A Robustly Optimized BERT Pretraining Approach

Expand