Chinese and English NLP dataset. You can click to search.
You can contribute your power by uploading dataset information. After uploading five or more data sets and reviewing them, the student can be used as a project contributor and display them.
clueai toolkit: Three minutes and three lines of code to complete NLP development (zero sample learning)

If there is any problem with the dataset, please submit an issue.
All data sets are from the Internet and are only organized for easy extraction. If there are any infringement or other issues, please contact us in time to delete them.
| ID | title | Update date | Dataset provider | license | illustrate | Keywords | category | Paper address | Remark |
|---|---|---|---|---|---|---|---|---|---|
| 1 | CCKS2017 Chinese electronic case naming entity identification | May 2017 | Beijing Jimuyun Health Technology Co., Ltd. | The data comes from the real electronic medical record data of the cloud hospital platform, with a total of 800 items (single patient single visit record), and was treated with desensitization. | Electronic medical record | Named entity recognition | Chinese | ||
| 2 | CCKS2018 Chinese electronic case naming entity identification | 2018 | Yidu Cloud (Beijing) Technology Co., Ltd. | The evaluation task of CCKS2018's electronic medical record naming entity recognition provides 600 marked electronic medical record texts, which requires a total of five entities including anatomical parts, independent symptoms, symptom descriptions, surgery and drugs. | Electronic medical record | Named entity recognition | Chinese | ||
| 3 | MSRA named entity identification data set in Microsoft Asia Research Institute | MSRA | The data comes from MSRA, the labeling form is BIO, and there are 46,365 entries in total | Msra | Named entity recognition | Chinese | |||
| 4 | 1998 People's Daily Corpus Set Entity Identification Annotation Set | January 1998 | People's Daily | The data source is People's Daily in 1998, and the labeling form is BIO, with a total of 23,061 entries. | 98 People's Daily | Named entity recognition | Chinese | ||
| 5 | Boson | Bosen Data | The data source is Boson, the labeling form is BMEO, and there are 2,000 entries in total | Boson | Named entity recognition | Chinese | |||
| 6 | CLUE Fine-Grain NER | 2020 | CLUE | The CLUENER2020 data set is based on the text classification data set THUCTC of Tsinghua University, which selects some data for fine-grained naming entity annotation. The original data comes from Sina News RSS. The data contains 10 label categories, the training set has a total of 10,748 corpuses, and the verification set has a total of 1,343 corpuses. | Fine-grained; CULE | Named entity recognition | Chinese | ||
| 7 | CoNLL-2003 | 2003 | CNTS - Language Technology Group | The data comes from the CoNLL-2003 task, which annotates four categories including PER, LOC, ORG and MISC | CoNLL-2003 | Named entity recognition | paper | English | |
| 8 | Weibo entity recognition | 2015 | https://github.com/hltcoe/golden-horse | EMNLP-2015 | Named entity recognition | ||||
| 9 | SIGHAN Bakeoff 2005 | 2005 | MSR/PKU | bakeoff-2005 | Named entity recognition |
| ID | title | Update date | Dataset provider | license | illustrate | Keywords | category | Paper address | Remark |
|---|---|---|---|---|---|---|---|---|---|
| 1 | NewsQA | 2019/9/13 | Microsoft Research Institute | The purpose of the Maluuba NewsQA dataset is to help research communities build algorithms that can answer questions that require human-level understanding and reasoning skills. It contains more than 12,000 news articles and 120,000 answers, with an average of 616 words per article and 2 to 3 answers per question. | English | QA | paper | ||
| 2 | SQuAD | Stanford | The Stanford Question and Answer Dataset (SQuAD) is a reading comprehension dataset composed of questions raised on a set of articles on Wikipedia, where the answer to each question is a paragraph of text, which may come from the corresponding reading paragraph, or the question may be unanswered. | English | QA | paper | |||
| 3 | SimpleQuestions | A large-scale simple question and answer system based on storage networks, the dataset provides a multi-task question and answer dataset with 100K answers to simple questions. | English | QA | paper | ||||
| 4 | WikiQA | 2016/7/14 | Microsoft Research Institute | In order to reflect the real information needs of ordinary users, WikiQA uses Bing query logs as the source of the problem. Each question links to a Wikipedia page that may have answers. Because the summary section of the Wikipedia page provides basic and often most important information about this topic, the sentences in this section are used as candidate answers. With the help of crowdsourcing, the dataset includes 3047 questions and 29258 sentences, of which 1473 sentences are marked as answer sentences for the corresponding question. | English | QA | paper | ||
| 5 | cMedQA | 2019/2/25 | Zhang Sheng | The data from the Medical Online Forum contains 54,000 questions and the corresponding approximately 100,000 answers. | Chinese | QA | paper | ||
| 6 | cMedQA2 | 2019/1/9 | Zhang Sheng | The extended version of cMedQA contains about 100,000 medical-related questions and corresponding about 200,000 answers. | Chinese | QA | paper | ||
| 7 | webMedQA | 2019/3/10 | He Junqing | A medical online question and answer dataset containing 60,000 questions and 310,000 answers, and contains the categories of questions. | Chinese | QA | paper | ||
| 8 | XQA | 2019/7/29 | Tsinghua University | This article mainly constructs a cross-language open question and answer dataset for open question and answer. The dataset (training set, test set) mainly includes nine languages and more than 90,000 question and answers. | Multilingual | QA | paper | ||
| 9 | AmazonQA | 2019/9/29 | Amazon | Carnegie Mellon University proposed a comment-based QA model task in response to the pain points of repeated answers to questions on Amazon platform, that is, using previous Q&A to answer a certain product, the QA system will automatically summarize an answer to customers. | English | QA | paper | ||
| 9 | AmazonQA | 2019/9/29 | Amazon | Carnegie Mellon University proposed a comment-based QA model task in response to the pain points of repeated answers to questions on Amazon platform, that is, using previous Q&A to answer a certain product, the QA system will automatically summarize an answer to customers. | English | QA | paper |
| ID | title | Update date | Dataset provider | license | illustrate | Keywords | category | Paper address | Remark |
|---|---|---|---|---|---|---|---|---|---|
| 1 | NLPCC2013 | 2013 | CCF | Weibo corpus, marked with 7 emotions: like, disgust, happiness, sadness, anger, surprise, fear. Size: 14 000 Weibo posts, 45 431 sentences | NLPCC2013, Emotion | Sentiment Analysis | paper | ||
| 2 | NLPCC2014 Task1 | 2014 | CCF | Weibo corpus, marked with 7 emotions: like, disgust, happiness, sadness, anger, surprise, fear. Size: 20,000 Weibo posts | NLPCC2014, Emotion | Sentiment Analysis | |||
| 3 | NLPCC2014 Task2 | 2014 | CCF | Weibo corpus marked with positive and negative | NLPCC2014, Sentiment | Sentiment Analysis | |||
| 4 | Weibo Emotion Corpus | 2016 | The Hong Kong Polytechnic University | Weibo corpus, marked with 7 emotions: like, disgust, happiness, sadness, anger, surprise, fear. Size: More than 40,000 Weibo posts | weibo emotion corpus | Sentiment Analysis | Emotion Corpus Construction Based on Selection from Noisy Natural Labels | ||
| 5 | [RenCECPs](Fuji Ren can be contacted ([email protected]) for a license agreement.) | 2009 | Fuji Ren | The annotated blog corpus marked with emotion and sentiment at the document level, paragraph level and sentence level. It contains 1500 blogs, 11000 paragraphs and 35000 sentences. | RenCECPs, emotion, sentiment | Sentiment Analysis | Construction of a blog emotion corpus for Chinese emotional expression analysis | ||
| 6 | weibo_senti_100k | Unknown | Unknown | Tag the Sina Weibo with emotion, and there are about 50,000 positive and negative comments each | weibo senti, sentiment | Sentiment Analysis | |||
| 7 | BDCI2018-Automobile industry user opinions and emotional recognition | 2018 | CCF | Comments on cars in the automotive forum mark the themes of the car’s poetry: power, price, interior, configuration, safety, appearance, handling, fuel consumption, space, and comfort. Each topic is marked with emotional labels, and emotions are divided into 3 categories, with the numbers 0, 1, and -1 respectively representing neutral, positive and negative. | Attributes Sentiment Analysis Theme Sentiment Analysis | Sentiment Analysis | |||
| 8 | AI Challenger Fine-grained User Comments Sentiment Analysis | 2o18 | Meituan | Catering reviews, 6 first-level attributes, 20 second-level attributes, each attribute is marked positive, negative, neutral, and not mentioned. | Attribute sentiment analysis | Sentiment Analysis | |||
| 9 | BDCI2019 Financial Information Negative and Subject Determination | 2019 | Central Bank | Financial field news, each sample tags the list of entities as well as the list of negative entities. The task is to determine whether a sample is negative and the corresponding negative entity. | Entity sentiment analysis | Sentiment Analysis | |||
| 10 | Zhijiang Cup E-commerce Review and Opinion Digging Competition | 2019 | Zhijiang Laboratory | The task of exploring the opinions of brand reviews is to extract product attribute characteristics and consumer opinions from product reviews, and confirm their emotional polarity and attribute types. For a certain attribute feature of a product, there are a series of opinion words that describe it, which represent consumers' views on the attribute feature. Each set of {product attribute characteristics, consumer opinion} has corresponding emotional polarity (negative, neutral, positive), representing the consumer's satisfaction with this attribute. In addition, multiple attribute features can be classified into a certain attribute type, such as appearance, box and other attribute features can be classified into the packaging attribute type. The participating teams will eventually submit the extracted prediction information of the test data, including four fields: attribute characteristic word, opinion word, opinion polarity and attribute type. | Attribute sentiment analysis | Sentiment Analysis | |||
| 11 | 2019 Sohu Campus Algorithm Competition | 2019 | Sohu | Given several articles, the goal is to judge the core entity of the article and its emotional attitude towards the core entity. Each article identifies up to three core entities and determines the emotional tendencies of the article towards the above core entities (positive, neutral, and negative). Entity: People, objects, regions, institutions, groups, enterprises, industries, certain specific events, etc. are fixed and can be used as the entity word for the subject of the article. Core entity: The entity word that mainly describes or acts as the main role of the article. | Entity sentiment analysis | Sentiment Analysis |
| ID | title | Update date | Dataset provider | license | illustrate | Keywords | category | Paper address | Remark |
|---|---|---|---|---|---|---|---|---|---|
| 1 | [2018 "Daguan Cup" Text Intelligent Processing Challenge](https://www.pkbigdata.com/common/cmpt/ "Daguan Cup" Text Intelligent Processing Challenge_Shiti and Data.html) | July 2018 | Optimistic data | The data set comes from optimistic data and is a long text classification task. It mainly includes four fields: id, article, word_seg and class. The data contains 19 categories, totaling 102,275 samples. | Long text; desensitization | Text classification | Chinese | ||
| 2 | Today's Headline Chinese News (Text) Category | May 2018 | Today's headlines | The data set comes from Toutiao today and is a short text classification task. The data contains 15 categories, totaling 382,688 samples. | short text; news | Text classification | Chinese | ||
| 3 | THUCNews Chinese text classification | 2016 | Tsinghua University | THUCNews is generated based on the historical data filtering and filtering of Sina News RSS subscription channel between 2005 and 2011, and is all in UTF-8 plain text format. Based on the original Sina news classification system, we reintegrated and divided 14 candidate classification categories: finance, lottery, real estate, stocks, home, education, technology, society, fashion, current affairs, sports, zodiac signs, games, and entertainment, with a total of 740,000 news documents (2.19 GB) | Documentation; News | Text classification | Chinese | ||
| 4 | Fudan University Chinese text classification | Natural Language Processing Group, Department of Computer Information and Technology, Fudan University, International Database Center | The data set is from Fudan University and is a short text classification task. The data contains 20 categories, with a total of 9,804 documents. | Documentation; News | Text classification | Chinese | |||
| 5 | News Title Short Text Classification | December 2019 | chenfengshf | CC0 Public Domain Sharing | The data set is derived from the Kesci platform and is a short text classification task for the news title field. Most of the content is short text title (length<50), the data contains 15 categories, a total of 38w samples | Short text; news title | Text classification | Chinese | |
| 6 | 2017 Zhihu Kanshan Cup Machine Learning Challenge | June 2017 | Chinese Artificial Intelligence Society; Zhihu | The data set comes from Zhihu, which is annotated data for the binding relationship between the question and topic tags. Each question has 1 or more tags, with a total of 1,999 tags, containing a total of 3 million questions. | Question; short text | Text classification | Chinese | ||
| 7 | 2019 Zhijiang Cup - E-commerce Review Opinion Mining Competition | August 2019 | Zhijiang Laboratory | The task of exploring the opinions of brand reviews is to extract product attribute characteristics and consumer opinions from product reviews, and confirm their emotional polarity and attribute types. For a certain attribute feature of a product, there are a series of opinion words that describe it, which represent consumers' views on the attribute feature. Each group of {product attribute characteristics, consumer opinion} has corresponding emotional polarity (negative, neutral, positive), which represents the degree of satisfaction of consumers with this attribute. | Comments; short text | Text classification | Chinese | ||
| 8 | IFLYTEK' Long Text Classification | iFlytek | This data set has more than 17,000 long text labeled data about app application descriptions, including various application topics related to daily life, with a total of 119 categories | Long text | Text classification | Chinese | |||
| 9 | News classification data across the entire network (SogouCA) | August 16, 2012 | Sogou | This data comes from news data from 18 channels including domestic, international, sports, social, entertainment, etc. from June to July 2012, 2012. | news | Text classification | Chinese | ||
| 10 | Sohu News Data (SogouCS) | August 2012 | Sogou | The data source is Sohu News from 18 channels including domestic, international, sports, social, entertainment, etc. from June to July 2012. | news | Text classification | Chinese | ||
| 11 | University of Science and Technology News Classification Corpus | November 2017 | Liu Yu Institute of Automation, Chinese Academy of Sciences Comprehensive Information Center | Cannot download for the time being, I have contacted the author, waiting for feedback | news | ||||
| 12 | ChnSentiCorp_htl_all | March 2018 | https://github.com/SophonPlus/ChineseNlpCorpus | More than 7000 hotel review data, more than 5000 positive reviews, more than 2000 negative reviews | |||||
| 13 | waimai_10k | March 2018 | https://github.com/SophonPlus/ChineseNlpCorpus | User reviews collected by a certain takeaway platform are 4,000 positive and about 8,000 negative. | |||||
| 14 | online_shopping_10_cats | March 2018 | https://github.com/SophonPlus/ChineseNlpCorpus | There are 10 categories, with a total of more than 60,000 comments, and about 30,000 positive and negative comments, including books, tablets, mobile phones, fruits, shampoo, water heater, Mengniu, clothes, computers, hotels | |||||
| 15 | weibo_senti_100k | March 2018 | https://github.com/SophonPlus/ChineseNlpCorpus | More than 100,000 pieces, marked with emotion on Sina Weibo, and about 50,000 positive and negative comments are each | |||||
| 16 | simplifyweibo_4_moods | March 2018 | https://github.com/SophonPlus/ChineseNlpCorpus | More than 360,000 pieces, marked with emotions on Sina Weibo, contains 4 kinds of emotions, including about 200,000 pieces of joy, about 50,000 pieces of anger, disgust, and depression. | |||||
| 17 | dmsc_v2 | March 2018 | https://github.com/SophonPlus/ChineseNlpCorpus | 28 movies, over 700,000 users, over 2 million ratings/comments data | |||||
| 18 | yf_dianping | March 2018 | https://github.com/SophonPlus/ChineseNlpCorpus | 240,000 restaurants, 540,000 users, 4.4 million comments/rating data | |||||
| 19 | yf_amazon | March 2018 | https://github.com/SophonPlus/ChineseNlpCorpus | 520,000 items, more than 1,100 categories, 1.42 million users, 7.2 million comments/rating data |
| ID | title | Update date | Dataset provider | license | illustrate | Keywords | category | Paper address | Remark |
|---|---|---|---|---|---|---|---|---|---|
| 1 | LCQMC | 2018/6/6 | Harbin Institute of Technology (Shenzhen) Intelligent Computing Research Center | Creative Commons Attribution 4.0 International License | This dataset contains 260,068 Chinese question pairs from multiple fields. The sentence pairs with the same inquiry intention are marked as 1, otherwise they are 0; and they are segmented into training set: 238,766 pairs, validation set: 8802 pairs, test set: 12,500 pairs. | Large-scale question matching; intention matching | Short text matching; question matching | paper | |
| 2 | The BQ Corpus | 2018/9/4 | Harbin Institute of Technology (Shenzhen) Intelligent Computing Research Center; WeBank | There are 120,000 sentence pairs in this dataset, from the bank's consulting service log for one year; sentence pairs contain different intentions, marked with a ratio of 1:1 positive and negative samples. | Bank service questions; intention matching | Short text matching; question consistency detection | paper | ||
| 3 | AFQMC Ant Financial Semantic Similarity | 2018/4/25 | Ant Financial | Provide 100,000 pairs of labeled data (updated in batches, updated) as training data, including synonymous pairs and different pairs | Financial Questions | Short text matching; question matching | |||
| 4 | The third Paipaidai "Magic Mirror Cup" Competition | 2018/6/10 | Paipaidai Smart Finance Research Institute | The train.csv file contains 3 columns, namely the label (label, which means whether question 1 and question 2 mean the same, 1 means the same, and 0 means the difference), the number of question 1 (q1) and the number of question 2 (q2). All the problem numbers that appear in this file have appeared in question.csv | Financial products | Short text matching; question matching | |||
| 5 | CAIL2019 Similar Case Matching Competition | 2019/6 | Tsinghua University; China Judgment Documents Network | For each data, triplets (A, B, C) are used to represent the data, where A, B, C all correspond to a certain document. The similarity between document data A and B is always greater than the similarity between A and B, that is, sim(A,B)>sim(A,C) | Legal documents; similar cases | Long text matching | |||
| 6 | CCKS 2018 WeBank Intelligent Customer Service Question Matching Competition | 2018/4/5 | Harbin Institute of Technology (Shenzhen) Intelligent Computing Research Center; WeBank | Bank service questions; intention matching | Short text matching; question matching | ||||
| 7 | ChineseTextualInference | 2018/12/15 | Liu Huanyong, Institute of Software Research, Chinese Academy of Sciences | Chinese text inference project, including the translation and construction of 880,000 text-containing Chinese text-containing data sets, and the construction of text-containing judgment model based on deep learning | Chinese NLI | Chinese text inference; text inclusion | |||
| 8 | NLPCC-DBQA | 2016/2017/2018 | NLPCC | Given question - the mark of the answer, and whether that answer is one of the answers to the question, 1 means yes, 0 means no | DBQA | Q&A Match | |||
| 9 | Calculation model for the correlation between "technical requirements" and "technical achievements" projects | 201/8/32 | CCF | The technical requirements and technical achievements in a given text form, as well as the correlation label between requirements and results; the correlation between technical requirements and technical achievements is divided into four levels: strong correlation, strong correlation, weak correlation, and no correlation | Long text; requirements match results | Long text matching | |||
| 10 | CNSD/CLUE-CMNLI | 2019/12 | ZengJunjun | Chinese natural language inference data set, this data and the original English data set are generated by translation and part of manual correction, which can alleviate the problem of insufficient Chinese natural language inference and semantic similarity calculation data sets to a certain extent. | Chinese NLI | Chinese natural language inference | paper | ||
| 11 | cMedQA v1.0 | 2017/4/5 | Xunyao Xunyi.com and the School of Information Systems and Management of the National University of Defense Technology | The data set is the question and answers asked in Xunyi XunPharma website. The data set has been anonymously processed and provides a 50,000 questions and 94,134 answers in the training set, with an average number of characters per question and answers being 120 and 212 respectively; the verification set has 2,000 questions and 3,774 answers, with an average number of characters per question and answers being 117 and 212 respectively; the test set has 2,000 questions and 3,835 answers, with an average number of characters per question and answer being 119 and 211 respectively; the data set has 54,000 questions and 101,743 answers, with an average number of characters per question and answer being 119 and 212 respectively; | Medical Q&A Match | Q&A Match | paper | ||
| 12 | cMedQA2 | 2018/11/8 | Xunyao Xunyi.com and the School of Information Systems and Management of the National University of Defense Technology | The source of this data set is the questions and answers asked in Xunyi XunPharma website. The data set has been anonymously processed and provides a collection of 100,000 questions and 188,490 answers in the training set, with an average number of characters per question and answers being 48 and 101 respectively; the verification set has 4,000 questions and 7,527 answers, with an average number of characters per question and answer being 49 and 101 respectively; the test set has 4,000 questions and 7,552 answers, with an average number of characters per question and answer being 49 and 100 respectively; the total number of characters per question and answer being 108,000 questions and 203,569 answers, with an average number of characters per question and answer being 49 and 101 respectively; | Medical Q&A Match | Q&A Match | paper | ||
| 13 | ChineseSTS | 2017/9/21 | Tang Shancheng, Bai Yunyue, Ma Fuyu. Xi'an University of Science and Technology | This dataset provides 12747 pairs of Chinese similar datasets. After the dataset, the authors give their similarity scores, and the corpus is composed of short sentences. | Short sentence similarity matching | Similarity matching | |||
| 14 | Dataset of the Medical Issues Similarity Measurement Competition held by China Health Information Processing Conference | 2018 | CHIP 2018-The 4th China Health Information Processing Conference (CHIP) | The main goal of this evaluation task is to match the intent of question sentences based on real Chinese patients' health consultation corpus. Given two statements, it is required to determine whether the intentions of the two are the same or similar. All corpus comes from real questions of patients on the Internet and has been screened and artificial intent matching labels. The data set has been desensitized, and the problem is marked by the digital indication training set contains about 20,000 marked data (desensitized, including punctuation marks), and the test set contains about 10,000 label-free data (desensitized, including punctuation marks > symbols). | Similarity Match for Medical Problems | Similarity matching | |||
| 15 | COS960: A Chinese Word Similarity Dataset of 960 Word Pairs | 2019/6/6 | Tsinghua University | The data set contains 960 pairs of words, and each pair is measured by 15 native speakers by similarity scores. The 960 pairs of word are divided into three groups according to the label, including 480 pairs of nouns, 240 pairs of verbs and 240 pairs of adjectives. | Similarity between words | Synonyms | paper | ||
| 16 | OPPO mobile search sort query-title semantic matching dataset. (https://pan.baidu.com/s/1Hg2Hubsn3GEuu4gubbHCzw Password 7p3n) | 2018/11/6 | OPPO | This data set comes from the OPPO mobile phone search sorting optimization real-time search scenario. This scenario returns the query results in real time as the user continuously enters. This data set has been simplified accordingly on this basis, providing a query-title semantic matching, that is, the problem of ctr prediction. | Question title matching, ctr prediction | Similarity matching | |||
| 17 | Web search results evaluation (SogouE) | 2012 | Sogou | Sogou Laboratory Data License Agreement | This data set contains query terms, related URLs and search data for query categories. The format is as follows: query terms]tRelated URLstQuery category where URLs are guaranteed to exist in the corresponding Internet corpus; "1" in the query category represents navigation query; "2" represents information query. | Automatic Search Engine Performance Evaluation with Click-through Data Analysis | Query type matching prediction |
| ID | title | Update date | Dataset provider | license | illustrate | Keywords | category | Paper address | Remark |
|---|---|---|---|---|---|---|---|---|---|
| 1 | LCSTS | 2015/8/6 | Qingcai Chen | The data set is from Sina Weibo and contains about two million real Chinese short texts. Each data includes two fields, abstract and text annotated by the author. There are 10,666 data manually marked the correlation between the short text and the summary, and the correlations are increased in turn from 1 to 5. | Single text summary; short text; text relevance | Text Summary | paper | ||
| 2 | Chinese short text summary dataset | 2018/6/20 | He Zhengfang | The data comes from Weibo published by Sina Weibo mainstream media, with a total of 679,898 pieces of data. | Single text summary; short text | Text Summary | |||
| 3 | Education and training industry abstract automatic abstract Chinese corpus | 2018/6/5 | anonymous | The corpus collects historical articles from mainstream vertical media in the education and training industry, with about 24,500 pieces of data, each piece of data including two fields annotated by the author and the body. | Single text summary; education and training | Text Summary | |||
| 4 | NLPCC2017 Task3 | 2017/11/8 | NLPCC2017 organizer | The data set is derived from the news field and is a task data provided by NLPCC 2017 and can be used for single-text summary. | Single text summary; news | Text Summary | |||
| 5 | Shence Cup 2018 | 2018/10/11 | DC Contest Organizer | The data comes from news text and is provided by the DC competition organizer. It simulates business scenarios and aims to extract core words from news texts. The final result is to improve the effect of recommendations and user portraits. | Text keywords; news | Text Summary | |||
| 6 | Byte Cup 2018 International Machine Learning Competition | 2018/12/4 | ByteDance | The data comes from ByteDance's TopBuzz and open copyright articles. The training set includes about 1.3 million text information, 1,000 articles in the verification set, and 800 articles in the test set. Data for each test set and validation set are manually labeled with multiple possible titles as an answer alternative via manual editing. | Single text summary; video; news | Text Summary | English | ||
| 7 | NEWSROOM | 2018/6/1 | Grusky | The data were obtained from search and social metadata from 1998 to 2017 and used a combination of abstract strategies that combine extraction and abstraction, including 1.3 million articles and abstracts written by the author and editor in 38 major publication editorial departments. | Single text summary; social metadata; search | Text Summary | paper | English | |
| 8 | [DUC/TAC](https://duc.nist.gov/ https://tac.nist.gov//) | 2014/9/9 | NIST | The full name is Document Understanding Conferences/Text Analysis Conference. The data set is derived from news lines and web texts in the corpus used in the annual TAC KBP (TAC Knowledge Base Population) competition. | Single text/multi-text summary; news | Text Summary | English | ||
| 9 | CNN/Daily Mail | 2017/7/31 | Standford | GNU v3 | The dataset is from CNN and DailyMail on mobile phones about one million news data as a corpus of machine reading comprehension. | Multi-text summary; long text; news | Text Summary | paper | English |
| 10 | Amazon SNAP Review | 2013/3/1 | Standford | The data comes from Amazon website shopping reviews, and you can obtain data in each major category (such as food, movies, etc.), or you can obtain all data at once. | Multi-text summary; shopping reviews | Text Summary | English | ||
| 11 | Gigaword | 2003/1/28 | David Graff, Christopher Cieri | The data set includes about 950,000 news articles, which are abstracted by the article title, and belong to the single sentence summary data set. | Single text summary; news | Text Summary | English | ||
| 12 | RA-MDS | 2017/9/11 | Piji Li | The full name is Reader-Aware Multi-Document Summarization. The data set is derived from news articles and is collected, marked and reviewed by experts. 45 topics are covered, each with 10 news documents and 4 model summary, each news document contains an average of 27 sentences and an average of 25 words per sentence. | Multi-text summary; news; manual labeling | Text Summary | paper | English | |
| 13 | TIPSTER SUMMAC | 2003/5/21 | The MITRE Corporation and the University of Edinburgh | The data consists of 183 documents marked by Computation and Language (cmp-lg) collection, and the documents are taken from papers published by the ACL conference. | Multi-text summary; long text | Text Summary | English | ||
| 14 | WikiHow | 2018/10/18 | Mahnaz Koupaee | Each data is an article, each article consists of multiple paragraphs, each paragraph begins with a sentence that summarizes it. By merging paragraphs to form articles and paragraph outlines to form abstracts, the final version of the dataset contains more than 200,000 long sequence pairs. | Multi-text summary; long text | Text Summary | paper | English | |
| 15 | Multi-News | 2019/12/4 | Alex Fabbri | Data are from input articles from over 1500 different websites and professional summary of 56,216 of these articles obtained from the website newser.com. | Multi-text summary | Text Summary | paper | English | |
| 16 | MED Summaries | 2018/8/17 | D.Potapov | The dataset is used for dynamic video summary evaluation and contains annotations for 160 videos, including 60 validation sets, 100 test sets, and 10 event categories in the test set. | Single text summary; video comments | Text Summary | paper | English | |
| 17 | BIGPATENT | 2019/7/27 | Sharma | The dataset includes 1.3 million U.S. patent document records and human written abstract abstracts that contain richer discourse structures and more commonly used entities. | Single text summary; patent; written | Text Summary | paper | English | |
| 18 | [NYT]( https://catalog.ldc.upenn.edu/LDC2008T19) | 2008/10/17 | Evan Sandhaus | The full name is The New York Times, the dataset contains 150 commercial articles from the New York Times, and captures all articles on the New York Times website from November 2009 to January 2010. | Single text summary; business article | Text Summary | English | ||
| 19 | The AQUAINT Corpus of English News Text | 2002/9/26 | David Graff | The dataset consists of English news text data from Xinhua News Agency (People's Republic of China), New York Times News Service and Associated Press World News Service, and contains approximately 375 million words. Dataset charges. | Single text summary; news | Text Summary | Chinese and English | ||
| 20 | Legal Case Reports Data Set | 2012/10/19 | Filippo Galgani | The data set comes from the Australian legal cases of the Federal Court of Australia (FCA) from 2006 to 2009, and contains approximately 4,000 legal cases and their summary. | Single text summary; legal case | Text Summary | English | ||
| twenty one | 17 Timelines | 2015/5/29 | GB Tran | Data is content extracted from news articles web pages, including news from four countries: Egypt, Libya, Yemen and Syria. | Single text summary; news | Text Summary | paper | Multilingual | |
| twenty two | PTS Corpus | 2018/10/9 | Fei Sun | The full name is Product Title Summarization Corpus, the data displays the summary of product names in e-commerce applications for mobile devices | Single text summary; short text | Text Summary | paper | ||
| twenty three | Scientific Summarization DataSets | 2019/10/26 | Santosh Gupta | The dataset was taken from Semantic Scholar Corpus and ArXiv. Title/Abstract Pair from Semantic Scholar Corpus, filtering out all papers in the field of biomedical and contains 5.8 million pieces of data. Data from ArXiv, containing title/abstract pairs of each paper from 1991 to 5 July 2019. The data set contains 10k of financial data, 26k of biology, 417k of mathematics, 1.57 million of physics, and 221k of CS. | Single text summary; paper | Text Summary | English | ||
| twenty four | Scientific Document Summarization Corpus and Annotations from the WING NUS group | 2019/3/19 | Jaidka | The dataset includes research papers on ACL computational linguistics and natural language processing, as well as their respective cited papers and three output abstracts: a traditional author's paper abstract (abstract), a community abstract (a collection of citation statement "citations") and a human abstract written by a trained annotator, and the training set contains 40 articles and cited papers. | Single text summary; paper | Text Summary | paper | English |
| ID | title | Update date | Dataset provider | license | illustrate | Keywords | category | Paper address | Remark |
|---|---|---|---|---|---|---|---|---|---|
| 1 | WMT2017 | 2017/2/1 | EMNLP 2017 Workshop on Machine Translation | The data mainly comes from the two institutions Europarl corpus and UN corpus, and comes with articles re-extracted from the News Commentary corpus task in 2017. This is a translation corpus provided by the EMNLP conference, as a benchmark for many paper effects to detect | Benchmark, WMT2017 | Chinese-English translation materials | paper | ||
| 2 | WMT2018 | 2018/11/1 | EMNLP 2018 Workshop on Machine Translation | The data mainly comes from the two institutions Europarl corpus and UN corpus, and comes with articles re-extracted from the News Commentary corpus task in 2018. This is a translation corpus provided by the EMNLP conference, as a benchmark for many paper effects to detect | Benchmark, WMT2018 | Chinese-English translation materials | paper | ||
| 3 | WMT2019 | 2019/1/31 | EMNLP 2019 Workshop on Machine Translation | The data mainly comes from the two institutions of Europarl corpus and UN corpus, as well as the data obtained from the news-commentary corpus and the ParaCrawl corpus attached | Benchmark, WMT2019 | Chinese-English translation materials | paper | ||
| 4 | UM-Corpus:A Large English-Chinese Parallel Corpus | 2014/5/26 | Department of Computer and Information Science, University of Macau, Macau | High-quality translation materials for Chinese and English comparison published by the University of Macau | UM-Corpus;English; Chinese;large | Chinese-English translation materials | paper | ||
| 5 | [Ai challenger translation 2017](https://pan.baidu.com/s/1E5gD5QnZvNxT3ZLtxe_boA Extraction Code: stjf) | 2017/8/14 | AI Technology Competition jointly initiated by Innovation Works, Sogou and Toutiao | The largest English-Chinese bilingual dataset in the spoken language field. More than 10 million English-Chinese comparison sentence pairs are provided as data sets. All biswords have been manually inspected, and the data set is guaranteed in terms of scale, relevance and quality. Training set: 10,000,000 sentence verification set (simultaneous translation): 934 sentence verification set (text translation): 8,000 sentences | AI challenger 2017 | Chinese-English translation materials | |||
| 6 | MultiUN | 2010 | Department of Linguistics and Philology Uppsala University, Uppsala/Sweden | This dataset is provided by the German Center for Artificial Intelligence Research. In addition to this dataset, the website also provides many translation and comparison corpus for download in other languages. | MultiUN | Chinese-English translation materials | MultiUN: A Multilingual corpus from United Nation Documents, Andreas Eisele and Yu Chen, LREC 2010 | ||
| 7 | NIST 2002 Open Machine Translation (OpenMT) Evaluation | 2010/5/14 | NIST Multimodal Information Group | LDC User Agreement for Non-Members | The data comes from Xinhua News Service contains 70 news stories, as well as 30 news stories from Zaobao News Service. A total of 100 stories selected from two news collections are between 212 and 707 Chinese characters. Xinhua has a total of 25,247 characters and Zaobao has 39,256 characters. | NIST | Chinese-English translation materials | paper | This series has years of data, and the use of this data requires a fee |
| 8 | The Multitarget TED Talks Task (MTTT) | 2018 | Kevin Duh, JUH | This dataset contains parallel corpus in multiple languages based on TED speeches, including 20 languages including Chinese and English. | TED | Chinese-English translation materials | The Multitarget TED Talks Task | ||
| 9 | ASPEC Chinese-Japanese | 2019 | Workshop on Asian Translation | This dataset mainly studies languages in Asian regions, such as translation tasks between Chinese and Japanese, and between Japanese and English. Translation corpus mainly comes from language science and technology papers (paper abstract; invention description; patents, etc.) | Asian scientific patent Japanese | Chinese and Japanese translation materials | http://lotus.kuee.kyoto-u.ac.jp/WAT/ | ||
| 10 | casia2015 | 2015 | Research group in Institute of Automation , Chinese Academy of Sciences | Corpus contains approximately one million sentence pairs automatically collected from the network | casia CWMT 2015 | Chinese-English translation materials | |||
| 11 | casual2011 | 2011 | Research group in Institute of Computing Technology , Chinese Academy of Sciences | The corpus contains 2 sections, each containing approximately 1 million (total 2 million) sentence pairs automatically collected from the network. The alignment accuracy at the sentence level is about 90%. | casual CWMT 2011 | Chinese-English translation materials | |||
| 12 | casual2015 | 2015 | Research group in Institute of Computing Technology , Chinese Academy of Sciences | The corpus contains approximately 2 million sentence pairs, including sentences collected from the web (60%), movie subtitles (20%) and English/Chinese thesaurus (20%). Sentence horizontal alignment accuracy is higher than 99%. | casual CWMT 2015 | Chinese-English translation materials | |||
| 13 | datum2015 | 2015 | Dataum Data Co., Ltd. | The corpus contains one million pairs of sentences covering different types, such as textbooks for language education, bilingual books, technical documents, bilingual news, government white papers, government documents, bilingual resources on the Internet, etc. Please note that some parts of the Chinese part of the data are divided by vocabulary. | datum CWMT 2015 | Chinese-English translation materials | |||
| 14 | datum2017 | 2017 | Dataum Data Co., Ltd. | The corpus contains 20 documents covering different types such as news, dialogue, legal documents, novels, etc. There are 50,000 sentences per file. The entire corpus contains one million sentences. The Chinese words in the first 10 files (Book1-Book10) have been divided into segments. | datum CWMT 2017 | Chinese-English translation materials | |||
| 15 | neu2017 | 2017 | NLP lab of Northeastern University, China | The corpus contains 2 million sentence pairs automatically collected from the network, including news, technical documents, etc. The alignment accuracy at the sentence level is about 90%. | neu CWMT 2017 | Chinese-English translation materials | |||
| 16 | Translation corpus (translation2019zh) | 2019 | Xu Liang | It can be used to train Chinese-English translation systems, from Chinese to English, or from English to Chinese; since there are millions of Chinese sentences, only Chinese sentences can be extracted as general Chinese corpus, training word vectors or pre-trained corpus. English tasks can also be operated similarly; |
| ID | title | Update date | Dataset provider | license | illustrate | Keywords | category | Paper address | Remark |
|---|---|---|---|---|---|---|---|---|---|
| 1 | NLPIR Weibo follows 1 million relationship corpus | 2017/12/2 | Dr. Zhang Huaping, Internet Search Mining and Security Laboratory, Beijing Institute of Technology | NLPIR Weibo Follow Relationship Corpus Description 1. NLPIR Weibo Follow Relationship Corpus was obtained from Sina Weibo and Tencent Weibo through public collection and extraction. In order to promote the research on Weibo computing, 10 million pieces of data are now publicly shared through the natural language processing and information retrieval sharing platform (127.0.0.1/wordpress) (there are currently close to 1 billion data, and a large amount of redundant data has been eliminated); 2. During the disclosure process of this corpus, technical means have been used to block the user's real name and url to the greatest extent. If the users involved need to fully protect their personal privacy, they can delete it to Dr. Zhang Huaping [email protected]. I apologize for the troubles caused to you and hope to understand; 3. It is only suitable for scientific research and teaching purposes and shall not be used as commercial use; if you quote this corpus, please indicate in specific locations such as software or papers, the source is: NLPIR Weibo corpus, which is a natural language processing and information retrieval sharing platform (http://www.nlpir.org/). 4. Field description: person_id The id of the character guanzhu_id The id of the person you are following |
| ID | title | Update date | Dataset provider | license | illustrate | Keywords | category | Paper address | Remark |
|---|---|---|---|---|---|---|---|---|---|
| 1 | NLPIR Weibo content corpus - 230,000 items | December 2017 | Dr. Zhang Huaping, Internet Search Mining and Security Laboratory, Beijing Institute of Technology | NLPIR Weibo Content Corpus Description 1. NLPIR Weibo Content Corpus was obtained from Sina Weibo and Tencent Weibo through public collection and extraction.为了推进微博计算的研究,现通过自然语言处理与信息检索共享平台(127.0.0.1/wordpress)予以公开共享其中的23万条数据(目前已有数据接近1000万,已经剔除了大量的冗余数据)。 2.本语料库在公开过程中,已经最大限度地采用技术手段屏蔽了用户真实姓名和url,如果涉及到的用户需要全面保护个人隐私的,可以Email给张华平博士[email protected]予以删除,对给您造成的困扰表示抱歉,并希望谅解; 3.只适用于科研教学用途,不得作为商用;引用本语料库,恭请在软件或者论文等成果特定位置表明出处为:NLPIR微博语料库,出处为自然语言处理与信息检索共享平台(http://www.nlpir.org/)。 4.字段说明: id 文章编号article 正文discuss 评论数目insertTime 正文插入时间origin 来源person_id 所属人物的id time 正文发布时间transmit 转发 | |||||
| 2 | 500万微博语料 | 2018年1月 | 北京理工大学网络搜索挖掘与安全实验室张华平博士 | 【500万微博语料】北理工搜索挖掘实验室主任@ICTCLAS张华平博士提供500万微博语料供大家使用,文件为sql文件,只能导入mysql数据库,内含建表语句,共500万数据。语料只适用于科研教学用途,不得作为商用;引用本语料库,请在软件或者论文等成果特定位置表明出处。 【看起来这份数据比上面那一份要杂糅一些,没有做过处理】 | |||||
| 3 | NLPIR新闻语料库-2400万字 | 2017年7月 | www.NLPIR.org | NLPIR新闻语料库说明1.解压缩后数据量为48MB,大约2400万字的新闻; 2.采集的新闻时间跨度为2009年10月12日至2009年12月14日。 3.文件名为新闻的时间;每个文件包括多个新闻正文内容(已经去除了新闻的垃圾信息); 4.新闻本身内容的版权属于原作者或者新闻机构; 5.整理后的语料库版权属于www.NLPIR.org; 6.可供新闻分析、自然语言处理、搜索等应用提供测试数据场景; 如需更大规模的语料库,可以联系NLPIR.org管理员。 | |||||
| 4 | NLPIR微博关注关系语料库100万条 | 2017年12月 | 北京理工大学网络搜索挖掘与安全实验室张华平博士 | NLPIR微博关注关系语料库说明1.NLPIR微博关注关系语料库由北京理工大学网络搜索挖掘与安全实验室张华平博士,通过公开采集与抽取从新浪微博、腾讯微博中获得。为了推进微博计算的研究,现通过自然语言处理与信息检索共享平台(127.0.0.1/wordpress)予以公开共享其中的1000万条数据(目前已有数据接近10亿,已经剔除了大量的冗余数据); 2.本语料库在公开过程中,已经最大限度地采用技术手段屏蔽了用户真实姓名和url,如果涉及到的用户需要全面保护个人隐私的,可以Email给张华平博士[email protected]予以删除,对给您造成的困扰表示抱歉,并希望谅解; 3.只适用于科研教学用途,不得作为商用;引用本语料库,恭请在软件或者论文等成果特定位置表明出处为:NLPIR微博语料库,出处为自然语言处理与信息检索共享平台(http://www.nlpir.org/)。 4.字段说明: person_id 人物的id guanzhu_id 所关注人的id | |||||
| 5 | NLPIR微博博主语料库100万条 | 2017年9月 | 北京理工大学网络搜索挖掘与安全实验室张华平博士 | NLPIR微博博主语料库说明1.NLPIR微博博主语料库由北京理工大学网络搜索挖掘与安全实验室张华平博士,通过公开采集与抽取从新浪微博、腾讯微博中获得。为了推进微博计算的研究,现通过自然语言处理与信息检索共享平台(127.0.0.1/wordpress)予以公开共享其中的100万条数据(目前已有数据接近1亿,已经剔除了大量的冗余与机器粉丝) 2.本语料库在公开过程中,已经最大限度地采用技术手段屏蔽了用户真实姓名和url,如果涉及到的用户需要全面保护个人隐私的,可以Email给张华平博士[email protected]予以删除,对给您造成的困扰表示抱歉,并希望谅解; 3.只适用于科研教学用途,不得作为商用;引用本语料库,恭请在软件或者论文等成果特定位置表明出处为:NLPIR微博语料库,出处为自然语言处理与信息检索共享平台(http://www.nlpir.org/)。 4.字段说明: id 内部id sex 性别address 家庭住址fansNum 粉丝数目summary 个人摘要wbNum 微博数量gzNum 关注数量blog 博客地址edu 教育情况work 工作情况renZh 是否认证brithday 生日; | |||||
| 6 | NLPIR短文本语料库-40万字 | August 2017 | 北京理工大学网络搜索挖掘与安全实验室(SMS@BIT) | NLPIR短文本语料库说明1.解压缩后数据量为48万字,大约8704篇短文本内容; 2.整理后的语料库版权属于www.NLPIR.org; 3.可供短文本自然语言处理、搜索、舆情分析等应用提供测试数据场景; | |||||
| 7 | 维基百科语料库 | 维基百科 | 维基百科会定期打包发布语料库 | ||||||
| 8 | 古诗词数据库 | 2020 | github主爬虫,http://shici.store | ||||||
| 9 | 保险行业语料库 | 2017 | 该语料库包含从网站Insurance Library 收集的问题和答案。 据我们所知,这是保险领域首个开放的QA语料库: 该语料库的内容由现实世界的用户提出,高质量的答案由具有深度领域知识的专业人士提供。 所以这是一个具有真正价值的语料,而不是玩具。 在上述论文中,语料库用于答复选择任务。 另一方面,这种语料库的其他用法也是可能的。 例如,通过阅读理解答案,观察学习等自主学习,使系统能够最终拿出自己的看不见的问题的答案。 数据集分为两个部分“问答语料”和“问答对语料”。问答语料是从原始英文数据翻译过来,未经其他处理的。问答对语料是基于问答语料,又做了分词和去标去停,添加label。所以,"问答对语料"可以直接对接机器学习任务。如果对于数据格式不满意或者对分词效果不满意,可以直接对"问答语料"使用其他方法进行处理,获得可以用于训练模型的数据。 | ||||||
| 10 | 汉语拆字字典 | 1905年7月 | 本倉庫含開放詞典網用以提供字旁和部件查詢的拆字字典數據庫,有便利使用者查難打漢字等用途。目前數據庫收錄17,803不同漢字的拆法,分為繁體字(chaizi-ft.txt)和簡體字(chaizi-jt.txt)兩個版本。 拆字法有別於固有的筆順字庫。拆字著重於儘量把每個字拆成兩個以上的組成部件,而不是拆成手寫字時所使用的筆畫。 | ||||||
| 11 | 新闻预料 | 2016年 | Xu Liang | 可以做为【通用中文语料】,训练【词向量】或做为【预训练】的语料; 也可以用于训练【标题生成】模型,或训练【关键词生成】模型(选关键词内容不同于标题的数据); 亦可以通过新闻渠道区分出新闻的类型。 | |||||
| 12 | 百科类问答json版(baike2018qa) | 2018 | Xu Liang | 可以做为通用中文语料,训练词向量或做为预训练的语料;也可以用于构建百科类问答;其中类别信息比较有用,可以用于做监督训练,从而构建更好句子表示的模型、句子相似性任务等。 | |||||
| 13 | 社区问答json版(webtext2019zh) :大规模高质量数据集 | 2019 | Xu Liang | 1)构建百科类问答:输入一个问题,构建检索系统得到一个回复或生产一个回复;或根据相关关键词从,社区问答库中筛选出你相关的领域数据2)训练话题预测模型:输入一个问题(和或描述),预测属于话题。 3)训练社区问答(cQA)系统:针对一问多答的场景,输入一个问题,找到最相关的问题,在这个基础上基于不同答案回复的质量、 问题与答案的相关性,找到最好的答案。 4)做为通用中文语料,做大模型预训练的语料或训练词向量。其中类别信息也比较有用,可以用于做监督训练,从而构建更好句子表示的模型、句子相似性任务等。 5)结合点赞数量这一额外信息,预测回复的受欢迎程度或训练答案评分系统。 | |||||
| 14 | .维基百科json版(wiki2019zh) | 2019 | Xu Liang | 可以做为通用中文语料,做预训练的语料或构建词向量,也可以用于构建知识问答。【不同于wiki原始释放的数据集,这个处理过了】 |
| ID | title | Update date | 数据集提供者 | license | illustrate | Keywords | category | 论文地址 | Remark |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 百度WebQA | 2016 | Baidu | 来自于百度知道;格式为一个问题多篇意思基本一致的文章,分为人为标注以及浏览器检索 | 阅读理解、百度知道真实问题 | 中文阅读理解 | paper | ||
| 2 | DuReader 1.0 | 2018/3/1 | Baidu | Apache2.0 | 本次竞赛数据集来自搜索引擎真实应用场景,其中的问题为百度搜索用户的真实问题,每个问题对应5个候选文档文本及人工整理的优质答案。 | 阅读理解、百度搜索真实问题 | 中文阅读理解 | paper | |
| 3 | SogouQA | 2018 | Sogou | CIPS-SOGOU问答比赛数据;来自于搜狗搜索引擎真实用户提交的查询请求;含有事实类与非事实类数据 | 阅读理解、搜狗搜索引擎真实问题 | 中文阅读理解 | |||
| 4 | 中文法律阅读理解数据集CJRC | 2019/8/17 | 哈工大讯飞联合实验室(HFL) | 数据集包含约10,000篇文档,主要涉及民事一审判决书和刑事一审判决书。通过抽取裁判文书的事实描述内容,针对事实描述内容标注问题,最终形成约50,000个问答对 | 阅读理解、中文法律领域 | 中文阅读理解 | paper | ||
| 5 | 2019“讯飞杯”中文机器阅读理解数据集(CMRC ) | 2019年10月 | 哈工大讯飞联合实验室(HFL) | CC-BY-SA-4.0 | 本次阅读理解的任务是句子级填空型阅读理解。 根据给定的一个叙事篇章以及若干个从篇章中抽取出的句子,参赛者需要建立模型将候选句子精准的填回原篇章中,使之成为完整的一篇文章。 | 句子级填空型阅读理解 | 中文阅读理解 | 赛事官网:https://hfl-rc.github.io/cmrc2019/ | |
| 6 | 2018“讯飞杯”中文机器阅读理解数据集(CMRC ) | 2018/10/19 | 哈工大讯飞联合实验室(HFL) | CC-BY-SA-4.0 | CMRC 2018数据集包含了约20,000个在维基百科文本上人工标注的问题。同时,我们还标注了一个挑战集,其中包含了需要多句推理才能够正确解答的问题,更富有挑战性 | 阅读理解、基于篇章片段抽取 | 中文阅读理解 | paper | 赛事官网:https://hfl-rc.github.io/cmrc2018/ |
| 7 | 2017“讯飞杯”中文机器阅读理解数据集(CMRC ) | 2017/10/14 | 哈工大讯飞联合实验室(HFL) | CC-BY-SA-4.0 | 首个中文填空型阅读理解数据集PD&CFT | 填空型阅读理解 | 中文阅读理解 | paper | 赛事官网 |
| 8 | 莱斯杯:全国第二届“军事智能机器阅读”挑战赛 | 2019/9/3 | 中电莱斯信息系统有限公司 | 面向军事应用场景的大规模中文阅读理解数据集,围绕多文档机器阅读理解进行竞赛,涉及理解、推理等复杂技术。 | 多文档机器阅读理解 | 中文阅读理解 | 赛事官网 | ||
| 9 | ReCO | 2020 | Sogou | 来源于搜狗的浏览器用户输入;有多选和直接答案 | 阅读理解、搜狗搜索 | 中文阅读理解 | paper | ||
| 10 | DuReader-checklist | 2021/3 | Baidu | Apache-2.0 | 建立了细粒度的、多维度的评测数据集,从词汇理解、短语理解、语义角色理解、逻辑推理等多个维度检测模型的不足之处,从而推动阅读理解评测进入“精细化“时代 | 细粒度阅读理解 | 中文阅读理解 | 赛事官网 | |
| 11 | DuReader-Robust | 2020/8 | Baidu | Apache-2.0 | 从过敏感性,过稳定性以及泛化性多个维度构建了测试阅读理解鲁棒性的数据 | 百度搜索、鲁棒性阅读理解 | 中文阅读理解 | paper | 赛事官网 |
| 12 | DuReader-YesNo | 2020/8 | Baidu | Apache-2.0 | DuReader yesno是一个以观点极性判断为目标任务的数据集,可以弥补抽取类数据集评测指标的缺陷,从而更好地评价模型对观点极性的理解能力。 | 观点型阅读理解 | 中文阅读理解 | 赛事官网 | |
| 13 | DuReader2.0 | 2021 | Baidu | Apache-2.0 | DuReader2.0是全新的大规模中文阅读理解数据,来源于用户真实输入,真实场景 | 阅读理解 | 中文阅读理解 | paper | 赛事官网 |
| 14 | CAIL2020 | 2020 | 哈工大讯飞联合实验室(HFL) | 中文司法阅读理解任务,今年我们将提出升级版,不仅文书种类由民事、刑事扩展为民事、刑事、行政,问题类型也由单步预测扩展为多步推理,难度有所升级。 | 法律阅读理解 | 中文阅读理解 | 赛事官网 | ||
| 15 | CAIL2021 | 2021 | 哈工大讯飞联合实验室(HFL) | 中文法律阅读理解比赛引入多片段回答的问题类型,即部分问题需要抽取文章中的多个片段组合成最终答案。希望多片段问题类型的引入,能够扩大中文机器阅读理解的场景适用性。本次比赛依旧保留单片段、是否类和拒答类的问题类型。 | 法律阅读理解 | 中文阅读理解 | 赛事官网 | ||
| 16 | CoQA | 2018/9 | 斯坦福大学 | CC BY-SA 4.0、Apache等 | CoQA是面向建立对话式问答系统的大型数据集,挑战的目标是衡量机器对文本的理解能力,以及机器面向对话中出现的彼此相关的问题的回答能力的高低 | 对话问答 | 英文阅读理解 | paper | Official website |
| 17 | SQuAD2.0 | 2018/1/11 | 斯坦福大学 | 行业内公认的机器阅读理解领域的顶级水平测试;它构建了一个包含十万个问题的大规模机器阅读理解数据集,选取超过500 篇的维基百科文章。数据集中每一个阅读理解问题的答案是来自给定的阅读文章的一小段文本—— 以及,现在在SQuAD 2.0 中还要判断这个问题是否能够根据当前的阅读文本作答 | 问答、包含未知答案 | 英文阅读理解 | paper | ||
| 18 | SQuAD1.0 | 2016 | 斯坦福大学 | 斯坦福大学于2016年推出的阅读理解数据集,给定一篇文章和相应问题,需要算法给出问题的答案。此数据集所有文章选自维基百科,一共有107,785问题,以及配套的536 篇文章 | 问答、基于篇章片段抽取 | 英文阅读理解 | paper | ||
| 19 | MCTest | 2013 | Microsoft | 100,000个必应Bing问题和人工生成的答案。从那时起,相继发布了1,000,000个问题数据集,自然语言生成数据集,段落排名数据集,关键词提取数据集,爬网数据集和会话搜索。 | 问答、搜索 | 英文阅读理解 | paper | ||
| 20 | CNN/Dailymail | 2015 | DeepMind | Apache-2.0 | 填空型大规模英文机器理解数据集,答案是原文中的某一个词。 CNN数据集包含美国有线电视新闻网的新闻文章和相关问题。大约有90k文章和380k问题。 Dailymail数据集包含每日新闻的文章和相关问题。大约有197k文章和879k问题。 | 问答对、填空型阅读理解 | 英文阅读理解 | paper | |
| twenty one | RACE | 2017 | 卡耐基梅隆大学 | / | 数据集为中国中学生英语阅读理解题目,给定一篇文章和5 道4 选1 的题目,包括了28000+ passages 和100,000 问题。 | 选择题形式 | 英文阅读理解 | paper | 下载需邮件申请 |
| twenty two | HEAD-QA | 2019 | aghie | MIT | 一个面向复杂推理的医疗保健、多选问答数据集。提供英语、西班牙语两种形式的数据 | 医疗领域、选择题形式 | 英文阅读理解西班牙语阅读理解 | paper | |
| twenty three | Consensus Attention-based Neural Networks for Chinese Reading Comprehension | 2018 | 哈工大讯飞联合实验室 | / | 中文完形填空型阅读理解 | 填空型阅读理解 | 中文阅读理解 | paper | |
| twenty four | WikiQA | 2015 | Microsoft | / | WikiQA语料库是一个新的公开的问题和句子对集,收集并注释用于开放域问答研究 | 片段抽取阅读理解 | 英文阅读理解 | paper | |
| 25 | Children's Book Test (CBT) | 2016 | / | 测试语言模型如何在儿童书籍中捕捉意义。与标准语言建模基准不同,它将预测句法功能词的任务与预测语义内容更丰富的低频词的任务区分开来 | 填空型阅读理解 | 英文阅读理解 | paper | ||
| 26 | NewsQA | 2017 | Maluuba Research | / | 一个具有挑战性的机器理解数据集,包含超过100000个人工生成的问答对,根据CNN的10000多篇新闻文章提供问题和答案,答案由相应文章的文本跨度组成。 | 片段抽取阅读理解 | 英文阅读理解 | paper | |
| 27 | Frames dataset | 2017 | Microsoft | / | 介绍了一个由1369个人类对话组成的框架数据集,平均每个对话15轮。开发这个数据集是为了研究记忆在目标导向对话系统中的作用。 | 阅读理解、对话 | 英文阅读理解 | paper | |
| 28 | Quasar | 2017 | 卡内基梅隆大学 | BSD-2-Clause | 提出了两个大规模数据集。Quasar-S数据集由37000个完形填空式查询组成,这些查询是根据流行网站Stack overflow 上的软件实体标记的定义构造的。网站上的帖子和评论是回答完形填空问题的背景语料库。Quasar-T数据集包含43000个开放域琐事问题及其从各种互联网来源获得的答案。 | 片段抽取阅读理解 | 英文阅读理解 | paper | |
| 29 | MS MARCO | 2018 | Microsoft | / | 微软基于搜索引擎BING 构建的大规模英文阅读理解数据集,包含10万个问题和20万篇不重复的文档。MARCO 数据集中的问题全部来自于BING 的搜索日志,根据用户在BING 中输入的真实问题模拟搜索引擎中的真实应用场景,是该领域最有应用价值的数据集之一。 | 多文档 | 英文阅读理解 | paper | |
| 30 | 中文完形填空 | 2016年 | 崔一鸣 | 首个中文填空型阅读理解数据集PD&CFT, 全称People Daily and Children's Fairy Tale, 数据来源于人民日报和儿童故事。 | 填空型阅读理解 | 中文完形填空 | paper | ||
| 31 | NLPCC ICCPOL2016 | 2016.12.2 | NLPCC主办方 | 基于文档中的句子人工合成14659个问题,包括14K中文篇章。 | 问答对阅读理解 | 中文阅读理解 |
感谢以下同学的贡献(排名不分先后)
郑少棉、李明磊、李露、叶琛、薛司悦、章锦川、李小昌、李俊毅
You can contribute your power by uploading dataset information. After uploading five or more data sets and reviewing them, the student can be used as a project contributor and display them.
Share your data set with community or make a contribution today! Just send email to chineseGLUE#163.com,
or join QQ group: 836811304