实体识别数据集
该存储库包含来自带有各种实体类型的几个域中的数据集,可用于实体识别和命名实体识别(NER)任务。
注意:我不再积极地将数据集添加到此列表中 - 自2020年以来可能会出现更多的NER数据集。但是,我很乐意通过问题或拉出请求添加更多数据集。
英文NER的数据集
下表显示了用于英语实体识别的数据集列表(有关其他语言的NER数据集列表,请参见下文)。数据目录包含有关在何处获得这些数据集的信息,这些数据集由于许可限制而无法共享的数据集,以及将其转换为Conll 2003格式的代码。下面还列出了指向其他语言的NER语料库的链接。
| 数据集 | 领域 | 执照 | 参考 | 可用性 |
|---|
| Conll 2003 | 消息 | 杜阿 | Sang and Meulder,2003年 | 容易找到 |
| Nist-ieer | 消息 | 没有任何 | NIST 1999 IE-ER | NLTK数据 |
| MUC-6 | 消息 | 自然界人士 | Grishman and Sundheim,1996年 | LDC 2003T13 |
| Ontonotes 5 | 各种各样的 | 自然界人士 | Weischedel等人,2013年 | LDC 2013T19 |
| BBN | 各种各样的 | 自然界人士 | Weischedel和Brunstein,2005年 | LDC 2005T33 |
| GMB-1.0.0 | 各种各样的 | 没有任何 | Bos等,2017 | http://gmb.let.rug.nl/data.php |
| 口香糖-3.1.0 | Wiki | 几个( * 2) | Zeldes,2016年 | ✔这里包括 |
| 维基戈德 | 维基百科 | CC-BY 4.0 | Balasuriya等,2009 | ✔这里包括 |
| ritter | 叽叽喳喳 | 没有任何 | Ritter等,2011 | 没有分裂,火车/测试/开发拆分 |
| BTC | 叽叽喳喳 | CC-BY 4.0 | Derczynski等,2016 | ✔这里包括 |
| Wnut17 | 社交媒体 | CC-BY 4.0 | Derczynski等,2017 | ✔这里包括 |
| I2B2-2006 | 医疗的 | 杜阿 | Uzuner等,2007 | http://www.i2b2.org |
| I2B2-2014 | 医疗的 | 杜阿 | Stubbs等,2015 | http://www.i2b2.org |
| 卡德克 | 医疗的 | Csiro | Karimi等,2015 | http://data.csiro.au/ |
| 安姆 | 解剖学 | CC-SA 3.0 | Ohta等,2012 | ✔这里包括 |
| Mitrestaurant | 查询 | 没有任何 | Liu等,2013a | http://groups.csail.mit.edu/sls/ |
| mitmovie | 查询 | 没有任何 | Liu等,2013b | http://groups.csail.mit.edu/sls/ |
| MalwaretextDB | 恶意软件 | 没有任何 | Lim等,2017 | http://www.statnlp.org/ |
| re3d | 防御 | 几个( * 1) | DSTL,2017年 | ✔这里包括 |
| SEC-FILINGS | 金融 | CC-BY 3.0 | Alvarado等,2015 | ✔这里包括 |
| 集会 | 机器人技术 | x | Costa等,2017 | x |
| wikineal | 维基百科 | CC BY-SA-NC 4.0 | Tedeschi等,2021 | https://github.com/babelscape/wikineural |
| 多纳 | 维基百科 | CC BY-SA-NC 4.0 | Tedeschi等,2022 | https://github.com/babelscape/multinerd |
| Hipe-2022 | 历史 | CC BY-SA-NC 4.0 | Ehrmann等,2022 | https://github.com/hipe-eval/hipe-2022-data |
| 音乐纳 | 音乐 | 麻省理工学院 | Epure和Hennequin,2023年 | https://github.com/deezer/music-ner-eacl2023 |
| WIESP2022-NER | 天体物理学 | CC BY-SA-NC 4.0 | Grezes等,2022 | https://huggingface.co/datasets/adsabs/wiesp2022-ner |
| nne | 消息 | CC 4.0 / LDC | Ringland等,2019 | https://github.com/nickyringland/nested_nemed_entities |
| 全世界 | 消息 | CC BY-SA-NC 4.0 | Shan等,2023 | https://github.com/stanfordnlp/en-worldwide-newswire https://arxiv.org/abs/2404.13465 |
许可证
许可注释:
(1)RE3D(“关系和实体提取评估数据集”)包含几个具有不同许可的数据集。这些都是:
- CC-BY-SA 3.0(Wikipedia数据集)
- CC BY-NC 3.0(BBC_ONLINE数据集)
- CC由3.0 au(Australian_department_of_foreign_affairs数据集)
- 公共域(us_state_department数据集,CENTCOM数据集)
- 英国公开政府许可证v3.0(UK__GOVERNMENT数据集)
- delegation_of_the_european_union_to_syria:请参阅https://eeas.europa.eu/delegations/syria/8157/legal-notice_en
- 口香糖3.1.0包括三个数据集,并带有许可证CC-BY 3.0,CC-BY-SA 3.0和CC-BY-NC-SA 3.0。注释是根据CC-BY 4.0许可的。
可以在相应的子目录中找到每个数据集的更详细的许可信息。
稍后... -Tabassum等人,在Stackoverflow https://cocoxu.github.io/publications/acl2020_stackoverflow_ner.pdf- litbank -litbank:https://github.com.com/dbamman/litbank(bamman/litbank(bamman,popate and popate and popate and popatity)https://cocoxu.github.io/publications/ACL2020_STACKOVERFLOW_NER.PDF(BAM,popat和Shen,Annastival intrary) NNE:一个用于英文新闻中嵌套的实体识别的数据集,2019年https://github.com/nickyringland/nested_named_entities -MARS Target Engyclopedia -LPSC摘要标签数据集:https://zenodo.org/record/1048484848419#19.2cc.w.w55a.recordiies https://www.kaggle.com/dataturks/best-buy-ecommerce-ner-dataset/home- ner:https://wwwww.kaggle.com/dataturks/resume-ensume-entistities-for-entities-for-ner-ner/home--少数 - 少数 - 少数 - 少数 - 少数 - 少数 - nertity nectity Date date date date date date ner for Ner: https://aclanthology.org/2021.acl-long.248/
其他语言的NER数据集
词汇命名实体资源
- Heiner:http://heiner.cl.uni-heidelberg.de/index.shtml
- NECKAR:https://event.ifi.uni-heidelberg.de/?page_id = 532#wikidata_ne_dataset
代码转换
- 英语 - 西班牙推文(计算2018):https://code-switching.github.io/2018/; https://code-switching.github.io/2018/files/spa-eng/release.zip; http://www.aclweb.org/anthology/w18-3219
- 阿拉伯语 - 埃及推文(计算2018):https://code-switching.github.io/2018/; https://code-switching.github.io/2018/files/msa-egy/arabictweetstokenaskigner.zip; http://www.aclweb.org/anthology/w18-3219
- 印地语英语社交媒体文字:https://github.com/silentflame/named-entity-rendition; http://aclweb.org/anthology/w18-2405
- EMNLP 2014年共享任务 - 代码转换推文(Nepali-English,Spanish-English,Pronsarin-English,Arabic-Arabic方言):http://emnlp2014.org/workshops/codeswitch/codeswitch/call.html.html
德语
- Conll 2003(英语,德语):https://www.clips.uantwerpen.be/conll2003/ner/
- Germeval 2014:https://sites.google.com/site/germeval2014ner/data
- 书面德语(tüba-d/Z)的TübingenTreebank:http://www.sfs.uni-tuebingen.de/en/ascl/ascl/resources/corpora/corpora/tueba-dueba-dz.html
- 欧洲报纸(荷兰语,法语,德语):https://github.com/europeananewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- 德国欧洲成绩单(子集):https://nlpado.de/~sebastian/software/ner_german.shtml
- 指定德语的实体模型,政治(NEMGP):https://www.thomas-zastrow.de/nlp/
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多纳:https://github.com/babelscape/multinerd
- DFKI SMARTDATA语料库(地理原理):https://dfki-lt-re-group.bit.bit.bit.ioio/smartdata-corpus/(德国语料库,用于交通和行业事件的精细命名实体识别和关系识别和关系。 Gabryszak,Leonhard Hennig。
- DBPEDIA摘要语料库(英语,德语,荷兰语,法语,意大利语,日语):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- DAWT数据集 - 跨多种语言(英语,西班牙语,法语,意大利语,德语,阿拉伯语)的密集注释的Wikipedia文本:https://github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
- 埃琳娜·莱特纳(Elena Leitner),乔治·雷姆(Georg Rehm),朱利(Juli)́数据:https://github.com/elenanereiss/legal-entity-rbognition
- HIPE-2022,在多语言历史文档中命名为实体识别和实体链接:https://hipe-eval.github.io/hipe-2022/ https://github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data
荷兰
- Conll 2002(西班牙语,荷兰语):https://www.clips.uantwerpen.be/conll2002/ner/
- 欧洲报纸(荷兰语,法语,德语):https://github.com/europeananewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- 同时语料库(平行语料库:英语,西班牙语,意大利语,荷兰语):http://www.newsreader-project.eu/results/data/wikinews/
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多纳:https://github.com/babelscape/multinerd
- DBPEDIA摘要语料库(英语,德语,荷兰语,法语,意大利语,日语):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- 荷兰议会文档2015-2016,从1848年开始。
- Sonar 1- Desmet和Hoste,精细的荷兰语名为“实体识别”,2014年(班级等级)
- 语料库书籍和语料库Gutenberg Dutch:http://blog.namescape.nl/?page_id=85; http://portal.clarin.nl/node/1940
南非荷兰语
- NCHLT南非荷兰语名为Entity注释语料库:https://repo.sadilar.org/handle/20.500.12185/299
西班牙语
- Conll 2002(西班牙语,荷兰语):https://www.clips.uantwerpen.be/conll2002/ner/
- Ancora(西班牙语,加泰罗尼亚):http://clic.ub.edu/corpus/en
- Deft Spanish Treebank(LDC2018T01):https://catalog.ldc.upenn.edu/ldc2018t01
- 灵丹妙药(实验室):http://panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/depentendency-lab-es
- PANACEA(ENV):http://panacea-lr.eu/en/info-for-researchers/data-sets/depentency-parsed-corpora/deplyendency-envendency-env-es
- 同时语料库(平行语料库:英语,西班牙语,意大利语,荷兰语):http://www.newsreader-project.eu/results/data/wikinews/
- ACE 2007(西班牙语和阿拉伯语):https://catalog.ldc.upenn.edu/ldc2014t18
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多纳:https://github.com/babelscape/multinerd
- http://www.grupolys.org/~marcos/pub/lrec16.tar.bz2(用于“将词典 - 示态启发式纳入核心分辨率sieves in Document-level中指定的实体识别”
- 具有人体实体的核心注释(西班牙语,加利西亚,葡萄牙语)的多语言语料库:http://gramatica.usc.es/~marcos/lrec.tar.bz2
- Drugsemantics黄金标准(Moreno等人,药物掌握:西班牙产品特征摘要中指定实体识别的语料库,2017年):https://data.mendeley.com/datasets/fwc7jrc5jr/1
- DBPEDIA摘要语料库(英语,德语,荷兰语,法语,意大利语,日语):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- DAWT数据集 - 跨多种语言(英语,西班牙语,法语,意大利语,德语,阿拉伯语)的密集注释的Wikipedia文本:https://github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
- Cantemist(癌症文本挖掘共享任务 - 肿瘤称为实体识别) - 命名与癌症有关的关键概念的实体识别,即西班牙医学文本中的肿瘤形态:https://temu.bsc.es/cantemist/
加泰罗尼亚
- Ancora(西班牙语,加泰罗尼亚):http://clic.ub.edu/corpus/en
加利西亚人
- 加利西亚ner语料库:https://gramatica.usc.es/~marcos/resources/corpus_gal_nec.txt.gz
- 具有人体实体的核心注释(西班牙语,加利西亚,葡萄牙语)的多语言语料库:http://gramatica.usc.es/~marcos/lrec.tar.bz2
巴斯克
- 巴斯克命名实体语料库(EIEC):http://ixa.eus/node/4486?language= en
- Basque Disamigation命名实体语料库(EDIEC):http://ixa.si.ehu.es/node/4485?language= en
- egunkaria 2000语料库(383个新闻文本),http://qtleap.eu/wp-content/uploads/2014/04/qtleap-2013-d5.1.pdf
葡萄牙语
- 后宫:https://www.linguateca.pt/aval_conjunta/harem/harem_ing.html
- Cintil语料库:http://cintil.ul.pt/cintilfeatures.html#corpus
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多纳:https://github.com/babelscape/multinerd
- 具有人体实体的核心注释(西班牙语,加利西亚,葡萄牙语)的多语言语料库:http://gramatica.usc.es/~marcos/lrec.tar.bz2
- bosque 8.0老鹰格式:https://gramatica.usc.es/~marcos/resources/corpora_flpt.tgz
- Lener-BR(巴西法律文件):https://cic.unb.br/~teodecampos/lener-br/
- Paramopama:用于命名实体识别的巴西 - 葡萄牙语料库
法语
- 酯:http://catalogue.elra.info/en-us/repository/browse/elra-s0241/
- 酯2:http://catalogue.elra.info/en-us/repository/browse/elra-s0338/
- etape:http://catalogue.elra.info/en-us/repository/browse/elra-e0046/
- 欧洲报纸(荷兰语,法语,德语):https://github.com/europeananewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- Quaero法国医学语料库:https://quaerofrenchmed.limsi.fr/
- Quaero广播新闻扩展了命名Entity语料库:http://catalog.elra.info/en-us/repository/browse/browse/elra-s0349/
- quaero旧新闻扩展名称实体语料库:http://catalog.elra.info/en-us/repository/browse/browse/elra-w0073/
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikiner-fr-gold https://arxiv.org/abs/2411.00030 https://huggingface.co/datasets/danrun/wikiner-fr-gold-gold
- wikinearal:https://github.com/babelscape/wikineural
- 多纳:https://github.com/babelscape/multinerd
- DBPEDIA摘要语料库(英语,德语,荷兰语,法语,意大利语,日语):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- DAWT数据集 - 跨多种语言(英语,西班牙语,法语,意大利语,德语,阿拉伯语)的密集注释的Wikipedia文本:https://github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
- CAP 2017-(Twitter Data),Lopez等人,CAP 2017挑战:Twitter名为“实体识别”,2017年:http://cap2017.imag.fr/competition.html
- HIPE-2022,在多语言历史文档中命名为实体识别和实体链接:https://hipe-eval.github.io/hipe-2022/ https://github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data
意大利人
- KINT:https://github.com/dhfbk/kind
- 评估:http://www.evalita.it/2009/tasks/entity
- 同时语料库(平行语料库:英语,西班牙语,意大利语,荷兰语):http://www.newsreader-project.eu/results/data/wikinews/
- PANACEA(ENV):http://panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/deplyendency-envendency-env-it
- 灵丹妙药(实验室):http://panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/depentency-lab-it
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多纳:https://github.com/babelscape/multinerd
- DBPEDIA摘要语料库(英语,德语,荷兰语,法语,意大利语,日语):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- DAWT数据集 - 跨多种语言(英语,西班牙语,法语,意大利语,德语,阿拉伯语)的密集注释的Wikipedia文本:https://github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
罗马尼亚人
- Ronec(Dumitrescu和Avram,介绍Ronec-罗马尼亚人名为Entitycorpus。LREC2020)。论文:https://arxiv.org/pdf/1909.01247.pdf数据:https://github.com/dumitrescustefan/ronec
- Romanian journalistic corpus (ROCO): http://metashare.elda.org/repository/browse/romanian-journalistic-corpus-roco/038baa80dc7311e5aa0b00237df3e3583781d7c0f2084057aa018a2d63d987e9/
- 罗马尼亚人平衡语料库(ROMBAC):http://metashare.elda.org/repository/browse/romanian-balanced-corpus-corpus-corpus-corpus-rombac/0a7dd85edc7311e5aaa0b00233e35873166666243524229dbubu
希腊语
- PANACEA(ENV):http://panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/depperency-epentency-env-env-el
- 灵丹妙药(实验室):http://panacea-lr.eu/en/info-for-researchers/data-sets/deppentency-parsed-corpora/depentency-lab-lab-el
匈牙利
- 匈牙利命名的实体语料库:http://rgai.inf.u-szeged.hu/index.php?lang=en&page=corpus_ne
- Hunnerwiki:http://hlt.sztaki.hu/resources/hunnerwiki.html
- NYTK:https://github.com/nytud/nytk-nerkor
捷克
- 捷克语命名Entity语料库:http://ufal.mff.cuni.cz/cnec
- BSNLP 2017(克罗地亚,捷克,波兰语,俄罗斯,斯洛伐克,斯洛文尼亚,乌克兰人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- CZENG 1.0(平行语料库:捷克 - 英语):http://ufal.mff.cuni.cz/czeng/czeng10
- pero ocr ner(捷克历史OCR编年史):https://github.com/roman-janik/poner https://dspace.vut.cz/items/6092e1b0-1b0-1b0-1b0-1b0-3d75-3d75-4451-8582-28582-28582-28573ac3044
抛光
- 波兰SEJM语料库:http://clip.ipipan.waw.pl/psc
- BSNLP 2017(克罗地亚,捷克,波兰语,俄罗斯,斯洛伐克,斯洛文尼亚,乌克兰人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- 波兰核心语料库:http://zil.ipipan.waw.pl/polishcoreferencecorpus
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多纳:https://github.com/babelscape/multinerd
- 经济新闻语料库(CEN语料库):http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/cen
- KPWR(KORPUSJęZYKAPOLSKIEGO POLITECHNIKIWROCławskiej/Wrocław技术大学的波兰语料库):http://plwordnet.pr.wroc.pl/index.pl/index.php?option=com_content = com_content&view = = Artcical&v = artical&id = 35 55 = 3 2pletememid=18 = = = = = = = = = = = = = = = = 1. http://plwordnet.pwr.wroc.pl/attachments/article/35/kpwr-1.1.7z(Broda等人,KPWR:迈向免费的波兰语语料库,2012年)
- nkjp:http://clip.ipipan.waw.pl/nationalcorpusofpolish?action=AttachFile&do=view&target=nkjp-podkorpusmilionowy-1.2.tar.gz
克罗地亚人
- HR500K 1.0:http://hdl.handle.net/11356/1183
- BSNLP 2017(克罗地亚,捷克,波兰语,俄罗斯,斯洛伐克,斯洛文尼亚,乌克兰人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- reldi-normtagner-hr(克罗地亚推文):http://hdl.handle.net/11356/1170
斯洛伐克
- BSNLP 2017(克罗地亚,捷克,波兰语,俄罗斯,斯洛伐克,斯洛文尼亚,乌克兰人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- 斯洛伐克分类新闻语料库:https://nlp.web.tuke.sk/pages/categorizednews
斯洛文尼亚
- BSNLP 2017(克罗地亚,捷克,波兰语,俄罗斯,斯洛伐克,斯洛文尼亚,乌克兰人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- SSJ500K:http://www.slovenscina.eu/tehnologije/ucni-korpus; http://eng.slovenscina.eu/tehnologije/ucni-korpus; https://www.clarin.si/repository/xmlui/handle/11356/1029;注意:v 2.2请参阅:http://hdl.handle.net/11356/1210
- 斯洛文尼亚新闻:http://zitnik.si/mediawiki/index.php?title=datasets#slovene_news; http://zitnik.si/mediawiki/images/7/7d/rtvslo_dec2011.tsv; http://zitnik.si/mediawiki/images/5/5e/rtvslo_dec2011_v2.tsv
- Janes-Tag 2.0(社交媒体文本)https://www.clarin.si/repository/xmlui/handle/11356/1123;另请参见:Fišer等人,Janes Project:Slovene用户生成的内容的语言资源和工具,2018年。
乌克兰
- BSNLP 2017(克罗地亚,捷克,波兰语,俄罗斯,斯洛伐克,斯洛文尼亚,乌克兰人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- 乌克兰棕色NER语料库:https://github.com/lang-uk/ner-uk; http://lang.org.ua/en/corpora/
塞尔维亚
- setimes.sr -http://hdl.handle.net/11356/1200
- 塞尔维亚人的指定实体评估语料库:http://www.korpus.matf.bg.ac.rs/srpneval/
- reldi-normtagner-sr(塞尔维亚推文):http://hdl.handle.net/11356/1171
保加利亚语
冰岛
- Mim-Gold-ner(Ingólfsdóttir,SvanhvítLilja,Sigurjónþorsteinsson和Hrafn Loftsson。 http://www.malfong.is/index.php?pg=mim_gold_ner
丹麦语
- 戴恩:Hvingelby等人,[Dane:丹麦语的命名实体资源。](http://www.lrec-conf.org/proceedings/lrec202020202020202020.lrec-1.565.pdf)
- 丹麦Propbank(DPB):http://catalog.elra.info/en-us/repository/browse/elra-w0117/
- 树木银植物园:http://catalog.elra.info/en-us/repository/browse/browse/elra-w0084/
挪威
- Bjarte Johansen是挪威人的实体认可,是第22届北欧计算语言学会议论文集。 2019(https://www.aclweb.org/anthology/w19-6123.pdf)数据:https://github.com/ljos/ljos/navnkjenner
- FredrikJørgensen等人,Norne:注释为挪威的指定实体,2019年(https://arxiv.org/pdf/1911.12146.pdf)。数据:https://github.com/ltgoslo/norne/; https://www.nb.no/sprakbanken/show?serial=Oai%3anb.no%3ASBR-49
瑞典
- 斯德哥尔摩互联网语料库:https://www.ling.su.se/english/nlp/corpora-and-resources/sic
- SUC 3.0:https://spraakbanken.gu.se/eng/resource/suc3
- 瑞典语手动注释NER:https://github.com/klintan/swedish-ner-corpus/
- 医疗Wikipedia数据(Almgren等人,在瑞典健康记录中被称为实体识别,具有基于角色的深度双向LSTMS,2016年):https://github.com/olofmogren/biomedical-ner-ner-ner-ner-data-swedish
- HIPE-2022,在多语言历史文档中命名为实体识别和实体链接:https://hipe-eval.github.io/hipe-2022/ https://github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data
芬兰
- 芬兰的数据集命名实体重生:https://github.com/mpsilfve/finer-data
- turku ner语料库:https://github.com/turkunlp/turku-ner-corpus
- HIPE-2022,在多语言历史文档中命名为实体识别和实体链接:https://hipe-eval.github.io/hipe-2022/ https://github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data
爱沙尼亚人
- Estonian ner corpus:https://metashare.ut.ee/repository/browse/estonian-ner-corpus/88D030C0ACDE11E2A6E2A6E2A6E4005056B40024F1DEF1DEF472ED254E77A8952E1003E1003D9F89F89F89F889F881E//
拉脱维亚和立陶宛人
- https://github.com/accurat-toolkit/tildener/tree/master/test(Pinnis,Latnis,Latvian and Lithuanian和Lithuanian命名为Tildener,LREC 2012)
- LV Tagger的培训数据:https://github.com/peterisp/lvtagger/tree/master/master/nertrainingdata
土耳其
- k̈ucukand can,一个针对命名实体识别和立场检测注释的推文数据集,2019年:https://github.com/dkucuk/tweet-dataset-ner-nner-sd
- k̈ucuk等人,在土耳其推文中命名为实体识别:http://optima.jrc.it/resources/2014_jrc_twitter_tr_ner-dataset.zip.zip
- 英语/土耳其wikipedia名为 - 实体识别和文本分类数据集(http://arxiv.org/abs/1702.02363):https://data.mendeley.com/datasets/cdcztymf4k/1
- çoban等人,被fbner命名为实体识别:土耳其语的新Facebook数据集:https://ieeexplore.ieee.org/document/9598971可根据要求提供可用于研究目的的数据
哈萨克
- Kaznerd:https://arxiv.org/pdf/2111.13419.pdf,https://github.com/is2ai/kaznerd
Uyghur
- uyghur命名实体关系语料库:https://github.com/kaharjan/uynerel(Abiderexiti等人,构建Uyghur命名实体关系语料库的注释计划。2016)
亚美尼亚人
- PIONER(金标准和银色标准数据集):https://github.com/ispras-texterra/pioner(Ghukasyan等人,Pioner:Amenian的数据集和基线,用于亚美尼亚人,名为Entity Insentity识别,2018年)
- ARMTDP-NN:https://github.com/myavrum/armtdp-ner
科普特
- Coptic通用依赖性树库:https://github.com/universaldependencencies/ud_coptic-scriptorium/tree/dev(另请参见https://copticscriptorium.org/treebank.html)。其中包含46,000个嵌套(非)和智力化的实体的令牌。
阿姆哈拉语
- 说语料库(请参阅“使用深度学习”的“命名为Amharic的实体识别”):https://github.com/geezorg/geezorg/data/tree/master/master/amharic/amharic/tagged/nmsu-say; http://data.geez.org/
阿拉伯
- AQMAR Arabic Wikipedia名为Entity语料库:http://www.cs.cmu.edu/~ark/arabicner/
- NE3L命名实体阿拉伯语料库(阿拉伯语,中文,俄语):http://catalog.elra.info/en-us/repository/browse/browse/elra-w0078/
- 反射实体翻译(平行语料库:英语,阿拉伯语,中文):https://catalog.ldc.upenn.edu/ldc2009t11
- ancorp:http://users.dsic.upv.es/~ybenajiba/downloads.html(另请参见:http://alias-i.com/lingpipe/demos/demos/tutorial/ne/read-meadorial/ne/read-me.html)
- ACE 2003(英语,中文,阿拉伯语):https://catalog.ldc.upenn.edu/ldc2004t09
- ACE 2004(英语,中文,阿拉伯语):https://catalog.ldc.upenn.edu/ldc2005t09
- ACE 2005(英语,中文,阿拉伯语):https://catalog.ldc.upenn.edu/ldc2006t06
- ACE 2007(西班牙语和阿拉伯语):https://catalog.ldc.upenn.edu/ldc2014t18
- Ontonotes 5(英语,阿拉伯语,中文):https://catalog.ldc.upenn.edu/ldc2013t19
- DAWT数据集 - 跨多种语言(英语,西班牙语,法语,意大利语,德语,阿拉伯语)的密集注释的Wikipedia文本:https://github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
- Wojood -2022嵌套的阿拉伯语名为Entity语料库。 https://dlnlp.ai/st/wojood/ https://aclanthology.org/2022.lrec-1.387.pdf https https://codalab.lisn.upsaclay.upsaclay.fr/competitions/11740
波斯语
- Armanpersonercorpus:http://islrn.org/resources/399-379-640-828-6/; https://github.com/haniehp/persianner
信德
- Siner:https://aclanthology.org/2020.lrec-1.361/,https://github.com/aliwazir/siner-dataset
乌尔都语
- IJCNLP 2008 SSEAL:http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
- 联合国数据集(Khan等人,乌尔都语为urdu命名的实体识别任务,2016年)。可在http://www.iiu.edu.pk/?page_id=5181中找到
- mk-pucit:https://www.dropbox.com/sh/1ivw7ykm2tugg94/aab9t5wnn7fynespo7tjjjw8la;请参阅:Kanwal等人,乌尔都语命名实体识别:Corpus Generation and Deep Learning Applications,2019年
指示
- Naamapadam:来自两个语言家族的11种主要印度语言的指定实体识别(NER)数据集。 https://research.ibm.com/publications/naamapadam-a-large-scale-named-entity-annotity-data-for-indic-languages https://ai4bharat.iit.iitm.ac.ac.ac.in/naamapadam
印地语
- Hiner:https://github.com/cfiltnlp/hiner
- 印地语健康数据集:https://www.kaggle.com/aijain/hindi-health-dataset/home
- Fire 2015,ESM-IL(英语,印地语,泰米尔语,马拉雅拉姆语):http://au-kbc.org/nlp/esm-fire2015/#traincorpus
- Fire Ner 2013(英语,印地语,泰米尔语,马拉雅拉姆语,孟加拉语):http://au-kbc.org/nlp/nlp/ner-fire2013/
- IJCNLP 2008 SSEAL:http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
孟加拉
- Fire Ner 2013(英语,印地语,泰米尔语,马拉雅拉姆语,孟加拉语):http://au-kbc.org/nlp/nlp/ner-fire2013/
- IJCNLP 2008 SSEAL:http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
- 孟加拉国:https://github.com/rifat1493/bengali-ner,https://ieeexplore.ieee.org/document/8944804
- ner-bangla:https://github.com/misabic/ner-bangla-dataset,https://content.iospress.com/articles/journal-oftillect/journal-er-oftelligent-and-fuzzy-systems/ifs179349
泰卢固语
- ner_telugu:https://github.com/anikethjr/ner_telugu
- IJCNLP 2008 SSEAL:http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
- 泰卢固语的命名entity注释corpora:http://www.tdil-dc.in/index.php?option=com_download&task = showresourcedetails&toolid = 982&lang = en
Maithili
- Maithili中的第一个命名实体识别器:资源创建和系统开发:https://content.iospress.com/articles/journal-ob-intelligent-and-fuzzy-systems/ifs210051
尼泊尔
- Everestner:https://journals.flvc.org/flairs/article/view/130725,https://github.com/nowalab/everest-ner
马拉地语
- 命名MARATHI的ENTITY注释Corpora:http://www.tdil-dc.in/index.php?option=com_download&task = showresourcedetails&toolid = 979&lang = en
- L3Cube Mahaner:https://arxiv.org/abs/2204.06029 https://github.com/l3cube-pune/marathinlp
旁遮普
- punjabi的命名entity注释corpora:http://www.tdil-dc.in/index.php?option=com_download&task = showresourcedetails&toolc.toolid = 980&lang = en
泰米尔人
- Fire 2015,ESM-IL(英语,印地语,泰米尔语,马拉雅拉姆语):http://au-kbc.org/nlp/esm-fire2015/#traincorpus
- Fire Ner 2013(英语,印地语,泰米尔语,马拉雅拉姆语,孟加拉语):http://au-kbc.org/nlp/nlp/ner-fire2013/
马拉雅拉姆语
- Fire 2015,ESM-IL(英语,印地语,泰米尔语,马拉雅拉姆语):http://au-kbc.org/nlp/esm-fire2015/#traincorpus
- Fire Ner 2013(英语,印地语,泰米尔语,马拉雅拉姆语,孟加拉语):http://au-kbc.org/nlp/nlp/ner-fire2013/
Oriya/Odia
- IJCNLP 2008 SSEAL:http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
僧伽罗/僧伽罗人
泰国
- 泰国命名-Entity-rendition-data:https://github.com/pythainlp/thai-named-entity-rencognition-data
- 泰语名称实体语料库:http://pioneer.chula.ac.th/~awirote/resources/corpora-data.html; http://pioneer.chula.ac.th/~awirote/data-nutcha.zip; http://pioneer.chula.ac.th/~awirote/data-sasiwimon.zip; http://pioneer.chula.ac.th/~awirote/data-nattadaporn.zip
- LST20:https://huggingface.co/datasets/lst20; https://arxiv.org/abs/2008.05055
- 泰语:https://github.com/vistec-ai/thai-nner,https://aclanthology.org/2022.findings-acl.116
印度尼西亚
- 身份:http://metashare.elda.org/repository/browse/entic/fed3fada7ef111e5aa3b001dd8b71c6666666666666abd4242f18ff1f18ffd9a9a9a95da9104cc/
- https://github.com/yohanesgultom/nlp-experiments/tree/master/data/ner
- 印度尼西亚-Ner:Syaifudin&Nurwidyantoro https://ieeexplore.ieee.org/document/7828656 https://github.com/yusufsufsyaifudin/yusufsofseaifudin/indonesia-indonesia-indonesia-indonesia-indonesia-indonesia-nerner
- IDNER-NEWS-2K:印尼新闻的数据集,用于指定实体识别任务。 Syaifudin&Nurwidyantoro https://dl.acm.org/doi/10.1145/3592854#fn8 https://github.com/khairunnisaor/idner-news-2k/
- NERP和NER-GRIT:Indonlp/Indonlu https://github.com/indonlp/indonlu/tree/tree/master/master/dataset https://aclanthology.org/2020.aacl-main.85/
越南人
- VLSP 2016:http://vlsp.org.vn/resources-vlsp2016; https://github.com/undertheseanlp/ner
- VLSP 2018:http://vlsp.org.vn/resources-vlsp2018; https://github.com/undertheseanlp/ner
- Phoner_covid19:https://github.com/vinairesearch/phoner_covid19
日本人
- IREX:https://nlp.cs.nyu.edu/irex/package/
- Met-2(日语,中文):https://www-nlpir.nist.gov/releated_projects/muc/
- BCCWJ基本NE语料库:https://sites.google.com/site/projectnextnlpne/en(Iwakura等人,构建了一种日本基本命名的各种流派的基本基本命名的实体语料,新闻2016)
- DBPEDIA摘要语料库(英语,德语,荷兰语,法语,意大利语,日语):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- 数据来自:Mai等人,一项关于细粒度命名实体识别的实证研究,2018年Coling(英语,日语):https://fgner.alt.alt.ai/duc/duc/ene/testsetsets/comp/
- wikipedia ner corpus:https://github.com/stockmarkteam/ner-wikipedia-dataset
- Wikiann:https://elisa-ie.github.io/wikiann/
- GSD:将UD GSD数据集转换为Megagon Labs https://github.com/megagonlabs/ud_japanese-gsd
- KWDLC:京都大学网络文档领导https://nlp.ist.i.i.y.kyoto-u.ac.jp/en/index.php?kwdlc https://github.com/ku-nlp/ku-nlp/kwdlc https https https https:/
韩国人
- 国家韩国语言学院(ROK)-NER语料库:https://github.com/digitalprk/koreaner; https://ithub.korean.go.kr/user/total/referenceview.do?boildseq = 5&articleseq = 118&boardgb = t&isinsupd&boardType = corpus
- kmou ner -https://github.com/kmounlp/ner
- 韩语理解评估-Klue ner -https://klue-benchmark.com/tasks/69/overview/description
- https://github.com/songys/entity
- HLCT 2016语料库,带有更新-https://github.com/machinereading/koreannernercorpus
中国人
- ACE 2003(英语,中文,阿拉伯语):https://catalog.ldc.upenn.edu/ldc2004t09
- ACE 2004(英语,中文,阿拉伯语):https://catalog.ldc.upenn.edu/ldc2005t09
- ACE 2005(英语,中文,阿拉伯语):https://catalog.ldc.upenn.edu/ldc2006t06
- Ontonotes 5(英语,阿拉伯语,中文):https://catalog.ldc.upenn.edu/ldc2013t19
- Met-2(日语,中文):https://www-nlpir.nist.gov/releated_projects/muc/
- 反射实体翻译(平行语料库:英语,阿拉伯语,中文):https://catalog.ldc.upenn.edu/ldc2009t11
- NE3L命名实体中国语料库(阿拉伯语,中文,俄语):http://catalogue.elra.info/en-us/repository/browse/browse/elra-w0079/
- 原始短语数据整理I中文(命名实体):http://catalog.elra.info/en-us/repository/browse/browse/elra-w0045_04/
- 原始短语数据整理II中文(命名实体):http://catalog.elra.info/en-us/repository/browse/browse/elra-w0045_08/
- ERE Deft Corpora(平行语料库:英语,中文):Mott等人,平行中文英语实体,关系和事件Corpora,2016年(LDC2015E78,LDC2014E114)
- 中国微博:命名和名义上提及的中国风格注释(微博):https://github.com/hltcoe/golden-horse
- 中文eduner:教育领域的2023数据集:https://link.springer.com/article/10.1007/s00521-023-08635-5-5
- 中国航空航天NER:https://www.nature.com/articles/s41598-023-50705-0
- SCICN:用于科学信息提取的中国数据集和基准测试
- EMP NER: Historical Chinese https://aclanthology.org/2024.lrec-main.35.pdf https://gitlab.com/enpchina/ENP-NER
他加禄语
- TLUnifed: https://arxiv.org/abs/2311.07161 https://huggingface.co/datasets/ljvmiranda921/tlunified-ner
俄语
- BSNLP 2017 (Croatian, Czech, Polish, Russian, Slovak, Slovene, Ukrainian): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html
- NE3L named entities Russian corpus (Arabic, Chinese, Russian): https://catalog.elra.info/en-us/repository/browse/ELRA-W0080/
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- factRuEval-2016: https://github.com/dialogue-evaluation/factRuEval-2016
- RuREBus 2020 (Russian Relation Extraction for Business) corpus https://github.com/dialogue-evaluation/RuREBus
约鲁巴
- GV-Yorùbá-NER. Data: https://github.com/ajesujoba/YorubaTwi-Embedding/tree/master/Yoruba/Yor%C3%B9b%C3%A1-NER ; Data statement: https://drive.google.com/file/d/177xu-O2FTJ7VJQ-0ohCWjVd1qu61Tvml/view Paper: Jesujoba O Alabi, Kwabena Amponsah-Kaakyire, David I Adelani, and Cristina Espãna-Bonet. Massive vs. curated word embeddings for low-resourced languages. the case of Yorùbá and Twi. In LREC, 2020 (https://arxiv.org/abs/1912.02481)
斯瓦希里语
- Helsinki Corpus of Swahili 2.0 (HCS 2.0) Annotated Version: http://metashare.csc.fi/repository/browse/helsinki-corpus-of-swahili-20-hcs-20-annotated-version/232c1910b9eb11e5915e005056be118e59fb2e920f1f4c0cafc94915fc6f5cac/ See: Shah et al., 2010. SYNERGY: A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation
伊博
- IgboNER: https://aclanthology.org/2022.lrec-1.547/ https://github.com/Chiamakac/IgboNER-Models later updated in https://openreview.net/pdf?id=tHUS9-vmUfC from https://sites.google.com/view/africanlp2023/home
Isindebele
- NCHLT isiNdebele Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/306
xhosa
- NCHLT isiXhosa Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/312
祖鲁
- NCHLT isiZulu Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/319
Sepedi
- NCHLT Sepedi Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/328
塞索托
- NCHLT Sesotho Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/334
Setswana
- NCHLT Setswana Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/341
Siswati
- NCHLT Siswati Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/346
Venda
- NCHLT Tshivenda Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/355
- MPHAYANER: Named Entity Recognition for Tshivenḓa: https://openreview.net/pdf?id=0nneuL3bSLt https://github.com/rendanim/MphayaNER from https://sites.google.com/view/africanlp2023/home
Xitsonga
- NCHLT Xitsonga Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/362
拉丁
- Herodotos Project: https://github.com/alexerdmann/Herodotos_Project_Annotation
A long list can be found here: http://damien.nouvels.net/resourcesen/corpora.html
参考
[Alvarado et al., 2015] Alvarado, Julio Cesar Salinas, Karin Verspoor, and Timothy Baldwin. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pp. 84-90. 2015. Accessed: August 2018.
[Balasuriya et al., 2009] Balasuriya, Dominic, Nicky Ringland, Joel Nothman, Tara Murphy, and James R. Curran. Named entity recognition in wikipedia. In Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 10-18. Association for Computational Linguistics, 2009
[Bos et al., 2017] Bos, Johan, Valerio Basile, Kilian Evang, Noortje J. Venhuizen, and Johannes Bjerva. The Groningen meaning bank. In Handbook of linguistic annotation, pp. 463-496. Springer, Dordrecht, 2017.
[Derczynski et al., 2016] Derczynski, Leon, Kalina Bontcheva, and Ian Roberts. Broad twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1169-1179. 2016. Available at: https://github.com/GateNLP/broad_twitter_corpus Accessed: August 2018.
[Derczynski et al., 2017] Leon Derczynski, Eric Nichols, Marieke van Erp, Nut Limsopatham (2017) Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition, in Proceedings of the 3rd Workshop on Noisy, User-generated Text. Available at: https://noisy-text.github.io/2017/emerging-rare-entities.html
[DSTL, 2017] Defence Science and Technology Laboratory. 2017. Relationship and Entity Extraction Evaluation Dataset. https://github.com/dstl/re3d. Accessed: January 2018.
[Grishman and Sundheim, 1996] Ralph Grishman and Beth Sundheim. 1996. Message understanding conference- 6: A brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.
[Karimi et al., 2015] Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna Kemp, and Chen Wang. 2015. Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, 55:73-81. Available at https://data.csiro.au Accessed: November 2017.
[Lim et al., 2017] Lim, Swee Kiat, Aldrian Obaja Muis, Wei Lu, and Chen Hui Ong. MalwareTextDB: A database for annotated malware articles. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1557-1567. 2017。
[Liu et al., 2013a] Jingjing Liu, Panupong Pasupat, Scott Cyphers, and Jim Glass. 2013. Asgard: A portable architecture for multilingual dialogue systems. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8386-8390. IEEE。 Available at https://groups.csail.mit.edu/sls/downloads/restaurant/ Accessed: January 2018
[Liu et al., 2013b] Jingjing Liu, Panupong Pasupat, Yining Wang, Scott Cyphers, and Jim Glass. 2013. Query understanding enhanced by hierarchical parsing structures. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 72-77. IEEE。 Available at https://groups.csail.mit.edu/sls/downloads/movie/ We used the trivia10k13 portion. Accessed: January 2018
[NIST, 1999 IE-ER] NIST. 1999. Information Extraction - Entity Recognition Evaluation. http://www.nist.gov/speech/tests/ieer/er_99/er_99.htm. The newswire development test data only (included in the NLTK package).
[Ohta et al., 2012] Tomoko Ohta, Sampo Pyysalo, Jun'ichi Tsujii and Sophia Ananiadou. 2012. Open-domain Anatomical Entity Mention Detection. In Proceedings of ACL 2012 Workshop on Detecting Structure in Scholarly Discourse (DSSD), pp. 27-36. Available at: http://www.nactem.ac.uk/anatomy/ and https://github.com/openbiocorpora/anem Accessed: November 2017.
[Ritter et al., 2011] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1524-1534, Edinburgh, Scotland, UK., July.计算语言学协会。 Accessed January 2018.
[Sang and Meulder, 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.
[Stubbs et al., 2015] Amber Stubbs and Ozlem Uzuner. 2015. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics, 58:S20-S29. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.
[Uzuner et al., 2007] Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association, 14(5):550-563. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.
[Weischedel and Brunstein, 2005] Ralph Weischedel and Ada Brunstein. 2005. BBN pronoun coreference and entity type corpus. Linguistic Data Consortium, Philadelphia.
[Weischedel et al., 2013] Weischedel, Ralph, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue et al. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA (2013).
[Zeldes, 2017] Amir Zeldes. 2017. The GUM corpus: creating multilayer resources in the classroom. Language Resources and Evaluation, 51(3):581-612. Available at https://github.com/amir-zeldes/gum/tree/master/coref/tsv/ Accessed: November 2017.