entity recognition datasets下载 - entity recognition datasets集源代码下载

entity recognition datasets

其他源码

1.0.0

下载

实体识别数据集

该存储库包含来自带有各种实体类型的几个域中的数据集，可用于实体识别和命名实体识别（NER）任务。

注意：我不再积极地将数据集添加到此列表中 - 自2020年以来可能会出现更多的NER数据集。但是，我很乐意通过问题或拉出请求添加更多数据集。

英文NER的数据集

下表显示了用于英语实体识别的数据集列表（有关其他语言的NER数据集列表，请参见下文）。数据目录包含有关在何处获得这些数据集的信息，这些数据集由于许可限制而无法共享的数据集，以及将其转换为Conll 2003格式的代码。下面还列出了指向其他语言的NER语料库的链接。

数据集	领域	执照	参考	可用性
Conll 2003	消息	杜阿	Sang and Meulder，2003年	容易找到
Nist-ieer	消息	没有任何	NIST 1999 IE-ER	NLTK数据
MUC-6	消息	自然界人士	Grishman and Sundheim，1996年	LDC 2003T13
Ontonotes 5	各种各样的	自然界人士	Weischedel等人，2013年	LDC 2013T19
BBN	各种各样的	自然界人士	Weischedel和Brunstein，2005年	LDC 2005T33
GMB-1.0.0	各种各样的	没有任何	Bos等，2017	http://gmb.let.rug.nl/data.php
口香糖-3.1.0	Wiki	几个（ * 2）	Zeldes，2016年	✔这里包括
维基戈德	维基百科	CC-BY 4.0	Balasuriya等，2009	✔这里包括
ritter	叽叽喳喳	没有任何	Ritter等，2011	没有分裂，火车/测试/开发拆分
BTC	叽叽喳喳	CC-BY 4.0	Derczynski等，2016	✔这里包括
Wnut17	社交媒体	CC-BY 4.0	Derczynski等，2017	✔这里包括
I2B2-2006	医疗的	杜阿	Uzuner等，2007	http://www.i2b2.org
I2B2-2014	医疗的	杜阿	Stubbs等，2015	http://www.i2b2.org
卡德克	医疗的	Csiro	Karimi等，2015	http://data.csiro.au/
安姆	解剖学	CC-SA 3.0	Ohta等，2012	✔这里包括
Mitrestaurant	查询	没有任何	Liu等，2013a	http://groups.csail.mit.edu/sls/
mitmovie	查询	没有任何	Liu等，2013b	http://groups.csail.mit.edu/sls/
MalwaretextDB	恶意软件	没有任何	Lim等，2017	http://www.statnlp.org/
re3d	防御	几个（ * 1）	DSTL，2017年	✔这里包括
SEC-FILINGS	金融	CC-BY 3.0	Alvarado等，2015	✔这里包括
集会	机器人技术	x	Costa等，2017	x
wikineal	维基百科	CC BY-SA-NC 4.0	Tedeschi等，2021	https://github.com/babelscape/wikineural
多纳	维基百科	CC BY-SA-NC 4.0	Tedeschi等，2022	https://github.com/babelscape/multinerd
Hipe-2022	历史	CC BY-SA-NC 4.0	Ehrmann等，2022	https://github.com/hipe-eval/hipe-2022-data
音乐纳	音乐	麻省理工学院	Epure和Hennequin，2023年	https://github.com/deezer/music-ner-eacl2023
WIESP2022-NER	天体物理学	CC BY-SA-NC 4.0	Grezes等，2022	https://huggingface.co/datasets/adsabs/wiesp2022-ner
nne	消息	CC 4.0 / LDC	Ringland等，2019	https://github.com/nickyringland/nested_nemed_entities
全世界	消息	CC BY-SA-NC 4.0	Shan等，2023	https://github.com/stanfordnlp/en-worldwide-newswire https://arxiv.org/abs/2404.13465

许可证

许可注释：

（1）RE3D（“关系和实体提取评估数据集”）包含几个具有不同许可的数据集。这些都是：

CC-BY-SA 3.0（Wikipedia数据集）
CC BY-NC 3.0（BBC_ONLINE数据集）
CC由3.0 au（Australian_department_of_foreign_affairs数据集）
公共域（us_state_department数据集，CENTCOM数据集）
英国公开政府许可证v3.0（UK__GOVERNMENT数据集）
delegation_of_the_european_union_to_syria：请参阅https：//eeas.europa.eu/delegations/syria/8157/legal-notice_en

口香糖3.1.0包括三个数据集，并带有许可证CC-BY 3.0，CC-BY-SA 3.0和CC-BY-NC-SA 3.0。注释是根据CC-BY 4.0许可的。

可以在相应的子目录中找到每个数据集的更详细的许可信息。

稍后... -Tabassum等人，在Stackoverflow https://cocoxu.github.io/publications/acl2020_stackoverflow_ner.pdf- litbank -litbank：https：//github.com.com/dbamman/litbank（bamman/litbank（bamman，popate and popate and popate and popatity）https：//cocoxu.github.io/publications/ACL2020_STACKOVERFLOW_NER.PDF（BAM，popat和Shen，Annastival intrary） NNE：一个用于英文新闻中嵌套的实体识别的数据集，2019年https://github.com/nickyringland/nested_named_entities -MARS Target Engyclopedia -LPSC摘要标签数据集：https：//zenodo.org/record/1048484848419#19.2cc.w.w55a.recordiies https://www.kaggle.com/dataturks/best-buy-ecommerce-ner-dataset/home- ner：https：//wwwww.kaggle.com/dataturks/resume-ensume-entistities-for-entities-for-ner-ner/home--少数 - 少数 - 少数 - 少数 - 少数 - 少数 - nertity nectity Date date date date date date ner for Ner： https://aclanthology.org/2021.acl-long.248/

其他语言的NER数据集

词汇命名实体资源

Heiner：http：//heiner.cl.uni-heidelberg.de/index.shtml
NECKAR：https：//event.ifi.uni-heidelberg.de/?page_id = 532#wikidata_ne_dataset

代码转换

英语 - 西班牙推文（计算2018）：https：//code-switching.github.io/2018/; https://code-switching.github.io/2018/files/spa-eng/release.zip; http://www.aclweb.org/anthology/w18-3219
阿拉伯语 - 埃及推文（计算2018）：https：//code-switching.github.io/2018/; https://code-switching.github.io/2018/files/msa-egy/arabictweetstokenaskigner.zip; http://www.aclweb.org/anthology/w18-3219
印地语英语社交媒体文字：https：//github.com/silentflame/named-entity-rendition； http://aclweb.org/anthology/w18-2405
EMNLP 2014年共享任务 - 代码转换推文（Nepali-English，Spanish-English，Pronsarin-English，Arabic-Arabic方言）：http：//emnlp2014.org/workshops/codeswitch/codeswitch/call.html.html

德语

Conll 2003（英语，德语）：https：//www.clips.uantwerpen.be/conll2003/ner/
Germeval 2014：https：//sites.google.com/site/germeval2014ner/data
书面德语（tüba-d/Z）的TübingenTreebank：http：//www.sfs.uni-tuebingen.de/en/ascl/ascl/resources/corpora/corpora/tueba-dueba-dz.html
欧洲报纸（荷兰语，法语，德语）：https：//github.com/europeananewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
德国欧洲成绩单（子集）：https：//nlpado.de/~sebastian/software/ner_german.shtml
指定德语的实体模型，政治（NEMGP）：https：//www.thomas-zastrow.de/nlp/
Wikiner：https：//figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
wikinearal：https：//github.com/babelscape/wikineural
多纳：https：//github.com/babelscape/multinerd
DFKI SMARTDATA语料库（地理原理）：https：//dfki-lt-re-group.bit.bit.bit.ioio/smartdata-corpus/（德国语料库，用于交通和行业事件的精细命名实体识别和关系识别和关系。 Gabryszak，Leonhard Hennig。
DBPEDIA摘要语料库（英语，德语，荷兰语，法语，意大利语，日语）：http：//downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
DAWT数据集 - 跨多种语言（英语，西班牙语，法语，意大利语，德语，阿拉伯语）的密集注释的Wikipedia文本：https：//github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
埃琳娜·莱特纳（Elena Leitner），乔治·雷姆（Georg Rehm），朱利（Juli）́数据：https：//github.com/elenanereiss/legal-entity-rbognition
HIPE-2022，在多语言历史文档中命名为实体识别和实体链接：https：//hipe-eval.github.io/hipe-2022/ https：//github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data

荷兰

Conll 2002（西班牙语，荷兰语）：https：//www.clips.uantwerpen.be/conll2002/ner/
欧洲报纸（荷兰语，法语，德语）：https：//github.com/europeananewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
同时语料库（平行语料库：英语，西班牙语，意大利语，荷兰语）：http：//www.newsreader-project.eu/results/data/wikinews/
Wikiner：https：//figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
wikinearal：https：//github.com/babelscape/wikineural
多纳：https：//github.com/babelscape/multinerd
DBPEDIA摘要语料库（英语，德语，荷兰语，法语，意大利语，日语）：http：//downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
荷兰议会文档2015-2016，从1848年开始。
Sonar 1- Desmet和Hoste，精细的荷兰语名为“实体识别”，2014年（班级等级）
语料库书籍和语料库Gutenberg Dutch：http：//blog.namescape.nl/?page_id=85; http://portal.clarin.nl/node/1940

南非荷兰语

NCHLT南非荷兰语名为Entity注释语料库：https：//repo.sadilar.org/handle/20.500.12185/299

西班牙语

Conll 2002（西班牙语，荷兰语）：https：//www.clips.uantwerpen.be/conll2002/ner/
Ancora（西班牙语，加泰罗尼亚）：http：//clic.ub.edu/corpus/en
Deft Spanish Treebank（LDC2018T01）：https：//catalog.ldc.upenn.edu/ldc2018t01
灵丹妙药（实验室）：http：//panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/depentendency-lab-es
PANACEA（ENV）：http：//panacea-lr.eu/en/info-for-researchers/data-sets/depentency-parsed-corpora/deplyendency-envendency-env-es
同时语料库（平行语料库：英语，西班牙语，意大利语，荷兰语）：http：//www.newsreader-project.eu/results/data/wikinews/
ACE 2007（西班牙语和阿拉伯语）：https：//catalog.ldc.upenn.edu/ldc2014t18
Wikiner：https：//figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
wikinearal：https：//github.com/babelscape/wikineural
多纳：https：//github.com/babelscape/multinerd
http://www.grupolys.org/~marcos/pub/lrec16.tar.bz2（用于“将词典 - 示态启发式纳入核心分辨率sieves in Document-level中指定的实体识别”
具有人体实体的核心注释（西班牙语，加利西亚，葡萄牙语）的多语言语料库：http：//gramatica.usc.es/~marcos/lrec.tar.bz2
Drugsemantics黄金标准（Moreno等人，药物掌握：西班牙产品特征摘要中指定实体识别的语料库，2017年）：https：//data.mendeley.com/datasets/fwc7jrc5jr/1
DBPEDIA摘要语料库（英语，德语，荷兰语，法语，意大利语，日语）：http：//downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
DAWT数据集 - 跨多种语言（英语，西班牙语，法语，意大利语，德语，阿拉伯语）的密集注释的Wikipedia文本：https：//github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
Cantemist（癌症文本挖掘共享任务 - 肿瘤称为实体识别） - 命名与癌症有关的关键概念的实体识别，即西班牙医学文本中的肿瘤形态：https：//temu.bsc.es/cantemist/

加泰罗尼亚

Ancora（西班牙语，加泰罗尼亚）：http：//clic.ub.edu/corpus/en

加利西亚人

加利西亚ner语料库：https：//gramatica.usc.es/~marcos/resources/corpus_gal_nec.txt.gz
具有人体实体的核心注释（西班牙语，加利西亚，葡萄牙语）的多语言语料库：http：//gramatica.usc.es/~marcos/lrec.tar.bz2

巴斯克

巴斯克命名实体语料库（EIEC）：http：//ixa.eus/node/4486？language= en
Basque Disamigation命名实体语料库（EDIEC）：http：//ixa.si.ehu.es/node/4485?language= en
egunkaria 2000语料库（383个新闻文本），http://qtleap.eu/wp-content/uploads/2014/04/qtleap-2013-d5.1.pdf

葡萄牙语

后宫：https：//www.linguateca.pt/aval_conjunta/harem/harem_ing.html
Cintil语料库：http：//cintil.ul.pt/cintilfeatures.html#corpus
Wikiner：https：//figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
wikinearal：https：//github.com/babelscape/wikineural
多纳：https：//github.com/babelscape/multinerd
具有人体实体的核心注释（西班牙语，加利西亚，葡萄牙语）的多语言语料库：http：//gramatica.usc.es/~marcos/lrec.tar.bz2
bosque 8.0老鹰格式：https：//gramatica.usc.es/~marcos/resources/corpora_flpt.tgz
Lener-BR（巴西法律文件）：https：//cic.unb.br/~teodecampos/lener-br/
Paramopama：用于命名实体识别的巴西 - 葡萄牙语料库

法语

酯：http：//catalogue.elra.info/en-us/repository/browse/elra-s0241/
酯2：http：//catalogue.elra.info/en-us/repository/browse/elra-s0338/
etape：http：//catalogue.elra.info/en-us/repository/browse/elra-e0046/
欧洲报纸（荷兰语，法语，德语）：https：//github.com/europeananewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
Quaero法国医学语料库：https：//quaerofrenchmed.limsi.fr/
Quaero广播新闻扩展了命名Entity语料库：http：//catalog.elra.info/en-us/repository/browse/browse/elra-s0349/
quaero旧新闻扩展名称实体语料库：http：//catalog.elra.info/en-us/repository/browse/browse/elra-w0073/
Wikiner：https：//figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
wikiner-fr-gold https://arxiv.org/abs/2411.00030 https://huggingface.co/datasets/danrun/wikiner-fr-gold-gold
wikinearal：https：//github.com/babelscape/wikineural
多纳：https：//github.com/babelscape/multinerd
DBPEDIA摘要语料库（英语，德语，荷兰语，法语，意大利语，日语）：http：//downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
DAWT数据集 - 跨多种语言（英语，西班牙语，法语，意大利语，德语，阿拉伯语）的密集注释的Wikipedia文本：https：//github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
CAP 2017-（Twitter Data），Lopez等人，CAP 2017挑战：Twitter名为“实体识别”，2017年：http：//cap2017.imag.fr/competition.html
HIPE-2022，在多语言历史文档中命名为实体识别和实体链接：https：//hipe-eval.github.io/hipe-2022/ https：//github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data

意大利人

KINT：https：//github.com/dhfbk/kind
评估：http：//www.evalita.it/2009/tasks/entity
同时语料库（平行语料库：英语，西班牙语，意大利语，荷兰语）：http：//www.newsreader-project.eu/results/data/wikinews/
PANACEA（ENV）：http：//panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/deplyendency-envendency-env-it
灵丹妙药（实验室）：http：//panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/depentency-lab-it
Wikiner：https：//figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
wikinearal：https：//github.com/babelscape/wikineural
多纳：https：//github.com/babelscape/multinerd
DBPEDIA摘要语料库（英语，德语，荷兰语，法语，意大利语，日语）：http：//downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
DAWT数据集 - 跨多种语言（英语，西班牙语，法语，意大利语，德语，阿拉伯语）的密集注释的Wikipedia文本：https：//github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation

罗马尼亚人

Ronec（Dumitrescu和Avram，介绍Ronec-罗马尼亚人名为Entitycorpus。LREC2020）。论文：https：//arxiv.org/pdf/1909.01247.pdf数据：https：//github.com/dumitrescustefan/ronec
Romanian journalistic corpus (ROCO): http://metashare.elda.org/repository/browse/romanian-journalistic-corpus-roco/038baa80dc7311e5aa0b00237df3e3583781d7c0f2084057aa018a2d63d987e9/
罗马尼亚人平衡语料库（ROMBAC）：http：//metashare.elda.org/repository/browse/romanian-balanced-corpus-corpus-corpus-corpus-rombac/0a7dd85edc7311e5aaa0b00233e35873166666243524229dbubu

希腊语

PANACEA（ENV）：http：//panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/depperency-epentency-env-env-el
灵丹妙药（实验室）：http：//panacea-lr.eu/en/info-for-researchers/data-sets/deppentency-parsed-corpora/depentency-lab-lab-el

匈牙利

匈牙利命名的实体语料库：http：//rgai.inf.u-szeged.hu/index.php?lang=en&page=corpus_ne
Hunnerwiki：http：//hlt.sztaki.hu/resources/hunnerwiki.html
NYTK：https：//github.com/nytud/nytk-nerkor

捷克

捷克语命名Entity语料库：http：//ufal.mff.cuni.cz/cnec
BSNLP 2017（克罗地亚，捷克，波兰语，俄罗斯，斯洛伐克，斯洛文尼亚，乌克兰人）：http：//bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
CZENG 1.0（平行语料库：捷克 - 英语）：http：//ufal.mff.cuni.cz/czeng/czeng10
pero ocr ner（捷克历史OCR编年史）：https：//github.com/roman-janik/poner https://dspace.vut.cz/items/6092e1b0-1b0-1b0-1b0-1b0-3d75-3d75-4451-8582-28582-28582-28573ac3044

抛光

波兰SEJM语料库：http：//clip.ipipan.waw.pl/psc
BSNLP 2017（克罗地亚，捷克，波兰语，俄罗斯，斯洛伐克，斯洛文尼亚，乌克兰人）：http：//bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
波兰核心语料库：http：//zil.ipipan.waw.pl/polishcoreferencecorpus
Wikiner：https：//figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
wikinearal：https：//github.com/babelscape/wikineural
多纳：https：//github.com/babelscape/multinerd
经济新闻语料库（CEN语料库）：http：//www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/cen
KPWR（KORPUSJęZYKAPOLSKIEGO POLITECHNIKIWROCławskiej/Wrocław技术大学的波兰语料库）：http：//plwordnet.pr.wroc.pl/index.pl/index.php?option=com_content = com_content&view = = Artcical＆v = artical＆id = 35 55 = 3 2pletememid=18 = = = = = = = = = = = = = = = = 1. http://plwordnet.pwr.wroc.pl/attachments/article/35/kpwr-1.1.7z（Broda等人，KPWR：迈向免费的波兰语语料库，2012年）
nkjp：http：//clip.ipipan.waw.pl/nationalcorpusofpolish?action=AttachFile&do=view＆target=nkjp-podkorpusmilionowy-1.2.tar.gz

克罗地亚人

HR500K 1.0：http：//hdl.handle.net/11356/1183
BSNLP 2017（克罗地亚，捷克，波兰语，俄罗斯，斯洛伐克，斯洛文尼亚，乌克兰人）：http：//bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
reldi-normtagner-hr（克罗地亚推文）：http：//hdl.handle.net/11356/1170

斯洛伐克

BSNLP 2017（克罗地亚，捷克，波兰语，俄罗斯，斯洛伐克，斯洛文尼亚，乌克兰人）：http：//bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
斯洛伐克分类新闻语料库：https：//nlp.web.tuke.sk/pages/categorizednews

斯洛文尼亚

BSNLP 2017（克罗地亚，捷克，波兰语，俄罗斯，斯洛伐克，斯洛文尼亚，乌克兰人）：http：//bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
SSJ500K：http：//www.slovenscina.eu/tehnologije/ucni-korpus; http://eng.slovenscina.eu/tehnologije/ucni-korpus; https://www.clarin.si/repository/xmlui/handle/11356/1029;注意：v 2.2请参阅：http：//hdl.handle.net/11356/1210
斯洛文尼亚新闻：http：//zitnik.si/mediawiki/index.php?title=datasets#slovene_news; http://zitnik.si/mediawiki/images/7/7d/rtvslo_dec2011.tsv; http://zitnik.si/mediawiki/images/5/5e/rtvslo_dec2011_v2.tsv
Janes-Tag 2.0（社交媒体文本）https://www.clarin.si/repository/xmlui/handle/11356/1123;另请参见：Fišer等人，Janes Project：Slovene用户生成的内容的语言资源和工具，2018年。

乌克兰

BSNLP 2017（克罗地亚，捷克，波兰语，俄罗斯，斯洛伐克，斯洛文尼亚，乌克兰人）：http：//bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
乌克兰棕色NER语料库：https：//github.com/lang-uk/ner-uk; http://lang.org.ua/en/corpora/

塞尔维亚

setimes.sr -http：//hdl.handle.net/11356/1200
塞尔维亚人的指定实体评估语料库：http：//www.korpus.matf.bg.ac.rs/srpneval/
reldi-normtagner-sr（塞尔维亚推文）：http：//hdl.handle.net/11356/1171

保加利亚语

Bultreebank（BTB）

冰岛

Mim-Gold-ner（Ingólfsdóttir，SvanhvítLilja，Sigurjónþorsteinsson和Hrafn Loftsson。 http://www.malfong.is/index.php?pg=mim_gold_ner

丹麦语

戴恩：Hvingelby等人，[Dane：丹麦语的命名实体资源。]（http：//www.lrec-conf.org/proceedings/lrec202020202020202020.lrec-1.565.pdf）
丹麦Propbank（DPB）：http：//catalog.elra.info/en-us/repository/browse/elra-w0117/
树木银植物园：http：//catalog.elra.info/en-us/repository/browse/browse/elra-w0084/

挪威

Bjarte Johansen是挪威人的实体认可，是第22届北欧计算语言学会议论文集。 2019（https://www.aclweb.org/anthology/w19-6123.pdf）数据：https：//github.com/ljos/ljos/navnkjenner
FredrikJørgensen等人，Norne：注释为挪威的指定实体，2019年（https://arxiv.org/pdf/1911.12146.pdf）。数据：https：//github.com/ltgoslo/norne/; https://www.nb.no/sprakbanken/show?serial=Oai%3anb.no%3ASBR-49

瑞典

斯德哥尔摩互联网语料库：https：//www.ling.su.se/english/nlp/corpora-and-resources/sic
SUC 3.0：https：//spraakbanken.gu.se/eng/resource/suc3
瑞典语手动注释NER：https：//github.com/klintan/swedish-ner-corpus/
医疗Wikipedia数据（Almgren等人，在瑞典健康记录中被称为实体识别，具有基于角色的深度双向LSTMS，2016年）：https：//github.com/olofmogren/biomedical-ner-ner-ner-ner-data-swedish
HIPE-2022，在多语言历史文档中命名为实体识别和实体链接：https：//hipe-eval.github.io/hipe-2022/ https：//github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data

芬兰

芬兰的数据集命名实体重生：https：//github.com/mpsilfve/finer-data
turku ner语料库：https：//github.com/turkunlp/turku-ner-corpus
HIPE-2022，在多语言历史文档中命名为实体识别和实体链接：https：//hipe-eval.github.io/hipe-2022/ https：//github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data

爱沙尼亚人

Estonian ner corpus：https：//metashare.ut.ee/repository/browse/estonian-ner-corpus/88D030C0ACDE11E2A6E2A6E2A6E4005056B40024F1DEF1DEF472ED254E77A8952E1003E1003D9F89F89F89F889F881E//

拉脱维亚和立陶宛人

https://github.com/accurat-toolkit/tildener/tree/master/test（Pinnis，Latnis，Latvian and Lithuanian和Lithuanian命名为Tildener，LREC 2012）
LV Tagger的培训数据：https：//github.com/peterisp/lvtagger/tree/master/master/nertrainingdata

土耳其

k̈ucukand can，一个针对命名实体识别和立场检测注释的推文数据集，2019年：https：//github.com/dkucuk/tweet-dataset-ner-nner-sd
k̈ucuk等人，在土耳其推文中命名为实体识别：http：//optima.jrc.it/resources/2014_jrc_twitter_tr_ner-dataset.zip.zip
英语/土耳其wikipedia名为 - 实体识别和文本分类数据集（http://arxiv.org/abs/1702.02363）：https：//data.mendeley.com/datasets/cdcztymf4k/1
çoban等人，被fbner命名为实体识别：土耳其语的新Facebook数据集：https：//ieeexplore.ieee.org/document/9598971可根据要求提供可用于研究目的的数据

哈萨克

Kaznerd：https：//arxiv.org/pdf/2111.13419.pdf，https：//github.com/is2ai/kaznerd

Uyghur

uyghur命名实体关系语料库：https：//github.com/kaharjan/uynerel（Abiderexiti等人，构建Uyghur命名实体关系语料库的注释计划。2016）

亚美尼亚人

PIONER（金标准和银色标准数据集）：https：//github.com/ispras-texterra/pioner（Ghukasyan等人，Pioner：Amenian的数据集和基线，用于亚美尼亚人，名为Entity Insentity识别，2018年）
ARMTDP-NN：https：//github.com/myavrum/armtdp-ner

科普特

Coptic通用依赖性树库：https：//github.com/universaldependencencies/ud_coptic-scriptorium/tree/dev（另请参见https://copticscriptorium.org/treebank.html）。其中包含46,000个嵌套（非）和智力化的实体的令牌。

阿姆哈拉语

说语料库（请参阅“使用深度学习”的“命名为Amharic的实体识别”）：https：//github.com/geezorg/geezorg/data/tree/master/master/amharic/amharic/tagged/nmsu-say; http://data.geez.org/

阿拉伯

AQMAR Arabic Wikipedia名为Entity语料库：http：//www.cs.cmu.edu/~ark/arabicner/
NE3L命名实体阿拉伯语料库（阿拉伯语，中文，俄语）：http：//catalog.elra.info/en-us/repository/browse/browse/elra-w0078/
反射实体翻译（平行语料库：英语，阿拉伯语，中文）：https：//catalog.ldc.upenn.edu/ldc2009t11
ancorp：http：//users.dsic.upv.es/~ybenajiba/downloads.html（另请参见：http：//alias-i.com/lingpipe/demos/demos/tutorial/ne/read-meadorial/ne/read-me.html）
ACE 2003（英语，中文，阿拉伯语）：https：//catalog.ldc.upenn.edu/ldc2004t09
ACE 2004（英语，中文，阿拉伯语）：https：//catalog.ldc.upenn.edu/ldc2005t09
ACE 2005（英语，中文，阿拉伯语）：https：//catalog.ldc.upenn.edu/ldc2006t06
ACE 2007（西班牙语和阿拉伯语）：https：//catalog.ldc.upenn.edu/ldc2014t18
Ontonotes 5（英语，阿拉伯语，中文）：https：//catalog.ldc.upenn.edu/ldc2013t19
DAWT数据集 - 跨多种语言（英语，西班牙语，法语，意大利语，德语，阿拉伯语）的密集注释的Wikipedia文本：https：//github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
Wojood -2022嵌套的阿拉伯语名为Entity语料库。 https://dlnlp.ai/st/wojood/ https://aclanthology.org/2022.lrec-1.387.pdf https https://codalab.lisn.upsaclay.upsaclay.fr/competitions/11740

波斯语

Armanpersonercorpus：http：//islrn.org/resources/399-379-640-828-6/; https://github.com/haniehp/persianner

信德

Siner：https：//aclanthology.org/2020.lrec-1.361/，https://github.com/aliwazir/siner-dataset

乌尔都语

IJCNLP 2008 SSEAL：http：//ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
联合国数据集（Khan等人，乌尔都语为urdu命名的实体识别任务，2016年）。可在http://www.iiu.edu.pk/?page_id=5181中找到
mk-pucit：https：//www.dropbox.com/sh/1ivw7ykm2tugg94/aab9t5wnn7fynespo7tjjjw8la;请参阅：Kanwal等人，乌尔都语命名实体识别：Corpus Generation and Deep Learning Applications，2019年

指示

Naamapadam：来自两个语言家族的11种主要印度语言的指定实体识别（NER）数据集。 https://research.ibm.com/publications/naamapadam-a-large-scale-named-entity-annotity-data-for-indic-languages https://ai4bharat.iit.iitm.ac.ac.ac.in/naamapadam

印地语

Hiner：https：//github.com/cfiltnlp/hiner
印地语健康数据集：https：//www.kaggle.com/aijain/hindi-health-dataset/home
Fire 2015，ESM-IL（英语，印地语，泰米尔语，马拉雅拉姆语）：http：//au-kbc.org/nlp/esm-fire2015/#traincorpus
Fire Ner 2013（英语，印地语，泰米尔语，马拉雅拉姆语，孟加拉语）：http：//au-kbc.org/nlp/nlp/ner-fire2013/
IJCNLP 2008 SSEAL：http：//ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5

孟加拉

Fire Ner 2013（英语，印地语，泰米尔语，马拉雅拉姆语，孟加拉语）：http：//au-kbc.org/nlp/nlp/ner-fire2013/
IJCNLP 2008 SSEAL：http：//ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
孟加拉国：https：//github.com/rifat1493/bengali-ner，https：//ieeexplore.ieee.org/document/8944804
ner-bangla：https：//github.com/misabic/ner-bangla-dataset，https：//content.iospress.com/articles/journal-oftillect/journal-er-oftelligent-and-fuzzy-systems/ifs179349

泰卢固语

ner_telugu：https：//github.com/anikethjr/ner_telugu
IJCNLP 2008 SSEAL：http：//ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
泰卢固语的命名entity注释corpora：http：//www.tdil-dc.in/index.php?option=com_download&task = showresourcedetails&toolid = 982＆lang = en

Maithili

Maithili中的第一个命名实体识别器：资源创建和系统开发：https：//content.iospress.com/articles/journal-ob-intelligent-and-fuzzy-systems/ifs210051

尼泊尔

Everestner：https：//journals.flvc.org/flairs/article/view/130725，https://github.com/nowalab/everest-ner

马拉地语

命名MARATHI的ENTITY注释Corpora：http：//www.tdil-dc.in/index.php?option=com_download&task = showresourcedetails&toolid = 979＆lang = en
L3Cube Mahaner：https：//arxiv.org/abs/2204.06029 https://github.com/l3cube-pune/marathinlp

旁遮普

punjabi的命名entity注释corpora：http：//www.tdil-dc.in/index.php?option=com_download&task = showresourcedetails&toolc.toolid = 980＆lang = en

泰米尔人

Fire 2015，ESM-IL（英语，印地语，泰米尔语，马拉雅拉姆语）：http：//au-kbc.org/nlp/esm-fire2015/#traincorpus
Fire Ner 2013（英语，印地语，泰米尔语，马拉雅拉姆语，孟加拉语）：http：//au-kbc.org/nlp/nlp/ner-fire2013/

马拉雅拉姆语

Fire 2015，ESM-IL（英语，印地语，泰米尔语，马拉雅拉姆语）：http：//au-kbc.org/nlp/esm-fire2015/#traincorpus
Fire Ner 2013（英语，印地语，泰米尔语，马拉雅拉姆语，孟加拉语）：http：//au-kbc.org/nlp/nlp/ner-fire2013/

Oriya/Odia

IJCNLP 2008 SSEAL：http：//ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5

僧伽罗/僧伽罗人

Lorelei（LDC2018E57）

泰国

泰国命名-Entity-rendition-data：https：//github.com/pythainlp/thai-named-entity-rencognition-data
泰语名称实体语料库：http：//pioneer.chula.ac.th/~awirote/resources/corpora-data.html; http://pioneer.chula.ac.th/~awirote/data-nutcha.zip; http://pioneer.chula.ac.th/~awirote/data-sasiwimon.zip; http://pioneer.chula.ac.th/~awirote/data-nattadaporn.zip
LST20：https：//huggingface.co/datasets/lst20; https://arxiv.org/abs/2008.05055
泰语：https：//github.com/vistec-ai/thai-nner，https：//aclanthology.org/2022.findings-acl.116

印度尼西亚

身份：http：//metashare.elda.org/repository/browse/entic/fed3fada7ef111e5aa3b001dd8b71c6666666666666abd4242f18ff1f18ffd9a9a9a95da9104cc/
https://github.com/yohanesgultom/nlp-experiments/tree/master/data/ner
印度尼西亚-Ner：Syaifudin＆Nurwidyantoro https://ieeexplore.ieee.org/document/7828656 https://github.com/yusufsufsyaifudin/yusufsofseaifudin/indonesia-indonesia-indonesia-indonesia-indonesia-indonesia-nerner
IDNER-NEWS-2K：印尼新闻的数据集，用于指定实体识别任务。 Syaifudin＆Nurwidyantoro https://dl.acm.org/doi/10.1145/3592854#fn8 https://github.com/khairunnisaor/idner-news-2k/
NERP和NER-GRIT：Indonlp/Indonlu https://github.com/indonlp/indonlu/tree/tree/master/master/dataset https://aclanthology.org/2020.aacl-main.85/

越南人

VLSP 2016：http：//vlsp.org.vn/resources-vlsp2016; https://github.com/undertheseanlp/ner
VLSP 2018：http：//vlsp.org.vn/resources-vlsp2018; https://github.com/undertheseanlp/ner
Phoner_covid19：https：//github.com/vinairesearch/phoner_covid19

日本人

IREX：https：//nlp.cs.nyu.edu/irex/package/
Met-2（日语，中文）：https：//www-nlpir.nist.gov/releated_projects/muc/
BCCWJ基本NE语料库：https：//sites.google.com/site/projectnextnlpne/en（Iwakura等人，构建了一种日本基本命名的各种流派的基本基本命名的实体语料，新闻2016）
DBPEDIA摘要语料库（英语，德语，荷兰语，法语，意大利语，日语）：http：//downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
数据来自：Mai等人，一项关于细粒度命名实体识别的实证研究，2018年Coling（英语，日语）：https：//fgner.alt.alt.ai/duc/duc/ene/testsetsets/comp/
wikipedia ner corpus：https：//github.com/stockmarkteam/ner-wikipedia-dataset
Wikiann：https：//elisa-ie.github.io/wikiann/
GSD：将UD GSD数据集转换为Megagon Labs https://github.com/megagonlabs/ud_japanese-gsd
KWDLC：京都大学网络文档领导https://nlp.ist.i.i.y.kyoto-u.ac.jp/en/index.php?kwdlc https://github.com/ku-nlp/ku-nlp/kwdlc https https https https：/

韩国人

国家韩国语言学院（ROK）-NER语料库：https：//github.com/digitalprk/koreaner; https://ithub.korean.go.kr/user/total/referenceview.do?boildseq = 5&articleseq = 118＆boardgb = t＆isinsupd&boardType = corpus
kmou ner -https：//github.com/kmounlp/ner
韩语理解评估-Klue ner -https：//klue-benchmark.com/tasks/69/overview/description
https://github.com/songys/entity
HLCT 2016语料库，带有更新-https：//github.com/machinereading/koreannernercorpus

中国人

ACE 2003（英语，中文，阿拉伯语）：https：//catalog.ldc.upenn.edu/ldc2004t09
ACE 2004（英语，中文，阿拉伯语）：https：//catalog.ldc.upenn.edu/ldc2005t09
ACE 2005（英语，中文，阿拉伯语）：https：//catalog.ldc.upenn.edu/ldc2006t06
Ontonotes 5（英语，阿拉伯语，中文）：https：//catalog.ldc.upenn.edu/ldc2013t19
Met-2（日语，中文）：https：//www-nlpir.nist.gov/releated_projects/muc/
反射实体翻译（平行语料库：英语，阿拉伯语，中文）：https：//catalog.ldc.upenn.edu/ldc2009t11
NE3L命名实体中国语料库（阿拉伯语，中文，俄语）：http：//catalogue.elra.info/en-us/repository/browse/browse/elra-w0079/
原始短语数据整理I中文（命名实体）：http：//catalog.elra.info/en-us/repository/browse/browse/elra-w0045_04/
原始短语数据整理II中文（命名实体）：http：//catalog.elra.info/en-us/repository/browse/browse/elra-w0045_08/
ERE Deft Corpora（平行语料库：英语，中文）：Mott等人，平行中文英语实体，关系和事件Corpora，2016年（LDC2015E78，LDC2014E114）
中国微博：命名和名义上提及的中国风格注释（微博）：https：//github.com/hltcoe/golden-horse
中文eduner：教育领域的2023数据集：https：//link.springer.com/article/10.1007/s00521-023-08635-5-5
中国航空航天NER：https：//www.nature.com/articles/s41598-023-50705-0
SCICN：用于科学信息提取的中国数据集和基准测试
EMP NER: Historical Chinese https://aclanthology.org/2024.lrec-main.35.pdf https://gitlab.com/enpchina/ENP-NER

他加禄语

TLUnifed: https://arxiv.org/abs/2311.07161 https://huggingface.co/datasets/ljvmiranda921/tlunified-ner

俄语

BSNLP 2017 (Croatian, Czech, Polish, Russian, Slovak, Slovene, Ukrainian): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html
NE3L named entities Russian corpus (Arabic, Chinese, Russian): https://catalog.elra.info/en-us/repository/browse/ELRA-W0080/
WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
WikiNEuRal: https://github.com/Babelscape/wikineural
MultiNERD: https://github.com/Babelscape/multinerd
factRuEval-2016: https://github.com/dialogue-evaluation/factRuEval-2016
RuREBus 2020 (Russian Relation Extraction for Business) corpus https://github.com/dialogue-evaluation/RuREBus

约鲁巴

GV-Yorùbá-NER. Data: https://github.com/ajesujoba/YorubaTwi-Embedding/tree/master/Yoruba/Yor%C3%B9b%C3%A1-NER ; Data statement: https://drive.google.com/file/d/177xu-O2FTJ7VJQ-0ohCWjVd1qu61Tvml/view Paper: Jesujoba O Alabi, Kwabena Amponsah-Kaakyire, David I Adelani, and Cristina Espãna-Bonet. Massive vs. curated word embeddings for low-resourced languages. the case of Yorùbá and Twi. In LREC, 2020 (https://arxiv.org/abs/1912.02481)

斯瓦希里语

Helsinki Corpus of Swahili 2.0 (HCS 2.0) Annotated Version: http://metashare.csc.fi/repository/browse/helsinki-corpus-of-swahili-20-hcs-20-annotated-version/232c1910b9eb11e5915e005056be118e59fb2e920f1f4c0cafc94915fc6f5cac/ See: Shah et al., 2010. SYNERGY: A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

伊博

IgboNER: https://aclanthology.org/2022.lrec-1.547/ https://github.com/Chiamakac/IgboNER-Models later updated in https://openreview.net/pdf?id=tHUS9-vmUfC from https://sites.google.com/view/africanlp2023/home

Isindebele

NCHLT isiNdebele Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/306

xhosa

NCHLT isiXhosa Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/312

祖鲁

NCHLT isiZulu Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/319

Sepedi

NCHLT Sepedi Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/328

塞索托

NCHLT Sesotho Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/334

Setswana

NCHLT Setswana Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/341

Siswati

NCHLT Siswati Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/346

Venda

NCHLT Tshivenda Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/355
MPHAYANER: Named Entity Recognition for Tshivenḓa: https://openreview.net/pdf?id=0nneuL3bSLt https://github.com/rendanim/MphayaNER from https://sites.google.com/view/africanlp2023/home

Xitsonga

NCHLT Xitsonga Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/362

拉丁

Herodotos Project: https://github.com/alexerdmann/Herodotos_Project_Annotation

A long list can be found here: http://damien.nouvels.net/resourcesen/corpora.html

参考

[Alvarado et al., 2015] Alvarado, Julio Cesar Salinas, Karin Verspoor, and Timothy Baldwin. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pp. 84-90. 2015. Accessed: August 2018.

[Balasuriya et al., 2009] Balasuriya, Dominic, Nicky Ringland, Joel Nothman, Tara Murphy, and James R. Curran. Named entity recognition in wikipedia. In Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 10-18. Association for Computational Linguistics, 2009

[Bos et al., 2017] Bos, Johan, Valerio Basile, Kilian Evang, Noortje J. Venhuizen, and Johannes Bjerva. The Groningen meaning bank. In Handbook of linguistic annotation, pp. 463-496. Springer, Dordrecht, 2017.

[Derczynski et al., 2016] Derczynski, Leon, Kalina Bontcheva, and Ian Roberts. Broad twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1169-1179. 2016. Available at: https://github.com/GateNLP/broad_twitter_corpus Accessed: August 2018.

[Derczynski et al., 2017] Leon Derczynski, Eric Nichols, Marieke van Erp, Nut Limsopatham (2017) Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition, in Proceedings of the 3rd Workshop on Noisy, User-generated Text. Available at: https://noisy-text.github.io/2017/emerging-rare-entities.html

[DSTL, 2017] Defence Science and Technology Laboratory. 2017. Relationship and Entity Extraction Evaluation Dataset. https://github.com/dstl/re3d. Accessed: January 2018.

[Grishman and Sundheim, 1996] Ralph Grishman and Beth Sundheim. 1996. Message understanding conference- 6: A brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.

[Karimi et al., 2015] Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna Kemp, and Chen Wang. 2015. Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, 55:73-81. Available at https://data.csiro.au Accessed: November 2017.

[Lim et al., 2017] Lim, Swee Kiat, Aldrian Obaja Muis, Wei Lu, and Chen Hui Ong. MalwareTextDB: A database for annotated malware articles. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1557-1567. 2017。

[Liu et al., 2013a] Jingjing Liu, Panupong Pasupat, Scott Cyphers, and Jim Glass. 2013. Asgard: A portable architecture for multilingual dialogue systems. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8386-8390. IEEE。 Available at https://groups.csail.mit.edu/sls/downloads/restaurant/ Accessed: January 2018

[Liu et al., 2013b] Jingjing Liu, Panupong Pasupat, Yining Wang, Scott Cyphers, and Jim Glass. 2013. Query understanding enhanced by hierarchical parsing structures. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 72-77. IEEE。 Available at https://groups.csail.mit.edu/sls/downloads/movie/ We used the trivia10k13 portion. Accessed: January 2018

[NIST, 1999 IE-ER] NIST. 1999. Information Extraction - Entity Recognition Evaluation. http://www.nist.gov/speech/tests/ieer/er_99/er_99.htm. The newswire development test data only (included in the NLTK package).

[Ohta et al., 2012] Tomoko Ohta, Sampo Pyysalo, Jun'ichi Tsujii and Sophia Ananiadou. 2012. Open-domain Anatomical Entity Mention Detection. In Proceedings of ACL 2012 Workshop on Detecting Structure in Scholarly Discourse (DSSD), pp. 27-36. Available at: http://www.nactem.ac.uk/anatomy/ and https://github.com/openbiocorpora/anem Accessed: November 2017.

[Ritter et al., 2011] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1524-1534, Edinburgh, Scotland, UK., July.计算语言学协会。 Accessed January 2018.

[Sang and Meulder, 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.

[Stubbs et al., 2015] Amber Stubbs and Ozlem Uzuner. 2015. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics, 58:S20-S29. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.

[Uzuner et al., 2007] Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association, 14(5):550-563. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.

[Weischedel and Brunstein, 2005] Ralph Weischedel and Ada Brunstein. 2005. BBN pronoun coreference and entity type corpus. Linguistic Data Consortium, Philadelphia.

[Weischedel et al., 2013] Weischedel, Ralph, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue et al. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA (2013).

[Zeldes, 2017] Amir Zeldes. 2017. The GUM corpus: creating multilayer resources in the classroom. Language Resources and Evaluation, 51(3):581-612. Available at https://github.com/amir-zeldes/gum/tree/master/coref/tsv/ Accessed: November 2017.

展开

附加信息

版本 1.0.0
类型其他源码
更新时间 2025-04-17
大小 2.39MB
来自于 Github