實體識別數據集
該存儲庫包含來自帶有各種實體類型的幾個域中的數據集,可用於實體識別和命名實體識別(NER)任務。
注意:我不再積極地將數據集添加到此列表中 - 自2020年以來可能會出現更多的NER數據集。但是,我很樂意通過問題或拉出請求添加更多數據集。
英文NER的數據集
下表顯示了用於英語實體識別的數據集列表(有關其他語言的NER數據集列表,請參見下文)。數據目錄包含有關在何處獲得這些數據集的信息,這些數據集由於許可限製而無法共享的數據集,以及將其轉換為Conll 2003格式的代碼。下面還列出了指向其他語言的NER語料庫的鏈接。
| 數據集 | 領域 | 執照 | 參考 | 可用性 |
|---|
| Conll 2003 | 消息 | 杜阿 | Sang and Meulder,2003年 | 容易找到 |
| Nist-ieer | 消息 | 沒有任何 | NIST 1999 IE-ER | NLTK數據 |
| MUC-6 | 消息 | 自然界人士 | Grishman and Sundheim,1996年 | LDC 2003T13 |
| Ontonotes 5 | 各種各樣的 | 自然界人士 | Weischedel等人,2013年 | LDC 2013T19 |
| BBN | 各種各樣的 | 自然界人士 | Weischedel和Brunstein,2005年 | LDC 2005T33 |
| GMB-1.0.0 | 各種各樣的 | 沒有任何 | Bos等,2017 | http://gmb.let.rug.nl/data.php |
| 口香糖-3.1.0 | Wiki | 幾個( * 2) | Zeldes,2016年 | ✔這裡包括 |
| 維基戈德 | 維基百科 | CC-BY 4.0 | Balasuriya等,2009 | ✔這裡包括 |
| ritter | 嘰嘰喳喳 | 沒有任何 | Ritter等,2011 | 沒有分裂,火車/測試/開發拆分 |
| BTC | 嘰嘰喳喳 | CC-BY 4.0 | Derczynski等,2016 | ✔這裡包括 |
| Wnut17 | 社交媒體 | CC-BY 4.0 | Derczynski等,2017 | ✔這裡包括 |
| I2B2-2006 | 醫療的 | 杜阿 | Uzuner等,2007 | http://www.i2b2.org |
| I2B2-2014 | 醫療的 | 杜阿 | Stubbs等,2015 | http://www.i2b2.org |
| 卡德克 | 醫療的 | Csiro | Karimi等,2015 | http://data.csiro.au/ |
| 安姆 | 解剖學 | CC-SA 3.0 | Ohta等,2012 | ✔這裡包括 |
| Mitrestaurant | 查詢 | 沒有任何 | Liu等,2013a | http://groups.csail.mit.edu/sls/ |
| mitmovie | 查詢 | 沒有任何 | Liu等,2013b | http://groups.csail.mit.edu/sls/ |
| MalwaretextDB | 惡意軟件 | 沒有任何 | Lim等,2017 | http://www.statnlp.org/ |
| re3d | 防禦 | 幾個( * 1) | DSTL,2017年 | ✔這裡包括 |
| SEC-FILINGS | 金融 | CC-BY 3.0 | Alvarado等,2015 | ✔這裡包括 |
| 集會 | 機器人技術 | x | Costa等,2017 | x |
| wikineal | 維基百科 | CC BY-SA-NC 4.0 | Tedeschi等,2021 | https://github.com/babelscape/wikineural |
| 多納 | 維基百科 | CC BY-SA-NC 4.0 | Tedeschi等,2022 | https://github.com/babelscape/multinerd |
| Hipe-2022 | 歷史 | CC BY-SA-NC 4.0 | Ehrmann等,2022 | https://github.com/hipe-eval/hipe-2022-data |
| 音樂納 | 音樂 | 麻省理工學院 | Epure和Hennequin,2023年 | https://github.com/deezer/music-ner-eacl2023 |
| WIESP2022-NER | 天體物理學 | CC BY-SA-NC 4.0 | Grezes等,2022 | https://huggingface.co/datasets/adsabs/wiesp2022-ner |
| nne | 消息 | CC 4.0 / LDC | Ringland等,2019 | https://github.com/nickyringland/nested_nemed_entities |
| 全世界 | 消息 | CC BY-SA-NC 4.0 | Shan等,2023 | https://github.com/stanfordnlp/en-worldwide-newswire https://arxiv.org/abs/2404.13465 |
許可證
許可註釋:
(1)RE3D(“關係和實體提取評估數據集”)包含幾個具有不同許可的數據集。這些都是:
- CC-BY-SA 3.0(Wikipedia數據集)
- CC BY-NC 3.0(BBC_ONLINE數據集)
- CC由3.0 au(Australian_department_of_foreign_affairs數據集)
- 公共域(us_state_department數據集,CENTCOM數據集)
- 英國公開政府許可證v3.0(UK__GOVERNMENT數據集)
- delegation_of_the_european_union_to_syria:請參閱https://eeas.europa.eu/delegations/syria/8157/legal-notice_en
- 口香糖3.1.0包括三個數據集,並帶有許可證CC-BY 3.0,CC-BY-SA 3.0和CC-BY-NC-SA 3.0。註釋是根據CC-BY 4.0許可的。
可以在相應的子目錄中找到每個數據集的更詳細的許可信息。
稍後... -Tabassum等人,在Stackoverflow https://cocoxu.github.io/publications/acl2020_stackoverflow_ner.pdf- litbank -litbank:https://github.com.com/dbamman/litbank(bamman/litbank(bamman,popate and popate and popate and popatity)https://cocoxu.github.io/publications/ACL2020_STACKOVERFLOW_NER.PDF(BAM,popat和Shen,Annastival intrary) NNE:一個用於英文新聞中嵌套的實體識別的數據集,2019年https://github.com/nickyringland/nested_named_entities -MARS Target Engyclopedia -LPSC摘要標籤數據集:https://zenodo.org/record/1048484848419#19.2cc.w.w55a.recordiies https://www.kaggle.com/dataturks/best-buy-ecommerce-ner-dataset/home- ner:https://wwwww.kaggle.com/dataturks/resume-ensume-entistities-for-entities-for-ner-ner/home--少數 - 少數 - 少數 - 少數 - 少數 - 少數 - nertity nectity Date date date date date date ner for Ner: https://aclanthology.org/2021.acl-long.248/
其他語言的NER數據集
詞彙命名實體資源
- Heiner:http://heiner.cl.uni-heidelberg.de/index.shtml
- NECKAR:https://event.ifi.uni-heidelberg.de/?page_id = 532#wikidata_ne_dataset
代碼轉換
- 英語 - 西班牙推文(計算2018):https://code-switching.github.io/2018/; https://code-switching.github.io/2018/files/spa-eng/release.zip; http://www.aclweb.org/anthology/w18-3219
- 阿拉伯語 - 埃及推文(計算2018):https://code-switching.github.io/2018/; https://code-switching.github.io/2018/files/msa-egy/arabictweetstokenaskigner.zip; http://www.aclweb.org/anthology/w18-3219
- 印地語英語社交媒體文字:https://github.com/silentflame/named-entity-rendition; http://aclweb.org/anthology/w18-2405
- EMNLP 2014年共享任務 - 代碼轉換推文(Nepali-English,Spanish-English,Pronsarin-English,Arabic-Arabic方言):http://emnlp2014.org/workshops/codeswitch/codeswitch/call.html.html
德語
- Conll 2003(英語,德語):https://www.clips.uantwerpen.be/conll2003/ner/
- Germeval 2014:https://sites.google.com/site/germeval2014ner/data
- 書面德語(tüba-d/Z)的TübingenTreebank:http://www.sfs.uni-tuebingen.de/en/ascl/ascl/resources/corpora/corpora/tueba-dueba-dz.html
- 歐洲報紙(荷蘭語,法語,德語):https://github.com/europeananewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- 德國歐洲成績單(子集):https://nlpado.de/~sebastian/software/ner_german.shtml
- 指定德語的實體模型,政治(NEMGP):https://www.thomas-zastrow.de/nlp/
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多納:https://github.com/babelscape/multinerd
- DFKI SMARTDATA語料庫(地理原理):https://dfki-lt-re-group.bit.bit.bit.ioio/smartdata-corpus/(德國語料庫,用於交通和行業事件的精細命名實體識別和關係識別和關係。 Gabryszak,Leonhard Hennig。
- DBPEDIA摘要語料庫(英語,德語,荷蘭語,法語,意大利語,日語):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- DAWT數據集 - 跨多種語言(英語,西班牙語,法語,意大利語,德語,阿拉伯語)的密集註釋的Wikipedia文本:https://github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
- 埃琳娜·萊特納(Elena Leitner),喬治·雷姆(Georg Rehm),朱利(Juli)́數據:https://github.com/elenanereiss/legal-entity-rbognition
- HIPE-2022,在多語言歷史文檔中命名為實體識別和實體鏈接:https://hipe-eval.github.io/hipe-2022/ https://github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data
荷蘭
- Conll 2002(西班牙語,荷蘭語):https://www.clips.uantwerpen.be/conll2002/ner/
- 歐洲報紙(荷蘭語,法語,德語):https://github.com/europeananewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- 同時語料庫(平行語料庫:英語,西班牙語,意大利語,荷蘭語):http://www.newsreader-project.eu/results/data/wikinews/
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多納:https://github.com/babelscape/multinerd
- DBPEDIA摘要語料庫(英語,德語,荷蘭語,法語,意大利語,日語):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- 荷蘭議會文檔2015-2016,從1848年開始。
- Sonar 1- Desmet和Hoste,精細的荷蘭語名為“實體識別”,2014年(班級等級)
- 語料庫書籍和語料庫Gutenberg Dutch:http://blog.namescape.nl/?page_id=85; http://portal.clarin.nl/node/1940
南非荷蘭語
- NCHLT南非荷蘭語名為Entity註釋語料庫:https://repo.sadilar.org/handle/20.500.12185/299
西班牙語
- Conll 2002(西班牙語,荷蘭語):https://www.clips.uantwerpen.be/conll2002/ner/
- Ancora(西班牙語,加泰羅尼亞):http://clic.ub.edu/corpus/en
- Deft Spanish Treebank(LDC2018T01):https://catalog.ldc.upenn.edu/ldc2018t01
- 靈丹妙藥(實驗室):http://panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/depentendency-lab-es
- PANACEA(ENV):http://panacea-lr.eu/en/info-for-researchers/data-sets/depentency-parsed-corpora/deplyendency-envendency-env-es
- 同時語料庫(平行語料庫:英語,西班牙語,意大利語,荷蘭語):http://www.newsreader-project.eu/results/data/wikinews/
- ACE 2007(西班牙語和阿拉伯語):https://catalog.ldc.upenn.edu/ldc2014t18
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多納:https://github.com/babelscape/multinerd
- http://www.grupolys.org/~marcos/pub/lrec16.tar.bz2(用於“將詞典 - 示態啟發式納入核心分辨率sieves in Document-level中指定的實體識別”
- 具有人體實體的核心註釋(西班牙語,加利西亞,葡萄牙語)的多語言語料庫:http://gramatica.usc.es/~marcos/lrec.tar.bz2
- Drugsemantics黃金標準(Moreno等人,藥物掌握:西班牙產品特徵摘要中指定實體識別的語料庫,2017年):https://data.mendeley.com/datasets/fwc7jrc5jr/1
- DBPEDIA摘要語料庫(英語,德語,荷蘭語,法語,意大利語,日語):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- DAWT數據集 - 跨多種語言(英語,西班牙語,法語,意大利語,德語,阿拉伯語)的密集註釋的Wikipedia文本:https://github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
- Cantemist(癌症文本挖掘共享任務 - 腫瘤稱為實體識別) - 命名與癌症有關的關鍵概念的實體識別,即西班牙醫學文本中的腫瘤形態:https://temu.bsc.es/cantemist/
加泰羅尼亞
- Ancora(西班牙語,加泰羅尼亞):http://clic.ub.edu/corpus/en
加利西亞人
- 加利西亞ner語料庫:https://gramatica.usc.es/~marcos/resources/corpus_gal_nec.txt.gz
- 具有人體實體的核心註釋(西班牙語,加利西亞,葡萄牙語)的多語言語料庫:http://gramatica.usc.es/~marcos/lrec.tar.bz2
巴斯克
- 巴斯克命名實體語料庫(EIEC):http://ixa.eus/node/4486? language= en
- Basque Disamigation命名實體語料庫(EDIEC):http://ixa.si.ehu.es/node/4485?language= en
- egunkaria 2000語料庫(383個新聞文本),http://qtleap.eu/wp-content/uploads/2014/04/qtleap-2013-d5.1.pdf
葡萄牙語
- 后宮:https://www.linguateca.pt/aval_conjunta/harem/harem_ing.html
- Cintil語料庫:http://cintil.ul.pt/cintilfeatures.html#corpus
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多納:https://github.com/babelscape/multinerd
- 具有人體實體的核心註釋(西班牙語,加利西亞,葡萄牙語)的多語言語料庫:http://gramatica.usc.es/~marcos/lrec.tar.bz2
- bosque 8.0老鷹格式:https://gramatica.usc.es/~marcos/resources/corpora_flpt.tgz
- Lener-BR(巴西法律文件):https://cic.unb.br/~teodecampos/lener-br/
- Paramopama:用於命名實體識別的巴西 - 葡萄牙語料庫
法語
- 酯:http://catalogue.elra.info/en-us/repository/browse/elra-s0241/
- 酯2:http://catalogue.elra.info/en-us/repository/browse/elra-s0338/
- etape:http://catalogue.elra.info/en-us/repository/browse/elra-e0046/
- 歐洲報紙(荷蘭語,法語,德語):https://github.com/europeananewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- Quaero法國醫學語料庫:https://quaerofrenchmed.limsi.fr/
- Quaero廣播新聞擴展了命名Entity語料庫:http://catalog.elra.info/en-us/repository/browse/browse/elra-s0349/
- quaero舊新聞擴展名稱實體語料庫:http://catalog.elra.info/en-us/repository/browse/browse/elra-w0073/
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikiner-fr-gold https://arxiv.org/abs/2411.00030 https://huggingface.co/datasets/danrun/wikiner-fr-gold-gold
- wikinearal:https://github.com/babelscape/wikineural
- 多納:https://github.com/babelscape/multinerd
- DBPEDIA摘要語料庫(英語,德語,荷蘭語,法語,意大利語,日語):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- DAWT數據集 - 跨多種語言(英語,西班牙語,法語,意大利語,德語,阿拉伯語)的密集註釋的Wikipedia文本:https://github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
- CAP 2017-(Twitter Data),Lopez等人,CAP 2017挑戰:Twitter名為“實體識別”,2017年:http://cap2017.imag.fr/competition.html
- HIPE-2022,在多語言歷史文檔中命名為實體識別和實體鏈接:https://hipe-eval.github.io/hipe-2022/ https://github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data
意大利人
- KINT:https://github.com/dhfbk/kind
- 評估:http://www.evalita.it/2009/tasks/entity
- 同時語料庫(平行語料庫:英語,西班牙語,意大利語,荷蘭語):http://www.newsreader-project.eu/results/data/wikinews/
- PANACEA(ENV):http://panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/deplyendency-envendency-env-it
- 靈丹妙藥(實驗室):http://panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/depentency-lab-it
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多納:https://github.com/babelscape/multinerd
- DBPEDIA摘要語料庫(英語,德語,荷蘭語,法語,意大利語,日語):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- DAWT數據集 - 跨多種語言(英語,西班牙語,法語,意大利語,德語,阿拉伯語)的密集註釋的Wikipedia文本:https://github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
羅馬尼亞人
- Ronec(Dumitrescu和Avram,介紹Ronec-羅馬尼亞人名為Entitycorpus。LREC2020)。論文:https://arxiv.org/pdf/1909.01247.pdf數據:https://github.com/dumitrescustefan/ronec
- Romanian journalistic corpus (ROCO): http://metashare.elda.org/repository/browse/romanian-journalistic-corpus-roco/038baa80dc7311e5aa0b00237df3e3583781d7c0f2084057aa018a2d63d987e9/
- 羅馬尼亞人平衡語料庫(ROMBAC):http://metashare.elda.org/repository/browse/romanian-balanced-corpus-corpus-corpus-corpus-rombac/0a7dd85edc7311e5aaa0b00233e35873166666243524229dbubu
希臘語
- PANACEA(ENV):http://panacea-lr.eu/en/info-for-researchers/data-sets/depperency-parsed-corpora/depperency-epentency-env-env-el
- 靈丹妙藥(實驗室):http://panacea-lr.eu/en/info-for-researchers/data-sets/deppentency-parsed-corpora/depentency-lab-lab-el
匈牙利
- 匈牙利命名的實體語料庫:http://rgai.inf.u-szeged.hu/index.php?lang=en&page=corpus_ne
- Hunnerwiki:http://hlt.sztaki.hu/resources/hunnerwiki.html
- NYTK:https://github.com/nytud/nytk-nerkor
捷克
- 捷克語命名Entity語料庫:http://ufal.mff.cuni.cz/cnec
- BSNLP 2017(克羅地亞,捷克,波蘭語,俄羅斯,斯洛伐克,斯洛文尼亞,烏克蘭人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- CZENG 1.0(平行語料庫:捷克 - 英語):http://ufal.mff.cuni.cz/czeng/czeng10
- pero ocr ner(捷克歷史OCR編年史):https://github.com/roman-janik/poner https://dspace.vut.cz/items/6092e1b0-1b0-1b0-1b0-1b0-3d75-3d75-4451-8582-28582-28582-28573ac3044
拋光
- 波蘭SEJM語料庫:http://clip.ipipan.waw.pl/psc
- BSNLP 2017(克羅地亞,捷克,波蘭語,俄羅斯,斯洛伐克,斯洛文尼亞,烏克蘭人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- 波蘭核心語料庫:http://zil.ipipan.waw.pl/polishcoreferencecorpus
- Wikiner:https://figshare.com/articles/learning_multlingual_named_entity_recognition_from_wikipedia/5462500
- wikinearal:https://github.com/babelscape/wikineural
- 多納:https://github.com/babelscape/multinerd
- 經濟新聞語料庫(CEN語料庫):http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/cen
- KPWR(KORPUSJęZYKAPOLSKIEGO POLITECHNIKIWROCławskiej/Wrocław技術大學的波蘭語料庫):http://plwordnet.pr.wroc.pl/index.pl/index.php?option=com_content = com_content&view = = Artcical&v = artical&id = 35 55 = 3 2pletememid=18 = = = = = = = = = = = = = = = = 1. http://plwordnet.pwr.wroc.pl/attachments/article/35/kpwr-1.1.7z(Broda等人,KPWR:邁向免費的波蘭語語料庫,2012年)
- nkjp:http://clip.ipipan.waw.pl/nationalcorpusofpolish?action=AttachFile&do=view&target=nkjp-podkorpusmilionowy-1.2.tar.gz
克羅地亞人
- HR500K 1.0:http://hdl.handle.net/11356/1183
- BSNLP 2017(克羅地亞,捷克,波蘭語,俄羅斯,斯洛伐克,斯洛文尼亞,烏克蘭人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- reldi-normtagner-hr(克羅地亞推文):http://hdl.handle.net/11356/1170
斯洛伐克
- BSNLP 2017(克羅地亞,捷克,波蘭語,俄羅斯,斯洛伐克,斯洛文尼亞,烏克蘭人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- 斯洛伐克分類新聞語料庫:https://nlp.web.tuke.sk/pages/categorizednews
斯洛維尼亞
- BSNLP 2017(克羅地亞,捷克,波蘭語,俄羅斯,斯洛伐克,斯洛文尼亞,烏克蘭人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- SSJ500K:http://www.slovenscina.eu/tehnologije/ucni-korpus; http://eng.slovenscina.eu/tehnologije/ucni-korpus; https://www.clarin.si/repository/xmlui/handle/11356/1029;注意:v 2.2請參閱:http://hdl.handle.net/11356/1210
- 斯洛文尼亞新聞:http://zitnik.si/mediawiki/index.php?title=datasets#slovene_news; http://zitnik.si/mediawiki/images/7/7d/rtvslo_dec2011.tsv; http://zitnik.si/mediawiki/images/5/5e/rtvslo_dec2011_v2.tsv
- Janes-Tag 2.0(社交媒體文本)https://www.clarin.si/repository/xmlui/handle/11356/1123;另請參見:Fišer等人,Janes Project:Slovene用戶生成的內容的語言資源和工具,2018年。
烏克蘭
- BSNLP 2017(克羅地亞,捷克,波蘭語,俄羅斯,斯洛伐克,斯洛文尼亞,烏克蘭人):http://bsnlp-2017.cs.helsinki.fi/shared_task_results.htmls.html
- 烏克蘭棕色NER語料庫:https://github.com/lang-uk/ner-uk; http://lang.org.ua/en/corpora/
塞爾維亞
- setimes.sr -http://hdl.handle.net/11356/1200
- 塞爾維亞人的指定實體評估語料庫:http://www.korpus.matf.bg.ac.rs/srpneval/
- reldi-normtagner-sr(塞爾維亞推文):http://hdl.handle.net/11356/1171
保加利亞語
冰島
- Mim-Gold-ner(Ingólfsdóttir,SvanhvítLilja,Sigurjónþorsteinsson和Hrafn Loftsson。 http://www.malfong.is/index.php?pg=mim_gold_ner
丹麥語
- 戴恩:Hvingelby等人,[Dane:丹麥語的命名實體資源。 ](http://www.lrec-conf.org/proceedings/lrec202020202020202020.lrec-1.565.pdf)
- 丹麥Propbank(DPB):http://catalog.elra.info/en-us/repository/browse/elra-w0117/
- 樹木銀植物園:http://catalog.elra.info/en-us/repository/browse/browse/elra-w0084/
挪威
- Bjarte Johansen是挪威人的實體認可,是第22屆北歐計算語言學會議論文集。 2019(https://www.aclweb.org/anthology/w19-6123.pdf)數據:https://github.com/ljos/ljos/navnkjenner
- FredrikJørgensen等人,Norne:註釋為挪威的指定實體,2019年(https://arxiv.org/pdf/1911.12146.pdf)。數據:https://github.com/ltgoslo/norne/; https://www.nb.no/sprakbanken/show?serial=Oai%3anb.no%3ASBR-49
瑞典
- 斯德哥爾摩互聯網語料庫:https://www.ling.su.se/english/nlp/corpora-and-resources/sic
- SUC 3.0:https://spraakbanken.gu.se/eng/resource/suc3
- 瑞典語手動註釋NER:https://github.com/klintan/swedish-ner-corpus/
- 醫療Wikipedia數據(Almgren等人,在瑞典健康記錄中被稱為實體識別,具有基於角色的深度雙向LSTMS,2016年):https://github.com/olofmogren/biomedical-ner-ner-ner-ner-data-swedish
- HIPE-2022,在多語言歷史文檔中命名為實體識別和實體鏈接:https://hipe-eval.github.io/hipe-2022/ https://github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data
芬蘭
- 芬蘭的數據集命名實體重生:https://github.com/mpsilfve/finer-data
- turku ner語料庫:https://github.com/turkunlp/turku-ner-corpus
- HIPE-2022,在多語言歷史文檔中命名為實體識別和實體鏈接:https://hipe-eval.github.io/hipe-2022/ https://github.com/github.com/hipe-eval/hipe-eval/hipe-202222222-data
愛沙尼亞人
- Estonian ner corpus:https://metashare.ut.ee/repository/browse/estonian-ner-corpus/88D030C0ACDE11E2A6E2A6E2A6E4005056B40024F1DEF1DEF472ED254E77A8952E1003E1003D9F89F89F89F889F881E//
拉脫維亞和立陶宛人
- https://github.com/accurat-toolkit/tildener/tree/master/test(Pinnis,Latnis,Latvian and Lithuanian和Lithuanian命名為Tildener,LREC 2012)
- LV Tagger的培訓數據:https://github.com/peterisp/lvtagger/tree/master/master/nertrainingdata
土耳其
- k̈ucukand can,一個針對命名實體識別和立場檢測註釋的推文數據集,2019年:https://github.com/dkucuk/tweet-dataset-ner-nner-sd
- k̈ucuk等人,在土耳其推文中命名為實體識別:http://optima.jrc.it/resources/2014_jrc_twitter_tr_ner-dataset.zip.zip
- 英語/土耳其wikipedia名為 - 實體識別和文本分類數據集(http://arxiv.org/abs/1702.02363):https://data.mendeley.com/datasets/cdcztymf4k/1
- çoban等人,被fbner命名為實體識別:土耳其語的新Facebook數據集:https://ieeexplore.ieee.org/document/9598971可根據要求提供可用於研究目的的數據
哈薩克
- Kaznerd:https://arxiv.org/pdf/2111.13419.pdf,https://github.com/is2ai/kaznerd
Uyghur
- uyghur命名實體關係語料庫:https://github.com/kaharjan/uynerel(Abiderexiti等人,構建Uyghur命名實體關係語料庫的註釋計劃。2016)
亞美尼亞人
- PIONER(金標準和銀色標準數據集):https://github.com/ispras-texterra/pioner(Ghukasyan等人,Pioner:Amenian的數據集和基線,用於亞美尼亞人,名為Entity Insentity識別,2018年)
- ARMTDP-NN:https://github.com/myavrum/armtdp-ner
科普特
- Coptic通用依賴性樹庫:https://github.com/universaldependencencies/ud_coptic-scriptorium/tree/dev(另請參見https://copticscriptorium.org/treebank.html)。其中包含46,000個嵌套(非)和智力化的實體的令牌。
阿姆哈拉語
- 說語料庫(請參閱“使用深度學習”的“命名為Amharic的實體識別”):https://github.com/geezorg/geezorg/data/tree/master/master/amharic/amharic/tagged/nmsu-say; http://data.geez.org/
阿拉伯
- AQMAR Arabic Wikipedia名為Entity語料庫:http://www.cs.cmu.edu/~ark/arabicner/
- NE3L命名實體阿拉伯語料庫(阿拉伯語,中文,俄語):http://catalog.elra.info/en-us/repository/browse/browse/elra-w0078/
- 反射實體翻譯(平行語料庫:英語,阿拉伯語,中文):https://catalog.ldc.upenn.edu/ldc2009t11
- ancorp:http://users.dsic.upv.es/~ybenajiba/downloads.html(另請參見:http://alias-i.com/lingpipe/demos/demos/tutorial/ne/read-meadorial/ne/read-me.html)
- ACE 2003(英語,中文,阿拉伯語):https://catalog.ldc.upenn.edu/ldc2004t09
- ACE 2004(英語,中文,阿拉伯語):https://catalog.ldc.upenn.edu/ldc2005t09
- ACE 2005(英語,中文,阿拉伯語):https://catalog.ldc.upenn.edu/ldc2006t06
- ACE 2007(西班牙語和阿拉伯語):https://catalog.ldc.upenn.edu/ldc2014t18
- Ontonotes 5(英語,阿拉伯語,中文):https://catalog.ldc.upenn.edu/ldc2013t19
- DAWT數據集 - 跨多種語言(英語,西班牙語,法語,意大利語,德語,阿拉伯語)的密集註釋的Wikipedia文本:https://github.com/klout/popendata/opendata/tree/master/master/master/wiki_annotation
- Wojood -2022嵌套的阿拉伯語名為Entity語料庫。 https://dlnlp.ai/st/wojood/ https://aclanthology.org/2022.lrec-1.387.pdf https https://codalab.lisn.upsaclay.upsaclay.fr/competitions/11740
波斯語
- Armanpersonercorpus:http://islrn.org/resources/399-379-640-828-6/; https://github.com/haniehp/persianner
信德
- Siner:https://aclanthology.org/2020.lrec-1.361/,https://github.com/aliwazir/siner-dataset
烏爾都語
- IJCNLP 2008 SSEAL:http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
- 聯合國數據集(Khan等人,烏爾都語為urdu命名的實體識別任務,2016年)。可在http://www.iiu.edu.pk/?page_id=5181中找到
- mk-pucit:https://www.dropbox.com/sh/1ivw7ykm2tugg94/aab9t5wnn7fynespo7tjjjw8la;請參閱:Kanwal等人,烏爾都語命名實體識別:Corpus Generation and Deep Learning Applications,2019年
指示
- Naamapadam:來自兩個語言家族的11種主要印度語言的指定實體識別(NER)數據集。 https://research.ibm.com/publications/naamapadam-a-large-scale-named-entity-annotity-data-for-indic-languages https://ai4bharat.iit.iitm.ac.ac.ac.in/naamapadam
印地語
- Hiner:https://github.com/cfiltnlp/hiner
- 印地語健康數據集:https://www.kaggle.com/aijain/hindi-health-dataset/home
- Fire 2015,ESM-IL(英語,印地語,泰米爾語,馬拉雅拉姆語):http://au-kbc.org/nlp/esm-fire2015/#traincorpus
- Fire Ner 2013(英語,印地語,泰米爾語,馬拉雅拉姆語,孟加拉語):http://au-kbc.org/nlp/nlp/ner-fire2013/
- IJCNLP 2008 SSEAL:http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
孟加拉
- Fire Ner 2013(英語,印地語,泰米爾語,馬拉雅拉姆語,孟加拉語):http://au-kbc.org/nlp/nlp/ner-fire2013/
- IJCNLP 2008 SSEAL:http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
- 孟加拉國:https://github.com/rifat1493/bengali-ner,https://ieeexplore.ieee.org/document/8944804
- ner-bangla:https://github.com/misabic/ner-bangla-dataset,https://content.iospress.com/articles/journal-oftillect/journal-er-oftelligent-and-fuzzy-systems/ifs179349
泰盧固語
- ner_telugu:https://github.com/anikethjr/ner_telugu
- IJCNLP 2008 SSEAL:http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
- 泰盧固語的命名entity註釋corpora:http://www.tdil-dc.in/index.php?option=com_download&task = showresourcedetails&toolid = 982&lang = en
Maithili
- Maithili中的第一個命名實體識別器:資源創建和系統開發:https://content.iospress.com/articles/journal-ob-intelligent-and-fuzzy-systems/ifs210051
尼泊爾
- Everestner:https://journals.flvc.org/flairs/article/view/130725,https://github.com/nowalab/everest-ner
馬拉地語
- 命名MARATHI的ENTITY註釋Corpora:http://www.tdil-dc.in/index.php?option=com_download&task = showresourcedetails&toolid = 979&lang = en
- L3Cube Mahaner:https://arxiv.org/abs/2204.06029 https://github.com/l3cube-pune/marathinlp
旁遮普
- punjabi的命名entity註釋corpora:http://www.tdil-dc.in/index.php?option=com_download&task = showresourcedetails&toolc.toolid = 980&lang = en
泰米爾人
- Fire 2015,ESM-IL(英語,印地語,泰米爾語,馬拉雅拉姆語):http://au-kbc.org/nlp/esm-fire2015/#traincorpus
- Fire Ner 2013(英語,印地語,泰米爾語,馬拉雅拉姆語,孟加拉語):http://au-kbc.org/nlp/nlp/ner-fire2013/
馬拉雅拉姆語
- Fire 2015,ESM-IL(英語,印地語,泰米爾語,馬拉雅拉姆語):http://au-kbc.org/nlp/esm-fire2015/#traincorpus
- Fire Ner 2013(英語,印地語,泰米爾語,馬拉雅拉姆語,孟加拉語):http://au-kbc.org/nlp/nlp/ner-fire2013/
Oriya/Odia
- IJCNLP 2008 SSEAL:http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
僧伽羅/僧伽羅人
泰國
- 泰國命名-Entity-rendition-data:https://github.com/pythainlp/thai-named-entity-rencognition-data
- 泰語名稱實體語料庫:http://pioneer.chula.ac.th/~awirote/resources/corpora-data.html; http://pioneer.chula.ac.th/~awirote/data-nutcha.zip; http://pioneer.chula.ac.th/~awirote/data-sasiwimon.zip; http://pioneer.chula.ac.th/~awirote/data-nattadaporn.zip
- LST20:https://huggingface.co/datasets/lst20; https://arxiv.org/abs/2008.05055
- 泰語:https://github.com/vistec-ai/thai-nner,https://aclanthology.org/2022.findings-acl.116
印度尼西亞
- 身份:http://metashare.elda.org/repository/browse/entic/fed3fada7ef111e5aa3b001dd8b71c6666666666666abd4242f18ff1f18ffd9a9a9a95da9104cc/
- https://github.com/yohanesgultom/nlp-experiments/tree/master/data/ner
- 印度尼西亞-Ner:Syaifudin&Nurwidyantoro https://ieeexplore.ieee.org/document/7828656 https://github.com/yusufsufsyaifudin/yusufsofseaifudin/indonesia-indonesia-indonesia-indonesia-indonesia-indonesia-nerner
- IDNER-NEWS-2K:印尼新聞的數據集,用於指定實體識別任務。 Syaifudin&Nurwidyantoro https://dl.acm.org/doi/10.1145/3592854#fn8 https://github.com/khairunnisaor/idner-news-2k/
- NERP和NER-GRIT:Indonlp/Indonlu https://github.com/indonlp/indonlu/tree/tree/master/master/dataset https://aclanthology.org/2020.aacl-main.85/
越南人
- VLSP 2016:http://vlsp.org.vn/resources-vlsp2016; https://github.com/undertheseanlp/ner
- VLSP 2018:http://vlsp.org.vn/resources-vlsp2018; https://github.com/undertheseanlp/ner
- Phoner_covid19:https://github.com/vinairesearch/phoner_covid19
日本人
- IREX:https://nlp.cs.nyu.edu/irex/package/
- Met-2(日語,中文):https://www-nlpir.nist.gov/releated_projects/muc/
- BCCWJ基本NE語料庫:https://sites.google.com/site/projectnextnlpne/en(Iwakura等人,構建了一種日本基本命名的各種流派的基本基本命名的實體語料,新聞2016)
- DBPEDIA摘要語料庫(英語,德語,荷蘭語,法語,意大利語,日語):http://downloads.dbpedia.org/2015-04/ext/nlp/nlp/abstracts/
- 數據來自:Mai等人,一項關於細粒度命名實體識別的實證研究,2018年Coling(英語,日語):https://fgner.alt.alt.ai/duc/duc/ene/testsetsets/comp/
- wikipedia ner corpus:https://github.com/stockmarkteam/ner-wikipedia-dataset
- Wikiann:https://elisa-ie.github.io/wikiann/
- GSD:將UD GSD數據集轉換為Megagon Labs https://github.com/megagonlabs/ud_japanese-gsd
- KWDLC:京都大學網絡文檔領導https://nlp.ist.i.i.y.kyoto-u.ac.jp/en/index.php?kwdlc https://github.com/ku-nlp/ku-nlp/kwdlc https https https https:/
韓國人
- 國家韓國語言學院(ROK)-NER語料庫:https://github.com/digitalprk/koreaner; https://ithub.korean.go.kr/user/total/referenceview.do?boildseq = 5&articleseq = 118&boardgb = t&isinsupd&boardType = corpus
- kmou ner -https://github.com/kmounlp/ner
- 韓語理解評估-Klue ner -https://klue-benchmark.com/tasks/69/overview/description
- https://github.com/songys/entity
- HLCT 2016語料庫,帶有更新-https://github.com/machinereading/koreannernercorpus
中國人
- ACE 2003(英語,中文,阿拉伯語):https://catalog.ldc.upenn.edu/ldc2004t09
- ACE 2004(英語,中文,阿拉伯語):https://catalog.ldc.upenn.edu/ldc2005t09
- ACE 2005(英語,中文,阿拉伯語):https://catalog.ldc.upenn.edu/ldc2006t06
- Ontonotes 5(英語,阿拉伯語,中文):https://catalog.ldc.upenn.edu/ldc2013t19
- Met-2(日語,中文):https://www-nlpir.nist.gov/releated_projects/muc/
- 反射實體翻譯(平行語料庫:英語,阿拉伯語,中文):https://catalog.ldc.upenn.edu/ldc2009t11
- NE3L命名實體中國語料庫(阿拉伯語,中文,俄語):http://catalogue.elra.info/en-us/repository/browse/browse/elra-w0079/
- 原始短語數據整理I中文(命名實體):http://catalog.elra.info/en-us/repository/browse/browse/elra-w0045_04/
- 原始短語數據整理II中文(命名實體):http://catalog.elra.info/en-us/repository/browse/browse/elra-w0045_08/
- ERE Deft Corpora(平行語料庫:英語,中文):Mott等人,平行中文英語實體,關係和事件Corpora,2016年(LDC2015E78,LDC2014E114)
- 中國微博:命名和名義上提及的中國風格註釋(微博):https://github.com/hltcoe/golden-horse
- 中文eduner:教育領域的2023數據集:https://link.springer.com/article/10.1007/s00521-023-08635-5-5
- 中國航空航天NER:https://www.nature.com/articles/s41598-023-50705-0
- SciCN: A Chinese Dataset and Benchmark for Scientific Information Extraction https://file.techscience.com/files/cmc/2024/TSP_CMC-78-3/TSP_CMC_35594/TSP_CMC_35594.pdf https://github.com/yangjingla/SciCN
- EMP NER: Historical Chinese https://aclanthology.org/2024.lrec-main.35.pdf https://gitlab.com/enpchina/ENP-NER
他加祿語
- TLUnifed: https://arxiv.org/abs/2311.07161 https://huggingface.co/datasets/ljvmiranda921/tlunified-ner
俄語
- BSNLP 2017 (Croatian, Czech, Polish, Russian, Slovak, Slovene, Ukrainian): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html
- NE3L named entities Russian corpus (Arabic, Chinese, Russian): https://catalog.elra.info/en-us/repository/browse/ELRA-W0080/
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- factRuEval-2016: https://github.com/dialogue-evaluation/factRuEval-2016
- RuREBus 2020 (Russian Relation Extraction for Business) corpus https://github.com/dialogue-evaluation/RuREBus
約魯巴
- GV-Yorùbá-NER. Data: https://github.com/ajesujoba/YorubaTwi-Embedding/tree/master/Yoruba/Yor%C3%B9b%C3%A1-NER ; Data statement: https://drive.google.com/file/d/177xu-O2FTJ7VJQ-0ohCWjVd1qu61Tvml/view Paper: Jesujoba O Alabi, Kwabena Amponsah-Kaakyire, David I Adelani, and Cristina Espãna-Bonet. Massive vs. curated word embeddings for low-resourced languages. the case of Yorùbá and Twi. In LREC, 2020 (https://arxiv.org/abs/1912.02481)
斯瓦希里語
- Helsinki Corpus of Swahili 2.0 (HCS 2.0) Annotated Version: http://metashare.csc.fi/repository/browse/helsinki-corpus-of-swahili-20-hcs-20-annotated-version/232c1910b9eb11e5915e005056be118e59fb2e920f1f4c0cafc94915fc6f5cac/ See: Shah et al., 2010. SYNERGY: A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation
Igbo
- IgboNER: https://aclanthology.org/2022.lrec-1.547/ https://github.com/Chiamakac/IgboNER-Models later updated in https://openreview.net/pdf?id=tHUS9-vmUfC from https://sites.google.com/view/africanlp2023/home
isiNdebele
- NCHLT isiNdebele Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/306
xhosa
- NCHLT isiXhosa Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/312
祖魯
- NCHLT isiZulu Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/319
Sepedi
- NCHLT Sepedi Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/328
Sesotho
- NCHLT Sesotho Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/334
Setswana
- NCHLT Setswana Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/341
Siswati
- NCHLT Siswati Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/346
Venda
- NCHLT Tshivenda Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/355
- MPHAYANER: Named Entity Recognition for Tshivenḓa: https://openreview.net/pdf?id=0nneuL3bSLt https://github.com/rendanim/MphayaNER from https://sites.google.com/view/africanlp2023/home
Xitsonga
- NCHLT Xitsonga Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/362
拉丁
- Herodotos Project: https://github.com/alexerdmann/Herodotos_Project_Annotation
A long list can be found here: http://damien.nouvels.net/resourcesen/corpora.html
參考
[Alvarado et al., 2015] Alvarado, Julio Cesar Salinas, Karin Verspoor, and Timothy Baldwin. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pp. 84-90. 2015. Accessed: August 2018.
[Balasuriya et al., 2009] Balasuriya, Dominic, Nicky Ringland, Joel Nothman, Tara Murphy, and James R. Curran. Named entity recognition in wikipedia. In Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 10-18. Association for Computational Linguistics, 2009
[Bos et al., 2017] Bos, Johan, Valerio Basile, Kilian Evang, Noortje J. Venhuizen, and Johannes Bjerva. The Groningen meaning bank. In Handbook of linguistic annotation, pp. 463-496. Springer, Dordrecht, 2017.
[Derczynski et al., 2016] Derczynski, Leon, Kalina Bontcheva, and Ian Roberts. Broad twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1169-1179. 2016. Available at: https://github.com/GateNLP/broad_twitter_corpus Accessed: August 2018.
[Derczynski et al., 2017] Leon Derczynski, Eric Nichols, Marieke van Erp, Nut Limsopatham (2017) Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition, in Proceedings of the 3rd Workshop on Noisy, User-generated Text. Available at: https://noisy-text.github.io/2017/emerging-rare-entities.html
[DSTL, 2017] Defence Science and Technology Laboratory. 2017. Relationship and Entity Extraction Evaluation Dataset. https://github.com/dstl/re3d. Accessed: January 2018.
[Grishman and Sundheim, 1996] Ralph Grishman and Beth Sundheim. 1996. Message understanding conference- 6: A brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.
[Karimi et al., 2015] Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna Kemp, and Chen Wang. 2015. Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, 55:73-81. Available at https://data.csiro.au Accessed: November 2017.
[Lim et al., 2017] Lim, Swee Kiat, Aldrian Obaja Muis, Wei Lu, and Chen Hui Ong. MalwareTextDB: A database for annotated malware articles. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1557-1567. 2017。
[Liu et al., 2013a] Jingjing Liu, Panupong Pasupat, Scott Cyphers, and Jim Glass. 2013. Asgard: A portable architecture for multilingual dialogue systems. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8386-8390. IEEE. Available at https://groups.csail.mit.edu/sls/downloads/restaurant/ Accessed: January 2018
[Liu et al., 2013b] Jingjing Liu, Panupong Pasupat, Yining Wang, Scott Cyphers, and Jim Glass. 2013. Query understanding enhanced by hierarchical parsing structures. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 72-77. IEEE. Available at https://groups.csail.mit.edu/sls/downloads/movie/ We used the trivia10k13 portion. Accessed: January 2018
[NIST, 1999 IE-ER] NIST. 1999. Information Extraction - Entity Recognition Evaluation. http://www.nist.gov/speech/tests/ieer/er_99/er_99.htm. The newswire development test data only (included in the NLTK package).
[Ohta et al., 2012] Tomoko Ohta, Sampo Pyysalo, Jun'ichi Tsujii and Sophia Ananiadou. 2012. Open-domain Anatomical Entity Mention Detection. In Proceedings of ACL 2012 Workshop on Detecting Structure in Scholarly Discourse (DSSD), pp. 27-36. Available at: http://www.nactem.ac.uk/anatomy/ and https://github.com/openbiocorpora/anem Accessed: November 2017.
[Ritter et al., 2011] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1524-1534, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Accessed January 2018.
[Sang and Meulder, 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.
[Stubbs et al., 2015] Amber Stubbs and Ozlem Uzuner. 2015. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics, 58:S20-S29. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.
[Uzuner et al., 2007] Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association, 14(5):550-563. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.
[Weischedel and Brunstein, 2005] Ralph Weischedel and Ada Brunstein. 2005. BBN pronoun coreference and entity type corpus. Linguistic Data Consortium, Philadelphia.
[Weischedel et al., 2013] Weischedel, Ralph, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue et al. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA (2013).
[Zeldes, 2017] Amir Zeldes. 2017. The GUM corpus: creating multilayer resources in the classroom. Language Resources and Evaluation, 51(3):581-612. Available at https://github.com/amir-zeldes/gum/tree/master/coref/tsv/ Accessed: November 2017.