Dataset untuk pengakuan entitas
Repositori ini berisi dataset dari beberapa domain yang dianotasi dengan berbagai jenis entitas, berguna untuk pengenalan entitas dan tugas pengenalan entitas (NER).
Catatan: Saya tidak lagi secara aktif menambahkan set data ke daftar ini - ada kemungkinan lebih banyak kumpulan data NER yang telah muncul sejak 2020. Namun, saya senang menambahkan lebih banyak kumpulan data melalui masalah atau permintaan tarik.
Dataset untuk NER dalam bahasa Inggris
Tabel berikut menunjukkan daftar set data untuk pengakuan entitas bahasa Inggris (untuk daftar dataset NER dalam bahasa lain, lihat di bawah). Direktori Data berisi informasi tentang di mana mendapatkan set data yang tidak dapat dibagikan karena pembatasan lisensi, serta kode untuk mengonversi mereka (jika perlu) ke format CONLL 2003. Tautan ke NER Corpora dalam bahasa lain juga tercantum di bawah ini.
| Dataset | Domain | Lisensi | Referensi | Ketersediaan |
|---|
| Conll 2003 | Berita | Doa | Sang and Meulder, 2003 | Mudah ditemukan |
| Nist-ieer | Berita | Tidak ada | NIST 1999 IE-ER | Data NLTK |
| MUC-6 | Berita | LDC | Grishman dan Sundheim, 1996 | LDC 2003t13 |
| Ontonotes 5 | Bermacam-macam | LDC | Weischedel et al., 2013 | LDC 2013t19 |
| BBN | Bermacam-macam | LDC | Weischedel dan Brunstein, 2005 | LDC 2005T33 |
| GMB-1.0.0 | Bermacam-macam | Tidak ada | Bos et al., 2017 | http://gmb.let.rug.nl/data.php |
| GUM-3.1.0 | Wiki | Beberapa ( * 2) | Zeldes, 2016 | ✔ Termasuk di sini |
| Wikigold | Wikipedia | CC-BY 4.0 | Balasuriya et al., 2009 | ✔ Termasuk di sini |
| Ritter | Twitter | Tidak ada | Ritter et al., 2011 | Tidak ada split, kereta/tes/dev split |
| BTC | Twitter | CC-BY 4.0 | Derczynski et al., 2016 | ✔ Termasuk di sini |
| Wnut17 | Media sosial | CC-BY 4.0 | Derczynski et al., 2017 | ✔ Termasuk di sini |
| I2B2-2006 | Medis | Doa | Uzuner et al., 2007 | http://www.i2b2.org |
| I2B2-2014 | Medis | Doa | Stubbs et al., 2015 | http://www.i2b2.org |
| Cadec | Medis | Csiro | Karimi et al., 2015 | http://data.csiro.au/ |
| Anem | Anatomis | CC-BY-SA 3.0 | Ohta et al., 2012 | ✔ Termasuk di sini |
| Mitrestaurant | Kueri | Tidak ada | Liu et al., 2013a | http://groups.csail.mit.edu/sls/ |
| Mitmovie | Kueri | Tidak ada | Liu et al., 2013b | http://groups.csail.mit.edu/sls/ |
| MalwaretextDB | Malware | Tidak ada | Lim et al., 2017 | http://www.statnlp.org/ |
| re3d | Pertahanan | Beberapa ( * 1) | DSTL, 2017 | ✔ Termasuk di sini |
| Sec-filings | Keuangan | CC-BY 3.0 | Alvarado et al., 2015 | ✔ Termasuk di sini |
| Perakitan | Robotika | X | Costa et al., 2017 | X |
| Wikineural | Wikipedia | CC BY-SA-NC 4.0 | Tedeschi et al., 2021 | https://github.com/babelscape/wikineural |
| Multinerd | Wikipedia | CC BY-SA-NC 4.0 | Tedeschi et al., 2022 | https://github.com/babelscape/multinerd |
| Hipe-2022 | Historis | CC BY-SA-NC 4.0 | Ehrmann et al., 2022 | https://github.com/hipe-eval/hipe-2022-data |
| Musik-ner | Musik | Mit | Epure dan Hennequin, 2023 | https://github.com/deezer/music-ner-eacl2023 |
| Wiesp2022-ner | Astrofisika | CC BY-SA-NC 4.0 | Grezes et al., 2022 | https://huggingface.co/datasets/adsabs/wiesp2022-ner |
| Nne | Berita | CC 4.0 / LDC | Ringland et al., 2019 | https://github.com/nickyringland/nested_named_entities |
| Di seluruh dunia | Berita | CC BY-SA-NC 4.0 | Shan et al., 2023 | https://github.com/stanfordnlp/en-worldwide-newswire https://arxiv.org/abs/2404.13465 |
Lisensi
Catatan Lisensi:
(1) RE3D ("Dataset Evaluasi Ekstraksi Hubungan dan Entitas") berisi beberapa dataset, dengan lisensi yang berbeda. Ini adalah:
- CC-BY-SA 3.0 (Wikipedia Dataset)
- CC BY-NC 3.0 (dataset BBC_ONLINE)
- CC oleh 3.0 AU (Australia_Department_of_foreign_affairs Dataset)
- Domain Publik (US_State_Department Dataset, Dataset Centcom)
- UK Open Government Lisensi v3.0 (UK_Government Dataset)
- Delegation_of_the_european_union_to_syria: lihat https://eeas.europa.eu/delegations/syria/8157/legal-notice_en
- GUM 3.1.0 terdiri dari tiga dataset, dengan lisensi CC-BY 3.0, CC-BY-SA 3.0 dan CC-BY-NC-SA 3.0. Anotasi dilisensikan berdasarkan CC-BY 4.0.
Informasi lisensi yang lebih rinci untuk setiap dataset dapat ditemukan di subdirektori yang sesuai.
Kemudian ... - Tabassum et al., Kode dan pengakuan entitas yang disebutkan di StackOverflow https://cocoxu.github.io/publications/acl2020_stackovlow_ner.pdf - litbank, https://github.com/dbamman/litbank (bamman, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat, popat yang popat yang popat yang ”popat, popat, popat, popat yang popat yang popat yang” popat, popat, popat, popat, popat yang popat yang ”, NNE: Dataset untuk pengakuan entitas bernama bersarang dalam bahasa Inggris Newswire, 2019 https://github.com/nickyringland/nested_named_entities - mars target ensiklopedia - lpsc abstrak berlabel: https://zenodo.org/record/104841 (https://zenodo.org/record/1048419 (https://zenodo.org/record/1048419 https://www.kaggle.com/dataturks/best-buy-ecommerce-ner-dataset/home-resume entitas untuk ner: https://www.kaggle.com/dataaturks/resume-entities-for-ner/home-beberapa-nd: beberapa-shoturks bernama pengakuan beberapa-shotisi: beberapa-shot-shotisi https://aclanthology.org/2021.acl-long.248/
Dataset untuk NER dalam bahasa lain
Sumber Daya Entitas Bernama Leksikal
- Heiner: http://heiner.cl.uni-heidelberg.de/index.shtml
- Neckar: https://event.ifi.uni-heidelberg.de/?page_id=532#wikidata_ne_dataset
SWITCHING KODE
- Tweet Inggris-Spanyol (Calcs 2018): https://code-switching.github.io/2018/; https://code-switching.github.io/2018/files/spa-eng/release.zip; http://www.aclweb.org/anthology/w18-3219
- Tweet Arab-Egyptian (Calcs 2018): https://code-switching.github.io/2018/; https://code-switching.github.io/2018/files/msa-egy/arabictweetstokenAsigner.zip; http://www.aclweb.org/anthology/w18-3219
- Teks Media Sosial Hindi-Inggris: https://github.com/silentflame/named-entity-cognition; http://aclweb.org/anthology/w18-2405
- EMNLP 2014 Tugas Bersama-Tweet yang disapu kode (Nepali-English, Spanyol-Inggris, Mandarin-Inggris, dialek Arab-Arab): http://emnlp2014.org/workshops/codeswitch/call.html
Jerman
- Conll 2003 (Bahasa Inggris, Jerman): https://www.clips.uantwerpen.be/conll2003/ner/
- Germeval 2014: https://sites.google.com/site/germeval2014ner/data
- Tübingen Treebank dari Jerman Tertulis (Tüba-d/Z): http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/tueba-dz.html
- Europeana Newspapers (Belanda, Prancis, Jerman): https://github.com/europeaneanewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- Transkrip Europarl Jerman (subset): https://nlpado.de/~sebastian/software/ner_german.shtml
- Model entitas bernama untuk Jerman, Politik (NEMGP): https://www.thomas-zastrow.de/nlp/
- Wikiner: https://figshare.com/articles/learning_multilingual_named_entity_recognition_from_wikipedia/5462500
- Wikineural: https://github.com/babelscape/wikineural
- Multinerd: https://github.com/babelscape/multinerd
- DFKI SmartData Corpus (geo-entitas): https://dfki-lt-re-group.bitbucket.io/smartdata-corpus/ (sebuah korpus Jerman untuk pengakuan entitas yang bernama berbutir halus, Maximile, Martin Schersersch, Veselina Mononova, Maximile, Maximile, Martin Schersersch, Veselina Mononova Mironova, Maximile, Maximile. Leonhard Hennig.
- Dbpedia abstrak corpus (Inggris, Jerman, Belanda, Prancis, Italia, Jepang): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- DAWT Dataset - Teks Wikipedia yang beranotasi padat di berbagai bahasa (Inggris, Spanyol, Prancis, Italia, Jerman, Arab): https://github.com/klout/opendata/tree/master/wiki_annotation
- Elena Leitner, Georg Rehm, Juli ́an Moreno-Schneider, Dataset Dokumen Hukum Jerman untuk Pengakuan Entitas yang Dinamai, LREC 2020: http://georg-re.hm/pdf/lrec-2020-leitner-et-al-preprint.pdf; Data: https://github.com/elenanereiss/legal-entity-recognition
- HIPE-2022, pengakuan entitas bernama dan entitas yang menghubungkan dalam dokumen sejarah multibahasa: https://hipe-eval.github.io/hipe-2022/ https://github.com/hipe-eval/hipe-2022-data
Belanda
- Conll 2002 (Spanyol, Belanda): https://www.clips.uantwerpen.be/conll2002/ner/
- Europeana Newspapers (Belanda, Prancis, Jerman): https://github.com/europeaneanewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- Sementara itu Corpus (Parallel Corpus: Bahasa Inggris, Spanyol, Italia, Belanda): http://www.newsreader-project.eu/results/data/wikinews/
- Wikiner: https://figshare.com/articles/learning_multilingual_named_entity_recognition_from_wikipedia/5462500
- Wikineural: https://github.com/babelscape/wikineural
- Multinerd: https://github.com/babelscape/multinerd
- Dbpedia abstrak corpus (Inggris, Jerman, Belanda, Prancis, Italia, Jepang): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- Dokumen Parlemen Belanda 2015-2016, dari 1848.nl (Jonkers, dinobatkan sebagai pengakuan entitas pada dokumen parlemen Belanda menggunakan Frog, tesis, University of Amsterdam, 2016): https://github.com/poezedoez/ner/blob/master/code/datab.com
- Sonar 1 - Desmet and Hoste, Belanda berbutir halus Nama Pengakuan Entitas, 2014 (Hirarki Kelas)
- Buku Corpus-Sonar dan Corpus Gutenberg Belanda: http://blog.namescape.nl/?page_id=85; http://portal.clarin.nl/node/1940
Afrikanas
- Nchlt Afrikaans bernama Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/299
Spanyol
- Conll 2002 (Spanyol, Belanda): https://www.clips.uantwerpen.be/conll2002/ner/
- Ancora (Spanyol, Catalan): http://clic.ub.edu/corpus/en
- Deft Spanish Treebank (LDC2018T01): https://catalog.ldc.upenn.edu/ldc2018t01
- Panacea (lab): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-lab-es
- Panacea (env): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-env-es
- Sementara itu Corpus (Parallel Corpus: Bahasa Inggris, Spanyol, Italia, Belanda): http://www.newsreader-project.eu/results/data/wikinews/
- ACE 2007 (Spanyol dan Arab): https://catalog.ldc.upenn.edu/ldc2014t18
- Wikiner: https://figshare.com/articles/learning_multilingual_named_entity_recognition_from_wikipedia/5462500
- Wikineural: https://github.com/babelscape/wikineural
- Multinerd: https://github.com/babelscape/multinerd
- http://www.grupolys.org/~marcos/pub/lrec16.tar.bz2 (digunakan dalam "Menggabungkan heuristik leksiko-semantik ke dalam resolusi coreference untuk pengakuan entitas yang disebutkan di tingkat dokumen")
- Korpora multibahasa dengan anotasi coreferential entitas orang (Spanyol, Galicia, Portugis): http://gramatatica.usc.es/~marcos/lrec.tar.bz2
- Standar Emas Obat -obatan Obat -obatan (Moreno et al., Obat -obatan: sebuah korpus untuk pengakuan entitas yang disebutkan dalam ringkasan Karakteristik Produk Spanyol, 2017): https://data.mendeley.com/datasets/fwc7jrc5jr/1
- Dbpedia abstrak corpus (Inggris, Jerman, Belanda, Prancis, Italia, Jepang): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- DAWT Dataset - Teks Wikipedia yang beranotasi padat di berbagai bahasa (Inggris, Spanyol, Prancis, Italia, Jerman, Arab): https://github.com/klout/opendata/tree/master/wiki_annotation
- Cantemist (Cancer Text Mining Tugas Bersama - Tumor Bernama Pengenalan Entitas) - Bernama pengenalan entitas dari jenis konsep kritis yang terkait dengan kanker, yaitu morfologi tumor dalam teks medis Spanyol: https://temu.bsc.es/cantemist/
Catalan
- Ancora (Spanyol, Catalan): http://clic.ub.edu/corpus/en
Galicia
- Galicia ner corpus: https://gramatatica.usc.es/~marcos/resources/corpus_gal_nec.txt.gz
- Korpora multibahasa dengan anotasi coreferential entitas orang (Spanyol, Galicia, Portugis): http://gramatatica.usc.es/~marcos/lrec.tar.bz2
Basque
- Basque bernama Entities Corpus (EIEC): http://ixa.eus/node/4486?Language=en
- Basque Disambig untuk Entities Corpus (Ediec): http://ixa.si.ehu.es/node/4485?language=en
- Egunkaria 2000 Corpus (383 Newswire Texts), disebutkan dalam http://qtleap.eu/wp-content/uploads/2014/04/qtleap-2013-d5.1.pdf
Portugis
- Harem: https://www.linguateca.pt/aval_conjunta/harem/harem_ing.html
- Cintil corpus: http://cintil.ul.pt/cintilfeatures.html#corpus
- Wikiner: https://figshare.com/articles/learning_multilingual_named_entity_recognition_from_wikipedia/5462500
- Wikineural: https://github.com/babelscape/wikineural
- Multinerd: https://github.com/babelscape/multinerd
- Korpora multibahasa dengan anotasi coreferential entitas orang (Spanyol, Galicia, Portugis): http://gramatatica.usc.es/~marcos/lrec.tar.bz2
- Format Bosque 8.0 Eagles: https://gramatatica.usc.es/~marcos/resources/corpora_flpt.tgz
- Lener-BR (Dokumen Hukum Brasil): https://cic.unb.br/~teodecampos/lener-r/
- Paramopama: Corpus Brasil-Portugis untuk pengakuan entitas bernama
Perancis
- Ester: http://catalogue.elra.info/en-us/repository/browse/elra-s0241/
- Ester 2: http://catalogue.elra.info/en-us/repository/browse/elra-s0338/
- Etape: http://catalogue.elra.info/en-us/repository/browse/elra-e0046/
- Europeana Newspapers (Belanda, Prancis, Jerman): https://github.com/europeaneanewspapers/ner-corpora; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- Quaero Prancis Medical Corpus: https://quaerofrenchmed.limsi.fr/
- Berita Siaran Quaero Diperpanjang Named Entity Corpus: http://catalog.elra.info/en-us/repository/browse/elra-s0349/
- Quaero Old Press Diperpanjang Entitas Corpus: http://catalog.elra.info/en-us/repository/browse/elra-w0073/
- Wikiner: https://figshare.com/articles/learning_multilingual_named_entity_recognition_from_wikipedia/5462500
- Wikiner-fr-gold https://arxiv.org/abs/2411.00030 https://huggingface.co/datasets/danrun/wikiner-fr-gold
- Wikineural: https://github.com/babelscape/wikineural
- Multinerd: https://github.com/babelscape/multinerd
- Dbpedia abstrak corpus (Inggris, Jerman, Belanda, Prancis, Italia, Jepang): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- DAWT Dataset - Teks Wikipedia yang beranotasi padat di berbagai bahasa (Inggris, Spanyol, Prancis, Italia, Jerman, Arab): https://github.com/klout/opendata/tree/master/wiki_annotation
- CAP 2017 - (Data Twitter), Lopez et al., Tantangan CAP 2017: Twitter bernama Entity Recognition, 2017: http://cap2017.imag.fr/competition.html
- HIPE-2022, pengakuan entitas bernama dan entitas yang menghubungkan dalam dokumen sejarah multibahasa: https://hipe-eval.github.io/hipe-2022/ https://github.com/hipe-eval/hipe-2022-data
Italia
- Kind: https://github.com/dhfbk/kind
- Evalita: http://www.evalita.it/2009/tasks/entity
- Sementara itu Corpus (Parallel Corpus: Bahasa Inggris, Spanyol, Italia, Belanda): http://www.newsreader-project.eu/results/data/wikinews/
- Panacea (env): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-env-it
- Panacea (lab): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-lab-it
- Wikiner: https://figshare.com/articles/learning_multilingual_named_entity_recognition_from_wikipedia/5462500
- Wikineural: https://github.com/babelscape/wikineural
- Multinerd: https://github.com/babelscape/multinerd
- Dbpedia abstrak corpus (Inggris, Jerman, Belanda, Prancis, Italia, Jepang): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- DAWT Dataset - Teks Wikipedia yang beranotasi padat di berbagai bahasa (Inggris, Spanyol, Prancis, Italia, Jerman, Arab): https://github.com/klout/opendata/tree/master/wiki_annotation
Rumania
- Ronec (Dumitrescu dan Avram, memperkenalkan Ronec - The Romanian bernama Entity Corpus. LREC 2020). Kertas: https://arxiv.org/pdf/1909.01247.pdf Data: https://github.com/dumitrescustefan/ronec
- Romanian Journalistic Corpus (Roco): http://metashare.elda.org/repository/browse/romanian-journalistic-corpus-roco/038baa80dc7311e5aa0b00847df3e3583781d7c0b0084405df3e3e3583781d7c7c0b00844012
- Romanian Balanced Corpus (ROMBAC): http://metashare.elda.org/repository/browse/romanian-balanced-corpus-rombac/0a7dd85edc7311e5aa0b00237df3e35873a0d662435d42dd94fba48c29dc0065/
Orang yunani
- Panacea (env): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-env-el
- Panacea (lab): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-lab-el
Hongaria
- Hungaria bernama Entity Corpora: http://rgai.inf.u-szeged.hu/index.php?lang=en&page=corpus_ne
- Hunnerwiki: http://hlt.sztaki.hu/resources/hunnerwiki.html
- NYTK: https://github.com/nytud/nytk-nerkor
Ceko
- Ceko bernama Entity Corpus: http://ufal.mff.cuni.cz/cnec
- BSNLP 2017 (Kroasia, Ceko, Polandia, Rusia, Slovakia, Slovene, Ukraina): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html
- Czeng 1.0 (Corpus Paralel: Ceko-Inggris): http://ufal.mff.cuni.cz/czeng/czeng10
- Pero Ocr Ner (Ceko Historis OCR Chronicles): https://github.com/roman-janik/poner https://dspace.vut.cz/items/6092e1b0-3d75-4451-8582-28573AC30404
Polandia
- The Polandia Sejm Corpus: http://clip.ipipan.waw.pl/psc
- BSNLP 2017 (Kroasia, Ceko, Polandia, Rusia, Slovakia, Slovene, Ukraina): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html
- Polandia Coreference Corpus: http://zil.ipipan.waw.pl/polishcoreferenceCorpus
- Wikiner: https://figshare.com/articles/learning_multilingual_named_entity_recognition_from_wikipedia/5462500
- Wikineural: https://github.com/babelscape/wikineural
- Multinerd: https://github.com/babelscape/multinerd
- Corpus of Economic News (Cen Corpus): http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/cen
- KPWR (Korpus Języka Polskiego Politechniki Wrocławskiej/Polandia Corpus dari Wrocław University of Technology): http://plwordnet.pwr.wroc.pl/index.php?option=com_content&view=article&id=35&ipid=1818181 http://plwordnet.pwr.wroc.pl/attachments/article/35/kpwr-1.1.7z (Broda et al., KPWR: Menuju korpus gratis Polandia, 2012)
- NKJP: http://clip.ipipan.waw.pl/nationalcorpusofpolish?action=attachfile&do=view&target=nkjp-podkorpusmilionowy-1.2.tar.gz
Kroasia
- HR500K 1.0: http://hdl.handle.net/11356/1183
- BSNLP 2017 (Kroasia, Ceko, Polandia, Rusia, Slovakia, Slovene, Ukraina): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html
- RELDI-NORMTAGNER-HR (Tweet Kroasia): http://hdl.handle.net/11356/1170
Slovakia
- BSNLP 2017 (Kroasia, Ceko, Polandia, Rusia, Slovakia, Slovene, Ukraina): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html
- Slovak Corpus Berita Yang dikategorikan: https://nlp.web.tone.sk/pages/categorizedNews
Slovene
- BSNLP 2017 (Kroasia, Ceko, Polandia, Rusia, Slovakia, Slovene, Ukraina): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html
- SSJ500K: http://www.slovenscina.eu/tehnologije/ucni-korpus; http://eng.slovenscina.eu/tehnologije/ucni-korpus; https://www.clarin.si/repository/xmlui/handle/11356/1029; Catatan: Untuk V 2.2 Lihat: http://hdl.handle.net/11356/1210
- Slovene News: http://zitnik.si/mediawiki/index.php?title=datasets#slovene_news; http://zitnik.si/mediawiki/images/7/7d/rtvslo_dec2011.tsv; http://zitnik.si/mediawiki/images/5/5e/rtvslo_dec2011_v2.tsv
- Janes-Tag 2.0 (Teks Media Sosial) https://www.clarin.si/repository/xmlui/handle/11356/1123; Lihat juga: Fišer et al., Proyek Janes: Sumber Daya Bahasa dan Alat untuk Konten yang Dibuat Pengguna Slovene, 2018.
Ukraina
- BSNLP 2017 (Kroasia, Ceko, Polandia, Rusia, Slovakia, Slovene, Ukraina): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html
- Ukraina Brown Ner Corpus: https://github.com/lang-uk/ner-uk; http://lang.org.ua/en/corpora/
Serbia
- Setimes.sr - http://hdl.handle.net/11356/1200
- Corpus Evaluasi Entitas Bernama untuk Serbia: http://www.korpus.matf.bg.ac.rs/srpneval/
- Reldi-Normtagner-sr (tweet Serbia): http://hdl.handle.net/11356/1171
Bulgaria
Islandia
- Mim-gold-ner (ingólfsdóttir, svanhvít lilja, sigurjón þorsteinsson, dan hrafn loftsson. "Menuju akurasi tinggi bernama pengakuan entitas untuk Islandia." Prosiding Konferensi Nordik ke-22 tentang Linguistik Komputasi. 2019): " http://www.malfong.is/index.php?pg=mim_gold_ner
Denmark
- Dane: Hvingelby et al., [Dane: Sumber Daya Entitas Bernama untuk Denmark.] (Http://www.lrec-conf.org/proedings/lrec2020/pdf/2020.lrec-1.565.pdf), LREC 2020: https://github.
- Denmark Propbank (DPB): http://catalog.elra.info/en-us/repository/browse/elra-w0117/
- Arboretum Treebank: http://catalog.elra.info/en-us/repository/browse/elra-w0084/
Norwegia
- Bjarte Johansen, pengakuan namanya-entitas untuk Norwegia, Prosiding Konferensi Nordik ke-22 tentang Linguistik Komputasi. 2019 (https://www.aclweb.org/anthology/w19-6123.pdf) Data: https://github.com/ljos/navnkjenner
- Fredrik Jørgensen et al., Norne: anotasi entitas bernama Norwegian, 2019 (https://arxiv.org/pdf/1911.12146.pdf). Data: https://github.com/ltgoslo/norne/; https://www.nb.no/sprakbanken/show?serial=oai%3anb.no%3asbr-49
Swedia
- Stockholm Internet Corpus: https://www.ling.su.se/english/nlp/corpora-and-sources/sic
- SUC 3.0: https://spraakbanken.gu.se/eng/resource/suc3
- Swedia secara manual beranotasi ner: https://github.com/klintan/swedish-ner-corpus/
- Data Wikipedia Medis (Almgren et al., Bernama pengakuan entitas dalam catatan kesehatan Swedia dengan LSTMS Deep Bidirectional berbasis karakter, 2016): https://github.com/olofmogren/biomedical-ner-data-swedish
- HIPE-2022, pengakuan entitas bernama dan entitas yang menghubungkan dalam dokumen sejarah multibahasa: https://hipe-eval.github.io/hipe-2022/ https://github.com/hipe-eval/hipe-2022-data
Finlandia
- Kumpulan data untuk Finlandia bernama Entity Recoginition: https://github.com/mpsilfve/finer-data
- Turku ner corpus: https://github.com/turkunlp/turku-ner-corpus
- HIPE-2022, pengakuan entitas bernama dan entitas yang menghubungkan dalam dokumen sejarah multibahasa: https://hipe-eval.github.io/hipe-2022/ https://github.com/hipe-eval/hipe-2022-data
Estonia
- Estonian Ner Corpus: https://metashare.ut.ee/repository/browse/estonian-ner-corpus/88d030c0acde11e2a6e4005056b40024f1def472ed254e77a8952e1003d9f82ed254e7a8952e1
Latvia dan Lithuania
- https://github.com/accurat-toolkit/tildener/tree/master/test (Pinnis, Latvian dan Lithuanian bernama Entity Recognition dengan Tildener, LREC 2012)
- Data Pelatihan untuk Tagger LV: https://github.com/peterisp/lvtagger/tree/master/nertraindata
Turki
- K̈ucuk dan can, dataset tweet yang dianotasi untuk pengakuan entitas bernama dan deteksi sikap, 2019: https://github.com/dkucuk/tweet-dataset-ner-sd
- K̈ucuk et al., Pengakuan entitas bernama pada tweet Turki: http://optima.jrc.it/resources/2014_jrc_twitter_tr_ner-dataset.zip
- Wikipedia Inggris/Turki Dataset pengakuan dan kategorisasi teks (http://arxiv.org/abs/1702.02363): https://data.mendeley.com/datasets/cdcztymf4k/1
- Çoban et al, pengakuan entitas bernama atas fbner: dataset Facebook baru di Turki: https://ieexplore.ieee.org/document/9598971 Data tersedia untuk tujuan penelitian berdasarkan permintaan
Kazakh
- Kaznerd: https://arxiv.org/pdf/2111.13419.pdf, https://github.com/is2ai/kaznerd
Uyghur
- Uyghur bernama entitas relasi corpus: https://github.com/kaharjan/uynerel (Abiderexiti et al., Skema anotasi untuk membangun Uyghur bernama Entity Relational Corpus. IALP 2016)
Armenia
- Pioner (Gold-Standard dan Silver-Standard Datasets): https://github.com/ispras-texterra/pioner (Ghukasyan et al., Pioner: Dataset dan Baselines untuk Armenia bernama Entity Recognition, 2018)
- ARMTDP-NER: https://github.com/myavrum/armtdp-ner
Koptik
- Treebank Ketergantungan Universal Koptik: https://github.com/universaldependencies/ud_coptic-scriptorium/tree/dev (lihat juga https://copticscriptorium.org/treebank.html). Ini berisi 46.000 token entitas bersarang (non-) yang dinamai dan wikified dari teks Koptik Sahidic.
Amharik
- Katakanlah Corpus (lihat "Pengakuan Entitas yang Dinamai untuk Amharic Menggunakan Deep Learning"): https://github.com/geezorg/data/tree/master/amharic/tagged/nmsu-say; http://data.geez.org/
Arab
- Aqmar Arab Wikipedia bernama Entity Corpus: http://www.cs.cmu.edu/~ark/arabicner/
- Ne3l bernama entitas corpus Arab (Arab, Cina, Rusia): http://catalog.elra.info/en-us/repository/browse/elra-w0078/
- Terjemahan Entitas Refleks (Parallel Corpus: English, Arabic, China): https://catalog.ldc.upenn.edu/ldc2009t11
- Anercorp: http://users.dsic.upv.es/~ybenjaban/downloads.html (lihat juga: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html)
- ACE 2003 (Bahasa Inggris, Cina, Arab): https://catalog.ldc.upenn.edu/ldc2004t09
- ACE 2004 (Bahasa Inggris, Cina, Arab): https://catalog.ldc.upenn.edu/ldc2005t09
- ACE 2005 (Bahasa Inggris, Cina, Arab): https://catalog.ldc.upenn.edu/ldc2006t06
- ACE 2007 (Spanyol dan Arab): https://catalog.ldc.upenn.edu/ldc2014t18
- Ontonotes 5 (Inggris, Arab, Cina): https://catalog.ldc.upenn.edu/ldc2013t19
- DAWT Dataset - Teks Wikipedia yang beranotasi padat di berbagai bahasa (Inggris, Spanyol, Prancis, Italia, Jerman, Arab): https://github.com/klout/opendata/tree/master/wiki_annotation
- WOJOOD - 2022 Nested Arab yang bernama Entity Corpus. https://dlnlp.ai/st/wojood/ https://aclanthology.org/2022.lrec-1.387.pdf https://codalab.lisn.upsaclay.fr/competitions/11740
Persia
- ArmanPersonerCorpus: http://islrn.org/resources/399-379-640-828-6/; https://github.com/haniehp/persianner
Sindhi
- Siner: https://aclanthology.org/2020.lrec-1.361/, https://github.com/aliwazir/siner-dataset
Urdu
- IJCNLP 2008 SSEAL: http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
- Dataset Uner (Khan et al., Dataset entitas yang disebut Urdu bernama Entity Recognition Task, 2016). Tersedia di http://www.iiu.edu.pk/?page_id=5181
- Mk-pucit: https://www.dropbox.com/sh/1ivw7ykm2tugg94/aab9t5wnn7fypo7tjjw8la; Lihat: Kanwal et al., Urdu bernama Entity Recognition: Corpus Generation and Deep Learning Applications, 2019
Indic
- Naamapadam: Dataset Entity Recognition (NER) untuk 11 bahasa India utama dari dua keluarga bahasa. https://research.ibm.com/publications/naamapadam-a-large-scale-named-entity-annotated-data-for-indic-languages https://ai4bharat.iitm.ac.in/naamapadam
Hindi
- Hiner: https://github.com/cfiltnlp/hiner
- Dataset Kesehatan Hindi: https://www.kaggle.com/aijain/hindi-health-dataset/home
- Fire 2015, ESM-Il (Inggris, Hindi, Tamil, Malayalam): http://au-kbc.org/nlp/esm-fire2015/#traincorpus
- Fire Ner 2013 (Bahasa Inggris, Hindi, Tamil, Malayalam, Bengali): http://au-kbc.org/nlp/ner-fire2013/
- IJCNLP 2008 SSEAL: http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
Benggala
- Fire Ner 2013 (Bahasa Inggris, Hindi, Tamil, Malayalam, Bengali): http://au-kbc.org/nlp/ner-fire2013/
- IJCNLP 2008 SSEAL: http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
- Bengali-ner: https://github.com/rifat1493/bengali-ner, https://eeeeexplore.ieee.org/document/8944804
- Ner-Bangla: https://github.com/misabic/ner-bangla-dataset, https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs179349
Telugu
- Ner_telugu: https://github.com/anikethjr/ner_telugu
- IJCNLP 2008 SSEAL: http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
- Named Entity Annotated Corpora untuk Telugu: http://www.tdil-dc.in/index.php?option=com_download&task=showResourceDetails&toolid=982&lang=en
Maithili
- Pengukur Entitas Dinamai Pertama di Maithili: Penciptaan Sumber Daya dan Pengembangan Sistem: https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs210051
Nepal
- Everestner: https://journals.flvc.org/flairs/article/view/130725, https://github.com/nowalab/everest-ner
Marathi
- Named Entity Annotated Corpora for Marathi: http://www.tdil-dc.in/index.php?option=com_download&task=showResourdeTails&toolid=979&lang=en
- L3Cube Mahaner: https://arxiv.org/abs/2204.06029 https://github.com/l3cube-pune/marathinlp
Punjabi
- Named Entity Annotated Corpora untuk Punjabi: http://www.tdil-dc.in/index.php?option=com_download&task=showResourceDetails&toolid=980&lang=en
Tamil
- Fire 2015, ESM-Il (Inggris, Hindi, Tamil, Malayalam): http://au-kbc.org/nlp/esm-fire2015/#traincorpus
- Fire Ner 2013 (Bahasa Inggris, Hindi, Tamil, Malayalam, Bengali): http://au-kbc.org/nlp/ner-fire2013/
Malayalam
- Fire 2015, ESM-Il (Inggris, Hindi, Tamil, Malayalam): http://au-kbc.org/nlp/esm-fire2015/#traincorpus
- Fire Ner 2013 (Bahasa Inggris, Hindi, Tamil, Malayalam, Bengali): http://au-kbc.org/nlp/ner-fire2013/
Oriya/Odia
- IJCNLP 2008 SSEAL: http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
Sinhala/Sinhala
Thai
- thai-named-entity-fecognition-data: https://github.com/pythainlp/thai-named-entity-recognition-data
- Thai bernama Entity Corpora: http://pioneer.chula.ac.th/~awirote/resources/corpora--data.html; http://pioneer.chula.ac.th/~awirote/data-nutcha.zip; http://pioneer.chula.ac.th/~awirote/data-sasiwimon.zip; http://pioneer.chula.ac.th/~awirote/data-nattadaporn.zip
- LST20: https://huggingface.co/datasets/lst20; https://arxiv.org/abs/2008.05055
- Thai-nner: https://github.com/vistec-ai/thai-nner, https://aclanthology.org/2022.findings-acl.116
Indonesia
- Identik: http://metashare.elda.org/repository/browse/ididenc/fed3fada7ef111e5aa3b001dd8b71c66c98EeEEEEE36EAD42F18FFD9A95DA9104CC/
- https://github.com/yohanesgultom/nlp-experiments/tree/master/data/ner
- Indonesia-ner: Syaifudin & Nurwidyantoro https://ieexplore.ieee.org/document/7828656 https://github.com/yusufsyaifudin/indonesia-ner
- IDNER-NEWS-2K: Dataset Berita Indonesia untuk Tugas Pengenalan Nama-Entitas. Reannotation dari Syaifudin & Nurwidyantoro https://dl.acm.org/doi/10.1145/3592854#fn8 https://github.com/khairunnisaor/idner-news-2k/
- NERP dan NER-GRIT: Dua dataset Indonesia dari Indonlp/Indonlu https://github.com/indonlp/indonlu/tree/master/dataset https://aclanthology.org/2020.aacl-main.85/
Vietnam
- VLSP 2016: http://vlsp.org.vn/resources-vlsp2016; https://github.com/undertheseanlp/ner
- VLSP 2018: http://vlsp.org.vn/resources-vlsp2018; https://github.com/undertheseanlp/ner
- Phoner_covid19: https://github.com/vinairesearch/phoner_covid19
Jepang
- IREX: https://nlp.cs.nyu.edu/irex/package/
- Met-2 (Jepang, Cina): https://www-nlpir.nist.gov/related_projects/muc/
- BCCWJ BASIC NE CORPUS: https://sites.google.com/site/projectnextnlpne/en (Iwakura et al., Membangun Corpus Entitas Dasar Jepang dari berbagai genre, News 2016)
- Dbpedia abstrak corpus (Inggris, Jerman, Belanda, Prancis, Italia, Jepang): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- Data dari: Mai et al., Sebuah studi empiris tentang pengakuan entitas bernama berbutir halus, Coling 2018 (Inggris, Jepang): https://fgner.alt.ai/duc/ene/testsets/comp/
- Wikipedia ner corpus: https://github.com/stockmarkteam/ner-wikipedia-dataset
- Wikiann: https://elisa-ie.github.io/wikiann/
- GSD: Konversi dataset UD GSD menjadi entitas bernama oleh Megagon Labs https://github.com/megagonlabs/ud_japanese-gsd
- KWDLC: Dokumen web Universitas Kyoto memimpin Corpus https://nlp.ist.i.kyoto-u.ac.jp/en/index.php?kwdlc https://github.com/ku-nlp/kwdlc https:/nagisa.readthedthedthedhedc.kwdlc https:/nagisa.readthedthedthed
Korea
- Institut Nasional Bahasa Korea (ROK) - Ner Corpus: https://github.com/digitalprk/koreaner; https://ithub.korean.go.kr/user/total/referenceview.do?boardseq=5&articleSeq=11&boardgb=T&isinsupd&boardType=corpus
- KMOU NER - https://github.com/kmounlp/ner
- Evaluasi Pemahaman Bahasa Korea - Klue Ner - https://klue-benchmark.com/tasks/69/overview/description
- https://github.com/songys/entity
- HLCT 2016 Corpus, dengan pembaruan - https://github.com/machinereading/koreannercorpus
Cina
- ACE 2003 (Bahasa Inggris, Cina, Arab): https://catalog.ldc.upenn.edu/ldc2004t09
- ACE 2004 (Bahasa Inggris, Cina, Arab): https://catalog.ldc.upenn.edu/ldc2005t09
- ACE 2005 (Bahasa Inggris, Cina, Arab): https://catalog.ldc.upenn.edu/ldc2006t06
- Ontonotes 5 (Inggris, Arab, Cina): https://catalog.ldc.upenn.edu/ldc2013t19
- Met-2 (Jepang, Cina): https://www-nlpir.nist.gov/related_projects/muc/
- Terjemahan Entitas Refleks (Parallel Corpus: English, Arabic, China): https://catalog.ldc.upenn.edu/ldc2009t11
- Ne3l bernama entitas corpus Cina (Arab, Cina, Rusia): http://catalogue.elra.info/en-us/repository/browse/elra-w0079/
- Kolasi Data Message Asli I dalam bahasa Cina (entitas bernama): http://catalog.elra.info/en-us/repository/browse/elra-w0045_04/
- Kolasi Data Message Singkat Asli II dalam bahasa Cina (entitas bernama): http://catalog.elra.info/en-us/repository/browse/elra-w0045_08/
- ERE DEFT Corpora (Parallel Corpus: English, China): Mott et al., Paralel Entitas Cina-Inggris, Hubungan dan Acara Corpora, 2016 (LDC2015E78, LDC2014E114)
- Weibo Cina: Anotasi gaya yang cekatan untuk bernama dan nominal menyebutkan media sosial Cina (Weibo): https://github.com/hltcoe/golden-horse
- Eduner Cina: 2023 Dataset di Domain Pendidikan: https://link.springer.com/article/10.1007/s00521-023-08635-5 https://github.com/anonymous-xl/eduner
- Chinese Aerospace NER: https://www.nature.com/articles/s41598-023-50705-0 https://github.com/Coder-XIAOKAI/Aerospace_NERdatasets
- SciCN: A Chinese Dataset and Benchmark for Scientific Information Extraction https://file.techscience.com/files/cmc/2024/TSP_CMC-78-3/TSP_CMC_35594/TSP_CMC_35594.pdf https://github.com/yangjingla/SciCN
- EMP NER: Historical Chinese https://aclanthology.org/2024.lrec-main.35.pdf https://gitlab.com/enpchina/ENP-NER
Tagalog
- TLUnifed: https://arxiv.org/abs/2311.07161 https://huggingface.co/datasets/ljvmiranda921/tlunified-ner
Rusia
- BSNLP 2017 (Croatian, Czech, Polish, Russian, Slovak, Slovene, Ukrainian): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html
- NE3L named entities Russian corpus (Arabic, Chinese, Russian): https://catalog.elra.info/en-us/repository/browse/ELRA-W0080/
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- factRuEval-2016: https://github.com/dialogue-evaluation/factRuEval-2016
- RuREBus 2020 (Russian Relation Extraction for Business) corpus https://github.com/dialogue-evaluation/RuREBus
Yoruba
- GV-Yorùbá-NER. Data: https://github.com/ajesujoba/YorubaTwi-Embedding/tree/master/Yoruba/Yor%C3%B9b%C3%A1-NER ; Data statement: https://drive.google.com/file/d/177xu-O2FTJ7VJQ-0ohCWjVd1qu61Tvml/view Paper: Jesujoba O Alabi, Kwabena Amponsah-Kaakyire, David I Adelani, and Cristina Espãna-Bonet. Massive vs. curated word embeddings for low-resourced languages. the case of Yorùbá and Twi. In LREC, 2020 (https://arxiv.org/abs/1912.02481)
Swahili
- Helsinki Corpus of Swahili 2.0 (HCS 2.0) Annotated Version: http://metashare.csc.fi/repository/browse/helsinki-corpus-of-swahili-20-hcs-20-annotated-version/232c1910b9eb11e5915e005056be118e59fb2e920f1f4c0cafc94915fc6f5cac/ See: Shah et al., 2010. SYNERGY: A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation
Igbo
- IgboNER: https://aclanthology.org/2022.lrec-1.547/ https://github.com/Chiamakac/IgboNER-Models later updated in https://openreview.net/pdf?id=tHUS9-vmUfC from https://sites.google.com/view/africanlp2023/home
isiNdebele
- NCHLT isiNdebele Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/306
Xhosa
- NCHLT isiXhosa Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/312
Zulu
- NCHLT isiZulu Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/319
Sepedi
- NCHLT Sepedi Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/328
Sesotho
- NCHLT Sesotho Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/334
Setswana
- NCHLT Setswana Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/341
Siswati
- NCHLT Siswati Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/346
Venda
- NCHLT Tshivenda Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/355
- MPHAYANER: Named Entity Recognition for Tshivenḓa: https://openreview.net/pdf?id=0nneuL3bSLt https://github.com/rendanim/MphayaNER from https://sites.google.com/view/africanlp2023/home
Xitsonga
- NCHLT Xitsonga Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/362
Latin
- Herodotos Project: https://github.com/alexerdmann/Herodotos_Project_Annotation
A long list can be found here: http://damien.nouvels.net/resourcesen/corpora.html
Referensi
[Alvarado et al., 2015] Alvarado, Julio Cesar Salinas, Karin Verspoor, and Timothy Baldwin. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pp. 84-90. 2015. Accessed: August 2018.
[Balasuriya et al., 2009] Balasuriya, Dominic, Nicky Ringland, Joel Nothman, Tara Murphy, and James R. Curran. Named entity recognition in wikipedia. In Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 10-18. Association for Computational Linguistics, 2009
[Bos et al., 2017] Bos, Johan, Valerio Basile, Kilian Evang, Noortje J. Venhuizen, and Johannes Bjerva. The Groningen meaning bank. In Handbook of linguistic annotation, pp. 463-496. Springer, Dordrecht, 2017.
[Derczynski et al., 2016] Derczynski, Leon, Kalina Bontcheva, and Ian Roberts. Broad twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1169-1179. 2016. Available at: https://github.com/GateNLP/broad_twitter_corpus Accessed: August 2018.
[Derczynski et al., 2017] Leon Derczynski, Eric Nichols, Marieke van Erp, Nut Limsopatham (2017) Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition, in Proceedings of the 3rd Workshop on Noisy, User-generated Text. Available at: https://noisy-text.github.io/2017/emerging-rare-entities.html
[DSTL, 2017] Defence Science and Technology Laboratory. 2017. Relationship and Entity Extraction Evaluation Dataset. https://github.com/dstl/re3d. Accessed: January 2018.
[Grishman and Sundheim, 1996] Ralph Grishman and Beth Sundheim. 1996. Message understanding conference- 6: A brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.
[Karimi et al., 2015] Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna Kemp, and Chen Wang. 2015. Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, 55:73-81. Available at https://data.csiro.au Accessed: November 2017.
[Lim et al., 2017] Lim, Swee Kiat, Aldrian Obaja Muis, Wei Lu, and Chen Hui Ong. MalwareTextDB: A database for annotated malware articles. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1557-1567. 2017.
[Liu et al., 2013a] Jingjing Liu, Panupong Pasupat, Scott Cyphers, and Jim Glass. 2013. Asgard: A portable architecture for multilingual dialogue systems. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8386-8390. IEEE. Available at https://groups.csail.mit.edu/sls/downloads/restaurant/ Accessed: January 2018
[Liu et al., 2013b] Jingjing Liu, Panupong Pasupat, Yining Wang, Scott Cyphers, and Jim Glass. 2013. Query understanding enhanced by hierarchical parsing structures. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 72-77. IEEE. Available at https://groups.csail.mit.edu/sls/downloads/movie/ We used the trivia10k13 portion. Accessed: January 2018
[NIST, 1999 IE-ER] NIST. 1999. Information Extraction - Entity Recognition Evaluation. http://www.nist.gov/speech/tests/ieer/er_99/er_99.htm. The newswire development test data only (included in the NLTK package).
[Ohta et al., 2012] Tomoko Ohta, Sampo Pyysalo, Jun'ichi Tsujii and Sophia Ananiadou. 2012. Open-domain Anatomical Entity Mention Detection. In Proceedings of ACL 2012 Workshop on Detecting Structure in Scholarly Discourse (DSSD), pp. 27-36. Available at: http://www.nactem.ac.uk/anatomy/ and https://github.com/openbiocorpora/anem Accessed: November 2017.
[Ritter et al., 2011] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1524-1534, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Accessed January 2018.
[Sang and Meulder, 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.
[Stubbs et al., 2015] Amber Stubbs and Ozlem Uzuner. 2015. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics, 58:S20-S29. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.
[Uzuner et al., 2007] Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association, 14(5):550-563. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.
[Weischedel and Brunstein, 2005] Ralph Weischedel and Ada Brunstein. 2005. BBN pronoun coreference and entity type corpus. Linguistic Data Consortium, Philadelphia.
[Weischedel et al., 2013] Weischedel, Ralph, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue et al. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA (2013).
[Zeldes, 2017] Amir Zeldes. 2017. The GUM corpus: creating multilayer resources in the classroom. Language Resources and Evaluation, 51(3):581-612. Available at https://github.com/amir-zeldes/gum/tree/master/coref/tsv/ Accessed: November 2017.