corus
1.0.0
Tautan ke Kode Korpora + Rusia yang tersedia untuk umum untuk pemuatan dan penguraian. 20+ data, 350GB+ teks.
Misalnya memungkinkan kita membuang lenta.ru oleh @yutkin. Download arsip yang mano (tautan di bagian referensi):
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz Gunakan corus untuk memuat data:
> >> from corus import load_lenta
> >> path = 'lenta-ru-news.csv.gz'
> >> records = load_lenta ( path )
> >> next ( records )
LentaRecord (
url = 'https://lenta.ru/news/2018/12/14/cancer/' ,
title = 'Названы регионы России с xa0 самой высокой смертностью от xa0 рака' ,
text = 'Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...' ,
topic = 'Россия' ,
tags = 'Общество'
)ITerate Over Texts:
> >> records = load_lenta ( path )
> >> for record in records :
... text = record . text
... ...Untuk tautan ke set data lain dan loader mereka lihat bagian referensi.
Bahan dalam bahasa Rusia:
corus mendukung Python 3.5+, Pypy 3.
$ pip install corus| Dataset | API from corus import | Tag | Teks | Tidak terkompresi | Keterangan |
|---|---|---|---|---|---|
| Lenta.ru | |||||
| Lenta.ru V1.0 | load_lenta # | news | 739 351 | 1,66 GB | wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz |
| Lenta.ru v1.1+ | load_lenta2 # | news | 800 975 | 1,94 GB | wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2 |
| Lib.rus.ec | load_librusec # | fiction | 301 871 | 144,92 GB | Dump of lib.rus.ec disiapkan untuk lokakarya Rusiawget http://panchenko.me/data/russe/librusec_fb2.plain.gz |
| Rossiya Segodnya | load_ria_raw #load_ria # | news | 1 003 869 | 3,70 GB | wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz |
| Mokoron Corpus Twitter Rusia | load_mokoron # | sentiment social | 17 633 417 | 1,86 GB | Markup Sentimen Twitter Rusia Manuilla Unduh https://www.dropbox.com/s/9egqjszeicki4ho/db.sql |
| Wikipedia | load_wiki # | 1 541 401 | 12.94 GB | Dump Wiki Rusiawget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2 | |
| Grameval2020 | load_gramru # | 162 372 | 30.04 MB | wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zipunzip master.zipmv GramEval2020-master/dataTrain trainmv GramEval2020-master/dataOpenTest devrm -r master.zip GramEval2020-masterwget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu | |
| Opencpora | load_corpora # | morph | 4 030 | 20.21 MB | wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip |
| RUSVECTORES SIMLEX-965 | load_simlex # | emb sim | wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsvwget https://rusvectores.org/static/testsets/ru_simlex965.tsv | ||
| Omnia Rusia | load_omnia # | morph web fiction | 489.62 GB | Taiga + wiki + araneum. Baca "Corpus Rusia yang Lebih Besar" https://events.spbu.ru/eventscontent/events/2019/corpora/corp_sborn.pdff Manuilla Unduh http://bit.ly/2zt4by9 | |
| Factrueval-2016 | load_factru # | news ner | 254 | 969.27 kb | Manual per, loc, org markup disiapkan untuk kompetisi dialog 2016wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zipunzip master.ziprm master.zip |
| Gareev | load_gareev # | news ner | 97 | 455.02 kb | Manual per markup org (tidak ada loc) Email Rinat Gareev ([email protected]) Minta dataset tar -xvf rus-ner-news-corpus.iob.tar.gzrm rus-ner-news-corpus.iob.tar.gz |
| Koleksi5 | load_ne5 # | news ner | 1.000 | 2.96 MB | Artikel berita dengan manual per, loc, org markupwget http://www.labinform.ru/pub/named_entities/collection5.zipunzip collection5.ziprm collection5.zip |
| Winer | load_wikiner # | ner | 203 287 | 36.15 MB | Kalimat dari Wiki Auto dianot dengan tag per, loc, orgwget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2 |
| BSNLP-2019 | load_bsnlp # | ner | 464 | 1.16 MB | Markup disiapkan untuk tugas bersama BSNLP 2019wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zipwget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zipunzip TRAININGDATA_BSNLP_2019_shared_task.zipunzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bgrm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip |
| Orang-1000 | load_persons # | news ner | 1.000 | 2.96 MB | Sama seperti Collection5, hanya per markup + nama yang dinormalisasiwget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip |
| Corpus Reaksi Obat Rusia (Rudrec) | load_rudrec # | ner | 4 809 | 1.73 kb | Rudredc adalah korpus baru yang dianot oleh ulasan konsumen dalam bahasa Rusia tentang produksi farmasi untuk mendeteksi entitas terkait kesehatan dan efektif produk farmasi. Di sini Anda dapat mengunduh dan bekerja dengan bagian yang dianot, untuk mendapatkan bagian mentah (ulasan 1.4m) silakan merujuk ke https://github.com/cimm-kzn/rudrec.wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json |
| Taiga | Koleksi besar teks Rusia dari berbagai sumber: situs berita, majalah, literasi, jejaring sosialwget https://linghub.ru/static/Taiga/retagged_taiga.tar.gztar -xzvf retagged_taiga.tar.gz | ||||
| Arzamas | load_taiga_arzamas # | news | 311 | 4,50 MB | |
| Fontanka | load_taiga_fontanka # | news | 342 683 | 786.23 MB | |
| Interfax | load_taiga_interfax # | news | 46 429 | 77.55 MB | |
| Kp | load_taiga_kp # | news | 45 503 | 61.79 MB | |
| Lenta | load_taiga_lenta # | news | 36 446 | 95.15 MB | |
| Taiga/n+1 | load_taiga_nplus1 # | news | 7 696 | 24.96 MB | |
| Majalah | load_taiga_magazines # | 39 890 | 2.19 GB | ||
| Subtitle | load_taiga_subtitles # | 19 011 | 909.08 MB | ||
| Sosial | load_taiga_social # | social | 1 876 442 | 648.18 MB | |
| Proza | load_taiga_proza # | fiction | 1 732 434 | 38.25 GB | |
| Stihi | load_taiga_stihi # | 9 157 686 | 12.80 GB | ||
| Dataset NLP Rusia | Beberapa dataset berita Rusia dari Webhose.io, Lenta.ru dan situs berita lainnya. | ||||
| Berita | load_buriy_news # | news | 2 154 801 | 6.84 GB | Dump of 40 News Top + 20 situs berita mode.wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2 |
| Webhose | load_buriy_webhose # | news | 285 965 | 859.32 MB | Buang dari webhose.io, 300 sumber selama satu bulan.wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2 |
| ODS #proj_news_viz | Beberapa berita duduk dikikis oleh anggota proyek #proj_news_viz ODS. | ||||
| Interfax | load_ods_interfax # | news | 543 961 | 1.22 GB | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz |
| Gazeta | load_ods_gazeta # | news | 865 847 | 1,63 GB | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz |
| Izvestia | load_ods_izvestia # | news | 86 601 | 307.19 MB | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz |
| Meduza | load_ods_meduza # | news | 71 806 | 270.11 MB | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz |
| Ria | load_ods_ria # | news | 101 543 | 233.88 MB | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz |
| Rusia hari ini | load_ods_rt # | news | 106 644 | 187.12 MB | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz |
| Tass | load_ods_tass # | news | 1 135 635 | 3.27 GB | wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz |
| Ketergantungan universal | |||||
| GSD | load_ud_gsd # | syntax morph | 5 030 | 1.01 MB | wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conlluwget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conlluwget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu |
| Taiga | load_ud_taiga # | syntax morph | 3 264 | 353.80 kb | wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conlluwget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conlluwget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu |
| Pud | load_ud_pud # | syntax morph | 1.000 | 207.78 kb | wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu |
| Sintagrus | load_ud_syntag # | syntax morph | 61 889 | 11.33 MB | wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conlluwget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conlluwget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu |
| Morphorueval-2017 | |||||
| Umum Internet-Corpus | load_morphoru_gicrya # | morph | 83 148 | 10.58 MB | wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zipunzip GIKRYA_texts_new.ziprm GIKRYA_texts_new.zip |
| Corpus Nasional Rusia | load_morphoru_rnc # | morph | 98 892 | 12.71 MB | wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rarunrar x RNC_texts.rarrm RNC_texts.rar |
| Opencpora | load_morphoru_corpora # | morph | 38 510 | 4,80 MB | wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rarunrar x OpenCorpora_Texts.rarrm OpenCorpora_Texts.rar |
| RUSSE Keterkaitan Semantik Rusia | |||||
| HJ: Penilaian manusia pada pasangan kata | load_russe_hj # | emb sim | wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv | ||
| RT: Sinonim dan Hypernyms dari Thresurus Ruthes | load_russe_rt # | emb sim | wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv | ||
| AE: Asosiasi Kognitif dari Eksperimen Sociation.org | load_russe_ae # | emb sim | wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csvwget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csvwget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv | ||
| Dataset Toloka | |||||
| Hubungan Leksikal dari Kebijaksanaan Kerumunan (LRWC) | load_toloka_lrwc # | emb sim | wget https://tlk.s3.yandex.net/dataset/LRWC.zipunzip LRWC.ziprm LRWC.zip | ||
| Korpus reaksi obat merugikan Rusia dari tweet (ruadrect) | load_ruadrect # | social | 9 515 | 2.09 MB | Korpus ini dikembangkan untuk penambangan media sosial untuk aplikasi kesehatan (#SMM4H) Tugas Berbagi 2020wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zipunzip RuADReCT.ziprm RuADReCT.zip |
corus/sources/<source>.pycorus/sources/__init__.pycorus/source/meta.pydocs.ipynb (periksa tabel meta sudah benar)Dev Env
python -m venv ~ /.venvs/natasha-corus
source ~ /.venvs/natasha-corus/bin/activate
pip install -r requirements/dev.txt
pip install -e .
python -m ipykernel install --user --name natasha-corusLint + Update Documents
make lint
make exec-docsMelepaskan
# Update setup.py version
git commit -am ' Up version '
git tag v0.10.0
git push
git push --tags