corus Download - corus Source code download

corus

Other source code

1.0.0

Download

Links to Publicly Available Russian Corpora + Code for Loading and Parsing. 20+ DATASESTS, 350GB+ of Text.

Usage

For Example lets us dump of lenta.ru by @yutkin. Manoally download the archive (Link in the Reference Section):

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Use corus to Load the Data:

 > >> from corus import load_lenta

> >> path = 'lenta-ru-news.csv.gz'
> >> records = load_lenta ( path )
> >> next ( records )

LentaRecord (
    url = 'https://lenta.ru/news/2018/12/14/cancer/' ,
    title = 'Названы регионы России с xa0 самой высокой смертностью от xa0 рака' ,
    text = 'Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...' ,
    topic = 'Россия' ,
    tags = 'Общество'
)

Iterate over texts:

 > >> records = load_lenta ( path )
> >> for record in records :
...     text = record . text
...     ...

For Links To Other Datasets and Their Loaders See the Reference Section.

Documentation

Materials are in Russian:

Corus Page on Natasha.github.io
Corus Section of DataFest 2020 Talk

Install

corus Supports Python 3.5+, Pypy 3.

$ pip install corus

Reference

Dataset	API `from corus import`	Tags	Texts	Uncompressed	Description
Lenta.ru
Lenta.ru v1.0	`load_lenta` `#`	`news`	739 351	1.66 GB	`wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz`
Lenta.ru v1.1+	`load_lenta2` `#`	`news`	800 975	1.94 GB	`wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2`
Lib.rus.ec	`load_librusec` `#`	`fiction`	301 871	144.92 GB	DUMP of LIB.RUS.ec Prepared for Russia Workshop `wget http://panchenko.me/data/russe/librusec_fb2.plain.gz`
Rossiya Segodnya	`load_ria_raw` `#` `load_ria` `#`	`news`	1 003 869	3.70 GB	`wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz`
Mokoron Russian Twitter Corpus	`load_mokoron` `#`	`social` `sentiment`	17 633 417	1.86 GB	Russian Twitter Sentiment Markup Manuilla download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
Wikipedia	`load_wiki` `#`		1 541 401	12.94 GB	Russian wiki dump `wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2`
Grameval2020	`load_gramru` `#`		162 372	30.04 MB	`wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip` `unzip master.zip` `mv GramEval2020-master/dataTrain train` `mv GramEval2020-master/dataOpenTest dev` `rm -r master.zip GramEval2020-master` `wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu`
Opencorpora	`load_corpora` `#`	`morph`	4 030	20.21 MB	`wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip`
Rusvectores Simlex-965	`load_simlex` `#`	`emb` `sim`			`wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv` `wget https://rusvectores.org/static/testsets/ru_simlex965.tsv`
Omnia Russian	`load_omnia` `#`	`morph` `web` `fiction`		489.62 GB	Taiga + wiki + araneum. Read "Even Larger Russian Corpus" https://events.spbu.ru/eventscontent/events/2019/corpora/corp_sborn.pdff Manuilla download http://bit.ly/2zt4by9
Factrueval-2016	`load_factru` `#`	`ner` `news`	254	969.27 KB	Manual Per, Loc, Org Markup Prepared for 2016 Dialog Competition `wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip` `unzip master.zip` `rm master.zip`
Gareev	`load_gareev` `#`	`ner` `news`	97	455.02 KB	Manual per, Org Markup (No Loc) Email Rinat Gareev ([email protected]) ASK for DATASET `tar -xvf rus-ner-news-corpus.iob.tar.gz` `rm rus-ner-news-corpus.iob.tar.gz`
Collection5	`load_ne5` `#`	`ner` `news`	1,000	2.96 MB	News Articles with Manual Per, Loc, Org Markup `wget http://www.labinform.ru/pub/named_entities/collection5.zip` `unzip collection5.zip` `rm collection5.zip`
Winer	`load_wikiner` `#`	`ner`	203 287	36.15 MB	Sentences from Wiki Auto Annoted with Per, Loc, Org Tags `wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2`
BSNLP-2019	`load_bsnlp` `#`	`ner`	464	1.16 MB	Markup Prepared for 2019 BSNLP Shared Task `wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip` `wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip` `unzip TRAININGDATA_BSNLP_2019_shared_task.zip` `unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg` `rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip`
Persons-1000	`load_persons` `#`	`ner` `news`	1,000	2.96 MB	Same as Collection5, Only Per Markup + Normalized Names `wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip`
The Russian Drug Reaction Corpus (Rudrec)	`load_rudrec` `#`	`ner`	4 809	1.73 KB	Rudredc is a New Partially Annoted Corpus of Consumer Reviews in Russian About Pharmaceutical Production for the Detection of Health-Related Entities and the Effectivence of Pharmaceutical Products. HERE YOU Can download and work with the annoted part, to get the raw part (1.4m Reviews) Please Refer to https://github.com/cimm-kzn/rudrec. `wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json`
Taiga	Large Collection of Russian Texts from Various Sources: News sites, Magazines, Literacy, Social Networks `wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz` `tar -xzvf retagged_taiga.tar.gz`
Arzamas	`load_taiga_arzamas` `#`	`news`	311	4.50 MB
Fontanka	`load_taiga_fontanka` `#`	`news`	342 683	786.23 MB
Interfax	`load_taiga_interfax` `#`	`news`	46 429	77.55 MB
KP	`load_taiga_kp` `#`	`news`	45 503	61.79 MB
Lenta	`load_taiga_lenta` `#`	`news`	36 446	95.15 MB
Taiga/N+1	`load_taiga_nplus1` `#`	`news`	7 696	24.96 MB
Magazines	`load_taiga_magazines` `#`		39 890	2.19 GB
Subtitles	`load_taiga_subtitles` `#`		19 011	909.08 MB
Social	`load_taiga_social` `#`	`social`	1 876 442	648.18 MB
Proza	`load_taiga_proza` `#`	`fiction`	1 732 434	38.25 GB
STIHI	`load_taiga_stihi` `#`		9 157 686	12.80 GB
Russian NLP Datasets	Several Russian News Datasets from Webhose.io, Lenta.ru and Other News sites.
News	`load_buriy_news` `#`	`news`	2 154 801	6.84 GB	DUMP of Top 40 News + 20 Fashion News sites. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2`
Webhose	`load_buriy_webhose` `#`	`news`	285 965	859.32 MB	Dump from Webhose.io, 300 Sources for One Month. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2`
ODS #PROJ_NEWS_VIZ	Several News Sits Scraped by Members of #Proj_News_viz Ods Project.
Interfax	`load_ods_interfax` `#`	`news`	543 961	1.22 GB	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz`
Gazeta	`load_ods_gazeta` `#`	`news`	865 847	1.63 GB	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz`
Izvestia	`load_ods_izvestia` `#`	`news`	86 601	307.19 MB	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz`
Meduza	`load_ods_meduza` `#`	`news`	71 806	270.11 MB	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz`
Ria	`load_ods_ria` `#`	`news`	101 543	233.88 MB	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz`
Russia Today	`load_ods_rt` `#`	`news`	106 644	187.12 MB	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz`
TASS	`load_ods_tass` `#`	`news`	1 135 635	3.27 GB	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz`
Universal Dependencies
GSD	`load_ud_gsd` `#`	`morph` `syntax`	5 030	1.01 MB	`wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu`
Taiga	`load_ud_taiga` `#`	`morph` `syntax`	3 264	353.80 KB	`wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu`
PUD	`load_ud_pud` `#`	`morph` `syntax`	1,000	207.78 KB	`wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu`
Syntagrus	`load_ud_syntag` `#`	`morph` `syntax`	61 889	11.33 MB	`wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu`
Morphorueval-2017
General Internet-Corpus	`load_morphoru_gicrya` `#`	`morph`	83 148	10.58 MB	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip` `unzip GIKRYA_texts_new.zip` `rm GIKRYA_texts_new.zip`
Russian National Corpus	`load_morphoru_rnc` `#`	`morph`	98 892	12.71 MB	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar` `unrar x RNC_texts.rar` `rm RNC_texts.rar`
Opencorpora	`load_morphoru_corpora` `#`	`morph`	38 510	4.80 MB	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar` `unrar x OpenCorpora_Texts.rar` `rm OpenCorpora_Texts.rar`
Russe Russian Semantic Relatedness
HJ: Human Judgements of Word Pairs	`load_russe_hj` `#`	`emb` `sim`			`wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv`
RT: Synonyms and Hypernyms from the Thresurus Ruthes	`load_russe_rt` `#`	`emb` `sim`			`wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv`
AE: Cognitive Associations from the Sociation.org Experiment	`load_russe_ae` `#`	`emb` `sim`			`wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv` `wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv` `wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv`
Toloka Datasets
Lexical Relations from the Wisdom of the Crowd (LRWC)	`load_toloka_lrwc` `#`	`emb` `sim`			`wget https://tlk.s3.yandex.net/dataset/LRWC.zip` `unzip LRWC.zip` `rm LRWC.zip`
The Russian Adverse Drug Reaction Corpus of Tweets (Ruadrect)	`load_ruadrect` `#`	`social`	9 515	2.09 MB	This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020 `wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip` `unzip RuADReCT.zip` `rm RuADReCT.zip`

Support

Chat - https://t.me/natural_language_processing
Issues - https://github.com/natasha/corus/issues
Commercial Support - https://lab.alexkuk.ru

Add New Source

Implement corus/sources/<source>.py
Add Import Into corus/sources/__init__.py
Add Meta Into corus/source/meta.py
Add Example Into docs.ipynb (Check Meta Table is Correct)
Run Tests (Readme is updated)

Development

DEV ENV

python -m venv ~ /.venvs/natasha-corus
source ~ /.venvs/natasha-corus/bin/activate

pip install -r requirements/dev.txt
pip install -e .

python -m ipykernel install --user --name natasha-corus

Lint + Update Docs

make lint
make exec-docs

Release

 # Update setup.py version

git commit -am ' Up version '
git tag v0.10.0

git push
git push --tags

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2025-04-19
size 339.94KB
From Github

Related Applications

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All