NLP_Datasets Download - NLP_Datasets Source code download

Russian -speaking NLP Datasets

This repository has only the Datasets that I created (usually automatically, sometimes with manual editing) to solve different tasks with texts in Russian.

Dialogs and exchanges of replicas

Dialogs with imageboard - strictly 18+, there are a certain number of broken dialogs, since it is very difficult to filter them automatically:
Part 1 Part 2 Part 3 Part 4 Part 5 Part 6

Marking of replicas in these dialogs by assessments of relevance and specificity, a file in JSONL format to select the highest quality dialogs:
Part 1 Part 2 Part 3 Part 4 Part 5 Part 7 Part 7 Part 9 Part 10 Part 12 Part 12

Marking code: tinkoff_model_dialogues_scoring.py

To unpack this archive, you must first combine files into one:

 cat chan_dialogues_scored.zip* > 1.zip

Then unpack it and get 700 MB json file:

 unzip 1.zip

Dialogs from jokes are about 90,000 dialogs collected from different entertainment sites. The expanded version of the Dataset with another formatting is available here Inkoziev/Jokes_Dialogues.

Cornell Movie Corpus cleaned dialogs - cleaned subtitles, many dialogs "from the middle"

Dialogs from Khudlit (Flibusta) - about 400 MB after unpacking:
Part 1 Part 2

Even Russian -speaking dialogs from Khudlita are more than 130 MB collected from fiction and similar sources. There are some relative small dialogs, the amount of garbage remaining after automatic cleaning.

An example of a code for training a cheating at one of the above dataset: Train_chitchat_rugpt.py. In the code, you need to correct the paths to the dataset and the catalog where the model will persist, as well as adjust the Batch_SIZE.

You can check the trained cheatchat using the run_chitchat_query.py code. For example, the cheatchats trained on the “jokes” for a request “give money to debt” will give approximately the following answer options:

 [1] -  Откуда у меня деньги?!
[2] -  А ты мне что, должен?
[3] -  А зачем?
[4] -  Что, опять?
[5] -  На себя и детей?
[6] -  У меня денег нет.
[7] -  Откуда у меня деньги?
[8] -  Нет.
[9] -  Не дам!
[10] -  Не дам!

Synthetic pairs question-answer with arithmetic tasks: Train test. A significantly expanded version of this dataset with long dialogs can be found in the Inkoziev/Arithmetic repository.

A ready -made model of a generative cheatchat, trained in parts of the above datasets, can be found here: https://hugingFace.co/inkoziev/rugpt_chitchat

Poprase of dialog replicas and lines of poems

Dataset is available in the Inkoziev/Paraphases repository. It is used to train the InKoziev/SBERT_SYNONYMY and for the paraphrase in the InKoziev/Paraphaser project project.

Short sentences and phrases.

Datasets are used to train chatbot. They contain short sentences extracted from a large text case, as well as some patterns and phrases.

Supply templates with open nominal groups

In the archive Templates.clause_with_np.100000.zip is part

 52669	есть#NP,Nom,Sing#.
25839	есть#NP,Nom,Plur#.
18371	NP,Masc,Nom,Sing#пожал#NP,Ins#.
17709	NP,Masc,Nom,Sing#покачал#NP,Ins#.

The first column is the frequency. In total, approximately 21 million offers were collected.

The second column contains the result of the Shallow Parsing, in which the nominal groups are replaced by substitution masks of the NP, Tags. The case is set, as well as the number and a grammatical kind in cases where it is necessary for proper coordination with the verb. For example, the NP, NOM, Sing recording describes the noun in the nominative case and the singular. The symbol '#' is used as a separator of words and chaskov.

Phrases and incomplete sentences

Archive PRN+Preposadj+V.zip contains samples of the species:

 Я на автобус опоздаю
Я из автобуса пришел
Мы из автобуса вышли
Я из автобуса вышла
Я из автобуса видел
Я на автобусах езжу
Они на автобусах приезжают
Мы на автобусах объездили

ADV+Verb.zip archive contains adverbs+verb in personal form:

 ПРЯМО АРЕСТОВАЛИ
ЛИЧНО атаковал
Немо атаковал
Ровно атаковала
Сегодня АТАКУЕТ
Ближе аттестует
Юрко ахнул

ADJ+Noun.zip archive contains type samples:

 ПОЧЕТНЫМ АБОНЕНТОМ
Вашим абонентом
Калининским абонентом
Калининградских аборигенов
Тунисских аборигенов
Байкальских аборигенов
Марсианских аборигенов
Голландские аборигены

A newer and more expanded version of this set, collected in another way, is located in the archive of Patterns.adj_noun.zip. This dataset has this:

 8	смутное	предчувствие
8	городская	полиция
8	среднеазиатские	государства
8	чудесное	средство
8	<<<null>>>	претендентка
8	испанский	король

Token << >> Instead of an adjective, it means that the noun is used without an attribute adjective. Such records are needed for the correct marginalization of the frequencies of using phrases.

The PREP+Noun.zip archive contains such patterns:

 У аборигенных народов
У аборигенных кобыл
Из аборигенных пород
С помощью аборигенов
На аборигенов
Для аборигенов
От аборигенов
У аборигенов

The archive Patterns.noun_gen.zip contains patterns of two nouns, of which the second in the genitive case:

 4	французские	<<<null>>>
4	дворец	фестивалей
4	названье	мест
4	классы	вагонов
4	доступность	магазина

Please note that if in the initial sentence the genitatives had subordinate adjectives or PP, then they will be removed in this dataset. Token << >> In the genititive column, it means a situation where the first noun is used without genet. These records simplify the marginalization of frequencies.

The archive Patterns.noun_np_gen.zip contains patterns from the noun and full right genetics:

 окно браузера
течение дня
укус медведки
изюминка такой процедуры
суть декларации
рецепт вкусного молочного коктейля
музыка самого высокого уровня

The archive S+V.zip contains samples of this type:

 Мы абсолютно не отказали.
Мужчина абсолютно не пострадал.
Они абсолютно совпадают.
Михаил абсолютно не рисковал.
Я абсолютно не выспалась.
Они абсолютно не сочетаются.
Я абсолютно не обижусь...

In the archive S+V+Inf.zip there are such samples:

 Заславский бахвалился превратить
Ленка бегает поспать
Она бегает умываться
Альбина бегает мерить
Вы бегаете жаловаться
Димка бегал фотографироваться

The archive S+V+Indobj.zip contains automatically assembled patterns of the subject+verb+preposition+noun:

 Встревоженный аббат пошел навстречу мэру.
Бывший аббат превратился в настоятеля.
Старый Абдуррахман прохаживался возле дома.
Лопоухий абориген по-прежнему был в прострации.
Высокий абориген вернулся с граблями;
Сморщенный абориген сидел за столиком.

In the archive S+V+Accus.zip there are samples of this type:

 Мой агент кинул меня.
Ричард аккуратно поднял Диану.
Леха аккуратно снял Аленку...
Они активируют новые мины!
Адмирал активно поддержал нас.

Archive S+V+Instr.zip contains samples:

 Я вертел ими
Они вертели ими
Вы вертели мной
Он вертит нами
Она вертит тобой
Она вертит мной
Он вертит ими
Она вертит ими

The archive S+Instr+V.zip contains such samples:

 Я тобой брезгую
Они ими бреются
Они ими вдохновляются
Мы ими вертим
Она тобой вертит
Он мной вертит
Он ими вертит

The remaining samples are finished sentences. For the convenience of training dialogue models, these data are divided into 3 groups:

Proposals with the verb in the 1st person of the only number

 Я только продаю!
Я не курю.
Я НЕ ОТПРАВЛЯЮ!
Я заклеил моментом.
Ездил только я.

Proposals with the verb in the 2nd person of the only number

 Как ты поступишь?
Ты это читаешь?
Где ты живешь?
Док ты есть.
Ты видишь меня.

Proposals with subject-subject and verb in the 3rd person

 Фонарь имел металлическую скобу.
Щенок ищет добрых хозяев.
Массажные головки имеют встроенный нагрев
Бусины переливаются очень красиво!

Proposals in the datasets FACTS4_1S.TX, facts5_1S.TXT, facts5_2S.txt, facts4.txt, facts6_1s.txt, facts6_2S.txt are sorcated using the Sort_FACTS_BY_LSA_TSNE.PY code. The idea of sorting is as follows. For offers in the file, we first perform LSA, receiving 60 vectors (see LSA_Dims Constant in the code). Then these vectors are embedded in one-dimensional space using T-SNE, so in the end for each sentence, the actual number is obtained, such that the decartion-glitters in the LSA proposal have a small difference in these TSNE-Cash. Next, sort sentences according to T-SNE and save the resulting list.

The offers in the remaining files are sorted by the Sort_SAMPles_by_kenlm.py program in decreasing probability. The probability of a sentence is obtained using a pre-trained 3-grade language model Kenlm.

The Questions_2S.TXT file with questions containing the finite verb in the form of 2 persons of the only number is separately posted. These questions are collected from a large building with texts, scraped from forums, subtitles and so on. For convenience, samples are sorted by the finite verb:


Берёшь 15 долларов ?
Берёшь денёк на отгул?
Берёшь отпуск за свой счёт?
Берёшь с собой что-нибудь на букву «К»?


Беспокоишься за меня?
Беспокоишься из-за Питера?
Беспокоишься из-за чего?

The questions are automatically selected using POS Tagger and may contain a small number of erroneous samples.

Resolution of anaphora (Rucoref-2015)

The task and Dataset are described on the official page of the competition. The initial dataset provided by the organizers is available on the link. Using the Extract_anaphora.py script, anaphoras were disclosed, as a result of which it turned out to be simpler for training the chatbot Dataset. For example, a data fragment:

 1	159	Кругом	кругом	R  
1	166	она	она	P-3fsnn	одинокую дачу  
1	170	была	быть	Vmis-sfa-e  
1	175	обнесена	обнесена	Vmps-sfpsp  
1	184	высоким	высокий	Afpmsif  
1	192	забором	забор	Ncmsin

It can be seen that the pronoun "she" is revealed to the phrase "lonely cottage". Bringing an open phrase to the correct grammatical form is left for the next stage.

Stress

Packed TSV file.

The data are collected to solve the problem of the Classicai contest. Open data used - Wikipedia and Wikhoslovar. In cases where the stress is known only for one normal form of the word (lemma), I used the logistics table in the grammatical dictionary and generated records with a mark of drilling. At the same time, it is understood that the stress position in the word does not change when it is declined or hidden. For a certain number of words in Russian, this is not the case, for example:

p^eki (nominative case plural)
rivers^and (genitive case the only number)

In such cases, the Dataset will be one of the stress options.

Statistics of the use of words in groups of 2, 3 and 4 words

Datasets contain numerical estimates of how more often the words are used together than separately. For details about the contents and the method of obtaining dataset, see on a separate page.

Samples with a change of grammatical face

A pairs of sentences in these samples can be useful for training models as part of a chatbot. Data looks like this:

 Я часто захожу !	ты часто заходишь !
Я сам перезвоню .	ты сам перезвонишь .
Я Вам перезвоню !	ты Вам перезвонишь !
Я не пью .	ты не пьешь .

In each line there are two sentences, separated by a symbol of tabulation.

Questions and answers for chat bots

Datasets are generated automatically from a large case of proposals.

Triad "Prerequisite-questioning" for sentences 3 words
Triad "Prerequisite-questioning" for sentences 4 words

An example of data in the above files:

 T: Собственник заключает договор аренды
Q: собственник заключает что?
A: договор аренды

T: Спереди стоит защитное бронестекло
Q: где защитное бронестекло стоит?
A: спереди

Each group of the prerequisite-questioning is separated by empty lines. Before the prerequisite, the mark T :, before the question label Q:, before the answer, the label a:

Lemma

Dataset with lemmas

The archive is a list of word forms and their Lemm, taken from the grammatical dictionary of the Russian language. A certain number (several percent) of words have ambiguous lemmatization, for example Roy - a verb to dig or a noun. In such cases, you need to take into account the context of the word. For example, this is how the Rulemma library works.

NP Chunking

Dataset with markings

Dataset contains sentences in which NP-champs are allocated. The first field in each record contains a label of the word:

0 - does not belong to the NP chunk
1 - beginning of np -chunk
2 - Continuation of NP -Chanca

The marking was obtained by automatic converting from dependencies and may contain some artifacts.

Other

Handicraft paraphrases

Words frequency, taking into account parts of speech

Bringing words to the neutral form of "Steel-Steel"

The roots of words

Expand

NLP_Datasets

Russian -speaking NLP Datasets

Dialogs and exchanges of replicas

Poprase of dialog replicas and lines of poems

Short sentences and phrases.

Supply templates with open nominal groups

Phrases and incomplete sentences

Proposals with the verb in the 1st person of the only number

Proposals with the verb in the 2nd person of the only number

Proposals with subject-subject and verb in the 3rd person

Resolution of anaphora (Rucoref-2015)

Stress

Statistics of the use of words in groups of 2, 3 and 4 words

Samples with a change of grammatical face

Questions and answers for chat bots

Lemma

NP Chunking

Other

OpenCore_NO_ACPI_Build

nspanel_pro_tools_apk

zkwork_aleo_gpu_worker

nextcloud_share_url_downloader

Dog_Fox_Bunny

Lihua data analysis engine free version 3.0_search_navigation_collection_public opinion_ranking_api

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express