This repository has only the Datasets that I created (usually automatically, sometimes with manual editing) to solve different tasks with texts in Russian.
Dialogs with imageboard - strictly 18+, there are a certain number of broken dialogs, since it is very difficult to filter them automatically:
Part 1 Part 2 Part 3 Part 4 Part 5 Part 6
Marking of replicas in these dialogs by assessments of relevance and specificity, a file in JSONL format to select the highest quality dialogs:
Part 1 Part 2 Part 3 Part 4 Part 5 Part 7 Part 7 Part 9 Part 10 Part 12 Part 12
Marking code: tinkoff_model_dialogues_scoring.py
To unpack this archive, you must first combine files into one:
cat chan_dialogues_scored.zip* > 1.zip
Then unpack it and get 700 MB json file:
unzip 1.zip
Dialogs from jokes are about 90,000 dialogs collected from different entertainment sites. The expanded version of the Dataset with another formatting is available here Inkoziev/Jokes_Dialogues.
Cornell Movie Corpus cleaned dialogs - cleaned subtitles, many dialogs "from the middle"
Dialogs from Khudlit (Flibusta) - about 400 MB after unpacking:
Part 1 Part 2
Even Russian -speaking dialogs from Khudlita are more than 130 MB collected from fiction and similar sources. There are some relative small dialogs, the amount of garbage remaining after automatic cleaning.
An example of a code for training a cheating at one of the above dataset: Train_chitchat_rugpt.py. In the code, you need to correct the paths to the dataset and the catalog where the model will persist, as well as adjust the Batch_SIZE.
You can check the trained cheatchat using the run_chitchat_query.py code. For example, the cheatchats trained on the “jokes” for a request “give money to debt” will give approximately the following answer options:
[1] - Откуда у меня деньги?!
[2] - А ты мне что, должен?
[3] - А зачем?
[4] - Что, опять?
[5] - На себя и детей?
[6] - У меня денег нет.
[7] - Откуда у меня деньги?
[8] - Нет.
[9] - Не дам!
[10] - Не дам!
Synthetic pairs question-answer with arithmetic tasks: Train test. A significantly expanded version of this dataset with long dialogs can be found in the Inkoziev/Arithmetic repository.
A ready -made model of a generative cheatchat, trained in parts of the above datasets, can be found here: https://hugingFace.co/inkoziev/rugpt_chitchat
Dataset is available in the Inkoziev/Paraphases repository. It is used to train the InKoziev/SBERT_SYNONYMY and for the paraphrase in the InKoziev/Paraphaser project project.
Datasets are used to train chatbot. They contain short sentences extracted from a large text case, as well as some patterns and phrases.
In the archive Templates.clause_with_np.100000.zip is part
52669 есть#NP,Nom,Sing#.
25839 есть#NP,Nom,Plur#.
18371 NP,Masc,Nom,Sing#пожал#NP,Ins#.
17709 NP,Masc,Nom,Sing#покачал#NP,Ins#.
The first column is the frequency. In total, approximately 21 million offers were collected.
The second column contains the result of the Shallow Parsing, in which the nominal groups are replaced by substitution masks of the NP, Tags. The case is set, as well as the number and a grammatical kind in cases where it is necessary for proper coordination with the verb. For example, the NP, NOM, Sing recording describes the noun in the nominative case and the singular. The symbol '#' is used as a separator of words and chaskov.
Archive PRN+Preposadj+V.zip contains samples of the species:
Я на автобус опоздаю
Я из автобуса пришел
Мы из автобуса вышли
Я из автобуса вышла
Я из автобуса видел
Я на автобусах езжу
Они на автобусах приезжают
Мы на автобусах объездили
ADV+Verb.zip archive contains adverbs+verb in personal form:
ПРЯМО АРЕСТОВАЛИ
ЛИЧНО атаковал
Немо атаковал
Ровно атаковала
Сегодня АТАКУЕТ
Ближе аттестует
Юрко ахнул
ADJ+Noun.zip archive contains type samples:
ПОЧЕТНЫМ АБОНЕНТОМ
Вашим абонентом
Калининским абонентом
Калининградских аборигенов
Тунисских аборигенов
Байкальских аборигенов
Марсианских аборигенов
Голландские аборигены
A newer and more expanded version of this set, collected in another way, is located in the archive of Patterns.adj_noun.zip. This dataset has this:
8 смутное предчувствие
8 городская полиция
8 среднеазиатские государства
8 чудесное средство
8 <<<null>>> претендентка
8 испанский король
Token << >> Instead of an adjective, it means that the noun is used without an attribute adjective. Such records are needed for the correct marginalization of the frequencies of using phrases.
The PREP+Noun.zip archive contains such patterns:
У аборигенных народов
У аборигенных кобыл
Из аборигенных пород
С помощью аборигенов
На аборигенов
Для аборигенов
От аборигенов
У аборигенов
The archive Patterns.noun_gen.zip contains patterns of two nouns, of which the second in the genitive case:
4 французские <<<null>>>
4 дворец фестивалей
4 названье мест
4 классы вагонов
4 доступность магазина
Please note that if in the initial sentence the genitatives had subordinate adjectives or PP, then they will be removed in this dataset. Token << >> In the genititive column, it means a situation where the first noun is used without genet. These records simplify the marginalization of frequencies.
The archive Patterns.noun_np_gen.zip contains patterns from the noun and full right genetics:
окно браузера
течение дня
укус медведки
изюминка такой процедуры
суть декларации
рецепт вкусного молочного коктейля
музыка самого высокого уровня
The archive S+V.zip contains samples of this type:
Мы абсолютно не отказали.
Мужчина абсолютно не пострадал.
Они абсолютно совпадают.
Михаил абсолютно не рисковал.
Я абсолютно не выспалась.
Они абсолютно не сочетаются.
Я абсолютно не обижусь...
In the archive S+V+Inf.zip there are such samples:
Заславский бахвалился превратить
Ленка бегает поспать
Она бегает умываться
Альбина бегает мерить
Вы бегаете жаловаться
Димка бегал фотографироваться
The archive S+V+Indobj.zip contains automatically assembled patterns of the subject+verb+preposition+noun:
Встревоженный аббат пошел навстречу мэру.
Бывший аббат превратился в настоятеля.
Старый Абдуррахман прохаживался возле дома.
Лопоухий абориген по-прежнему был в прострации.
Высокий абориген вернулся с граблями;
Сморщенный абориген сидел за столиком.
In the archive S+V+Accus.zip there are samples of this type:
Мой агент кинул меня.
Ричард аккуратно поднял Диану.
Леха аккуратно снял Аленку...
Они активируют новые мины!
Адмирал активно поддержал нас.
Archive S+V+Instr.zip contains samples:
Я вертел ими
Они вертели ими
Вы вертели мной
Он вертит нами
Она вертит тобой
Она вертит мной
Он вертит ими
Она вертит ими
The archive S+Instr+V.zip contains such samples:
Я тобой брезгую
Они ими бреются
Они ими вдохновляются
Мы ими вертим
Она тобой вертит
Он мной вертит
Он ими вертит
The remaining samples are finished sentences. For the convenience of training dialogue models, these data are divided into 3 groups:
Я только продаю!
Я не курю.
Я НЕ ОТПРАВЛЯЮ!
Я заклеил моментом.
Ездил только я.
Как ты поступишь?
Ты это читаешь?
Где ты живешь?
Док ты есть.
Ты видишь меня.
Фонарь имел металлическую скобу.
Щенок ищет добрых хозяев.
Массажные головки имеют встроенный нагрев
Бусины переливаются очень красиво!
Proposals in the datasets FACTS4_1S.TX, facts5_1S.TXT, facts5_2S.txt, facts4.txt, facts6_1s.txt, facts6_2S.txt are sorcated using the Sort_FACTS_BY_LSA_TSNE.PY code. The idea of sorting is as follows. For offers in the file, we first perform LSA, receiving 60 vectors (see LSA_Dims Constant in the code). Then these vectors are embedded in one-dimensional space using T-SNE, so in the end for each sentence, the actual number is obtained, such that the decartion-glitters in the LSA proposal have a small difference in these TSNE-Cash. Next, sort sentences according to T-SNE and save the resulting list.
The offers in the remaining files are sorted by the Sort_SAMPles_by_kenlm.py program in decreasing probability. The probability of a sentence is obtained using a pre-trained 3-grade language model Kenlm.
The Questions_2S.TXT file with questions containing the finite verb in the form of 2 persons of the only number is separately posted. These questions are collected from a large building with texts, scraped from forums, subtitles and so on. For convenience, samples are sorted by the finite verb:
Берёшь 15 долларов ?
Берёшь денёк на отгул?
Берёшь отпуск за свой счёт?
Берёшь с собой что-нибудь на букву «К»?
Беспокоишься за меня?
Беспокоишься из-за Питера?
Беспокоишься из-за чего?
The questions are automatically selected using POS Tagger and may contain a small number of erroneous samples.
The task and Dataset are described on the official page of the competition. The initial dataset provided by the organizers is available on the link. Using the Extract_anaphora.py script, anaphoras were disclosed, as a result of which it turned out to be simpler for training the chatbot Dataset. For example, a data fragment:
1 159 Кругом кругом R
1 166 она она P-3fsnn одинокую дачу
1 170 была быть Vmis-sfa-e
1 175 обнесена обнесена Vmps-sfpsp
1 184 высоким высокий Afpmsif
1 192 забором забор Ncmsin
It can be seen that the pronoun "she" is revealed to the phrase "lonely cottage". Bringing an open phrase to the correct grammatical form is left for the next stage.
Packed TSV file.
The data are collected to solve the problem of the Classicai contest. Open data used - Wikipedia and Wikhoslovar. In cases where the stress is known only for one normal form of the word (lemma), I used the logistics table in the grammatical dictionary and generated records with a mark of drilling. At the same time, it is understood that the stress position in the word does not change when it is declined or hidden. For a certain number of words in Russian, this is not the case, for example:
p^eki (nominative case plural)
rivers^and (genitive case the only number)
In such cases, the Dataset will be one of the stress options.
Datasets contain numerical estimates of how more often the words are used together than separately. For details about the contents and the method of obtaining dataset, see on a separate page.
A pairs of sentences in these samples can be useful for training models as part of a chatbot. Data looks like this:
Я часто захожу ! ты часто заходишь !
Я сам перезвоню . ты сам перезвонишь .
Я Вам перезвоню ! ты Вам перезвонишь !
Я не пью . ты не пьешь .
In each line there are two sentences, separated by a symbol of tabulation.
Datasets are generated automatically from a large case of proposals.
Triad "Prerequisite-questioning" for sentences 3 words
Triad "Prerequisite-questioning" for sentences 4 words
An example of data in the above files:
T: Собственник заключает договор аренды
Q: собственник заключает что?
A: договор аренды
T: Спереди стоит защитное бронестекло
Q: где защитное бронестекло стоит?
A: спереди
Each group of the prerequisite-questioning is separated by empty lines. Before the prerequisite, the mark T :, before the question label Q:, before the answer, the label a:
Dataset with lemmas
The archive is a list of word forms and their Lemm, taken from the grammatical dictionary of the Russian language. A certain number (several percent) of words have ambiguous lemmatization, for example Roy - a verb to dig or a noun. In such cases, you need to take into account the context of the word. For example, this is how the Rulemma library works.
Dataset with markings
Dataset contains sentences in which NP-champs are allocated. The first field in each record contains a label of the word:
0 - does not belong to the NP chunk
1 - beginning of np -chunk
2 - Continuation of NP -Chanca
The marking was obtained by automatic converting from dependencies and may contain some artifacts.
Handicraft paraphrases
Words frequency, taking into account parts of speech
Bringing words to the neutral form of "Steel-Steel"
The roots of words