open korean instructions
1.0.0
Open Korean Instructions is a repository that collects Korean Instration Datasets for learning language models.
In addition, there are many different data created by translating or using GPT. If you have new data, please let me know with PR.
| name | # | Type | detail |
|---|---|---|---|
| Koalpaca V1.0 | 52K | Single | After translation of Alpaca Instruction |
| Koalpaca V1.1 | 21K | Single | After collecting intellectual questions, create an answer with CHATGPT |
| Sharegpt DeEPL translation | 620K (singleton) 84K (multiton) | Multiton, Singleton | Translated Sharegpt data into DeEPL |
| Sharegpt-74K-KO | 74K, 55K (Remove code) | Multiton | Translate the Cleaned version of the Sharegpt 90K using a Google Translator |
| Kochatgpt practice | 13k | Singleton, multiton, RM | After collecting questions from Korean questions datasets, create an answer with Chatgpt |
| OIG-SMALL-CHIP2-KO | 210k | Single | Laion AI's OIG-SMALLCHIP-2 Translated English Data Google Translate |
| KORQUAD-CHAT | 9.6K | Multiton, knowledge base | KORQUAD V1 Data Context (News, Wikipedia Paragraph) |
| Airc-Keti/KOWOW | ? | Multiton, knowledge base | WOW (Wizard of Wikipedia) -Data that translates knowledge -based dialogue data |
| COUNSELGPT | Singleton (13k) Multiton (8.7K) | Multiton, Singleton | Consultation data created by GPT |
| Evolve-Instruct | 37K | Single | Data created by GP after enhancing the Instruction using the Evol-Instructed used in Wizardlm |
| KULLM V2 | 153K | Single | GPT4ALL, DOLLY, VICUNA (Sharegpt) data translated into deepl |
| NLPAI-LAB/Openassistant-Guanaco-Ko | 9.85K | Multiton | Korean Translation of Guanaco Via The DeEPL API |
| psymon/namuwiki_alpaca_dataset | 79K | Single | Dataset that modified wooden wiki dump files to fit the Stanford Alpaca learning |
| CHANGPT/KO-LIMA-VICUNA | 1K | Singleton, multiton (extremely part) | Dataset that regenerated LIMA_VICUNA_FORMAT data in Korean using the GPT4 API |
| Taeshahn/ko-lima | 1K | Singleton, multiton (extremely part) | LIMA: Dataset translated into Korean language data from Less is more for alignment (zhou et al., 2023) |
| KO-StrateGyqa | 2.2K (question), 9K (document) | Multi-Hop QA, yes/no short answer type | This dataset is a Korean version of StrateGyqa. Translate all the questions and paragraphs of the existing dataset using DEEPL. |
| Haerae-Hub/KoINSTRUCT-base | 52K | Single | AlpACA seems to be translation of data. |
| Haerae-Hub/KoInstruct-Qa | 50.3K | Single | I don't know what the original data is. There may be duplicates in the above data. |
| kyujinpy/kopen-platypus | 24.9k | Single | Garage-Baid/Open-PlatyPus Data Data Translation |
| Ziozzang/EverythingLm-Data-V2-KO | 1K | Single | Translate EVERYTHINGLM-DATA-V2 into DeEPL |
| Human-Rights-Corpus/HRC/ | 1.5K | Single | Human Rights Corpus for Interactive Model-In order to change the decision of the Korea National Human Rights Commission and the Counseling Case, the style change and question and answer, the examination is made in consideration of the post-war context and the one-shot question and answer after learning using GPT-3.5-TURBO |
| Kyujinpy/Openorca-KO | 21.6K | Single | Dataset translated by sampling about 20,000 out of Openorca Dataset |
| kyujinpy/kocot_2000 | 2.16k | Single | Using Deepl Dataset, Translation About KAIST-COT. |
| RLHF-KOREAN-FRIENDLY-LLM | 2.4K (SFT), 3.8K (RM), 3.6K (RLHF) | Single | Collect a variety of data and build a thousand units of datasets for RLHF |
| Jojo0217/Korean_rlhf_dataset | 107K | Single | This is a dataset that was built for Korean LLM model SFT learning during the Sungkyunkwan University Industry -Academic Cooperation Project. |
| Maywell/ko_hh-RLHF-20K_FILTERED | 20K | Multiton, RM | 20k of the HH-RLHF dataset translates into the Synatra-translation model |
| SquareLike/Openorca-Gugugo-KO | 640K + (in translation) | Single | Gugugo-Koen-7B-V1.1 |
| Maywell/ko_ultrafeedback_binarized | 62K (RM) | Single | This is a dataset that translates and refined the UltrafeedBack_binarized through the Synatra-7B-Translation model. |
| Mrbananahuman/kor_ethical_QUANSWER | 29.1K | Single | AI ethical/unethical query for RLHF learning-answer dataset |
| Humanf-Markrai/wiki_qa_near_dedup | 138K | Single | QA data made by Maywell/Wikidata_qa made by Maywell (Jeonghwan Park) |
| KAIST-AI/Multilingual-Cot-Collection | 77.2K | Single | Multilingual COT COLLECTION released by KAIST, 77.2K Korean |
| Heegyu/pku-saferlhf-ko | 164K (RM) | Single | PKU-Alignment/PKU-SAPERLHF Data Translation |
| Heegyu/HH-RLHF-KO | 113K (RM) | Multiton | Anthropic/HH-RLHF data translation |
| Heegyu/Webgpt_Comparisons_ko | 19.6K (RM) | Single | Openai/Webgpt_Comparisons translates into a model |
| Heegyu/Glaive-Function-Calling-V2-KO | 15.2K (function calling) | Multiton | GLAIVEAI/Glaive-Function-Calling-V2 translates 15.2K into CHATGPT |
| SquareLike/KO_MEDICAL_CHAT | 3.04K | Multiton | JWJ7140/KO-Medical-Chat Medtext and Chatdoctor Dataset converted to Korean dialogue via GPT3.5 |
| Markrai/Kocommercial-dataset | 1.44m | Single | Collect and process commercially available datasets and merge |
| Maywell/KOVAST | 685K | Multiton | 685K massive multiton Korean conversation |
| SJ-DONALD/ORCA-DPO-PAIRS-KO | 36K | Single | mncai/orca_dpo_pairs_ko, JA-CK/Orca-DPO-PIIRS-KO |
| LCW99/Wikipedia-KOREAN-20240501-1million-QNA | 1m | Singleton QA | Hangul Wikipedia is divided into millions of sections and created a million Q & A |
| NLP-WITH-DEEPLEARNING/KO.WIZARDLM_EVOL_INSTRUCT_V2_196K | 196k | Single | Dataset translated as a wizardlm/wizardlm_evol_instruct_v2_196k |
| Haerae-Hub/QARV-Instruct-100K | 100k | Single | Directions that require knowledge of Korea-Answer pairs (including English) |
| Kuotient/Orca-Math-Word-Problems-193K-KOREAN | 193K | Single | Microsoft/Orca-Math-Word-Problems-200K Translation |
| Kuotient/Orca-Math-Korean-Preference | 193K | Singleton (DPO) | DPO dataset made using translated Microsoft/Orca-Math-Word-Problems-200K |
| Jojo0217/Korean_safe_conversation | 26K | Single | Sungkyunkwan University -Everyday dialogue data built for vaiv company industry -academic cooperation, and dataset for natural and ethical chatbot construction |
| Haerae-Hub/K2-FEEDBACK | 100k | Single | K^2-feedback integrates directors specialized in Korean culture and linguistics based on the feedback collection, which is designed to improve the evaluation ability in the Korean model. (NOTE: Originally, the data for learning Prometheus model can be used for learning by bringing only 5 outputs.) |
| maywell/kiz_samples | 24.9k | Single | Output sample of the KIQU-70B model. |
| Carrotai/Ko-Instruction-Dataset | 7K | Single | High-quality Korean dataset in Korean used using the Wizardlm-2-8x22b model, Wizardlm: Empowering Large Language Models to Follow Complex Instructions |
| Haerae-Hub/HR-Instruct-Math-V0.1 | 30K | Single | Korean Mathematics Instruction Data (POC version) |
| Iknow-Lab/QARV-Instruct-KO-MT | 10K | Multiton | Haerae-Hub/QARV-Instruct-KO Multiton data that adds 2 turn conversations using GPT-3.5-TURBO for 10,000 data |
| Iknow-Lab/Ko-Evol-WRITING-WIKI | 30K | Single | Writing / creative writing data created using GPT-3.5-TURBO |
| AIHUB RLHF Dataset | SFT (13k), RM (33K), PPO (33K) | Single | RM data is ranked for directors and five answers. In the case of PPO data, there is only a directive and no answer. |
| Beomi/Koalpaca-Realqa | 18K | Single | It is a dataset for Korean natural language processing based on the actual Korean user dialogue of the Chatkoalpaca service in 2023-2024. |
| Collection | explanation |
|---|---|
| Yoo Jun -hyuk's translation data | It is a dataset that translated the English dataset into Korean. |
| Yoo Jun -hyuk's translation data 2 (Magpie) | Magpie Data Set Korean translation (@nayohan's translation model) |
| Songys/HuggingFace_koreandataset | As of October 10, 2024, Song Young -sook's Korean data set in HUGGINGFACE |
| I Yohan's translation data | Datasets Translated from English to Korean Using llama3-Instranstrans-Enko-8b` |
| name | # | Type | detail |
|---|---|---|---|
| Haerae-Hub/kmmlu | 243K | MCQA | Korean language performance evaluation benchmark on 45 topics |
| Haetae-Project/hae-rae-bench | 1.5K | MCQA | HAE-RAE Bench is a benchmark dataset designed to evaluate Korean language skills (vocabulary, history, common sense, and reading) of language models. |
| Haerae-Hub/CSAT-QA | 0.9k | MCQA | Korean SAT problem |
| Haerae-Hub/K2-EVAL | 90 | generation | For the correct answer, the directive, people or GPT-4, written by 90 people who need in-depth knowledge of Korean culture |
| Sean0042/kormedmcqa | <1K | MCQA | Korean Medical QA Benchmark |
| Haerae-Hub/Korean-Human-Judgements | <1K | Human Preference | Questions, answers A, answer B and people's preferences |
| Haerae-Hub/Kudge | 2.8K | Human Preference | 5.6K Korean Human Annotation |