[ English | Français | Japanese]
A summary of this repository is also published as a preprint: Exploring Open Large Language Models for the Japanese Language: A Practical Guide
If you are referring to this repository, please quote:
@article{awesomeJapanese2024,
title={{Exploring Open Large Language Models for the Japanese Language: A Practical Guide}},
author={Kaito Sugimoto},
doi={10.51094/jxiv.682},
journal={Jxiv preprint},
year={2024}
}
Some architecture changes have been made. For more information, see below: Pre-learning of the original LLM "PLaMo-100B" with a scale of 100 billion parameters ↩
For more information, see the following article: Strategic notes for pre- and post-learning when developing large-scale language models, including the positioning and development guidelines for large-scale language models Tanuki-8B and 8x8B, etc. - especially on synthetic data - ↩ ↩ 2
However, to speed up the model, there have been changes to the original Llama. For more information, see below: PLaMo-13B has been released ↩
Although no details are specified, the press release states the following: In addition to open data sets, the training data includes the original data sets created by Stability AI Japan, as well as data created with the cooperation of the Japanese language team of the EleutherAI Polyglot project and members of the Stable Community Japan. ' ↩
This study conducted an evaluation of a linguistic model trained to predict words from right to left instead of the usual left to right. Both normal and inverse language models are published. ↩
Before performing Instruction Tuning, we add a Chat Vector, which is the difference between the Llama 3 Instruct and the Llama 3 Base. ↩ ↩ 2
After performing Instruction Tuning, a Chat Vector is added, which is the difference between the Llama 3 Instruct and the Llama 3 Base. ↩ ↩ 2
However, if you would like to use KARAKURI LM for commercial purposes, you will need to contact Karakuri Co., Ltd., the developer. ↩
Instruction Tuning, the system uses data generated by OpenAI models such as GPT-3.5 and GPT-4 to learn, so it may be in violation of OpenAI regulations. ↩ ↩ 2 ↩ 3 ↩ 4 ↩ 5 ↩ 6 ↩ 7 ↩ 8 ↩ 9 ↩ 10
Before performing an ORPO, we add a Chat Vector of the difference between Gemma 2 Instruct and Gemma 2 Base. ↩
○: The model has been uploaded to HuggingFace's Model Hub, and can be read immediately using AutoModel.from_pretrained() etc. △: No models are uploaded to Model Hub, but they support the format HuggingFace (transformers, formerly pytorch-transformers). ✕: The model does not support HuggingFace. ↩
This is a study that experiments with a combination of various morpheme analysers and subwording techniques. It is difficult to list models for all combinations, so here we present the model Juman++ + BPE, which has the highest average task performance in the experiment. ↩
However, the maximum series length has been extended to 2048, and various architectural changes have been made to the original BERT. See README in the HuggingFace repository for more information. ↩
nlp-waseda/roberta-base-japanese and nlp-waseda/roberta-large-japanese pre-train the maximum token length of the model input at 128, while nlp-waseda/roberta-large-japanese-seq512 pre-trains at 512 ↩
However, the maximum series length is extended from the normal 512 to 1282, allowing longer input statements to be handled ↩
The small one is a scratch study using Japanese Wikipedia and Japanese financial corpus, while the base one is a different study using Japanese financial corpus at Tohoku University BERT ↩
The Man-Proof WordPiece model is a model that divides words using MeCab (IPA Dictionary + Man-Proof Dictionary) and then subwords using WordPiece, while the SentencePiece model is a model that converts words directly into Unigram without splitting words ↩
For details on each model, see Chapter 4 of the author's paper. Note that the SC-2M-wiki model is only pre-trained on Wikipedia, so it is not strictly a domain-specific model. ↩
The embedding models were classified using the Dense Text Retrieval based on Pretrained Language Models: A Survey (Zhao+, 2022). Bi-Encoder is an architecture in which two inputs are input individually into a model, each of which is vectorized, and then formulated as the closeness of the inputs by formulating the dot product and cosine similarity of these inputs. In contrast, Cross-Encoder is an architecture that inputs two inputs into a model and directly calculates the proximity within the model. In the field of information extraction, Cross-Encoder is more computational costly, but since it is expected that the model will calculate the closeness of the inputs more finely, it is often used as a relauncher to reexamine the order of the extraction results. In addition, among Bi-Encoders, there are types of Bi-Encoders that represent inputs as multiple vectors (for example, ColBERT) rather than single vectors (for example, ColBERT), so they have been further divided into Single-representation bi-encoders and Multi-representation bi-encoders. ↩
However, it calls for people to keep in mind use for research and education purposes. Also note that some licenses for the model from which you merged are not Apache 2.0. ↩ ↩ 2 ↩ 3