Large-scale language models (LLMs) are gaining more and more attention these days, but their zero/few-shot learning ability is often evaluated, and I feel that they are not being used to use them to fine-tuning text classification like existing pre-trained models such as BERT. Therefore, in order to find out how much text classification is possible with LLM, we conducted an experiment using LLM for text classification using the same method as text classification using BERT.
The purpose of this experiment is to investigate what happens when LLM, which is often attracting attention for zero/few-shot learning ability, is used for regular text classification.
BERT, which was commonly used for text classification up until now, is a bidirectional model, and frequently used the sentence token [CLS] for text classification. However, LLMs, which are commonly used these days, such as LLaMA, are unidirectional models. Therefore, in a unidirectional model, there is no point in taking the sentence token. Therefore, in this implementation, we use the embedded representation of sentence-end tokens for text classification, using transformers ' LlamaForSequenceClassification class as a reference. In a unidirectional language model, the end-of-statement token can only be considered to be the token of the entire statement in a series, so it is considered to be an appropriate alternative to [CLS] .
Furthermore, since full fine-tuning of LLM is extremely difficult from a memory and computational efficiency perspective, we use LoRA, a fine-tuning technique that allows you to achieve performance comparable to Full Fine-tuning by adjusting only the additional low-rank matrix. Note: PEFT is used for fine-tuning with LoRA explanation material.
In the evaluation experiment, we will classify nine values of the livedoor news corpus. The experimental content is almost the same as the author's text classification tutorial using BERT.
In the evaluation experiment, seven types of Japanese LLM were used. Specifically, we conducted experiments on four rinna 3.6B models and CyberAgent 7B, 3B, and 1B models.
As a way to adjust the hyperparameter, we performed experiments with the learning rate set to 1e-4, 3e-4, 5e-4, and 1e-3. We also experimented with three types of input formats to the model. Specifically, for each article in the Livedoor News Corpus, the title was stored in a variable called title and the article body , and injected into the following three templates.
| Template Type | Appearance |
|---|---|
| 0 | f"Title: {title}nBody: {body}nLabel: " |
| 1 | f"Title: {title}nBody: {body}" |
| 2 | f"{title}n{body}" |
We conducted experiments one by one for all combinations of the above-mentioned learning rates and templates, and used the hyperparameter with the highest macro average F value in the development set for evaluation in the final test set. LoRA's rank r is fixed at 32.
Note that the experimental results are not accurate because the experiments are performed only once with a single random number seed value and no split cross-validation is performed. Therefore, please do not overly trust the results below, but please take a look at them as a reference only.
The results are shown in the table below. The experimental results are arranged in descending order for the macro average F value. All subsequent results can be viewed from the CSV file stored in the results directory.
| Accuracy | Precision | Recall | F1 | |
|---|---|---|---|---|
| rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 | 97.96 | 97.77 | 97.76 | 97.75 |
| rinna/japanese-gpt-neox-3.6b | 97.55 | 97.24 | 97.39 | 97.30 |
| rinna/japanese-gpt-neox-3.6b-instruction-sft | 97.55 | 97.32 | 97.27 | 97.27 |
| rinna/japanese-gpt-neox-3.6b-instruction-ppo | 97.55 | 97.03 | 97.37 | 97.18 |
| cyberagent/open-calm-7b | 97.01 | 96.76 | 96.42 | 96.55 |
| cyberagent/open-calm-3b | 96.88 | 96.38 | 96.51 | 96.42 |
| cyberagent/open-calm-1b | 94.43 | 94.24 | 93.80 | 93.98 |
From the table, we can see that the directive tuned rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 showed the highest F value. On the other hand, the relatively large model at 7B, cyberagent/open-calm-7b has a slightly lower F value. In order to improve performance, it may be necessary to tune it a little more, such as RoLA's r and other high paras.
Incidentally, the F value of rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 is 97.75 , which is higher than the F value of studio-ousia/luke-japanese-large-lite which achieved the highest performance in the text classification tutorial using 97.47 , which was implemented by the author. Of course, the number of parameters in the model is about 9 times different, so it cannot be a simple comparison, but if you want to pursue the performance of text classification, using LLM+LoRA as an alternative to BERT may be a good option.
Next, rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 cyberagent/open-calm-7b rinna/japanese-gpt-neox-3.6b template for the three representative models in this experiment are shown in the table below.
| Template | Val. F1 | F1 | |
|---|---|---|---|
| rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 | 2 | 97.27 | 97.75 |
| rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 | 1 | 97.18 | 97.14 |
| rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 | 0 | 97.05 | 96.80 |
| rinna/japanese-gpt-neox-3.6b | 1 | 97.14 | 97.30 |
| rinna/japanese-gpt-neox-3.6b | 2 | 96.92 | 97.36 |
| rinna/japanese-gpt-neox-3.6b | 0 | 96.61 | 96.69 |
| cyberagent/open-calm-7b | 1 | 97.22 | 96.55 |
| cyberagent/open-calm-7b | 0 | 97.07 | 96.56 |
| cyberagent/open-calm-7b | 2 | 96.88 | 96.85 |
Generally, the inference ability of an LLM is greatly influenced by the template (prompt). On the other hand, since this experiment is not a zero/few-shot-like setting, it is expected that performance differences due to templates can be reduced to some extent. However, the results show that the template has previously produced a certain difference in F values (about 1 point in F values). template_type=0 is a relatively complex template, template_type=2 is a simple template that is simply concatenated with line breaks, but it can be seen that simple template_type=2 tends to have better performance. Prompts are very important in zero/few-shot settings, but if you can make fine tunings, it may be better to keep the prompts as simple as possible.
Next, let's look at the performance of each learning rate when the model is fixed to rinna/japanese-gpt-neox-3.6b and template_type is fixed to 2 .
| LR | Val. F1 | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| 5e-2 | 2.18 | 12.91 | 1.43 | 11.11 | 2.54 |
| 3e-2 | 2.18 | 12.91 | 1.43 | 11.11 | 2.54 |
| 1e-2 | 2.18 | 12.91 | 1.43 | 11.11 | 2.54 |
| 5e-3 | 24.78 | 32.20 | 36.30 | 30.27 | 28.21 |
| 3e-3 | 2.18 | 12.91 | 1.43 | 11.11 | 2.54 |
| 1e-3 | 96.92 | 97.69 | 97.51 | 97.27 | 97.36 |
| 5e-4 | 96.77 | 98.23 | 98.02 | 97.87 | 97.93 |
| 3e-4 | 96.74 | 96.88 | 96.46 | 96.21 | 96.30 |
| 1e-4 | 94.79 | 97.01 | 96.85 | 96.72 | 96.76 |
| 5e-5 | 94.28 | 95.92 | 95.73 | 95.50 | 95.58 |
| 3e-5 | 93.74 | 94.02 | 93.50 | 93.61 | 93.55 |
| 1e-5 | 78.94 | 81.25 | 80.21 | 79.43 | 79.62 |
From the table, it can be seen that although a fairly large learning rate is effective for learning with LoRA, the upper limit is around 1e-3 , and if you use a very large learning rate such as 1e-2 , your learning will not go well. I would like to see experimental results on a more broad model, but when classifying using LLM+LoRA, I think it's a good idea to try a learning rate of around 5e-4 as a first move.
Furthermore, let's look at the differences in performance for each batch size when the model is fixed to rinna/japanese-gpt-neox-3.6b , template_type 2 , and LoRA's r is fixed to 32 .
| batch size | LR | Val. F1 | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| 2 | 5e-4 | 97.12 | 98.10 | 98.02 | 97.48 | 97.70 |
| 16 | 1e-3 | 97.12 | 97.83 | 97.77 | 97.37 | 97.52 |
| 32 | 1e-3 | 96.92 | 97.69 | 97.51 | 97.27 | 97.36 |
| 64 | 5e-4 | 96.57 | 97.55 | 97.39 | 97.35 | 97.35 |
| 4 | 5e-4 | 97.08 | 97.42 | 97.37 | 97.01 | 97.15 |
| 8 | 3e-4 | 97.20 | 97.28 | 96.99 | 96.87 | 96.91 |
This table is arranged in descending order for F values. As a result, it is possible that there is a possibility that performance differences may be produced to some extent depending on the batch size, but in this experiment, the experiment was conducted only once with one random number seed value, so it seems difficult to come to a clear conclusion. In general, smaller batch sizes take longer to train and tend to be unstable in performance, so it might be a good idea to set the batch size to around 16 or 32 for now.
Finally, let's look at the performance of LoRA per r when the model is fixed to rinna/japanese-gpt-neox-3.6b and template_type is fixed to 2 .
| LoRA r | LR | Val. F1 | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| 8 | 5e-4 | 97.45 | 97.15 | 96.97 | 96.75 | 96.83 |
| 64 | 1e-3 | 97.22 | 97.28 | 96.96 | 96.85 | 96.89 |
| 16 | 1e-3 | 97.20 | 97.69 | 97.59 | 97.27 | 97.38 |
| 4 | 3e-4 | 97.12 | 97.69 | 97.64 | 97.24 | 97.40 |
| 32 | 1e-3 | 96.92 | 97.69 | 97.51 | 97.27 | 97.36 |
As a result, I feel like there is not much correlation between F values in the development set and F values in the test set. Since the value of LoRA's r is thought to be "the larger the model, the smaller the value", I think it would be safer to set it to 32 or higher for medium-sized LLMs of just a few B, but I'd like to try a bit more experiment.
In this implementation, we conducted an experiment using LLM for traditional text classification. As a result, by making fine adjustments using LoRA, we can achieve quite high performance by simply adjusting a small number of parameters, and "using LLM as an alternative to BERT" could also be considered a reasonable option. There was also a tendency for templates to still affect performance even when setting fine tuning. Furthermore, when making fine adjustments using LoRA, the learning rate must be set to a fairly large value, and it was found that the value of rank r also likely affects performance.
Author: Hayato Tsukagoshi
email: [email protected]
If you would like to refer to this implementation in a paper, etc., please use the following:
@misc {
hayato-tsukagoshi-2023-llm-lora-classification,
title = { {Text Classification with LLMs and LoRA} } ,
author = { Hayato Tsukagoshi } ,
year = { 2023 } ,
publisher = { GitHub } ,
journal = { GitHub repository } ,
howpublished = { url{https://github.com/hppRC/llm-lora-classification} } ,
url = { https://github.com/hppRC/llm-lora-classification } ,
}