fewshot textclassificationのダウンロードfewshot textclassificationソースコードのダウンロード

fewshot textclassification

AI ソースコード

1.0.0

ダウンロード

少ないショットテキスト分類

テキスト分類のための少数のショット転送のためのSetFitアプローチで遊ぶ。

編集：私もアクティブな学習を使用していくつかの実験を行いましたので、今ではアクティブなものもあります。ある晴れた日を整理します。

実装された方法

main.pyで

ケース0 ：自己監視の対照的な方法で微調整された文である文字変圧器で概説されているsetFitメソッド。次に、エンコードされた文の上にロジスティック分類器を平手打ちし、実際のタスクを実行します。
ケース1 ：これは、文のトランスを介した通常のタスク固有の微調整です
1. 変圧器の自己監視された微調整をしないでください。タスクのために直接訓練しないでください
2. ロジスティック分類器の代わりに、通常のデンセンを使用し、エンコーダーと一緒にトレーニングします
ケース2 ：ケース0と同様ですが、変圧器の自己監視された微調整を行い、テキストのエンコードに直接移動し、ロジスティック分類器を取り付けます。
ケース3 ：これらすべての代わりに、いくつかのショットプロンプトを策定し、ハグFaceのモデルを尋ねてテキストを分類します。

Active.pyで

ケース4 ：対照的なアクティブ学習を使用します。 Small-Textの実装は<3です（巨大なGPUを持っていることを願っています）。

使用法

 $  ~/Dev/projects/setfit$ python main.py --help
Usage: main.py [OPTIONS]

Options:
  -d, --dataset-name TEXT         The name of the dataset as it appears on the
                                  HuggingFace hub e.g. SetFit/SentEval-CR |
                                  SetFit/bbc-news | SetFit/enron_spam ...

  -c, --case INTEGER              0, 1, 2, or 3: which experiment are we
                                  running. See readme or docstrings to know
                                  more but briefly: **0**: SentTF ->
                                  Constrastive Pretrain -> +LogReg on task.
                                  **1**: SentTF -> +Dense on task. **2**:
                                  SentTF -> +LogReg on task. **3**:
                                  FewShotPrompting based Clf over Flan-t5-xl
                                  [required]

  -r, --repeat INTEGER            The number of times we should run the entire
                                  experiment (changing the seed).

  -bs, --batch-size INTEGER       ... you know what it is.
  -ns, --num-sents INTEGER        Size of our train set. Set short values
                                  (under 100)

  -e, --num-epochs INTEGER        Epochs for fitting Clf+SentTF on the main
                                  (classification) task.

  -eft, --num-epochs-finetune INTEGER
                                  Epochs for both contrastive pretraining of
                                  SentTF.

  -ni, --num-iters INTEGER        Number of text pairs to generate for
                                  contrastive learning. Values above 20 can
                                  get expensive to train.

  -tot, --test-on-test            If true, we report metrics on testset. If
                                  not, on a 20% split of train set. Off by
                                  default.

  -ft, --full-test                We truncate the testset of every dataset to
                                  have 100 instances. If you know what you're
                                  doing, you can test on the full dataset.NOTE
                                  that if you're running this in case 3 you
                                  should probably be a premium member and not
                                  be paying per use.

  --help                          Show this message and exit.

注：Huggingface（ケース3）でホストされているLLMSをクエリする場合は、Huggingface Hubでアカウントを作成し、アクセストークン./hf_token.key生成する必要があります。
PS：心配しないでください。このファイルを.gitignoreに追加しました


$ python active.py --help
Usage: active.py [OPTIONS]

Options:
  -d, --dataset-name TEXT     The name of the dataset as it appears on the
                              HuggingFace hub e.g. SetFit/SentEval-CR |
                              SetFit/bbc-news | SetFit/enron_spam | imdb ...

  -ns, --num-sents INTEGER    Size of our train set. I.e., the dataset at the
                              END of AL. Not the start of it.

  -nq, --num-queries INTEGER  Number of times we query the unlabeled set and
                              pick some labeled examples. Set short values
                              (under 10)

  -ft, --full-test            We truncate the testset of every dataset to have
                              100 instances. If you know what you're doing,
                              you can test on the full dataset.NOTE that if
                              you're running this in case 3 you should
                              probably be a premium member and not be paying
                              per use.

  --help                      Show this message and exit.

または、必要なライブラリをインストールした後./run.shを実行するだけで実行できます（ requirements.txtを参照）

その後、ノートブックsummarise.ipynbを実行して、結果を要約して視覚化することができます（このコードを追加する場合は）結果を視覚化できます。

PS： --full-testに注意してください。デフォルトでは、すべてのテストセットを最初の100インスタンスに切り捨てます。

使用されているデータセット

setfit/senteval-cr
SetFit/BBC-News
setfit/enron_spam
SetFit/SST2
IMDB

それらはすべて、SetFit Libを作った素敵で親切な人々によって掃除された分類データセットです。ただし、これらの3つのフィールドがある場合、任意のHFデータセットを使用できます。

結論？

これが私の結果です：

この表は、この +アクティブな学習セットアップの結果を示しています。特に指定がない限り、各実験を5回繰り返します。これらの数字は、列車セットに100個のインスタンスしかなかった場合のタスクの精度を報告します。

	BBC-NEWS	SST2	Senteval-cr	IMDB	enron_spam
setfit ft	0.978±0.004	0.860±0.018	0.882±0.029	0.924±0.026	0.960±0.017
対照的なsetfit ftはありません	0.932±0.015	0.854±0.019	0.886±0.005	0.902±0.019	0.942±0.020
通常のFT	0.466±0.133	0.628±0.098	0.582±0.054	0.836±0.166	0.776±0.089
LLMプロンプト	0.950±0.000	0.930±0.000	0.900±0.000	0.930±0.000	0.820±0.000
制約al	0.980±0.000	0.910±0.000	0.910±0.000	0.870±0.000	0.980±0.000