fewshot textclassification下載 - fewshot textclassification源代碼下載

fewshot textclassification

Ai源碼

1.0.0

下載

幾乎沒有彈出文本分類

使用SETFIT方法進行幾次傳輸進行文本分類。

編輯：我還對主動學習進行了一些實驗，所以現在我也有Active.py。我將在一個晴天更好地組織它。

實施方法

在main.py中

案例0 ：在其論文中概述的setFit方法，即以自我監督的對比方式進行微調的句子變壓器。然後，我們將邏輯分類器拍在編碼句子的頂部並執行實際任務。
案例1 ：這是對句子變壓器的常規特定任務微調IE
1. 不要對變壓器進行自我監督的微調，而直接訓練任務
2. 而不是物流分類器，我們使用常規的densenet並與編碼器一起訓練它
案例2 ：類似於情況0，但我們沒有對變壓器進行自我監督的微調，而是直接進行編碼文本並擬合邏輯分類器。
案例3 ：代替所有這些，我們制定了一些發動的提示，並要求在擁抱面上詢問模型以對文本進行分類。

在Active.py中

案例4 ：使用對比度積極學習。小文本的實現<3（我希望您擁有巨大的GPU）。

用法

 $  ~/Dev/projects/setfit$ python main.py --help
Usage: main.py [OPTIONS]

Options:
  -d, --dataset-name TEXT         The name of the dataset as it appears on the
                                  HuggingFace hub e.g. SetFit/SentEval-CR |
                                  SetFit/bbc-news | SetFit/enron_spam ...

  -c, --case INTEGER              0, 1, 2, or 3: which experiment are we
                                  running. See readme or docstrings to know
                                  more but briefly: **0**: SentTF ->
                                  Constrastive Pretrain -> +LogReg on task.
                                  **1**: SentTF -> +Dense on task. **2**:
                                  SentTF -> +LogReg on task. **3**:
                                  FewShotPrompting based Clf over Flan-t5-xl
                                  [required]

  -r, --repeat INTEGER            The number of times we should run the entire
                                  experiment (changing the seed).

  -bs, --batch-size INTEGER       ... you know what it is.
  -ns, --num-sents INTEGER        Size of our train set. Set short values
                                  (under 100)

  -e, --num-epochs INTEGER        Epochs for fitting Clf+SentTF on the main
                                  (classification) task.

  -eft, --num-epochs-finetune INTEGER
                                  Epochs for both contrastive pretraining of
                                  SentTF.

  -ni, --num-iters INTEGER        Number of text pairs to generate for
                                  contrastive learning. Values above 20 can
                                  get expensive to train.

  -tot, --test-on-test            If true, we report metrics on testset. If
                                  not, on a 20% split of train set. Off by
                                  default.

  -ft, --full-test                We truncate the testset of every dataset to
                                  have 100 instances. If you know what you're
                                  doing, you can test on the full dataset.NOTE
                                  that if you're running this in case 3 you
                                  should probably be a premium member and not
                                  be paying per use.

  --help                          Show this message and exit.

注意：如果要查詢在HuggingFace（案例3）託管的LLM，則必須在HuggingFace Hub上創建您的帳戶並生成訪問令牌，之後您應該將它們粘貼到文件中./hf_token.key 。
PS：不用擔心，我已經將此文件添加到.gitignore


$ python active.py --help
Usage: active.py [OPTIONS]

Options:
  -d, --dataset-name TEXT     The name of the dataset as it appears on the
                              HuggingFace hub e.g. SetFit/SentEval-CR |
                              SetFit/bbc-news | SetFit/enron_spam | imdb ...

  -ns, --num-sents INTEGER    Size of our train set. I.e., the dataset at the
                              END of AL. Not the start of it.

  -nq, --num-queries INTEGER  Number of times we query the unlabeled set and
                              pick some labeled examples. Set short values
                              (under 10)

  -ft, --full-test            We truncate the testset of every dataset to have
                              100 instances. If you know what you're doing,
                              you can test on the full dataset.NOTE that if
                              you're running this in case 3 you should
                              probably be a premium member and not be paying
                              per use.

  --help                      Show this message and exit.

或者，您可以在安裝requirements.txt庫後只需運行./run.sh

之後，您可以運行筆記本summarise.ipynb來匯總和可視化（如果我添加此代碼）結果。

ps：注意 - --full-test 。默認情況下，我們將每個測試集截斷為前100個實例。

使用的數據集

setFit/senteval-cr
setFit/bbc-news
setFit/enron_spam
setFit/sst2
IMDB

它們都是由製作SetFit Lib的善良和善良的人清理的分類數據集。但是您可以使用任何HF數據集，只要它具有這三個字段： （i）文本（str），（ii）標籤（int）和（iii）label_text（str）。

結論？

這是我的結果：

該表介紹了此 +主動學習設置的結果。除非另有說明，否則我們將每個實驗重複5次。當我們在火車集中只有100個實例時，這些數字報告了任務準確性。

	BBC新聞	SST2	Senteval-Cr	IMDB	Enron_spam
setFit ft	0.978±0.004	0.860±0.018	0.882±0.029	0.924±0.026	0.960±0.017
沒有對比度setFit ft	0.932±0.015	0.854±0.019	0.886±0.005	0.902±0.019	0.942±0.020
常規ft	0.466±0.133	0.628±0.098	0.582±0.054	0.836±0.166	0.776±0.089
LLM提示	0.950±0.000	0.930±0.000	0.900±0.000	0.930±0.000	0.820±0.000
約束	0.980±0.000	0.910±0.000	0.910±0.000	0.870±0.000	0.980±0.000