fewshot textclassification下载 - fewshot textclassification源代码下载

fewshot textclassification

Ai源码

1.0.0

下载

几乎没有弹出文本分类

使用SETFIT方法进行几次传输进行文本分类。

编辑：我还对主动学习进行了一些实验，所以现在我也有Active.py。我将在一个晴天更好地组织它。

实施方法

在main.py中

案例0 ：在其论文中概述的setFit方法，即以自我监督的对比方式进行微调的句子变压器。然后，我们将逻辑分类器拍在编码句子的顶部并执行实际任务。
案例1 ：这是对句子变压器的常规特定任务微调IE
1. 不要对变压器进行自我监督的微调，而直接训练任务
2. 而不是物流分类器，我们使用常规的densenet并与编码器一起训练它
案例2 ：类似于情况0，但我们没有对变压器进行自我监督的微调，而是直接进行编码文本并拟合逻辑分类器。
案例3 ：代替所有这些，我们制定了一些发动的提示，并要求在拥抱面上询问模型以对文本进行分类。

在Active.py中

案例4 ：使用对比度积极学习。小文本的实现<3（我希望您拥有巨大的GPU）。

用法

 $  ~/Dev/projects/setfit$ python main.py --help
Usage: main.py [OPTIONS]

Options:
  -d, --dataset-name TEXT         The name of the dataset as it appears on the
                                  HuggingFace hub e.g. SetFit/SentEval-CR |
                                  SetFit/bbc-news | SetFit/enron_spam ...

  -c, --case INTEGER              0, 1, 2, or 3: which experiment are we
                                  running. See readme or docstrings to know
                                  more but briefly: **0**: SentTF ->
                                  Constrastive Pretrain -> +LogReg on task.
                                  **1**: SentTF -> +Dense on task. **2**:
                                  SentTF -> +LogReg on task. **3**:
                                  FewShotPrompting based Clf over Flan-t5-xl
                                  [required]

  -r, --repeat INTEGER            The number of times we should run the entire
                                  experiment (changing the seed).

  -bs, --batch-size INTEGER       ... you know what it is.
  -ns, --num-sents INTEGER        Size of our train set. Set short values
                                  (under 100)

  -e, --num-epochs INTEGER        Epochs for fitting Clf+SentTF on the main
                                  (classification) task.

  -eft, --num-epochs-finetune INTEGER
                                  Epochs for both contrastive pretraining of
                                  SentTF.

  -ni, --num-iters INTEGER        Number of text pairs to generate for
                                  contrastive learning. Values above 20 can
                                  get expensive to train.

  -tot, --test-on-test            If true, we report metrics on testset. If
                                  not, on a 20% split of train set. Off by
                                  default.

  -ft, --full-test                We truncate the testset of every dataset to
                                  have 100 instances. If you know what you're
                                  doing, you can test on the full dataset.NOTE
                                  that if you're running this in case 3 you
                                  should probably be a premium member and not
                                  be paying per use.

  --help                          Show this message and exit.

注意：如果要查询在HuggingFace（案例3）托管的LLM，则必须在HuggingFace Hub上创建您的帐户并生成访问令牌，之后您应该将它们粘贴到文件中./hf_token.key 。
PS：不用担心，我已经将此文件添加到.gitignore


$ python active.py --help
Usage: active.py [OPTIONS]

Options:
  -d, --dataset-name TEXT     The name of the dataset as it appears on the
                              HuggingFace hub e.g. SetFit/SentEval-CR |
                              SetFit/bbc-news | SetFit/enron_spam | imdb ...

  -ns, --num-sents INTEGER    Size of our train set. I.e., the dataset at the
                              END of AL. Not the start of it.

  -nq, --num-queries INTEGER  Number of times we query the unlabeled set and
                              pick some labeled examples. Set short values
                              (under 10)

  -ft, --full-test            We truncate the testset of every dataset to have
                              100 instances. If you know what you're doing,
                              you can test on the full dataset.NOTE that if
                              you're running this in case 3 you should
                              probably be a premium member and not be paying
                              per use.

  --help                      Show this message and exit.

或者，您可以在安装requirements.txt库后只需运行./run.sh

之后，您可以运行笔记本summarise.ipynb来汇总和可视化（如果我添加此代码）结果。

ps：注意 - --full-test 。默认情况下，我们将每个测试集截断为前100个实例。

使用的数据集

setFit/senteval-cr
setFit/bbc-news
setFit/enron_spam
setFit/sst2
IMDB

它们都是由制作SetFit Lib的善良和善良的人清理的分类数据集。但是您可以使用任何HF数据集，只要它具有这三个字段： （i）文本（str），（ii）标签（int）和（iii）label_text（str）。

结论？

这是我的结果：

该表介绍了此 +主动学习设置的结果。除非另有说明，否则我们将每个实验重复5次。当我们在火车集中只有100个实例时，这些数字报告了任务准确性。

	BBC新闻	SST2	Senteval-Cr	IMDB	Enron_spam
setFit ft	0.978±0.004	0.860±0.018	0.882±0.029	0.924±0.026	0.960±0.017
没有对比度setFit ft	0.932±0.015	0.854±0.019	0.886±0.005	0.902±0.019	0.942±0.020
常规ft	0.466±0.133	0.628±0.098	0.582±0.054	0.836±0.166	0.776±0.089
LLM提示	0.950±0.000	0.930±0.000	0.900±0.000	0.930±0.000	0.820±0.000
约束	0.980±0.000	0.910±0.000	0.910±0.000	0.870±0.000	0.980±0.000