Download fewshot textclassification - download do código fewshot textclassification do código

fewshot textclassification

Código-Fonte de IA

1.0.0

Baixar

Classificação de texto com poucos tiro

Brincando com a abordagem do setFit para transferência de poucos tiro para classificação de texto.

EDIT: Eu também fiz alguns experimentos com aprendizado ativo, então agora eu tenho ativo.py também. Vou organizá -lo melhor um dia ensolarado.

Métodos implementados

Em main.py

Caso 0 : O método do setFit, conforme descrito em seu artigo, ou seja, um transformador de frase, ajustado de maneira contrastiva auto-supervisionada. Em seguida, fornecemos um classificador logístico sobre as frases codificadas e realizamos a tarefa real.
Caso 1 : Este é o ajuste fino específico da tarefa regular, ou seja, sobre o transformador de frases nós
1. Não faça o ajuste fino auto-supervisionado do transformador e treine diretamente para a tarefa
2. Em vez de um classificador logístico, usamos um densenet regular e treinamos ao lado do codificador
Caso 2 : Semelhante ao caso 0, mas não fazemos o ajuste fino auto-supervisionado do transformador e seguimos diretamente a codificação de texto e ajustando um classificador logístico.
Caso 3 : Em vez de tudo isso, formulamos um prompt de poucos anos e pedimos um modelo no HuggingFace para classificar o texto.

Em Active.py

Caso 4 : Use aprendizado ativo contrastivo. A implementação do texto pequeno é <3 (espero que você tenha GPUs enormes).

Uso

 $  ~/Dev/projects/setfit$ python main.py --help
Usage: main.py [OPTIONS]

Options:
  -d, --dataset-name TEXT         The name of the dataset as it appears on the
                                  HuggingFace hub e.g. SetFit/SentEval-CR |
                                  SetFit/bbc-news | SetFit/enron_spam ...

  -c, --case INTEGER              0, 1, 2, or 3: which experiment are we
                                  running. See readme or docstrings to know
                                  more but briefly: **0**: SentTF ->
                                  Constrastive Pretrain -> +LogReg on task.
                                  **1**: SentTF -> +Dense on task. **2**:
                                  SentTF -> +LogReg on task. **3**:
                                  FewShotPrompting based Clf over Flan-t5-xl
                                  [required]

  -r, --repeat INTEGER            The number of times we should run the entire
                                  experiment (changing the seed).

  -bs, --batch-size INTEGER       ... you know what it is.
  -ns, --num-sents INTEGER        Size of our train set. Set short values
                                  (under 100)

  -e, --num-epochs INTEGER        Epochs for fitting Clf+SentTF on the main
                                  (classification) task.

  -eft, --num-epochs-finetune INTEGER
                                  Epochs for both contrastive pretraining of
                                  SentTF.

  -ni, --num-iters INTEGER        Number of text pairs to generate for
                                  contrastive learning. Values above 20 can
                                  get expensive to train.

  -tot, --test-on-test            If true, we report metrics on testset. If
                                  not, on a 20% split of train set. Off by
                                  default.

  -ft, --full-test                We truncate the testset of every dataset to
                                  have 100 instances. If you know what you're
                                  doing, you can test on the full dataset.NOTE
                                  that if you're running this in case 3 you
                                  should probably be a premium member and not
                                  be paying per use.

  --help                          Show this message and exit.

NOTA : Se você deseja consultar o Hospedado LLMS no HuggingFace (caso 3), você deve criar sua conta no HUGGINGFACE HUB e gerar tokens de acesso após o que você deve colá -los em um arquivo ./hf_token.key .
PS: Não se preocupe, adicionei este arquivo a .gitignore


$ python active.py --help
Usage: active.py [OPTIONS]

Options:
  -d, --dataset-name TEXT     The name of the dataset as it appears on the
                              HuggingFace hub e.g. SetFit/SentEval-CR |
                              SetFit/bbc-news | SetFit/enron_spam | imdb ...

  -ns, --num-sents INTEGER    Size of our train set. I.e., the dataset at the
                              END of AL. Not the start of it.

  -nq, --num-queries INTEGER  Number of times we query the unlabeled set and
                              pick some labeled examples. Set short values
                              (under 10)

  -ft, --full-test            We truncate the testset of every dataset to have
                              100 instances. If you know what you're doing,
                              you can test on the full dataset.NOTE that if
                              you're running this in case 3 you should
                              probably be a premium member and not be paying
                              per use.

  --help                      Show this message and exit.

Ou você pode simplesmente executar ./run.sh depois de instalar as bibliotecas necessárias (consulte requirements.txt )

Posteriormente, você pode executar o notebook summarise.ipynb para resumir e visualizar (se eu adicionar esse código) os resultados.

PS: Preste atenção ao --full-test . Por padrão, truncamos todos os conjuntos de testes em suas primeiras 100 instâncias.

Conjuntos de dados usados

Setfit/Senteval-Cr
Setfit/BBC-News
Setfit/enron_spam
Setfit/sst2
IMDB

São todos os conjuntos de dados de classificação que foram limpos pelas pessoas agradáveis e gentis que fizeram o Setfit Lib. Mas você pode usar qualquer conjunto de dados HF , desde que tenha esses três campos: (i) texto (str), (ii) etiqueta (int) e (iii) label_text (str).

Conclusões?

Aqui estão meus resultados:

Esta tabela apresenta os resultados disso + a configuração de aprendizado ativo. A menos que especificado de outra forma, repetimos cada experimento 5 vezes. Esses números relatam a precisão da tarefa quando tivemos apenas 100 instâncias no conjunto de trens.

	BBC-News	SST2	SRENDEVAL-CR	IMDB	enron_spam
Setfit ft	0,978 ± 0,004	0,860 ± 0,018	0,882 ± 0,029	0,924 ± 0,026	0,960 ± 0,017
Sem setFitfit contrastivo ft	0,932 ± 0,015	0,854 ± 0,019	0,886 ± 0,005	0,902 ± 0,019	0,942 ± 0,020
Ft regular	0,466 ± 0,133	0,628 ± 0,098	0,582 ± 0,054	0,836 ± 0,166	0,776 ± 0,089
LLM Promoting	0,950 ± 0,000	0,930 ± 0,000	0,900 ± 0,000	0,930 ± 0,000	0,820 ± 0,000
Al	0,980 ± 0,000	0,910 ± 0,000	0,910 ± 0,000	0,870 ± 0,000	0,980 ± 0,000

[1]: O LLM solicitando é feito apenas com 10 instâncias (o prompt real pode conter menos dependendo do comprimento). Também não é repetido para diferentes sementes.

[2]: Al também não é repetido para diferentes sementes.

Expandir

Informações adicionais