fewshot textclassification 다운로드 - fewshot textclassification 소스 코드 다운로드

fewshot textclassification

AI 소스 코드

1.0.0

다운로드

소수의 샷 텍스트 분류

텍스트 분류를 위해 소수의 샷 전송을위한 SetFit 접근법을 사용합니다.

편집 : 나는 또한 능동적 학습으로 실험을 했으므로 이제 Active.py도 있습니다. 맑은 날 더 잘 정리할 것입니다.

구현 된 방법

main.py에서

사례 0 : 종이, 즉 문장 변압기에 요약 된 SetFit 메소드는 자체 감독 된 대조적 인 방식으로 미세 조정됩니다. 그런 다음 인코딩 된 문장 위에 물류 분류기를 때리고 실제 작업을 수행합니다.
사례 1 : 이것은 문장 변압기에 대한 정기적 인 작업 별 미세 조정 IE입니다.
1. 변압기의 자체 감독 된 미세 조정을하지 말고 작업을 위해 직접 훈련하십시오.
2. 로지스틱 분류기 대신 우리는 일반 덴 세트를 사용하여 인코더와 함께 훈련합니다.
사례 2 : 케이스 0과 유사하지만 변압기의 자체 감독 된 미세 조정을 수행하지 않으며 텍스트 인코딩 및 물류 분류기를 장착하는 데 직접 이동합니다.
사례 3 :이 모든 대신에 우리는 몇 번의 샷 프롬프트를 공식화하고 텍스트를 분류하기 위해 포옹 페이스의 모델을 요청합니다.

Active.py에서

사례 4 : 대조적 인 능동 학습을 사용하십시오. Small-Text의 구현은 <3입니다 (GPU가 거대한 GPU가 있기를 바랍니다).

용법

 $  ~/Dev/projects/setfit$ python main.py --help
Usage: main.py [OPTIONS]

Options:
  -d, --dataset-name TEXT         The name of the dataset as it appears on the
                                  HuggingFace hub e.g. SetFit/SentEval-CR |
                                  SetFit/bbc-news | SetFit/enron_spam ...

  -c, --case INTEGER              0, 1, 2, or 3: which experiment are we
                                  running. See readme or docstrings to know
                                  more but briefly: **0**: SentTF ->
                                  Constrastive Pretrain -> +LogReg on task.
                                  **1**: SentTF -> +Dense on task. **2**:
                                  SentTF -> +LogReg on task. **3**:
                                  FewShotPrompting based Clf over Flan-t5-xl
                                  [required]

  -r, --repeat INTEGER            The number of times we should run the entire
                                  experiment (changing the seed).

  -bs, --batch-size INTEGER       ... you know what it is.
  -ns, --num-sents INTEGER        Size of our train set. Set short values
                                  (under 100)

  -e, --num-epochs INTEGER        Epochs for fitting Clf+SentTF on the main
                                  (classification) task.

  -eft, --num-epochs-finetune INTEGER
                                  Epochs for both contrastive pretraining of
                                  SentTF.

  -ni, --num-iters INTEGER        Number of text pairs to generate for
                                  contrastive learning. Values above 20 can
                                  get expensive to train.

  -tot, --test-on-test            If true, we report metrics on testset. If
                                  not, on a 20% split of train set. Off by
                                  default.

  -ft, --full-test                We truncate the testset of every dataset to
                                  have 100 instances. If you know what you're
                                  doing, you can test on the full dataset.NOTE
                                  that if you're running this in case 3 you
                                  should probably be a premium member and not
                                  be paying per use.

  --help                          Show this message and exit.

참고 ./hf_token.key Huggingf
추신 :이 파일을 .gitignore에 추가했습니다


$ python active.py --help
Usage: active.py [OPTIONS]

Options:
  -d, --dataset-name TEXT     The name of the dataset as it appears on the
                              HuggingFace hub e.g. SetFit/SentEval-CR |
                              SetFit/bbc-news | SetFit/enron_spam | imdb ...

  -ns, --num-sents INTEGER    Size of our train set. I.e., the dataset at the
                              END of AL. Not the start of it.

  -nq, --num-queries INTEGER  Number of times we query the unlabeled set and
                              pick some labeled examples. Set short values
                              (under 10)

  -ft, --full-test            We truncate the testset of every dataset to have
                              100 instances. If you know what you're doing,
                              you can test on the full dataset.NOTE that if
                              you're running this in case 3 you should
                              probably be a premium member and not be paying
                              per use.

  --help                      Show this message and exit.

또는 필요한 라이브러리를 설치 한 후 간단히 ./run.sh 를 실행할 수 있습니다 ( requirements.txt 참조).

그 후, 노트북 summarise.ipynb 실행하여 결과를 요약하고 시각화 할 수 있습니다 (이 코드를 추가하는 경우).

추신 : --full-test 에주의를 기울이십시오. 기본적으로 우리는 모든 테스트를 처음 100 인스턴스로 잘라냅니다.

사용 된 데이터 세트

setfit/senteval-cr
SetFit/BBC-News
setfit/enron_spam
setfit/sst2
IMDB

그것들은 setfit을 lib을 만든 멋지고 친절한 사람들이 청소 한 모든 분류 데이터 세트입니다. 그러나이 세 가지 필드 (i) 텍스트 (str), (ii) 레이블 (int) 및 (iii) label_text (str)가있는 경우 HF 데이터 세트를 사용할 수 있습니다.

결론?

내 결과는 다음과 같습니다.

이 표는이 + 활성 학습 설정의 결과를 나타냅니다. 달리 명시되지 않으면 각 실험을 5 번 반복합니다. 이 숫자는 열차 세트에 100 개의 인스턴스 만 있으면 작업 정확도를보고합니다.

	BBC-News	SST2	SENTEVAL-CR	IMDB	enron_spam
setfit ft	0.978 ± 0.004	0.860 ± 0.018	0.882 ± 0.029	0.924 ± 0.026	0.960 ± 0.017
대조적 인 setfit ft	0.932 ± 0.015	0.854 ± 0.019	0.886 ± 0.005	0.902 ± 0.019	0.942 ± 0.020
일반 FT	0.466 ± 0.133	0.628 ± 0.098	0.582 ± 0.054	0.836 ± 0.166	0.776 ± 0.089
LLM 프롬프트	0.950 ± 0.000	0.930 ± 0.000	0.900 ± 0.000	0.930 ± 0.000	0.820 ± 0.000
전신 Al	0.980 ± 0.000	0.910 ± 0.000	0.910 ± 0.000	0.870 ± 0.000	0.980 ± 0.000