Kaggle PII_Data_Detection 다운로드 -Kaggle Kaggle PII_Data

이 repo는 Kaggle -PII 데이터 감지 용입니다

파이썬 환경

1. 패키지를 설치하십시오

 pip install - r requirements . txt

데이터 준비

1. Kaggle API를 설정하십시오

 export KAGGLE_USERNAME= " your_kaggle_username "
export KAGGLE_KEY= " your_api_key "

2. 데이터 세트를 다운로드하십시오

 cd kaggle_dataset
kaggle datasets download -d lizhecheng/pii-data-detection-dataset
unzip pii-data-detection-dataset.zip
kaggle datasets download -d lizhecheng/piidd-reliable-cv
unzip piidd-reliable-cv.zip

 cd kaggle_dataset
cd competition
kaggle competitions download -c pii-detection-removal-from-educational-data
unzip pii-detection-removal-from-educational-data.zip

3. 단순한 디버타

 cd kaggle_notebook
run train-0.3-validation.ipynb

4. Wandb 스윕을 실행하십시오

 cd models
wandb sweep --project PII config.yaml

wandb agent xxx/PII/xxxxxxxx

5. 4 배 교차 검증

 cd kfold
wandb login --relogin
(input your wandb api key)
wandb init -e (input your wandb username)

export KFOLD=0/1/2/3
wandb sweep --project PII hypertuning_kfold.yaml
wandb agent xxx/PII/xxxxxxxx

학습 대행사 연구소 -PII 데이터 감지의 [Public 9th Private 25th Solution]입니다.

우선,이 경쟁을 주최 한 Kaggle 과 THE LEARNING AGENCY LAB 에게 감사의 말씀을 전하며 팀의 모든 사람들에게 감사의 말씀을 전합니다. 결과는 완벽하지는 않았지만 우리는 여전히 많은 것을 배웠으며 앞으로 나아갈 것입니다. 모든 승자에게 축하드립니다!

전체 코드

이 경쟁의 GitHub Repo 다음과 같습니다. 여기서 거의 모든 코드를 찾을 수 있습니다 : https://github.com/lizhecheng02/kaggle-pii_data_detection

미세 조정

AWP (Adversarial Weight Perturbation)

사용자 정의 AWP 클래스를 사용하여 모델의 견고성을 향상시키고 자체 CustomTrainer 작성하십시오. 이것은 우리 팀이 NLP 대회에서 자주 사용하는 방법이며, 좋은 결과를 얻습니다. (해당 코드는 models Directory의 GitHub에서 찾을 수 있습니다).

Wandb Sweep

이 도구를 사용하면 다양한 하이퍼 파라미터의 다양한 조합을 시도하여 최고의 미세 조정 결과를 생성하는 것들을 선택할 수 있습니다. (해당 코드는 models Directory의 GitHub에서 찾을 수 있습니다).

Replace nn with | in all documents

이 경우, 우리는 4 배 교차 검증 및 LB 점수 0.977로 일련의 모델을 훈련시켰다. LB에 약간의 개선이 있었지만 결과는 PB에서 개선되지 않았다.

후 처리

학생 이름 순서에 대한 잘못된 레이블을 수정하십시오. ( b, b-> b, i )
n 주소에 나타나는 것에 특별한주의를 기울이고, I 레이블로 표시해야합니다.
이메일 주소 및 전화 번호를 필터링 하여이 두 범주의 일부가 아닌 결과를 제거하십시오. ( 크게 개선 없음 )
Dr. 와 같은 제목이 B 레이블로 예측되는 경우를 처리하십시오. ( 크게 개선 없음 )

 def pp(new_pred_df):
    df = new_pred_df.copy()
    i = 0
    while i < len(df):
        st = i
        doc = df.loc[st, "document"]
        tok = df.loc[st, "token"]
        pred_tok = df.loc[st, "label"]
        if pred_tok == 'O':
            i += 1
            continue
        lab = pred_tok.split('-')[1]
        cur_doc = doc
        cur_lab = lab
        last_tok = tok
        cur_tok = last_tok

        while i < len(df) and cur_doc == doc and cur_lab == lab and last_tok == cur_tok:
            last_tok = cur_tok + 1
            i += 1
            cur_doc = df.loc[i, "document"]
            cur_tok = df.loc[i, "token"]
            if i >= len(df) or df.loc[i, "label"] == 'O':
                break
            cur_lab = df.loc[i, "label"].split('-')[1]

        if st - 2 >= 0 and df.loc[st - 2, "document"] == df.loc[st, "document"] and df.loc[st - 1, "token_str"] == 'n' and df.loc[st - 2, "label"] != 'O' and df.loc[st - 2, "label"].split('-')[1] == lab:
            df.loc[st - 1, "label"] = 'I-' + lab
            df.loc[st - 1, "score"] = 1
            for j in range(st, i):
                if df.loc[j, "label"] != 'I-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'I-' + lab
            continue

        for j in range(st, i):
            if j == st:
                if df.loc[j, "label"] != 'B-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'B-' + lab
            else:
                if df.loc[j, "label"] != 'I-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'I-' + lab

        if lab == 'NAME_STUDENT' and any(len(item) == 2 and item[0].isupper() and item[1] == "." for item in df.loc[st:i-1, 'token_str']):
            for j in range(st, i):
                df.loc[j, "score"] = 0
                df.loc[j, "label"] = 'O'

    return df

앙상블

Average Ensemble

최종 결과를 얻기 위해 평균 확률을 취하는 방법을 사용하십시오. 이 경쟁에서 리콜이 정밀도보다 더 중요하기 때문에 잠재적 인 올바른 리콜이 누락되지 않도록 임계 값을 0.0으로 설정했습니다.

 for text_id in final_token_pred:
    for word_idx in final_token_pred[text_id]:
        pred = final_token_pred[text_id][word_idx].argmax(-1)
        pred_without_O = final_token_pred[text_id][word_idx][:12].argmax(-1)
        if final_token_pred[text_id][word_idx][12] < 0.0:
            final_pred = pred_without_O
            tmp_score = final_token_pred[text_id][word_idx][final_pred]
        else:
            final_pred = pred
            tmp_score = final_token_pred[text_id][word_idx][final_pred]

Vote Ensemble

최종 제출에서 우리는 7 가지 모델을 앙상 시켰으며, 최소한 두 가지 모델이 동일한 레이블을 예측 한 경우 라벨을 올바른 예측으로 받아 들였습니다.

 for tmp_pred in single_pred:
    for text_id in tmp_pred:
        max_id = 0
        for word_idx in tmp_pred[text_id]:
            max_id = tmp_pred[text_id][word_idx].argmax(-1)
            tmp_pred[text_id][word_idx] = np.zeros(tmp_pred[text_id][word_idx].shape)
            tmp_pred[text_id][word_idx][max_id] = 1.0
        for word_idx in tmp_pred[text_id]:
            final_token_pred[text_id][word_idx] += tmp_pred[text_id][word_idx]

 for text_id in final_token_pred:
    for word_idx in final_token_pred[text_id]:
        pred = final_token_pred[text_id][word_idx].argmax(-1)
        pred_without_O = final_token_pred[text_id][word_idx][:12].argmax(-1)
        if final_token_pred[text_id][word_idx][pred] >= 2:
            final_pred = pred
            tmp_score = final_token_pred[text_id][word_idx][final_pred]
        else:
            final_pred = 12
            tmp_score = final_token_pred[text_id][word_idx][final_pred]

추론

Two GPU Inference

T4*2 GPU를 사용하면 단일 GPU에 비해 추론 속도가 두 배가됩니다. Ensemble 8 모델의 경우 최대 Max_Length는 896 입니다. Ensemble 7 모델이라면 Max_Length를 1024 로 설정할 수 있으며 이는 더 이상적인 값입니다. (해당 코드는 submissions 디렉토리의 GitHub에서 찾을 수 있습니다).

Convert Non-English Characters (LB를 낮추기)

 def replace_non_english_chars(text):
    mapping = {
        'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a', 'ä': 'a', 'å': 'a',
        'è': 'e', 'é': 'e', 'ê': 'e', 'ë': 'e',
        'ì': 'i', 'í': 'i', 'î': 'i', 'ï': 'i',
        'ò': 'o', 'ó': 'o', 'ô': 'o', 'õ': 'o', 'ö': 'o', 'ø': 'o',
        'ù': 'u', 'ú': 'u', 'û': 'u', 'ü': 'u',
        'ÿ': 'y',
        'ç': 'c',
        'ñ': 'n',
        'ß': 'ss'
    }

    result = []
    for char in text:
        if char not in string.ascii_letters:
            replacement = mapping.get(char.lower())
            if replacement:
                result.append(replacement)
            else:
                result.append(char)
        else:
            result.append(char)

    return ''.join(result)

2 단계 LLM (실패)

학생 이름이 가장 일반적인 레이블 유형이기 때문에 GPT-4 API를 사용하여 데이터 세트에서 약 10,000 개의 비 학생 이름에 주석을 달았습니다. 이 특정 유형의 레이블을 예측할 때 모델의 정확성을 향상시키기를 희망합니다.

이름 관련 레이블에서 Mistral-7b 모델을 미세 조정하려고 시도했지만 LB의 점수는 크게 감소했습니다.

따라서, 나는 소수의 학습을 위해 Mistral-7b 사용하여 레이블 name student 으로 예측 된 내용이 실제로 이름인지를 결정했습니다. (여기서 우리는 모델이 학생의 이름인지 아닌지를 구별 할 수있을뿐만 아니라 명확하게 이름이 아닌 예측을 배제하기 위해서만 기대할 수 없습니다).

프롬프트는 다음과 같습니다.이를 수행하면 LB에서 0.001 미만의 약간의 개선이 발생했습니다.

 f"I'll give you a name, and you need to tell me if it's a normal person name, cited name or even not a name. Do not consider other factors.nExample:n- Is Matt Johnson a normal person name? Answer: Yesn- Is Johnson. T a normal person name? Answer: No, this is likely a cited name.n- Is Andsgjdu a normal person name? Answer: No, it is even not a name.nNow the question is:n- Is {name} a normal person name? Answer:"

제출

모델	LB	PB	선택하다
`Seven single models that exceed 0.974 on the LB`	`0.978`	`0.964`	예
`Two 4-fold cross-validation models, with LB scores of 0.977 and 0.974 respectively.`	`0.978`	`0.961`	예
`Three single models with ensemble LB score of 0.979, plus one set of 4-fold cross-validation models with an LB score of 0.977. (Use vote ensemble)`	`0.979`	`0.963`	예
`Two single models ensemble`	`0.972`	`0.967`	아니요
`Four single models ensemble`	`0.979`	`0.967`	아니요