Kaggle PII_Data_Detectionダウンロード-Kaggle Kaggle PII_Data_Detectionソースコードダウンロード

このレポは、Kaggle -PIIデータ検出用です

Python環境

1.パッケージをインストールします

 pip install - r requirements . txt

データを準備します

1. Kaggle APIを設定します

 export KAGGLE_USERNAME= " your_kaggle_username "
export KAGGLE_KEY= " your_api_key "

2.データセットをダウンロードします

 cd kaggle_dataset
kaggle datasets download -d lizhecheng/pii-data-detection-dataset
unzip pii-data-detection-dataset.zip
kaggle datasets download -d lizhecheng/piidd-reliable-cv
unzip piidd-reliable-cv.zip

 cd kaggle_dataset
cd competition
kaggle competitions download -c pii-detection-removal-from-educational-data
unzip pii-detection-removal-from-educational-data.zip

3。シンプルなデバータ

 cd kaggle_notebook
run train-0.3-validation.ipynb

4。WandBスイープを実行します

 cd models
wandb sweep --project PII config.yaml

wandb agent xxx/PII/xxxxxxxx

5。4倍のクロス検証

 cd kfold
wandb login --relogin
(input your wandb api key)
wandb init -e (input your wandb username)

export KFOLD=0/1/2/3
wandb sweep --project PII hypertuning_kfold.yaml
wandb agent xxx/PII/xxxxxxxx

これは、学習代理店ラボの[Public 9th Private 25th Solution]です-PIIデータ検出

まず第一に、この競争を開催してくれたKaggleとTHE LEARNING AGENCY LABに感謝します。また、チームの全員の努力に感謝します。結果は完璧ではありませんでしたが、私たちはまだ多くを学び、前進し続けます。すべての勝者におめでとうございます！

完全なコード

このコンペティションのGitHub Repoは、ほぼすべてのコードを見つけることができます：https：//github.com/lizhecheng02/kaggle-pii_data_detection

微調整

AWP (Adversarial Weight Perturbation)

カスタムAWPクラスを使用してモデルの堅牢性を強化し、独自のCustomTrainerを書き込みます。これは、チームがNLPコンペティションでよく使用する方法であり、いくつかの良い結果が得られます。（対応するコードは、 modelsディレクトリの下の私のgithubにあります）。

Wandb Sweep

このツールを使用すると、さまざまなハイパーパラメーターのさまざまな組み合わせを試して、最高の微調整結果を生成するものを選択できます。（対応するコードは、 modelsディレクトリの下の私のgithubにあります）。

Replace nn with | in all documents

この場合、0.977の4倍の交差検証とLBスコアの両方でモデルのセットをトレーニングしました。 LBには多少の改善がありましたが、結果はPBに改善されていませんでした。

後処理

学生名の順序の誤ったラベルを修正します。（ b、b-> b、i ）
アドレスに表示されるnに特に注意してください。これは、 iラベルにラベルを付ける必要があります。
電子メールアドレスと電話番号を除外し、明らかにこれら2つのカテゴリの一部ではない結果を削除します。（大幅な改善はありません）
Dr.のようなタイトルがBラベルで予測されるケースを処理します。（大幅な改善はありません）

 def pp(new_pred_df):
    df = new_pred_df.copy()
    i = 0
    while i < len(df):
        st = i
        doc = df.loc[st, "document"]
        tok = df.loc[st, "token"]
        pred_tok = df.loc[st, "label"]
        if pred_tok == 'O':
            i += 1
            continue
        lab = pred_tok.split('-')[1]
        cur_doc = doc
        cur_lab = lab
        last_tok = tok
        cur_tok = last_tok

        while i < len(df) and cur_doc == doc and cur_lab == lab and last_tok == cur_tok:
            last_tok = cur_tok + 1
            i += 1
            cur_doc = df.loc[i, "document"]
            cur_tok = df.loc[i, "token"]
            if i >= len(df) or df.loc[i, "label"] == 'O':
                break
            cur_lab = df.loc[i, "label"].split('-')[1]

        if st - 2 >= 0 and df.loc[st - 2, "document"] == df.loc[st, "document"] and df.loc[st - 1, "token_str"] == 'n' and df.loc[st - 2, "label"] != 'O' and df.loc[st - 2, "label"].split('-')[1] == lab:
            df.loc[st - 1, "label"] = 'I-' + lab
            df.loc[st - 1, "score"] = 1
            for j in range(st, i):
                if df.loc[j, "label"] != 'I-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'I-' + lab
            continue

        for j in range(st, i):
            if j == st:
                if df.loc[j, "label"] != 'B-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'B-' + lab
            else:
                if df.loc[j, "label"] != 'I-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'I-' + lab

        if lab == 'NAME_STUDENT' and any(len(item) == 2 and item[0].isupper() and item[1] == "." for item in df.loc[st:i-1, 'token_str']):
            for j in range(st, i):
                df.loc[j, "score"] = 0
                df.loc[j, "label"] = 'O'

    return df

アンサンブル

Average Ensemble

確率の平均をとる方法を使用して、最終結果を取得します。この競争ではリコールは精度よりも重要であるため、潜在的な正しいリコールを逃さないように、しきい値を0.0に設定します。

 for text_id in final_token_pred:
    for word_idx in final_token_pred[text_id]:
        pred = final_token_pred[text_id][word_idx].argmax(-1)
        pred_without_O = final_token_pred[text_id][word_idx][:12].argmax(-1)
        if final_token_pred[text_id][word_idx][12] < 0.0:
            final_pred = pred_without_O
            tmp_score = final_token_pred[text_id][word_idx][final_pred]
        else:
            final_pred = pred
            tmp_score = final_token_pred[text_id][word_idx][final_pred]

Vote Ensemble

最終的な提出では、7つのモデルをアンサンスしました。少なくとも2つのモデルが同じラベルを予測した場合、ラベルを正しい予測として受け入れました。

 for tmp_pred in single_pred:
    for text_id in tmp_pred:
        max_id = 0
        for word_idx in tmp_pred[text_id]:
            max_id = tmp_pred[text_id][word_idx].argmax(-1)
            tmp_pred[text_id][word_idx] = np.zeros(tmp_pred[text_id][word_idx].shape)
            tmp_pred[text_id][word_idx][max_id] = 1.0
        for word_idx in tmp_pred[text_id]:
            final_token_pred[text_id][word_idx] += tmp_pred[text_id][word_idx]

 for text_id in final_token_pred:
    for word_idx in final_token_pred[text_id]:
        pred = final_token_pred[text_id][word_idx].argmax(-1)
        pred_without_O = final_token_pred[text_id][word_idx][:12].argmax(-1)
        if final_token_pred[text_id][word_idx][pred] >= 2:
            final_pred = pred
            tmp_score = final_token_pred[text_id][word_idx][final_pred]
        else:
            final_pred = 12
            tmp_score = final_token_pred[text_id][word_idx][final_pred]

推論

Two GPU Inference

T4*2 GPUを使用すると、単一のGPUと比較して推論速度が2倍になります。 8モデルをアンサンブルするには、最大max_lengthは896です。 Ensemble 7モデルの場合、max_lengthを1024に設定できます。これはより理想的な値です。（対応するコードは、 submissionsディレクトリの下の私のgithubにあります）。

Convert Non-English Characters （LBを下げます）

 def replace_non_english_chars(text):
    mapping = {
        'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a', 'ä': 'a', 'å': 'a',
        'è': 'e', 'é': 'e', 'ê': 'e', 'ë': 'e',
        'ì': 'i', 'í': 'i', 'î': 'i', 'ï': 'i',
        'ò': 'o', 'ó': 'o', 'ô': 'o', 'õ': 'o', 'ö': 'o', 'ø': 'o',
        'ù': 'u', 'ú': 'u', 'û': 'u', 'ü': 'u',
        'ÿ': 'y',
        'ç': 'c',
        'ñ': 'n',
        'ß': 'ss'
    }

    result = []
    for char in text:
        if char not in string.ascii_letters:
            replacement = mapping.get(char.lower())
            if replacement:
                result.append(replacement)
            else:
                result.append(char)
        else:
            result.append(char)

    return ''.join(result)

2段階のLLM（失敗）

学生名は最も一般的なラベルタイプであるため、 GPT-4 APIを使用してデータセットから約10,000の非学生名を注釈にしました。この特定のタイプのラベルを予測する際のモデルの精度を強化したいと考えています。

名前関連ラベルでMistral-7bモデルを微調整しようとしましたが、LBのスコアは大幅に減少しました。

したがって、 Mistral-7bを使用して少数のショット学習に使用して、レーベルname studentであると予測されるコンテンツが実際に名前であるかどうかを判断しました。（ここでは、モデルが学生の名前であるかどうかを区別することは期待できませんが、明らかに名前ではない予測を除外するだけです）。

プロンプトは以下にあり、これを行うと、LBが0.001未満で非常にわずかに改善されました。

 f"I'll give you a name, and you need to tell me if it's a normal person name, cited name or even not a name. Do not consider other factors.nExample:n- Is Matt Johnson a normal person name? Answer: Yesn- Is Johnson. T a normal person name? Answer: No, this is likely a cited name.n- Is Andsgjdu a normal person name? Answer: No, it is even not a name.nNow the question is:n- Is {name} a normal person name? Answer:"

提出

モデル	ポンド	PB	選ぶ
`Seven single models that exceed 0.974 on the LB`	`0.978`	`0.964`	はい
`Two 4-fold cross-validation models, with LB scores of 0.977 and 0.974 respectively.`	`0.978`	`0.961`	はい
`Three single models with ensemble LB score of 0.979, plus one set of 4-fold cross-validation models with an LB score of 0.977. (Use vote ensemble)`	`0.979`	`0.963`	はい
`Two single models ensemble`	`0.972`	`0.967`	いいえ
`Four single models ensemble`	`0.979`	`0.967`	いいえ