Kaggle PII_Data_Detection تنزيل - Kaggle PII_Data_Detection تنزيل رمز المصدر

هذا الريبو هو اكتشاف بيانات Kaggle - PII

بيئة بيثون

1. تثبيت الحزم

 pip install - r requirements . txt

إعداد البيانات

1. تعيين kaggle API

 export KAGGLE_USERNAME= " your_kaggle_username "
export KAGGLE_KEY= " your_api_key "

2. تنزيل مجموعة البيانات

 cd kaggle_dataset
kaggle datasets download -d lizhecheng/pii-data-detection-dataset
unzip pii-data-detection-dataset.zip
kaggle datasets download -d lizhecheng/piidd-reliable-cv
unzip piidd-reliable-cv.zip

 cd kaggle_dataset
cd competition
kaggle competitions download -c pii-detection-removal-from-educational-data
unzip pii-detection-removal-from-educational-data.zip

3. ديبرتا البسيطة

 cd kaggle_notebook
run train-0.3-validation.ipynb

4. تشغيل Wandb Sweep

 cd models
wandb sweep --project PII config.yaml

wandb agent xxx/PII/xxxxxxxx

5.

 cd kfold
wandb login --relogin
(input your wandb api key)
wandb init -e (input your wandb username)

export KFOLD=0/1/2/3
wandb sweep --project PII hypertuning_kfold.yaml
wandb agent xxx/PII/xxxxxxxx

هذا هو [الحل التاسع العام الخامس والعشرين] لمختبر وكالة التعلم - اكتشاف بيانات PII

بادئ ذي بدء ، شكرًا لك على Kaggle و THE LEARNING AGENCY LAB لاستضافة هذه المسابقة ، وبفضل كل شخص في الفريق على جهودهم. على الرغم من أن النتيجة لم تكن مثالية ، إلا أننا ما زلنا نتعلم الكثير ، وسنستمر في المضي قدمًا. مبروك لجميع الفائزين!

رمز كامل

إليكم GitHub Repo لهذه المسابقة ، حيث يمكنك العثور على جميع الرموز تقريبًا: https://github.com/lizhecheng02/kaggle-pii_data_detection

الكون المثالى

AWP (Adversarial Weight Perturbation)

عزز متانة النموذج باستخدام فئة AWP مخصصة وكتابة CustomTrainer الخاصة. هذه طريقة يستخدمها فريقنا غالبًا في مسابقات NLP ، ولديها بعض النتائج الجيدة. (يمكن العثور على الرمز المقابل على GitHub ضمن دليل models ).

Wandb Sweep

باستخدام هذه الأداة ، يمكننا تجربة مجموعات مختلفة من أجهزة الاختلاف المختلفة لاختيار تلك التي تنتج أفضل نتائج صقل. (يمكن العثور على الرمز المقابل على GitHub ضمن دليل models ).

Replace nn with | in all documents

في هذه الحالة ، قمنا بتدريب مجموعة من النماذج مع كل من التحقق من صحة 4 أضعاف وعشرات LB من 0.977. على الرغم من وجود بعض التحسن على LB ، فإن النتائج لم تظهر أي تحسن على PB.

ما بعد المعالجة

تصحيح الملصقات غير الصحيحة لترتيب أسماء الطلاب. ( ب ، ب -> ب ، أنا )
إيلاء اهتمام خاص n في العناوين ، والتي يجب تصنيفها مع علامة I.
قم بتصفية عناوين البريد الإلكتروني وأرقام الهواتف ، وإزالة النتائج التي ليست جزءًا من هاتين الفئتين. ( لا يوجد تحسن كبير )
تعامل مع الحالات التي يتم فيها التنبؤ بعناوين مثل Dr. مع علامة B. ( لا يوجد تحسن كبير )

 def pp(new_pred_df):
    df = new_pred_df.copy()
    i = 0
    while i < len(df):
        st = i
        doc = df.loc[st, "document"]
        tok = df.loc[st, "token"]
        pred_tok = df.loc[st, "label"]
        if pred_tok == 'O':
            i += 1
            continue
        lab = pred_tok.split('-')[1]
        cur_doc = doc
        cur_lab = lab
        last_tok = tok
        cur_tok = last_tok

        while i < len(df) and cur_doc == doc and cur_lab == lab and last_tok == cur_tok:
            last_tok = cur_tok + 1
            i += 1
            cur_doc = df.loc[i, "document"]
            cur_tok = df.loc[i, "token"]
            if i >= len(df) or df.loc[i, "label"] == 'O':
                break
            cur_lab = df.loc[i, "label"].split('-')[1]

        if st - 2 >= 0 and df.loc[st - 2, "document"] == df.loc[st, "document"] and df.loc[st - 1, "token_str"] == 'n' and df.loc[st - 2, "label"] != 'O' and df.loc[st - 2, "label"].split('-')[1] == lab:
            df.loc[st - 1, "label"] = 'I-' + lab
            df.loc[st - 1, "score"] = 1
            for j in range(st, i):
                if df.loc[j, "label"] != 'I-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'I-' + lab
            continue

        for j in range(st, i):
            if j == st:
                if df.loc[j, "label"] != 'B-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'B-' + lab
            else:
                if df.loc[j, "label"] != 'I-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'I-' + lab

        if lab == 'NAME_STUDENT' and any(len(item) == 2 and item[0].isupper() and item[1] == "." for item in df.loc[st:i-1, 'token_str']):
            for j in range(st, i):
                df.loc[j, "score"] = 0
                df.loc[j, "label"] = 'O'

    return df

فرقة

Average Ensemble

استخدم طريقة أخذ متوسط الاحتمالات للحصول على النتيجة النهائية. نظرًا لأن الاستدعاء أكثر أهمية من الدقة في هذه المسابقة ، فقد حددت العتبة على 0.0 لتجنب فقدان أي استدعاء صحيح محتمل.

 for text_id in final_token_pred:
    for word_idx in final_token_pred[text_id]:
        pred = final_token_pred[text_id][word_idx].argmax(-1)
        pred_without_O = final_token_pred[text_id][word_idx][:12].argmax(-1)
        if final_token_pred[text_id][word_idx][12] < 0.0:
            final_pred = pred_without_O
            tmp_score = final_token_pred[text_id][word_idx][final_pred]
        else:
            final_pred = pred
            tmp_score = final_token_pred[text_id][word_idx][final_pred]

Vote Ensemble

في تقديمنا النهائي ، قمنا بتجميع 7 نماذج ، وقبلنا الملصق كتوقع صحيح إذا تنبأ اثنين من النماذج على الأقل من نفس الملصق.

 for tmp_pred in single_pred:
    for text_id in tmp_pred:
        max_id = 0
        for word_idx in tmp_pred[text_id]:
            max_id = tmp_pred[text_id][word_idx].argmax(-1)
            tmp_pred[text_id][word_idx] = np.zeros(tmp_pred[text_id][word_idx].shape)
            tmp_pred[text_id][word_idx][max_id] = 1.0
        for word_idx in tmp_pred[text_id]:
            final_token_pred[text_id][word_idx] += tmp_pred[text_id][word_idx]

 for text_id in final_token_pred:
    for word_idx in final_token_pred[text_id]:
        pred = final_token_pred[text_id][word_idx].argmax(-1)
        pred_without_O = final_token_pred[text_id][word_idx][:12].argmax(-1)
        if final_token_pred[text_id][word_idx][pred] >= 2:
            final_pred = pred
            tmp_score = final_token_pred[text_id][word_idx][final_pred]
        else:
            final_pred = 12
            tmp_score = final_token_pred[text_id][word_idx][final_pred]

الاستدلال

Two GPU Inference

باستخدام T4*2 GPUS يضاعف سرعة الاستدلال مقارنة مع وحدة معالجة الرسومات الواحدة. إلى فرقة 8 طرز ، الحد الأقصى MAX_LENGTH هو 896 ؛ إذا كانت نماذج الفرقة 7 ، يمكن ضبط Max_Length على 1024 ، وهي قيمة مثالية. (يمكن العثور على الرمز المقابل على github الخاص بي ضمن دليل submissions ).

Convert Non-English Characters (اجعل LB أقل)

 def replace_non_english_chars(text):
    mapping = {
        'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a', 'ä': 'a', 'å': 'a',
        'è': 'e', 'é': 'e', 'ê': 'e', 'ë': 'e',
        'ì': 'i', 'í': 'i', 'î': 'i', 'ï': 'i',
        'ò': 'o', 'ó': 'o', 'ô': 'o', 'õ': 'o', 'ö': 'o', 'ø': 'o',
        'ù': 'u', 'ú': 'u', 'û': 'u', 'ü': 'u',
        'ÿ': 'y',
        'ç': 'c',
        'ñ': 'n',
        'ß': 'ss'
    }

    result = []
    for char in text:
        if char not in string.ascii_letters:
            replacement = mapping.get(char.lower())
            if replacement:
                result.append(replacement)
            else:
                result.append(char)
        else:
            result.append(char)

    return ''.join(result)

LLM مرحلتين (غير ناجحة)

قمنا بشرح ما يقرب من 10،000 اسم غير طلاب من مجموعة البيانات باستخدام واجهة برمجة تطبيقات GPT-4 ، لأن أسماء الطلاب هي نوع التسمية الأكثر شيوعًا. نأمل أن نعزز دقة النموذج في التنبؤ بهذا النوع من الملصقات.

لقد حاولت ضبط نموذج Mistral-7b على الملصقات المتعلقة بالأسماء ، لكن الدرجات على LB أظهرت انخفاضًا كبيرًا.

لذلك ، حاولت استخدام Mistral-7b لتعلم القليل من اللقطة لتحديد ما إذا كان المحتوى الذي يتوقع منه أن يكون name student الملصق هو في الواقع اسم. (هنا لا يمكننا أن نتوقع أن يميز النموذج ما إذا كان اسم الطالب أم لا ، ولكن فقط لاستبعاد التنبؤات التي ليست أسماء بوضوح).

المطالبة موجودة في ما يلي ، حيث أنتج هذا تحسنا طفيفًا على LB ، أقل من 0.001.

 f"I'll give you a name, and you need to tell me if it's a normal person name, cited name or even not a name. Do not consider other factors.nExample:n- Is Matt Johnson a normal person name? Answer: Yesn- Is Johnson. T a normal person name? Answer: No, this is likely a cited name.n- Is Andsgjdu a normal person name? Answer: No, it is even not a name.nNow the question is:n- Is {name} a normal person name? Answer:"

استسلام

النماذج	رطل	PB	يختار
`Seven single models that exceed 0.974 on the LB`	`0.978`	`0.964`	نعم
`Two 4-fold cross-validation models, with LB scores of 0.977 and 0.974 respectively.`	`0.978`	`0.961`	نعم
`Three single models with ensemble LB score of 0.979, plus one set of 4-fold cross-validation models with an LB score of 0.977. (Use vote ensemble)`	`0.979`	`0.963`	نعم
`Two single models ensemble`	`0.972`	`0.967`	لا
`Four single models ensemble`	`0.979`	`0.967`	لا

شفرة

LB 0.978 PB 0.964

LB 0.978 PB 0.961

LB 0.979 PB 0.963

خاتمة

بفضل زملائي في الفريق ، عرفنا بعضنا البعض من خلال Kaggle لأكثر من نصف عام. أشعر أنني محظوظ لأنني قادر على التعلم والتقدم معكم جميعًا. @Rdxsun ، @bianshengtao ، @xuanmingzhang777 ، tonyarobertson.

تجاوز خاطئ مثل عدم الذهاب بعيدًا بما فيه الكفاية. — - Confucius

يوسع