PyTorchText下載 - PyTorchText源代碼下載

中文用戶請查看readme-zh.md

這是2017年Zhihu機器學習挑戰賽的解決方案。我們在963支球隊中贏得了冠軍。

1。設定

從pytorch.org安裝pytorch（python 2，cuda）
安裝其他下降：
```
pip2 install -r requirements.txt
```

您可能需要tf.contrib.keras.preprocessing.sequence.pad_sequences進行數據預處理。

開始可視化的見點：
```
python2 -m visdom.server
```

2。數據預處理

修改相關文件中的數據路徑

2.1 wordvector文件 - > numpy文件

python scripts/data_process/embedding2matrix.py main char_embedding.txt char_embedding.npz 
python scripts/data_process/embedding2matrix.py main word_embedding.txt word_embedding.npz

2.2問題集 - > numpy文件

它的內存消耗，請確保您的內存大於32克。

python scripts/data_process/question2array.py main question_train_set.txt train.npz
python scripts/data_process/question2array.py main question_eval_set.txt test.npz

2.3標籤 - > JSON

python scripts/data_process/label2id.py main question_topic_train_set.txt labels.json

2.4驗證數據

python scripts/data_process/get_val.py

3。培訓

修改模型路徑的config.py

通往我們使用的模型的路徑：

CNN： models/MultiCNNTextBNDeep.py
RNN （LSTM）： models/LSTMText.py
RCNN： models/RCNN.py
Inception： models/CNNText_inception.py
fastText： models/FastText3.py

3.1沒有數據的三角模型

 # LSTM char
python2 main.py main --max_epoch=5 --plot_every=100 --env= ' lstm_char ' --weight=1 --model= ' LSTMText '  --batch-size=128  --lr=0.001 --lr2=0 --lr_decay=0.5 --decay_every=10000  --type_= ' char '   --zhuge=True --linear-hidden-size=2000 --hidden-size=256 --kmax-pooling=3   --num-layers=3  --augument=False

# LSTM word
python2 main.py main --max_epoch=5 --plot_every=100 --env= ' lstm_word ' --weight=1 --model= ' LSTMText '  --batch-size=128  --lr=0.001 --lr2=0.0000 --lr_decay=0.5 --decay_every=10000  --type_= ' word '   --zhuge=True --linear-hidden-size=2000 --hidden-size=320 --kmax-pooling=2  --augument=False

#  RCNN char
python2 main.py main --max_epoch=5 --plot_every=100 --env= ' rcnn_char ' --weight=1 --model= ' RCNN '  --batch-size=128  --lr=0.001 --lr2=0 --lr_decay=0.5 --decay_every=5000  --title-dim=1024 --content-dim=1024  --type_= ' char ' --zhuge=True --kernel-size=3 --kmax-pooling=2 --linear-hidden-size=2000 --debug-file= ' /tmp/debugrcnn ' --hidden-size=256 --num-layers=3 --augument=False

# RCNN word
main.py main --max_epoch=5 --plot_every=100 --env= ' RCNN-word ' --weight=1 --model= ' RCNN '  --zhuge=True --num-workers=4 --batch-size=128 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=5000  --title-dim=1024 --content-dim=512  --kernel-size=3 --debug-file= ' /tmp/debugrc '  --kmax-pooling=1 --type_= ' word ' --augument=False
# CNN word
 python main.py main --max_epoch=5 --plot_every=100 --env= ' MultiCNNText ' --weight=1 --model= ' MultiCNNTextBNDeep '  --batch-size=64  --lr=0.001 --lr2=0.000 --lr_decay=0.8 --decay_every=10000  --title-dim=250 --content-dim=250    --weight-decay=0 --type_= ' word ' --debug-file= ' /tmp/debug '  --linear-hidden-size=2000 --zhuge=True  --augument=False

# inception word
python2 main.py main --max_epoch=5 --plot_every=100 --env= ' inception-word ' --weight=1 --model= ' CNNText_inception '  --zhuge=True --num-workers=4 --batch-size=512 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --title-dim=1200 --content-dim=1200 --type_= ' word ' --augument=False                                                   
# inception char
python2 main.py main --max_epoch=5 --plot_every=100 --env= ' inception-char ' --weight=1 --model= ' CNNText_inception '  --zhuge=True --num-workers=4 --batch-size=512 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --title-dim=1200 --content-dim=1200 --type_= ' char '   --augument=False

# FastText3 word
python2 main.py main --max_epoch=5 --plot_every=100 --env= ' fasttext3-word ' --weight=5 --model= ' FastText3 ' --zhuge=True --num-workers=4 --batch-size=512  --lr2=1e-4 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --linear_hidden_size=2000 --type_= ' word '  --debug-file=/tmp/debugf --augument=False

在大多數情況下，得分可以通過Finetune提高。例如：

python2 main.py main --max_epoch=2 --plot_every=100 --env= ' LSTMText-word-ft ' --model= ' LSTMText '  --zhuge=True --num-workers=4 --batch-size=256 --model-path=None --lr2=5e-5 --lr=5e-5 --decay-every=5000 --type_= ' word '  --model-path= ' checkpoints/LSTMText_word_0.409196378421 '

3.2帶有數據的火車模型

在訓練命令中添加--augument 。

3.3分

模型	分數
cnn_word	0.4103
rnn_word	0.4119
rcnn_word	0.4115
inceptin_word	0.4109
FastText_word	0.4091
rnn_char	0.4031
rcnn_char	0.4037
inception_char	0.4024
rcnn_word_aug	0.41344
cnn_word_aug	0.41051
rnn_word_aug	0.41368
incetpion_word_aug	0.41254
fastText3_word_aug	0.40853
cnn_char_aug	0.38738
rcnn_char_aug	0.39854

使用模型合奏，它最多可以達到0.433。

4測試並提交

4.1測試

模型：包括LSTMText ， RCNN ， MultiCNNTextBNDeep ， FastText3 ， CNNText_inception
模型路徑：驗證模型的路徑
結果路徑：在哪裡保存模型
Val：測試Val集或測試集。

 # LSTM
python2 test.1.py main --model= ' LSTMText '  --batch-size=512  --model-path= ' checkpoints/LSTMText_word_0.411994005382 ' --result-path= ' /data_ssd/zhihu/result/LSTMText0.4119_word_test.pth '  --val=False --zhuge=True

python2 test.1.py main --model= ' LSTMText '  --batch-size=256 --type_=char --model-path= ' checkpoints/LSTMText_char_0.403192339135 ' --result-path= ' /data_ssd/zhihu/result/LSTMText0.4031_char_test.pth '  --val=False --zhuge=True
 
# RCNN
python2 test.1.py main --model= ' RCNN '  --batch-size=512  --model-path= ' checkpoints/RCNN_word_0.411511574999 ' --result-path= ' /data_ssd/zhihu/result/RCNN_0.4115_word_test.pth '  --val=False --zhuge=True

python2 test.1.py main --model= ' RCNN '  --batch-size=512  --model-path= ' checkpoints/RCNN_char_0.403710422571 ' --result-path= ' /data_ssd/zhihu/result/RCNN_0.4037_char_test.pth '  --val=False --zhuge=True

# DeepText

python2 test.1.py main --model= ' MultiCNNTextBNDeep '  --batch-size=512  --model-path= ' checkpoints/MultiCNNTextBNDeep_word_0.410330780091 ' --result-path= ' /data_ssd/zhihu/result/DeepText0.4103_word_test.pth '  --val=False --zhuge=True
# more to go ...

4.2合奏

notebooks/val_ensemble.ipynb更多詳細notebooks/test_ensemble.ipynb

5個主要文件

main.py ：主要（用於培訓）
config.py ：配置文件
test.1.py ：進行測試
data/ ：用於數據加載器
scripts/ ：用於數據預處理
utils/ ：包括計算分數和包裝器以進行可視化。
models/ ：型號
- models/BasicModel ：模型的基本模型。
- models/MultiCNNTextBNDeep ：CNN
- models/LSTMText ：rnn
- models/RCNN ：RCNN
- models/CNNText_inception Inception
- models/MultiModelALL和models/MultiModelAll2
- 其他模型
rep.py ：複製代碼。
del/ ：方法失敗或不使用。
notebooks/ ：筆記本。

預驗證的模型

https://pan.baidu.com/s/1mjvtjgs passwd：tayb

展開

PyTorchText

中文用戶請查看readme-zh.md

1。設定

2。數據預處理

2.1 wordvector文件 - > numpy文件

2.2問題集 - > numpy文件

2.3標籤 - > JSON

2.4驗證數據

3。培訓

3.1沒有數據的三角模型

3.2帶有數據的火車模型

3.3分

4測試並提交

4.1測試

4.2合奏

5個主要文件

預驗證的模型

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express