
用于机器学习的“瑞士军刀”
ktrain.text.qa.generative_qa模块。我们的onprem.llm软件包应用于生成问题的提问任务。请参阅示例笔记本。Ktrain是深度学习库Tensorflow Keras(和其他库)的轻量级包装,可帮助构建,培训和部署神经网络和其他机器学习模型。受ML框架扩展的启发, Ktrain旨在使深度学习和AI更容易访问,更容易申请新手和经验丰富的从业者。只有几行代码, Ktrain允许您轻松,快速:
为text , vision , graph和tabular数据采用快速,准确且易于使用的预先使用的模型:
text数据:vision数据:graph数据:tabular数据:使用学习率查找器估计模型的最佳学习率
利用三角政策,1循环政策和SGDR等学习率时间表,以有效地最大程度地减少损失并改善概括
为任何语言构建文本分类器(例如,伯特(Bert)的阿拉伯情感分析,中文分析NBSVM)
轻松培训任何语言的NER模型(例如,荷兰语)
来自多种格式的负载和预处理文本和图像数据
检查错误分类的数据点并提供解释以帮助改善模型
利用一个简单的预测API来保存和部署模型和数据预处理步骤,以对新的原始数据进行预测
内置支持将模型导出到ONNX和TensorFlow Lite(有关更多信息,请参见示例笔记本)
请参阅以下教程笔记本,以获取有关如何在项目中使用KTRAIN的指南:
一些博客教程和其他有关KTRAIN的指南如下:
Ktrain:Keras的轻量级包装,可帮助培训神经网络
BERT文本分类中的3行代码
文本分类,带有tensorflow 2中的拥抱脸部变压器(没有泪水)
用BERT在3行代码中构建一个开放域的提问系统
使用KTRAIN进行灾难推文分类的Finetuning Bert由Hamiz Ahmed分类
桑迪·科萨西(Sandy Khosasi)的印尼NLP示例
在Google Colab上使用Ktrain ?请参见这些coarab示例:
transformer Word Embeddings的示例只有几行代码可以轻松完成诸如文本分类和图像分类之类的任务。
import ktrain
from ktrain import text as txt
# load data
( x_train , y_train ), ( x_test , y_test ), preproc = txt . texts_from_folder ( 'data/aclImdb' , maxlen = 500 ,
preprocess_mode = 'bert' ,
train_test_names = [ 'train' , 'test' ],
classes = [ 'pos' , 'neg' ])
# load model
model = txt . text_classifier ( 'bert' , ( x_train , y_train ), preproc = preproc )
# wrap model and data in ktrain.Learner object
learner = ktrain . get_learner ( model ,
train_data = ( x_train , y_train ),
val_data = ( x_test , y_test ),
batch_size = 6 )
# find good learning rate
learner . lr_find () # briefly simulate training to find good learning rate
learner . lr_plot () # visually identify best learning rate
# train using 1cycle learning rate schedule for 3 epochs
learner . fit_onecycle ( 2e-5 , 3 ) import ktrain
from ktrain import vision as vis
# load data
( train_data , val_data , preproc ) = vis . images_from_folder (
datadir = 'data/dogscats' ,
data_aug = vis . get_data_aug ( horizontal_flip = True ),
train_test_names = [ 'train' , 'valid' ],
target_size = ( 224 , 224 ), color_mode = 'rgb' )
# load model
model = vis . image_classifier ( 'pretrained_resnet50' , train_data , val_data , freeze_layers = 80 )
# wrap model and data in ktrain.Learner object
learner = ktrain . get_learner ( model = model , train_data = train_data , val_data = val_data ,
workers = 8 , use_multiprocessing = False , batch_size = 64 )
# find good learning rate
learner . lr_find () # briefly simulate training to find good learning rate
learner . lr_plot () # visually identify best learning rate
# train using triangular policy with ModelCheckpoint and implicit ReduceLROnPlateau and EarlyStopping
learner . autofit ( 1e-4 , checkpoint_folder = '/tmp/saved_weights' ) import ktrain
from ktrain import text as txt
# load data
( trn , val , preproc ) = txt . entities_from_txt ( 'data/ner_dataset.csv' ,
sentence_column = 'Sentence #' ,
word_column = 'Word' ,
tag_column = 'Tag' ,
data_format = 'gmb' ,
use_char = True ) # enable character embeddings
# load model
model = txt . sequence_tagger ( 'bilstm-crf' , preproc )
# wrap model and data in ktrain.Learner object
learner = ktrain . get_learner ( model , train_data = trn , val_data = val )
# conventional training for 1 epoch using a learning rate of 0.001 (Keras default for Adam optmizer)
learner . fit ( 1e-3 , 1 ) import ktrain
from ktrain import graph as gr
# load data with supervision ratio of 10%
( trn , val , preproc ) = gr . graph_nodes_from_csv (
'cora.content' , # node attributes/labels
'cora.cites' , # edge list
sample_size = 20 ,
holdout_pct = None ,
holdout_for_inductive = False ,
train_pct = 0.1 , sep = ' t ' )
# load model
model = gr . graph_node_classifier ( 'graphsage' , trn )
# wrap model and data in ktrain.Learner object
learner = ktrain . get_learner ( model , train_data = trn , val_data = val , batch_size = 64 )
# find good learning rate
learner . lr_find ( max_epochs = 100 ) # briefly simulate training to find good learning rate
learner . lr_plot () # visually identify best learning rate
# train using triangular policy with ModelCheckpoint and implicit ReduceLROnPlateau and EarlyStopping
learner . autofit ( 0.01 , checkpoint_folder = '/tmp/saved_weights' ) # load text data
categories = [ 'alt.atheism' , 'soc.religion.christian' , 'comp.graphics' , 'sci.med' ]
from sklearn . datasets import fetch_20newsgroups
train_b = fetch_20newsgroups ( subset = 'train' , categories = categories , shuffle = True )
test_b = fetch_20newsgroups ( subset = 'test' , categories = categories , shuffle = True )
( x_train , y_train ) = ( train_b . data , train_b . target )
( x_test , y_test ) = ( test_b . data , test_b . target )
# build, train, and validate model (Transformer is wrapper around transformers library)
import ktrain
from ktrain import text
MODEL_NAME = 'distilbert-base-uncased'
t = text . Transformer ( MODEL_NAME , maxlen = 500 , class_names = train_b . target_names )
trn = t . preprocess_train ( x_train , y_train )
val = t . preprocess_test ( x_test , y_test )
model = t . get_classifier ()
learner = ktrain . get_learner ( model , train_data = trn , val_data = val , batch_size = 6 )
learner . fit_onecycle ( 5e-5 , 4 )
learner . validate ( class_names = t . get_classes ()) # class_names must be string values
# Output from learner.validate()
# precision recall f1-score support
#
# alt.atheism 0.92 0.93 0.93 319
# comp.graphics 0.97 0.97 0.97 389
# sci.med 0.97 0.95 0.96 396
#soc.religion.christian 0.96 0.96 0.96 398
#
# accuracy 0.96 1502
# macro avg 0.95 0.96 0.95 1502
# weighted avg 0.96 0.96 0.96 1502 import ktrain
from ktrain import tabular
import pandas as pd
train_df = pd . read_csv ( 'train.csv' , index_col = 0 )
train_df = train_df . drop ([ 'Name' , 'Ticket' , 'Cabin' ], 1 )
trn , val , preproc = tabular . tabular_from_df ( train_df , label_columns = [ 'Survived' ], random_state = 42 )
learner = ktrain . get_learner ( tabular . tabular_classifier ( 'mlp' , trn ), train_data = trn , val_data = val )
learner . lr_find ( show_plot = True , max_epochs = 5 ) # estimate learning rate
learner . fit_onecycle ( 5e-3 , 10 )
# evaluate held-out labeled test set
tst = preproc . preprocess_test ( pd . read_csv ( 'heldout.csv' , index_col = 0 ))
learner . evaluate ( tst , class_names = preproc . get_classes ())确保PIP与: pip install -U pip有关
如果尚未安装TensorFlow 2(例如, pip install tensorflow )。
安装Ktrain : pip install ktrain
如果使用tensorflow>=2.16 :
pip install tf_kerasTF_USE_LEGACY_KERAS设置为true以上应该是您在Linux系统和云计算环境(例如Google Colab和AWS EC2)上所需的一切。如果您在Windows计算机上使用KTRAIN ,则可以遵循这些更详细的说明,其中包括一些额外的步骤。
tensorflow>=2.11起,您必须仅使用遗留优化器,例如tf.keras.optimizers.legacy.Adam 。此时不支持较新的tf.keras.optimizers.Optimizer Base类。例如,使用TensorFlow 2.11及以上时,请使用tf.keras.optimzers.legacy.Adam()而不是model.compile中的字符串"adam" 。 Ktrain使用开箱即用的型号时会自动执行此操作(例如,来自transformers库中的型号)。tf_keras软件包,并在导入KTRAIN之前设置环境变量TF_USE_LEGACY_KERAS=True (例如,添加export TF_USE_LEGACY_KERAS=1 in .bashrc in .bashrc或添加os.bashrc或添加os.environ['TF_USE_LEGACY_KERAS']="1" ''''''''' ETC。)。 eli5和stellargraph库的分叉版本来支持TensorFlow2。) # for graph module:
pip install https : // github . com / amaiya / stellargraph / archive / refs / heads / no_tf_dep_082 . zip
# for text.TextPredictor.explain and vision.ImagePredictor.explain:
pip install https : // github . com / amaiya / eli5 - tf / archive / refs / heads / master . zip
# for tabular.TabularPredictor.explain:
pip install shap
# for text.zsl (ZeroShotClassifier), text.summarization, text.translation, text.speech:
pip install torch
# for text.speech:
pip install librosa
# for tabular.causal_inference_model:
pip install causalnlp
# for text.summarization.core.LexRankSummarizer:
pip install sumy
# for text.kw.KeywordExtractor
pip install textblob
# for text.generative_ai
pip install onprem KTRAIN故意将销钉固定到较低版本的变压器上,以包括对较旧版本的TensorFlow的支持。如果您需要更新版本的transformers ,则通常在安装KTRAIN之后进行升级transformers 。
从v0.30.x开始,张量安装是可选的,仅在训练神经网络时才需要。尽管KTRAIN使用TensorFlow进行神经网络培训,但它还包括各种有用的预处理的Pytorch型号和Sklearn型号,可以在不安装TensorFlow的情况下使用它们,如此表中总结:
| 特征 | 张量 | Pytorch | Sklearn |
|---|---|---|---|
| 培训任何神经网络(例如,文本或图像分类) | ✅ | ||
| 端到端的提问(预审计) | ✅ | ✅ | |
| 基于QA的信息提取(预审计) | ✅ | ✅ | |
| 零拍(预审计) | ✅ | ||
| 语言翻译(审慎) | ✅ | ||
| 摘要(审慎) | ✅ | ||
| 语音转录(审慎) | ✅ | ||
| 图像字幕(预审计) | ✅ | ||
| 对象检测(审慎) | ✅ | ||
| 情感分析(审慎) | ✅ | ||
| Generativeai(句子转换器) | ✅ | ||
| 主题建模(Sklearn) | ✅ | ||
| 键形提取(textBlob/nltk/sklearn) | ✅ |
如上所述, KTRAIN中的端到端提问和信息提取可以与TensorFlow(使用framework='tf' )或Pytorch一起使用(使用framework='pt' )。
使用Ktrain时,请引用以下论文:
@article{maiya2020ktrain,
title={ktrain: A Low-Code Library for Augmented Machine Learning},
author={Arun S. Maiya},
year={2020},
eprint={2004.10703},
archivePrefix={arXiv},
primaryClass={cs.LG},
journal={arXiv preprint arXiv:2004.10703},
}
创造者:Arun S. Maiya
电子邮件: Arun [at] Maiya [dot]网络