TransmogrifAI
0.7.0
TransMogrifai(发音为Trăns-Mŏgˈrə-fī)是用Scala编写的自动库,它在Apache Spark上运行。它的开发是通过机器学习自动化来加速机器学习开发人员的生产力,以及可以执行编译时类型安全,模块化和重复使用的API。通过自动化,它可以实现几乎降低时间的手工调整模型的精度。
如果需要机器学习库,请使用TransMogrifai以:
要了解Transmogrifai的动机,请查看以下内容:
跳过快速启动和文档。
泰坦尼克号数据集是机器学习社区中经常引用的数据集。目的是建立一个机器学习的模型,该模型将从泰坦尼克号乘客清单中预测幸存者。这是您将如何使用TransMogrifai构建模型:
import com . salesforce . op . _
import com . salesforce . op . readers . _
import com . salesforce . op . features . _
import com . salesforce . op . features . types . _
import com . salesforce . op . stages . impl . classification . _
import org . apache . spark . SparkConf
import org . apache . spark . sql . SparkSession
implicit val spark = SparkSession .builder.config( new SparkConf ()).getOrCreate()
import spark . implicits . _
// Read Titanic data as a DataFrame
val passengersData = DataReaders . Simple .csvCase[ Passenger ](path = pathToData).readDataset().toDF()
// Extract response and predictor Features
val (survived, predictors) = FeatureBuilder .fromDataFrame[ RealNN ](passengersData, response = " survived " )
// Automated feature engineering
val featureVector = predictors.transmogrify()
// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true )
// Automated model selection
val pred = BinaryClassificationModelSelector ().setInput(survived, checkedFeatures).getOutput()
// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow ().setInputDataset(passengersData).setResultFeatures(pred).train()
println( " Model summary: n " + model.summaryPretty())模型摘要:
Evaluated Logistic Regression, Random Forest models with 3 folds and AuPR metric.
Evaluated 3 Logistic Regression models with AuPR between [0.6751930383321765, 0.7768725281794376]
Evaluated 16 Random Forest models with AuPR between [0.7781671467343991, 0.8104798040316159]
Selected model Random Forest classifier with parameters:
|-----------------------|--------------|
| Model Param | Value |
|-----------------------|--------------|
| modelType | RandomForest |
| featureSubsetStrategy | auto |
| impurity | gini |
| maxBins | 32 |
| maxDepth | 12 |
| minInfoGain | 0.001 |
| minInstancesPerNode | 10 |
| numTrees | 50 |
| subsamplingRate | 1.0 |
|-----------------------|--------------|
Model evaluation metrics:
|-------------|--------------------|---------------------|
| Metric Name | Hold Out Set Value | Training Set Value |
|-------------|--------------------|---------------------|
| Precision | 0.85 | 0.773851590106007 |
| Recall | 0.6538461538461539 | 0.6930379746835443 |
| F1 | 0.7391304347826088 | 0.7312186978297163 |
| AuROC | 0.8821603927986905 | 0.8766642291593114 |
| AuPR | 0.8225075757571668 | 0.850331080886535 |
| Error | 0.1643835616438356 | 0.19682151589242053 |
| TP | 17.0 | 219.0 |
| TN | 44.0 | 438.0 |
| FP | 3.0 | 64.0 |
| FN | 9.0 | 97.0 |
|-------------|--------------------|---------------------|
Top model insights computed using correlation:
|-----------------------|----------------------|
| Top Positive Insights | Correlation |
|-----------------------|----------------------|
| sex = "female" | 0.5177801026737666 |
| cabin = "OTHER" | 0.3331391338844782 |
| pClass = 1 | 0.3059642953159715 |
|-----------------------|----------------------|
| Top Negative Insights | Correlation |
|-----------------------|----------------------|
| sex = "male" | -0.5100301587292186 |
| pClass = 3 | -0.5075774968534326 |
| cabin = null | -0.31463114463832633 |
|-----------------------|----------------------|
Top model insights computed using CramersV:
|-----------------------|----------------------|
| Top Insights | CramersV |
|-----------------------|----------------------|
| sex | 0.525557139885501 |
| embarked | 0.31582347194683386 |
| age | 0.21582347194683386 |
|-----------------------|----------------------|
尽管这似乎有点太神奇了,但对于那些想要更多控制的人来说,TransMogrifai还提供了灵活性,可以完全指定所提取的所有功能以及在ML管道中应用的所有算法。访问我们的文档网站以获取完整的文档,入门,示例,常见问题解答和其他信息。
您只需将TransMogrifai作为现有项目的常规依赖性添加。首先从下面的版本矩阵中选择TransMogrifai版本以匹配您的项目依赖项(如果不确定 -稳定版本):
| TransMogrifai版本 | 火花版 | Scala版本 | Java版本 |
|---|---|---|---|
| 0.7.1(未发行,主), 0.7.0(稳定) | 2.4 | 2.11 | 1.8 |
| 0.6.1,0.6.0,0.5.3,0.5.2,0.5.1,0.5.0 | 2.3 | 2.11 | 1.8 |
| 0.4.0,0.3.4 | 2.2 | 2.11 | 1.8 |
对于build.gradle添加的Gradle:
repositories {
jcenter()
mavenCentral()
}
dependencies {
// TransmogrifAI core dependency
compile ' com.salesforce.transmogrifai:transmogrifai-core_2.11:0.7.0 '
// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
// compile 'com.salesforce.transmogrifai:transmogrifai-models_2.11:0.7.0'
}对于build.sbt中的SBT添加:
scalaVersion : = " 2.11.12 "
resolvers + = Resolver .jcenterRepo
// TransmogrifAI core dependency
libraryDependencies + = " com.salesforce.transmogrifai " %% " transmogrifai-core " % " 0.7.0 "
// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
// libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-models" % "0.7.0"然后将transmogrifai导入您的代码:
// TransmogrifAI functionality: feature types, feature builders, feature dsl, readers, aggregators etc.
import com . salesforce . op . _
import com . salesforce . op . aggregators . _
import com . salesforce . op . features . _
import com . salesforce . op . features . types . _
import com . salesforce . op . readers . _
// Spark enrichments (optional)
import com . salesforce . op . utils . spark . RichDataset . _
import com . salesforce . op . utils . spark . RichRDD . _
import com . salesforce . op . utils . spark . RichRow . _
import com . salesforce . op . utils . spark . RichMetadata . _
import com . salesforce . op . utils . spark . RichStructType . _ 访问我们的文档网站以获取完整的文档,入门,示例,常见问题解答和其他信息。
有关编程API,请参见Scaladoc。
BSD 3条©Salesforce.com,Inc。