DataScienceUtils下载 - DataScienceUtils源代码下载

数据科学效应：数据科学的常用方法

建立状态

数据科学UTILS扩展了Scikit-Learn API和Matplotlib API，以提供简单的方法来简化数据科学项目的任务和可视化。

代码示例和文档

让我们探索一些代码示例和输出。

您可以找到所有代码示例的完整文档：https：//datascienceutils.readthedocs.io/en/latest/

在文档中，您可以找到更多方法和其他示例。

包装的API构建可与Scikit-Learn API和Matplotlib API一起使用。以下是此软件包的一些功能：

指标

情节混乱矩阵

计算和绘制混淆矩阵，假阳性速率，假负率，准确性和分类的F1分数。

 from ds_utils . metrics import plot_confusion_matrix

plot_confusion_matrix ( y_test , y_pred , [ 0 , 1 , 2 ])

每个标记实例的情节度量增长

接收火车和测试集，并通过越来越多的训练有素的实例来绘制给定的指标更改。

 from ds_utils . metrics import plot_metric_growth_per_labeled_instances
from sklearn . tree import DecisionTreeClassifier
from sklearn . ensemble import RandomForestClassifier

plot_metric_growth_per_labeled_instances (
    x_train , y_train , x_test , y_test ,
    {
        "DecisionTreeClassifier" : DecisionTreeClassifier ( random_state = 0 ),
        "RandomForestClassifier" : RandomForestClassifier ( random_state = 0 , n_estimators = 5 )
    }
)

可视化准确性按概率分组

接收测试的真实标签和分类器概率预测，对结果进行分配和分类，最后绘制结果堆叠的条形图。原始代码

 from ds_utils . metrics import visualize_accuracy_grouped_by_probability

visualize_accuracy_grouped_by_probability (
    test [ "target" ],
    1 ,
    classifier . predict_proba ( test [ selected_features ]),
    display_breakdown = False
)

没有故障：

与故障：

带有概率（阈值）注释的接收器操作特征（ROC）曲线

使用绘图作为后端，绘制具有多个分类器的阈值注释的ROC曲线。

 from ds_utils . metrics import plot_roc_curve_with_thresholds_annotations

classifiers_names_and_scores_dict = {
    "Decision Tree" : tree_clf . predict_proba ( X_test )[:, 1 ],
    "Random Forest" : rf_clf . predict_proba ( X_test )[:, 1 ],
    "XGBoost" : xgb_clf . predict_proba ( X_test )[:, 1 ]
}
fig = plot_roc_curve_with_thresholds_annotations (
    y_true ,
    classifiers_names_and_scores_dict ,
    positive_label = 1
)
fig . show ()

具有概率（阈值）注释的Precision-Recall曲线

使用Plotly作为后端来绘制具有多个分类器的阈值注释的Precision-Recall曲线。

 from ds_utils . metrics import plot_precision_recall_curve_with_thresholds_annotations

classifiers_names_and_scores_dict = {
    "Decision Tree" : tree_clf . predict_proba ( X_test )[:, 1 ],
    "Random Forest" : rf_clf . predict_proba ( X_test )[:, 1 ],
    "XGBoost" : xgb_clf . predict_proba ( X_test )[:, 1 ]
}
fig = plot_precision_recall_curve_with_thresholds_annotations (
    y_true ,
    classifiers_names_and_scores_dict ,
    positive_label = 1
)
fig . show ()

预处理

可视化功能

接收功能并在图上可视化其值：

如果该功能是浮动的，则该方法绘制了分布图。
如果该功能是DateTime，则该方法绘制了随着时间的推移的行绘制。
如果该特征是对象，分类，布尔或整数，则该方法绘制了计数图（直方图）。

 from ds_utils . preprocess import visualize_feature

visualize_feature ( X_train [ "feature" ])

功能类型	阴谋
漂浮
整数
DateTime
类别 /对象
布尔

获得相关的功能

计算哪些特征在阈值上方相关，并提取与目标特征相关性和相关性的数据框架。

 from ds_utils . preprocess import get_correlated_features

correlations = get_correlated_features ( train , features , target )

Level_0	Level_1	Level_0_level_1_corr	Level_0_target_corr	Level_1_target_corr
income_category_low	income_category_medium	1.0	0.1182165609358650	0.11821656093586504
学期_ 36个月	学期_ 60个月	1.0	0.1182165609358650	0.11821656093586504
ivey_payments_high	ivey_payments_low	1.0	0.1182165609358650	0.11821656093586504

可视化相关性

计算列的成对相关性，不包括NA/NUL值，并使用热图可视化。原始代码

 from ds_utils . preprocess import visualize_correlations

visualize_correlations ( data )

图相关树状图

绘制相关矩阵的树状图。这是一个图表组成，该图表从层次上显示了通过连接树最相关的变量。连接的权利越近，功能就越相关。原始代码

 from ds_utils . preprocess import plot_correlation_dendrogram

plot_correlation_dendrogram ( data )

情节特征的互动

绘制两个特征之间的联合分布：

如果两个特征都是分类，布尔值或对象，则该方法绘制了共享直方图。
如果一个功能是分类，布尔值或对象，而另一个是数字，则该方法绘制了Boxplot图表。
如果一个功能是DateTime，另一个是数字或DateTime，则该方法绘制了行绘图图。
如果一个功能是DateTime，而另一个功能是分类，布尔值或对象，则该方法绘制了小提琴图（Boxplot和内核密度估算的组合）。
如果两个特征都是数字，则该方法绘制了散点图。

 from ds_utils . preprocess import plot_features_interaction

plot_features_interaction ( "feature_1" , "feature_2" , data )

	数字	分类	布尔	DateTime
数字
分类
布尔
DateTime

字符串

将标签附加到框架

此方法从给定的字段中提取标签，并将其作为新列附加到数据框架上。

考虑一个看起来像这样的数据集：

x_train ：

Article_name	Article_tags
1	DS，ML，DL
2	DS，ML

x_test ：

Article_name	Article_tags
3	DS，ML，PY

使用此代码：

 import pandas as pd
from ds_utils . strings import append_tags_to_frame

x_train = pd . DataFrame ([{ "article_name" : "1" , "article_tags" : "ds,ml,dl" },
                        { "article_name" : "2" , "article_tags" : "ds,ml" }])
x_test = pd . DataFrame ([{ "article_name" : "3" , "article_tags" : "ds,ml,py" }])

x_train_with_tags , x_test_with_tags = append_tags_to_frame ( x_train , x_test , "article_tags" , "tag_" )

结果将是：

x_train_with_tags ：

Article_name	tag_ds	tag_ml	tag_dl
1	1	1	1
2	1	1	0

x_test_with_tags ：

Article_name	tag_ds	tag_ml	tag_dl
3	1	1	0

从子集中提取重要术语

该方法返回子集中的术语有趣或异常的发生。它基于elasticsearch gighte_text聚合。

 import pandas as pd
from ds_utils . strings import extract_significant_terms_from_subset

corpus = [ 'This is the first document.' , 'This document is the second document.' ,
          'And this is the third one.' , 'Is this the first document?' ]
data_frame = pd . DataFrame ( corpus , columns = [ "content" ])
# Let's differentiate between the last two documents from the full corpus
subset_data_frame = data_frame [ data_frame . index > 1 ]
terms = extract_significant_terms_from_subset ( data_frame , subset_data_frame ,
                                              "content" )

terms的输出将是下表：

第三	一	和	这	这	是	第一的	文档	第二
1.0	1.0	1.0	0.67	0.67	0.67	0.5	0.25	0.0

无监督

群集基数

群集基数是每个集群示例的数量。此方法将每个群集的点数绘制为条形图。

 import pandas as pd
from matplotlib import pyplot as plt
from sklearn . cluster import KMeans
from ds_utils . unsupervised import plot_cluster_cardinality

data = pd . read_csv ( path / to / dataset )
estimator = KMeans ( n_clusters = 8 , random_state = 42 )
estimator . fit ( data )

plot_cluster_cardinality ( estimator . labels_ )

plt . show ()

情节群集幅度

聚类幅度是从所有示例到群集的质心的距离之和。此方法将每个群集的总点对上距离绘制为条形图。

 import pandas as pd
from matplotlib import pyplot as plt
from sklearn . cluster import KMeans
from scipy . spatial . distance import euclidean
from ds_utils . unsupervised import plot_cluster_magnitude

data = pd . read_csv ( path / to / dataset )
estimator = KMeans ( n_clusters = 8 , random_state = 42 )
estimator . fit ( data )

plot_cluster_magnitude ( data , estimator . labels_ , estimator . cluster_centers_ , euclidean )

plt . show ()

幅度与基数

较高的群集基数往往会导致较高的聚类幅度，这是有意义的。当相对于其他簇与大小相关时，簇被认为是异常的。该方法通过将针对基数作为散点图绘制幅度来帮助找到异常的簇。

 import pandas as pd
from matplotlib import pyplot as plt
from sklearn . cluster import KMeans
from scipy . spatial . distance import euclidean
from ds_utils . unsupervised import plot_magnitude_vs_cardinality

data = pd . read_csv ( path / to / dataset )
estimator = KMeans ( n_clusters = 8 , random_state = 42 )
estimator . fit ( data )

plot_magnitude_vs_cardinality ( data , estimator . labels_ , estimator . cluster_centers_ , euclidean )

plt . show ()

最佳簇数

K-均值聚类要求您事先确定簇k的数量。此方法运行KMeans算法并在每次迭代中增加群集数。总幅度或距离总和用作损失度量。

注意：目前，此方法仅适用于sklearn.cluster.KMeans 。

 import pandas as pd
from matplotlib import pyplot as plt
from scipy . spatial . distance import euclidean
from ds_utils . unsupervised import plot_loss_vs_cluster_number

data = pd . read_csv ( path / to / dataset )

plot_loss_vs_cluster_number ( data , 3 , 20 , euclidean )

plt . show ()

xai（可解释的AI）

情节特征的重要性

该方法将其作为条形图的重要性将其列为重要性，有助于可视化哪些功能对模型的决策产生最大的影响。

 import pandas as pd
from matplotlib import pyplot as plt
from sklearn . tree import DecisionTreeClassifier
from ds_utils . xai import plot_features_importance

# Load the dataset
data = pd . read_csv ( path / to / dataset )
target = data [ "target" ]
features = data . columns . tolist ()
features . remove ( "target" )

# Train a decision tree classifier
clf = DecisionTreeClassifier ( random_state = 42 )
clf . fit ( data [ features ], target )

# Plot feature importance
plot_features_importance ( features , clf . feature_importances_ )

plt . show ()

这种可视化有助于了解哪些功能在模型的决策过程中最有影响力，从而为特征选择和模型解释提供了宝贵的见解。

探索更多

对您到目前为止所看到的感到兴奋？还有更多可以发现的！深入深入每个模块，以解开DataScienceutils的全部潜力：

指标 - 计算和可视化算法性能评估的强大方法。了解模型的性能。
预处理 - 基本数据预处理技术，以准备您的数据进行培训。改进模型的输入以获得更好的结果。
字符串 - 在数据范围内操纵和处理字符串的有效方法。轻松处理文本数据。
无监督 - 计算和可视化无监督模型性能的工具。更好地了解您的聚类和降低性降低结果。
XAI-帮助解释模型决策的方法，使您的AI更容易解释和值得信赖。

每个模块旨在简化您的数据科学工作流程，为您提供预处理数据，火车模型，评估性能和解释结果所需的工具。查看每个模块的详细文档，以查看DataScienceutils如何增强您的项目！

贡献

我们很高兴您有兴趣为数据科学效应做出贡献！您的贡献有助于使每个人都更好。无论您是经验丰富的开发人员还是刚开始，这里都有一个适合您的地方。

如何贡献

找到一个贡献的区域：查看我们的问题页面以获取打开任务，或想到您想添加的功能。
分叉存储库：制作自己的项目副本。
创建一个分支：在新的Git分支中进行更改。
进行更改：添加改进或修复。我们感谢：
- 错误报告和修复
- 功能请求和实现
- 文档改进
- 性能优化
- 用户体验增强
测试您的更改：确保您的代码按预期工作，并且不会引入新问题。
提交拉动请求：打开一个清晰标题和更改的描述的PR。

编码指南

我们遵循Python软件基金会的行为守则和Matplotlib使用指南。请在您的贡献中遵守这些准则。

得到帮助

如果您是新手开源或需要任何帮助，请随时在问题部分中提出问题或与维护者联系。我们在这里为您提供帮助！

为什么要贡献？

提高您的技能：在现实世界项目中获得经验。
成为社区的一部分：与其他开发人员和数据科学家建立联系。
有所作为：您的贡献将帮助他人在数据科学旅程中。
获得认可：所有贡献者都在我们的项目中得到认可。

请记住，没有贡献太小。无论是在文档中修复错字还是添加主要功能，所有贡献都得到重视和赞赏。

感谢您帮助所有人为数据科学提供更好的效果！

安装指南

以下是安装软件包的几种方法：

1。从PYPI安装（建议）

安装数据科学效应及其依赖关系的最简单方法是使用PIP，Python的首选软件包安装程序：

pip install data-science-utils

要将数据科学效果升级到最新版本，请使用：

pip install -U data-science-utils

2。从源安装

如果您想从源安装，可以克隆存储库并安装：

git clone https://github.com/idanmoradarthas/DataScienceUtils.git
cd DataScienceUtils
pip install .

另外，您可以使用PIP直接从GitHub安装：

pip install git+https://github.com/idanmoradarthas/DataScienceUtils.git

3。使用Anaconda安装

如果您使用的是Anaconda，则可以使用Conda安装：

conda install idanmorad::data-science-utils

有关依赖项的注释

数据科学UTILS具有多种依赖性，包括Numpy，Pandas，Matplotlib，Plotly和Scikit-Learn。当您使用上面的方法安装软件包时，这些将自动安装。

保持更新

数据科学Utils是一个活跃的项目，通常会以其他方法和改进来发布新版本。我们建议定期检查更新以访问最新功能和错误修复。

如果您在安装过程中遇到任何问题，请查看我们的GitHub问题页面或打开新问题以寻求帮助。

展开