DataScienceUtils下載 - DataScienceUtils源代碼下載

數據科學效應：數據科學的常用方法

建立狀態

數據科學UTILS擴展了Scikit-Learn API和Matplotlib API，以提供簡單的方法來簡化數據科學項目的任務和可視化。

代碼示例和文檔

讓我們探索一些代碼示例和輸出。

您可以找到所有代碼示例的完整文檔：https：//datascienceutils.readthedocs.io/en/latest/

在文檔中，您可以找到更多方法和其他示例。

包裝的API構建可與Scikit-Learn API和Matplotlib API一起使用。以下是此軟件包的一些功能：

指標

情節混亂矩陣

計算和繪製混淆矩陣，假陽性速率，假負率，準確性和分類的F1分數。

 from ds_utils . metrics import plot_confusion_matrix

plot_confusion_matrix ( y_test , y_pred , [ 0 , 1 , 2 ])

每個標記實例的情節度量增長

接收火車和測試集，並通過越來越多的訓練有素的實例來繪製給定的指標更改。

 from ds_utils . metrics import plot_metric_growth_per_labeled_instances
from sklearn . tree import DecisionTreeClassifier
from sklearn . ensemble import RandomForestClassifier

plot_metric_growth_per_labeled_instances (
    x_train , y_train , x_test , y_test ,
    {
        "DecisionTreeClassifier" : DecisionTreeClassifier ( random_state = 0 ),
        "RandomForestClassifier" : RandomForestClassifier ( random_state = 0 , n_estimators = 5 )
    }
)

可視化準確性按概率分組

接收測試的真實標籤和分類器概率預測，對結果進行分配和分類，最後繪製結果堆疊的條形圖。原始代碼

 from ds_utils . metrics import visualize_accuracy_grouped_by_probability

visualize_accuracy_grouped_by_probability (
    test [ "target" ],
    1 ,
    classifier . predict_proba ( test [ selected_features ]),
    display_breakdown = False
)

沒有故障：

與故障：

帶有概率（閾值）註釋的接收器操作特徵（ROC）曲線

使用繪圖作為後端，繪製具有多個分類器的閾值註釋的ROC曲線。

 from ds_utils . metrics import plot_roc_curve_with_thresholds_annotations

classifiers_names_and_scores_dict = {
    "Decision Tree" : tree_clf . predict_proba ( X_test )[:, 1 ],
    "Random Forest" : rf_clf . predict_proba ( X_test )[:, 1 ],
    "XGBoost" : xgb_clf . predict_proba ( X_test )[:, 1 ]
}
fig = plot_roc_curve_with_thresholds_annotations (
    y_true ,
    classifiers_names_and_scores_dict ,
    positive_label = 1
)
fig . show ()

具有概率（閾值）註釋的Precision-Recall曲線

使用Plotly作為後端來繪製具有多個分類器的閾值註釋的Precision-Recall曲線。

 from ds_utils . metrics import plot_precision_recall_curve_with_thresholds_annotations

classifiers_names_and_scores_dict = {
    "Decision Tree" : tree_clf . predict_proba ( X_test )[:, 1 ],
    "Random Forest" : rf_clf . predict_proba ( X_test )[:, 1 ],
    "XGBoost" : xgb_clf . predict_proba ( X_test )[:, 1 ]
}
fig = plot_precision_recall_curve_with_thresholds_annotations (
    y_true ,
    classifiers_names_and_scores_dict ,
    positive_label = 1
)
fig . show ()

預處理

可視化功能

接收功能並在圖上可視化其值：

如果該功能是浮動的，則該方法繪製了分佈圖。
如果該功能是DateTime，則該方法繪製了隨著時間的推移的行繪製。
如果該特徵是對象，分類，布爾或整數，則該方法繪製了計數圖（直方圖）。

 from ds_utils . preprocess import visualize_feature

visualize_feature ( X_train [ "feature" ])

功能類型	陰謀
漂浮
整數
DateTime
類別 /對象
布爾

獲得相關的功能

計算哪些特徵在閾值上方相關，並提取與目標特徵相關性和相關性的數據框架。

 from ds_utils . preprocess import get_correlated_features

correlations = get_correlated_features ( train , features , target )

Level_0	Level_1	Level_0_level_1_corr	Level_0_target_corr	Level_1_target_corr
income_category_low	income_category_medium	1.0	0.1182165609358650	0.11821656093586504
學期_ 36個月	學期_ 60個月	1.0	0.1182165609358650	0.11821656093586504
ivey_payments_high	ivey_payments_low	1.0	0.1182165609358650	0.11821656093586504

可視化相關性

計算列的成對相關性，不包括NA/NUL值，並使用熱圖可視化。原始代碼

 from ds_utils . preprocess import visualize_correlations

visualize_correlations ( data )

圖相關樹狀圖

繪製相關矩陣的樹狀圖。這是一個圖表組成，該圖表從層次上顯示了通過連接樹最相關的變量。連接的權利越近，功能就越相關。原始代碼

 from ds_utils . preprocess import plot_correlation_dendrogram

plot_correlation_dendrogram ( data )

情節特徵的互動

繪製兩個特徵之間的聯合分佈：

如果兩個特徵都是分類，布爾值或對象，則該方法繪製了共享直方圖。
如果一個功能是分類，布爾值或對象，而另一個是數字，則該方法繪製了Boxplot圖表。
如果一個功能是DateTime，另一個是數字或DateTime，則該方法繪製了行繪圖圖。
如果一個功能是DateTime，而另一個功能是分類，布爾值或對象，則該方法繪製了小提琴圖（Boxplot和內核密度估算的組合）。
如果兩個特徵都是數字，則該方法繪製了散點圖。

 from ds_utils . preprocess import plot_features_interaction

plot_features_interaction ( "feature_1" , "feature_2" , data )

	數字	分類	布爾	DateTime
數字
分類
布爾
DateTime

字符串

將標籤附加到框架

此方法從給定的字段中提取標籤，並將其作為新列附加到數據框架上。

考慮一個看起來像這樣的數據集：

x_train ：

Article_name	Article_tags
1	DS，ML，DL
2	DS，ML

x_test ：

Article_name	Article_tags
3	DS，ML，PY

使用此代碼：

 import pandas as pd
from ds_utils . strings import append_tags_to_frame

x_train = pd . DataFrame ([{ "article_name" : "1" , "article_tags" : "ds,ml,dl" },
                        { "article_name" : "2" , "article_tags" : "ds,ml" }])
x_test = pd . DataFrame ([{ "article_name" : "3" , "article_tags" : "ds,ml,py" }])

x_train_with_tags , x_test_with_tags = append_tags_to_frame ( x_train , x_test , "article_tags" , "tag_" )

結果將是：

x_train_with_tags ：

Article_name	tag_ds	tag_ml	tag_dl
1	1	1	1
2	1	1	0

x_test_with_tags ：

Article_name	tag_ds	tag_ml	tag_dl
3	1	1	0

從子集中提取重要術語

該方法返回子集中的術語有趣或異常的發生。它基於elasticsearch gighte_text聚合。

 import pandas as pd
from ds_utils . strings import extract_significant_terms_from_subset

corpus = [ 'This is the first document.' , 'This document is the second document.' ,
          'And this is the third one.' , 'Is this the first document?' ]
data_frame = pd . DataFrame ( corpus , columns = [ "content" ])
# Let's differentiate between the last two documents from the full corpus
subset_data_frame = data_frame [ data_frame . index > 1 ]
terms = extract_significant_terms_from_subset ( data_frame , subset_data_frame ,
                                              "content" )

terms的輸出將是下表：

第三	一	和	這	這	是	第一的	文件	第二
1.0	1.0	1.0	0.67	0.67	0.67	0.5	0.25	0.0

無監督

群集基數

群集基數是每個集群示例的數量。此方法將每個群集的點數繪製為條形圖。

 import pandas as pd
from matplotlib import pyplot as plt
from sklearn . cluster import KMeans
from ds_utils . unsupervised import plot_cluster_cardinality

data = pd . read_csv ( path / to / dataset )
estimator = KMeans ( n_clusters = 8 , random_state = 42 )
estimator . fit ( data )

plot_cluster_cardinality ( estimator . labels_ )

plt . show ()

情節群集幅度

聚類幅度是從所有示例到群集的質心的距離之和。此方法將每個群集的總點對上距離繪製為條形圖。

 import pandas as pd
from matplotlib import pyplot as plt
from sklearn . cluster import KMeans
from scipy . spatial . distance import euclidean
from ds_utils . unsupervised import plot_cluster_magnitude

data = pd . read_csv ( path / to / dataset )
estimator = KMeans ( n_clusters = 8 , random_state = 42 )
estimator . fit ( data )

plot_cluster_magnitude ( data , estimator . labels_ , estimator . cluster_centers_ , euclidean )

plt . show ()

幅度與基數

較高的群集基數往往會導致較高的聚類幅度，這是有意義的。當相對於其他簇與大小相關時，簇被認為是異常的。該方法通過將針對基數作為散點圖繪製幅度來幫助找到異常的簇。

 import pandas as pd
from matplotlib import pyplot as plt
from sklearn . cluster import KMeans
from scipy . spatial . distance import euclidean
from ds_utils . unsupervised import plot_magnitude_vs_cardinality

data = pd . read_csv ( path / to / dataset )
estimator = KMeans ( n_clusters = 8 , random_state = 42 )
estimator . fit ( data )

plot_magnitude_vs_cardinality ( data , estimator . labels_ , estimator . cluster_centers_ , euclidean )

plt . show ()

最佳簇數

K-均值聚類要求您事先確定簇k的數量。此方法運行KMeans算法並在每次迭代中增加群集數。總幅度或距離總和用作損失度量。

注意：目前，此方法僅適用於sklearn.cluster.KMeans 。

 import pandas as pd
from matplotlib import pyplot as plt
from scipy . spatial . distance import euclidean
from ds_utils . unsupervised import plot_loss_vs_cluster_number

data = pd . read_csv ( path / to / dataset )

plot_loss_vs_cluster_number ( data , 3 , 20 , euclidean )

plt . show ()

xai（可解釋的AI）

情節特徵的重要性

該方法將其作為條形圖的重要性將其列為重要性，有助於可視化哪些功能對模型的決策產生最大的影響。

 import pandas as pd
from matplotlib import pyplot as plt
from sklearn . tree import DecisionTreeClassifier
from ds_utils . xai import plot_features_importance

# Load the dataset
data = pd . read_csv ( path / to / dataset )
target = data [ "target" ]
features = data . columns . tolist ()
features . remove ( "target" )

# Train a decision tree classifier
clf = DecisionTreeClassifier ( random_state = 42 )
clf . fit ( data [ features ], target )

# Plot feature importance
plot_features_importance ( features , clf . feature_importances_ )

plt . show ()

這種可視化有助於了解哪些功能在模型的決策過程中最有影響力，從而為特徵選擇和模型解釋提供了寶貴的見解。

探索更多

對您到目前為止所看到的感到興奮？還有更多可以發現的！深入深入每個模塊，以解開DataScienceutils的全部潛力：

指標 - 計算和可視化算法性能評估的強大方法。了解模型的性能。
預處理 - 基本數據預處理技術，以準備您的數據進行培訓。改進模型的輸入以獲得更好的結果。
字符串 - 在數據范圍內操縱和處理字符串的有效方法。輕鬆處理文本數據。
無監督 - 計算和可視化無監督模型性能的工具。更好地了解您的聚類和降低性降低結果。
XAI-幫助解釋模型決策的方法，使您的AI更容易解釋和值得信賴。

每個模塊旨在簡化您的數據科學工作流程，為您提供預處理數據，火車模型，評估性能和解釋結果所需的工具。查看每個模塊的詳細文檔，以查看DataScienceutils如何增強您的項目！

貢獻

我們很高興您有興趣為數據科學效應做出貢獻！您的貢獻有助於使每個人都更好。無論您是經驗豐富的開發人員還是剛開始，這裡都有一個適合您的地方。

如何貢獻

找到一個貢獻的區域：查看我們的問題頁面以獲取打開任務，或想到您想添加的功能。
分叉存儲庫：製作自己的項目副本。
創建一個分支：在新的Git分支中進行更改。
進行更改：添加改進或修復。我們感謝：
- 錯誤報告和修復
- 功能請求和實現
- 文檔改進
- 性能優化
- 用戶體驗增強
測試您的更改：確保您的代碼按預期工作，並且不會引入新問題。
提交拉動請求：打開一個清晰標題和更改的描述的PR。

編碼指南

我們遵循Python軟件基金會的行為守則和Matplotlib使用指南。請在您的貢獻中遵守這些準則。

得到幫助

如果您是新手開源或需要任何幫助，請隨時在問題部分中提出問題或與維護者聯繫。我們在這里為您提供幫助！

為什麼要貢獻？

提高您的技能：在現實世界項目中獲得經驗。
成為社區的一部分：與其他開發人員和數據科學家建立聯繫。
有所作為：您的貢獻將幫助他人在數據科學旅程中。
獲得認可：所有貢獻者都在我們的項目中得到認可。

請記住，沒有貢獻太小。無論是在文檔中修復錯字還是添加主要功能，所有貢獻都得到重視和讚賞。

感謝您幫助所有人為數據科學提供更好的效果！

安裝指南

以下是安裝軟件包的幾種方法：

1。從PYPI安裝（建議）

安裝數據科學效應及其依賴關係的最簡單方法是使用PIP，Python的首選軟件包安裝程序：

pip install data-science-utils

要將數據科學效果升級到最新版本，請使用：

pip install -U data-science-utils

2。從源安裝

如果您想從源安裝，可以克隆存儲庫並安裝：

git clone https://github.com/idanmoradarthas/DataScienceUtils.git
cd DataScienceUtils
pip install .

另外，您可以使用PIP直接從GitHub安裝：

pip install git+https://github.com/idanmoradarthas/DataScienceUtils.git

3。使用Anaconda安裝

如果您使用的是Anaconda，則可以使用Conda安裝：

conda install idanmorad::data-science-utils

有關依賴項的註釋

數據科學UTILS具有多種依賴性，包括Numpy，Pandas，Matplotlib，Plotly和Scikit-Learn。當您使用上面的方法安裝軟件包時，這些將自動安裝。

保持更新

數據科學Utils是一個活躍的項目，通常會以其他方法和改進來發布新版本。我們建議定期檢查更新以訪問最新功能和錯誤修復。

如果您在安裝過程中遇到任何問題，請查看我們的GitHub問題頁面或打開新問題以尋求幫助。

展開