MLstatkit is a comprehensive Python library designed to seamlessly integrate established statistical methods into machine learning projects. It encompasses a variety of tools, including Delong's test for comparing areas under two correlated Receiver Operating Characteristic (ROC) curves, Bootstrapping for calculating confidence intervals, AUC2OR for converting the Area Under the Receiver Operating Characteristic Curve (AUC) into several related statistics such as Cohen’s d, Pearson’s rpb, odds-ratio, and natural log odds-ratio, and Permutation_test for assessing the statistical significance of the difference between two models' metrics by randomly shuffling the data and recalculating the metrics to create a distribution of differences. With its modular design, MLstatkit offers researchers and data scientists a flexible and powerful toolkit to augment their analyses and model evaluations, catering to a broad spectrum of statistical testing needs within the domain of machine learning.
Install MLstatkit directly from PyPI using pip:
pip install MLstatkitDelong_test function enables a statistical evaluation of the differences between the areas under two correlated Receiver Operating Characteristic (ROC) curves derived from distinct models. This facilitates a deeper understanding of comparative model performance.
true : array-like of shape (n_samples,)
True binary labels in range {0, 1}.
prob_A : array-like of shape (n_samples,)
Predicted probabilities by the first model.
prob_B : array-like of shape (n_samples,)
Predicted probabilities by the second model.
z_score : float
The z score from comparing the AUCs of two models.
p_value : float
The p value from comparing the AUCs of two models.
from MLstatkit.stats import Delong_test
# Example data
true = np.array([0, 1, 0, 1])
prob_A = np.array([0.1, 0.4, 0.35, 0.8])
prob_B = np.array([0.2, 0.3, 0.4, 0.7])
# Perform DeLong's test
z_score, p_value = Delong_test(true, prob_A, prob_B)
print(f"Z-Score: {z_score}, P-Value: {p_value}")This demonstrates the usage of Delong_test to statistically compare the AUCs of two models based on their probabilities and the true labels. The returned z-score and p-value help in understanding if the difference in model performances is statistically significant.
The Bootstrapping function calculates confidence intervals for specified performance metrics using bootstrapping, providing a measure of the estimation's reliability. It supports calculation for AUROC (area under the ROC curve), AUPRC (area under the precision-recall curve), and F1 score metrics.
from MLstatkit.stats import Bootstrapping
# Example data
y_true = np.array([0, 1, 0, 0, 1, 1, 0, 1, 0])
y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.3, 0.4, 0.7, 0.05])
# Calculate confidence intervals for AUROC
original_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'roc_auc')
print(f"AUROC: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]")
# Calculate confidence intervals for AUPRC
original_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'pr_auc')
print(f"AUPRC: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]")
# Calculate confidence intervals for F1 score with a custom threshold
original_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'f1', threshold=0.5)
print(f"F1 Score: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]")
# Calculate confidence intervals for AUROC, AUPRC, F1 score
for score in ['roc_auc', 'pr_auc', 'f1']:
original_score, conf_lower, conf_upper = Bootstrapping(y_true, y_prob, score, threshold=0.5)
print(f"{score.upper()} original score: {original_score:.3f}, confidence interval: [{conf_lower:.3f} - {conf_upper:.3f}]")The Permutation_test function assesses the statistical significance of the difference between two models' metrics by randomly shuffling the data and recalculating the metrics to create a distribution of differences. This method does not assume a specific distribution of the data, making it a robust choice for comparing model performance.
from MLstatkit.stats import Permutation_test
y_true = np.array([0, 1, 0, 0, 1, 1, 0, 1, 0])
prob_model_A = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.3, 0.4, 0.7, 0.05])
prob_model_B = np.array([0.2, 0.3, 0.25, 0.85, 0.15, 0.35, 0.45, 0.65, 0.01])
# Conduct a permutation test to compare F1 scores
metric_a, metric_b, p_value, benchmark, samples_mean, samples_std = Permutation_test(
y_true, prob_model_A, prob_model_B, 'f1'
)
print(f"F1 Score Model A: {metric_a:.5f}, Model B: {metric_b:.5f}")
print(f"Observed Difference: {benchmark:.5f}, p-value: {p_value:.5f}")
print(f"Permuted Differences Mean: {samples_mean:.5f}, Std: {samples_std:.5f}")The AUC2OR function converts an Area Under the Curve (AUC) value to an Odds Ratio (OR) and optionally returns intermediate values such as t, z, d, and ln_OR. This conversion is useful for understanding the relationship between AUC, a common metric in binary classification, and OR, which is often used in statistical analyses.
from MLstatkit.stats import AUC2OR
AUC = 0.7 # Example AUC value
# Convert AUC to OR and retrieve all intermediate values
t, z, d, ln_OR, OR = AUC2OR(AUC, return_all=True)
print(f"t: {t:.5f}, z: {z:.5f}, d: {d:.5f}, ln_OR: {ln_OR:.5f}, OR: {OR:.5f}")
# Convert AUC to OR without intermediate values
OR = AUC2OR(AUC)
print(f"OR: {OR:.5f}")The implementation of Delong_test in MLstatkit is based on the following publication:
The Bootstrapping method for calculating confidence intervals does not directly reference a single publication but is a widely accepted statistical technique for estimating the distribution of a metric by resampling with replacement. For a comprehensive overview of bootstrapping methods, see:
The Permutation_tests are utilized to assess the significance of the difference in performance metrics between two models by randomly reallocating observations to groups and computing the metric. This approach does not make specific distributional assumptions, making it versatile for various data types. For a foundational discussion on permutation tests, refer to:
These references lay the groundwork for the statistical tests and methodologies implemented in MLstatkit, providing users with a deep understanding of their scientific basis and applicability.
The AUR2OR function converts the Area Under the Receiver Operating Characteristic Curve (AUC) into several related statistics including Cohen’s d, Pearson’s rpb, odds-ratio, and natural log odds-ratio. This conversion is particularly useful in interpreting the performance of classification models. For a detailed explanation of the mathematical formulas used in this conversion, refer to:
These references provide the mathematical foundation for the AUR2OR function, ensuring that users can accurately interpret the statistical significance and practical implications of their model performance metrics.
We welcome contributions to MLstatkit! Please see our contribution guidelines for more details.
MLstatkit is distributed under the MIT License. For more information, see the LICENSE file in the GitHub repository.
0.1.7 Update README.md0.1.6 Debug.0.1.5 Update README.md, Add AUC2OR function.0.1.4 Update README.md, Add Permutation_tests function, Re-do Bootstrapping Parameters.0.1.3 Update README.md.0.1.2 Add Bootstrapping operation process progress display.0.1.1 Update README.md, setup.py. Add CONTRIBUTING.md.0.1.0 First edition