DataProfiler下载 - DataProfiler文件源代码下载

在浅色模式下显示一个黑色徽标，并以深色模式显示白色徽标。

数据参考器|您的数据中有什么？

DataProfiler是一个Python库，旨在使数据分析，监视和敏感数据检测变得容易。

使用单个命令加载数据，该库会自动格式化文件，并将文件加载到数据框架中。分析数据，库标识模式，统计信息，实体（PII / NPI）等。然后可以将数据配置文件用于下游应用程序或报告中。

入门仅需几行代码（示例CSV）：

 import json
from dataprofiler import Data , Profiler

data = Data ( "your_file.csv" ) # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL

print ( data . data . head ( 5 )) # Access data directly via a compatible Pandas DataFrame

profile = Profiler ( data ) # Calculate Statistics, Entity Recognition, etc

readable_report = profile . report ( report_options = { "output_format" : "compact" })

print ( json . dumps ( readable_report , indent = 4 ))

注意：数据探索器带有预先训练的深度学习模型，用于有效识别敏感数据（PII / NPI）。如果需要，很容易将新实体添加到现有的预训练模型中，或插入整个新的实体识别管道。

有关API文档，请访问文档页面。

如果您有建议或找到错误，请打开问题。

如果您想贡献，请访问贡献页面。

安装

要安装PYPI ： pip install DataProfiler[full]

如果要安装ML依赖项而不生成报告，请使用DataProfiler[ml]

如果ML要求太严格（例如，您不想安装TensorFlow），则可以使用DataProfiler[reports]安装较小的软件包。较小的软件包禁用默认敏感数据检测 /实体识别（LABLER）

从PYPI安装： pip install DataProfiler

什么是数据配置文件？

在此库的情况下，数据配置文件是一个包含有关基础数据集的统计信息和预测的字典。有“全局统计信息”或global_stats ，其中包含数据集级别数据，并且有“列/行级统计信息”或data_stats （每列都是新的键值输入）。

结构化配置文件的格式如下：

 "global_stats": {
    "samples_used": int,
    "column_count": int,
    "row_count": int,
    "row_has_null_ratio": float,
    "row_is_null_ratio": float,
    "unique_row_ratio": float,
    "duplicate_row_count": int,
    "file_type": string,
    "encoding": string,
    "correlation_matrix": list[list[int]], (*)
    "chi2_matrix": list[list[float]],
    "profile_schema": {
        string: list[int]
    },
    "times": dict[string, float],
},
"data_stats": [
    {
        "column_name": string,
        "data_type": string,
        "data_label": string,
        "categorical": bool,
        "order": string,
        "samples": list[str],
        "statistics": {
            "sample_size": int,
            "null_count": int,
            "null_types": list[string],
            "null_types_index": {
                string: list[int]
            },
            "data_type_representation": dict[string, float],
            "min": [null, float, str],
            "max": [null, float, str],
            "mode": float,
            "median": float,
            "median_absolute_deviation": float,
            "sum": float,
            "mean": float,
            "variance": float,
            "stddev": float,
            "skewness": float,
            "kurtosis": float,
            "num_zeros": int,
            "num_negatives": int,
            "histogram": {
                "bin_counts": list[int],
                "bin_edges": list[float],
            },
            "quantiles": {
                int: float
            },
            "vocab": list[char],
            "avg_predictions": dict[string, float],
            "data_label_representation": dict[string, float],
            "categories": list[str],
            "unique_count": int,
            "unique_ratio": float,
            "categorical_count": dict[string, int],
            "gini_impurity": float,
            "unalikeability": float,
            "precision": {
                'min': int,
                'max': int,
                'mean': float,
                'var': float,
                'std': float,
                'sample_size': int,
                'margin_of_error': float,
                'confidence_level': float
            },
            "times": dict[string, float],
            "format": string
        },
        "null_replication_metrics": {
            "class_prior": list[int],
            "class_sum": list[list[int]],
            "class_mean": list[list[int]]
        }
    }
]

（*）当前相关矩阵更新已切换。它将在以后的更新中重置。用户仍然可以根据需要使用IS_enable选项设置为true。

非结构化配置文件的格式如下：

 "global_stats": {
    "samples_used": int,
    "empty_line_count": int,
    "file_type": string,
    "encoding": string,
    "memory_size": float, # in MB
    "times": dict[string, float],
},
"data_stats": {
    "data_label": {
        "entity_counts": {
            "word_level": dict[string, int],
            "true_char_level": dict[string, int],
            "postprocess_char_level": dict[string, int]
        },
        "entity_percentages": {
            "word_level": dict[string, float],
            "true_char_level": dict[string, float],
            "postprocess_char_level": dict[string, float]
        },
        "times": dict[string, float]
    },
    "statistics": {
        "vocab": list[char],
        "vocab_count": dict[string, int],
        "words": list[string],
        "word_count": dict[string, int],
        "times": dict[string, float]
    }
}

图形配置文件的格式如下：

 "num_nodes": int,
"num_edges": int,
"categorical_attributes": list[string],
"continuous_attributes": list[string],
"avg_node_degree": float,
"global_max_component_size": int,
"continuous_distribution": {
    "<attribute_1>": {
        "name": string,
        "scale": float,
        "properties": list[float, np.array]
    },
    "<attribute_2>": None,
    ...
},
"categorical_distribution": {
    "<attribute_1>": None,
    "<attribute_2>": {
        "bin_counts": list[int],
        "bin_edges": list[float]
    },
    ...
},
"times": dict[string, float]

个人资料统计描述

结构化轮廓

global_stats：

samples_used用于生成此配置文件的输入数据样本数量
column_count输入数据集中包含的列数
row_count输入数据集中包含的行数
row_has_null_ratio至少包含一个空值的行的比例
row_is_null_ratio完全由无效值（空行）组成到行总数的行比例
unique_row_ratio输入数据集中不同行的比例与行总数
duplicate_row_count输入数据集中不止一次发生的行数
file_type包含输入数据集的文件格式（ex：.csv）
encoding - 包含输入数据集的文件的编码（ex：utf -8）
correlation_matrix形状column_count x column_count的矩阵包含数据集中每列之间的相关系数
chi2_matrix形状column_count x column_count的矩阵包含数据集中每个列之间的卡方统计信息
profile_schema输入数据集的格式标记每列及其索引的描述
- string - 所讨论的列的标签及其配置文件架构中的索引
times - 以毫秒为单位生成该数据集的全局统计信息所需的时间

data_stats：

column_name输入数据集中此列的标签/标题
data_type本列中包含的原始python数据类型
data_label由标签组件确定的本列中数据的标签/实体
categorical - “ true”如果此列包含分类数据
order - 该列中数据的订购方式（如果有的话）
samples - 此列的一小部分数据条目
statistics - 列的统计信息
- sample_size用于生成此配置文件的输入数据样本数量
- null_count样本中的无零条目数
- null_types此样本中存在的不同空类型的列表
- null_types_index包含每种空类型的dict和相应的列表，表示它存在于此示例中
- data_type_representation识别为每个data_type的样本的百分比
- min - 样品中的最小值
- max - 样品中的最大值
- mode - 样本中条目的模式
- median - 样品中的条目中位数
- median_absolute_deviation样本中条目的中值绝对偏差
- sum - 列的所有采样值的总数
- mean - 样本中所有条目的平均值
- variance - 样本中所有条目的方差
- stddev样品中所有条目的标准偏差
- skewness - 样品中所有条目的统计偏度
- kurtosis样品中所有条目的统计峰度
- num_zeros具有值0的样本中的条目数
- num_negatives该样本中的条目数量小于0
- histogram - 包含直方图相关信息
  - bin_counts每个垃圾箱中的条目数
  - bin_edges每个垃圾箱的阈值
- quantiles - 根据样本中的条目列出的每个百分位数的值
- vocab - 此样本中条目中使用的字符列表
- avg_predictions - 所有数据点上的数据标签预测关注的平均值
- categories - 如果categorial ='true'，样本中每个不同类别的列表
- unique_count样本中不同条目的数量
- unique_ratio样本中不同条目数量与样本中的条目总数的比例
- categorical_count如果categorical ='true'
- gini_impurity如果根据子集中的标签分布将其随机标记的频率是错误标记的频率。
- unalikeability - 表示频率在样本中彼此不同的值
- precision - 关于每个样本数字中数字数量的统计量
- times - 以毫秒生成该样本的统计数据所需的时间
- format - 可能的日期时间格式列表
null_replication_metrics基于列值是否为null（DICT键引用的列表的索引1）的数据分区的统计信息（索引0）
- class_prior包含列值为null而不是null的概率的列表
- class_sum根据列值是否为null，包含所有其他行的总和的列表
- class_mean根据列值是否为null，包含所有其他行的均值列表

非结构化配置文件

global_stats：

samples_used用于生成此配置文件的输入数据样本数量
empty_line_count输入数据中的空行数
file_type输入数据的文件类型（ex：.txt）
encoding - 输入数据文件的文件编码（ex：utf -8）
memory_size MB中输入数据的大小
times - 以毫秒生成此配置文件所需的时间

data_stats：

data_label输入数据标签上的标签和统计信息
- entity_counts输入数据中显示特定标签或实体的次数
  - word_level每个标签或实体中计数的单词数
  - true_char_level由模型确定的每个标签或实体中计数的字符数
  - postprocess_char_level处理器确定的每个标签或实体中计数的字符数
- entity_percentages输入数据中每个标签或实体的百分比
  - word_level每个标签或实体中包含的输入数据中单词的百分比
  - true_char_level由模型确定的每个标签或实体中包含的输入数据中字符的百分比
  - postprocess_char_level处理器确定的每个标签或实体中包含的输入数据中字符的百分比
- times - 数据标签者预测数据所需的时间
statistics - 输入数据的统计数据
- vocab - 输入数据中每个字符的列表
- vocab_count输入数据中每个不同字符的出现数量
- words - 输入数据中每个单词的列表
- word_count输入数据中每个不同单词的出现数量
- times - 在毫秒中生成词汇和单词统计所花费的时间

图形配置文件

num_nodes图中的节点数
num_edges图中的边数
categorical_attributes分类边缘属性列表
continuous_attributes连续边缘属性列表
avg_node_degree图中的节点的平均度
global_max_component_size ：全局最大组件的大小

连续_distribution：

<attribute_N> ：属性列表中的n-th edge属性的名称
- name - 属性的分布名称
- scale - 用于扩展和比较分布的负模可能性
- properties - 描述分布的统计属性列表
  - [形状（可选），LOC，比例，平均值，方差，偏斜，峰度]

partorical_distribution：

<attribute_N> ：属性列表中的n-th edge属性的名称
- bin_counts ：分布直方图的每个箱中计数
- bin_edges ：分布直方图的每个箱的边缘
时间 - 以毫秒生成此配置文件所需的时间

支持

支持的数据格式

任何划界文件（CSV，TSV等）
JSON对象
AVRO文件
镶木文件
文本文件
Pandas DataFrame
指向上述支持的文件类型之一的URL

数据类型

数据类型是在结构化数据的列级确定的

int
漂浮
细绳
DateTime

数据标签

每个单元格确定数据标签的结构化数据（使用探测器时列/行）或在字符级别以用于非结构化数据。

未知
地址
禁令（银行帐号，10-18位数字）
Credit_Card
email_address
UUID
hash_or_key（MD5，SHA1，SHA256，随机哈希等）
IPv4
IPv6
mac_address
人
phone_number
SSN
URL
US_STATE
驾照
日期
时间
DateTime
整数
漂浮
数量
序数

开始

加载文件

数据介绍者可以介绍以下数据/文件类型：

CSV文件（或任何划界文件）
JSON对象
AVRO文件
镶木文件
文本文件
Pandas DataFrame
指向上述支持的文件类型之一的URL

Profiler应自动识别文件类型并将数据加载到Data Class中。

与其他属性一起， Data class使数据可以通过有效的pandas数据框架访问。

 # Load a csv file, return a CSVData object
csv_data = Data ( 'your_file.csv' )

# Print the first 10 rows of the csv file
print ( csv_data . data . head ( 10 ))

# Load a parquet file, return a ParquetData object
parquet_data = Data ( 'your_file.parquet' )

# Sort the data by the name column
parquet_data . data . sort_values ( by = 'name' , inplace = True )

# Print the sorted first 10 rows of the parquet data
print ( parquet_data . data . head ( 10 ))

# Load a json file from a URL, return a JSONData object
json_data = Data ( 'https://github.com/capitalone/DataProfiler/blob/main/dataprofiler/tests/data/json/iris-utf-8.json' )

如果未自动识别文件类型（稀有），则可以专门指定它们，请参见“指定Filetype或定界符”部分。

配置文件

例如，示例使用CSV文件，但是CSV，JSON，AVRO，PARQUET或文本也可以使用。

 import json
from dataprofiler import Data , Profiler

# Load file (CSV should be automatically identified)
data = Data ( "your_file.csv" )

# Profile the dataset
profile = Profiler ( data )

# Generate a report and use json to prettify.
report  = profile . report ( report_options = { "output_format" : "pretty" })

# Print the report
print ( json . dumps ( report , indent = 4 ))

更新配置文件

当前，数据剖面师可以批处理更新其配置文件。

 import json
from dataprofiler import Data , Profiler

# Load and profile a CSV file
data = Data ( "your_file.csv" )
profile = Profiler ( data )

# Update the profile with new data:
new_data = Data ( "new_data.csv" )
profile . update_profile ( new_data )

# Print the report using json to prettify.
report  = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

请注意，如果您使用包含与最初介绍的数据索引重叠的整数索引更新配置文件，则当计算无效行时，索引将“转移”到无人居住的值，以使零计数和比率仍然准确。

合并轮廓

如果您有两个具有相同架构的文件（但数据不同），则可以通过添加操作员将两个配置文件合并在一起。

这也使配置文件可以以分布式方式确定。

 import json
from dataprofiler import Data , Profiler

# Load a CSV file with a schema
data1 = Data ( "file_a.csv" )
profile1 = Profiler ( data1 )

# Load another CSV file with the same schema
data2 = Data ( "file_b.csv" )
profile2 = Profiler ( data2 )

profile3 = profile1 + profile2

# Print the report using json to prettify.
report  = profile3 . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

请注意，如果合并的配置文件具有重叠的整数索引，则计算出空行时，索引将“转移”到无人居的值，以使零计数和比率仍然准确。

分析器的差异

为了找到具有相同架构的配置文件之间的更改，我们可以利用配置文件的diff函数。 diff将提供总体文件和采样差异以及数据统计信息的详细差异。例如，数值列具有t检验来评估相似性和PSI（人口稳定指数）以量化色谱柱分布位移。更多信息在GitHub页面的“ Profiler”部分中进行了描述。

创建这样的差异报告：

 import json
import dataprofiler as dp

# Load a CSV file
data1 = dp . Data ( "file_a.csv" )
profile1 = dp . Profiler ( data1 )

# Load another CSV file
data2 = dp . Data ( "file_b.csv" )
profile2 = dp . Profiler ( data2 )

diff_report = profile1 . diff ( profile2 )
print ( json . dumps ( diff_report , indent = 4 ))

介绍熊猫数据框

 import pandas as pd
import dataprofiler as dp
import json

my_dataframe = pd . DataFrame ([[ 1 , 2.0 ],[ 1 , 2.2 ],[ - 1 , 3 ]])
profile = dp . Profiler ( my_dataframe )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

# read a specified column, in this case it is labeled 0:
print ( json . dumps ( report [ "data_stats" ][ 0 ], indent = 4 ))

非结构化的分析师

除了结构化的剖面，DataProfiler还为TextData对象或字符串提供了非结构化的分析。非结构化的剖析师还可以与列表[String]，PD.Series（String）或pd.dataframe（string）给定的PROFILER_TYPE选项指定为unstructured 。以下是带有文本文件的非结构化探查器的示例。

 import dataprofiler as dp
import json

my_text = dp . Data ( 'text_file.txt' )
profile = dp . Profiler ( my_text )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

带有PD的非结构化参考器的另一个示例profiler_type='unstructured'

 import dataprofiler as dp
import pandas as pd
import json

text_data = pd . Series ([ 'first string' , 'second string' ])
profile = dp . Profiler ( text_data , profiler_type = 'unstructured' )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

图形剖面

DataProfiler还提供了从CSV文件介绍图形数据的能力。以下是图形数据csv文件的图形探照室的示例：

 import dataprofiler as dp
import pprint

my_graph = dp . Data ( 'graph_file.csv' )
profile = dp . Profiler ( my_graph )

# print the report using pretty print (json dump does not work on numpy array values inside dict)
report = profile . report ()
printer = pprint . PrettyPrinter ( sort_dicts = False , compact = True )
printer . pprint ( report )

访问文档页面以获取其他示例和API详细信息

参考

 Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions
Authors: Anh Truong, Austin Walters, Jeremy Goodsitt
2020 https://arxiv.org/abs/2012.09597
The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services

展开