DataProfiler下載 - DataProfiler文件源代碼下載

在淺色模式下顯示一個黑色徽標，並以深色模式顯示白色徽標。

數據參考器|您的數據中有什麼？

DataProfiler是一個Python庫，旨在使數據分析，監視和敏感數據檢測變得容易。

使用單個命令加載數據，該庫會自動格式化文件，並將文件加載到數據框架中。分析數據，庫標識模式，統計信息，實體（PII / NPI）等。然後可以將數據配置文件用於下游應用程序或報告中。

入門僅需幾行代碼（示例CSV）：

 import json
from dataprofiler import Data , Profiler

data = Data ( "your_file.csv" ) # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL

print ( data . data . head ( 5 )) # Access data directly via a compatible Pandas DataFrame

profile = Profiler ( data ) # Calculate Statistics, Entity Recognition, etc

readable_report = profile . report ( report_options = { "output_format" : "compact" })

print ( json . dumps ( readable_report , indent = 4 ))

注意：數據探索器帶有預先訓練的深度學習模型，用於有效識別敏感數據（PII / NPI）。如果需要，很容易將新實體添加到現有的預訓練模型中，或插入整個新的實體識別管道。

有關API文檔，請訪問文檔頁面。

如果您有建議或找到錯誤，請打開問題。

如果您想貢獻，請訪問貢獻頁面。

安裝

要安裝PYPI ： pip install DataProfiler[full]

如果要安裝ML依賴項而不生成報告，請使用DataProfiler[ml]

如果ML要求太嚴格（例如，您不想安裝TensorFlow），則可以使用DataProfiler[reports]安裝較小的軟件包。較小的軟件包禁用默認敏感數據檢測 /實體識別（LABLER）

從PYPI安裝： pip install DataProfiler

什麼是數據配置文件？

在此庫的情況下，數據配置文件是一個包含有關基礎數據集的統計信息和預測的字典。有“全局統計信息”或global_stats ，其中包含數據集級別數據，並且有“列/行級統計信息”或data_stats （每列都是新的鍵值輸入）。

結構化配置文件的格式如下：

 "global_stats": {
    "samples_used": int,
    "column_count": int,
    "row_count": int,
    "row_has_null_ratio": float,
    "row_is_null_ratio": float,
    "unique_row_ratio": float,
    "duplicate_row_count": int,
    "file_type": string,
    "encoding": string,
    "correlation_matrix": list[list[int]], (*)
    "chi2_matrix": list[list[float]],
    "profile_schema": {
        string: list[int]
    },
    "times": dict[string, float],
},
"data_stats": [
    {
        "column_name": string,
        "data_type": string,
        "data_label": string,
        "categorical": bool,
        "order": string,
        "samples": list[str],
        "statistics": {
            "sample_size": int,
            "null_count": int,
            "null_types": list[string],
            "null_types_index": {
                string: list[int]
            },
            "data_type_representation": dict[string, float],
            "min": [null, float, str],
            "max": [null, float, str],
            "mode": float,
            "median": float,
            "median_absolute_deviation": float,
            "sum": float,
            "mean": float,
            "variance": float,
            "stddev": float,
            "skewness": float,
            "kurtosis": float,
            "num_zeros": int,
            "num_negatives": int,
            "histogram": {
                "bin_counts": list[int],
                "bin_edges": list[float],
            },
            "quantiles": {
                int: float
            },
            "vocab": list[char],
            "avg_predictions": dict[string, float],
            "data_label_representation": dict[string, float],
            "categories": list[str],
            "unique_count": int,
            "unique_ratio": float,
            "categorical_count": dict[string, int],
            "gini_impurity": float,
            "unalikeability": float,
            "precision": {
                'min': int,
                'max': int,
                'mean': float,
                'var': float,
                'std': float,
                'sample_size': int,
                'margin_of_error': float,
                'confidence_level': float
            },
            "times": dict[string, float],
            "format": string
        },
        "null_replication_metrics": {
            "class_prior": list[int],
            "class_sum": list[list[int]],
            "class_mean": list[list[int]]
        }
    }
]

（*）當前相關矩陣更新已切換。它將在以後的更新中重置。用戶仍然可以根據需要使用IS_enable選項設置為true。

非結構化配置文件的格式如下：

 "global_stats": {
    "samples_used": int,
    "empty_line_count": int,
    "file_type": string,
    "encoding": string,
    "memory_size": float, # in MB
    "times": dict[string, float],
},
"data_stats": {
    "data_label": {
        "entity_counts": {
            "word_level": dict[string, int],
            "true_char_level": dict[string, int],
            "postprocess_char_level": dict[string, int]
        },
        "entity_percentages": {
            "word_level": dict[string, float],
            "true_char_level": dict[string, float],
            "postprocess_char_level": dict[string, float]
        },
        "times": dict[string, float]
    },
    "statistics": {
        "vocab": list[char],
        "vocab_count": dict[string, int],
        "words": list[string],
        "word_count": dict[string, int],
        "times": dict[string, float]
    }
}

圖形配置文件的格式如下：

 "num_nodes": int,
"num_edges": int,
"categorical_attributes": list[string],
"continuous_attributes": list[string],
"avg_node_degree": float,
"global_max_component_size": int,
"continuous_distribution": {
    "<attribute_1>": {
        "name": string,
        "scale": float,
        "properties": list[float, np.array]
    },
    "<attribute_2>": None,
    ...
},
"categorical_distribution": {
    "<attribute_1>": None,
    "<attribute_2>": {
        "bin_counts": list[int],
        "bin_edges": list[float]
    },
    ...
},
"times": dict[string, float]

個人資料統計描述

結構化輪廓

global_stats：

samples_used用於生成此配置文件的輸入數據樣本數量
column_count輸入數據集中包含的列數
row_count輸入數據集中包含的行數
row_has_null_ratio至少包含一個空值的行的比例
row_is_null_ratio完全由無效值（空行）組成到行總數的行比例
unique_row_ratio輸入數據集中不同行的比例與行總數
duplicate_row_count輸入數據集中不止一次發生的行數
file_type包含輸入數據集的文件格式（ex：.csv）
encoding - 包含輸入數據集的文件的編碼（ex：utf -8）
correlation_matrix形狀column_count x column_count的矩陣包含數據集中每列之間的相關係數
chi2_matrix形狀column_count x column_count的矩陣包含數據集中每個列之間的卡方統計信息
profile_schema輸入數據集的格式標記每列及其索引的描述
- string - 所討論的列的標籤及其配置文件架構中的索引
times - 以毫秒為單位生成該數據集的全局統計信息所需的時間

data_stats：

column_name輸入數據集中此列的標籤/標題
data_type本列中包含的原始python數據類型
data_label由標籤組件確定的本列中數據的標籤/實體
categorical - “ true”如果此列包含分類數據
order - 該列中數據的訂購方式（如果有的話）
samples - 此列的一小部分數據條目
statistics - 列的統計信息
- sample_size用於生成此配置文件的輸入數據樣本數量
- null_count樣本中的無零條目數
- null_types此樣本中存在的不同空類型的列表
- null_types_index包含每種空類型的dict和相應的列表，表示它存在於此示例中
- data_type_representation識別為每個data_type的樣本的百分比
- min - 樣品中的最小值
- max - 樣品中的最大值
- mode - 樣本中條目的模式
- median - 樣品中的條目中位數
- median_absolute_deviation樣本中條目的中值絕對偏差
- sum - 列的所有採樣值的總數
- mean - 樣本中所有條目的平均值
- variance - 樣本中所有條目的方差
- stddev樣品中所有條目的標準偏差
- skewness - 樣品中所有條目的統計偏度
- kurtosis樣品中所有條目的統計峰度
- num_zeros具有值0的樣本中的條目數
- num_negatives該樣本中的條目數量小於0
- histogram - 包含直方圖相關信息
  - bin_counts每個垃圾箱中的條目數
  - bin_edges每個垃圾箱的閾值
- quantiles - 根據樣本中的條目列出的每個百分位數的值
- vocab - 此樣本中條目中使用的字符列表
- avg_predictions - 所有數據點上的數據標籤預測關注的平均值
- categories - 如果categorial ='true'，樣本中每個不同類別的列表
- unique_count樣本中不同條目的數量
- unique_ratio樣本中不同條目數量與樣本中的條目總數的比例
- categorical_count如果categorical ='true'
- gini_impurity如果根據子集中的標籤分佈將其隨機標記的頻率是錯誤標記的頻率。
- unalikeability - 表示頻率在樣本中彼此不同的值
- precision - 關於每個樣本數字中數字數量的統計量
- times - 以毫秒生成該樣本的統計數據所需的時間
- format - 可能的日期時間格式列表
null_replication_metrics基於列值是否為null（DICT鍵引用的列表的索引1）的數據分區的統計信息（索引0）
- class_prior包含列值為null而不是null的概率的列表
- class_sum根據列值是否為null，包含所有其他行的總和的列表
- class_mean根據列值是否為null，包含所有其他行的均值列表

非結構化配置文件

global_stats：

samples_used用於生成此配置文件的輸入數據樣本數量
empty_line_count輸入數據中的空行數
file_type輸入數據的文件類型（ex：.txt）
encoding - 輸入數據文件的文件編碼（ex：utf -8）
memory_size MB中輸入數據的大小
times - 以毫秒生成此配置文件所需的時間

data_stats：

data_label輸入數據標籤上的標籤和統計信息
- entity_counts輸入數據中顯示特定標籤或實體的次數
  - word_level每個標籤或實體中計數的單詞數
  - true_char_level由模型確定的每個標籤或實體中計數的字符數
  - postprocess_char_level處理器確定的每個標籤或實體中計數的字符數
- entity_percentages輸入數據中每個標籤或實體的百分比
  - word_level每個標籤或實體中包含的輸入數據中單詞的百分比
  - true_char_level由模型確定的每個標籤或實體中包含的輸入數據中字符的百分比
  - postprocess_char_level處理器確定的每個標籤或實體中包含的輸入數據中字符的百分比
- times - 數據標籤者預測數據所需的時間
statistics - 輸入數據的統計數據
- vocab - 輸入數據中每個字符的列表
- vocab_count輸入數據中每個不同字符的出現數量
- words - 輸入數據中每個單詞的列表
- word_count輸入數據中每個不同單詞的出現數量
- times - 在毫秒中生成詞彙和單詞統計所花費的時間

圖形配置文件

num_nodes圖中的節點數
num_edges圖中的邊數
categorical_attributes分類邊緣屬性列表
continuous_attributes連續邊緣屬性列表
avg_node_degree圖中的節點的平均度
global_max_component_size ：全局最大組件的大小

連續_distribution：

<attribute_N> ：屬性列表中的n-th edge屬性的名稱
- name - 屬性的分佈名稱
- scale - 用於擴展和比較分佈的負模可能性
- properties - 描述分佈的統計屬性列表
  - [形狀（可選），LOC，比例，平均值，方差，偏斜，峰度]

partorical_distribution：

<attribute_N> ：屬性列表中的n-th edge屬性的名稱
- bin_counts ：分佈直方圖的每個箱中計數
- bin_edges ：分佈直方圖的每個箱的邊緣
時間 - 以毫秒生成此配置文件所需的時間

支持

支持的數據格式

任何劃界文件（CSV，TSV等）
JSON對象
AVRO文件
鑲木文件
文本文件
Pandas DataFrame
指向上述支持的文件類型之一的URL

數據類型

數據類型是在結構化數據的列級確定的

int
漂浮
細繩
DateTime

數據標籤

每個單元格確定數據標籤的結構化數據（使用探測器時列/行）或在字符級別以用於非結構化數據。

未知
地址
禁令（銀行帳號，10-18位數字）
Credit_Card
email_address
UUID
hash_or_key（MD5，SHA1，SHA256，隨機哈希等）
IPv4
IPv6
mac_address
人
phone_number
SSN
URL
US_STATE
駕照
日期
時間
DateTime
整數
漂浮
數量
序數

開始

加載文件

數據介紹者可以介紹以下數據/文件類型：

CSV文件（或任何劃界文件）
JSON對象
AVRO文件
鑲木文件
文本文件
Pandas DataFrame
指向上述支持的文件類型之一的URL

Profiler應自動識別文件類型並將數據加載到Data Class中。

與其他屬性一起， Data class使數據可以通過有效的pandas數據框架訪問。

 # Load a csv file, return a CSVData object
csv_data = Data ( 'your_file.csv' )

# Print the first 10 rows of the csv file
print ( csv_data . data . head ( 10 ))

# Load a parquet file, return a ParquetData object
parquet_data = Data ( 'your_file.parquet' )

# Sort the data by the name column
parquet_data . data . sort_values ( by = 'name' , inplace = True )

# Print the sorted first 10 rows of the parquet data
print ( parquet_data . data . head ( 10 ))

# Load a json file from a URL, return a JSONData object
json_data = Data ( 'https://github.com/capitalone/DataProfiler/blob/main/dataprofiler/tests/data/json/iris-utf-8.json' )

如果未自動識別文件類型（稀有），則可以專門指定它們，請參見“指定Filetype或定界符”部分。

配置文件

例如，示例使用CSV文件，但是CSV，JSON，AVRO，PARQUET或文本也可以使用。

 import json
from dataprofiler import Data , Profiler

# Load file (CSV should be automatically identified)
data = Data ( "your_file.csv" )

# Profile the dataset
profile = Profiler ( data )

# Generate a report and use json to prettify.
report  = profile . report ( report_options = { "output_format" : "pretty" })

# Print the report
print ( json . dumps ( report , indent = 4 ))

更新配置文件

當前，數據剖面師可以批處理更新其配置文件。

 import json
from dataprofiler import Data , Profiler

# Load and profile a CSV file
data = Data ( "your_file.csv" )
profile = Profiler ( data )

# Update the profile with new data:
new_data = Data ( "new_data.csv" )
profile . update_profile ( new_data )

# Print the report using json to prettify.
report  = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

請注意，如果您使用包含與最初介紹的數據索引重疊的整數索引更新配置文件，則當計算無效行時，索引將“轉移”到無人居住的值，以使零計數和比率仍然準確。

合併輪廓

如果您有兩個具有相同架構的文件（但數據不同），則可以通過添加操作員將兩個配置文件合併在一起。

這也使配置文件可以以分佈式方式確定。

 import json
from dataprofiler import Data , Profiler

# Load a CSV file with a schema
data1 = Data ( "file_a.csv" )
profile1 = Profiler ( data1 )

# Load another CSV file with the same schema
data2 = Data ( "file_b.csv" )
profile2 = Profiler ( data2 )

profile3 = profile1 + profile2

# Print the report using json to prettify.
report  = profile3 . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

請注意，如果合併的配置文件具有重疊的整數索引，則計算出空行時，索引將“轉移”到無人居的值，以使零計數和比率仍然準確。

分析器的差異

為了找到具有相同架構的配置文件之間的更改，我們可以利用配置文件的diff函數。 diff將提供總體文件和採樣差異以及數據統計信息的詳細差異。例如，數值列具有t檢驗來評估相似性和PSI（人口穩定指數）以量化色譜柱分佈位移。更多信息在GitHub頁面的“ Profiler”部分中進行了描述。

創建這樣的差異報告：

 import json
import dataprofiler as dp

# Load a CSV file
data1 = dp . Data ( "file_a.csv" )
profile1 = dp . Profiler ( data1 )

# Load another CSV file
data2 = dp . Data ( "file_b.csv" )
profile2 = dp . Profiler ( data2 )

diff_report = profile1 . diff ( profile2 )
print ( json . dumps ( diff_report , indent = 4 ))

介紹熊貓數據框

 import pandas as pd
import dataprofiler as dp
import json

my_dataframe = pd . DataFrame ([[ 1 , 2.0 ],[ 1 , 2.2 ],[ - 1 , 3 ]])
profile = dp . Profiler ( my_dataframe )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

# read a specified column, in this case it is labeled 0:
print ( json . dumps ( report [ "data_stats" ][ 0 ], indent = 4 ))

非結構化的分析師

除了結構化的剖面，DataProfiler還為TextData對像或字符串提供了非結構化的分析。非結構化的剖析師還可以與列表[String]，PD.Series（String）或pd.dataframe（string）給定的PROFILER_TYPE選項指定為unstructured 。以下是帶有文本文件的非結構化探查器的示例。

 import dataprofiler as dp
import json

my_text = dp . Data ( 'text_file.txt' )
profile = dp . Profiler ( my_text )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

帶有PD的非結構化參考器的另一個示例profiler_type='unstructured'

 import dataprofiler as dp
import pandas as pd
import json

text_data = pd . Series ([ 'first string' , 'second string' ])
profile = dp . Profiler ( text_data , profiler_type = 'unstructured' )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

圖形剖面

DataProfiler還提供了從CSV文件介紹圖形數據的能力。以下是圖形數據csv文件的圖形探照室的示例：

 import dataprofiler as dp
import pprint

my_graph = dp . Data ( 'graph_file.csv' )
profile = dp . Profiler ( my_graph )

# print the report using pretty print (json dump does not work on numpy array values inside dict)
report = profile . report ()
printer = pprint . PrettyPrinter ( sort_dicts = False , compact = True )
printer . pprint ( report )

訪問文檔頁面以獲取其他示例和API詳細信息

參考

 Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions
Authors: Anh Truong, Austin Walters, Jeremy Goodsitt
2020 https://arxiv.org/abs/2012.09597
The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services

展開