DataProfilerダウンロード - DataProfilerソースコードのダウンロード

明るい色モードの黒いロゴと、暗い色モードの白いロゴを示しています。

データプロファイラー|あなたのデータには何がありますか？

DataProfilerは、データ分析、監視、および機密データの検出を簡単にするために設計されたPythonライブラリです。

データを単一のコマンドでロードすると、ライブラリはファイルを自動的にデータフレームにロードします。データをプロファイリングすると、ライブラリはスキーマ、統計、エンティティ（PII / NPI）などを識別します。データプロファイルは、ダウンストリームアプリケーションまたはレポートで使用できます。

開始するだけで、コードの数行しか必要ありません（例CSVの例）：

 import json
from dataprofiler import Data , Profiler

data = Data ( "your_file.csv" ) # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL

print ( data . data . head ( 5 )) # Access data directly via a compatible Pandas DataFrame

profile = Profiler ( data ) # Calculate Statistics, Entity Recognition, etc

readable_report = profile . report ( report_options = { "output_format" : "compact" })

print ( json . dumps ( readable_report , indent = 4 ))

注：データプロファイラーには、機密データ（PII / NPI）を効率的に識別するために使用される事前に訓練されたディープラーニングモデルが付属しています。必要に応じて、既存の事前訓練を受けたモデルに新しいエンティティを追加したり、エンティティ認識のための新しいパイプラインを挿入したりするのは簡単です。

APIドキュメントについては、ドキュメントページをご覧ください。

提案がある場合やバグを見つけた場合は、問題を開いてください。

貢献したい場合は、貢献ページにアクセスしてください。

インストール

Pypiから完全なパッケージをインストールするには： pip install DataProfiler[full]

レポートを生成せずにML依存関係をインストールする場合はDataProfiler[ml]を使用します

ML要件が厳しすぎる場合（たとえば、Tensorflowをインストールしたくない場合）、 DataProfiler[reports] 。 Slimmerパッケージは、デフォルトの機密データ検出 /エンティティ認識（Labler）を無効にします

PYPIからのインストール： pip install DataProfiler

データプロファイルとは何ですか？

このライブラリの場合、データプロファイルは、基礎となるデータセットに関する統計と予測を含む辞書です。データセットレベルのデータを含む「グローバル統計」またはglobal_statsがあり、「列/行レベルの統計」またはdata_statsがあります（各列は新しいキー値のエントリです）。

構造化されたプロファイルの形式は以下のとおりです。

 "global_stats": {
    "samples_used": int,
    "column_count": int,
    "row_count": int,
    "row_has_null_ratio": float,
    "row_is_null_ratio": float,
    "unique_row_ratio": float,
    "duplicate_row_count": int,
    "file_type": string,
    "encoding": string,
    "correlation_matrix": list[list[int]], (*)
    "chi2_matrix": list[list[float]],
    "profile_schema": {
        string: list[int]
    },
    "times": dict[string, float],
},
"data_stats": [
    {
        "column_name": string,
        "data_type": string,
        "data_label": string,
        "categorical": bool,
        "order": string,
        "samples": list[str],
        "statistics": {
            "sample_size": int,
            "null_count": int,
            "null_types": list[string],
            "null_types_index": {
                string: list[int]
            },
            "data_type_representation": dict[string, float],
            "min": [null, float, str],
            "max": [null, float, str],
            "mode": float,
            "median": float,
            "median_absolute_deviation": float,
            "sum": float,
            "mean": float,
            "variance": float,
            "stddev": float,
            "skewness": float,
            "kurtosis": float,
            "num_zeros": int,
            "num_negatives": int,
            "histogram": {
                "bin_counts": list[int],
                "bin_edges": list[float],
            },
            "quantiles": {
                int: float
            },
            "vocab": list[char],
            "avg_predictions": dict[string, float],
            "data_label_representation": dict[string, float],
            "categories": list[str],
            "unique_count": int,
            "unique_ratio": float,
            "categorical_count": dict[string, int],
            "gini_impurity": float,
            "unalikeability": float,
            "precision": {
                'min': int,
                'max': int,
                'mean': float,
                'var': float,
                'std': float,
                'sample_size': int,
                'margin_of_error': float,
                'confidence_level': float
            },
            "times": dict[string, float],
            "format": string
        },
        "null_replication_metrics": {
            "class_prior": list[int],
            "class_sum": list[list[int]],
            "class_mean": list[list[int]]
        }
    }
]

（*）現在、相関マトリックスの更新が切り替えられています。後の更新でリセットされます。ユーザーは、IS_ENABLEオプションをtrueに設定して、必要に応じて使用できます。

非構造化されたプロファイルの形式は以下のとおりです。

 "global_stats": {
    "samples_used": int,
    "empty_line_count": int,
    "file_type": string,
    "encoding": string,
    "memory_size": float, # in MB
    "times": dict[string, float],
},
"data_stats": {
    "data_label": {
        "entity_counts": {
            "word_level": dict[string, int],
            "true_char_level": dict[string, int],
            "postprocess_char_level": dict[string, int]
        },
        "entity_percentages": {
            "word_level": dict[string, float],
            "true_char_level": dict[string, float],
            "postprocess_char_level": dict[string, float]
        },
        "times": dict[string, float]
    },
    "statistics": {
        "vocab": list[char],
        "vocab_count": dict[string, int],
        "words": list[string],
        "word_count": dict[string, int],
        "times": dict[string, float]
    }
}

グラフプロファイルの形式は以下に次のとおりです。

 "num_nodes": int,
"num_edges": int,
"categorical_attributes": list[string],
"continuous_attributes": list[string],
"avg_node_degree": float,
"global_max_component_size": int,
"continuous_distribution": {
    "<attribute_1>": {
        "name": string,
        "scale": float,
        "properties": list[float, np.array]
    },
    "<attribute_2>": None,
    ...
},
"categorical_distribution": {
    "<attribute_1>": None,
    "<attribute_2>": {
        "bin_counts": list[int],
        "bin_edges": list[float]
    },
    ...
},
"times": dict[string, float]

プロファイル統計の説明

構造化されたプロファイル

Global_stats：

samples_usedこのプロファイルを生成するために使用される入力データサンプルの数
column_count入力データセットに含まれる列の数
row_count入力データセットに含まれる行数
row_has_null_ratio行の総数に少なくとも1つのnull値を含む行の割合
row_is_null_ratio行の総数に対するnull値（null行）で完全に構成される行の割合
unique_row_ratio入力データセット内の異なる行の割合は、行の総数に対する
duplicate_row_count入力データセットで複数回発生する行の数
file_type入力データセットを含むファイルの形式（例：.csv）
encoding - 入力データセットを含むファイルのエンコード（例：UTF -8）
correlation_matrix形状column_count x column_countのマトリックスデータセット内の各列間の相関係数を含む
chi2_matrix形状のマトリックスcolumn_count x column_countデータセット内の各列間のカイ二乗統計を含む
profile_schema各列とデータセット内のインデックスをラベル付けする入力データセットの形式の説明
- string - 問題の列のラベルとプロファイルスキーマのインデックス
times - このデータセットのグローバル統計をミリ秒単位で生成するのにかかった期間

data_stats：

column_name入力データセットのこの列のラベル/タイトル
data_typeこの列に含まれるプリミティブPythonデータ型
data_labelラベルコンポーネントによって決定されたこの列のデータのラベル/エンティティ
categorical - この列にカテゴリデータが含まれている場合、「true」
order - この列のデータが順序付けられる方法（もしあれば」
samples - この列からのデータエントリの小さなサブセット
statistics - 列の統計情報
- sample_sizeこのプロファイルを生成するために使用される入力データサンプルの数
- null_countサンプルのnullエントリの数
- null_typesこのサンプル内に存在するさまざまなヌルタイプのリスト
- null_types_index各nullタイプと、このサンプル内に存在することを示すそれぞれのリストを含むdict
- data_type_representation各data_typeとして識別するサンプルの割合
- min - サンプルの最小値
- max - サンプルの最大値
- mode - サンプル内のエントリのモード
- median - サンプルのエントリの中央値
- median_absolute_deviationサンプル内のエントリの絶対偏差の中央値
- sum - 列からのすべてのサンプリング値の合計
- mean - サンプル内のすべてのエントリの平均
- variance - サンプル内のすべてのエントリの分散
- stddevサンプル内のすべてのエントリの標準偏差
- skewness - サンプル内のすべてのエントリの統計的歪度
- kurtosis - サンプル内のすべてのエントリの統計的尖度
- num_zeros値0を持つこのサンプルのエントリの数
- num_negatives 0未満の値を持つこのサンプルのエントリの数
- histogram - ヒストグラム関連情報が含まれています
  - bin_counts各ビン内のエントリ数
  - bin_edges各ビンのしきい値
- quantiles - サンプルのエントリに基づいてリストされている順序の各パーセンタイルの値
- vocabこのサンプルのエントリ内で使用される文字のリスト
- avg_predictionsサンプリングされたすべてのデータポイントにわたるデータラベル予測の自信の平均
- categories - categorial = 'true'の場合、サンプル内の各異なるカテゴリのリスト
- unique_countサンプル内の個別のエントリの数
- unique_ratioサンプル内のエントリの総数に対するサンプル内の異なるエントリの数の割合
- categorical_count各カテゴリにサンプリングされたエントリの数categorical = 'true'
- gini_impurityサブセット内のラベルの分布に従ってランダムにラベル付けされた場合、セットからランダムに選択された要素が誤ってラベル付けされる頻度の測定値
- unalikeabilityサンプル内のエントリが互いに頻繁に異なる頻度を示す値
- precision - 各サンプルの数の数字数に関する統計の辞書
- times - このサンプルの統計をミリ秒単位で生成するのにかかった時間
- format - 可能なデータタイム形式のリスト
null_replication_metrics列値がnull（dict keysで参照されるリストのインデックス1）であるかどうかに基づいて分割されたデータの統計（インデックス0）
- class_prior列値がnullであり、nullではない確率を含むリスト
- class_sum列値がnullかどうかに基づいて、他のすべての行の合計を含むリスト
- class_mean列値がnullかどうかに基づいて、他のすべての行の平均を含むリスト

非構造化プロファイル

Global_stats：

samples_usedこのプロファイルを生成するために使用される入力データサンプルの数
empty_line_count入力データの空の行の数
file_type入力データのファイルタイプ（ex：.txt）
encoding - 入力データファイルのファイルエンコード（例：UTF -8）
memory_size MBの入力データのサイズ
times - このプロファイルをミリ秒単位で生成するのにかかった期間

data_stats：

data_label入力データのラベル上のラベルと統計
- entity_counts入力データ内に特定のラベルまたはエンティティが表示される回数
  - word_level各ラベルまたはエンティティ内でカウントされる単語の数
  - true_char_levelモデルによって決定された各ラベルまたはエンティティ内でカウントされる文字の数
  - postprocess_char_levelプロセッサによって決定された各ラベルまたはエンティティ内でカウントされる文字の数
- entity_percentages入力データ内の各ラベルまたはエンティティの割合
  - word_level各ラベルまたはエンティティに含まれる入力データの単語の割合
  - true_char_levelモデルによって決定された各ラベルまたはエンティティ内に含まれる入力データの文字の割合
  - postprocess_char_levelプロセッサによって決定された各ラベルまたはエンティティ内に含まれる入力データの文字の割合
- times - データラベラーがデータを予測するのにかかった時間
statistics - 入力データの統計
- vocab入力データの各文字のリスト
- vocab_count入力データ内の各異なる文字の発生数
- words - 入力データの各単語のリスト
- word_count入力データの各異なる単語の発生数
- times - 語彙と単語の統計をミリ秒単位で生成するのにかかった時間

グラフプロファイル

num_nodesグラフ内のノードの数
num_edgesグラフ内のエッジ数
categorical_attributesカテゴリのエッジ属性のリスト
continuous_attributes連続エッジ属性のリスト
avg_node_degreeグラフ内のノードの平均度
global_max_component_size ：グローバル最大コンポーネントのサイズ

継続_distribution：

<attribute_N> ：属性のリストのn番目のエッジ属性の名前
- name - 属性の分布名
- scale - 分布のスケーリングと比較に使用されるネガティブログの尤度
- properties - 分布を説明する統計的特性のリスト
  - [形状（オプション）、loc、スケール、平均、分散、スキュー、尖度]

categorical_distribution：

<attribute_N> ：属性のリストのn番目のエッジ属性の名前
- bin_counts ：分布ヒストグラムの各ビンにカウント
- bin_edges ：分布ヒストグラムの各ビンのエッジ
時間 - このプロファイルをミリ秒単位で生成するのにかかった期間

サポート

サポートされているデータ形式

区切られたファイル（CSV、TSVなど）
JSONオブジェクト
AVROファイル
寄木細工ファイル
テキストファイル
Pandas DataFrame
上記のサポートされているファイルタイプの1つを指すURL

データ型

データ型は、構造化されたデータの列レベルで決定されます

int
フロート
弦
DateTime

データラベル

データラベルは、構造化されたデータのセルごとに決定されます（プロファイラーが使用されるときの列/行）または非構造化データの文字レベルで決定されます。

未知
住所
禁止（銀行口座番号、10〜18桁）
クレジットカード
電子メールアドレス
uuid
Hash_or_key（MD5、SHA1、SHA256、ランダムハッシュなど）
IPv4
IPv6
mac_address
人
phone_number
SSN
URL
US_STATE
運転免許証
日付
時間
DateTime
整数
フロート
量
順序

始めましょう

ファイルをロードします

データプロファイラーは、次のデータ/ファイルタイプをプロファイルできます。

CSVファイル（または区切りファイル）
JSONオブジェクト
AVROファイル
寄木細工ファイル
テキストファイル
Pandas DataFrame
上記のサポートされているファイルタイプの1つを指すURL

プロファイラーはファイルタイプを自動的に識別し、データをData Classにロードする必要があります。

他の属性とともに、 Data class有効なPANDASデータフレームを介してデータにアクセスできるようにします。

 # Load a csv file, return a CSVData object
csv_data = Data ( 'your_file.csv' )

# Print the first 10 rows of the csv file
print ( csv_data . data . head ( 10 ))

# Load a parquet file, return a ParquetData object
parquet_data = Data ( 'your_file.parquet' )

# Sort the data by the name column
parquet_data . data . sort_values ( by = 'name' , inplace = True )

# Print the sorted first 10 rows of the parquet data
print ( parquet_data . data . head ( 10 ))

# Load a json file from a URL, return a JSONData object
json_data = Data ( 'https://github.com/capitalone/DataProfiler/blob/main/dataprofiler/tests/data/json/iris-utf-8.json' )

ファイルタイプが自動的に識別されない場合（まれ）、それらを具体的に指定できます。FiletypeまたはDelimiterを指定するセクションを参照してください。

ファイルをプロファイルします

例では、CSVファイルを使用しますが、CSV、JSON、Avro、Parquet、またはテキストも機能します。

 import json
from dataprofiler import Data , Profiler

# Load file (CSV should be automatically identified)
data = Data ( "your_file.csv" )

# Profile the dataset
profile = Profiler ( data )

# Generate a report and use json to prettify.
report  = profile . report ( report_options = { "output_format" : "pretty" })

# Print the report
print ( json . dumps ( report , indent = 4 ))

プロファイルの更新

現在、データプロファイラーは、バッチでプロファイルを更新するために装備されています。

 import json
from dataprofiler import Data , Profiler

# Load and profile a CSV file
data = Data ( "your_file.csv" )
profile = Profiler ( data )

# Update the profile with new data:
new_data = Data ( "new_data.csv" )
profile . update_profile ( new_data )

# Print the report using json to prettify.
report  = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

データが元々プロファイルされたデータのインデックスとオーバーラップする整数インデックスを含むデータを更新するデータに、ヌル行が計算されると、インデックスが無人値に「シフト」され、ヌル数と比率がまだ正確になるようにすることに注意してください。

マージプロファイル

同じスキーマ（異なるデータ）を持つ2つのファイルがある場合、追加演算子を介して2つのプロファイルをマージすることができます。

これにより、プロファイルを分散方法で決定できます。

 import json
from dataprofiler import Data , Profiler

# Load a CSV file with a schema
data1 = Data ( "file_a.csv" )
profile1 = Profiler ( data1 )

# Load another CSV file with the same schema
data2 = Data ( "file_b.csv" )
profile2 = Profiler ( data2 )

profile3 = profile1 + profile2

# Print the report using json to prettify.
report  = profile3 . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

マージされたプロファイルに整数インデックスが重複している場合、ヌル行が計算されると、インデックスが無人値に「シフト」されるため、ヌル数と比率がまだ正確になります。

プロファイラーの違い

同じスキーマでプロファイル間の変更を見つけるために、プロファイルのdiff関数を利用できます。 diff 、データの統計の詳細な違いだけでなく、全体的なファイルとサンプリングの違いを提供します。たとえば、数値列には、類似性を評価するためのt検定と、列分布シフトを定量化するPSI（人口安定性指数）の両方があります。詳細については、GitHubページのプロファイラーセクションで説明します。

このような違いレポートを作成します。

 import json
import dataprofiler as dp

# Load a CSV file
data1 = dp . Data ( "file_a.csv" )
profile1 = dp . Profiler ( data1 )

# Load another CSV file
data2 = dp . Data ( "file_b.csv" )
profile2 = dp . Profiler ( data2 )

diff_report = profile1 . diff ( profile2 )
print ( json . dumps ( diff_report , indent = 4 ))

パンダデータフレームをプロファイルします

 import pandas as pd
import dataprofiler as dp
import json

my_dataframe = pd . DataFrame ([[ 1 , 2.0 ],[ 1 , 2.2 ],[ - 1 , 3 ]])
profile = dp . Profiler ( my_dataframe )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

# read a specified column, in this case it is labeled 0:
print ( json . dumps ( report [ "data_stats" ][ 0 ], indent = 4 ))

非構造化プロファイラー

構造化されたプロファイラーに加えて、DataProfilerはTextDataオブジェクトまたは文字列の構造化プロファイリングを提供します。非構造化プロファイラーは、 unstructuredとして指定されたPROFILER_TYPEオプションが与えられた、リスト[String]、PD.Series（String）、またはPD.DataFrame（String）でも動作します。以下は、テキストファイルを備えた非構造化プロファイラーの例です。

 import dataprofiler as dp
import json

my_text = dp . Data ( 'text_file.txt' )
profile = dp . Profiler ( my_text )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

Pd.Series of Stringsを使用した非構造化プロファイラーの別の例を以下に示します。プロファイラーオプションprofiler_type='unstructured'を使用して

 import dataprofiler as dp
import pandas as pd
import json

text_data = pd . Series ([ 'first string' , 'second string' ])
profile = dp . Profiler ( text_data , profiler_type = 'unstructured' )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

グラフプロファイラー

DataProfilerは、CSVファイルからグラフデータをプロファイルする機能も提供します。以下は、グラフデータCSVファイルを備えたグラフプロファイラーの例です。

 import dataprofiler as dp
import pprint

my_graph = dp . Data ( 'graph_file.csv' )
profile = dp . Profiler ( my_graph )

# print the report using pretty print (json dump does not work on numpy array values inside dict)
report = profile . report ()
printer = pprint . PrettyPrinter ( sort_dicts = False , compact = True )
printer . pprint ( report )

追加の例とAPIの詳細については、ドキュメントページにアクセスしてください

参照

 Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions
Authors: Anh Truong, Austin Walters, Jeremy Goodsitt
2020 https://arxiv.org/abs/2012.09597
The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services

拡大する