DataProfiler 다운로드 - DataProfiler 소스 코드 다운로드

밝은 색상 모드의 검은 색 로고와 어두운 색상 모드에서 흰색 로고를 보여줍니다.

데이터 프로파일 러 | 데이터에 무엇이 있습니까?

DataproFiler는 데이터 분석, 모니터링 및 민감한 데이터 감지를 쉽게하기 위해 설계된 Python 라이브러리입니다.

단일 명령으로 데이터를 로드하면 라이브러리는 파일을 자동으로 형식화하고 데이터 프레임으로로드합니다. 데이터 프로파일 링 라이브러리는 스키마, 통계, 엔티티 (PII / NPI) 등을 식별합니다. 그런 다음 다운 스트림 응용 프로그램 또는 보고서에서 데이터 프로파일을 사용할 수 있습니다.

시작하는 데는 몇 줄의 코드가 필요합니다 (예 CSV).

 import json
from dataprofiler import Data , Profiler

data = Data ( "your_file.csv" ) # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL

print ( data . data . head ( 5 )) # Access data directly via a compatible Pandas DataFrame

profile = Profiler ( data ) # Calculate Statistics, Entity Recognition, etc

readable_report = profile . report ( report_options = { "output_format" : "compact" })

print ( json . dumps ( readable_report , indent = 4 ))

참고 : 데이터 프로파일 러에는 민감한 데이터 (PII / NPI)를 효율적으로 식별하는 데 사용되는 미리 훈련 된 딥 러닝 모델이 함께 제공됩니다. 원하는 경우 기존 미리 훈련 된 모델에 새 엔티티를 추가하거나 엔티티 인식을 위해 완전히 새로운 파이프 라인을 삽입하는 것은 쉽습니다.

API 문서는 문서 페이지를 방문하십시오.

제안이 있거나 버그를 찾으면 문제를여십시오.

기여하려면 기여 페이지를 방문하십시오.

설치하다

PYPI에서 전체 패키지를 설치하려면 : pip install DataProfiler[full]

보고서를 생성하지 않고 ML 종속성을 설치하려면 DataProfiler[ml] 사용하십시오.

ML 요구 사항이 너무 엄격한 경우 (예를 들어, 텐서 플로를 설치하지 않으려면) DataProfiler[reports] 있는 더 슬림 한 패키지를 설치할 수 있습니다. Slimmer 패키지는 기본 민감한 데이터 감지 / 엔티티 인식 (Labler)을 비활성화합니다.

PYPI에서 설치 : pip install DataProfiler

데이터 프로필이란 무엇입니까?

이 라이브러리의 경우 데이터 프로파일은 기본 데이터 세트에 대한 통계 및 예측을 포함하는 사전입니다. 데이터 세트 레벨 데이터를 포함하고 "열/행 수준 통계"또는 data_stats 가있는 "글로벌 통계"또는 global_stats 가 있습니다 (각 열은 새로운 키 값 항목입니다).

구조화 된 프로파일의 형식은 다음과 같습니다.

 "global_stats": {
    "samples_used": int,
    "column_count": int,
    "row_count": int,
    "row_has_null_ratio": float,
    "row_is_null_ratio": float,
    "unique_row_ratio": float,
    "duplicate_row_count": int,
    "file_type": string,
    "encoding": string,
    "correlation_matrix": list[list[int]], (*)
    "chi2_matrix": list[list[float]],
    "profile_schema": {
        string: list[int]
    },
    "times": dict[string, float],
},
"data_stats": [
    {
        "column_name": string,
        "data_type": string,
        "data_label": string,
        "categorical": bool,
        "order": string,
        "samples": list[str],
        "statistics": {
            "sample_size": int,
            "null_count": int,
            "null_types": list[string],
            "null_types_index": {
                string: list[int]
            },
            "data_type_representation": dict[string, float],
            "min": [null, float, str],
            "max": [null, float, str],
            "mode": float,
            "median": float,
            "median_absolute_deviation": float,
            "sum": float,
            "mean": float,
            "variance": float,
            "stddev": float,
            "skewness": float,
            "kurtosis": float,
            "num_zeros": int,
            "num_negatives": int,
            "histogram": {
                "bin_counts": list[int],
                "bin_edges": list[float],
            },
            "quantiles": {
                int: float
            },
            "vocab": list[char],
            "avg_predictions": dict[string, float],
            "data_label_representation": dict[string, float],
            "categories": list[str],
            "unique_count": int,
            "unique_ratio": float,
            "categorical_count": dict[string, int],
            "gini_impurity": float,
            "unalikeability": float,
            "precision": {
                'min': int,
                'max': int,
                'mean': float,
                'var': float,
                'std': float,
                'sample_size': int,
                'margin_of_error': float,
                'confidence_level': float
            },
            "times": dict[string, float],
            "format": string
        },
        "null_replication_metrics": {
            "class_prior": list[int],
            "class_sum": list[list[int]],
            "class_mean": list[list[int]]
        }
    }
]

(*) 현재 상관 관계 매트릭스 업데이트가 전환됩니다. 이후 업데이트에서 재설정됩니다. 사용자는 여전히 IS_ENABLE 옵션으로 설정된대로 사용할 수 있습니다.

구조화되지 않은 프로파일의 형식은 다음과 같습니다.

 "global_stats": {
    "samples_used": int,
    "empty_line_count": int,
    "file_type": string,
    "encoding": string,
    "memory_size": float, # in MB
    "times": dict[string, float],
},
"data_stats": {
    "data_label": {
        "entity_counts": {
            "word_level": dict[string, int],
            "true_char_level": dict[string, int],
            "postprocess_char_level": dict[string, int]
        },
        "entity_percentages": {
            "word_level": dict[string, float],
            "true_char_level": dict[string, float],
            "postprocess_char_level": dict[string, float]
        },
        "times": dict[string, float]
    },
    "statistics": {
        "vocab": list[char],
        "vocab_count": dict[string, int],
        "words": list[string],
        "word_count": dict[string, int],
        "times": dict[string, float]
    }
}

그래프 프로파일의 형식은 다음과 같습니다.

 "num_nodes": int,
"num_edges": int,
"categorical_attributes": list[string],
"continuous_attributes": list[string],
"avg_node_degree": float,
"global_max_component_size": int,
"continuous_distribution": {
    "<attribute_1>": {
        "name": string,
        "scale": float,
        "properties": list[float, np.array]
    },
    "<attribute_2>": None,
    ...
},
"categorical_distribution": {
    "<attribute_1>": None,
    "<attribute_2>": {
        "bin_counts": list[int],
        "bin_edges": list[float]
    },
    ...
},
"times": dict[string, float]

프로필 통계 설명

구조화 된 프로파일

Global_stats :

samples_used 이 프로파일을 생성하는 데 사용되는 입력 데이터 샘플 수
column_count 입력 데이터 세트에 포함 된 열 수
row_count 입력 데이터 세트에 포함 된 행 수
row_has_null_ratio 총 행 수에 적어도 하나의 널 값을 포함하는 행의 비율
row_is_null_ratio 총 행 수에 대한 null 값 (null 행)으로 완전히 구성된 행의 비율
unique_row_ratio 입력 데이터 세트의 고유 한 행의 비율에 대한 총 행 수
duplicate_row_count 입력 데이터 세트에서 두 번 이상 발생하는 행 수
file_type 입력 데이터 세트를 포함하는 파일의 형식 (예 : .CSV)
encoding - 입력 데이터 세트가 포함 된 파일 인코딩 (예 : UTF -8)
correlation_matrix 데이터 세트의 각 열 사이의 상관 계수를 포함하는 Shape column_count X column_count 의 행렬
chi2_matrix 데이터 세트의 각 열 사이의 카이 제곱 통계를 포함하는 Shape column_count X column_count 의 행렬
profile_schema 데이터 세트에서 각 열과 그 인덱스를 레이블링하는 입력 데이터 세트 형식에 대한 설명
- string - 해당 열의 레이블과 프로파일 스키마의 인덱스
times -이 데이터 세트의 글로벌 통계를 밀리 초 단위로 생성하는 데 걸리는 시간

data_stats :

column_name 입력 데이터 세트 에서이 열의 레이블/제목
data_type 이 열에 포함 된 원시 파이썬 데이터 유형
data_label 라벨러 구성 요소에 의해 결정된이 열의 데이터의 레이블/엔티티
categorical -이 열에 범주 형 데이터가 포함 된 경우 'true'
order 이 열의 데이터가 주문되는 방식, 그렇지 않으면 "무작위"
samples -이 열에서 작은 데이터 항목의 작은 하위 집합
statistics - 열의 통계 정보
- sample_size 이 프로파일을 생성하는 데 사용되는 입력 데이터 샘플 수
- null_count 샘플의 NULL 항목 수
- null_types 이 샘플에 존재하는 다른 null 유형 목록
- null_types_index 각 NULL 유형을 포함하는 DICT 및이 샘플 내에 존재하는 표시의 각각 목록
- data_type_representation 각 data_type로 식별하는 샘플의 백분율
- min - 샘플의 최소값
- max - 샘플의 최대 값
- mode - 샘플의 항목 모드
- median - 샘플의 항목의 중앙값
- median_absolute_deviation 샘플의 항목의 중간 절대 편차
- sum - 열에서 모든 샘플링 된 값의 총계
- mean - 샘플의 모든 항목의 평균
- variance - 샘플의 모든 항목의 분산
- stddev 샘플의 모든 항목의 표준 편차
- skewness - 샘플의 모든 항목의 통계적 왜곡
- kurtosis 샘플의 모든 항목의 통계적 첨도
- num_zeros 값이 0 인이 샘플의 항목 수
- num_negatives 이 샘플의 값이 0보다 작은 항목 수
- histogram - 히스토그램 관련 정보가 포함되어 있습니다
  - bin_counts 각 빈 내의 항목 수
  - bin_edges 각 빈의 임계 값
- quantiles 샘플의 항목에 따라 나열된 순서대로 각 백분위 수의 값
- vocab -이 샘플의 항목에 사용 된 문자 목록
- avg_predictions 샘플링 된 모든 데이터 포인트의 평균 데이터 레이블 예측 신뢰도
- categories - categorial = 'true'인 경우 샘플 내 각각의 별개의 범주 목록
- unique_count 샘플의 고유 한 항목 수
- unique_ratio 샘플의 고유 한 항목 수의 비율은 샘플의 총 항목 수에
- categorical_count 각 범주에 대해 샘플링 된 항목 categorical = 'True'인 경우 각 범주에 대해 샘플링됩니다.
- gini_impurity 하위 집합의 레이블 분포에 따라 무작위로 레이블이 붙은 경우 세트에서 무작위로 선택한 요소가 잘못 표시되는 빈도의 측정
- unalikeability 샘플 내에서 각 항목이 서로 얼마나 자주 다른지를 나타내는 값
- precision - 각 샘플에 대한 숫자의 숫자 수와 관련하여 통계의 덕트
- times -이 샘플의 통계를 밀리 초로 생성하는 데 걸리는 시간
- format - 가능한 dateTime 형식의 목록
null_replication_metrics 열 값이 NULL인지 여부에 따라 파티션 된 데이터 통계 (DICT 키에서 참조 된 목록 1) (색인 0)
- class_prior 열 값이 null이고 null이 아닌 확률을 포함하는 목록
- class_sum 열 값이 null인지 아닌지에 따라 다른 모든 행의 합이 포함 된 목록
- class_mean 열 값이 null인지 아닌지에 따라 다른 모든 행의 평균을 포함하는 목록

구조화되지 않은 프로파일

Global_stats :

samples_used 이 프로파일을 생성하는 데 사용되는 입력 데이터 샘플 수
empty_line_count 입력 데이터의 빈 줄 수
file_type 입력 데이터의 파일 유형 (예 : .txt)
encoding - 입력 데이터 파일의 파일 인코딩 (예 : UTF -8)
memory_size MB에서 입력 데이터의 크기
times - 시간 기간은이 프로파일을 밀리 초로 생성하는 데 걸렸습니다.

data_stats :

data_label 입력 데이터의 레이블에 대한 레이블 및 통계
- entity_counts 특정 레이블 또는 엔티티가 입력 데이터 안에 나타나는 횟수
  - word_level 각 레이블 또는 엔티티 내에서 계산 된 단어 수
  - true_char_level 모델에 의해 결정된 각 레이블 또는 엔터티 내에서 계산 된 문자 수
  - postprocess_char_level 사후 프로세서가 결정한 각 레이블 또는 엔터티 내에서 계산 된 문자 수
- entity_percentages 입력 데이터 내 각 레이블 또는 엔터티의 백분율
  - word_level 각 레이블 또는 엔터티에 포함 된 입력 데이터의 단어 백분율
  - true_char_level 모델에 의해 결정된 각 레이블 또는 엔터티에 포함 된 입력 데이터의 문자 비율
  - postprocess_char_level 사후 처리기가 결정한 각 레이블 또는 엔터티에 포함 된 입력 데이터의 문자 비율
- times - 데이터 라벨러가 데이터를 예측하는 데 걸리는 시간
statistics - 입력 데이터의 통계
- vocab - 입력 데이터의 각 문자 목록
- vocab_count 입력 데이터에서 각각의 고유 한 문자의 발생 수
- words - 입력 데이터의 각 단어 목록
- word_count 입력 데이터에서 각각의 별개의 단어 발생 수
- times - 어휘 통계를 생성하는 데 걸리는 시간

그래프 프로필

num_nodes 그래프의 노드 수
num_edges 그래프의 가장자리 수
categorical_attributes 범주 형 가장자리 속성 목록
continuous_attributes 연속 모서리 속성 목록
avg_node_degree 그래프의 평균 노드도
global_max_component_size : 글로벌 최대 구성 요소의 크기

continuous_distribution :

<attribute_N> : 속성 목록에서 N-th Edge 속성의 이름
- name - 속성에 대한 분포 이름
- scale - 분포를 확장하고 비교하는 데 사용되는 음의 로그 가능성
- properties - 분포를 설명하는 통계 속성 목록
  - [모양 (선택 사항), loc, 스케일, 평균, 분산, 왜곡, kurtosis]

범주 릭 _distribution :

<attribute_N> : 속성 목록에서 N-th Edge 속성의 이름
- bin_counts : 분포 히스토그램의 각 빈에 계산
- bin_edges : 분포 히스토그램의 각 빈 가장자리
시간 - 시간 기간은이 프로파일을 밀리 초로 생성하는 데 걸렸습니다.

지원하다

지원되는 데이터 형식

구분 된 파일 (CSV, TSV 등)
JSON 개체
avro 파일
파크 파일
텍스트 파일
팬더 데이터 프레임
위의 지원되는 파일 유형 중 하나를 가리키는 URL

데이터 유형

데이터 유형은 구조화 된 데이터의 열 수준에서 결정됩니다.

int
뜨다
끈
DateTime

데이터 레이블

데이터 레이블은 구조화 된 데이터 ( 프로파일 러가 사용될 때 열/행) 또는 구조화되지 않은 데이터의 문자 수준에서 셀당 결정됩니다.

알려지지 않은
주소
금지 (은행 계좌 번호, 10-18 자리)
Credit_card
email_address
uuid
Hash_or_key (MD5, SHA1, SHA256, Random Hash 등)
IPv4
IPv6
mac_address
사람
폰_number
SSN
URL
US_STATE
드라이버 _license
날짜
시간
DateTime
정수
뜨다
수량
서수

시작하세요

파일을로드하십시오

데이터 프로파일 러는 다음 데이터/파일 유형을 프로파일 할 수 있습니다.

CSV 파일 (또는 구분 된 파일)
JSON 개체
avro 파일
파크 파일
텍스트 파일
팬더 데이터 프레임
위의 지원되는 파일 유형 중 하나를 가리키는 URL

프로파일 러는 파일 유형을 자동으로 식별하고 데이터를 Data Class 에로드해야합니다.

다른 귀속과 함께 Data class 유효한 Pandas Dataframe을 통해 데이터에 액세스 할 수있게합니다.

 # Load a csv file, return a CSVData object
csv_data = Data ( 'your_file.csv' )

# Print the first 10 rows of the csv file
print ( csv_data . data . head ( 10 ))

# Load a parquet file, return a ParquetData object
parquet_data = Data ( 'your_file.parquet' )

# Sort the data by the name column
parquet_data . data . sort_values ( by = 'name' , inplace = True )

# Print the sorted first 10 rows of the parquet data
print ( parquet_data . data . head ( 10 ))

# Load a json file from a URL, return a JSONData object
json_data = Data ( 'https://github.com/capitalone/DataProfiler/blob/main/dataprofiler/tests/data/json/iris-utf-8.json' )

파일 유형이 자동으로 식별되지 않은 경우 (드문) 파일 유형을 구체적으로 지정할 수 있습니다. 파일 타입 또는 구분기를 지정하는 섹션을 참조하십시오.

파일 프로파일

예는 예를 들어 CSV 파일을 사용하지만 CSV, JSON, AVRO, PARQUET 또는 텍스트도 작동합니다.

 import json
from dataprofiler import Data , Profiler

# Load file (CSV should be automatically identified)
data = Data ( "your_file.csv" )

# Profile the dataset
profile = Profiler ( data )

# Generate a report and use json to prettify.
report  = profile . report ( report_options = { "output_format" : "pretty" })

# Print the report
print ( json . dumps ( report , indent = 4 ))

프로필 업데이트

현재 데이터 프로파일 러는 프로파일을 배치로 업데이트 할 수 있습니다.

 import json
from dataprofiler import Data , Profiler

# Load and profile a CSV file
data = Data ( "your_file.csv" )
profile = Profiler ( data )

# Update the profile with new data:
new_data = Data ( "new_data.csv" )
profile . update_profile ( new_data )

# Print the report using json to prettify.
report  = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

원래 프로파일로 프로파일로 프로파일을 업데이트하는 데이터에 원래 프로파일이 포함 된 정수 지수가 포함되어있는 경우, 널 행이 계산되면 널 카운트와 비율이 여전히 정확하도록 지수가 "선택되지 않은"값으로 "이동"됩니다.

프로파일 병합

동일한 스키마가있는 두 개의 파일이있는 경우 (하지만 다른 데이터)는 추가 연산자를 통해 두 프로파일을 함께 병합 할 수 있습니다.

또한 프로파일을 분산 방식으로 결정할 수 있습니다.

 import json
from dataprofiler import Data , Profiler

# Load a CSV file with a schema
data1 = Data ( "file_a.csv" )
profile1 = Profiler ( data1 )

# Load another CSV file with the same schema
data2 = Data ( "file_b.csv" )
profile2 = Profiler ( data2 )

profile3 = profile1 + profile2

# Print the report using json to prettify.
report  = profile3 . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

병합 된 프로파일에 정수 지수가 겹치면 널 행이 계산되면 널 카운트와 비율이 여전히 정확하도록 지수가 "무인"값으로 "이동"됩니다.

프로파일 러 차이

동일한 스키마로 프로파일 간의 변경 사항을 찾으려면 프로파일의 diff 함수를 활용할 수 있습니다. diff 전체 파일 및 샘플링 차이와 데이터 통계의 상세한 차이를 제공합니다. 예를 들어, 숫자 컬럼은 유사성과 PSI (모집단 안정성 지수)를 평가하기위한 t- 검정을 가지며 열 분포 이동을 정량화합니다. 자세한 내용은 GitHub 페이지의 프로파일 러 섹션에 설명되어 있습니다.

다음과 같은 차이 보고서를 작성하십시오.

 import json
import dataprofiler as dp

# Load a CSV file
data1 = dp . Data ( "file_a.csv" )
profile1 = dp . Profiler ( data1 )

# Load another CSV file
data2 = dp . Data ( "file_b.csv" )
profile2 = dp . Profiler ( data2 )

diff_report = profile1 . diff ( profile2 )
print ( json . dumps ( diff_report , indent = 4 ))

프로파일 팬더 데이터 프레임

 import pandas as pd
import dataprofiler as dp
import json

my_dataframe = pd . DataFrame ([[ 1 , 2.0 ],[ 1 , 2.2 ],[ - 1 , 3 ]])
profile = dp . Profiler ( my_dataframe )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

# read a specified column, in this case it is labeled 0:
print ( json . dumps ( report [ "data_stats" ][ 0 ], indent = 4 ))

구조화되지 않은 프로파일 러

구조화 된 프로파일러 외에도 DataproFiler는 TextData 객체 또는 문자열에 대한 비정형 프로파일 링을 제공합니다. 구조화되지 않은 프로파일 러는 unstructured 지정된 profiler_type 옵션이 주어진 목록 [String], pd.series (string) 또는 pd.dataframe (String)과도 작동합니다. 아래는 텍스트 파일이있는 구조화되지 않은 프로파일 러의 예입니다.

 import dataprofiler as dp
import json

my_text = dp . Data ( 'text_file.txt' )
profile = dp . Profiler ( my_text )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

Pd.series of Strings를 갖는 구조화되지 않은 프로파일러의 또 다른 예는 다음과 같이 제공되며 프로파일 옵션 profiler_type='unstructured' 와 함께 제공됩니다.

 import dataprofiler as dp
import pandas as pd
import json

text_data = pd . Series ([ 'first string' , 'second string' ])
profile = dp . Profiler ( text_data , profiler_type = 'unstructured' )

# print the report using json to prettify.
report = profile . report ( report_options = { "output_format" : "pretty" })
print ( json . dumps ( report , indent = 4 ))

그래프 프로파일 러

DataproFiler는 또한 CSV 파일에서 그래프 데이터를 프로파일하는 기능을 제공합니다. 아래는 그래프 데이터 CSV 파일이있는 그래프 프로파일 러의 예입니다.

 import dataprofiler as dp
import pprint

my_graph = dp . Data ( 'graph_file.csv' )
profile = dp . Profiler ( my_graph )

# print the report using pretty print (json dump does not work on numpy array values inside dict)
report = profile . report ()
printer = pprint . PrettyPrinter ( sort_dicts = False , compact = True )
printer . pprint ( report )

추가 예제 및 API 세부 사항은 문서 페이지를 방문하십시오.

참조

 Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions
Authors: Anh Truong, Austin Walters, Jeremy Goodsitt
2020 https://arxiv.org/abs/2012.09597
The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services

확장하다