imodelsX 다운로드 imodelsX 소스 코드 다운로드

텍스트 모델/데이터를 설명, 예측 및 조향하는 Scikit-Learn Friendly Library.
또한 텍스트 데이터를 시작하기위한 많은 유틸리티.

데모 노트북

설명 가능한 모델링/조향

모델	참조	산출	설명
트리 프롬프트	,?,? ,,	설명 + 스티어링	프롬프트의 나무를 생성합니다 LLM ( 공식 ) 조종
iprompt	,?,?,	설명 + 스티어링	프롬프트를 생성합니다 데이터의 패턴 ( 공식 )을 설명합니다.
AutoPrompt	ㅤㅤ,?,?	설명 + 스티어링	자연 언어 프롬프트를 찾으십시오 입력 등급 사용
D3	,?,?,	설명	두 분포의 차이점을 설명하십시오
SASC	ㅤㅤ,?,?	설명	블랙 박스 텍스트 모듈을 설명하십시오 LLM 사용 ( 공식 )
8 월	,?,?,	선형 모델	LLM을 사용하여 더 나은 선형 모델에 맞습니다 임베딩을 추출하려면 ( 공식 )
8 월 트리	,?,?,	의사 결정 트리	LLM을 사용하여 더 나은 의사 결정 트리에 맞습니다 기능을 확장하려면 ( 공식 )
Qaemb	,?,?,	설명 할 수 있습니다 임베딩	해석 가능한 임베딩을 생성합니다 LLMS 질문 ( 공식 )에게 물어 보면
칸	,?,?,	작은 회로망	2 층 Kolmogorov-arnold 네트워크에 맞습니다

데모 노트북 문서? 참조 코드? 연구 논문
RLPROMPT, CBMS 및 NBDT와 같은 다른 해석 가능한 알고리즘을 지원할 계획입니다. 알고리즘에 기여하고 싶다면 PR을 열어 주시겠습니까?

일반 유틸리티

모델	참조
LLM 래퍼	다른 LLM을 쉽게 호출하십시오
데이터 세트 래퍼	최소 처리 된 Huggingface 데이터 세트를 다운로드하십시오
ngrams의 가방	ngrams의 선형 모델을 배우십시오
선형 조정	LLM 임베딩 상단에 단일 선형 레이어를 Finetune

QuickStart

설치 : pip install imodelsx (또는 더 많은 제어, 소스에서 클론 및 설치)

데모 : 데모 노트를 참조하십시오

자연적인 설명

트리 프롬프트

 from imodelsx import TreePromptClassifier
import datasets
import numpy as np
from sklearn . tree import plot_tree
import matplotlib . pyplot as plt

# set up data
rng = np . random . default_rng ( seed = 42 )
dset_train = datasets . load_dataset ( 'rotten_tomatoes' )[ 'train' ]
dset_train = dset_train . select ( rng . choice (
    len ( dset_train ), size = 100 , replace = False ))
dset_val = datasets . load_dataset ( 'rotten_tomatoes' )[ 'validation' ]
dset_val = dset_val . select ( rng . choice (
    len ( dset_val ), size = 100 , replace = False ))

# set up arguments
prompts = [
    "This movie is" ,
    " Positive or Negative? The movie was" ,
    " The sentiment of the movie was" ,
    " The plot of the movie was really" ,
    " The acting in the movie was" ,
]
verbalizer = { 0 : " Negative." , 1 : " Positive." }
checkpoint = "gpt2"

# fit model
m = TreePromptClassifier (
    checkpoint = checkpoint ,
    prompts = prompts ,
    verbalizer = verbalizer ,
    cache_prompt_features_dir = None ,  # 'cache_prompt_features_dir/gp2',
)
m . fit ( dset_train [ "text" ], dset_train [ "label" ])


# compute accuracy
preds = m . predict ( dset_val [ 'text' ])
print ( ' n Tree-Prompt acc (val) ->' ,
      np . mean ( preds == dset_val [ 'label' ]))  # -> 0.7

# compare to accuracy for individual prompts
for i , prompt in enumerate ( prompts ):
    print ( i , prompt , '->' , m . prompt_accs_ [ i ])  # -> 0.65, 0.5, 0.5, 0.56, 0.51

# visualize decision tree
plot_tree (
    m . clf_ ,
    fontsize = 10 ,
    feature_names = m . feature_names_ ,
    class_names = list ( verbalizer . values ()),
    filled = True ,
)
plt . show ()

iprompt

 from imodelsx import explain_dataset_iprompt , get_add_two_numbers_dataset

# get a simple dataset of adding two numbers
input_strings , output_strings = get_add_two_numbers_dataset ( num_examples = 100 )
for i in range ( 5 ):
    print ( repr ( input_strings [ i ]), repr ( output_strings [ i ]))

# explain the relationship between the inputs and outputs
# with a natural-language prompt string
prompts , metadata = explain_dataset_iprompt (
    input_strings = input_strings ,
    output_strings = output_strings ,
    checkpoint = 'EleutherAI/gpt-j-6B' , # which language model to use
    num_learned_tokens = 3 , # how long of a prompt to learn
    n_shots = 3 , # shots per example
    n_epochs = 15 , # how many epochs to search
    verbose = 0 , # how much to print
    llm_float16 = True , # whether to load the model in float_16
)
- - - - - - - -
prompts is a list of found natural - language prompt strings

D3 (설명 된 분포)

 from imodelsx import explain_dataset_d3
hypotheses , hypothesis_scores = explain_dataset_d3 (
    pos = positive_samples , # List[str] of positive examples
    neg = negative_samples , # another List[str]
    num_steps = 100 ,
    num_folds = 2 ,
    batch_size = 64 ,
)

SASC

여기서는 데이터 세트 대신 모듈을 설명합니다

 from imodelsx import explain_module_sasc
# a toy module that responds to the length of a string
mod = lambda str_list : np . array ([ len ( s ) for s in str_list ])

# a toy dataset where the longest strings are animals
text_str_list = [ "red" , "blue" , "x" , "1" , "2" , "hippopotamus" , "elephant" , "rhinoceros" ]
explanation_dict = explain_module_sasc (
    text_str_list ,
    mod ,
    ngrams = 1 ,
)

Aug-Imodels

이것들은 Scikit-Learn 모델과 마찬가지로 사용하십시오. 훈련 중에는 LLM을 통해 더 나은 기능에 적합하지만 테스트 시간에는 매우 빠르고 완전히 투명합니다.

 from imodelsx import AugLinearClassifier , AugTreeClassifier , AugLinearRegressor , AugTreeRegressor
import datasets
import numpy as np

# set up data
dset = datasets . load_dataset ( 'rotten_tomatoes' )[ 'train' ]
dset = dset . select ( np . random . choice ( len ( dset ), size = 300 , replace = False ))
dset_val = datasets . load_dataset ( 'rotten_tomatoes' )[ 'validation' ]
dset_val = dset_val . select ( np . random . choice ( len ( dset_val ), size = 300 , replace = False ))

# fit model
m = AugLinearClassifier (
    checkpoint = 'textattack/distilbert-base-uncased-rotten-tomatoes' ,
    ngrams = 2 , # use bigrams
)
m . fit ( dset [ 'text' ], dset [ 'label' ])

# predict
preds = m . predict ( dset_val [ 'text' ])
print ( 'acc_val' , np . mean ( preds == dset_val [ 'label' ]))

# interpret
print ( 'Total ngram coefficients: ' , len ( m . coefs_dict_ ))
print ( 'Most positive ngrams' )
for k , v in sorted ( m . coefs_dict_ . items (), key = lambda item : item [ 1 ], reverse = True )[: 8 ]:
    print ( ' t ' , k , round ( v , 2 ))
print ( 'Most negative ngrams' )
for k , v in sorted ( m . coefs_dict_ . items (), key = lambda item : item [ 1 ])[: 8 ]:
    print ( ' t ' , k , round ( v , 2 ))

칸

 import imodelsx
from sklearn . datasets import make_classification , make_regression
from sklearn . metrics import accuracy_score
import numpy as np

X , y = make_classification ( n_samples = 5000 , n_features = 5 , n_informative = 3 )
model = imodelsx . KANClassifier ( hidden_layer_size = 64 , device = 'cpu' ,
                               regularize_activation = 1.0 , regularize_entropy = 1.0 )
model . fit ( X , y )
y_pred = model . predict ( X )
print ( 'Test acc' , accuracy_score ( y , y_pred ))

# now try regression
X , y = make_regression ( n_samples = 5000 , n_features = 5 , n_informative = 3 )
model = imodelsx . kan . KANRegressor ( hidden_layer_size = 64 , device = 'cpu' ,
                                  regularize_activation = 1.0 , regularize_entropy = 1.0 )
model . fit ( X , y )
y_pred = model . predict ( X )
print ( 'Test correlation' , np . corrcoef ( y , y_pred . flatten ())[ 0 , 1 ])

일반 유틸리티

쉬운 기준선

Sklearn API를 따르는 적합한 기준선.

 from imodelsx import LinearFinetuneClassifier , LinearNgramClassifier
# fit a simple one-layer finetune on top of LLM embeddings
m = LinearFinetuneClassifier (
    checkpoint = 'distilbert-base-uncased' ,
)
m . fit ( dset [ 'text' ], dset [ 'label' ])
preds = m . predict ( dset_val [ 'text' ])
acc = ( preds == dset_val [ 'label' ]). mean ()
print ( 'validation acc' , acc )

LLM 래퍼

캐싱으로 다른 언어 모델을 호출하기위한 쉬운 API (Langchain보다 훨씬 가벼운).

 import imodelsx . llm
# supports any huggingface checkpoint or openai checkpoint (including chat models)
llm = imodelsx . llm . get_llm (
    checkpoint = "gpt2-xl" ,  # text-davinci-003, gpt-3.5-turbo, ...
    CACHE_DIR = ".cache" ,
)
out = llm ( "May the Force be" )
llm ( "May the Force be" ) # when computing the same string again, uses the cache

데이터 포장지

기본 전처리로 Huggingface 데이터 세트를로드하기위한 API.

 import imodelsx . data
dset , dataset_key_text = imodelsx . data . load_huggingface_dataset ( 'ag_news' )
# Ensures that dset has a split named 'train' and 'validation',
# and that the input data is contained for each split in a column given by {dataset_key_text}