embcompare
v1.0.1
一個簡單的Python工具,用於嵌入比較
EmbCompare是一個高度靈感來自嵌入式比較器工具的小型Python軟件包,可幫助您在視覺和數值上比較嵌入。
Embcompare可以保持類似的狀態。所有計算都是在內存中進行的,該軟件包不會帶來任何嵌入式存儲管理。
如果您需要存儲,比較和跟踪實驗的工具,則可能喜歡矢量項目。
# basic install
pip install embcompare
# installation with the gui tool
pip install embcompare[gui]Embcompare提供了一個具有三個子命令的CLI:
embcompare add用於創建或更新包含所有嵌入式信息的YAML文件:路徑,格式,標籤,術語頻率,...;embcompare report用於生成包含比較指標的JSON報告;embcompare gui用於啟動簡化的WebApp,以視覺上比較您的嵌入。Embcare使用YAML文件來引用嵌入和相關信息。默認情況下,Embcompare正在尋找當前工作目錄中名為Embcompare.yaml的文件。
embeddings :
first_embedding :
name : My first embedding
path : /abspath/to/firstembedding.json
format : json
frequencies : /abspath/to/freqs.json
frequencies_format : json
labels : /abspath/to/labels.pkl
labels_format : pkl
second_embedding :
name : My second embedding
path : /abspath/to/secondembedding.json
format : word2vec
frequencies : /abspath/to/freqs.pkl
frequencies_format : pkl
labels : /abspath/to/labels.json
labels_format : json embcompare add命令允許以程序為程序更新此文件(如果不存在,甚至創建它)。
Embcompare的目的是通過數字指標來幫助比較嵌入,這些指標可用於檢查新生成的嵌入是否與上一個嵌入有很大不同。命令embcompare report可以通過兩種方式使用:
embcompare report first_embedding
# creates a first_embedding_report.json file containing some infos about the embeddingembcompare report first_embedding second_embedding
# creates a first_embedding_second_embedding_report.json file containing comparison metrics
GUI也非常方便地比較嵌入。要啟動GUI,請使用Commande embcompare gui 。它將啟動一個簡化的應用程序,該應用程序將允許您視覺上比較配置文件中添加的嵌入式。
EmbCompare提供了幾個類來加載和比較嵌入。
Embedding類是gensim.KeyedVectors類的孩子。
它增加了幾個功能:
import json
import gensim . downloader as api
from embcompare import Embedding
word_vectors = api . load ( "glove-wiki-gigaword-100" )
with open ( "frequencies.json" , "r" ) as f :
word_frequencies = json . load ( f )
embedding = Embedding . load_from_keyedvectors ( word_vectors , frequencies = word_frequencies )
neigh_dist , neigh_ind = embedding . compute_neighborhoods ()EmbeddingComparison分類類旨在比較兩個Embedding對象:
from embcompare import EmbeddingComparison , load_embedding
emb1 = load_embedding ( "first_emb.bin" , embedding_format = "fasttext" , frequencies_path = "freqs.pkl" )
emb2 = load_embedding ( "second_emb.bin" , embedding_format = "word2vec" , frequencies_path = "freqs.pkl" )
comparison = EmbeddingComparison ({ "emb1" : emb1 , "emb2" : emb2 }, n_neighbors = 25 )
comparison . neighborhoods_similarities [ "word" ]
# 0.867EmbeddingReport類用於生成有關嵌入的小報告:
from embcompare import EmbeddingReport , load_embedding
emb1 = load_embedding ( "first_emb.bin" , embedding_format = "fasttext" , frequencies_path = "freqs.pkl" )
report = EmbeddingReport ( emb1 )
report . to_dict ()
# {
# "vector_size": 300,
# "mean_frequency": 0.00012,
# "mean_distance_neighbors": 0.023,
# ...
# } EmbeddingComparisonReport類用於生成兩個嵌入的小比較報告:
from embcompare import EmbeddingComparison , EmbeddingComparisonReport , load_embedding
emb1 = load_embedding ( "first_emb.bin" , embedding_format = "fasttext" , frequencies_path = "freqs.pkl" )
emb2 = load_embedding ( "second_emb.bin" , embedding_format = "word2vec" , frequencies_path = "freqs.pkl" )
comparison = EmbeddingComparison ({ "emb1" : emb1 , "emb2" : emb2 })
report = EmbeddingComparisonReport ( comparison )
report . to_dict ()
# {
# "embeddings" : [
# {
# "vector_size": 300,
# "mean_frequency": 0.00012,
# "mean_distance_neighbors": 0.023,
# ...
# },
# ...
# ],
# "neighborhoods_similarities_median": 0.012,
# ...
# } GUI是通過簡化的。我們嘗試將該應用程序模塊化,以便您可以更輕鬆地為您的自定義簡化應用重複使用某些功能:
# embcompare/gui/app.py
from embcompare . gui . features import (
display_custom_elements_comparison ,
display_elements_comparison ,
display_embeddings_config ,
display_frequencies_comparison ,
display_neighborhoods_similarities ,
display_numbers_of_elements ,
display_parameters_selection ,
display_spaces_comparison ,
display_statistics_comparison ,
)
from embcompare . gui . helpers import create_comparison
def main ():
"""Streamlit app for embeddings comparison"""
config_embeddings = config [ CONFIG_EMBEDDINGS ]
(
tab_infos ,
tab_stats ,
tab_spaces ,
tab_neighbors ,
tab_compare ,
tab_compare_custom ,
tab_frequencies ,
) = st . tabs (
[
"Infos" ,
"Statistics" ,
"Spaces" ,
"Similarities" ,
"Elements" ,
"Search elements" ,
"Frequencies" ,
]
)
# Embedding selection (inside the sidebar)
with st . sidebar :
parameters = display_parameters_selection ( config_embeddings )
# Display informations about embeddings
with tab_infos :
display_embeddings_config (
config_embeddings , parameters . emb1_id , parameters . emb2_id
)
comparison = create_comparison (
config_embeddings ,
emb1_id = parameters . emb1_id ,
emb2_id = parameters . emb2_id ,
n_neighbors = parameters . n_neighbors ,
max_emb_size = parameters . max_emb_size ,
min_frequency = parameters . min_frequency ,
)
# Display number of element in both embedding and common elements
with tab_infos :
display_numbers_of_elements ( comparison )
# Display statistics
with tab_stats :
display_statistics_comparison ( comparison )
if not comparison . common_keys :
st . warning ( "The embeddings have no element in common" )
st . stop ()
# Comparison below are based on common elements comparison
with tab_spaces :
display_spaces_comparison ( comparison )
with tab_neighbors :
display_neighborhoods_similarities ( comparison )
with tab_compare :
display_elements_comparison ( comparison )
with tab_compare_custom :
display_custom_elements_comparison ( comparison )
with tab_frequencies :
display_frequencies_comparison ( comparison )