embcompare
v1.0.1
一个简单的Python工具,用于嵌入比较
EmbCompare是一个高度灵感来自嵌入式比较器工具的小型Python软件包,可帮助您在视觉和数值上比较嵌入。
Embcompare可以保持类似的状态。所有计算都是在内存中进行的,该软件包不会带来任何嵌入式存储管理。
如果您需要存储,比较和跟踪实验的工具,则可能喜欢矢量项目。
# basic install
pip install embcompare
# installation with the gui tool
pip install embcompare[gui]Embcompare提供了一个具有三个子命令的CLI:
embcompare add用于创建或更新包含所有嵌入式信息的YAML文件:路径,格式,标签,术语频率,...;embcompare report用于生成包含比较指标的JSON报告;embcompare gui用于启动简化的WebApp,以视觉上比较您的嵌入。Embcare使用YAML文件来引用嵌入和相关信息。默认情况下,Embcompare正在寻找当前工作目录中名为Embcompare.yaml的文件。
embeddings :
first_embedding :
name : My first embedding
path : /abspath/to/firstembedding.json
format : json
frequencies : /abspath/to/freqs.json
frequencies_format : json
labels : /abspath/to/labels.pkl
labels_format : pkl
second_embedding :
name : My second embedding
path : /abspath/to/secondembedding.json
format : word2vec
frequencies : /abspath/to/freqs.pkl
frequencies_format : pkl
labels : /abspath/to/labels.json
labels_format : json embcompare add命令允许以程序为程序更新此文件(如果不存在,甚至创建它)。
Embcompare的目的是通过数字指标来帮助比较嵌入,这些指标可用于检查新生成的嵌入是否与上一个嵌入有很大不同。命令embcompare report可以通过两种方式使用:
embcompare report first_embedding
# creates a first_embedding_report.json file containing some infos about the embeddingembcompare report first_embedding second_embedding
# creates a first_embedding_second_embedding_report.json file containing comparison metrics
GUI也非常方便地比较嵌入。要启动GUI,请使用Commande embcompare gui 。它将启动一个简化的应用程序,该应用程序将允许您视觉上比较配置文件中添加的嵌入式。
EmbCompare提供了几个类来加载和比较嵌入。
Embedding类是gensim.KeyedVectors类的孩子。
它增加了几个功能:
import json
import gensim . downloader as api
from embcompare import Embedding
word_vectors = api . load ( "glove-wiki-gigaword-100" )
with open ( "frequencies.json" , "r" ) as f :
word_frequencies = json . load ( f )
embedding = Embedding . load_from_keyedvectors ( word_vectors , frequencies = word_frequencies )
neigh_dist , neigh_ind = embedding . compute_neighborhoods ()EmbeddingComparison分类类旨在比较两个Embedding对象:
from embcompare import EmbeddingComparison , load_embedding
emb1 = load_embedding ( "first_emb.bin" , embedding_format = "fasttext" , frequencies_path = "freqs.pkl" )
emb2 = load_embedding ( "second_emb.bin" , embedding_format = "word2vec" , frequencies_path = "freqs.pkl" )
comparison = EmbeddingComparison ({ "emb1" : emb1 , "emb2" : emb2 }, n_neighbors = 25 )
comparison . neighborhoods_similarities [ "word" ]
# 0.867EmbeddingReport类用于生成有关嵌入的小报告:
from embcompare import EmbeddingReport , load_embedding
emb1 = load_embedding ( "first_emb.bin" , embedding_format = "fasttext" , frequencies_path = "freqs.pkl" )
report = EmbeddingReport ( emb1 )
report . to_dict ()
# {
# "vector_size": 300,
# "mean_frequency": 0.00012,
# "mean_distance_neighbors": 0.023,
# ...
# } EmbeddingComparisonReport类用于生成两个嵌入的小比较报告:
from embcompare import EmbeddingComparison , EmbeddingComparisonReport , load_embedding
emb1 = load_embedding ( "first_emb.bin" , embedding_format = "fasttext" , frequencies_path = "freqs.pkl" )
emb2 = load_embedding ( "second_emb.bin" , embedding_format = "word2vec" , frequencies_path = "freqs.pkl" )
comparison = EmbeddingComparison ({ "emb1" : emb1 , "emb2" : emb2 })
report = EmbeddingComparisonReport ( comparison )
report . to_dict ()
# {
# "embeddings" : [
# {
# "vector_size": 300,
# "mean_frequency": 0.00012,
# "mean_distance_neighbors": 0.023,
# ...
# },
# ...
# ],
# "neighborhoods_similarities_median": 0.012,
# ...
# } GUI是通过简化的。我们尝试将该应用程序模块化,以便您可以更轻松地为您的自定义简化应用重复使用某些功能:
# embcompare/gui/app.py
from embcompare . gui . features import (
display_custom_elements_comparison ,
display_elements_comparison ,
display_embeddings_config ,
display_frequencies_comparison ,
display_neighborhoods_similarities ,
display_numbers_of_elements ,
display_parameters_selection ,
display_spaces_comparison ,
display_statistics_comparison ,
)
from embcompare . gui . helpers import create_comparison
def main ():
"""Streamlit app for embeddings comparison"""
config_embeddings = config [ CONFIG_EMBEDDINGS ]
(
tab_infos ,
tab_stats ,
tab_spaces ,
tab_neighbors ,
tab_compare ,
tab_compare_custom ,
tab_frequencies ,
) = st . tabs (
[
"Infos" ,
"Statistics" ,
"Spaces" ,
"Similarities" ,
"Elements" ,
"Search elements" ,
"Frequencies" ,
]
)
# Embedding selection (inside the sidebar)
with st . sidebar :
parameters = display_parameters_selection ( config_embeddings )
# Display informations about embeddings
with tab_infos :
display_embeddings_config (
config_embeddings , parameters . emb1_id , parameters . emb2_id
)
comparison = create_comparison (
config_embeddings ,
emb1_id = parameters . emb1_id ,
emb2_id = parameters . emb2_id ,
n_neighbors = parameters . n_neighbors ,
max_emb_size = parameters . max_emb_size ,
min_frequency = parameters . min_frequency ,
)
# Display number of element in both embedding and common elements
with tab_infos :
display_numbers_of_elements ( comparison )
# Display statistics
with tab_stats :
display_statistics_comparison ( comparison )
if not comparison . common_keys :
st . warning ( "The embeddings have no element in common" )
st . stop ()
# Comparison below are based on common elements comparison
with tab_spaces :
display_spaces_comparison ( comparison )
with tab_neighbors :
display_neighborhoods_similarities ( comparison )
with tab_compare :
display_elements_comparison ( comparison )
with tab_compare_custom :
display_custom_elements_comparison ( comparison )
with tab_frequencies :
display_frequencies_comparison ( comparison )