基于pytorch和?变压器。
从PYPI安装库:
pip install transformers-embedder或来自conda:
conda install -c riccorl transformers-embedder它提供了一个pytorch层和一个令牌仪,该层几乎支持Huggingface?Transformers库中的每个预验证的模型。这是一个快速示例:
import transformers_embedder as tre
tokenizer = tre . Tokenizer ( "bert-base-cased" )
model = tre . TransformersEmbedder (
"bert-base-cased" , subword_pooling_strategy = "sparse" , layer_pooling_strategy = "mean"
)
example = "This is a sample sentence"
inputs = tokenizer ( example , return_tensors = True ) {
'input_ids': tensor([[ 101, 1188, 1110, 170, 6876, 5650, 102]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])
'scatter_offsets': tensor([[0, 1, 2, 3, 4, 5, 6]]),
'sparse_offsets': {
'sparse_indices': tensor(
[
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 7, 7])
},
'sentence_length': 7 # with special tokens included
}
outputs = model ( ** inputs ) # outputs.word_embeddings.shape[1:-1] # remove [CLS] and [SEP]
torch.Size([1, 5, 768])
# len(example)
5
使用基于变压器的模型的烦恼之一是,从输出的亚toke嵌入中计算单词嵌入并不是微不足道的。使用此API,它就像使用“变形金刚”从理论上的每个变压器模型中获取单词级嵌入的嵌入一样容易。
TransformersEmbedder类提供3种获取嵌入的方法:
subword_pooling_strategy="sparse" :使用稀疏矩阵乘法来计算每个单词(即子tokens的嵌入在一起)的子折断的均值(即子ttokens的嵌入)。此策略是默认策略。subword_pooling_strategy="scatter" :使用散点机操作计算每个单词子tokens的嵌入的平均值。它不是确定性的,但它可以与ONNX导出一起使用。subword_pooling_strategy="none" :返回变压器模型的原始输出而没有子池。这里有一些功能表:
| 合并 | 确定性 | onnx | |
|---|---|---|---|
| 疏 | ✅ | ✅ | |
| 分散 | ✅ | ✅ | |
| 没有任何 | ✅ | ✅ |
您还可以使用layer_pooling_strategy参数获得多种类型的输出:
layer_pooling_strategy="last" :返回变压器模型的最后一个隐藏状态layer_pooling_strategy="concat" :返回所选output_layers的串联layer_pooling_strategy="sum" :返回变压器模型的选定output_layers的总和layer_pooling_strategy="mean" :返回变压器模型的所选output_layers的平均值layer_pooling_strategy="scalar_mix" :返回变压器模型所选output_layers的参数化标量混合物层的输出如果您还希望从HuggingFace模型中使用所有输出,则可以设置return_all=True以获取它们。
class TransformersEmbedder ( torch . nn . Module ):
def __init__ (
self ,
model : Union [ str , tr . PreTrainedModel ],
subword_pooling_strategy : str = "sparse" ,
layer_pooling_strategy : str = "last" ,
output_layers : Tuple [ int ] = ( - 4 , - 3 , - 2 , - 1 ),
fine_tune : bool = True ,
return_all : bool = True ,
)Tokenizer类别提供了tokenize方法,以预处理TransformersEmbedder层的输入。您可以在批处理中传递原始句子,预先被判断的句子和句子。它将预处理他们返回带有模型输入的字典。通过传递return_tensors=True它将返回输入为torch.Tensor 。
默认情况下,如果将文本(或批次)作为字符串传递,则使用huggingface令牌将其用于令牌。
text = "This is a sample sentence"
tokenizer ( text )
text = [ "This is a sample sentence" , "This is another sample sentence" ]
tokenizer ( text )您可以通过设置is_split_into_words=True传递预施式的句子(或一批句子)
text = [ "This" , "is" , "a" , "sample" , "sentence" ]
tokenizer ( text , is_split_into_words = True )
text = [
[ "This" , "is" , "a" , "sample" , "sentence" , "1" ],
[ "This" , "is" , "sample" , "sentence" , "2" ],
]
tokenizer ( text , is_split_into_words = True )首先,初始化令牌仪
import transformers_embedder as tre
tokenizer = tre . Tokenizer ( "bert-base-cased" ) text = "This is a sample sentence"
tokenizer ( text ) {
{
'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1]],
'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6]],
'sparse_offsets': {
'sparse_indices': tensor(
[
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 7, 7])
},
'sentence_lengths': [7],
}
text = "This is a sample sentence A"
text_pair = "This is a sample sentence B"
tokenizer ( text , text_pair ) {
'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 138, 102, 1188, 1110, 170, 6876, 5650, 139, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]],
'sparse_offsets': {
'sparse_indices': tensor(
[
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 15, 15])
},
'sentence_lengths': [15],
}
padding=True和return_tensors=True ,令牌器返回已准备就绪的文本 batch = [
[ "This" , "is" , "a" , "sample" , "sentence" , "1" ],
[ "This" , "is" , "sample" , "sentence" , "2" ],
[ "This" , "is" , "a" , "sample" , "sentence" , "3" ],
# ...
[ "This" , "is" , "a" , "sample" , "sentence" , "n" , "for" , "batch" ],
]
tokenizer ( batch , padding = True , return_tensors = True )
batch_pair = [
[ "This" , "is" , "a" , "sample" , "sentence" , "pair" , "1" ],
[ "This" , "is" , "sample" , "sentence" , "pair" , "2" ],
[ "This" , "is" , "a" , "sample" , "sentence" , "pair" , "3" ],
# ...
[ "This" , "is" , "a" , "sample" , "sentence" , "pair" , "n" , "for" , "batch" ],
]
tokenizer ( batch , batch_pair , padding = True , return_tensors = True )可以将自定义字段添加到模型输入中,并告诉tokenizer如何使用add_padding_ops添加它们。首先用模型名称初始化令牌仪:
import transformers_embedder as tre
tokenizer = tre . Tokenizer ( "bert-base-cased" )然后添加自定义字段:
custom_fields = {
"custom_filed_1" : [
[ 0 , 0 , 0 , 0 , 1 , 0 , 0 ],
[ 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 1 , 0 ]
]
}现在,我们可以为我们的自定义字段custom_filed_1添加填充逻辑。 add_padding_ops方法接受输入
key :令牌输入中字段的名称value :用于填充的价值length :长度到垫。它可以是一个int或两个字符串subword ,其中元素被填充以匹配子字的长度,并在子词合并后相对于批处理长度填充元素的word 。 tokenizer . add_padding_ops ( "custom_filed_1" , 0 , "word" )最后,我们可以使用自定义字段将输入归为:
text = [
"This is a sample sentence" ,
"This is another example sentence just make it longer, with a comma too!"
]
tokenizer ( text , padding = True , return_tensors = True , additional_inputs = custom_fields )这些输入已准备好用于模型,包括已提交的自定义。
>>> inputs
{
'input_ids': tensor(
[
[ 101, 1188, 1110, 170, 6876, 5650, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 101, 1188, 1110, 1330, 1859, 5650, 1198, 1294, 1122, 2039, 117, 1114, 170, 3254, 1918, 1315, 106, 102]
]
),
'token_type_ids': tensor(
[
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
]
),
'attention_mask': tensor(
[
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
]
),
'scatter_offsets': tensor(
[
[ 0, 1, 2, 3, 4, 5, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16]
]
),
'sparse_offsets': {
'sparse_indices': tensor(
[
[ 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16],
[ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
]
),
'sparse_values': tensor(
[1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
1.0000, 1.0000, 0.5000, 0.5000, 1.0000, 1.0000, 1.0000]
),
'sparse_size': torch.Size([2, 17, 18])
}
'sentence_lengths': [7, 17],
}
TransformersEmbedder类中的某些代码取自Pytorch散点库。预验证的模型和令牌的核心来自?变压器。