gpt tokenizer下载gpt tokenizer源代码下载

gpt-tokenizer

gpt-tokenizer是一个令牌字节对编码器/解码器，支持所有OpenAI模型（包括GPT-3.5，GPT-4，GPT-4，GPT-4O和O1）。它是所有JavaScript环境可用的最快，最小和最低的足迹GPT Tokenizer。它用打字稿编写。

该图书馆已被信任：

Coderabbit（赞助商？）
微软（团队，Genaiscript）
弹性（基巴纳）
效应ts
铁克的铆钉

请考虑？赞助该项目，如果您觉得有用。

特征

截至2023年，它是NPM上最完整的开源GPT Tokenizer。该软件包是Openai tiktoken的一个港口，顶部撒上一些其他独特功能：

多亏了encodeChat功能，支持轻松地聊天
支持所有当前的OpenAI型号（可用编码： r50k_base ， p50k_base ， p50k_edit ， cl100k_base和o200k_base ）
可以加载并同步工作！（即在非异步/等待上下文中）
解码器和编码器函数的发电机功能版本
提供了解码异步数据流的能力（使用任何可估计输入使用decodeAsyncGenerator和decodeGenerator ）
没有全局缓存（与原始的GPT-3编码实现一样，没有意外内存泄漏）
包括高性能的isWithinTokenLimit函数，以评估令牌限制而无需编码整个文本/聊天
通过消除及物阵列来提高整体性能
类型安全（用打字稿编写）
在开箱即用的浏览器中工作

安装

作为NPM软件包

npm install gpt-tokenizer

作为UMD模块

 < script src =" https://unpkg.com/gpt-tokenizer " > </ script >

< script >
  // the package is now available as a global:
  const { encode , decode } = GPTTokenizer_cl100k_base
</ script >

如果您想使用自定义编码，请获取相关脚本。

https://unpkg.com/gpt-tokenizer/dist/o200k_base.js（用于gpt-4o和o1 ）
https://unpkg.com/gpt-tokenizer/dist/cl100k_base.js（用于gpt-4-*和gpt-3.5-turbo ）
https://unpkg.com/gpt-tokenizer/dist/p50k_base.js
https://unpkg.com/gpt-tokenizer/dist/p50k_edit.js
https://unpkg.com/gpt-tokenizer/dist/r50k_base.js

全局名称是一个串联： GPTTokenizer_${encoding} 。

有关更多信息，请参阅支持的模型及其编码部分。

操场

操场在令人难忘的URL下出版：https：//gpt-tokenizer.dev/

您可以使用Codesandbox Playground在浏览器中使用该软件包。

GPT Tokenizer Playground

操场模仿了官方的Openai令牌。

用法

该库提供了各种功能，可以将文本转换为一系列整数（令牌），这些序列可以馈入LLM模型。使用OpenAI使用的字节对编码（BPE）算法进行转换。

 import {
  encode ,
  encodeChat ,
  decode ,
  isWithinTokenLimit ,
  encodeGenerator ,
  decodeGenerator ,
  decodeAsyncGenerator ,
} from 'gpt-tokenizer'
// note: depending on the model, import from the respective file, e.g.:
// import {...} from 'gpt-tokenizer/model/gpt-4o'

const text = 'Hello, world!'
const tokenLimit = 10

// Encode text into tokens
const tokens = encode ( text )

// Decode tokens back into text
const decodedText = decode ( tokens )

// Check if text is within the token limit
// returns false if the limit is exceeded, otherwise returns the actual number of tokens (truthy value)
const withinTokenLimit = isWithinTokenLimit ( text , tokenLimit )

// Example chat:
const chat = [
  { role : 'system' , content : 'You are a helpful assistant.' } ,
  { role : 'assistant' , content : 'gpt-tokenizer is awesome.' } ,
] as const

// Encode chat into tokens
const chatTokens = encodeChat ( chat )

// Check if chat is within the token limit
const chatWithinTokenLimit = isWithinTokenLimit ( chat , tokenLimit )

// Encode text using generator
for ( const tokenChunk of encodeGenerator ( text ) ) {
  console . log ( tokenChunk )
}

// Decode tokens using generator
for ( const textChunk of decodeGenerator ( tokens ) ) {
  console . log ( textChunk )
}

// Decode tokens using async generator
// (assuming `asyncTokens` is an AsyncIterableIterator<number>)
for await ( const textChunk of decodeAsyncGenerator ( asyncTokens ) ) {
  console . log ( textChunk )
}

默认情况下，从gpt-tokenizer导入的使用cl100k_base编码，由gpt-3.5-turbo和gpt-4使用。

要获得不同模型的令牌，请直接导入：例如：

 import {
  encode ,
  decode ,
  isWithinTokenLimit ,
  // etc...
} from 'gpt-tokenizer/model/gpt-3.5-turbo'

如果您正在处理不支持软件包的解析器。JSON exports解决方案，您可能需要从相应的cjs或esm目录中导入，例如：

 import {
  encode ,
  decode ,
  isWithinTokenLimit ,
  // etc...
} from 'gpt-tokenizer/cjs/model/gpt-3.5-turbo'

懒惰加载

如果您不介意异步加载令牌，则可以在功能中使用动态导入，例如：

 const {
  encode ,
  decode ,
  isWithinTokenLimit ,
  // etc...
} = await import ( 'gpt-tokenizer/model/gpt-3.5-turbo' )

加载编码

如果软件包不支持您的模型，但是您知道它使用了哪种BPE，则可以直接加载编码，例如：

 import {
  encode ,
  decode ,
  isWithinTokenLimit ,
  // etc...
} from 'gpt-tokenizer/encoding/cl100k_base'

支持的模型及其编码

o1-* （ o200k_base ）
gpt-4o （ o200k_base ）
gpt-4-* （ cl100k_base ）
gpt-3.5-turbo （ cl100k_base ）
text-davinci-003 （ p50k_base ）
text-davinci-002 （ p50k_base ）
text-davinci-001 （ r50k_base ）
...还有许多其他模型，请参见Models.ts，以获取最新的支持模型及其编码的列表。

注意：如果您使用的是gpt-3.5-*或gpt-4-* ，并且看不到您要寻找的模型，请直接使用cl100k_base 。

API

`encode(text: string): number[]`

将给定文本编码为一系列令牌。当您需要将一件文本转换为GPT模型可以处理的令牌格式时，请使用此方法。

例子：

 import { encode } from 'gpt-tokenizer'

const text = 'Hello, world!'
const tokens = encode ( text )

`decode(tokens: number[]): string`

将一系列令牌解码为文本。当您想将输出令牌从GPT模型转换回人类可读文本时，请使用此方法。

例子：

 import { decode } from 'gpt-tokenizer'

const tokens = [ 18435 , 198 , 23132 , 328 ]
const text = decode ( tokens )

`isWithinTokenLimit(text: string, tokenLimit: number): false | number`

检查文本是否在令牌限制之内。如果超过限制，则返回false ，否则返回令牌的数量。使用此方法快速检查给定文本是否在GPT模型施加的令牌限制内，而无需编码整个文本。

例子：

 import { isWithinTokenLimit } from 'gpt-tokenizer'

const text = 'Hello, world!'
const tokenLimit = 10
const withinTokenLimit = isWithinTokenLimit ( text , tokenLimit )

`countTokens(text: string | Iterable<ChatMessage>): number`

计算输入文本或聊天中的令牌数量。当您需要确定代币数量而无需检查限制时，请使用此方法。

例子：

 import { countTokens } from 'gpt-tokenizer'

const text = 'Hello, world!'
const tokenCount = countTokens ( text )

`encodeChat(chat: ChatMessage[], model?: ModelName): number[]`

将给定的聊天编码为一系列令牌。

如果您没有直接导入模型版本，或者在初始化过程中未提供model ，则必须在此处提供它以正确地将聊天介绍给给定模型。当您需要将聊天转换为GPT模型可以处理的令牌格式时，请使用此方法。

例子：

 import { encodeChat } from 'gpt-tokenizer'

const chat = [
  { role : 'system' , content : 'You are a helpful assistant.' } ,
  { role : 'assistant' , content : 'gpt-tokenizer is awesome.' } ,
]
const tokens = encodeChat ( chat )

请注意，如果您编码一个空聊天，它仍然包含最少的特殊令牌数量。

`encodeGenerator(text: string): Generator<number[], void, undefined>`

使用发电机来编码给定文本，产生大量令牌。当您想在块中编码文本时，请使用此方法，这对于处理大型文本或流数据可能很有用。

例子：

 import { encodeGenerator } from 'gpt-tokenizer'

const text = 'Hello, world!'
const tokens = [ ]
for ( const tokenChunk of encodeGenerator ( text ) ) {
  tokens . push ( ... tokenChunk )
}

`encodeChatGenerator(chat: Iterator<ChatMessage>, model?: ModelName): Generator<number[], void, undefined>`

与encodeChat相同，但使用生成器作为输出，并且可以使用任何迭代器作为输入chat 。

`decodeGenerator(tokens: Iterable<number>): Generator<string, void, undefined>`

使用发电机来解码一系列令牌，产生大量的解码文本。当您想在块中解码令牌时，请使用此方法，这对于处理大型输出或流数据可能很有用。

例子：

 import { decodeGenerator } from 'gpt-tokenizer'

const tokens = [ 18435 , 198 , 23132 , 328 ]
let decodedText = ''
for ( const textChunk of decodeGenerator ( tokens ) ) {
  decodedText += textChunk
}

`decodeAsyncGenerator(tokens: AsyncIterable<number>): AsyncGenerator<string, void, undefined>`

使用发电机对一系列令牌进行解码，从而产生大量的解码文本。当您想在块中解码令牌时使用此方法，这对于在异步上下文中处理大型输出或流数据可能很有用。

例子：

 import { decodeAsyncGenerator } from 'gpt-tokenizer'

async function processTokens ( asyncTokensIterator ) {
  let decodedText = ''
  for await ( const textChunk of decodeAsyncGenerator ( asyncTokensIterator ) ) {
    decodedText += textChunk
  }
}

特殊令牌

GPT模型使用了一些特殊的令牌。并非所有模型都支持所有这些令牌。

自定义允许集

gpt-tokenizer允许您在编码文本时指定允许的特殊令牌的自定义集。为此，将包含允许的特殊令牌作为参数的Set传递给encode函数：

 import {
  EndOfPrompt ,
  EndOfText ,
  FimMiddle ,
  FimPrefix ,
  FimSuffix ,
  ImStart ,
  ImEnd ,
  ImSep ,
  encode ,
} from 'gpt-tokenizer'

const inputText = `Some Text ${ EndOfPrompt } `
const allowedSpecialTokens = new Set ( [ EndOfPrompt ] )
const encoded = encode ( inputText , allowedSpecialTokens )
const expectedEncoded = [ 8538 , 2991 , 220 , 100276 ]

expect ( encoded ) . toBe ( expectedEncoded )

定制的禁止套装

同样，您可以在编码文本时指定自定义的特殊令牌。传递包含不允许特殊令牌作为encode函数的参数的Set ：

 import { encode , EndOfText } from 'gpt-tokenizer'

const inputText = `Some Text ${ EndOfText } `
const disallowedSpecial = new Set ( [ EndOfText ] )
// throws an error:
const encoded = encode ( inputText , undefined , disallowedSpecial )