sailor2
1.0.0
该存储库包含Sailor2的代码,Sailor2是东南亚的强大而包容的开放语言模型。
Sailor2是与Sea AI实验室,SCB10X,Wissight,Huggingface和Sailor2社区合作的合作项目。

徽标是由Midjourney生成的
Sailor2是一项由社区驱动的计划,将尖端的多语言语言模型带到东南亚(SEA)。我们的研究强调了对生产使用的8B和20B参数范围中对模型的强劲需求,以及用于专业应用的1B模型,例如投机解码和研究目的。这些模型在Apache 2.0许可下发布,为整个地区的高级语言技术提供了增强的可访问性。
Sailor2建立在令人敬畏的多语言型号QWEN 2.5的基础上,并在500B代币上不断预先训练,以通过统一模型更好地支持15种语言。这些语言包括英语,中文,缅甸语,塞伯诺,伊洛卡诺,印尼语,爪哇人,高棉,老挝,马来语,圣丹尼,塔纳格,泰语,泰国,越南语和沃雷。通过满足对多种,健壮和可访问的语言模型的不断增长的需求,Sailor2试图在海上服务不足,开放,包容且可访问的多语言LLM为服务不足。
有关更多培训详细信息,请参阅Sailor2主页。
我们正在努力准备所有代码的发布,请继续关注!
Sailor2的代码一直处于最新的拥抱面变压器中,我们建议您安装transformers==4.46.3 。
这里提供了一个代码片段,向您展示如何加载令牌和模型以及如何生成内容。
import torch
from transformers import AutoModelForCausalLM , AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM . from_pretrained (
'sail/Sailor2-20B-Chat' ,
torch_dtype = torch . bfloat16 ,
device_map = "auto"
)
tokenizer = AutoTokenizer . from_pretrained ( 'sail/Sailor2-20B-Chat' )
system_prompt =
'You are an AI assistant named Sailor2, created by Sea AI Lab.
As an AI assistant, you can answer questions in English, Chinese, and Southeast Asian languages
such as Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, and Waray.
Your responses should be friendly, unbiased, informative, detailed, and faithful.'
prompt = "Beri saya pengenalan singkat tentang model bahasa besar."
# prompt = "Hãy cho tôi một giới thiệu ngắn gọn về mô hình ngôn ngữ lớn."
# prompt = "ให้ฉันแนะนำสั้น ๆ เกี่ยวกับโมเดลภาษาขนาดใหญ่"
messages = [
{ "role" : "system" , "content" : system_prompt },
{ "role" : "user" , "content" : prompt }
]
text = tokenizer . apply_chat_template (
messages ,
tokenize = False ,
add_generation_prompt = True
)
model_inputs = tokenizer ([ text ], return_tensors = "pt" ). to ( device )
input_ids = model_inputs . input_ids . to ( device )
generated_ids = model . generate (
input_ids ,
max_new_tokens = 512 ,
)
generated_ids = [
output_ids [ len ( input_ids ):] for input_ids , output_ids in zip ( model_inputs . input_ids , generated_ids )
]
response = tokenizer . batch_decode ( generated_ids , skip_special_tokens = True )[ 0 ]
print ( response )Sailor2根据Apache许可证2.0的条款分发。没有限制研究和商业用途。
如果您发现Sailor2有用,请引用我们的工作如下:
@misc{sailor2report,
title={Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM},
author={{Sailor2 Team}},
year={2024}
}
如果您有任何疑问,请提出问题或通过[email protected]或[email protected]与我们联系。