sailor2
1.0.0
該存儲庫包含Sailor2的代碼,Sailor2是東南亞的強大而包容的開放語言模型。
Sailor2是與Sea AI實驗室,SCB10X,Wissight,Huggingface和Sailor2社區合作的合作項目。

徽標是由Midjourney生成的
Sailor2是一項由社區驅動的計劃,將尖端的多語言語言模型帶到東南亞(SEA)。我們的研究強調了對生產使用的8B和20B參數範圍中對模型的強勁需求,以及用於專業應用的1B模型,例如投機解碼和研究目的。這些模型在Apache 2.0許可下發布,為整個地區的高級語言技術提供了增強的可訪問性。
Sailor2建立在令人敬畏的多語言型號QWEN 2.5的基礎上,並在500B代幣上不斷預先訓練,以通過統一模型更好地支持15種語言。這些語言包括英語,中文,緬甸語,塞伯諾,伊洛卡諾,印尼語,爪哇人,高棉,老撾,馬來語,聖丹尼,塔納格,泰語,泰國,越南語和沃雷。通過滿足對多種,健壯和可訪問的語言模型的不斷增長的需求,Sailor2試圖在海上服務不足,開放,包容且可訪問的多語言LLM為服務不足。
有關更多培訓詳細信息,請參閱Sailor2主頁。
我們正在努力準備所有代碼的發布,請繼續關注!
Sailor2的代碼一直處於最新的擁抱面變壓器中,我們建議您安裝transformers==4.46.3 。
這裡提供了一個代碼片段,向您展示如何加載令牌和模型以及如何生成內容。
import torch
from transformers import AutoModelForCausalLM , AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM . from_pretrained (
'sail/Sailor2-20B-Chat' ,
torch_dtype = torch . bfloat16 ,
device_map = "auto"
)
tokenizer = AutoTokenizer . from_pretrained ( 'sail/Sailor2-20B-Chat' )
system_prompt =
'You are an AI assistant named Sailor2, created by Sea AI Lab.
As an AI assistant, you can answer questions in English, Chinese, and Southeast Asian languages
such as Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, and Waray.
Your responses should be friendly, unbiased, informative, detailed, and faithful.'
prompt = "Beri saya pengenalan singkat tentang model bahasa besar."
# prompt = "Hãy cho tôi một giới thiệu ngắn gọn về mô hình ngôn ngữ lớn."
# prompt = "ให้ฉันแนะนำสั้น ๆ เกี่ยวกับโมเดลภาษาขนาดใหญ่"
messages = [
{ "role" : "system" , "content" : system_prompt },
{ "role" : "user" , "content" : prompt }
]
text = tokenizer . apply_chat_template (
messages ,
tokenize = False ,
add_generation_prompt = True
)
model_inputs = tokenizer ([ text ], return_tensors = "pt" ). to ( device )
input_ids = model_inputs . input_ids . to ( device )
generated_ids = model . generate (
input_ids ,
max_new_tokens = 512 ,
)
generated_ids = [
output_ids [ len ( input_ids ):] for input_ids , output_ids in zip ( model_inputs . input_ids , generated_ids )
]
response = tokenizer . batch_decode ( generated_ids , skip_special_tokens = True )[ 0 ]
print ( response )Sailor2根據Apache許可證2.0的條款分發。沒有限制研究和商業用途。
如果您發現Sailor2有用,請引用我們的工作如下:
@misc{sailor2report,
title={Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM},
author={{Sailor2 Team}},
year={2024}
}
如果您有任何疑問,請提出問題或通過[email protected]或[email protected]與我們聯繫。