Jeju Translation Download - Jeju Translation Source code download

Jeju Translation

AI Source Code

1.0.0

Download

? Jeju language, standard language two -way voice translation model creation project

Model use

 import torch
from transformers import AutoTokenizer , AutoModelForSeq2SeqLM
  
## Set up the device (GPU or CPU)
device = torch . device ( "cuda" if torch . cuda . is_available () else "cpu" )

## Load the tokenizer and model
tokenizer = AutoTokenizer . from_pretrained ( "Junhoee/Kobart-Jeju-translation" )
model = AutoModelForSeq2SeqLM . from_pretrained ( "Junhoee/Kobart-Jeju-translation" ). to ( device )

## Set up the input text
## 문장 입력 전에 방향에 맞게 [제주] or [표준] 토큰을 입력 후 문장 입력
input_text = "[표준] 안녕하세요"

## Tokenize the input text
input_ids = tokenizer ( input_text , return_tensors = "pt" , padding = True , truncation = True ). input_ids . to ( device )

## Generate the translation
outputs = model . generate ( input_ids , max_length = 64 )

## Decode and print the output
decoded_output = tokenizer . decode ( outputs [ 0 ], skip_special_tokens = True )
print ( "Model Output:" , decoded_output )

 Model Output : 안녕하수꽈

? My role

Dataset creation
- Jeju-Standard Dataset Collection and Preparation for New Dataset (Dataset)
  - Data collection such as AI-Hub, Github, etc.
Translation model logic design
- Fine tuning as a KOBART model
- Among the text2Text models in Korean in Korean, Kobart is the best and fastest model.
- In the process of designing two -way logic , entering [Jeju] and [Standard] tokens in front of the sentence to make the model easy to understand (BLEU Score 0.5-> 0.7, up to 1 standard)
- Due to the lack of RAM, only 700,000 data were learned , but the dataset format method was changed from Float16 to Unit16 to solve memory shortages (GPU memory, resource saving)

1. Introduction of project

?‍?‍? Team member

Vitamin 12: Leader, Lee Seo -hyun, Lee Yerin
Vitamin 13: Kim Yun -young, Kim Jae -gyeom, Lee Hyung -seok

? period

2024 first semester

? ️ theme

Create Jeju dialect and standard language bidirectional translation model

target

We would like to promote the understanding of Jeju dialects and contribute to the preservation of Jeju culture.
We promote smooth communication with citizens in Jeju.
We develop a two -way translation model that connects the Jeju dialect and the Korean standard language.
Implementing voice recognition and user interface.

2. Data collection

Data collected by AI-HUB
- Korean dialect ignition data
- Korean and older Korean dialect data
Data collected by Github
- Kakao JIT Jeju Tongue Data
Other data
- Living Province Data (Jeju Preliminary web page crawling)
- Well, Lang Harman Data (YouTuber Collection of Data by referring to the lyrics translation video among the Langhaman videos)
- Jeju dialect that taste and stylish data (data collected from the book 'Jeju Tongue Taste and Prize')
- Data even if it passes off, even if it passes, it collects data from the book 'Even if it is gone')
- 2018 Jeju Language Oral Materials Collection (Collected for Evaluation)

3. Model learning

3-1. Model related

I learned in a way to bring in the pre-learning model and Fine-Tuning .
Pre -learning model used to develop translation models:
- GOGAMZA/KOBART-Base-V2
Pre -learning model selection criteria
- Is it the right model for translation?
- Is it learned in Korean?
- Is the model capacity so big and the learning speed is fast?
Models that have been considered but not selected:
- T5 (There is a problem with too long learning time)
- Jebert (performance was not satisfactory)

3-2. Learning method

Learning methodology
- Source-> Learning in Target format
- Prior to entering the sentence, adding [Jeju] or [Standard] tokens to specify the direction of translation and learning together
- Using the Dataset of the DataSets package, converting it to an optimized form for language model learning
Main parameter settings
- MAX_LENGTH: 64
- batch_size: 32
- Transing_rate: At first, starting from 2E-5 and learning progresses gradually reduced
- EPOCHS: 3

? 4. Main achievements

Final BLEU Score -Jeju Language Oral Data Book Data Standards
- Jeju language-> standard language: 0.76
- Standard language-> Jeju language: 0.5
BLEU Score Performance Table

date	04-13	05-03	05-06	05-13	05-21	05-24	05-26	05-30
Jeju language-> standard language bleu score	0.56	0.59	0.42	0.64	0.70	0.74	0.76	0.74
Standard language-> Jeju Bleu Score	0.35	0.37	0.26	0.37	0.39	0.46	0.50	0.49

Overall, we recorded the BLEU Score .

Bleu score visualization

Interface implementation
Voice recognition function
- Stt
  - Receive Whisper models from Hugging Face and proceed with Fine-Tuning
  - Jeju language conversion to text and convert to text
- TTS
  - Receive the Glos TTS, HIFIGAN model from Hugging Face and proceed with Fine-Tuning
  - I tried to express voice in Jeju, but failed ...
  - Expression instead of standard language voice (using GTTS)

? 5. Future plans

Preliminary processing through additional data collection and grammatic micro -adjustment to secure quality data
Improvement of the ability to recognize the accent of the voice recognition model
Web implementation and mobile app development plan

? 6. Reference

Data source
- Korean dialect ignition data (provided by AI-HUB): https://www.aihub.or.kr/aihubdata/data/View.do?curmenu=115&topmenu
- Middle and older Korean dialect data (AI-HUB): https://www.aihub.or.kr/aihubdata/data/VIEW.DO?Curmenu=115&topmenu
- Kakao JIT Jeju Tongue Data (see Kakaobrane GitHub): https://github.com/kakaobrain/jejuo
- Living Living Side Data (see Jeju Language Preliminary): https://www.jeju.go.kr/culture/dialect/lifedialect.htm
Model source
- Kobart Hugging Face: https://huggingFace.co/gogamza/kobart-base-v2
- Whisper Hugging Face: https://huggingFace.co/openai/whisper-large-v2
- Kobart github: https://github.com/skt-ai/kobart

Expand

Additional Information

Version 1.0.0
Type AI Source Code
Update Time 2025-08-23
size 581.52KB
From Github

Related Applications

GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

2024-11-03
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
GitHub the via/releases

2024-11-01

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
ML stack

AI Source Code

1.0.0
awesome free chatgpt

AI Source Code

1.0.0
pywin_contextmenu

AI Source Code

Version update
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All