低資源語言
低資源(人類)語言的保護,開發和文檔的資源。
根據一些估計,目前的7,000個口語中有一半預計本世紀將滅絕。但是,學者,獨立學者,組織,社區和個人有很多工作要停止或放慢這一趨勢。此列表旨在提供開源代碼列表,這些列表可用於記錄,保存,開發,保存或使用瀕危語言。
鬆弛小組
我們有一個懈怠的小組進行現場討論。加入我們!
出版品
一份描述該存儲庫的白皮書發表在LREC 2016 CCURL研討會(用於資源不足的語言的協作和計算)上。該論文位於此存儲庫中,在papers夾中。在此處下載原始紙:開源代碼服務瀕危語言。
貢獻
要在Github上編輯此列表,只需單擊此處。如果您想討論與此有關的任何內容,請打開一個問題。如果您知道此列表中沒有可用的任何資源,請使用上面的鏈接或提交拉請請求。
有關貢獻指南中貢獻的更多詳細信息。
如果您有興趣以某種離線能力討論列表,請與@richardlitt聯繫。我很高興有一個電話或電子郵件交換。
目錄
Doctoc生成的目錄
- 定義
- 通用存儲庫
- 鍵盤佈局配置幫助者
- 註解
- 格式規格
- I18N相關的存儲庫
- 音頻自動化
- 文本到語音(TTS)
- 自動語音識別(ASR)
- 文本自動化
- 實驗
- 抽認卡
- 自然語言產生
- 計算系統
- Android應用
- 鍍鉻擴展
- fieldDB
- 學術研究論文特定的存儲庫
- 示例存儲庫
- 字體
- 語料庫
- 組織
- 教程
- 特定語言項目
- 南非荷蘭語
- 阿爾巴尼亞人
- Alutiiq
- 阿姆哈拉語
- 巴斯克
- 孟加拉
- Chichewa
- 加利西亞人
- 格魯吉亞人
- 瓜拉尼
- 豪薩
- 印地語
- Høgnorsk
- 冰島
- Inuktitut
- 愛爾蘭人
- Kinyarwanda
- 庫爾德
- 林加拉
- Lushootseed
- 馬來語
- 馬爾加什
- manx
- Migmaq
- Minderico
- Nishnaabe
- Oromo
- Quechua
- 薩米
- 蘇格蘭蓋爾語
- secwepemctsín
- 索馬利亞
- tigrinya
- 烏拉爾
- 祖魯
- 執照
定義
瀕危語言是有滅絕危險的人類語言。該列表還包括少數語言 - 由穩定但人口較小的語言(例如,馬耳他或夏威夷)所使用的語言;和低資源或資源不足的語言,這可能是大量人口使用的,但數字化的語言不足(例如,Quechua)。這些語言具有共同的某些特徵;最相關的是稀疏數據和缺乏資源,從拼寫檢查器到語法到機器翻譯語料庫。不屬於此列表的其他資源不足的語言包括構造的語言(例如Klingon或na'Vi),計算機語言(例如,JavaScript或Lua)以及已稀疏以至於在大多數目的(例如,Tocharian)變得無關緊要的稀疏語言。
開源“通過免費的許可來促進產品設計或藍圖的通用訪問,以及對該設計或藍圖的通用再分配,包括任何人隨後對其進行改進。” (Wiki)。這很重要,因為分配給不開源的語言或項目的金錢和資源是以其他地方可能的可擴展性來花費的。
此列表曾經被命名為endangered-languages 。重命名為反映危險是一個加載的術語,兩者都可能無法反映少數族裔語言的語言社區的觀點。與其他高資源的語言相比, low-resource-languages將此列表重點放在缺乏數字資源上。
Tools which are built for these languages are not included (unless relevant for dialects or variants): Arabic, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, Flemish, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Norwegian (Bokmål),波斯語,波蘭,葡萄牙語,羅馬尼亞語,俄語,塞爾維亞語,斯洛伐克,斯洛文尼亞人,西班牙,瑞典語,泰國,泰國,土耳其,烏克蘭,烏克蘭,瓦倫西亞,越南,越南人。此列表來自此Wikipedia頁面上最受歡迎的網站內容語言列表。可以使用其他指標 - 如果您還有另一個指標,請建議!
此列表在一件事上特別出色。通常顯示現場中存在的各種工具。但是,對於對特定語言或工具套件的深入研究,它的性能並非出色。例如,列出每種低資源語言的所有Firefox語言包或APERTIUM語言模塊都是無助的,將包括ACL Wiki中指出的所有可用的工具,這主要是通過IXA組通過IXA組來含義分類工具,其中一些是開源的,而有些則不是。相反,將此列表視為更多研究的起點。
尋找代碼語言的資源?看看很棒的列表集合。
通用存儲庫
單語詞典項目和實用程序
公用事業
- 免費電子詞典項目是用於手機的Java Midlet的項目 - 土著語言詞典。
- 網站託管單語言的數字詞典。
- Wesay-允許語言社區構建自己的詞典。 https://software.sil.org/wesay/(SIL International)。
軟體
- 4lang-使用Eilenberg機器的概念詞典。
- 強調。多種語言的純文本的統計單化化
- 對齊 - openfst-這是針對四個任務的CRF自動編碼器框架的實現:bitext Word對齊,詞性標記,代碼切換,依賴關係解析。
- Apertium Apertium是一種構建開源淺轉移機器翻譯系統的工具箱,特別適用於相關語言對:它包括引擎,維護工具和幾種語言對的開放語言數據。
- ARK-TWEET-NLP-CMU ARK TWITTER語音部分Tagger( fork )。
- ARTOFREADING-與閱讀插圖收集藝術有關的索引和處理腳本。
- 貝絲線 - 用於語言識別的多項式貝葉斯分類。
- 聖經 - 科爾布斯工具 - 用於閱讀/處理多語言聖經語料庫的工具集合。
- Bloomdesktop -Bloom Desktop是一種混合C#/JavaScript/html/css Windows應用程序,它極大地“降低了標準”的語言社區,這些語言社區想要自己的語言。 Bloom提供了低訓練的高輸出系統,母語揚聲器及其倡導者共同努力培養社區作者身份並獲得外部材料的機會……https://bloomlibrary.org/。
- Bloomlibrary -Bloom庫單頁應用程序,使用AngularJS&Bootstrap,Parse.com後端。 https://bloomlibrary.org/。
- 大腦 - JavaScript中的神經網絡。
- Bristol Uni MT形態工具 - 此存儲庫是先前在http://www.cs.bris.ac.uk/research/machinelearning/morphology/morphology/resources.jsp上獲得的腳本的鏡像。包括:UKWABELANA-一種開源形態學祖魯語料庫和Emma:形態學分析的新型評估指標。
- 棕色群集 - 棕色單詞聚類算法的C ++實現。
- CasualCon CasualConc是一個協和程序,可在Mac OS X 10.5豹或更高版本上本地運行。它最初是為隨意使用(初步分析或非研究目的)而設計的,儘管[維護者]一直在將其用於自己的研究(並且可能其他人)。它可以生成KWIC一致性線,單詞簇,搭配分析和單詞計數。
- CDEC-基於(主要是)無上下文形式主義的統計機器翻譯和其他結構化預測模型的解碼器,對齊器和模型優化器。
- Charlint Charlint是用Perl編寫的角色歸一化/檢查工具。除此之外,它實現了Unicode TR 15的歸一化形式C,作為W3C字符模型中早期均勻歸一化的測試平台。
- 合唱 - 一種版本控制系統,旨在啟用適用於地理上分佈的典型語言開發團隊的工作流程。
- CLAM-計算語言學應用程序調解人 - 快速將NLP應用程序與Web啟用前端變為RESTFULES服務。您提供命令行應用程序,其輸入,輸出和參數的規範,以及圍繞應用程序包裹的蛤lam,以形成完全露面的Restful Web服務。
- CMU Sphinx Cmusphinx是根據BSD樣式許可發布的與揚聲器無關的大型詞彙連續識別器。它也是開源工具和資源的集合,可讓研究人員和開發人員構建語音識別系統。
- cnminlangwebcollect-中國少數民族網站語言檢測和網站收集。
- COG -COG是使用詞彙術和比較語言技術比較語言的工具。它可用於自動化比較來自不同語言品種的單詞列表的大部分過程。 http://sillsdev.github.io/cog/。
- convertextract-使用非unicode文本(例如需要SIL字體的文本)將Excel,Word和PowerPoint文件轉換為Unicode,同時保留原始文件的格式。
- 精光 - 語音精光http://phonologicalcorpustools.github.io/corpustools/。
- CTK-圍繞著LDC的Champolion句子對準器內核建造,Champollion工具套件(CTK)旨在為盡可能多的語言對提供現成的平行文本句子對準工具。 (原始項目在SourceForge上:http://champollion.sourceforge.net)。
- DataTags-一種評估數據集敏感性和隱私風險的系統,並分配標籤以描述必須如何傳輸,存儲和訪問數據集。 (叉)。
- 數據存儲庫框架共享和發布研究數據。
- detative -detative:語言現場工作的軟件http://www.dative.ca。
- dattion-與多個語言現場工作Web服務數據庫進行交互的單頁應用程序。網站。
- DeepLearntoolbox -MATLAB/八度工具箱,用於深度學習。包括深信信網,堆疊的自動編碼器,卷積神經網,卷積自動編碼器和香草神經網。每種方法都有示例可以讓您入門。
- Desmeme-探索語言模板的數據庫和工具。
- dictdb-語言翻譯的字典數據庫。
- 演講 - 基於Python的工具,用於轉換和合併多層註釋的語言數據。
- Divvun-gramCheck-此程序對指定為約束語法格式讀數的表單進行FST查找,並在帶有可讀消息的XML文件中查找錯誤標籤。它被用作語法檢查器管道的後期。
- Divvun -Keyboard- iOS和Android的鍵盤應用程序,帶有鍵盤佈局,用於土著和少數語言
- Divvunspell
hfst-ospell (下圖)在Rust中重寫,用於強大的並發和內存管理。實際使用的速度比hfst-ospell快10倍。它使用與hfst-ospell相同的ZHFST文件,該文件適用於Giellalt Github org中的所有語言(見下文)。 - DLTK -Deutsch語言工具套件。更多的。
- epitran-許多低資源語言的音素轉換(G2P)。
- 老年人:瀕臨滅絕的語言數據電子存儲庫 - 瀕危語言數據電子存儲庫:基於Web的本體論協作語言數據編目工具。
- Enchant -Enchant Spellchecking庫https://abiword.github.io/enchant/。
- EXSITE9 -EXSITE9是一個桌面應用程序,旨在用描述性元數據輕鬆,快速地對其數據文件進行構建,隨後包裝其數據文件和相關的元數據,準備向存儲庫提交。 EXSITE9還允許在您本地文件存儲上實際移動其物理位置的結構組織;允許您正確組織文件和元數據準備包裝。
- Fast_Align-簡單,快速無監督的單詞對齊器。
- FastText-快速文本表示和分類的庫。
- FieldWorks- FieldWorks是一套用於語言和文化數據的軟件工具,並支持複雜的腳本。 https://software.sil.org/fieldworks/ fieldworks語言資源管理器(或簡稱Flex)旨在幫助現場語言學家執行許多常見的語言文檔和分析任務。它可以為您提供幫助:獲取和記錄詞彙信息,創建詞典,與文本進行了線性化,分析話語特徵,研究形態。
- 法郎 - 自然語言檢測https://wooorm.com/franc/。
- FWDocumentation- FieldWorks的開發人員文檔(語言和文化數據的軟件工具,並支持複雜腳本)。
- fwlocalization-實地考察的本地化。
- FWSUPPORTTOOLS-實地工程開發的其他工具。
- Gaia -Gaia是Boot 2 Gecko Project的基於HTML5的手機UI。注意:有關發布的分支的詳細信息,請參見Wiki。如果您有興趣用新語言設置鍵盤,請參閱此信息。
- Giellakbd-android-拉丁美洲的叉子(由Google for Android),以邊緣化語言為目標,該語言在移動操作系統上也應具有一流的狀態。由kbdgen使用(請參閱此頁面的其他地方)。
- Giellakbd -ios-蘋果本機iOS鍵盤的開源重新成像,特別關注對本地化鍵盤的支持。由kbdgen使用(請參閱此頁面的其他地方)。
- Giza-PP-Giza ++是一種統計機器翻譯工具包,用於訓練IBM型號1-5和HMM Word Alignment模型。該軟件包還包含MKCLS工具的來源,該工俱生成了訓練某些對齊模型所需的單詞類。
- GV -Crawl -Global Voices Bitext Crawler創建平行語料庫。
- Glotlid- FastText語言識別,並支持2000多個標籤。
- Glottolog數據 - Glottolog為世界語言提供了全面的參考信息。
- Gramadóir-專為具有有限計算資源的少數族裔語言和其他語言的語法檢查器而設計的語法檢查引擎。
- Grind -Indesign 5.5插件設計允許使用Adobe Indesign中的石墨智能字體。該項目將SIL的石墨2智能字體技術與我們自己的段落作曲家插件實現。
- HermitCrab-Hermitcrab.net是一種靈活的形態/語音解析器,採用項目和過程方法。
- HFST -OSSEL- HFST拼寫檢查器庫和命令行工具。
- HFST-OSSEL-JS- HFST-OSSELL的節點綁定。
- HFST優化 - lookup-HFST優化 - lookup獨立庫和命令行工具。
- Hundict-來自Parallel Corpora的雙語詞典提取器。
- Hunspell-咒語檢查器和形態分析儀庫和程序專為具有豐富形態和復雜單詞複合或字符編碼的語言而設計。
- Huntag-使用最大熵學習和隱藏的Markov模型的NLP的順序標記器。
- ICU -DOTNET -C#ICU4C包裝器。
- ICU4C- http://source.icu-project.org/repos/icu/icu/的SVN項目鏡像。 FieldWorks分支具有一些特定於FieldWorks的增強功能。
- iLanguage-一種半無調的語言獨立的形態分析儀,可用於驅動未知語言文本,或者對單詞中可能的詞素進行粗略估計。輸入:語料庫。使用壓縮,最大熵和現場lling術。
- IPA -HELP -IPA幫助。
- Itweets -Geodata-土著推文中的Geodata。
- jQuery.im-基於jQuery的輸入方法庫。
- KBDGEN-為各種操作系統生成鍵盤和鍵盤佈局。
- Koreksyon-用於開發和實施低資源語言的拼寫和語法檢查功能的工具。
- L20N.JS -L20N重新發明軟件本地化。用戶應該能夠從自然語言的整個表現力中受益。 L20N使簡單的事情變得簡單,同時使復雜的事情成為可能。這是L20N的JavaScript實現。 http://l20n.org。
- langid.py-獨立語言標識系統。
- Langtech由Tromsø大學提供的SVN提供了許多資源。詳細信息在這裡,在這裡用英語。
- 樂高統一概念 - 與樂高統一概念有關的材料。
- LEX4ALL-任何低資源語言的發音詞典http://lex4all.github.io/lex4all/。
- LexDB -LexDB是詞彙同源跟踪數據庫。它存儲了所有詞彙和認知判斷的完整出處,並允許出口到許多Nexus方言。該數據庫寫在靈活的Python/Django Web框架中。
- lfmerge-發送/接收語言forge.org。
- liblevenshtein-用於基於Levenshtein Automata生成有限狀態換能器的庫。
- libpalaso -Palaso庫:一組.NET庫對語言軟件開發人員有用。
- 術語語法矩陣術語語法矩陣是開發寬覆蓋,精確,實施的不同語言語法的框架。
- lingpy -lingpy:用於歷史語言學中定量任務的Python庫http://lingpy.org。
- Linguistica linguistica是一個計劃,旨在探索無監督的自然語言學習,主要關注形態學(單詞結構)。它在Windows,Mac OS X和Linux下運行,並在QT開發框架中以C ++書寫。它對記憶的需求取決於所分析的語料庫的大小。
- 長壓 - jQuery插件,以簡化重音或稀有字符的寫作。 http://toki-woki.net/lab/long-press/。
- 低資源pos-tagging-2014低資源pos-tagging:2014
- LRL-有關低資源語言的工作。
- MacVoikko-基於Voikko的OS X拼寫服務器。
- 機器 - 機器是一個自然語言處理庫,用於.NET,專注於提供用於處理資源貧乏語言的工具(Flex使用)。
- 延伸 - 用於生成懸縫拼寫檢查擴展的腳本。
- Mgiza-基於著名的Giza ++的單詞對齊工具,擴展以支持多線程,簡歷培訓和增量培訓。
- 少數族裔翻譯少數族裔翻譯是一個簡單的程序,可以通過給其他語言Wikipedias的現有文章來幫助較小尺寸的Wikipedias(實際上大小)的內容生成,以便用戶可以輕鬆地翻譯或調整現有文本,從而提高其Wikipedia版本的大小和使用性。
- 默菲夫人 - 默菲夫人是無監督和半監督的形態分割的工具。
- MOMPHOLM-形態意識語言模型。
- MORPH測試 - 用於運行測試的Python腳本,用於生成和分析使用Giella基礎設施構建的形態傳感器。與HFST,施樂的FST工具以及FOMA一起使用。
- MosesDecoder -Moses,機器翻譯系統。
- MOZ-L10N層 - 創建一個偽模板來評估L10N的字符串優先級。
- Mukurtucms -Mukurtu內容管理系統(CMS)是一個基於Internet的平台,旨在歸檔數字文化資源
- 神話 - 神話是一個簡單的詞庫,它使用結構化的文本數據文件和帶有二進制搜索的索引文件來查找單詞和短語,並以語音,含義和同義詞的一部分返回信息。
- MyWorksafe-語言發展工人的智能和簡單備份。 http://software.sil.org/myworksafe/。
- NABU -NABU是一種數字媒體項目管理系統,可提供音頻和視頻項目的目錄,這些項目的元數據以及有關項目工作流程狀態的信息。 www.paradisec.org.au
- 天然 - JavaScript通用節點的自然語言設施。
- NIST 2008開放機器翻譯評估
- NLTK- Python自然語言工具套件。 NLTK源http://www.nltk.org/。
- 節點窗格-Node.js客戶端的panlex客戶端。
- Norma-一種自動拼寫歸一化的工具。
- nplm- https://nlg.isi.edu/software/nplm/的叉子進行了一些效率調整和改編,可用於MosesDecoder。
- Octothorpe -CouchDB驅動的Wiki東西。
- ODTXSLT-在軟件包的內容(例如ODT,DOCX等)上執行XSLT變換。
- Old-Webapp-在線語言數據庫---用於創建Web應用程序的軟件,用於協作documenterys語言。
- 舊 - 在線語言數據庫(舊):用於語言現場工作的軟件。 http://www.onlinelinguisticdatabase.org。
- 舊金字塔 - 在線語言數據庫遷移到金字塔框架。
- Omegat-HFST-Tokenizer-Omegat-HFST-Tokenizer在Omegat中提供基於FST的令牌化。
- OpenDatakit開放數據套件(ODK)是一套開源工具套件,可幫助組織作者,現場和管理移動數據收集解決方案
- OpenNLP -Apache OpenNLP庫是一種基於機器學習的工具包,用於處理自然語言文本。網站。
- OPS -DEVBOX-(Linux)開發人員機器的Ansible Playbook。
- Panlex -Tools - 此軟件包包含腳本,以將詞彙資源轉換為適合導入Panlex的格式。可以在https://dev.panlex.org上找到文檔。
- PDSC-Collection-viewer-ParadiSec Collection瀏覽器
- 範式 - 範式是約瑟夫·E·格里姆斯(Joseph E.
- 途徑 - 準備出版的語言數據。
- PDFDROPLET-庫和GUI徵集PDF頁面(例如2 -up)http://software.sil.org/pdfdroplet/。
- 胡椒 - 胡椒是一種基於爪哇的開源轉換器框架,用於語言數據。
- 語音輔助 - 語音助理是一種發現工具。它提供了語音數據的語料庫,它會自動繪製聲音並通過其搜索功能來繪製聲音,可幫助用戶以語言發現和測試聲音規則。
- PressAgio -PressAgio是一個基於N -Gram模型的文本的庫。例如,您可以發送一個字符串,庫將返回字符串中最後一個令牌最有可能的單詞完成。
- PRIMERPRO- PRIMERPRO的目的是幫助掃盲工作者開發給定語言的底漆。
- pydelphin- Delph -in(友好叉)的Python圖書館。
- RBGPARSER-基於圖的依賴性解析器。
- Rosetta Pangloss -Rosetta Project的Pangloss系統。
- 薩爾姆 - 薩爾姆:後綴陣列及其在喜悅中的經驗語言處理中的應用。
- 鹽 - 基於圖的模型,用於存儲和操縱語言數據。
- Saymore-一種用於製作常見語言文檔任務的工具,例如保留所有結果文件和元數據,將文件轉換為存檔格式和轉錄。
- secwepemc -facebook-將Facebook轉換為不受支持的語言。
- Segparser-接頭分割,POS標記和依賴性解析的隨機貪婪算法。
- 幼苗 - 建造並使用種子語料庫進行人類語言項目。
- Skype用您的語言 - 將Skype轉換為不支持的語言。
- Solid -Solid是一種軟件工具,可用於檢查,清理和轉換標準格式(例如工具箱)詞典數據。
- Sphere轉換工具許多最不發達國公司都包含NIST Sphere格式的語音文件。下面的程序將球形文件轉換為其他格式。
- StandardFormatlib-標準格式庫。
- 斯坦福·科倫普(Stanford Corenlp) - 斯坦福·科倫普(Stanford Corenlp):核心NLP工具的Java套件。 https://stanfordnlp.github.io/corenlp/。
- Stanford Corenlp Python-斯坦福大學Corenlp工具的Python包裝紙。
- Stanza -Stanford NLP Group的共享Python工具。
- STR2IPA-具有接近語音寫作系統的語言的發音字典。
- Sugali-這是許多(許多)語言項目的語言標識項目的舊存儲庫,用於軟件項目課程,低資源語言的NLP項目。
- 類似糖的 - 低資源語言的語言識別(Susanne,Guy和Liging)。
- 音節 - 通用音節縮放算法的Python界面
- 美味模擬鍵盤 - iOS8+的自定義鍵盤,可作為默認Apple鍵盤的美味模仿。使用Swift和最新的Apple Technologies!
- Teckit-編碼轉換工具包的文本。
- Teny-低資源機器翻譯工具。
- Teradict-將英語單詞翻譯成數百種語言!
- Tesseract.js- 62種語言的純JavaScript OCR? http://tesseract.projectnaptha.com/。
- TEXNLP -TEXNLP:德州自然語言處理工具。
- Timbl Timbl是實現多種基於內存的學習算法的開源軟件包,其中IB1-IG(具有適用於符號特徵空間的功能加權的K-Near-egrient Grinide分類的實現)和IB1-IG的決策樹Igtree。所有已實施的算法都有共同點,它們將訓練集的某些表示形式明確地存儲在內存中。在測試過程中,新病例是通過從最相似的存儲案例中推斷出來的。
- Toney-音調分類軟件。
- 現場語言學家的工具箱 - 工具箱是現場語言學家的數據管理和分析工具。它對於維護詞彙數據以及解析和線性化文本特別有用,但是它可用於管理幾乎任何類型的數據。
- Elan的工具箱腳本 - Alexander Koenig的工具箱腳本的鏡像https://tla.mpi.nl/tools/tla-tools/elan/thirdparty/。
- 工具forfielduistics-語言學腳本和食譜集合。
- Transcriber -Aikuma的HTML5轉錄工具
- 翻譯引擎 - 用JavaScript編寫的音譯引擎。
- tsammalex數據-Tsammalex是植物和動物的多語言詞彙數據庫。
- Tweet2Learn-一個應用程序,可在Twitter上更容易使用您的母語。
- Twitter_langid-用於語言識別的層次結構字符字神經網絡。
- UniversAldepentencies文檔 - 通用依賴在線文檔http://universaldeppedencencies.org/docs/。
- UniversAldespies工具 - 處理數據的各種公用事業。
- VOCBench VOCBench是一種基於網絡的,多語言,編輯和工作流的工具,可使用SKOS-XL管理詞庫,權威列表和詞彙表。
- wavesurfer.js-構建在Web音頻上的可通道波形和帆布https://wavesurfer-js.org/(還具有Elan插件)。
- Web-Template-這是一個基於Web的模板,可用於介紹語言學習資源以幫助語言振興工作。它包括會說話的詞典和一個短語,其中包含句子和短語。
- WebCorpus-此項目是一個腳本和程序的集合,用於從爬行數據中創建WebCorpus。
- Wikt2dict-許多語言版本的Wiktionary解析器工具。
- Wikipron-重試Wiktionary條目的IPA發音
- Word Generator WordGenerator從其音節結構的規格中生成假設單詞。
- WordBoundary-在單詞邊界檢測和分割中的實驗。
- WordByWord- WordByWord是Vera Ferreira,Peter Bouda和Cidles的Ricardo Filipe開發的免費,易於使用的多媒體詞彙培訓師,並在瀕危語言基金會的支持下。
- WSI4URLANG-資源不足的語言(URLANG)的單詞感應感應(WSI)。
- XDXF_MAKEDICT -XDXF字典格式和“ Makedict”字典轉換軟件(官方存儲庫)。
鍵盤佈局配置幫助者
- jQuery.im- jQuery輸入方法編輯器在Wikipedia上使用
- KBDGEN-從一個簡單的yaml文件中生成了Windows,MacOS,X11,iOS,Android和Chrome的鍵盤和鍵盤佈局。還註冊Windows未知的語言,因此,在安裝後,指定的BCP 47代碼(包括對ISO 639-3的全部支持)與已安裝的語言工具(例如鍵盤,拼寫檢查器和其他工具)之間存在正確穩定的關聯。
- 鍵盤 - 使用jQuery 〜https://mottie.github.io/keyboard/使用虛擬鍵盤。
- 鍵盤 - 開源鍵盤鍵盤。
- Keyman -Keyman跨平台輸入方法。 Keyman使您可以在Windows,iPhone,iPad,Android平板電腦和手機,甚至在Web瀏覽器中立即輸入1000多種語言。網站。
- keyboardlayouteditor-鍵盤佈局編輯器https://code.google.com/archive/p/keyboardlayouteditor/。
- 鍵盤佈局編輯器 - 鍵盤佈局編輯器http://www.keyboard-layout-editor.com
- Lipika-ime-用於Mac OS X的輸入方法引擎(IME),並對所有指示語言進行內置支持。
- XkeyBoardConfig- X窗口的非架構鍵盤配置數據庫。目的是為X Window系統實現(免費,開源和商業)提供一致,結構良好的X鍵盤配置數據的開源X鍵盤配置數據。該項目針對基於XKB的系統。
註解
- AGTK -AGTK是一套軟件組件套件,用於構建用於註釋語言信號的工具,時間序列數據,該數據記錄了任何類型的語言行為(例如音頻,視頻)。內部數據結構基於註釋圖。 (原始項目在SourceForge上:https://sourceforge.net/projects/agtk/)。
- Brendano-簡易句法註釋的圖形片段語言https://www.cs.cmu.edu/~ark/fudg/。
- Elan Elan是創建視頻和音頻資源複雜註釋的專業工具。
- EOPAS-民族在線演示和註釋系統。
- Flat- Folia語言註釋工具 - Flat是基於葉片格式(http://proycon.github.io/folia/)的基於網絡的語言註釋環境,這是一種用於語言註釋的富XML格式。 FLAT允許用戶查看帶註釋的Folia文檔並通過新的註釋豐富這些文檔,通過Folia範式支持各種語言註釋類型。它是一種以文檔為中心的工具,可完全保留和可視化文檔結構。
- gfl_syntax-簡易句法註釋的圖形片段語言https://www.cs.cmu.edu/~ark/fudg/。
- GRAF-PYTHON-圖書館Graf-Python是一個開源Python插入和編寫GRAF/XML文件,如ISO 24612中所述。庫的解析器從文件中創建一個註釋圖。然後,用戶可以通過Graf-Python的API查詢註釋圖。
- Kwaras- Elan Corpus管理的工具。
- LDC Word Aligner LDC Word Aligner是一種軟件工具,用於手動註釋單詞對齊,以支持阿拉伯語英語和中文英語單詞對齊任務。它具有乾淨,易於使用的接口。自2009年開發以來,LDC已使用LDC Word Aligner從包括廣播,Newswire和基於Web的來源在內的各種流派中生成超過1,000,000個註釋的單詞對齊數據。網站。
- Poio -Analyzer -Poio是用於語言文檔,描述性語言學和/或語言類型學的語言學家的軟件工具集合。它允許語言學家管理和分析其數據。 Poio Interlinear編輯器允許在轉錄中添加形態句法註釋。 It supports various file formats for input, but will only output standardized XML defined by the Corpus Encoding Standard and the Text Encoding Initiative. Several tools for analyzing linguistic data will be made available to further process annotated data. Poio tools are written in Python and are based on PyQt.
- poio-api - Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan's EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called “Graph Annotation F…
- pyannotation - PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files.
- XTrans Trans is a next generation multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. The XTrans toolkit provides new and efficient solutions to common transcription challenges and addresses critical gaps in existing tools.Designed with input from experienced human transcribers working with real world data, XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics.
Format Specifications
- spec - The official specification for the DLx linguistic data format. https://digitallinguistics.github.io/spec/.
- FoLiA FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are support, making FoLiA a useful format for NLP tasks and data interchange. http://proycon.github.io/folia/
- xdxf_makedict - XDXF dictionary format and "makedict" dictionary converting software (official repository).
i18n-related Repositories
- Express-Lingua - An i18n middleware for the Express.js framework.
- Polyglot.js Give your JavaScript the ability to speak many languages.
- Transifex - System for providing a nice, userfriendly/project oriented approach to translating
.po files. Great for non-technical users, free for open-source projects, decent for minority languages; however , it can take a while to get a new language added to the Transifex system because the ticketing system Transifex uses results in them losing tickets sometimes. Provides translation memory, ability to appoint reviewers, etc. Transifex used to have an open source system that you could host on your own, but that seems to have disappeared.
Audio automation
- arctic-prompts - Generate prompts PDF for CMU ARCTIC dataset.
- AudioWebService - a simple nodejs server which accepts upload of audio and runs it through praat.
- AuToBI - Automatic prosodic annotation tool written in Java.
- BashScriptsForPhonetics - ( Fork of a dormant project).
- esv-text-audio-aligner - ESV Text/Audio Aligner to programmatically obtain the timings for each word in the corresponding audio.
- html5-audio-read-along - HTML5 Audio Read-Along.
- ipa-chart - International Phonetic Alphabet (IPA) Unicode Chart and Character Picker.
- kaldi-svn-archive - An read-only archive of the original Kaldi SVN repository (mainly to keep sandboxes available).
- lex4all - pronunciation LEXicons for Any Low-resource Language ( Fork of a student project).
- Montreal-Forced-Aligner - Python interface for forced text/speech alignment.
- node-pocketsphinx
- opensauce - GNU Octave-compatible version of VoiceSauce.
- pocketsphinx - PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop.
- pocketsphinx-ios-demo - Simple demo for iOS.
- pocketsphinx-python - Python module installed with setup.py.
- pocketsphinx-ruby - Ruby speech recognition with Pocketsphinx.
- pocketsphinx-wp-demo - Demo to run pocketsphinx on WP8 platform.
- pocketsphinx.js - Speech recognition in JavaScript.
- praat-py - From my PhD days: Praat-Py is a custom build of Praat, the computer program used by linguists for doing phonetic analysis on sound files, to allow for scripts to be written in the Python programming language, rather than in Praat's built-in language. ( Fork of a dormant project).
- Praat-Scripts - Mietta's Scripts.
- PraatTextGridJS - A small library which can parse TextGrid into json and json into TextGrid.
- PraatontheWeb - Web implementation of Praat. Source code, running demo scripts on web, samples and documentation.
- prosodicParsing - different kinds of HMMs to use for incorporating prosody into basic parsing.
- Prosodylab-Aligner - Python interface for forced audio alignment using HTK and SoX.
- prosodylab.alignertools
- Recordmp3js - Record MP3 files directly from the browser using JS and HTML.
- sphinx4 - Pure Java speech recognition library.
- sphinxbase
- sphinxtrain
- TLSphinx - Swift wrapper around Pocketsphinx.
Text-to-Speech (TTS)
- espeak - eSpeak is a compact open source software speech synthesizer for English and other languages, for Linux and Windows. http://espeak.sourceforge.net.
- MARY TTS - MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java http://mary.dfki.de.
- Ossian - Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision.
Automatic Speech Recognition (ASR)
- Elpis - Elpis is software for creating speech recognition models and applying them to the transcription of audio. As of 2022, it gives access to Kaldi and Huggingface Transformers.
- kaldi - This is now the official location of the Kaldi project.
- Persephone - Persephone aims to make state-of-the-art phonemic transcription accessible to people involved in language documentation, who have a training corpus of about one to four hours of transcribed speech. As of 2022, Persephone is superseded by Elpis.
Text automation
- clld - Cross Linguistic Linked Data python library.
- LaTeX2HTML5 - LaTeX web components.
- MultilingualCorporaExtractor - Node io Spider for extracting multilingual corpora ( Fork of a student project).
- SeedLing - Building and Using A Seed Corpus for the Human Language Project ( Fork of a student project).
Experimentation
- experigen - A framework for creating linguistic experiments.
- GamifyPsycholinguisticsExperiments - A simple node server to gamify linguistics experiments, runs offline on a laptop for small scale experiements and online on a server for large scale experiments. Data is sent to a Google spreadsheet. ( Fork of a dormant project).
- OpenSesame - Graphical experiment builder for the social sciences.
- OPrime - Open Source Experimentation Libraries - Online and Offline for Android and HTML5.
- psychopyMegProsody - Runs MegProsody using PsychoPy.
- PsychScript - A HTML5/Javascript library for running behavioural experiments online.
Flashcards
- Anki - Anki is a program to make and share flaschard decks (including audio) for any language or writing system. https://apps.ankiweb.net/.
- awesome-anki - A curated list of awesome Anki add-ons, decks and resources.
- VocabLift - Language-learning tool that uses vocabulary from LIFT-format dictionaries produced by programs such as Fieldworks Language Explorer and WeSay.
Natural language generation
- OpenCCG - OpenCCG library for parsing and realization with CCG. Includes mini-grammars for Inuit, Nezperce, Basque and others.
計算系統
- Common Language Resources and Technology Infrastructure Norway / Clarino - One of their projects (not clearly listed here) is about providing an online system for language analysis, so users can connect resources visually, dump in text, and get a result. Kind of like the Yahoo! Pipes but for language processing. Uses the ABEL cluster.
Android Applications
- Aikuma - Android software for recording and translation.
- Android Speech Recognition Trainer - Speech recognition training app for low resource languages which interfaces with FieldDB corpora.
- android-template - This is a template of an Android word-learning app that may be used a way to introduce a language. It includes a quiz. For the documentation, go to http://eddersko.github.io/android-template/.
- AndroidFieldDB - An Android app which lets the user build a custom visual and auditory vocabulary, useful for guided anomia treatment and self designed language lessons by heritage speakers.
- AndroidFieldDBElicitationRecorder - A general purpose video recording tool.
- AndroidLanguageLessons - Lets heritage speakers create self designed language lessons.
- AndroidProductionExperiment - Android App to run perception experiments.
- Bevara - Android Phone Application designed for Linguistic Fieldwork to help preserve, maintain, and save endangered languages.
- ojoVoz - A mobile app for sending georeferenced image and voice recordings from an Adroid phone to an email address. For more information, please go to http://sautiyawakulima.net/ojovoz/.
- pocketsphinx-android - pocketsphinx build for Android.
- pocketsphinx-android-demo
Chrome Extensions
- babelfrog - Chrome extension to help learn languages as you browse.
- DictionaryChromeExtension - Dictionary for websites in low-resource languages. App and codebase which connects to a Wiktionary to provide definitions of any term on any website (current languages Cherokee 194,426 entries, Inuktitut 251 entries, Kartuli 7,363 entries, Plains Cree (incubation) 0 entries) use.
FieldDB
FieldDB is actively worked on by the FieldDB (Formally known as OpenSourceFieldlinguistics) group. These repos explicitly work with it but could be repurposed for other projects.
- FieldDB - An offline/online field database which adapts to its user's terminology and I-Language, has plugins for various data automation routines along the process of primary data collection to cleaning to publication and archival.使用。
FieldDB Webservices/Components/Plugins
- AndroidLanguageLearningClientForFieldDB-sikuli - Sikuli tests for AndroidLanguageLearningClientForFieldDB.
- AuthenticationWebService - A node.js web service which mananges users and corpora creation and authentication.
- bower-fielddb-angular - A bower repository which hosts fielddb-angular components, bower install fielddb-angular --save.
- bower-fielddb - A bower repository which hosts fielddb core components, bower install fielddb --save.
- fielddb-spreadsheet-sikuli - sikuli tests for the spreadsheet module use.
- FieldDBActivityFeed - A fielddb activity feed widget which can be embedded in other codebases, websites etc use.
- FieldDBGlosser - A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. bower install fielddb-glosser --save.
- FieldDBLexicon - A lexicon browser/editor web widget for FieldDB databases.
- LanguageClassDashboard - App which provides a view of FieldDB corpora for language teachers use.
- LexiconWebService - A node.js ElasticSearch wrapper for indexing/training lexicons from corpora.
- LexiconWebServiceSample - A node.js web server which implements the fieldlinguist's lexicon API for the FieldDB project.
Academic Research Paper-Specific Repositories
- Gargantua - Fast Unsupervised Sentence Aligner described in "Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora", COLING 2010.
- ldc-kiy - Materials for: The experimental state of mind in elicitation: illustrations from tonal fieldwork. Dubmitted to Language Documentation & Conservation, How to study a tone language .
- Learning to map into a Univerisal POS tagset Yuan Zhang, Roi Reichart, Regina Barzilay and Amir Globerson
- low-resource-pos-tagging-2014 and low-resource-pos-tagging-2014 Published in: Learning a Part-of-Speech Tagger from Two Hours of Annotation. Dan Garrette and Jason Baldridge . In Proceedings of NAACL 2013. And in: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages. Dan Garrette, Jason Mielens, and Jason Baldridge . In Proceedings of ACL 2013.
- orthotree - Linguistic family tree based on orthographic distance.
- type-supervised-tagging-2012emnlp This repository contains the code, scripts, and instructions needed to reproduce the results in the paper: Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries. Dan Garrette and Jason Baldridge . In Proceedings of EMNLP 2012. This code is frozen as of the version used to obtain the results in the paper. It will not be maintained. To see the updated code, visit nlp
- visualizing-language - For visualizations of WALS and other typological databases.
- WALS-APiCS - Code for working with WALS-APiCS (Atlas of Pidgin and Creole Language Structures) complexity metrics.
Example Repositories
These are repositories that are generally only interesting for training purposes or seeing how something is done.
- CorpusWebService - über-simple node.js-Proxy to enable CORS request for couchdb.
- CorporaForFieldLinguistics - Small corpora from diverse language typologies, useful for testing scripts.
- startR
- lucenerevolution-2013 - Demo examples for linguistics in Lucene and Solr.
- berlin-buzzwords-2013 - Demo examples for Lucene, Solr, ElasticSearch and OpenNLP from Berlin Buzzwords 2013 talk.
字體
- fontinline - Make inline stroke paths from an outline font.
- Noto Fonts - Noto is Google's free font family that aims to support all the world's scripts. Its design goal is to achieve visual harmonization across languages. Noto fonts are under Apache License 2.0.
- Unicodify Unicodify is a suite of programs for converting text in a variety of 8-bit encodings to Unicode (using the UTF-16 encoding). Unicodify was particularly designed to handle HTML-based text using non-ISCII 8-bit fonts to render South Asian scripts. However, elements of the suite can map other types of non-ASCII 8-bit encodings, such as Latin-2, ISCII and PASCII.
Corpora
These corpora are useful for working with tools on endangered languages. Monolingual corpora that are more for archival efforts should most likely not be included here.
- bible-corpus - A multilingual parallel corpus created from translations of the Bible.
- poio-corpus - The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.
組織
On GitHub
- batumi - Speech recognition and natural language processing for low-resource languages
- BloomBooks
- unicode-cldr - Unicode Common Locale Data Repository (CLDR) Project http://cldr.unicode.org
- cmusphinx - Mirror of the SourceForge repositories
- dativebase - Tools for working with OLD.
- divvun - The Divvun group at UiT develops proofing tools, keyboard apps and other language technology solutions for indigenous and minority languages, especially the Sámi languages.網站。
- FieldDB
- GiellaLT - home for keyboard layouts, lexicons and morphologies for indigenous and minority languages, especially for morphologically complex languages, using mainly rule-based techonlogies. The resources are used by Divvun (above) and Giellatekno (below) to build a number of tools for the language communities. Almost everything is open source.
- HFST - Helsinki Finite-State Technology.網站。
- hunspell
- keymanapp - Website.
- langtech - Language Technology Group, University of Melbourne
- lex4all
- longnow
- MontrealCorpusTools
- moses-smt - Statistical Machine Translation.
- mukurtucms
- NLTK - Natural Language Toolkit.
- PhonologicalCorpusTools)
- Projet de recherche sur l'écriture - Crowdsourcing or conducting large scale psycholinguistics experiments (or statistically significant field linguistics).
- prosodylab - Prosodylab at McGill University, Canada
- SIL International (Dev) SIL International- Another SIL organization, with many repositories.
- SIL International - SIL (originally known as the Summer Institute of Linguistics, Inc.) is probably the leading organization which provides software and tools tailored for use by field linguists and lexicographers working on endangered languages. A little known fact is that much of it's code is open sourced on GitHub and SIL is happy to recieve open source contributions and collaborate on open source projects.
- SIL NRSI - SIL Non-Roman Script Initiative. The NRSI is a department of SIL International, whose task is to provide assistance, research and development for SIL International and its partners to support the use of non-Roman and complex scripts in language development.
- StanfordNLP https://nlp.stanford.edu
- ucsd-field-lab - University of California, San Diego
- UniversalDependencies - Universal Dependencies (UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on an evolution of (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary.
- utcompling - The University of Texas at Austin's Computational Linguistics Lab.網站。
Other OSS Organizations
- Giellatekno - Giellatekno combines cutting-edge linguistic and computational research into the analysis of Saami and other morphologically-rich languages, with the development of practical applications. We focus on deep linguistic modeling and on highly efficient and robust computational analysis with a wide empirical coverage. They use svn for their code: all of it can be found here, sorted by language.
- LOWLANDS - LOWLANDS – Parsing low-resource languages and domains https://ccc.ku.dk/research/lowlands/
- LTRC: Language Technologies Research Center IIIT Hyderabad LTRC addresses the complex problem of understanding and processing natural languages in both speech and text mode. LTRC conducts research on both basic and applied aspects of language technology. It is the largest academic centre of speech and language technology in South Asia. LTRC carries out its work through four labs, which work in synergy with each other, as listed above.
- The Language Archive Part of the MPI
教程
- How to Write a Spelling Corrector by Peter Norvig.
Language Specific Projects
For each language, we include the ISO 639-3 code, and the main autonym for that language.
南非荷蘭語
afr :: Afrikaans
- Afrikaanse rekenaarlinguïstiek (Afrikaans computational linguistics) — wordlists, corpora, morphological analyser, tagger, word decompounder. Available upon email.
阿爾巴尼亞人
sqi :: shqip
- Apertium rules for Albanian - Machine Translation rules
- out-of-copyright-albanian-authors - authors scraped from the albanian language wikipedia who are out of copyright.
- Plis keyboard - The Plis keyboard is a keyboard or computer keyboard layout for the Albanian language.
- spell checking - Here you find a collection of Albanian words and information about them. Aspell, Ispell, and MySpell are included.
Alutiiq
ems :: sugpiaq
- wiinaq - Word Wiinaq is a Kodiak Alutiiq dictionary web application with automatically generated ending tables and souped-up search capabilities. It is written in Python using Django.
阿姆哈拉語
amh :: አማርኛ
- HornMorpho - Morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs
巴斯克
eus :: euskara
- Matxin - An open-source transfer machine translation engine. Linguistic information for the translation from Spanish and Basque (es-eu) is included.
孟加拉
ben :: বাংলা
- Bangla-অঙ্কুর for Mac This project aims to develop a phonetic based Bangla typing system for Macintosh computer which can be developed into a transliteration technique in the future.
- Bengali Writer - `Bengali Writer' is a set of utilities for computerized editing and typesetting in Bengali, a language of India and Bangladesh. It comprises a set of fonts for Bengali in several formats (METAFONT, BDF, PS), a text editor with spell-cheking, export, and more. (Original project is on SourceForge: https://sourceforge.net/projects/bengaliwriter/).
- Ekushey Bangla Computing and Localization Project for the Bangla speaking people.
- Lekho - A collection of tools and resources for using bangla on computers (Original project is on SourceForge: https://sourceforge.net/projects/lekho/).
Chichewa
nya :: chicheŵa
- Chichewa - NLP resources for Chichewa.
加利西亞人
glg :: galego
- an-metri-gal - Análise métrico de texto en verso en lingua galega (Galician language) gl-ES
- android_gl_dict - Android Galician (gl_ES) Keyboard Dictionary
- aspell-gl - Galician dictionary for aspell
- CitiusSentiment - Sentiment analysis (opinion mining) for Portuguese, English, Spanish, and Galician
- CitiusTagger - A PoS-Tagger and Named Entity Classification tool for Portuguese, English, Galician, and Spanish
- Conshuga - Galician verb conjugator
- corpora - This is a collection of corpus of Galician (or related to Galicia) words / Colección de corpus de palabras en galego (ou relacionadas con Galicia)
- DepPattern - Dependency Syntactic Parsing for Portuguese, Spanish, English, and Galician, including MetaRomance parser
- DOGA_scraper - Galician Official journal scraper
- elFinder-language - Galician - Gallego / language for elFinder
- EuroWordNetLemon - EuroWordNet lemon lexicons generated from the LMF versions of the Multilingual Central Repository (MCR) EuroWordNet lexicons. It includes lexicons for Spanish, Catalan, Basque & Galician.
- GalegoDroid - Galician Translator for Android
- galeXtra - Multiword Extractor for Portuguese, English, Spanish, Galician, French
- Galician-Dependency-Treebank - This Galician Dependency Treebank has been developed by transliterating and adapting lexically the Portuguese part (Bosque 7.3 by the Floresta sintá(c)tica project) of the CONLL-X 2006.
- Galician-Fuzzy-Text-watch - Based on Fuzzy Text International by Jesse Hallett, uses the galician language to display time.
- galician-locale-for-mac - Galician locale for Mac OS X
- gl-syllabler - Split galician language words into syllables
- gl- Galician OmegaT Localisation
- hunspell-gl-ciencias - Project oriented into developing a science and maths Galician language Hunspell dictionary
- hunspell-gl - Galician hunspell dictionaries
- hyphen-gl - Galician hyphenation rules
- javagalician-java6 - The Java Galician Locale is an implementation of Java localization SPIs which will allow the Java VM to use the Galician Language (locales "gl" and "gl_ES"), one of the official languages of Spain, which is not included in Sun's JVM distribution.
- Linguakit - Multilingual toolkit for NLP: dependency parser, PoS tagger, NERC, multiword extractor, sentiment analysis, etc.
- ParlamentoGalicia - Project based on the information extracted from the transcriptions of the sessions held in the Galician Parlament
- poss-gl - Galician translation of Producing Open Source Software, by Karl Fogel
- rima - Find rhyming words in galician language.
- stopwords-gl - Galician stopwords collection
- texlive-babel-galician - TeXLive babel-galician package
- UD_Galician-CTG - The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus created at the University of Vigo by the the TALG NLP research group.
- UD_Galician-TreeGal - The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña).
- UL_Galician-TreeGal - CoNLL-UL Repository for UD_Galician-TreeGal
Apertium
- apertium-cat-glg - Apertium translation pair for Catalan and Galician
- apertium-dict-en-gl - English-Galician language pair for Apertium
- apertium-dict-es-gl - Spanish-Galician language pair for Apertium
- apertium-dict-pt-gl - Portuguese-Galician language pair for Apertium
- apertium-en-gl - Apertium translation pair for English and Galician
- apertium-es-gl - Apertium translation pair for Spanish and Galician
- apertium-glg - Apertium linguistic data for Galician
- Apertium-pt-gl.pt-gl-LMF - This is the LMF version of the Apertium bilingual ditionary for Portugues and Galician languages
- apertium-pt-gl - Apertium translation pair for Portuguese and Galician
格魯吉亞人
kat :: ქართული
- awesome-georgia - A curated list of awesome libraries and packages specific/related to Georgia (country).
- Gadatsqvetilebebi - გადაწყვეტილებები; Web spider and corpora importer for public legal decisions.
- GeoWordsDatabase - Around 310 000 unique Georgian words https://bumbeishvili.github.io/GeoWordsDatabase/.
- Kartuli Speech Recognition - ანდროიდის ქართველი მომხმარებლებისთვის სიტყვის ამოცნობის სისტემის შექმნა. Codebase to turn any webpage from any alphabet into another alphabet, the default is to turn latin letters into Kartuli. use "Do your friends keep commenting on Facebook with English keyboards (either because they forgot to switch, or because they didn't/can't install a Georgian keyboard)? Now you can read the web through კართული eyes.".
- KartuliChromeExtension - Chrome აპლიკაცია, რომელიც ყველა ინგლისურ ასო-ბგერას აჩვენებს ქართულ ასო-ბგერად.
- QartuliDaBunebismetkveleba - მათემატიკისა და ბუნებისმეტყველების ინტერაქტიული სახელმძღვანელო მე-2 - მე-3 კლასის მოსწავლეებისათვის.
- SakartvelosUzenaesiSasamartloSarke - საქართველოს უზენაესი სასამართლო სარკე.
- SamartlosSakonstitutsioSasamartdoSarke - სამართლოს საკონსტიტუციო სასამართდო სარკე.
- translitit-latin-to-mkhedruli-georgian - A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript.
- translitit-mkhedruli-georgian-to-ipa - A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript.
- Declensions - Methods to generate declensions for Georgian language
字體
- Stichoza/font-larisome - Iconic font for Georgian currency inspired by Font-Awesome (CSS).
- Lotuashvili/BPGNateli - Bower package for BPG Nateli font (CSS).
- thecotne/georgian-webfonts - Package for georgian fonts (CSS).
Internationalization and Localization (i18n/l10n)
- Stichoza/money-num-to-string - Convert a number/money to localized string (PHP, JavaScript).
- natchkebiailia/NumberToWord - Convert numbers to localized strings (JavaScript).
- d0ragon/number-to-words-ka - Convert numbers to localized strings (PHP).
- dimakura/ka - Common functionality for georgian projects (Ruby).
- dimakura/ka.js - Georgian language support for node and browser (JavaScript).
- akalongman/kautilities - Convert Georgian letters to Latin and vice-versa (PHP).
- Landish/Laravel-Ka - Laravel Georgian Language Pack.
- Landish/RedactorJS-GE - Redactor WYSIWYG HTML Editor Georgian Language Pack (JavaScript).
- wenzhixin/bootstrap-table - Bootstrap table with extra features. l10n by @Lotuashvili and @Stichoza.
- moment/moment - A lightweight date library (JavaScript).
- ioseb/geokbd - Georgian keyboard library (JavaScript).
Guarani
grn :: Guarani
- ParaMorfo - morphological analysis and generation of Spanish and Guarani verbs, nouns, and adjectives.
豪薩
hau :: هَرْشَن هَوْسَ
- Hausa - Repository for Hausa NLP tools.
印地語
hin :: हिन्दी
- hindi-morph - An open source morphological analyzer for Hindi.
Høgnorsk
nno :: Høgnorsk
- hunspell-hn_NO - A beginning to a spellchecking tool for Høgnorsk, a conservative variant of Norwegian Nynorsk, based on a set of corpuses.
冰島
isl :: íslenska
- IceNLP - IceNLP is an open source Natural Language Processing (NLP) toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java.
Inuktitut
iku :: Inuktitut
- InuktitutAlignerData - Scripts for alignment of laboratory speech production data.
- InuktitutComputing - Inuktitut Morphological Analyser, transcoder, transliterator, corpus tools, and lexical lists for working with Inuktitut. Usable online at http://inuktitutcomputing.ca/index.php.
愛爾蘭人
gle :: Gaeilge
- aimsigh - Source for the now-defunct aimsigh.com Irish search engine.
- caighdean - Code for standardizing Irish language text.
- fleiscin - Irish hyphenation patterns for TeX https://cadhan.com/fleiscin/.
- GaelSpell - Sources for an Irish language spell checker.
- tesseract-gle-uncial - OCR for old Irish fonts.
Kinyarwanda
kin :: Ikinyarwanda
- kin-morph-fst - Kinyarwanda morphological analyzer.
- TurboTagger & TurboParser for Kinyarwanda (download) TurboTagger & TurboParser for Kinyarwanda
庫爾德
kur :: Kurdî
- Kurlex - Morphological analyser and lexicon, written in the Alexina framework, licensed under the LGPL-LR.
- kurmanji-stemmer - NLTK based kurmanji stemmer
林加拉
lin :: Lingála
- Lingala NLP NLP tools and resources for Lingala
Lushootseed
lut :: Lushootseed
- Lushootseed - Joshua Crowgey's work on Lushootseed http://students.washington.edu/jcrowgey/lushootseed/.
馬來語
msa :: Bahasa Melayu
- MorfoMalayu - morphological analysis of Malay words.
馬爾加什
mlg :: Malagasy
- Global Voices Malagasy Project This page provides a link to a corpus of parallel news articles in Malagasy and English from the Global Voices project. This corpus was collected and aligned at the sentence level by Victor Chahuneau.
Manx
glv :: Gaelg
- aspell-gv - Manx Gaelic dictionary for aspell.
- gaelg - NLP resources for Manx Gaelic, mainly in support of the gv2ga MT engine.
Migmaq
mic :: Mi'kmaq
- migmaq-lessons - Repository for website building Mi'gmaq language lessons.
Minderico
drc :: Piação do Ninhou
- fredericajordarzambarino - A web based game for mobile devices in minderico based in the "Who Wants to be a Millionaire" TV show.
Nishnaabe
oji :: Ojibwe, Oddawa, Chippewa, Anishinaabemowin, ᐊᓂᔑᓈᐯᒧᐎᓐ
- Ojibway-iphone-app - An iPhone app with audio and images for learning the Ojibway language.
- OjibwayMap - An iPhone app with audio and images for learning Ojibway language and culture.
- nishanimate - A desktop app to facilitate Nishnaabe-language acquisition via animations produced by the natural language processing of audio-accompanied text.
Oromo
orm :: Oromo
- hornmorpho - morphological analysis and generation of amharic and oromo verbs and nouns. and tigrinya verbs
Quechua
que :: Runa Simi
- AntiMorfo - morphological analysis and generation of Quechua nouns, adjectives, and verbs and Spanish verbs.
- Morphology, spellchecker - XFST and FOMA, plus OpenOffice plugin.
薩米
sma :: Sámi/Saami
- divvun-webdemo - simple webdemo for divvun grammar checker.網站。
- Giellatekno A host of Sámi tools.
- Mobile keyboards (iOS and Android), learning apps, dictionaries, morphologies, syntax disambiguators, some amount of project collaboration with Apertium on shallow translation between Saami languages, and
- Oahpa! - A learning portal for Saami languages. Includes WordPress based, media rich lesson-based learning, and morphological and syntactic exercizes generated from the morphological and syntactic tools
- Neahttadigisánit - A morphologically sensitive dictionary, with modes for 'social media input' (which allows users to type a 'relaxed' version of the orthography ( acdnstz will be recognized also as áčđŋšŧz̄ ), and also includes a JavaScript bookmarklet to offer click-to-read dictionary lookup functionality. Also available for other Uralic, and non-Uralic languages. Giellatekno does a lot for other minority Uralic languages. Following are some keywords for CTRL+F friendliness:
- Saami languages: North Saami, Lule Saami, South Saami // Inari Saami, Kildin Saami, Pite Saami, Skolt Saami.
- Other Uralic languages: Erzya, Finnish, Hill Mari, Ingrian, Khanty, Kven, Komi, Livonian, Meadow Mari, Moksha, Nenets, Nganasan, Olonetsian, Udmurt, Veps.
- Other languages: Buriat, Cornish, Faroese, Greenlandic, Iñupiaq, Northern Haida, Ojibwe, Plains Cree, Russian.
蘇格蘭蓋爾語
gla :: Gàidhlig
- aspell-gd - Scottish Gaelic dictionary for aspell.
- briathrachan - This is the source code to Briathrachan, a Gaelic-English dictionary app for iOS.
- gaidhlig - NLP resources for Scottish Gaelic, mainly in support of gd2ga/ga2gd MT engines.
- gd-fcfg - Context-free feature-based grammar of Scottish Gaelic in the NLTK format.
- gdbank - Some tools and resources for natural language processing of Scottish Gaelic. https://www.tantallon.org.uk/cggblog/.
- hunspell-gd - Files for building Scottish Gaelic spell checkers.
Secwepemctsín
shs :: Secwepemctsín
- secwepemctsnem - A project to help people learn Secwepemctsín.
索馬利亞
som :: Soomaaliga
- somorph - Somali morphological and syntactic analyzers and generators built on XFST and VISL-CG Constraint Grammar. Up to date version checked in on Giellatekno's repository.
- qaamuus.net morphologically aware dictionary based on lexical resources found online, and the somali morphology.
tigrinya
tir :: ትግርኛ
- HornMorpho - morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs.
Uralic
urj :: Uralic languages
- UralicNLP - A Python library for processing Uralic languages (Finnish, Skolt Sami, Erzya, Moksha, Komi-Zyrian and so on). The library provides an easy programmatic access to Giellatekno resources such as FST morphology and CG disambiguators. Other functionalities include UD parser, API for the Online Dictionary of Uralic Languages and interface to SemFi and SemUr semantic databases. The library is under active development and new features are added from time to time.
祖魯
zul :: zulu
- Ukwabelana An open-source morphological Zulu corpus
執照
© Richard Littauer 2014-2017