低资源语言
低资源(人类)语言的保护,开发和文档的资源。
根据一些估计,目前的7,000个口语中有一半预计本世纪将灭绝。但是,学者,独立学者,组织,社区和个人有很多工作要停止或放慢这一趋势。此列表旨在提供开源代码列表,这些列表可用于记录,保存,开发,保存或使用濒危语言。
松弛小组
我们有一个懈怠的小组进行现场讨论。加入我们!
发布
一份描述该存储库的白皮书发表在LREC 2016 CCURL研讨会(用于资源不足的语言的协作和计算)上。该论文位于此存储库中,在papers夹中。在此处下载原始纸:开源代码服务濒危语言。
贡献
要在Github上编辑此列表,只需单击此处。如果您想讨论与此有关的任何内容,请打开一个问题。如果您知道此列表中没有可用的任何资源,请使用上面的链接或提交拉请请求。
有关贡献指南中贡献的更多详细信息。
如果您有兴趣以某种离线能力讨论列表,请与@richardlitt联系。我很高兴有一个电话或电子邮件交换。
目录
Doctoc生成的目录
- 定义
- 通用存储库
- 键盘布局配置帮助者
- 注解
- 格式规格
- I18N相关的存储库
- 音频自动化
- 文本到语音(TTS)
- 自动语音识别(ASR)
- 文本自动化
- 实验
- 抽认卡
- 自然语言产生
- 计算系统
- Android应用
- 镀铬扩展
- fieldDB
- 学术研究论文特定的存储库
- 示例存储库
- 字体
- 语料库
- 组织
- 教程
- 特定语言项目
- 南非荷兰语
- 阿尔巴尼亚人
- Alutiiq
- 阿姆哈拉语
- 巴斯克
- 孟加拉
- Chichewa
- 加利西亚人
- 格鲁吉亚人
- 瓜拉尼
- 豪萨
- 印地语
- Høgnorsk
- 冰岛
- Inuktitut
- 爱尔兰人
- Kinyarwanda
- 库尔德
- 林加拉
- Lushootseed
- 马来语
- 马尔加什
- manx
- Migmaq
- Minderico
- Nishnaabe
- Oromo
- Quechua
- 萨米
- 苏格兰盖尔语
- secwepemctsín
- 索马里
- tigrinya
- 乌拉尔
- 祖鲁
- 执照
定义
濒危语言是有灭绝危险的人类语言。该列表还包括少数语言 - 由稳定但人口较小的语言(例如,马耳他或夏威夷)所使用的语言;和低资源或资源不足的语言,这可能是大量人口使用的,但数字化的语言不足(例如,Quechua)。这些语言具有共同的某些特征;最相关的是稀疏数据和缺乏资源,从拼写检查器到语法到机器翻译语料库。不属于此列表的其他资源不足的语言包括构造的语言(例如Klingon或na'Vi),计算机语言(例如,JavaScript或Lua)以及已稀疏以至于在大多数目的(例如,Tocharian)变得无关紧要的稀疏语言。
开源“通过免费的许可来促进产品设计或蓝图的通用访问,以及对该设计或蓝图的通用再分配,包括任何人随后对其进行改进。” (Wiki)。这很重要,因为分配给不开源的语言或项目的金钱和资源是以其他地方可能的可扩展性来花费的。
此列表曾经被命名为endangered-languages 。重命名为反映危险是一个加载的术语,两者都可能无法反映少数族裔语言的语言社区的观点。与其他高资源的语言相比, low-resource-languages将此列表重点放在缺乏数字资源上。
Tools which are built for these languages are not included (unless relevant for dialects or variants): Arabic, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, Flemish, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Norwegian (Bokmål),波斯语,波兰,葡萄牙语,罗马尼亚语,俄语,塞尔维亚语,斯洛伐克,斯洛文尼亚人,西班牙,瑞典语,泰国,泰国,土耳其,乌克兰,乌克兰,瓦伦西亚,越南,越南人。此列表来自此Wikipedia页面上最受欢迎的网站内容语言列表。可以使用其他指标 - 如果您还有另一个指标,请建议!
此列表在一件事上特别出色。通常显示现场中存在的各种工具。但是,对于对特定语言或工具套件的深入研究,它的性能并非出色。例如,列出每种低资源语言的所有Firefox语言包或APERTIUM语言模块都是无助的,将包括ACL Wiki中指出的所有可用的工具,这主要是通过IXA组通过IXA组来含义分类工具,其中一些是开源的,而有些则不是。相反,将此列表视为更多研究的起点。
寻找代码语言的资源?看看很棒的列表集合。
通用存储库
单语词典项目和实用程序
公用事业
- 免费电子词典项目是用于手机的Java Midlet的项目 - 土著语言词典。
- 网站托管单语言的数字词典。
- Wesay-允许语言社区构建自己的词典。 https://software.sil.org/wesay/(SIL International)。
软件
- 4lang-使用Eilenberg机器的概念词典。
- 强调。多种语言的纯文本的统计单化化
- 对齐 - openfst-这是针对四个任务的CRF自动编码器框架的实现:bitext Word对齐,词性标记,代码切换,依赖关系解析。
- Apertium Apertium是一种构建开源浅转移机器翻译系统的工具箱,特别适用于相关语言对:它包括引擎,维护工具和几种语言对的开放语言数据。
- ARK-TWEET-NLP-CMU ARK TWITTER语音部分Tagger( fork )。
- ARTOFREADING-与阅读插图收集艺术有关的索引和处理脚本。
- 贝丝线 - 用于语言识别的多项式贝叶斯分类。
- 圣经 - 科尔布斯工具 - 用于阅读/处理多语言圣经语料库的工具集合。
- Bloomdesktop -Bloom Desktop是一种混合C#/JavaScript/html/css Windows应用程序,它极大地“降低了标准”的语言社区,这些语言社区想要自己的语言。 Bloom提供了低训练的高输出系统,母语扬声器及其倡导者共同努力培养社区作者身份并获得外部材料的机会……https://bloomlibrary.org/。
- Bloomlibrary -Bloom库单页应用程序,使用AngularJS&Bootstrap,Parse.com后端。 https://bloomlibrary.org/。
- 大脑 - JavaScript中的神经网络。
- Bristol Uni MT形态工具 - 此存储库是先前在http://www.cs.bris.ac.uk/research/machinelearning/morphology/morphology/resources.jsp上获得的脚本的镜像。包括:UKWABELANA-一种开源形态学祖鲁语料库和Emma:形态学分析的新型评估指标。
- 棕色群集 - 棕色单词聚类算法的C ++实现。
- CasualCon CasualConc是一个协和程序,可在Mac OS X 10.5豹或更高版本上本地运行。它最初是为随意使用(初步分析或非研究目的)而设计的,尽管[维护者]一直在将其用于自己的研究(并且可能其他人)。它可以生成KWIC一致性线,单词簇,搭配分析和单词计数。
- CDEC-基于(主要是)无上下文形式主义的统计机器翻译和其他结构化预测模型的解码器,对齐器和模型优化器。
- Charlint Charlint是用Perl编写的角色归一化/检查工具。除此之外,它实现了Unicode TR 15的归一化形式C,作为W3C字符模型中早期均匀归一化的测试平台。
- 合唱 - 一种版本控制系统,旨在启用适用于地理上分布的典型语言开发团队的工作流程。
- CLAM-计算语言学应用程序调解人 - 快速将NLP应用程序与Web启用前端变为RESTFULES服务。您提供命令行应用程序,其输入,输出和参数的规范,以及围绕应用程序包裹的蛤lam,以形成完全露面的Restful Web服务。
- CMU Sphinx Cmusphinx是根据BSD样式许可发布的与扬声器无关的大型词汇连续识别器。它也是开源工具和资源的集合,可让研究人员和开发人员构建语音识别系统。
- cnminlangwebcollect-中国少数民族网站语言检测和网站收集。
- COG -COG是使用词汇术和比较语言技术比较语言的工具。它可用于自动化比较来自不同语言品种的单词列表的大部分过程。 http://sillsdev.github.io/cog/。
- convertextract-使用非unicode文本(例如需要SIL字体的文本)将Excel,Word和PowerPoint文件转换为Unicode,同时保留原始文件的格式。
- 精光 - 语音精光http://phonologicalcorpustools.github.io/corpustools/。
- CTK-围绕着LDC的Champolion句子对准器内核建造,Champollion工具套件(CTK)旨在为尽可能多的语言对提供现成的平行文本句子对准工具。 (原始项目在SourceForge上:http://champollion.sourceforge.net)。
- DataTags-一种评估数据集敏感性和隐私风险的系统,并分配标签以描述必须如何传输,存储和访问数据集。 (叉)。
- 数据存储库框架共享和发布研究数据。
- detative -detative:语言现场工作的软件http://www.dative.ca。
- dattion-与多个语言现场工作Web服务数据库进行交互的单页应用程序。网站。
- DeepLearntoolbox -MATLAB/八度工具箱,用于深度学习。包括深信信网,堆叠的自动编码器,卷积神经网,卷积自动编码器和香草神经网。每种方法都有示例可以让您入门。
- Desmeme-探索语言模板的数据库和工具。
- dictdb-语言翻译的字典数据库。
- 演讲 - 基于Python的工具,用于转换和合并多层注释的语言数据。
- Divvun-gramCheck-此程序对指定为约束语法格式读数的表单进行FST查找,并在带有可读消息的XML文件中查找错误标签。它被用作语法检查器管道的后期。
- Divvun -Keyboard- iOS和Android的键盘应用程序,带有键盘布局,用于土著和少数语言
- Divvunspell
hfst-ospell (下图)在Rust中重写,用于强大的并发和内存管理。实际使用的速度比hfst-ospell快10倍。它使用与hfst-ospell相同的ZHFST文件,该文件适用于Giellalt Github org中的所有语言(见下文)。 - DLTK -Deutsch语言工具套件。更多的。
- epitran-许多低资源语言的音素转换(G2P)。
- 老年人:濒临灭绝的语言数据电子存储库 - 濒危语言数据电子存储库:基于Web的本体论协作语言数据编目工具。
- Enchant -Enchant Spellchecking库https://abiword.github.io/enchant/。
- EXSITE9 -EXSITE9是一个桌面应用程序,旨在用描述性元数据轻松,快速地对其数据文件进行构建,随后包装其数据文件和相关的元数据,准备向存储库提交。 EXSITE9还允许在您本地文件存储上实际移动其物理位置的结构组织;允许您正确组织文件和元数据准备包装。
- Fast_Align-简单,快速无监督的单词对齐器。
- FastText-快速文本表示和分类的库。
- FieldWorks- FieldWorks是一套用于语言和文化数据的软件工具,并支持复杂的脚本。 https://software.sil.org/fieldworks/ fieldworks语言资源管理器(或简称Flex)旨在帮助现场语言学家执行许多常见的语言文档和分析任务。它可以为您提供帮助:获取和记录词汇信息,创建词典,与文本进行了线性化,分析话语特征,研究形态。
- 法郎 - 自然语言检测https://wooorm.com/franc/。
- FWDocumentation- FieldWorks的开发人员文档(语言和文化数据的软件工具,并支持复杂脚本)。
- fwlocalization-实地考察的本地化。
- FWSUPPORTTOOLS-实地工程开发的其他工具。
- Gaia -Gaia是Boot 2 Gecko Project的基于HTML5的手机UI。注意:有关发布的分支的详细信息,请参见Wiki。如果您有兴趣用新语言设置键盘,请参阅此信息。
- Giellakbd-android-拉丁美洲的叉子(由Google for Android),以边缘化语言为目标,该语言在移动操作系统上也应具有一流的状态。由kbdgen使用(请参阅此页面的其他地方)。
- Giellakbd -ios-苹果本机iOS键盘的开源重新成像,特别关注对本地化键盘的支持。由kbdgen使用(请参阅此页面的其他地方)。
- Giza-PP-Giza ++是一种统计机器翻译工具包,用于训练IBM型号1-5和HMM Word Alignment模型。该软件包还包含MKCLS工具的来源,该工具生成了训练某些对齐模型所需的单词类。
- GV -Crawl -Global Voices Bitext Crawler创建平行语料库。
- Glotlid- FastText语言识别,并支持2000多个标签。
- Glottolog数据 - Glottolog为世界语言提供了全面的参考信息。
- Gramadóir-专为具有有限计算资源的少数族裔语言和其他语言的语法检查器而设计的语法检查引擎。
- Grind -Indesign 5.5插件设计允许使用Adobe Indesign中的石墨智能字体。该项目将SIL的石墨2智能字体技术与我们自己的段落作曲家插件实现。
- HermitCrab-Hermitcrab.net是一种灵活的形态/语音解析器,采用项目和过程方法。
- HFST -OSSEL- HFST拼写检查器库和命令行工具。
- HFST-OSSEL-JS- HFST-OSSELL的节点绑定。
- HFST优化 - lookup-HFST优化 - lookup独立库和命令行工具。
- Hundict-来自Parallel Corpora的双语词典提取器。
- Hunspell-咒语检查器和形态分析仪库和程序专为具有丰富形态和复杂单词复合或字符编码的语言而设计。
- Huntag-使用最大熵学习和隐藏的Markov模型的NLP的顺序标记器。
- ICU -DOTNET -C#ICU4C包装器。
- ICU4C- http://source.icu-project.org/repos/icu/icu/的SVN项目镜像。 FieldWorks分支具有一些特定于FieldWorks的增强功能。
- iLanguage-一种半无调的语言独立的形态分析仪,可用于驱动未知语言文本,或者对单词中可能的词素进行粗略估计。输入:语料库。使用压缩,最大熵和现场lling术。
- IPA -HELP -IPA帮助。
- Itweets -Geodata-土著推文中的Geodata。
- jQuery.im-基于jQuery的输入方法库。
- KBDGEN-为各种操作系统生成键盘和键盘布局。
- Koreksyon-用于开发和实施低资源语言的拼写和语法检查功能的工具。
- L20N.JS -L20N重新发明软件本地化。用户应该能够从自然语言的整个表现力中受益。 L20N使简单的事情变得简单,同时使复杂的事情成为可能。这是L20N的JavaScript实现。 http://l20n.org。
- langid.py-独立语言标识系统。
- Langtech由Tromsø大学提供的SVN提供了许多资源。详细信息在这里,在这里用英语。
- 乐高统一概念 - 与乐高统一概念有关的材料。
- LEX4ALL-任何低资源语言的发音词典http://lex4all.github.io/lex4all/。
- LexDB -LexDB是词汇同源跟踪数据库。它存储了所有词汇和认知判断的完整出处,并允许出口到许多Nexus方言。该数据库写在灵活的Python/Django Web框架中。
- lfmerge-发送/接收语言forge.org。
- liblevenshtein-用于基于Levenshtein Automata生成有限状态换能器的库。
- libpalaso -Palaso库:一组.NET库对语言软件开发人员有用。
- 术语语法矩阵术语语法矩阵是开发宽覆盖,精确,实施的不同语言语法的框架。
- lingpy -lingpy:用于历史语言学中定量任务的Python库http://lingpy.org。
- Linguistica linguistica是一个计划,旨在探索无监督的自然语言学习,主要关注形态学(单词结构)。它在Windows,Mac OS X和Linux下运行,并在QT开发框架中以C ++书写。它对记忆的需求取决于所分析的语料库的大小。
- 长压 - jQuery插件,以简化重音或稀有字符的写作。 http://toki-woki.net/lab/long-press/。
- 低资源pos-tagging-2014低资源pos-tagging:2014
- LRL-有关低资源语言的工作。
- MacVoikko-基于Voikko的OS X拼写服务器。
- 机器 - 机器是一个自然语言处理库,用于.NET,专注于提供用于处理资源贫乏语言的工具(Flex使用)。
- 延伸 - 用于生成悬缝拼写检查扩展的脚本。
- Mgiza-基于著名的Giza ++的单词对齐工具,扩展以支持多线程,简历培训和增量培训。
- 少数族裔翻译少数族裔翻译是一个简单的程序,可以通过给其他语言Wikipedias的现有文章来帮助较小尺寸的Wikipedias(实际上大小)的内容生成,以便用户可以轻松地翻译或调整现有文本,从而提高其Wikipedia版本的大小和使用性。
- 默菲夫人 - 默菲夫人是无监督和半监督的形态分割的工具。
- MOMPHOLM-形态意识语言模型。
- MORPH测试 - 用于运行测试的Python脚本,用于生成和分析使用Giella基础设施构建的形态传感器。与HFST,施乐的FST工具以及FOMA一起使用。
- MosesDecoder -Moses,机器翻译系统。
- MOZ-L10N层 - 创建一个伪模板来评估L10N的字符串优先级。
- Mukurtucms -Mukurtu内容管理系统(CMS)是一个基于Internet的平台,旨在归档数字文化资源
- 神话 - 神话是一个简单的词库,它使用结构化的文本数据文件和带有二进制搜索的索引文件来查找单词和短语,并以语音,含义和同义词的一部分返回信息。
- MyWorksafe-语言发展工人的智能和简单备份。 http://software.sil.org/myworksafe/。
- NABU -NABU是一种数字媒体项目管理系统,可提供音频和视频项目的目录,这些项目的元数据以及有关项目工作流程状态的信息。 www.paradisec.org.au
- 天然 - JavaScript通用节点的自然语言设施。
- NIST 2008开放机器翻译评估
- NLTK- Python自然语言工具套件。 NLTK源http://www.nltk.org/。
- 节点窗格-Node.js客户端的panlex客户端。
- Norma-一种自动拼写归一化的工具。
- nplm- https://nlg.isi.edu/software/nplm/的叉子进行了一些效率调整和改编,可用于MosesDecoder。
- Octothorpe -CouchDB驱动的Wiki东西。
- ODTXSLT-在软件包的内容(例如ODT,DOCX等)上执行XSLT变换。
- Old-Webapp-在线语言数据库---用于创建Web应用程序的软件,用于协作documenterys语言。
- 旧 - 在线语言数据库(旧):用于语言现场工作的软件。 http://www.onlinelinguisticdatabase.org。
- 旧金字塔 - 在线语言数据库迁移到金字塔框架。
- Omegat-HFST-Tokenizer-Omegat-HFST-Tokenizer在Omegat中提供基于FST的令牌化。
- OpenDatakit开放数据套件(ODK)是一套开源工具套件,可帮助组织作者,现场和管理移动数据收集解决方案
- OpenNLP -Apache OpenNLP库是一种基于机器学习的工具包,用于处理自然语言文本。网站。
- OPS -DEVBOX-(Linux)开发人员机器的Ansible Playbook。
- Panlex -Tools - 此软件包包含脚本,以将词汇资源转换为适合导入Panlex的格式。可以在https://dev.panlex.org上找到文档。
- PDSC-Collection-viewer-ParadiSec Collection浏览器
- 范式 - 范式是约瑟夫·E·格里姆斯(Joseph E.
- 途径 - 准备出版的语言数据。
- PDFDROPLET-库和GUI征集PDF页面(例如2 -up)http://software.sil.org/pdfdroplet/。
- 胡椒 - 胡椒是一种基于爪哇的开源转换器框架,用于语言数据。
- 语音辅助 - 语音助理是一种发现工具。它提供了语音数据的语料库,它会自动绘制声音并通过其搜索功能来绘制声音,可帮助用户以语言发现和测试声音规则。
- PressAgio -PressAgio是一个基于N -Gram模型的文本的库。例如,您可以发送一个字符串,库将返回字符串中最后一个令牌最有可能的单词完成。
- PRIMERPRO- PRIMERPRO的目的是帮助扫盲工作者开发给定语言的底漆。
- pydelphin- Delph -in(友好叉)的Python图书馆。
- RBGPARSER-基于图的依赖性解析器。
- Rosetta Pangloss -Rosetta Project的Pangloss系统。
- 萨尔姆 - 萨尔姆:后缀阵列及其在喜悦中的经验语言处理中的应用。
- 盐 - 基于图的模型,用于存储和操纵语言数据。
- Saymore-一种用于制作常见语言文档任务的工具,例如保留所有结果文件和元数据,将文件转换为存档格式和转录。
- secwepemc -facebook-将Facebook转换为不受支持的语言。
- Segparser-接头分割,POS标记和依赖性解析的随机贪婪算法。
- 幼苗 - 建造并使用种子语料库进行人类语言项目。
- Skype用您的语言 - 将Skype转换为不支持的语言。
- Solid -Solid是一种软件工具,可用于检查,清理和转换标准格式(例如工具箱)词典数据。
- Sphere转换工具许多最不发达国公司都包含NIST Sphere格式的语音文件。下面的程序将球形文件转换为其他格式。
- StandardFormatlib-标准格式库。
- 斯坦福·科伦普(Stanford Corenlp) - 斯坦福·科伦普(Stanford Corenlp):核心NLP工具的Java套件。 https://stanfordnlp.github.io/corenlp/。
- Stanford Corenlp Python-斯坦福大学Corenlp工具的Python包装纸。
- Stanza -Stanford NLP Group的共享Python工具。
- STR2IPA-具有接近语音写作系统的语言的发音字典。
- Sugali-这是许多(许多)语言项目的语言标识项目的旧存储库,用于软件项目课程,低资源语言的NLP项目。
- 类似糖的 - 低资源语言的语言识别(Susanne,Guy和Liging)。
- 音节 - 通用音节缩放算法的Python界面
- 美味模拟键盘 - iOS8+的自定义键盘,可作为默认Apple键盘的美味模仿。使用Swift和最新的Apple Technologies!
- Teckit-编码转换工具包的文本。
- Teny-低资源机器翻译工具。
- Teradict-将英语单词翻译成数百种语言!
- Tesseract.js- 62种语言的纯JavaScript OCR? http://tesseract.projectnaptha.com/。
- TEXNLP -TEXNLP:德州自然语言处理工具。
- Timbl Timbl是实现多种基于内存的学习算法的开源软件包,其中IB1-IG(具有适用于符号特征空间的功能加权的K-Near-egrient Grinide分类的实现)和IB1-IG的决策树Igtree。所有已实施的算法都有共同点,它们将训练集的某些表示形式明确地存储在内存中。在测试过程中,新病例是通过从最相似的存储案例中推断出来的。
- Toney-音调分类软件。
- 现场语言学家的工具箱 - 工具箱是现场语言学家的数据管理和分析工具。它对于维护词汇数据以及解析和线性化文本特别有用,但是它可用于管理几乎任何类型的数据。
- Elan的工具箱脚本 - Alexander Koenig的工具箱脚本的镜像https://tla.mpi.nl/tools/tla-tools/elan/thirdparty/。
- 工具forfielduistics-语言学脚本和食谱集合。
- Transcriber -Aikuma的HTML5转录工具
- 翻译引擎 - 用JavaScript编写的音译引擎。
- tsammalex数据-Tsammalex是植物和动物的多语言词汇数据库。
- Tweet2Learn-一个应用程序,可在Twitter上更容易使用您的母语。
- Twitter_langid-用于语言识别的层次结构字符字神经网络。
- UniversAldepentencies文档 - 通用依赖在线文档http://universaldeppedencencies.org/docs/。
- UniversAldespies工具 - 处理数据的各种公用事业。
- VOCBench VOCBench是一种基于网络的,多语言,编辑和工作流的工具,可使用SKOS-XL管理词库,权威列表和词汇表。
- wavesurfer.js-构建在Web音频上的可通道波形和帆布https://wavesurfer-js.org/(还具有Elan插件)。
- Web-Template-这是一个基于Web的模板,可用于介绍语言学习资源以帮助语言振兴工作。它包括会说话的词典和一个短语,其中包含句子和短语。
- WebCorpus-此项目是一个脚本和程序的集合,用于从爬行数据中创建WebCorpus。
- Wikt2dict-许多语言版本的Wiktionary解析器工具。
- Wikipron-重试Wiktionary条目的IPA发音
- Word Generator WordGenerator从其音节结构的规格中生成假设单词。
- WordBoundary-在单词边界检测和分割中的实验。
- WordByWord- WordByWord是Vera Ferreira,Peter Bouda和Cidles的Ricardo Filipe开发的免费,易于使用的多媒体词汇培训师,并在濒危语言基金会的支持下。
- WSI4URLANG-资源不足的语言(URLANG)的单词感应感应(WSI)。
- XDXF_MAKEDICT -XDXF字典格式和“ Makedict”字典转换软件(官方存储库)。
键盘布局配置帮助者
- jQuery.im- jQuery输入方法编辑器在Wikipedia上使用
- KBDGEN-从一个简单的yaml文件中生成了Windows,MacOS,X11,iOS,Android和Chrome的键盘和键盘布局。还注册Windows未知的语言,因此,在安装后,指定的BCP 47代码(包括对ISO 639-3的全部支持)与已安装的语言工具(例如键盘,拼写检查器和其他工具)之间存在正确稳定的关联。
- 键盘 - 使用jQuery 〜https://mottie.github.io/keyboard/使用虚拟键盘。
- 键盘 - 开源键盘键盘。
- Keyman -Keyman跨平台输入方法。 Keyman使您可以在Windows,iPhone,iPad,Android平板电脑和手机,甚至在Web浏览器中立即输入1000多种语言。网站。
- keyboardlayouteditor-键盘布局编辑器https://code.google.com/archive/p/keyboardlayouteditor/。
- 键盘布局编辑器 - 键盘布局编辑器http://www.keyboard-layout-editor.com
- Lipika-ime-用于Mac OS X的输入方法引擎(IME),并对所有指示语言进行内置支持。
- XkeyBoardConfig- X窗口的非架构键盘配置数据库。目的是为X Window系统实现(免费,开源和商业)提供一致,结构良好的X键盘配置数据的开源X键盘配置数据。该项目针对基于XKB的系统。
注解
- AGTK -AGTK是一套软件组件套件,用于构建用于注释语言信号的工具,时间序列数据,该数据记录了任何类型的语言行为(例如音频,视频)。内部数据结构基于注释图。 (原始项目在SourceForge上:https://sourceforge.net/projects/agtk/)。
- Brendano-简易句法注释的图形片段语言https://www.cs.cmu.edu/~ark/fudg/。
- Elan Elan是创建视频和音频资源复杂注释的专业工具。
- EOPAS-民族在线演示和注释系统。
- Flat- Folia语言注释工具 - Flat是基于叶片格式(http://proycon.github.io/folia/)的基于网络的语言注释环境,这是一种用于语言注释的富XML格式。 FLAT允许用户查看带注释的Folia文档并通过新的注释丰富这些文档,通过Folia范式支持各种语言注释类型。它是一种以文档为中心的工具,可完全保留和可视化文档结构。
- gfl_syntax-简易句法注释的图形片段语言https://www.cs.cmu.edu/~ark/fudg/。
- GRAF-PYTHON-图书馆Graf-Python是一个开源Python插入和编写GRAF/XML文件,如ISO 24612中所述。库的解析器从文件中创建一个注释图。然后,用户可以通过Graf-Python的API查询注释图。
- Kwaras- Elan Corpus管理的工具。
- LDC Word Aligner LDC Word Aligner是一种软件工具,用于手动注释单词对齐,以支持阿拉伯语英语和中文英语单词对齐任务。它具有干净,易于使用的接口。自2009年开发以来,LDC已使用LDC Word Aligner从包括广播,Newswire和基于Web的来源在内的各种流派中生成超过1,000,000个注释的单词对齐数据。网站。
- Poio -Analyzer -Poio是用于语言文档,描述性语言学和/或语言类型学的语言学家的软件工具集合。它允许语言学家管理和分析其数据。 Poio Interlinear编辑器允许在转录中添加形态句法注释。 It supports various file formats for input, but will only output standardized XML defined by the Corpus Encoding Standard and the Text Encoding Initiative. Several tools for analyzing linguistic data will be made available to further process annotated data. Poio tools are written in Python and are based on PyQt.
- poio-api - Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan's EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called “Graph Annotation F…
- pyannotation - PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files.
- XTrans Trans is a next generation multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. The XTrans toolkit provides new and efficient solutions to common transcription challenges and addresses critical gaps in existing tools.Designed with input from experienced human transcribers working with real world data, XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics.
Format Specifications
- spec - The official specification for the DLx linguistic data format. https://digitallinguistics.github.io/spec/.
- FoLiA FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are support, making FoLiA a useful format for NLP tasks and data interchange. http://proycon.github.io/folia/
- xdxf_makedict - XDXF dictionary format and "makedict" dictionary converting software (official repository).
i18n-related Repositories
- Express-Lingua - An i18n middleware for the Express.js framework.
- Polyglot.js Give your JavaScript the ability to speak many languages.
- Transifex - System for providing a nice, userfriendly/project oriented approach to translating
.po files. Great for non-technical users, free for open-source projects, decent for minority languages; however , it can take a while to get a new language added to the Transifex system because the ticketing system Transifex uses results in them losing tickets sometimes. Provides translation memory, ability to appoint reviewers, etc. Transifex used to have an open source system that you could host on your own, but that seems to have disappeared.
Audio automation
- arctic-prompts - Generate prompts PDF for CMU ARCTIC dataset.
- AudioWebService - a simple nodejs server which accepts upload of audio and runs it through praat.
- AuToBI - Automatic prosodic annotation tool written in Java.
- BashScriptsForPhonetics - ( Fork of a dormant project).
- esv-text-audio-aligner - ESV Text/Audio Aligner to programmatically obtain the timings for each word in the corresponding audio.
- html5-audio-read-along - HTML5 Audio Read-Along.
- ipa-chart - International Phonetic Alphabet (IPA) Unicode Chart and Character Picker.
- kaldi-svn-archive - An read-only archive of the original Kaldi SVN repository (mainly to keep sandboxes available).
- lex4all - pronunciation LEXicons for Any Low-resource Language ( Fork of a student project).
- Montreal-Forced-Aligner - Python interface for forced text/speech alignment.
- node-pocketsphinx
- opensauce - GNU Octave-compatible version of VoiceSauce.
- pocketsphinx - PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop.
- pocketsphinx-ios-demo - Simple demo for iOS.
- pocketsphinx-python - Python module installed with setup.py.
- pocketsphinx-ruby - Ruby speech recognition with Pocketsphinx.
- pocketsphinx-wp-demo - Demo to run pocketsphinx on WP8 platform.
- pocketsphinx.js - Speech recognition in JavaScript.
- praat-py - From my PhD days: Praat-Py is a custom build of Praat, the computer program used by linguists for doing phonetic analysis on sound files, to allow for scripts to be written in the Python programming language, rather than in Praat's built-in language. ( Fork of a dormant project).
- Praat-Scripts - Mietta's Scripts.
- PraatTextGridJS - A small library which can parse TextGrid into json and json into TextGrid.
- PraatontheWeb - Web implementation of Praat. Source code, running demo scripts on web, samples and documentation.
- prosodicParsing - different kinds of HMMs to use for incorporating prosody into basic parsing.
- Prosodylab-Aligner - Python interface for forced audio alignment using HTK and SoX.
- prosodylab.alignertools
- Recordmp3js - Record MP3 files directly from the browser using JS and HTML.
- sphinx4 - Pure Java speech recognition library.
- sphinxbase
- sphinxtrain
- TLSphinx - Swift wrapper around Pocketsphinx.
Text-to-Speech (TTS)
- espeak - eSpeak is a compact open source software speech synthesizer for English and other languages, for Linux and Windows. http://espeak.sourceforge.net.
- MARY TTS - MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java http://mary.dfki.de.
- Ossian - Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision.
Automatic Speech Recognition (ASR)
- Elpis - Elpis is software for creating speech recognition models and applying them to the transcription of audio. As of 2022, it gives access to Kaldi and Huggingface Transformers.
- kaldi - This is now the official location of the Kaldi project.
- Persephone - Persephone aims to make state-of-the-art phonemic transcription accessible to people involved in language documentation, who have a training corpus of about one to four hours of transcribed speech. As of 2022, Persephone is superseded by Elpis.
Text automation
- clld - Cross Linguistic Linked Data python library.
- LaTeX2HTML5 - LaTeX web components.
- MultilingualCorporaExtractor - Node io Spider for extracting multilingual corpora ( Fork of a student project).
- SeedLing - Building and Using A Seed Corpus for the Human Language Project ( Fork of a student project).
实验
- experigen - A framework for creating linguistic experiments.
- GamifyPsycholinguisticsExperiments - A simple node server to gamify linguistics experiments, runs offline on a laptop for small scale experiements and online on a server for large scale experiments. Data is sent to a Google spreadsheet. ( Fork of a dormant project).
- OpenSesame - Graphical experiment builder for the social sciences.
- OPrime - Open Source Experimentation Libraries - Online and Offline for Android and HTML5.
- psychopyMegProsody - Runs MegProsody using PsychoPy.
- PsychScript - A HTML5/Javascript library for running behavioural experiments online.
抽认卡
- Anki - Anki is a program to make and share flaschard decks (including audio) for any language or writing system. https://apps.ankiweb.net/.
- awesome-anki - A curated list of awesome Anki add-ons, decks and resources.
- VocabLift - Language-learning tool that uses vocabulary from LIFT-format dictionaries produced by programs such as Fieldworks Language Explorer and WeSay.
Natural language generation
- OpenCCG - OpenCCG library for parsing and realization with CCG. Includes mini-grammars for Inuit, Nezperce, Basque and others.
Computing systems
- Common Language Resources and Technology Infrastructure Norway / Clarino - One of their projects (not clearly listed here) is about providing an online system for language analysis, so users can connect resources visually, dump in text, and get a result. Kind of like the Yahoo! Pipes but for language processing. Uses the ABEL cluster.
Android Applications
- Aikuma - Android software for recording and translation.
- Android Speech Recognition Trainer - Speech recognition training app for low resource languages which interfaces with FieldDB corpora.
- android-template - This is a template of an Android word-learning app that may be used a way to introduce a language. It includes a quiz. For the documentation, go to http://eddersko.github.io/android-template/.
- AndroidFieldDB - An Android app which lets the user build a custom visual and auditory vocabulary, useful for guided anomia treatment and self designed language lessons by heritage speakers.
- AndroidFieldDBElicitationRecorder - A general purpose video recording tool.
- AndroidLanguageLessons - Lets heritage speakers create self designed language lessons.
- AndroidProductionExperiment - Android App to run perception experiments.
- Bevara - Android Phone Application designed for Linguistic Fieldwork to help preserve, maintain, and save endangered languages.
- ojoVoz - A mobile app for sending georeferenced image and voice recordings from an Adroid phone to an email address. For more information, please go to http://sautiyawakulima.net/ojovoz/.
- pocketsphinx-android - pocketsphinx build for Android.
- pocketsphinx-android-demo
镀铬扩展
- babelfrog - Chrome extension to help learn languages as you browse.
- DictionaryChromeExtension - Dictionary for websites in low-resource languages. App and codebase which connects to a Wiktionary to provide definitions of any term on any website (current languages Cherokee 194,426 entries, Inuktitut 251 entries, Kartuli 7,363 entries, Plains Cree (incubation) 0 entries) use.
FieldDB
FieldDB is actively worked on by the FieldDB (Formally known as OpenSourceFieldlinguistics) group. These repos explicitly work with it but could be repurposed for other projects.
- FieldDB - An offline/online field database which adapts to its user's terminology and I-Language, has plugins for various data automation routines along the process of primary data collection to cleaning to publication and archival.使用。
FieldDB Webservices/Components/Plugins
- AndroidLanguageLearningClientForFieldDB-sikuli - Sikuli tests for AndroidLanguageLearningClientForFieldDB.
- AuthenticationWebService - A node.js web service which mananges users and corpora creation and authentication.
- bower-fielddb-angular - A bower repository which hosts fielddb-angular components, bower install fielddb-angular --save.
- bower-fielddb - A bower repository which hosts fielddb core components, bower install fielddb --save.
- fielddb-spreadsheet-sikuli - sikuli tests for the spreadsheet module use.
- FieldDBActivityFeed - A fielddb activity feed widget which can be embedded in other codebases, websites etc use.
- FieldDBGlosser - A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. bower install fielddb-glosser --save.
- FieldDBLexicon - A lexicon browser/editor web widget for FieldDB databases.
- LanguageClassDashboard - App which provides a view of FieldDB corpora for language teachers use.
- LexiconWebService - A node.js ElasticSearch wrapper for indexing/training lexicons from corpora.
- LexiconWebServiceSample - A node.js web server which implements the fieldlinguist's lexicon API for the FieldDB project.
Academic Research Paper-Specific Repositories
- Gargantua - Fast Unsupervised Sentence Aligner described in "Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora", COLING 2010.
- ldc-kiy - Materials for: The experimental state of mind in elicitation: illustrations from tonal fieldwork. Dubmitted to Language Documentation & Conservation, How to study a tone language .
- Learning to map into a Univerisal POS tagset Yuan Zhang, Roi Reichart, Regina Barzilay and Amir Globerson
- low-resource-pos-tagging-2014 and low-resource-pos-tagging-2014 Published in: Learning a Part-of-Speech Tagger from Two Hours of Annotation. Dan Garrette and Jason Baldridge . In Proceedings of NAACL 2013. And in: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages. Dan Garrette, Jason Mielens, and Jason Baldridge . In Proceedings of ACL 2013.
- orthotree - Linguistic family tree based on orthographic distance.
- type-supervised-tagging-2012emnlp This repository contains the code, scripts, and instructions needed to reproduce the results in the paper: Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries. Dan Garrette and Jason Baldridge . In Proceedings of EMNLP 2012. This code is frozen as of the version used to obtain the results in the paper. It will not be maintained. To see the updated code, visit nlp
- visualizing-language - For visualizations of WALS and other typological databases.
- WALS-APiCS - Code for working with WALS-APiCS (Atlas of Pidgin and Creole Language Structures) complexity metrics.
Example Repositories
These are repositories that are generally only interesting for training purposes or seeing how something is done.
- CorpusWebService - über-simple node.js-Proxy to enable CORS request for couchdb.
- CorporaForFieldLinguistics - Small corpora from diverse language typologies, useful for testing scripts.
- startR
- lucenerevolution-2013 - Demo examples for linguistics in Lucene and Solr.
- berlin-buzzwords-2013 - Demo examples for Lucene, Solr, ElasticSearch and OpenNLP from Berlin Buzzwords 2013 talk.
字体
- fontinline - Make inline stroke paths from an outline font.
- Noto Fonts - Noto is Google's free font family that aims to support all the world's scripts. Its design goal is to achieve visual harmonization across languages. Noto fonts are under Apache License 2.0.
- Unicodify Unicodify is a suite of programs for converting text in a variety of 8-bit encodings to Unicode (using the UTF-16 encoding). Unicodify was particularly designed to handle HTML-based text using non-ISCII 8-bit fonts to render South Asian scripts. However, elements of the suite can map other types of non-ASCII 8-bit encodings, such as Latin-2, ISCII and PASCII.
语料库
These corpora are useful for working with tools on endangered languages. Monolingual corpora that are more for archival efforts should most likely not be included here.
- bible-corpus - A multilingual parallel corpus created from translations of the Bible.
- poio-corpus - The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.
组织
On GitHub
- batumi - Speech recognition and natural language processing for low-resource languages
- BloomBooks
- unicode-cldr - Unicode Common Locale Data Repository (CLDR) Project http://cldr.unicode.org
- cmusphinx - Mirror of the SourceForge repositories
- dativebase - Tools for working with OLD.
- divvun - The Divvun group at UiT develops proofing tools, keyboard apps and other language technology solutions for indigenous and minority languages, especially the Sámi languages.网站。
- FieldDB
- GiellaLT - home for keyboard layouts, lexicons and morphologies for indigenous and minority languages, especially for morphologically complex languages, using mainly rule-based techonlogies. The resources are used by Divvun (above) and Giellatekno (below) to build a number of tools for the language communities. Almost everything is open source.
- HFST - Helsinki Finite-State Technology.网站。
- hunspell
- keymanapp - Website.
- langtech - Language Technology Group, University of Melbourne
- lex4all
- longnow
- MontrealCorpusTools
- moses-smt - Statistical Machine Translation.
- mukurtucms
- NLTK - Natural Language Toolkit.
- PhonologicalCorpusTools)
- Projet de recherche sur l'écriture - Crowdsourcing or conducting large scale psycholinguistics experiments (or statistically significant field linguistics).
- prosodylab - Prosodylab at McGill University, Canada
- SIL International (Dev) SIL International- Another SIL organization, with many repositories.
- SIL International - SIL (originally known as the Summer Institute of Linguistics, Inc.) is probably the leading organization which provides software and tools tailored for use by field linguists and lexicographers working on endangered languages. A little known fact is that much of it's code is open sourced on GitHub and SIL is happy to recieve open source contributions and collaborate on open source projects.
- SIL NRSI - SIL Non-Roman Script Initiative. The NRSI is a department of SIL International, whose task is to provide assistance, research and development for SIL International and its partners to support the use of non-Roman and complex scripts in language development.
- StanfordNLP https://nlp.stanford.edu
- ucsd-field-lab - University of California, San Diego
- UniversalDependencies - Universal Dependencies (UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on an evolution of (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary.
- utcompling - The University of Texas at Austin's Computational Linguistics Lab.网站。
Other OSS Organizations
- Giellatekno - Giellatekno combines cutting-edge linguistic and computational research into the analysis of Saami and other morphologically-rich languages, with the development of practical applications. We focus on deep linguistic modeling and on highly efficient and robust computational analysis with a wide empirical coverage. They use svn for their code: all of it can be found here, sorted by language.
- LOWLANDS - LOWLANDS – Parsing low-resource languages and domains https://ccc.ku.dk/research/lowlands/
- LTRC: Language Technologies Research Center IIIT Hyderabad LTRC addresses the complex problem of understanding and processing natural languages in both speech and text mode. LTRC conducts research on both basic and applied aspects of language technology. It is the largest academic centre of speech and language technology in South Asia. LTRC carries out its work through four labs, which work in synergy with each other, as listed above.
- The Language Archive Part of the MPI
教程
- How to Write a Spelling Corrector by Peter Norvig.
Language Specific Projects
For each language, we include the ISO 639-3 code, and the main autonym for that language.
南非荷兰语
afr :: Afrikaans
- Afrikaanse rekenaarlinguïstiek (Afrikaans computational linguistics) — wordlists, corpora, morphological analyser, tagger, word decompounder. Available upon email.
阿尔巴尼亚人
sqi :: shqip
- Apertium rules for Albanian - Machine Translation rules
- out-of-copyright-albanian-authors - authors scraped from the albanian language wikipedia who are out of copyright.
- Plis keyboard - The Plis keyboard is a keyboard or computer keyboard layout for the Albanian language.
- spell checking - Here you find a collection of Albanian words and information about them. Aspell, Ispell, and MySpell are included.
Alutiiq
ems :: sugpiaq
- wiinaq - Word Wiinaq is a Kodiak Alutiiq dictionary web application with automatically generated ending tables and souped-up search capabilities. It is written in Python using Django.
阿姆哈拉语
amh :: አማርኛ
- HornMorpho - Morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs
巴斯克
eus :: euskara
- Matxin - An open-source transfer machine translation engine. Linguistic information for the translation from Spanish and Basque (es-eu) is included.
孟加拉
ben :: বাংলা
- Bangla-অঙ্কুর for Mac This project aims to develop a phonetic based Bangla typing system for Macintosh computer which can be developed into a transliteration technique in the future.
- Bengali Writer - `Bengali Writer' is a set of utilities for computerized editing and typesetting in Bengali, a language of India and Bangladesh. It comprises a set of fonts for Bengali in several formats (METAFONT, BDF, PS), a text editor with spell-cheking, export, and more. (Original project is on SourceForge: https://sourceforge.net/projects/bengaliwriter/).
- Ekushey Bangla Computing and Localization Project for the Bangla speaking people.
- Lekho - A collection of tools and resources for using bangla on computers (Original project is on SourceForge: https://sourceforge.net/projects/lekho/).
Chichewa
nya :: chicheŵa
- Chichewa - NLP resources for Chichewa.
加利西亚人
glg :: galego
- an-metri-gal - Análise métrico de texto en verso en lingua galega (Galician language) gl-ES
- android_gl_dict - Android Galician (gl_ES) Keyboard Dictionary
- aspell-gl - Galician dictionary for aspell
- CitiusSentiment - Sentiment analysis (opinion mining) for Portuguese, English, Spanish, and Galician
- CitiusTagger - A PoS-Tagger and Named Entity Classification tool for Portuguese, English, Galician, and Spanish
- Conshuga - Galician verb conjugator
- corpora - This is a collection of corpus of Galician (or related to Galicia) words / Colección de corpus de palabras en galego (ou relacionadas con Galicia)
- DepPattern - Dependency Syntactic Parsing for Portuguese, Spanish, English, and Galician, including MetaRomance parser
- DOGA_scraper - Galician Official journal scraper
- elFinder-language - Galician - Gallego / language for elFinder
- EuroWordNetLemon - EuroWordNet lemon lexicons generated from the LMF versions of the Multilingual Central Repository (MCR) EuroWordNet lexicons. It includes lexicons for Spanish, Catalan, Basque & Galician.
- GalegoDroid - Galician Translator for Android
- galeXtra - Multiword Extractor for Portuguese, English, Spanish, Galician, French
- Galician-Dependency-Treebank - This Galician Dependency Treebank has been developed by transliterating and adapting lexically the Portuguese part (Bosque 7.3 by the Floresta sintá(c)tica project) of the CONLL-X 2006.
- Galician-Fuzzy-Text-watch - Based on Fuzzy Text International by Jesse Hallett, uses the galician language to display time.
- galician-locale-for-mac - Galician locale for Mac OS X
- gl-syllabler - Split galician language words into syllables
- gl- Galician OmegaT Localisation
- hunspell-gl-ciencias - Project oriented into developing a science and maths Galician language Hunspell dictionary
- hunspell-gl - Galician hunspell dictionaries
- hyphen-gl - Galician hyphenation rules
- javagalician-java6 - The Java Galician Locale is an implementation of Java localization SPIs which will allow the Java VM to use the Galician Language (locales "gl" and "gl_ES"), one of the official languages of Spain, which is not included in Sun's JVM distribution.
- Linguakit - Multilingual toolkit for NLP: dependency parser, PoS tagger, NERC, multiword extractor, sentiment analysis, etc.
- ParlamentoGalicia - Project based on the information extracted from the transcriptions of the sessions held in the Galician Parlament
- poss-gl - Galician translation of Producing Open Source Software, by Karl Fogel
- rima - Find rhyming words in galician language.
- stopwords-gl - Galician stopwords collection
- texlive-babel-galician - TeXLive babel-galician package
- UD_Galician-CTG - The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus created at the University of Vigo by the the TALG NLP research group.
- UD_Galician-TreeGal - The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña).
- UL_Galician-TreeGal - CoNLL-UL Repository for UD_Galician-TreeGal
Apertium
- apertium-cat-glg - Apertium translation pair for Catalan and Galician
- apertium-dict-en-gl - English-Galician language pair for Apertium
- apertium-dict-es-gl - Spanish-Galician language pair for Apertium
- apertium-dict-pt-gl - Portuguese-Galician language pair for Apertium
- apertium-en-gl - Apertium translation pair for English and Galician
- apertium-es-gl - Apertium translation pair for Spanish and Galician
- apertium-glg - Apertium linguistic data for Galician
- Apertium-pt-gl.pt-gl-LMF - This is the LMF version of the Apertium bilingual ditionary for Portugues and Galician languages
- apertium-pt-gl - Apertium translation pair for Portuguese and Galician
格鲁吉亚人
kat :: ქართული
- awesome-georgia - A curated list of awesome libraries and packages specific/related to Georgia (country).
- Gadatsqvetilebebi - გადაწყვეტილებები; Web spider and corpora importer for public legal decisions.
- GeoWordsDatabase - Around 310 000 unique Georgian words https://bumbeishvili.github.io/GeoWordsDatabase/.
- Kartuli Speech Recognition - ანდროიდის ქართველი მომხმარებლებისთვის სიტყვის ამოცნობის სისტემის შექმნა. Codebase to turn any webpage from any alphabet into another alphabet, the default is to turn latin letters into Kartuli. use "Do your friends keep commenting on Facebook with English keyboards (either because they forgot to switch, or because they didn't/can't install a Georgian keyboard)? Now you can read the web through კართული eyes.".
- KartuliChromeExtension - Chrome აპლიკაცია, რომელიც ყველა ინგლისურ ასო-ბგერას აჩვენებს ქართულ ასო-ბგერად.
- QartuliDaBunebismetkveleba - მათემატიკისა და ბუნებისმეტყველების ინტერაქტიული სახელმძღვანელო მე-2 - მე-3 კლასის მოსწავლეებისათვის.
- SakartvelosUzenaesiSasamartloSarke - საქართველოს უზენაესი სასამართლო სარკე.
- SamartlosSakonstitutsioSasamartdoSarke - სამართლოს საკონსტიტუციო სასამართდო სარკე.
- translitit-latin-to-mkhedruli-georgian - A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript.
- translitit-mkhedruli-georgian-to-ipa - A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript.
- Declensions - Methods to generate declensions for Georgian language
字体
- Stichoza/font-larisome - Iconic font for Georgian currency inspired by Font-Awesome (CSS).
- Lotuashvili/BPGNateli - Bower package for BPG Nateli font (CSS).
- thecotne/georgian-webfonts - Package for georgian fonts (CSS).
Internationalization and Localization (i18n/l10n)
- Stichoza/money-num-to-string - Convert a number/money to localized string (PHP, JavaScript).
- natchkebiailia/NumberToWord - Convert numbers to localized strings (JavaScript).
- d0ragon/number-to-words-ka - Convert numbers to localized strings (PHP).
- dimakura/ka - Common functionality for georgian projects (Ruby).
- dimakura/ka.js - Georgian language support for node and browser (JavaScript).
- akalongman/kautilities - Convert Georgian letters to Latin and vice-versa (PHP).
- Landish/Laravel-Ka - Laravel Georgian Language Pack.
- Landish/RedactorJS-GE - Redactor WYSIWYG HTML Editor Georgian Language Pack (JavaScript).
- wenzhixin/bootstrap-table - Bootstrap table with extra features. l10n by @Lotuashvili and @Stichoza.
- moment/moment - A lightweight date library (JavaScript).
- ioseb/geokbd - Georgian keyboard library (JavaScript).
瓜拉尼
grn :: Guarani
- ParaMorfo - morphological analysis and generation of Spanish and Guarani verbs, nouns, and adjectives.
豪萨
hau :: هَرْشَن هَوْسَ
- Hausa - Repository for Hausa NLP tools.
印地语
hin :: हिन्दी
- hindi-morph - An open source morphological analyzer for Hindi.
Høgnorsk
nno :: Høgnorsk
- hunspell-hn_NO - A beginning to a spellchecking tool for Høgnorsk, a conservative variant of Norwegian Nynorsk, based on a set of corpuses.
冰岛
isl :: íslenska
- IceNLP - IceNLP is an open source Natural Language Processing (NLP) toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java.
Inuktitut
iku :: Inuktitut
- InuktitutAlignerData - Scripts for alignment of laboratory speech production data.
- InuktitutComputing - Inuktitut Morphological Analyser, transcoder, transliterator, corpus tools, and lexical lists for working with Inuktitut. Usable online at http://inuktitutcomputing.ca/index.php.
爱尔兰人
gle :: Gaeilge
- aimsigh - Source for the now-defunct aimsigh.com Irish search engine.
- caighdean - Code for standardizing Irish language text.
- fleiscin - Irish hyphenation patterns for TeX https://cadhan.com/fleiscin/.
- GaelSpell - Sources for an Irish language spell checker.
- tesseract-gle-uncial - OCR for old Irish fonts.
Kinyarwanda
kin :: Ikinyarwanda
- kin-morph-fst - Kinyarwanda morphological analyzer.
- TurboTagger & TurboParser for Kinyarwanda (download) TurboTagger & TurboParser for Kinyarwanda
库尔德
kur :: Kurdî
- Kurlex - Morphological analyser and lexicon, written in the Alexina framework, licensed under the LGPL-LR.
- kurmanji-stemmer - NLTK based kurmanji stemmer
林加拉
lin :: Lingála
- Lingala NLP NLP tools and resources for Lingala
Lushootseed
lut :: Lushootseed
- Lushootseed - Joshua Crowgey's work on Lushootseed http://students.washington.edu/jcrowgey/lushootseed/.
马来语
msa :: Bahasa Melayu
- MorfoMalayu - morphological analysis of Malay words.
马尔加什
mlg :: Malagasy
- Global Voices Malagasy Project This page provides a link to a corpus of parallel news articles in Malagasy and English from the Global Voices project. This corpus was collected and aligned at the sentence level by Victor Chahuneau.
manx
glv :: Gaelg
- aspell-gv - Manx Gaelic dictionary for aspell.
- gaelg - NLP resources for Manx Gaelic, mainly in support of the gv2ga MT engine.
Migmaq
mic :: Mi'kmaq
- migmaq-lessons - Repository for website building Mi'gmaq language lessons.
Minderico
drc :: Piação do Ninhou
- fredericajordarzambarino - A web based game for mobile devices in minderico based in the "Who Wants to be a Millionaire" TV show.
Nishnaabe
oji :: Ojibwe, Oddawa, Chippewa, Anishinaabemowin, ᐊᓂᔑᓈᐯᒧᐎᓐ
- Ojibway-iphone-app - An iPhone app with audio and images for learning the Ojibway language.
- OjibwayMap - An iPhone app with audio and images for learning Ojibway language and culture.
- nishanimate - A desktop app to facilitate Nishnaabe-language acquisition via animations produced by the natural language processing of audio-accompanied text.
Oromo
orm :: Oromo
- hornmorpho - morphological analysis and generation of amharic and oromo verbs and nouns. and tigrinya verbs
Quechua
que :: Runa Simi
- AntiMorfo - morphological analysis and generation of Quechua nouns, adjectives, and verbs and Spanish verbs.
- Morphology, spellchecker - XFST and FOMA, plus OpenOffice plugin.
萨米
sma :: Sámi/Saami
- divvun-webdemo - simple webdemo for divvun grammar checker.网站。
- Giellatekno A host of Sámi tools.
- Mobile keyboards (iOS and Android), learning apps, dictionaries, morphologies, syntax disambiguators, some amount of project collaboration with Apertium on shallow translation between Saami languages, and
- Oahpa! - A learning portal for Saami languages. Includes WordPress based, media rich lesson-based learning, and morphological and syntactic exercizes generated from the morphological and syntactic tools
- Neahttadigisánit - A morphologically sensitive dictionary, with modes for 'social media input' (which allows users to type a 'relaxed' version of the orthography ( acdnstz will be recognized also as áčđŋšŧz̄ ), and also includes a JavaScript bookmarklet to offer click-to-read dictionary lookup functionality. Also available for other Uralic, and non-Uralic languages. Giellatekno does a lot for other minority Uralic languages. Following are some keywords for CTRL+F friendliness:
- Saami languages: North Saami, Lule Saami, South Saami // Inari Saami, Kildin Saami, Pite Saami, Skolt Saami.
- Other Uralic languages: Erzya, Finnish, Hill Mari, Ingrian, Khanty, Kven, Komi, Livonian, Meadow Mari, Moksha, Nenets, Nganasan, Olonetsian, Udmurt, Veps.
- Other languages: Buriat, Cornish, Faroese, Greenlandic, Iñupiaq, Northern Haida, Ojibwe, Plains Cree, Russian.
苏格兰盖尔语
gla :: Gàidhlig
- aspell-gd - Scottish Gaelic dictionary for aspell.
- briathrachan - This is the source code to Briathrachan, a Gaelic-English dictionary app for iOS.
- gaidhlig - NLP resources for Scottish Gaelic, mainly in support of gd2ga/ga2gd MT engines.
- gd-fcfg - Context-free feature-based grammar of Scottish Gaelic in the NLTK format.
- gdbank - Some tools and resources for natural language processing of Scottish Gaelic. https://www.tantallon.org.uk/cggblog/.
- hunspell-gd - Files for building Scottish Gaelic spell checkers.
Secwepemctsín
shs :: Secwepemctsín
- secwepemctsnem - A project to help people learn Secwepemctsín.
索马里
som :: Soomaaliga
- somorph - Somali morphological and syntactic analyzers and generators built on XFST and VISL-CG Constraint Grammar. Up to date version checked in on Giellatekno's repository.
- qaamuus.net morphologically aware dictionary based on lexical resources found online, and the somali morphology.
tigrinya
tir :: ትግርኛ
- HornMorpho - morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs.
Uralic
urj :: Uralic languages
- UralicNLP - A Python library for processing Uralic languages (Finnish, Skolt Sami, Erzya, Moksha, Komi-Zyrian and so on). The library provides an easy programmatic access to Giellatekno resources such as FST morphology and CG disambiguators. Other functionalities include UD parser, API for the Online Dictionary of Uralic Languages and interface to SemFi and SemUr semantic databases. The library is under active development and new features are added from time to time.
祖鲁
zul :: zulu
- Ukwabelana An open-source morphological Zulu corpus
执照
© Richard Littauer 2014-2017