langumo下載langumo源代碼下載

langumo

Ai源碼

v0.2.0

下載

Langumo

語言模型的統一語料庫構建環境。

介紹

langumo是語言模型的統一語料庫建設環境。 langumo為構建基於文本的數據集提供管道。構造數據集需要復雜的管道（例如解析，改組和令牌化）。此外，如果從不同來源收集了Corpora，則從各種格式中提取數據將是一個問題。 langumo僅僅立即使用各種格式來構建數據集。

主要功能

易於構建，易於添加新的語料庫格式。
通過性能優化（甚至用Python編寫）快速構建。
支持在解析語料庫中進行多處理。
記憶使用次數極少。
多合一環境。不要介意內部程序！
不需要為新語料庫編寫代碼。而是簡單地添加到構建配置。

依賴性

NLTK
Colorama
pyyaml> = 5.3.1
TQDM> = 4.46.0
Tokenizers == 0.8.1
mwparserfromhell> = 0.5.4
KSS == 1.3.1

安裝

與pip

可以使用pip安裝langumo如下：

$ pip install langumo

來自來源

您可以通過克隆存儲庫和運行來安裝來自源的langumo ：

$ git clone https://github.com/affjljoo3581/langumo.git
$ cd langumo
$ python setup.py install

快速啟動指南

讓我們構建一個Wikipedia數據集。首先，在虛擬環境中安裝langumo 。

$ pip install langumo

安裝langumo後，創建一個用於構建中的工作區。

$ mkdir workspace
$ cd workspace

在創建數據集之前，我們需要一個Wikipedia轉儲文件（這是數據集的來源）。您可以從此處獲取各種版本的Wikipedia轉儲文件。在本教程中，我們將使用Wikipedia轉儲文件的一部分。使用您的瀏覽器下載文件，然後轉到workspace/src 。或者，使用wget簡單地將文件獲取：

$ wget -P src https://dumps.wikimedia.org/enwiki/20200901/enwiki-20200901-pages-articles1.xml-p1p30303.bz2

langumo需要一個構建配置文件，其中包含數據集的詳細信息。創建build.yml文件到workspace並編寫belows：

 langumo :
  inputs :
  - path : src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
    parser : langumo.parsers.WikipediaParser

  build :
    parsing :
      num-workers : 8 # The number of CPU cores you have.

    tokenization :
      vocab-size : 32000 # The vocabulary size.

現在，我們準備創建第一個數據集。運行langumo ！

$ langumo

然後，您可以看到以下輸出：

 [*] import file from src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
[*] parse raw-formatted corpus file with WikipediaParser
[*] merge 1 files into one
[*] shuffle raw corpus file: 100%|██████████████████████████████| 118042/118042 [00:01<00:00, 96965.15it/s]
[00:00:10] Reading files (256 Mo)                   ███████████████████████████████████                 100
[00:00:00] Tokenize words                           ███████████████████████████████████ 418863   /   418863
[00:00:01] Count pairs                              ███████████████████████████████████ 418863   /   418863
[00:00:02] Compute merges                           ███████████████████████████████████ 28942    /    28942
[*] export the processed file to build/vocab.txt
[*] tokenize sentences with WordPiece model: 100%|███████████████| 236084/236084 [00:23<00:00, 9846.67it/s]
[*] split validation corpus - 23609  of 236084 lines
[*] export the processed file to build/corpus.train.txt
[*] export the processed file to build/corpus.eval.txt

構建數據集後， workspace將包含以下文件：

 workspace
├── build
│   ├── corpus.eval.txt
│   ├── corpus.train.txt
│   └── vocab.txt
├── src
│   └── enwiki-20200901-pages-articles1.xml-p1p30303.bz2
└── build.yml

用法

 usage: langumo [-h] [config]

The unified corpus building environment for Language Models.

positional arguments:
  config      langumo build configuration

optional arguments:
  -h, --help  show this help message and exit