langumo Download - langumo Source code download

langumo

AI Source Code

v0.2.0

Download

langumo

The unified corpus building environment for Language Models.

Introduction

langumo is an unified corpus building environment for Language Models. langumo provides pipelines for building text-based datasets. Constructing datasets requires complicated pipelines (e.g. parsing, shuffling and tokenization). Moreover, if corpora are collected from different sources, it would be a problem to extract data from various formats. langumo helps to build a dataset with the diverse formats simply at once.

Main features

Easy to build, simple to add new corpus format.
Fast building through performance optimizations (even written in Python).
Supports multi-processing in parsing corpora.
Extremely less memory usage.
All-in-one environment. Never mind the internal procedures!
Does not need to write codes for new corpus. Instead, add to the build configuration simply.

Dependencies

nltk
colorama
pyyaml>=5.3.1
tqdm>=4.46.0
tokenizers==0.8.1
mwparserfromhell>=0.5.4
kss==1.3.1

Installation

With pip

langumo can be installed using pip as follows:

$ pip install langumo

From source

You can install langumo from source by cloning the repository and running:

$ git clone https://github.com/affjljoo3581/langumo.git
$ cd langumo
$ python setup.py install

Quick start guide

Let's build a Wikipedia dataset. First, install langumo in your virtual enviornment.

$ pip install langumo

After installing langumo, create a workspace to use in build.

$ mkdir workspace
$ cd workspace

Before creating the dataset, we need a Wikipedia dump file (which is a source of the dataset). You can get various versions of Wikipedia dump files from here. In this tutorial, we will use a part of Wikipedia dump file. Download the file with your browser and move to workspace/src. Or, use wget to get the file in terminal simply:

$ wget -P src https://dumps.wikimedia.org/enwiki/20200901/enwiki-20200901-pages-articles1.xml-p1p30303.bz2

langumo needs a build configuration file which contains the details of dataset. Create build.yml file to workspace and write belows:

langumo:
  inputs:
  - path: src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
    parser: langumo.parsers.WikipediaParser

  build:
    parsing:
      num-workers: 8 # The number of CPU cores you have.

    tokenization:
      vocab-size: 32000 # The vocabulary size.

Now we are ready to create our first dataset. Run langumo!

$ langumo

Then you can see the below outputs:

[*] import file from src/enwiki-20200901-pages-articles1.xml-p1p30303.bz2
[*] parse raw-formatted corpus file with WikipediaParser
[*] merge 1 files into one
[*] shuffle raw corpus file: 100%|██████████████████████████████| 118042/118042 [00:01<00:00, 96965.15it/s]
[00:00:10] Reading files (256 Mo)                   ███████████████████████████████████                 100
[00:00:00] Tokenize words                           ███████████████████████████████████ 418863   /   418863
[00:00:01] Count pairs                              ███████████████████████████████████ 418863   /   418863
[00:00:02] Compute merges                           ███████████████████████████████████ 28942    /    28942
[*] export the processed file to build/vocab.txt
[*] tokenize sentences with WordPiece model: 100%|███████████████| 236084/236084 [00:23<00:00, 9846.67it/s]
[*] split validation corpus - 23609  of 236084 lines
[*] export the processed file to build/corpus.train.txt
[*] export the processed file to build/corpus.eval.txt

After building the dataset, workspace would contain the below files:

workspace
├── build
│   ├── corpus.eval.txt
│   ├── corpus.train.txt
│   └── vocab.txt
├── src
│   └── enwiki-20200901-pages-articles1.xml-p1p30303.bz2
└── build.yml

Usage

usage: langumo [-h] [config]

The unified corpus building environment for Language Models.

positional arguments:
  config      langumo build configuration

optional arguments:
  -h, --help  show this help message and exit