Wordsiv is a Python package for generating text with a limited character set. It is designed for type proofing, but it may be useful for generating lipograms.
Let's say you have the letters HAMBURGERFONTSIVhamburgerfontsiv and
punctuation ., in your font. Wordsiv might generate the following drivel:
True enough, for a fine enough to him at the thrones above, some time for at that first the business. She is that he hath set as a thought he even from its substratums It is the rest of it is not the things the savings of a movement that he measures about the matter, Ahab gives to the boats at noon, or from the Green Forest, this high as on Ahab.
While designing a typeface, it is useful to examine text with a partial character set. Wordsiv tries its best to generate realistic-looking text with whatever glyphs are available.
First, install wordsiv with pip:
# we install straight from git (for now!)
$ pip install git+https://github.com/tallpauley/wordsiv # byexample: +passNext, install one or more source packages from the releases page of the source packages repo:
$ base=https://github.com/tallpauley/wordsiv-source-packages/releases/download
$ pkg=en_markov_gutenberg-0.1.0/en_markov_gutenberg-0.1.0-py3-none-any.whl
$ pip install $base/$pkg # byexample: +passNow you can make bogus sentences in Python!
>>> import wordsiv
>>> wsv = wordsiv.WordSiv(limit_glyphs=('HAMBURGERFONTSIVhamburgerfontsiv'))
>>> wsv.sentence(source='en_markov_gutenberg')
('I might go over the instant to the streets in the air of those the same be '
'haunting')If you prefer to work in the DrawBot app, you can follow this procedure to install Wordsiv:
Install the wordsiv package via Python->Install Python Packages:
git+https://github.com/tallpauley/wordsiv and click Go!
Install the desired Source packages in the same window, but add --no-deps
at the end:
https://github.com/tallpauley/wordsiv-source-packages/releases/download/en_wordcount_web-0.1.0/en_wordcount_web-0.1.0-py3-none-any.whl --no-deps
.whl or .tar.gz package URLs from the
source packages repo's releases page under
the assets drop down.When you write your DrawBot script, you will add each source using
add_source_module():
import wordsiv
import en_wordcount_web
wsv = wordsiv.WordSiv()
wsv.add_source_module(en_wordcount_web)
print(wsv.sentence(source="en_wordcount_web"))
Wordsiv first needs some words, which come in the form of Sources: objects which supply the raw word data.
These Sources are available via Source Packages, which are simply Python Packages. Let's install some:
base=https://github.com/tallpauley/wordsiv-source-packages/releases/download
# A markov model trained on public domain books
$ pkg=en_markov_gutenberg-0.1.0/en_markov_gutenberg-0.1.0-py3-none-any.whl
$ pip install $base/$pkg # byexample: +pass
# Most common English words compiled by Peter Norvig with data from Google
$ pkg=en_wordcount_web-0.1.0/en_wordcount_web-0.1.0-py3-none-any.whl
$ pip install $base/$pkg # byexample: +pass
# Most common Trigrams compiled by Peter Norvig with data from Google
$ pkg=en_wordcount_trigrams-0.1.0/en_wordcount_trigrams-0.1.0-py3-none-any.whl
$ pip install $base/$pkg # byexample: +passWordsiv auto-discovers these installed packages, and and can use these sources right away. Let's try a source with the most common words in the English language in modern usage:
>>> from wordsiv import WordSiv
>>> wsv = WordSiv()
>>> wsv.sentence(source='en_wordcount_web')
('Maple canvas sporting pages transferred with superior government brand with '
'women for key assign.')How does a Wordsiv know how to arrange words from a Source into a sentence? This is where Models come into play.
The source en_wordcount_web uses model rand by default. Here we explicitly
select the model rand to achieve the same result as above:
>>> wsv = WordSiv()
>>> wsv.sentence(source='en_wordcount_web', model="rand")
('Maple canvas sporting pages transferred with superior government brand with '
'women for key assign.')Notice we get the same sentence when we initialize a new WordSiv() object. This is because Wordsiv is designed to be determinisic.
If we want text that is somewhat natural-looking, we might use our
MarkovModel (model='mkv').
>>> wsv.paragraph(source="en_markov_gutenberg", model="mkv") # byexample: +skip
"Why don't think so desirous of hugeness. Our pie is worship..."A Markov model is trained on real text, and forecasts each word by looking at the preceding word(s). We keep the model as stupid as possible though (one word state) to generate as many different sentences as possible.
WordCount Sources and Models work with simple lists of words and occurence counts to generate words.
The RandomModel (model='rand') uses occurence counts to
randomly choose words, favoring more popular words:
# Default: probability by occurence count
>>> wsv.paragraph(source='en_wordcount_web', model='rand') # byexample: +skip
'Day music, commencement protection to threads who and dimension...'The RandomModel can also be set to ignore occurence counts and choose words completely randomly:
>>> wsv.sentence(source='en_wordcount_web', sent_len=5, prob=False) # byexample: +skip
'Conceivably championships consecration ects— anointed.'The SequentialModel (model='seq') spits out words in the
order they appear in the Source. We could use this Model to display the top 5
trigrams in the English language:
>>> wsv.words(source='en_wordcount_trigrams', num_words=5) # byexample: +skip
['the', 'ing', 'and', 'ion', 'tio']Wordsiv is built around the idea of selecting words which can be rendered with the glyphs in an incomplete font file. Wordsiv can automatically determine what glyphs are in a font file.
Let's load a font with the characters HAMBURGERFONTSIVhamburgerfontsiv
>>> wsv = WordSiv(font_file='tests/data/noto-sans-subset.ttf')
>>> wsv.sentence(source='en_markov_gutenberg', max_sent_len=10)
'Nor is fair to be in as these annuities'We can limit the glyphs in the same way, but manually with limit_glyphs
>>> wsv = WordSiv(limit_glyphs='HAMBURGERFONTSIVhamburgerfontsiv')
>>> wsv.sentence(source='en_wordcount_web')
'Manage miss ago are motor to rather at first to be of has forget'font_file AND limit_glyphsIt can be useful at times to specify the character set we want to display, and
only using those characters if we have them in the font file. We can do
this by specifying both font_file and limit_glyphs:
>>> wsv = WordSiv(
... font_file='tests/data/noto-sans-subset.ttf',
... limit_glyphs='abcdefghijklmnop'
... )
>>> wsv.sentence('en_wordcount_web', cap_sent=False, min_wl=3)
'eng gnome gene game egg one aim him again one game one image boom'There are a variety of ways text can be manipulated. Here are some examples:
Both the MarkovModel and WordCount models allow us to uppercase or lowercase text, whether or not the source words are capitalized or not:
>>> wsv = WordSiv()
>>> wsv.sentence('en_wordcount_web', uc=True, max_sent_len=8)
'MAPLE CANVAS SPORTING PAGES TRANSFERRED, WITH SUPERIOR GOVERNMENT.'
>>> wsv.sentence(
... 'en_markov_gutenberg', lc=True, min_sent_len=7, max_sent_len=10
... )
'i besought the bosom of the sun so'The RandomModel by capitalizes sentences by default, but we can turn this off:
>>> wsv.sentence('en_wordcount_web', cap_sent=False, sent_len=10)
'egcs very and mortgage expressed about and online truss controls.'By default the WordCount models insert punctuation with probabilities roughly derived from usage in the English language.
We can turn this off by passing our own function for punctuation:
>>> def only_period(words, *args): return ' '.join(words) + '.'
>>> wsv.paragraph(
... source='en_wordcount_web', punc_func=only_period, sent_len=5, para_len=2
... )
'By schools sign I avoid. Or about fascism writers what.'For more details on punc_func, see punctuation.py.
This only applies for WordCount models, as the
MarkovModel uses the punctuation in it's source data.
Models take care of generating sentences and words, so parameters
relating to these are handled by the models. For now, please refer to the source
code for these models to learn the parameters accepted for word(), words(),
and sentence() APIs:
The WordSiv object itself handles sentences(), paragraph(),
paragraphs() and text calls with their parameters. See the WordSiv
Class source code to learn how to customize the text output.
When proofing type, we probably want our proof to stay the same as long as we have the same character set. This helps us compare changes in the type.
For this reason, Wordsiv uses a single pseudo-random number generator, that is seeded upon creation of the WordSiv object. This means that a Python script using this library will produce the same outcome wherever it runs.
If you want your script to generate different words, you can seed the WordSiv object:
>>> wsv = WordSiv(seed=6)
>>> wsv.sentence(source="en_markov_gutenberg", min_sent_len=7)
'even if i forgot the go in their'After watching the documentary Coded Bias, I considered whether we should even generate text based on historical (or even current) data, because of the sexism, racism, colonialism, homophobia, etc., contained within the texts.
This section attempts to address some of these ethical questions that arose for me (Chris Pauley), and try to steer this project away from generating offensive text.
First off, this library was designed for the purposes of:
Of course, we naturally read words (duh), so it goes without saying that you should supervise text generated by this library.
I considered if there were more progressive texts I could train a Markov model on. However, we are scrambling the source text to meaninglessness anyway, to maximize sentences made with a limited character set.
Even the most positive text can get dark really quick when scrambled. A Markov model of state size 1 (ideal for limited character sets) trained with the UN Universal Declaration of Human Rights made this sentence:
Everyone is entitled to torture
or other limitation of brotherhood.
The point is, semi-random word generation ruins the meaning of text, so why bother picking a thoughtful source? However, we should really try to stay away from offensive source material, because offensive patterns will show up if there is any probability involved.
If you're wanting to contribute a source and/or model to this project, here are some guidelines:
For example, we keep MarkovModel from picking up too much context from the original text by keeping a state size of 1. Having a single word state also increases the amount of potential sentences as well, so it works out.
The sentences we generate make less sense, but since this is designed for dummy text for proofing, this is a good thing!
It's tricky filtering out "offensive" words, since:
Since we're generating nonsensical text for proofing, we should try our best to filter wordlists by offensive words lists. If you really need swears in your text, you can create sources for your own purposes.
We can't prevent random words from forming offensive sentences, but we can at least restrict words that tend to form offensive sentences.
Statistical models like those used in WordSiv will pick up on patterns in text— especially MarkovModels. Try to pick source material that is fairly neutral (not that anything really is).
I trained en_markov_gutenberg with these public domain texts from nltk, which seemed safe enough for a dumb, single-word-state Markov model:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',
'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt',
'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt',
'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt',
'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt', 'whitman-leaves.txt']
If you notice any particular models generating offensive sentences more than not, please file an issue at the wordsiv-source-packages repo.
I'm definitely not the first to generate words for proofing. Check out these cool projects, And let me know if you know of more I should add!
I probably wouldn't have got very far without the inspiration of word-o-mat, and a nice DrawBot script that Rob Stenson shared with me. The latter is where I got the idea to seed the random number generator to make it deterministic.
I also borrowed heavily from spaCy in how I set up the Source packages.
Also want to thank my wife Pammy for kindly listening as I explain each esoteric challenge I've tackled, and lending me emotional support when I almost wiped out 4 hours of work with a careless Git mistake.