We're going to write a set of simple command line tools to display basic statistics about a text file or set of text files. Some basic statistics include...
We'll also work towards adding the ability to...
Here's a screenshot of a program that downloads the entire text of Moby Dick from Project Gutenberg and prints out a histogram of the letter frequencies.
It turns out that the letter "t" makes up 9.25% of all the letters in Moby Dick.
To get started, you'll need to...
npm install to install the required packages.textalyze.js is the source code for this projectsample_data is a directory containing sample text files to analyze, mostly from Project Gutenberg.The textalyze.js file that comes with this repository is filled with comments designed to help you get started. You should feel free to delete them in order to make the program easier to read.
Think about the questions you'd need to be able to answer in order to make it work, though:
These questions run the gamut from nitty-gritty Ruby to user experience, while also starting us down the path of becoming comfortable with how the web works.
To request feedback on your code, use the standard GitHub flow process.
This project is structured as a sequence of iterations, each of which builds on previous iterations. Iterations serve three important roles:
Using hard-coded examples, write a function that takes an Array containing
arbitrary and possibly duplicated items as input and returns an Object containing
item/count pairs. We've written some
This iteration has tests written for you. Run
npm testto see the failing tests. Remember to run npm install first!
That is, if the input has 100 entries and 20 of them are the letter "a" then then resulting Object should contain
{ 'a': 20 }"Sensible" is up to you to define, but here's a suggested format, pretending we hard-coded the input as ["a", "a", "a", "b", "b", "c"].
user@host project-js-textalyze $ node textalyze.js
The counts for ["a", "a", "a", "b", "b", "c"] are...
a 3
b 2
c 1
user@host project-js-textalyze $
Using hard-coded examples, write a function that takes an arbitrary String as input and returns an Array of all the characters in the string, including spaces and punctuation.
Feed this into the array-counting function from the previous iteration to get an Object containing letter/count pairs. Print out those pairs in a sensible way.
Create a file lib/sanitize.js and define a function called sanitize inside. As in lib/itemCounts.js, the last line should be
module.exports = sanitizeThe sanitize function should take an arbitrary String — perhaps containing spaces, punctuation, line breaks, etc. — and return a "sanitized" string that replaces all upper-case letters with their lower-case equivalent. This will ensure that the letters 'A' and 'a' are not treated as two distinct letters when we analyze our text. We'll handle punctuation and other bits in a later iteration.
It should work like this
sanitize('This is a sentence.') // => 'this is a sentence.'
sanitize('WHY AM I YELLING?') // => 'why am i yelling?'
sanitize('HEY: ThIs Is hArD tO rEaD!') // => 'hey: this is hard to read!'Lucky for us, JavaScript comes with a built-in function to help us: String.prototype.toLowerCase.
Integrate this function into current program so that the Object of results contains, e.g.,
{ 'a': 25 }instead of
{ 'a': 19, 'A': 6 }Oftentimes the data we want isn't in a format that makes it easy to analyze. The process of taking poorly-formatted data and transforming it into something we can make use of is called sanitizing our data.
What counts as "sanitizing" varies depending on the underlying data and our needs. For example, if we wanted to look at all the text in an HTML document, we wouldn't want to be counting all the HTML tags. Conversely, if we wanted a report on the most commonly-used tags in an HTML document, we'd want to keep the tags but remove the text.
In our case, we've designed our program such that it treats upper-case letters and lower-case letters as distinct letters, i.e., our results Object might contain
{ 'a': 20, 'A': 5 }but we'd probably rather it just contain
{ 'a': 25 }Likewise, we probably don't care about punctuation (periods, commas, hyphens, colons, etc.), although this is harder to deal with than differences between upper-case and lower-case letters.
The base repository contains a directory called sample_data that contains a handful of text files. Hard-code the name of one of these files into your program and read the contents of that file into a string. Pass that string into your current program, so that it now prints out the letter-count statistics for that specific file instead of the hard-coded strings you had in the previous iteration.
To read the contents of a file into a string, see fs.readFile and fs.readFileSync.
We don't want to edit our JavaScript code every time we need to change the file from which we're reading data. Let's change it so that the user running the program can pass in the name of the file from which to read. We'll do this using command line arguments.
This iteration marks v1.0 of our program. As it stands, our program — although limited — is self-contained enough that you could give it to another person and they could use it as you intended without having to know how to edit JavaScript code.
Congrats!
Consider the following command run from the Terminal:
node some-program.js first_argument second_argument banana
The command line arguments are first_argument, second_argument, and banana, with a space denoting the separation between each argument. first_argument is the first command line argument and banana is the third command line argument.
Using hard-coded examples, write a function that takes an Array containing arbitrary and possibly duplicated entries as input and returns a Object containing item/frequency pairs. Print out those pairs in a sensible way.
That is, if the input has 100 entries and 20 of the are letter "a" then then returned Object should have
{ 'a': 0.20 }You've already written a function that takes an Array and returns a Object containing entry/count pairs and you'll need these counts (one way or another) in order to calculate the overall frequency. If you want to stretch yourself, try writing your "frequency statistics" function in a way that makes use of your "counting statistics" function, so that you don't have to duplicate as much code or work in your program.
This is a "stretch approach," which means it's absolutely not necessary for you to write your program this way. Like we've been saying, it's much better to write something and get feedback on it than get stuck while trying to puzzle out a "better", "faster", "more elegant", etc. approach.
Print out a histogram of letter frequencies that looks something like the following:
The goal is to produce a useful, well-designed output. It doesn't have to look identical to the above output.
Hint: You can use the frequency for each item as a way to scale the length of the histogram.
Here are some additional features you might add:
Install and use the request module to add support for passing in URLs as well as file names. For example, rather than having to download Moby Dick first, you could run
node textalyze.js http://www.gutenberg.org/cache/epub/2701/pg2701.txtAdd support for displaying out the 5 (or N) most common words instead of just letter frequencies.
Add support for exporting the data in a format you can load into Excel, like a CSV file. You can install and use the csv-writer module to do this.
Find texts from multiple languages and compare the letter frequency between languages. A language's letter frequency acts as a kind of fingerprint, and you'd surprised at how little text it takes to identify a language once you know the letter frequencies.
Use a charting library like AnyChart to export a graphical histogram.
To install a module, run the following command (replacing nameOfModule with the name of the desired module):
npm install --save nameOfModuleThis will update package.json and add the module as a dependency. Read each module's documentation to see how to require and use it.