How to Use Python to Find the Zipf Distribution of a Text File

You might be wondering about the term Zipf distribution. To understand what we mean by this term, we need to define Zipf's law first. Don't worry, I'll keep everything simple.

Zipf's Law

Zipf's law simply states that given some corpus (large and structured set of texts) of natural language utterances, the occurrence of the most frequent word will be approximately twice as often as the second most frequent word, three times as the third most frequent word, four times as the fourth most frequent word, and so forth.

Let's look at an example of that. If you look into the Brown Corpus of American English, you will notice that the most frequent word is the (69,971 occurrences). If we look into the second most frequent word, that is of, we will notice that it occurs 36,411 times.

The word the accounts for around 7% of the Brown Corpus words (69,971 of slightly over 1 million words). If we come to the word of, we will notice that it accounts for around 3.6% of the corpus (around half of the). Thus, we can notice that Zipf's law applies to this situation.

Thus, Zipf's law is trying to tell us that a small number of items usually account for the bulk of activities we observe. For instance, a small number of diseases (cancer, cardiovascular diseases) account for the bulk of deaths. This also applies to words that account for the bulk of all word occurrences in literature, and many other examples in our lives.

Data Preparation

Before moving forward, let me refer you to the data we will be using to experiment with in our tutorial. Our data this time will be from the National Library of Medicine. We will be downloading what's called a MeSH (Medical Subject Heading) ASCII file, from here. In particular, d2016.bin (28 MB).

I will not go into detail in describing this file since it is beyond the scope of this tutorial, and we just need it to experiment with our code.

Building the Program

After you have downloaded the data in the above section, let's now start building our Python script that will find the Zipf's distribution of the data in d2016.bin.

The first normal step to perform is to open the file:

open_file = open('d2016.bin', 'r')

In order to carry out the necessary operations on the bin file, we need to load the file in a string variable. This can be simply achieved using the read() function, as follows:

file_to_string = open_file.read()

Since we will be looking for some pattern (i.e. words), regular expressions come into play. We will thus be making use of Python's re module.

At this point we have already read the bin file and loaded its content in a string variable. Finding the Zipf's distribution means finding the frequency of occurrence of words in the bin file. The regular expression will thus be used to locate the words in the file.

The method we will be using to make such a match is the findall() method. As mentioned in the re module documentation about findall(), the method will:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

What we want to do is write a regular expression that will locate all the individual words in the text string variable. The regular expression that can perform this task is:

\b[A-Za-z][a-z]{2,10}\b

where \b is an anchor for word boundaries. In Python, this can be represented as follows:

words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)

This regular expression is basically telling us to find all the words that start with a letter (upper-case or lower-case) and followed by a sequence of letters which consist of at least 2 characters and no more than 9 characters. In other words, the size of the words that will be included in the output will range from 3 to 10 characters long.

We can now run a loop which aims at calculating the frequency of occurrence of each word:

for word in words:
    count = frequency.get(word,0)
    frequency[word] = count + 1

Here, if the word is not found yet in the list of words, instead of raising a KeyError, the default value 0 is returned. Otherwise, count is incremented by 1, representing the number of times the word has occurred in the list so far.

Finally, we will print the key-value pair of the dictionary, showing the word (key) and the number of times it appeared in the list (value):

for key, value in reversed(sorted(frequency.items(), key = itemgetter(1))):
    print key, value

This part sorted(frequency.items(), key = itemgetter(1)) sorts the output by value in ascending order, that is, it shows the words from the least frequent occurrence to the most frequent occurrence. In order to list the most frequent words at the beginning, we use the reversed() method.

Putting It All Together

After going through the different building blocks of the program, let's see how it all looks together:

import re
from operator import itemgetter    

frequency = {}
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)

for word in words:
    count = frequency.get(word,0)
    frequency[word] = count + 1
    
for key, value in reversed(sorted(frequency.items(), key = itemgetter(1))):
    print key, value

I will show here the first ten words and their frequencies returned by the program:

the 42602
abcdef 31913
and 30699
abbcdef 27016
was 17430
see 16189
with 14380
under 13127
for 9767
abcdefv 8694

From this Zipf distribution, we can validate Zipf's law in that some words (high-frequency words) represent the bulk of words, such as we can see above the, and, was, for. This also applies to the sequences abcdef, abbcdef, and abcdefv which are highly frequent letter sequences that have some meaning particular to this file.

Conclusion

In this tutorial, we have seen how Python makes it easy to work with statistical concepts such as Zipf's law. Python comes in very handy in particular when working with large text files, which would require a lot of time and effort if we were to find Zipf's distribution manually. As we saw, we were able to quickly load, parse, and find the Zipf's distribution of a file of size 28 MB. Let alone the simplicity in sorting the output thanks to Python's dictionaries.

HIGHLIGHTS OF THE DAY