[ad_1]
Ever wondered about a quick way to tell what some document is focusing on? What is its main topic? Let me give you this simple trick. List the unique words mentioned in the document, and then check how many times each word has been mentioned (frequency). This way would give you an indication of what the document is mainly about. But that would be a very boring, slow and tiring task if done manually. We need some automated process, don’t we?
Yes, an automated process will make this much easier. Let’s see how we can list the different unique words in a text file and check the frequency of each word using Python.
1. Get the Test File
In this tutorial, we are going to use test.txt as our test file. Go ahead and download it, but don’t open it! Let’s make a small game. The text inside this test file is from one of my tutorials at Envato Tuts+. Based on the frequency of words, let’s guess which of my tutorials this text was extracted from.
Let the game begin!
About Regular Expressions
Since we are going to apply a pattern in our game, we need to use regular expressions (regex). If “regular expressions” is a new term to you, this is a nice definition from Wikipedia:
A sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, an editor, and grep, a filter
If you want to know more about regular expressions before moving ahead with this tutorial, you can see my other tutorial Regular Expressions In Python, and come back again to continue this tutorial.
2. Building the Program
Let’s work step by step on building this game. The first thing we want to do is to store the text file in a string variable.
document_text = open('test.txt', 'r') text_string = document_text.read()
Now, in order to make applying our regular expression easier, let’s turn all the letters in our document into lower case letters, using the lower() function, as follows:
text_string = document_text.read().lower()
Let’s write our regular expression that would return all the words with the number of characters in the range [3-15]
. Starting from 3
will help in avoiding words that we may not be interested in counting their frequency like if, of, in, etc., and words having a length larger than 15
might not be correct words. The regular expression for such a pattern looks as follows:
\b[a-z]3,15\b
\b
is related to word boundary. For more information on the word boundary, you can check this tutorial.
The above regular expression can be written as follows:
match_pattern = re.search(r'\b[a-z]3,15\b', text_string)
Since we want to walk through multiple words in the document, we can use the findall
function:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
At this point, we want to find the frequency of each word in the document. The suitable concept to use here is Python’s Dictionaries, since we need key-value
pairs, where key
is the word, and the value
represents the frequency words appeared in the document.
Assuming we have declared an empty dictionary frequency =
, the above paragraph would look as follows:
for word in match_pattern: count = frequency.get(word,0) frequency[word] = count + 1
We can now see our keys using:
frequency_list = frequency.keys()
Finally, in order to get the word and its frequency (number of times it appeared in the text file), we can do the following:
for words in frequency_list: print(words, frequency[words])
Let’s put the program together in the next section, and see what the output looks like.
3. Putting It All Together
Having discussed the program step by step, let’s now see how the program looks:
import re frequency = document_text = open('test.txt', 'r') text_string = document_text.read().lower() match_pattern = re.findall(r'\b[a-z]3,15\b', text_string) for word in match_pattern: count = frequency.get(word,0) frequency[word] = count + 1 frequency_list = frequency.keys() for words in frequency_list: print(words, frequency[words])
If you run the program, you should get something like the following:
Let’s come back to our game. Going through the word frequencies, what do you think the test file (with content from my other Python tutorial) was talking about?
(Hint: check the word with the maximum frequency).
4. Get the Most Frequent Words
In the above example, the list of unique words was fairly small due to a small text sample. So, we could pick the most frequent word after glancing through the list relatively quickly.
What if the text sample is quite large. In that case, it would be much easier to get the most frequent words by simple sorting that is baked into our program. Here is some example code that gets the most frequently used words from an excerpt of Dracula.
import re frequency = document_text = open('dracula.txt', 'r') text_string = document_text.read().lower() match_pattern = re.findall(r'\b[a-z]3,15\b', text_string) for word in match_pattern: count = frequency.get(word,0) frequency[word] = count + 1 most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True)) most_frequent_count = most_frequent.keys() for words in most_frequent_count: print(words, most_frequent[words])
I got the following list of words after executing the program.
5. Exclude Specific Words From the Count
You can usually expect the most common word in any large piece of text to be the word “the”. You can get rid of such unwanted filler words for a better analysis of the text by creating a blacklist and only adding words that are not in the blacklist to your dictionary.
import re frequency = document_text = open('dracula.txt', 'r') text_string = document_text.read().lower() match_pattern = re.findall(r'\b[a-z]3,15\b', text_string) blacklisted = ['the', 'and', 'for', 'that', 'which'] for word in match_pattern: if word not in blacklisted: count = frequency.get(word,0) frequency[word] = count + 1 most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True)) most_frequent_count = most_frequent.keys() for words in most_frequent_count: print(words, most_frequent[words])
Here is the output after running the above code on the same file.
Final Thoughts
In this tutorial, we learned how to get the frequency of words in a text sample by using a simple Python program. We also modified the original code to get a list of most frequent words or only get words that are not in our blacklist. Hopefully, you will now be able to update the program according to your own individual needs to analyze any piece of text.