This lesson is still being designed and assembled (Pre-Alpha version)

Library Carpentry: Text & Data Mining

Introduction

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is text mining?

Objectives
  • Gain a basic understanding what text mining is.

  • Learn some text and data mining terminology.

Introduction

Welcome to this hands-on lesson to learn some text and data mining skills. We will first run through some of the basics that you will need when exploring and analysing text.

What is Text Mining?

FIXME

Terminology

To start with here is a bit of basic terminology that will be used in this lesson:

Token: a single word, letter, number or punctuation mark.

String: a group of characters comprised of words, letters, numbers, punctuation.

Integer: a positive or negative whole number without a decimal point.

Stop words: generally the most common words in a language (e.g. “the”, “of”, “and” etc.) which are sometimes filtered out during text analysis in order to focus on the vocabulary that conveys more of the content of a piece of text.

Document: a single file containing some text.

Corpus: a collection of documents.

Questions:

  • How many people have used text mining for their work before?
  • Who wants to use it in future?
  • What types of text analyses would you want to do?

Key Points

  • Text mining refers to different methods used for analysing text automatically.

Jupyter Notebook

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is Jupyter Notebook?

Objectives
  • Learn what Jupyter Notebook is

  • Start Jupyter Notebook

Introduction to Jupyter Notebook

This course run is using Jupyter Notebook. It is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

Starting Jupyter Notebook server

The start the lesson within a Jupyter Notebook you first need to start a Jupyter Notebook server. To do that you need to open a terminal window and type:

jupyter notebook

This should open up a browser window containing the base directory of where you can store your notebook on your computer.

You can learn how to open a terminal window on your computer in the setup instructions for this lesson.

Creating a new notebook

To create a new notebook you need to select a location for it on your computer via the browser window that opened up and click on “New” in the top right corner of the browser. You also need to select Python 3 to do so.

Starting a new notebook

Once the new notebook opens you can give it a name by changing the word “Untitled” in the first line of the new notebook that opens up.

A new notebook

You can see the first cell in your new notebook. You can enter python code into this cell and press “Run” as long as it is marked as “Code” in the menu at the top of your notebook. This will run your code and you will see any output created by the code immediately below it.

This should be all you need for using a notebook in this lesson but more information on how to use Jupyter Notebook and how to store a notebook can be found in this Data Carpentries Overview of Jupyter Notebook.

Task: Testing it works

To check that it works, tell the notebook to print the string “Works!” by typing the following code into the cell:

print("Works!")

and press run.

Answer

The output of your code appears below the cell.

Works!

Key Points

  • Jupyter Notebook is a tool to run small pieces of code and create visualisations more easily than via the command line. It is useful for running tutorials and lessons such as this one.

Python Fundamentals

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I create a new variable in Python?

  • How do I print the value of a variable?

  • How can I create a list and iterate through it?

Objectives
  • Learn Python fundamentals

  • Learn Python syntax

  • Run simple python code

Introduction to Python

Python is a programming language. We will use this as a way to interact with and analyse text documents. We can use Python either through the command line in your terminal window, by writing Python scripts or via Jupyter Notebook. In this lesson we will be using it via Jupyter Notebook.

This introduction to Python is important for understanding all the steps that follow in the text mining lesson. If you are new to programming it may not be entirely clear why you need to learn Python first. Bare with us, you will soon understand why it is needed and is useful and it will help to speed things up later-on.

We will limit this introduction to the level of Python required for this lesson for a more in-depth introduction to Python, we’d recommend the following [Library Carpentries lesson]{https://librarycarpentry.org/lc-python-intro/}.

Printing text

We first want to test that the Python works by asking it to print a string, so type print("Hello World") into the next Jupyter Notebook cell as shown below and run it as code.

print("Hello World")
Hello World

Variables and how to assign them

In python, you can assign information to variables. A variable is a memory location used to hold data which has a name associated with it. You can think of it like a box which can contain information and which also has a name, so you can refer to it in code.

For example we can put the string “text mining” into the box named “text”.

We do that by assigning the string to the variable named text. When assigning variables in python, the name of the variable is on the left followed by an equal sign and the value of the variable, as follows:

text = "Text Mining"

You can see the content of the variable (what’s in the box) by printing the value of it using the print() function and the variable name as shown:

print(text)
Text Mining

In addition to strings we can assign other data types such as numbers and floats to variables.

text = "Text Mining"  # An example of a string
number = 42  # An example of an integer
pi_value = 3.1415  # An example of a float

In this lesson we will be concentrating mainly on strings.

Note

Whenever something is followed by the # sign in Python code then it is a comment and not part of the code. Comments are used to explain the code but are not needed to run it.

A list

Data can be grouped together in an ordered way using a list. Lists are very common data structures used in Python, for example to represent text.

In the following example the list named “sentence” stores all the words and punctuation of the sentence “This is an example sentence.”

Lists are created by typing comma separated values inside square brackets. You can print out all elements in the list.

sentence = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
print(sentence) # print all elements

['Just', 'think', 'happy', 'thoughts', 'and', 'you', '’ll', 'smile', '.']

The list holds an ordered sequence of elements (in this case words or punctuation) and each element can be accessed using its index or position in the list. To do that you need to specify the index.

Note

Python indexes start with 0 instead of 1. So the first element in the list is called using a 0 in square brackets after the name of the list

print(sentence[0]) # print the first element
Just

You can also print a slice of the list (e.g. the first two elements of the list):

sentence[0:2]
['Just', 'think']

You can delete an item in the list by using the pop() function specifying the index of the element (unless you want to remove the last one).

sentence.pop(7)
print(sentence) # prints the entire list at once
['Just', 'think', 'happy', 'thoughts', 'and', 'you', '’ll', '.']

Items can be added to a list either by inserting them at a specific position (index) or by appending them to the end of the list.

sentence.insert(7,'fly')
print(sentence)
sentence.append('[J.B. Barrie]')
print(sentence)
['Just', 'think', 'happy', 'thoughts', 'and', 'you', '’ll', 'fly', '.']
['Just', 'think', 'happy', 'thoughts', 'and', 'you', '’ll', 'fly', '.', '[J.B. Barrie]']

A for loop

A for loop can be used to access the elements in a list one at a time.

The syntax for a for loop starts with for x in y: where y is the thing you want to loop through (in our case a list) and x is a new variable which each element of y is assigned to while looping. So when looping, x keeps getting overwritten by the next element of y until the end of y is reached.

The code after the initial line then specifies what is to be done with each element of y. This needs to be indented using a tab so that Python knows the instructions that follow need to be executed for each element.

For example, you can loop through the sentence list by assigning each element to the word variable and then print each element by calling the print() function:

for word in sentence:
    print(word) # prints each element of the list one after the other
Just
think
happy
thoughts
and
you
’ll
fly
.
[J.B. Barrie]

if/elif/else statements

An if/elif/else statement can be used to condition some code on something being true or false. This is useful when wanting to only run some code if a specific condition is met or for running different bits of code depending on what condition is met.

If the test that follows the statement is true, then its body (i.e., the lines indented underneath it) are executed. If the test is false then the indented code is not executed and the programme continues.

In the following example, the code specifies that if the sentence list contains 5 elements then print text to confirm that, else if the list contains more than 5 elements then print text to say it does. If neither of those two conditions are true (else) which means the sentence list contains less than 5 elements, then print something saying that is the case. Finally print the entire list to show which elements it contains. The last line of code is not indented as it is supposed to run independent of the if/elif/else statements.

We use the len() function to count the length of the list which returns an integer and we also use operators as part of the tests (== for is equal to and > for is greater than).

if len(sentence) == 5:
    print("The sentence list contains 5 tokens.")
elif len(sentence) > 5:
    print("The sentence list contains more than 5 tokens.")
else:
    print("The sentence list contains less than 5 tokens.")

print(sentence)
The sentence list contains more than 5 tokens.
['Just', 'think', 'happy', 'thoughts', 'and', 'you', '’ll', 'fly', '.', '[J.B. Barrie]']

A tuple

A tuple is similar to a list as it is an ordered sequence of elements. The difference is that tuples can’t be changed once created and they are created by placing comma-separated values inside parentheses.

colour_tuple = ('blue', 'green', 'red')
print(colour_tuple[1]) # prints the 2nd entry in the tuple
green

Dictionary

A dictionary works a lot like a list but it contains ‘key/value’ pairs. A key acts as a name for a single or a set of values in the dictionary.

In the example below the values are integers (e.g. counts). So a dictionary could be used to store the number of times (the value) a word (the key) occurs in a text.

pets = {'cats':45, 'dogs':24, 'mice':33}
print(pets['cats']) # prints the value of the key 'cats'
45

We can use a for loop to print the keys and values of a dictionary:

for key, value in pets.items():
    print(key, value)
cats 45
dogs 24
mice 33

Task 1: Printing text

Print a bit of text of your choice using print()

Answer

print("Humpty Dumpty sat on the wall")
Humpty Dumpty sat on the wall

Task 2: Create a list

Create a list containing different first names, iterate through them

Answer

names = ['Mary', 'John', 'Bob']
for name in names:
  print(name)
Mary
John
Bob

Key Points

  • Use name = value to assign a value to a variable with a specific name in order to record it in memory

  • Use the print(variable) function to print the value of the variable

  • Create the list by giving it different values (list-name['value1','value2','value3']) and use a for loop to iterate through each value of the list

Tokenising Text

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is tokenisation?

  • How can a string of raw text be tokenised?

Objectives
  • Learn how to tokenise text

Tokenising text

But first … importing packages

Python has a selection of pre-written code that can be used. These come as in built functions and a library of packages of modules. We have already used the in-built function print(). In-built functions are available as soon as you start python. There is also a (software) library of modules that contain other functions, but these modules need to be imported.

For this course we need to import a few libraries into Python. To do this, we need to use the import command.

NLTK is the tool which we’ll be using to do much of the text processing in this workshop so we need to run import nltk. We will also use numpy to represent information in arrays and matrices, string to process some strings and matplotlib to visualise the output.

If there is a problem importing any of these modules you may need to revisit the appropriate install in the prerequisites list.

import nltk
import numpy
import string
import matplotlib.pyplot as plt

Tokenising a string

In order to process text we need to break it down into tokens. As we explained at the start, a token is a letter, word, number, or punctuation which is contained in a string.

To tokenise we first need to import the word_tokenize method from the tokenize package from NLTK which allows us to do this without writing the code ourselves.

from nltk.tokenize import word_tokenize

We will also download a specific tokeniser that NLTK uses as default. There are different ways of tokenising text and today we will use NLTK’s in-built punkt tokeniser by calling:

nltk.download('punkt')

Now we can assign text as a string variable and tokenise it. We will save the tokenised output in a list using the humpty_tokens variable. We can inspect this list by inspecting the humpty_tokens variable.

humpty_string = "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall; All the king's horses and all the king's men couldn't put Humpty together again."
humpty_tokens = word_tokenize(humpty_string)
# Show first 10 entries of the tokens list
humpty_tokens[0:10]
['Humpty', 'Dumpty', 'sat', 'on', 'a', 'wall', ',', 'Humpty', 'Dumpty', 'had']

As you can see, some of the words are uppercase and some are lowercase. To further analyse the data, for example counting the occurrences of a word, we need to normalise the data and make it all lowercase.

You can lowercase the strings in the list by going through it and calling the .lower() method on each entry. You can do this by using a for loop to loop through each word in the list.

lower_humpty_tokens = [word.lower() for word in humpty_tokens]
# Show first 10 entries of the lowercased tokens list
lower_humpty_tokens[0:6]
['humpty', 'dumpty', 'sat', 'on', 'a', 'wall']

Task: Printing token in list

Print the 13th token of the nursery rhyme (remember that a list index starts with 0).

Answer

print(lower_humpty_tokens[12])
fall

Key Points

  • Tokenisation means to split a string into separate words and punctuation, for example to be able to count them.

  • Text can be tokenised using the a tokeniser, e.g. the punkt tokeniser in NLTK.

Pre-processing Data Collections

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I load a file and tokenise it?

  • How can I load a text collection made up of multiple text files and tokenise them?

Objectives
  • Learn how to tokenise a text file and a collection of text files

Data Preparation

Text data comes in different forms. You might want to analyse a document in one file or an entire collection of documents (a corpus) stored in multiple files. In this part of the lesson we will show you how to load a single document and how to load the text of an entire corpus into Python for further analysis.

Download some data

Firstly, please download a dataset and make a note of where it is saved on your computer. We need the path to dataset in order to load and read it for further processing.

We will use the Medical History of British India collection provided by the National Libarry of Scotland as an example:

This dataset forms the first half of the Medical History of British India collection, which itself is part of the broader India Papers collection held by the Library. A Medical History of British India consists of official publications varying from short reports to multi-volume histories related to disease, public health and medical research between circa 1850 to 1950. These are historical sources for a period which witnessed the transition from a humoral to a biochemical tradition, which was based on laboratorial science and document the important breakthroughs in bacteriology, parasitology and the developments of vaccines in a colonial context.

This collection has been made available as part of NLS’s DataFoundry platform which provides access to a number of their digitised collections.

We are only interested in the text the Medical History of British India collection for this course so at the bottom of the website, download the “Just the text” data or download it directly here.

Note that this dataset requires approx. 120 MB of free file space on your computer once it has been unzipped. Most computers automatically uncompress .zip files as the one you have downloaded. If your computer does not do that then right-click on the file and click on uncompress or unzip.

You should be left with a folder called nls-text-indiaPapers containing all the .txt files for this collection. Please check that you have that on your computer and find out what its path is. In my case it is /Users/balex/Downloads/nls-text-indiaPapers/.

Loading and tokenising a single document

You can use the open() function to open one file in the Medical History of British India corpus. You need to specify the path to a file in the downloaded dataset and the mode of opening it (‘r’ for read). The path will be different to the one below depending on where you saved the data on your computer.

The read() function is used to read the file. The file’s content (the text) is then stored as a string variable called india_raw.

You can then tokenise the text and convert it to lowercase. You can check it has worked by printing out a slice of the list lower_india_tokens.

file = open('/Users/balex/Downloads/nls-text-indiaPapers/74457530.txt','r')  # replace the path with the one on your computer
india_raw = file.read()
india_tokens = word_tokenize(india_raw)
lower_india_tokens = [word.lower() for word in india_tokens]
lower_india_tokens[0:10]
['no', '.', '1111', '(', 'sanitary', ')', ',', 'dated', 'ootacamund', ',']

Loading and tokenising a corpus

We can do the same for an entire collection of documents (a corpus). Here we choose a collection of raw text documents in a given directory. We will use the entire Medical History of British India collection as our dataset.

To read the text files in this collection we can use the PlaintextCorpusReader class provided in the corpus package of NLTK. You need to specify the collection directory name and a wildcard for which files to read in the directory (e.g. .* for all files) and the text encoding of the files (in this case latin1). Using the words() method provided by NLTK, the text is automatically tokenised and stored in a list of words. As before, we can then lowercase the words in the list.

from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/balex/Downloads/nls-text-indiaPapers/'
wordlists = PlaintextCorpusReader(corpus_root, '.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[:10])
['No', '.', '1111', '(', 'Sanitary', '),', 'dated', 'Ootacamund', ',', 'the']
lower_corpus_tokens = [str(word).lower() for word in corpus_tokens]
lower_corpus_tokens[0:10]

Task 1: Print slice of tokens in list

Print out a larger slice of the list of tokens in the Medical History of British India collection, e.g. the first 30 tokens.

Answer

print(corpus_tokens[:30])

Task 2: Print slice of lowercase tokens in list

Print out the same slice but for the lower-cased version.

Answer

print(lower_corpus_tokens[0:30])

Key Points

  • To open and read a file on your computer, the open() and read() functions can be used.

  • To read an entire collection of text files you can use the PlaintextCorpusReader class provided by NLTK and its words() function to extract all the words from the text in the collection.

Tokens in Context: Concordance Lists

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is a concordance list?

  • How can a concordance list be created for a particular search term?

Objectives
  • Understand what a concordance list is.

  • Learn how to create one.

Concordance list for a text collection

Next, we will display concordances for a particular token, i.e. all contexts a particular token appears in. We can do this using the Text class in NLTK’s text package. We can represent our list of lowercased tokens in the document collection loaded previously using the Text class. The concordance list of a token can be displayed using the concordance() method on this class as shown below.

from nltk.text import Text
t = Text(lower_india_tokens)
t.concordance('woman')
Displaying 20 of 20 matches:
s of age , a sweeper , who married a woman who had leprosy , and at the age of
e of sitabu , aged 40 , a muhammadan woman . her grand- father and father were
ung man deliberately married a leper woman , and became himself a leper at the
contrary . in no . 6 a man marries a woman whose grandfather and father had bee
 lepers . in no . 10 a man marries a woman whose father had died of leprosy . i
applies to these cases . in no . 2 a woman marries a man whose father and elder
n in the case of a man who marries a woman of notoriously leper family . in no
toriously leper family . in no . 5 a woman marries a man whose elder brother wa
d continued to cohabit with a native woman after she had been attacked with lep
isen from intermarriage of a man and woman in both of whom leprosy was heredita
s a leper ; he is now married to the woman , and they both live in the asylum .
een accompanied by a healthy looking woman , and by this means , although all h
editary transmission . in one case a woman got the disease about two years afte
 passed their thirtieth year , one a woman about 25 years of age , and the seve
 fracture of femur in middle third . woman well nourished and skin healthy , no
 ; had a brother with same disease . woman recovered and able to move about ; o
re or less related to each other . a woman got leprosy first from a leprous hus
s village assured me that before the woman returned home after her husband 's d
against it . this case was that of a woman with two leprous children , aged abo
ions had been lepers , who married a woman with tubercular leprosy , he being a

In the output for the next bit of code which creates a concordance list for the word “he”, we can see that there are many more results in the list than displayed on screen (Displaying 25 of 170 matches). The concordance() method only prints the first 25 results by default (or less if there are less).

t = Text(lower_india_tokens)
t.concordance('he')
Displaying 25 of 170 matches:
leprosy treated by gurjun oil , which he was able to watch for a length of tim
 diminished . during these two months he gained three pounds in weight , which
does not seem much , considering that he did no work and was fairly well fed o
se from jail on the 23rd january 1876 he was again suffering from the sores th
n 5th and died on 20th october 1875 . he was seriously ill when he was brought
ober 1875 . he was seriously ill when he was brought to the hospital , and cou
itted on the 8th september 1875 , and he went home of his own accord on 20th d
is own accord on 20th december 1875 . he was much improved under treat- ment b
evalence of leprosy in the district , he had had but very few opportunities of
even half this number . the natives , he says , call every chronic skin diseas
in the legs , the feet and the ears . he has perfect taste , hearing , sight a
te laboured under it . the leper says he was quite free from leprosy until he
 he was quite free from leprosy until he associated with this man and took din
prosy of 15 years ' standing . states he had first gonorrha , then syphillis ,
been affected 6 years ; was well when he married . had two children who died ,
ve of the territory beyond the hubb . he had lost some parts of his hands and
 feet previous to his incarceration . he was treated with large doses of iodid
ease be removed . dr. bloomfield says he sent two interesting specimens of thi
ant medical college museum , but that he never heard of them after , nor did h
e never heard of them after , nor did he discover them in the museum when he v
d he discover them in the museum when he visited it afterwards . should the di
er and elder sister were lepers , and he himself became a leper at 30 years of
. his elder brother was a leper , and he himself became a leper at 32 . his wi
one year after she was affected , and he suffered from leprosy . no . 7.-the c
ied . afterwards , at the age of 43 , he him- self was attacked with leprosy .

You can specify the number of lines using an additional lines parameter, e.g.:

t.concordance('he',lines=170)

Task: Create a new concordance list

Create a concordance list for a different search term, e.g. the word “great” or choose your own.

Answer

t.concordance('great')

Key Points

  • A concordance list is a list of all contexts in which a particular token appears in a corpus or text.

  • A concordance list can be created using the concordance() method of the Text` class in NLTK.

Searching Text using Regular Expressions

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I search for tokens in text more flexibly? For example, to find all all mentions of woman and women.

Objectives
  • Learn how to search for tokens in a data set using regular expressions

Searching text using regular expressions

This episode provides a taster to the use of regular expression searching. For a more detailed overview and use of regular expressions, we refer to the Programming Historian lesson Understanding Regular Expressions.

You may want to look for the word “women” as well a “woman” in a corpus simultaneously, to find out how many times they occur. You would do this using regular expressions. Regular expressions define a search term that can have some variety in it. Do use regular expression search in python, you first need to import the re module.

import re

The next line is a bit more complex combining a for loop and and an if statement so let’s explain this in a bit more detail.

The code first goes through each element in the lower_india_tokens list using a for loop and assigning each element in that list to the variablew. The if statement then specifies to only return true if the strings assigned to w matches “woman” or “women”. If that is the case then the word is added to the womaen_strings list.

The regular expression search string ^wom[ae]n$ contains square brackets to indicate that the letter to match could be “a” or “e” (so either woman or women). ^ means start of string and $ means end of string, so the search is for the exact tokens “women” or “woman” but not for words containing them.

You can see all the strings found in the corpus assigned to the womaen_strings list by printing it.

womaen_strings=[w for w in lower_india_tokens if re.search('^wom[ae]n$', w)]
print(womaen_strings)
['women', 'women', 'women', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'women', 'woman', 'women', 'women', 'woman', 'women', 'woman', 'women', 'woman', 'woman', 'woman', 'woman', 'women', 'women', 'women', 'women', 'women', 'woman', 'woman', 'women', 'women', 'women', 'women', 'women', 'women', 'woman', 'woman', 'women', 'women', 'women']

You can see how the search results change if you remove the ^ and $ characters from the regular expression.

Now that the results are stored in a list you can count them. We will see how to do that in the next section of the course.

womaen_strings=[w for w in lower_india_tokens if re.search('wom[ae]n', w)]
print(womaen_strings)
['women', 'women', 'women', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'women', 'woman', 'women', 'women', 'woman', 'women', 'woman', 'women', 'washerwoman', 'woman', 'woman', 'woman', 'woman', 'women', 'women', 'women', 'women', 'women', 'woman', 'woman', 'women', 'women', 'women', 'women', 'women', 'women', 'woman', 'woman', 'women', 'women', 'women']

Regural expressions can be very specific and we will not cover them in detail here but they are very powerful to carry out complex searches, e.g. find all tokens starting with a and are 12 characters long (. means a character). Or find all tokens which are 13 characters long but that do not start with a lower case letter ([^a-z] means not the letters a-z).

[w for w in lower_india_tokens if re.search('^a............$', w)]
['antiscorbutic',
 'approximately',
 'approximately',
 'agriculturist',
 'ages.-chiefly',
 'approximately',
 'accommodation']
[w for w in lower_india_tokens if re.search('^[^a-z]............$', w)]
['24-pergunnahs',
 '19.-commenced',
 '24-pergunuahs',
 '24-pergunnahs',
 '24-pergunnahs',
 '1875.-patches']

Task: Search for specific tokens using regular expressions

Search for all tokens starting with the string “man” or “men”

Answer

maen_strings=[w for w in lower_india_tokens if re.search('m[ae]n', w)]
print(maen_strings)

Key Points

  • To search for tokens in text using regular expressions you need the re module and its search function.

  • You will need to learn how to construct regular expressions. E.g. you can use a wildcard * or you can use a range of letters, e.g. [ae] (for a or e), [a-z] (for a to z), or numbers, e.g. [0-9] (for all single digits) etc. Regular expressions can be very powerful if used correctly. To find all mentions of the words woman or women you need to use the following regular expression wom[ae]n.

Counting Tokens in Text

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I count tokens in text?

Objectives
  • Learn how to count tokens in text.

Counting tokens in text

You can also do other useful things like count the number of tokens in a text, determine the number and percentage count of particular tokens and plot the count distributions as a graph. To do this we have to import the FreqDist class from the NLTK probability package. When calling this class, a list of tokens from a text or corpus needs to be specified as a parameter in brackets.

from nltk.probability import FreqDist
fdist = FreqDist(lower_india_tokens)
fdist
FreqDist({'the': 5923, ',': 5332, '.': 5258, 'of': 4062, 'and': 2118, 'in': 2117, 'to': 1891, 'is': 1124, 'a': 1049, 'that': 816, ...})

The results show the top most frequent tokens and their frequency count.

We can count the total number of tokens in a corpus using the N() method:

fdist.N()
93571

And count the number of times a token appears in a corpus:

fdist['she']
26

We can also determine the relative frequency of a token in a corpus, so what % of the corpus a term is:

fdist.freq('she')
0.0002778638680787851

If you have a list of tokens created using regular expression matching as in the previous section and you’d like to count them then you can also simply count the length of the list:

len(womaen_strings)
43

Frequency counts of tokens are useful to compare different corpora in terms of occurrences of different words or expressions, for example in order to see if a word appears a lot rarer in one corpus versus another. Counts of tokens, documents and a entire corpus can also be used to compute simple pairwise document similarity of two documents (e.g. see Jana Vembunarayanan’s blogpost for a hands-on example of how to do that).

Key Points

  • To count tokens, one can make use of NLTK’s FreqDist class from the probability package. The N() method can then be used to count how many tokens a text or corpus contains.

  • Counts for a specific token can be obtained using fdist["token"].

Visualising Frequency Distributions

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I draw a frequency distribution of the most frequent words in a collection?

  • How can I visualise this data as a word cloud.

Objectives
  • Learn how to draw frequency distributions of tokens in text.

  • Learn how to create a word cloud showing the most frequent words in the text.

Visualising Frequency distributions of tokens in text

Graph

The plot() method can be called to draw the frequency distribution as a graph for the most common tokens in the text.

fdist.plot(30,title='Frequency distribution for 30 most common tokens in our text collection')

Frequency distribution for 30 most common tokens in our text collection

You can see that the distribution contains a lot of non-content words like “the”, “of”, “and” etc. (we call these stop words) and punctuation. We can remove them before drawing the graph. We need to import stopwords from the corpus package to do this. The list of stop words is combined with a list of punctuation and a list of single digits using + signs into a new list of items to be ignored.

nltk.download('stopwords')
from nltk.corpus import stopwords
remove_these = set(stopwords.words('english') + list(string.punctuation) + list(string.digits))
filtered_text = [w for w in lower_india_tokens if not w in remove_these]
fdist_filtered = FreqDist(filtered_text)
fdist_filtered.plot(30,title='Frequency distribution for 30 most common tokens in our text collection (excluding stopwords and punctuation)')

Frequency distribution for 30 most common tokens in our text collection (excluding stopwords and punctuation).

Note

While it makes sense to remove stop words for this type of frequency analyis it essential to keep them in the data for other text analysis tasks. Retaining the original text is crucial, for example, when deriving part-of-speech tags of a text or for recognising names in a text. We will look at these types of text processing in day 2 of this course.

Word cloud

We can also present the filtered tokens as a word cloud. This allows us the have an overview of the corpus using the WordCloud( ).generate_from_frequencies() method. The input to this method is a frequency dictionary of all tokens and their counts in the text. This needs to be created first by importing the Counter package in python and creating a dictionary using the filtered_text variable as input.

We generate the WordCloud using the frequency dictionary and plot the figure to size. We can show the plot using plt.show().

from collections import Counter
dictionary=Counter(filtered_text)
import matplotlib.pyplot as plt
from wordcloud import WordCloud

cloud = WordCloud(max_font_size=80,colormap="hsv").generate_from_frequencies(dictionary)
plt.figure(figsize=(16,12))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()s

png

Shaped word cloud

And now a shaped word cloud for a bit of fun, if there is time at the end of day 1. This will present your workcloud in the shape of a given image.

You need a shape file which we provide for you in the form of the medical symbol:

The mask image needs to have a transparent background so that only the black shape is used as a mask for the word cloud.

To display the shaped word cloud you need to import the Image package form PIL as well as numpy. The image first needs to be opened and converted into a numpy array which we call med_mask. A customised colour map (cmap) is created to present the words in black font. Then the word cloud is created with a white background, the mask and the colour map set as parameters and generated from the dictionary containing the number of occurrences for each word.

from PIL import Image
import numpy as np
med_mask = np.array(Image.open("medical.png"))

# Custom Colormap
from matplotlib.colors import LinearSegmentedColormap
colors = ["#000000", "#111111", "#101010", "#121212", "#212121", "#222222"]
cmap = LinearSegmentedColormap.from_list("mycmap", colors)

wc = WordCloud(background_color="white", mask=med_mask, colormap=cmap)
wc.generate_from_frequencies(dictionary)
plt.figure(figsize=(16,12))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")

Task 1: Filter the frequency distribution further

Change the last frequency distribution plot to not show any the following strings: “…”, “1876”, “1877”, “one”, “two”, “three”. Consider adding them to the remove_these list. Hint: You can create a list of strings of all numbers between 0 and 10000000 by calling list(map(str, range(0,1000000)))

Answer

numbers=list(map(str, range(0,1000000)))
otherTokens=["..."]
remove_these = set(stopwords.words('english') + list(string.punctuation) + numbers + otherTokens)
filtered_text = [w for w in lower_india_tokens if not w in remove_these]
fdist_filtered = FreqDist(filtered_text_new)
fdist_filtered.plot(30,title='Frequency distribution for 30 most common tokens in our text collection (excluding stopwords, punctuation, numbers etc.)')

Frequency distribution for 30 most common tokens in our text collection (excluding stopwords, punctuation, numbers etc.)

Task 2: Redraw word cloud

Redraw the word cloud with the updated filtered_text variable (after removing the strings in Task 1).

Answer

dictionary=Counter(filtered_text)
import matplotlib.pyplot as plt
from wordcloud import WordCloud
cloud = WordCloud(max_font_size=80,colormap="hsv").generate_from_frequencies(dictionary)
plt.figure(figsize=(16,12))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

New word cloud

Key Points

  • A frequency distribution can be created using the plot() method.

  • In this episode you have also learned how to clean data by removing stopwords and other types of tokens from the text.

  • A word cloud can be used to visualise tokens in text and their frequency in a different way.

Lexical Dispersion Plot

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I measure how frequently a word appears across the parts of a corpus?

  • How can I plot the occurrences of a word and how many words from the beginning of the corpus it appears?

Objectives
  • Learn how to plot the occurances of specific words as they appear across a document or a corpus.

  • We will use the US Presidential Inaugural Addresses and which are provided with NLTK.

Lexical Dispersion Plot

We can plot lexical dispersion of particular tokens. Lexical dispersion is a measure of how frequently a word appears across the parts of a corpus. This plot notes the occurrences of a word and how many words from the beginning of the corpus it appears (word offsets). This is particularly useful for a corpus that covers a longer time period and for which you want to analyse how specific terms were used more or less frequently over time.

To create a lexical disperson plot, you will first load and import a different corpus, the inaugural corpus which are all US Presidential Inaugural Addresses and which are provided with NLTK.

nltk.download('inaugural')
from nltk.corpus import inaugural
from nltk.text import Text
inaugural_tokens=inaugural.words()
inaugural_texts = Text(inaugural_tokens)
[nltk_data] Downloading package inaugural to /Users/<USERNAME>/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!

To create the lexical dispersion plot for this corpus you also need to load dispersion_plot from the nltk.draw.dispersion package. You can then call the dispersion_plot method given a set of parameters, including the target words you want to plot across the corpus, whether this should be done case-sensitively, and specifying the title of the plot.

from nltk.draw.dispersion import dispersion_plot

# the following command can be used to increase the size of the plot using width and hight specifications
plt.figure(figsize=(12, 9))
targets=['great','good','tax','work','change']
dispersion_plot(inaugural_texts, targets, ignore_case=True, title='Lexical Dispersion Plot')

Key Points

  • Lexical dispersion is a visualisation that allows us to see where a particular term appears across a document or set of documents

  • We used NLTK’s dispersion_plot .

Plotting Frequency Over Time

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I extract and plot the frequency of specific terms over time?

Objectives
  • We will use a NLTK’s ConditionalFreqDist class to extract the frequency of defined words.

  • We will use the US Presidential Inaugural Addresses and which are provided with NLTK.

Plotting frequency over time

Similarly to lexical dispersion, you can also plot frequency of terms over time. This is similarly to the Google n-gram visualisation for the Google Books corpus but we will show you how to do something similar for your own corpus.

You first need to import NLTK’s ConditionalFreqDist class from the nltk.probability package. To generate the graph, you have to specify the list of words to be plotted (see targets) and the x-axis labels (in this case the year the inaugural was held which appears at the start of each file: fileid[:4]).

The plot is created by:

  • looping through each file (speech)
  • then looping through each word in each speech
  • then looping though the list of specified target words and
  • checking if each target word matches the start of each word in the speeches (after being lower-cased).

The ConditionalFreqDist object (cfd) stores the number of times each of the target words appear in the each of the speaches and the plot() method is used to visualise the graph.

from nltk.probability import ConditionalFreqDist

# type this to set the figure size
plt.rcParams["figure.figsize"] = (12, 9)

targets=['great','good','tax','work','change', 'wom[ae]n']

cfd = nltk.ConditionalFreqDist((target, fileid[:4])
    for fileid in inaugural.fileids()
    for word in inaugural.words(fileid)
    for target in targets
    if word.lower().startswith(target))
cfd.plot()

Task 1: See how the plot changes when choosing different target words.

Answer

plt.rcParams["figure.figsize"] = (12, 9)
targets=['god','work']
cfd = nltk.ConditionalFreqDist((target, fileid[:4])
	for fileid in inaugural.fileids()
	for word in inaugural.words(fileid)
   	for target in targets
    if word.lower().startswith(target))
cfd.plot()

Task 2: Use regular expression searching to search for target words exactly instead of matching on words that start with the target words.

Answer

plt.rcParams["figure.figsize"] = (12, 9)
targets=['m[ea]n']
cfd = nltk.ConditionalFreqDist((target, fileid[:4])
    for fileid in inaugural.fileids()
    for word in inaugural.words(fileid)
    for target in targets
    if word.lower().startswith(target))
cfd.plot()

Key Points

  • Here we extracted the terms and the years from the files using NLTK’s ConditionalFreqDist class from the nltk.probability package

  • We then plotted these on a graph to visualise how the use changes over time