Introduction
|
|
Jupyter Notebook
|
|
Python Fundamentals
|
Use name = value to assign a value to a variable with a specific name in order to record it in memory
Use the print(variable) function to print the value of the variable
Create the list by giving it different values (list-name['value1','value2','value3'] ) and use a for loop to iterate through each value of the list
|
Tokenising Text
|
Tokenisation means to split a string into separate words and punctuation, for example to be able to count them.
Text can be tokenised using the a tokeniser, e.g. the punkt tokeniser in NLTK.
|
Pre-processing Data Collections
|
To open and read a file on your computer, the open() and read() functions can be used.
To read an entire collection of text files you can use the PlaintextCorpusReader class provided by NLTK and its words() function to extract all the words from the text in the collection.
|
Tokens in Context: Concordance Lists
|
|
Searching Text using Regular Expressions
|
To search for tokens in text using regular expressions you need the re module and its search function.
You will need to learn how to construct regular expressions. E.g. you can use a wildcard * or you can use a range of letters, e.g. [ae] (for a or e), [a-z] (for a to z), or numbers, e.g. [0-9] (for all single digits) etc. Regular expressions can be very powerful if used correctly. To find all mentions of the words woman or women you need to use the following regular expression wom[ae]n .
|
Counting Tokens in Text
|
To count tokens, one can make use of NLTK’s FreqDist class from the probability package. The N() method can then be used to count how many tokens a text or corpus contains.
Counts for a specific token can be obtained using fdist["token"] .
|
Visualising Frequency Distributions
|
A frequency distribution can be created using the plot() method.
In this episode you have also learned how to clean data by removing stopwords and other types of tokens from the text.
A word cloud can be used to visualise tokens in text and their frequency in a different way.
|
Lexical Dispersion Plot
|
|
Plotting Frequency Over Time
|
Here we extracted the terms and the years from the files using NLTK’s ConditionalFreqDist class from the nltk.probability package
We then plotted these on a graph to visualise how the use changes over time
|
Collocations
|
We used NLTK’s BigramAssocMeasures() and BigramCollocationFinder to find the words commonly found together in this document set.
We then scored these collocations using bigram_measures.likelihood_ratio
|
Part-of Speech Tagging Text
|
We use a NLTK’s part-of-speech tagger, averaged_perceptron_tagger , to label each word with part of speech, tense, number, (plural/singular) and case.
We are used the text from the US Presidential Inaugaral speeches, in particular that from the last speech by Trump.
We then extracted all nouns both plural (NNS) and singular (NN).
We then visualise the nouns from these speeches using a plot of frequence distribution and a word cloud.
|