Counting Tokens in Text
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How can I count tokens in text?
Objectives
Learn how to count tokens in text.
Counting tokens in text
You can also do other useful things like count the number of tokens in a text, determine the number and percentage count of particular tokens and plot the count distributions as a graph. To do this we have to import the FreqDist
class from the NLTK probability
package. When calling this class, a list of tokens from a text or corpus needs to be specified as a parameter in brackets.
from nltk.probability import FreqDist
fdist = FreqDist(lower_india_tokens)
fdist
FreqDist({'the': 5923, ',': 5332, '.': 5258, 'of': 4062, 'and': 2118, 'in': 2117, 'to': 1891, 'is': 1124, 'a': 1049, 'that': 816, ...})
The results show the top most frequent tokens and their frequency count.
We can count the total number of tokens in a corpus using the N()
method:
fdist.N()
93571
And count the number of times a token appears in a corpus:
fdist['she']
26
We can also determine the relative frequency of a token in a corpus, so what % of the corpus a term is:
fdist.freq('she')
0.0002778638680787851
If you have a list of tokens created using regular expression matching as in the previous section and you’d like to count them then you can also simply count the length of the list:
len(womaen_strings)
43
Frequency counts of tokens are useful to compare different corpora in terms of occurrences of different words or expressions, for example in order to see if a word appears a lot rarer in one corpus versus another. Counts of tokens, documents and a entire corpus can also be used to compute simple pairwise document similarity of two documents (e.g. see Jana Vembunarayanan’s blogpost for a hands-on example of how to do that).
Key Points
To count tokens, one can make use of NLTK’s
FreqDist
class from theprobability
package. TheN()
method can then be used to count how many tokens a text or corpus contains.Counts for a specific token can be obtained using
fdist["token"]
.