Regular Expressions
Last updated on 2023-04-24 | Edit this page
Overview
Questions
- How can you imagine using regular expressions in your work?
Objectives
- Use regular expressions in searches
Regular Expressions
One of the reason why I have stressed the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file name. So, for example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from ‘2014’, then you can. Or if you have ‘journal’ somewhere in a filename when you have data about journals, you can use the computer to select just those files then do something with them. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data within files.
A powerful means of doing this selecting based on file
characteristics is to use regular expressions, often abbreviated to
regex. A regular expression is a sequence of characters that define a
search pattern, mainly for use in pattern matching with strings, or
string matching, i.e. “find and replace”-like operations. Regular
expressions are typically surrounded by /
characters,
though we will (mostly) ignore those for ease of comprehension. Regular
expressions will let you:
- Match on types of character (e.g. ‘upper case letters’, ‘digits’, ‘spaces’, etc.)
- Match patterns that repeat any number of times
- Capture the parts of the original string that match your pattern
As most computational software has regular expression functionality built in and as many computational tasks in libraries are built around complex matching, it is good place for Library Carpentry to start in earnest.
A very simple use of a regular expression would be to locate the same
word spelled two different ways. For example the regular expression
organi[sz]e
matches both “organise” and “organize”.
But it would also match reorganise
,
reorganize
, organises
, organizes
,
organised
, organized
, et cetera, because we’ve
not specified the beginning or end of our string. So there are a bunch
of special syntax that help us be more precise.
The first we’ve seen: square brackets can be used to define a list or range of characters to be found. So:
-
[ABC]
matches A or B or C -
[A-Z]
matches any upper case letter -
[A-Za-z0-9]
matches any upper or lower case letter or any digit (note: this is case-sensitive)
Then there are:
-
.
matches any character -
\d
matches any single digit -
\w
matches any part of word character (equivalent to[A-Za-z0-9]
) -
\s
matches any space, tab, or newline -
\
NB: this is also used to escape the following character when that character is a special character. So, for example, a regular expression that found.com
would be\.com
because.
is a special character that matches any character. -
^
asserts the position at the start of the line. So what you put after it will only match if they are the first characters of a line. -
$
asserts the position at the end of the line. So what you put before it will only match if they are the last characters of a line. -
\b
adds a word boundary. Putting this either side of a word stops the regular expression matching longer variants of words. So:- the regular expression
foobar
will matchfoobar
and find666foobar
,foobar777
,8thfoobar8th
et cetera - the regular expression
\bfoobar
will matchfoobar
and findfoobar777
- the regular expression
foobar\b
will matchfoobar
and find666foobar
- the regular expression
\bfoobar\b
will findfoobar
- the regular expression
So, what is ^[Oo]rgani.e\b
going to match.
Using special characters in regular expression matches
Can you guess what the regular expression ^[Oo]rgani.e\b
will match?
organise
organize
Organise
Organize
organife
Organike
Or, any other string that starts a line, begins with a letter
o
in lower or capital case, proceeds with
rgani
, has any character in the 7th position, and ends with
the letter e
.
Other useful special characters are:
-
*
matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc. -
+
matches the preceding element one or more times. For example, ab+c matches “abc”, “abbbc” but not “ac”. -
?
matches when the preceding character appears zero or one time. -
{VALUE}
matches the preceding character the number of times define by VALUE; ranges can be specified with the syntax{VALUE,VALUE}
-
|
means or.
So, what are these going to match?
^[Oo]rgani.e\w*
Can you guess what the regular expression
^[Oo]rgani.e\w*
will match?
organise
Organize
organifer
Organi2ed111
Or, any other string that starts a line, begins with a letter
o
in lower or capital case, proceeds with
rgani
, has any character in the 7th position, follows with
letter e
and zero or more characters from the range
[A-Za-z0-9]
.
[Oo]rgani.e\w+$
Can you guess what the regular expression
[Oo]rgani.e\w+$
will match?
organiser
Organized
organifer
Organi2ed111
Or, any other string that ends a line, begins with a letter
o
in lower or capital case, proceeds with
rgani
, has any character in the 7th position, follows with
letter e
and one or more characters from the range
[A-Za-z0-9]
.
^[Oo]rgani.e\w?\b
Can you guess what the regular expression
^[Oo]rgani.e\w?\b
will match?
organise
Organized
organifer
Organi2ek
Or, any other string that starts a line, begins with a letter
o
in lower or capital case, proceeds with
rgani
, has any character in the 7th position, follows with
letter e
, and ends with zero or one characters from the
range [A-Za-z0-9]
.
^[Oo]rgani.e\w?$
Can you guess what the regular expression
^[Oo]rgani.e\w?$
will match?
organise
Organized
organifer
Organi2ek
Or, any other string that starts and ends a line, begins with a
letter o
in lower or capital case, proceeds with
rgani
, has any character in the 7th position, follows with
letter e
and zero or one characters from the range
[A-Za-z0-9]
.
\b[Oo]rgani.e\w{2}\b
Can you guess what the regular expression
\b[Oo]rgani.e\w{2}\b
will match?
organisers
Organizers
organifers
Organi2ek1
Or, any other string that begins with a letter o
in
lower or capital case after a word boundary, proceeds with
rgani
, has any character in the 7th position, follows with
letter e
, and ends with two characters from the range
[A-Za-z0-9]
.
\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b
Can you guess what the regular expression
\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b
will match?
organise
Organi1e
Organizer
organifed
Or, any other string that begins with a letter o
in
lower or capital case after a word boundary, proceeds with
rgani
, has any character in the 7th position, and end with
letter e
, or any other string that begins with a letter
o
in lower or capital case after a word boundary, proceeds
with rgani
, has any character in the 7th position, follows
with letter e
, and ends with a single character from the
range [A-Za-z0-9]
.
This logic is super useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. Or for looking at cells in spreadsheets for certain values. Or for extracting some data from a column of a spreadsheet to make new columns. I could go on. The point is, it is super useful in many contexts. To embed this knowledge we won’t - however - be using computers. Instead we’ll use pen and paper. Work in teams of 4-6 on the exercises below. When you think you have the right answer, check it against the solution. When you finish, I’d like you to split your team into two groups and write each other some tests. These should include a) strings you want the other team to write regex for and b) regular expressions you want the other team to work out what they would match. Then test each other on the answers. If you want to check your logic, use regex101, myregexp, or regex pal regexper.com: the first three help you see what text your regular expression will match, the latter visualises the workflow of a regular expression.
Exercise
Pair up with the person next to you to work through the following problems.
Using square brackets
Can you guess what the regular expression Fr[ea]nc[eh]
will match?
French
France
Frence
Franch
This will also find words where there are characters either side of
the solutions above, such as Francer
,
foobarFrench
, and Franch911
.
Using dollar signs
Can you guess what the regular expression Fr[ea]nc[eh]$
will match?
French
France
Frence
Franch
This will also find strings at the end of a line. It will find words
where there were characters before these, for example
foobarFrench
.
Introducing options
What would match the strings French
and
France
only that appear at the beginning of a line?
^France|^French
This will also find words where there were characters after
French
such as Frenchness
.
Case insensitivity
How do you match the whole words colour
and
color
(case insensitive)?
\b[Cc]olou?r\b|\bCOLOU?R\b
/colou?r/i
In real life, you should only come across the case
insensitive variations colour
, color
,
Colour
, Color
, COLOUR
, and
COLOR
(rather than, say, coLour
). So based on
what we know, the logical regular expression is
\b[Cc]olou?r\b|\bCOLOU?R\b
. An alternative more elegant
option we’ve not discussed is to take advantage of the /
delimiters and add an ignore case flag: so /colou?r/i
will
match all case insensitive variants of colour
and
color
.
Word boundaries
How would you find the whole-word headrest
and or the
2-gram head rest
but not head rest
(that is,
with two spaces between head
and rest
?
\bhead ?rest\b
Note that although \bhead\s?rest\b
does work, it will
also match zero or one tabs or newline characters between
head
and rest
. So again, although in most real
world cases it will be fine, it isn’t strictly correct.
Matching non-linguistic patterns
How would you find a string that ends with 4 letters preceded by at least one zero?
0+[a-z]{4}\b
Matching digits
How do you match any 4 digit string anywhere?
\d{4}
Note this will also match 4 digit strings within longer strings of numbers and letters.
Matching dates
How would you match the date format dd-MM-yyyy
?
\b\d{2}-\d{2}-\d{4}\b
Depending on your data, you may chose to remove the word bounding.
Matching multiple date formats
How would you match the date format dd-MM-yyyy
or
dd-MM-yy
at the end of a string only?
\d{2}-\d{2}-\d{2,4}$
Note this will also find strings such as 31-01-198
at
the end of a line, so you may wish to check your data and revise the
expression to exclude false positives. Depending on your data, you may
chose to add word bounding at the start of the expression.
Matching publication formats
How would you match publication formats such as
British Library : London, 2015
and
Manchester University Press: Manchester, 1999
?
.* ?: .*, \d{4}
Without word boundaries you will find that this matches any text you
put before British
or Manchester
.
Nevertheless, the regular expression does a good job on the first look
up and may be need to be refined on a second depending on your data.
References
James Baker , “Preserving Your Research Data,” Programming Historian (30 April 2014), http://programminghistorian.org/lessons/preserving-your-research-data.html. The sub-sections ‘Plain text formats are your friend’ and ‘Naming files sensible things is good for you and for your computers’ are reworked from this lesson.
Owen Stephens, “Working with Data using OpenRefine”, *Overdue Ideas” (19 November 2014), http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/. The section on ‘Regular Expressions’ is reworked from this lesson developed by Owen Stephens on behalf of the British Library
Andromeda Yelton, “Coding for Librarians: Learning by Example”, Library Technology Reports 51:3 (April 2015), doi: 10.5860/ltr.51n3
Fiona Tweedie, “Why Code?”, The Research Bazaar (October 2014), http://melbourne.resbaz.edu.au/post/95320810834/why-code
Key Points
- Regular expressions are powerful tools for pattern matching