Content from Regular Expressions
Last updated on 2024-09-30 | Edit this page
Estimated time: 35 minutes
Overview
Questions
- How can you imagine using regular expressions in your work?
Objectives
- Identify potential use cases for regular expressions
- Recognize common regex metacharacters
- Use regular expressions in searches
Regular expressions
Regular expressions are a concept and an implementation used in many different programming environments for sophisticated pattern matching. They are an incredibly powerful tool that can amplify your capacity to find, manage, and transform data and files.
A regular expression, often abbreviated to regex, is a method of using a sequence of characters to define a search to match strings, i.e. “find and replace”-like operations. In computation, a ‘string’ is a contiguous sequence of symbols or values. For example, a word, a date, a set of numbers (e.g., a phone number), or an alphanumeric value (e.g., an identifier). A string could be any length, ranging from empty (zero characters) to one that spans many lines of text (including line break characters). The terms ‘string’ and ‘line’ are sometimes used interchangeably, even when they are not strictly the same thing.
In library searches, we are most familiar with a small part of regular expressions known as the “wild card character,” but there are many more features to the complete regular expressions syntax. Regular expressions will let you:
- Match on types of characters (e.g. ‘upper case letters’, ‘digits’, ‘spaces’, etc.).
- Match patterns that repeat any number of times.
- Capture the parts of the original string that match your pattern.
Regex can also be useful for daily work. For example, say your organization wants to change the way they display telephone numbers on their website by removing the parentheses around the area code. Rather than search for each specific phone number (that could take forever and be prone to error) or searching for every open parenthesis character (could also take forever and return many false-positives), you could search for the pattern of a phone number. Regular expressions rely on the use of literal characters and metacharacters. A metacharacter is any American Standard Code for Information Interchange (ASCII) character that has a special meaning. By using metacharacters and possibly literal characters, you can construct a regex for finding strings or files that match a pattern rather than a specific string.
Since regular expressions defines some ASCII characters as
“metacharacters” that have more than their literal meaning, it is also
important to be able to “escape” these metacharacters to use them for
their normal, literal meaning. For example, the period .
means “match any character”, but if you want to match a period then you
will need to use a \
in front of it to signal to the
regular expression processor that you want to use the period as a plain
old period and not a metacharacter. That notation is called “escaping”
the special character. The concept of “escaping” special characters is
shared across a variety of computational settings, including markdown
and Hypertext Markup Language (HTML).
Regex Syntax and interoperability
Most regular expression implementations employ similar syntaxes and metacharacters (generally influenced by the regex syntax of a programming language called Perl), and they behave similarly for most pattern-matching in this lesson. But there are differences, often subtle, in each, so it’s always a good practice to read the application or language’s documentation whenever available, especially when you start using more advanced regex features. Some programs, notably many UNIX command line programs (for more on UNIX see our ‘Shell Lesson’), use an older regex standard (called ‘POSIX regular expressions’) which is less feature-rich and uses different metacharacters than Perl-influenced implementations. For the purposes of our lesson, you do not need to worry too much about all this, but if you want to follow up on this see this detailed syntax comparison.
A very simple use of a regular expression would be to locate the same
word spelled two different ways. For example the regular expression
organi[sz]e
matches both organise
and
organize
. But because it locates all matches for the
pattern in the file, not just for that word, it would also match
reorganise
, reorganize
,
organises
, organizes
, organised
,
organized
, etc.
Learning common regex metacharacters
Square brackets can be used to define a list or range of characters to be found. So:
-
[ABC]
matches A or B or C. -
[A-Z]
matches any upper case letter. -
[A-Za-z]
matches any upper or lower case letter. -
[A-Za-z0-9]
matches any upper or lower case letter or any digit.
Then there are:
-
.
matches any character. -
\d
matches any single digit. -
\w
matches any part of word character (equivalent to[A-Za-z0-9]
). -
\s
matches any space, tab, or newline. -
\
used to escape the following character when that character is a special character. So, for example, a regular expression that found.com
would be\.com
because.
is a special character that matches any character. -
^
is an “anchor” which asserts the position at the start of the line. So what you put after the caret will only match if they are the first characters of a line. The caret is also known as a circumflex. -
$
is an “anchor” which asserts the position at the end of the line. So what you put before it will only match if they are the last characters of a line. -
\b
asserts that the pattern must match at a word boundary. Putting this either side of a word stops the regular expression matching longer variants of words. So:- the regular expression
mark
will match not onlymark
but also findmarking
,market
,unremarkable
, and so on. - the regular expression
\bword
will matchword
,wordless
, andwordlessly
. - the regular expression
comb\b
will matchcomb
andhoneycomb
but notcombine
. - the regular expression
\brespect\b
will matchrespect
but notrespectable
ordisrespectful
.
- the regular expression
So, what is ^[Oo]rgani.e\b
going to match?
Using special characters in regular expression matches
What will the regular expression ^[Oo]rgani.e\b
match?
organise
organize
Organise
Organize
organife
Organike
Or, any other string that starts a line, begins with a letter
o
in lower or capital case, proceeds with
rgani
, has any character in the 7th position, and ends with
the letter e
. See solution
visualised on Regexper.com
Other useful special characters are:
-
*
matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc. -
+
matches the preceding element one or more times. For example, ab+c matches “abc”, “abbbc” but not “ac”. -
?
matches when the preceding character appears zero or one time. -
{VALUE}
matches the preceding character the number of times defined by VALUE; ranges, say, 1-6, can be specified with the syntax{VALUE,VALUE}
, e.g.\d{1,9}
will match any number between one and nine digits in length. -
|
means or. -
/i
renders an expression case-insensitive (equivalent to[A-Za-z]
).
So, what are these going to match?
^[Oo]rgani.e\w*
What will the regular expression ^[Oo]rgani.e\w*
match?
organise
Organize
organifer
Organi2ed111
Or, any other string that starts a line, begins with a letter
o
in lower or capital case, proceeds with
rgani
, has any character in the 7th position, follows with
letter e
and zero or more characters from the range
[A-Za-z0-9]
.
[Oo]rgani.e\w+$
What will the regular expression [Oo]rgani.e\w+$
match?
organiser
Organized
organifer
Organi2ed111
Or, any other string that ends a line, begins with a letter
o
in lower or capital case, proceeds with
rgani
, has any character in the 7th position, follows with
letter e
and at least one or more
characters from the range [A-Za-z0-9]
.
^[Oo]rgani.e\w?\b
What will the regular expression ^[Oo]rgani.e\w?\b
match?
organise
Organized
organifer
Organi2ek
Or, any other string that starts a line, begins with a letter
o
in lower or capital case, proceeds with
rgani
, has any character in the 7th position, follows with
letter e
, and ends with zero or one
characters from the range [A-Za-z0-9]
.
^[Oo]rgani.e\w?$
What will the regular expression ^[Oo]rgani.e\w?$
match?
organise
Organized
organifer
Organi2ek
Or, any other string that starts and ends a line, begins with a
letter o
in lower or capital case, proceeds with
rgani
, has any character in the 7th position, follows with
letter e
and zero or one characters from
the range [A-Za-z0-9]
.
\b[Oo]rgani.e\w{2}\b
What will the regular expression \b[Oo]rgani.e\w{2}\b
match?
organisers
Organizers
organifers
Organi2ek1
Or, any other string that begins with a letter o
in
lower or capital case after a word boundary, proceeds with
rgani
, has any character in the 7th position, follows with
letter e
, and ends with two characters
from the range [A-Za-z0-9]
.
\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b
What will the regular expression
\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b
match?
organise
Organi1e
Organizer
organifed
Or, any other string that begins with a letter o
in
lower or capital case after a word boundary, proceeds with
rgani
, has any character in the 7th position, and end with
letter e
, or any other string that begins with a letter
o
in lower or capital case after a word boundary, proceeds
with rgani
, has any character in the 7th position, follows
with letter e
, and ends with a single character from the
range [A-Za-z0-9]
.
This logic is useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. It can be used for looking at cells in spreadsheets for certain values, or for extracting some data from a column of a spreadsheet to make new columns. There are many other contexts in which regex is useful when using a computer to search through a document, spreadsheet, or file structure. Some real-world use cases for regex were included on a ACRL Tech Connect blog post now archived at Library Hat .
To embed this knowledge we will not - however - be using computers. Instead we’ll use pen and paper for now.
Exercise
Work in teams of four to six on the exercises below. When you think you have the right answer, check it against the solution.
When you finish, split your team into two groups and write each other some tests. These should include a) strings you want the other team to write regex for and b) regular expressions you want the other team to work out what they would match.
Then test each other on the answers. If you want to check your logic use regex101, myregexp, regex pal or regexper.com: the first three help you see what text your regular expression will match, the latter visualises the workflow of a regular expression.
Using square brackets
What will the regular expression Fr[ea]nc[eh]
match?
French
France
Frence
Franch
Note that the way this regular expression is constructed, it will
match misspellings such as Franch
and Frence
.
Lacking an “anchor” such as ^
or \b
, this will
also find strings where there are characters to either side of the
regular expression, such as in French
,
France's
, French-fried
.
Using dollar signs
What will the regular expression Fr[ea]nc[eh]$
match?
French
France
Frence
Franch
This will match the pattern only when it appears at the end of a
line. It will also find strings with other characters coming
before the pattern, for example, in French
or
faux-French
.
Introducing options
What would match the strings French
and
France
that appear at the beginning of a line?
^France|^French
This will also find words where there were characters after
French
such as Frenchness
.
Case insensitivity
How do you match the whole words colour
and
color
(case insensitive)?
\b[Cc]olou?r\b|\bCOLOU?R\b
/colou?r/i
In real life, you should only come across the case
insensitive variations colour
, color
,
Colour
, Color
, COLOUR
, and
COLOR
(rather than, say, coLour
). So based on
what we know, the logical regular expression is
\b[Cc]olou?r\b|\bCOLOU?R\b
.
An alternative more elegant option we’ve not discussed is to take
advantage of the /
delimiters and add an ‘ignore case’
flag. To use these flags, include /
delimiters before and
after the expression then letters after to raise each flag (where
i
is ‘ignore case’): so /colou?r/i
will match
all case insensitive variants of colour
and
color
.
Word boundaries
How would you find the whole word headrest
and or
head rest
but not head rest
(that is, with
two spaces between head
and rest
?
\bhead ?rest\b
Note that although \bhead\s?rest\b
does work, it will
also match zero or one tabs or newline characters between
head
and rest
. So again, although in most real
world cases it will be fine, it isn’t strictly correct.
Matching non-linguistic patterns
How would you find a string that ends with four letters preceded by at least one zero?
0+[A-Za-z]{4}\b
Matching digits
How do you match any four-digit string anywhere?
\d{4}
Note: this will also match four-digit strings within longer strings of numbers and letters.
Matching dates
How would you match the date format dd-MM-yyyy
?
\b\d{2}-\d{2}-\d{4}\b
Depending on your data, you may choose to remove the word bounding.
Matching multiple date formats
How would you match the date format dd-MM-yyyy
or
dd-MM-yy
at the end of a line only?
\d{2}-\d{2}-\d{2,4}$
Note this will also find strings such as 31-01-198
at
the end of a line, so you may wish to check your data and revise the
expression to exclude false positives. Depending on your data, you may
choose to add word bounding at the start of the expression.
Matching publication formats
How would you match publication formats such as
British Library : London, 2015
and
Manchester University Press: Manchester, 1999
?
.* ?: .*, \d{4}
Without word boundaries you will find that this matches any text you
put before British
or Manchester
.
Nevertheless, the regular expression does a good job on the first look
up and may be need to be refined on a second, depending on your
data.
Key Points
- Regular expressions are a language for pattern matching.
Content from Matching & Extracting Strings
Last updated on 2024-10-15 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- How can you use regular expressions to match and extract strings?
Objectives
- Use regular expressions to match words, email addresses, and phone numbers.
- Use regular expressions to extract substrings from strings (e.g. addresses).
Exercise Using Regex101.com
For this exercise, open a browser and go to https://regex101.com. Regex101.com is a free regular expression debugger with real time explanation, error detection, and highlighting.
Open the swcCoC.md file, copy the text, and paste that into the test string box.
For a quick test to see if it is working, type the string
community
into the regular expression box.
If you look in the box on the right of the screen, you see that the expression matches six instances of the string ‘community’ (the instances are also highlighted within the text).
Taking spaces into consideration
Type community
(note the trailing space). You get three
matches. Why not six?
The string ‘community-led’ matches the first search, but drops out of
this result because the space does not match the character
-
.
Taking any character into consideration
If you want to match ‘community-led’ by adding other regex characters
to the expression community
, what would they be?
For instance, \S+\b
. This would match one or more
non-space characters followed by a word boundary.
Exploring effect of expressions matching different words
Change the expression to communi
and you get 15 full
matches of several words. Why?
Because the string ‘communi’ is present in all of those words,
including communi
cation and communi
ty. Because
the expression does not have a word boundary, this expression would also
match incommuni
cado, were it present in this text. If you
want to test this, type incommunicado
into the text
somewhere and see if it is found.
Taking capitalization into consideration
Type the expression [Cc]ommuni
. You get 16 matches.
Why?
The pattern communi
with either a capital C
or lowercase c
is present in the text 16 times.
Regex characters that indicate location
Type the expression ^[Cc]ommuni
. You get no matches.
Why?
There is no matching string present at the start of a line. Look at
the text and replace the string after the ^
with something
that matches a word at the start of a line. Does it find a match?
Finding plurals
Find all of the words starting with Comm or comm that are plural.
[Cc]omm\w+s\b
[Cc]
finds capital and lowercase c
omm
is straightforward character matches
\w+
matches the preceding element (a word character) one
or more times
s
is a straightforward character match
\b
ensures the ‘s’ is located at the end of the
word.
Exercise finding email addresses using regex101.com
For this exercise, open a browser and go to https://regex101.com.
Open the swcCoC.md file, copy it, and paste it into the test string box.
Start with what you know
What character do you know is held in common with all email addresses?
The ‘@’ character.
Add to what you know
The string before the “@” could contain any kind of word character, special character or digit in any combination and length. How would you express this in regex? Hint: often addresses will have a dash (-) or dot (.) in them, and neither of these are included in the word character expression (\w). How do you capture this in the expression?
[\w.-]+@
\w
matches any word character (including digits and
underscore)
.
matches a literal period (when used in between square
brackets, .
does not mean “any character”, it literally
means “.”)
-
matches a dash
[]
the brackets enclose the Boolean string that ‘OR’ the
word characters, dot, and dash.
+
matches any word character OR digit OR character OR
-
repeated 1 or more times
Finish the expression
The string after the “@” could contain any kind of word character,
special character or digit in any combination and length as well as the
dash. In addition, we know that it will have some characters after a
period (.
). Most common domain names have two or three
characters, but many more are now possible (access the
latest list). What expression would capture this? Hint: the
.
is also a metacharacter, so you will have to use the
escape \
to express a literal period. Note: for the string
after the period, we did not try to match a -
character,
since those rarely appear in the characters after the period at the end
of an email address.
[\w.-]+\.\w{2,3} OR [\w.-]+\.\w+
See the previous exercise for the explanation of the expression up to
the +
\.
matches the literal period (‘.’) not the regex
expression .
\w
matches any word (including digits and
underscore)
+
matches any word character OR digit OR character OR
-
repeated 1 or more times.
{2,3}
limits the number of word characters and/or digits
to a two or three-character string.
[]
the brackets enclose the Boolean string that ‘OR’ the
digits, word characters, characters and dash.
+
matches any word character OR digit OR character OR
-
repeated 1 or more times
Exercise finding phone numbers, Using regex101.com
Does this Code of Conduct contain a phone number?
What to consider:
- It may or may not have a country code, perhaps starting with a “+”.
- It will have an area code, potentially enclosed in parentheses.
- It may have the sections all separated with a “-”.
Start with what you know: find strings of digits
Start with what we know, which is the most basic format of a phone number: three digits, a dash, and four digits. How would we write a regex expression that matches this?
\d{3}-\d{4}
\d
matches digits
{3}
matches 3 digits
-
matches the character ‘-’
\d
matches any digit
{4}
matches 4 digits.
This expression should find three matches in the document.
Match a string that includes an area code with a dash
Start with what we know, which is the most basic format of a phone number: three digits, a dash, and four digits. How would we expand the expression to include an area code (three digits and a dash)?
\d{3}-\d{3}-\d{4}
\d
matches digits
{3}
matches 3 digits
-
matches the character ‘-’
\d
matches any digit
{4}
matches 4 digits.
This expression should find one match in the document
Match a string that includes an area code within parenthesis separated from the rest of the phone number with a space or without a space
Start with what we know, which is the most basic format of a phone number: three digits, a dash, and four digits. How would we expand the expression to include a phone number with an area code in parenthesis, separated from the phone number, with or without a space.
\(\d{3}\) ?\d{3}-\d{4}
\(
escape character with the parenthesis as
straightforward character match
\d
matches digits
{3}
matches 3 digits
\)
escape character with the parenthesis as a
straightforward character match
?
matches zero or one spaces
See the previous exercise for the explanation of the rest of the expression.
This expression should find two matches in the document.
Match a phone number containing a country code.
Country codes are preceded by a “+” and can have up to three digits. We also have to consider that there may or may not be a space between the country code and anything appearing next.
\+\d{1,3} ?\(\d{3}\)\s?\d{3}-\d{4}
\+
escape character with the plus sign as
straightforward character match
\d
matches digits
{1,3}
matches 1 to 3 digits
?
matches zero or one spaces
See the previous exercise for the explanation of the rest of the expression.
This expression should find one match in the document.
Using regular expressions when working with files and directories
One of the reasons we stress the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file names. For example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from ‘2017’, then you can. Or if you have ‘journal’ somewhere in a filename when you have data about journals, you can use the computer to select just those files. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data within those files. See Workshop Overview: File Naming & Formatting for further background.
Extracting a substring in Google Sheets using regex
Extracting a substring in Google Sheets using regex
- Export and unzip the 2017 Public Library Survey (originally from the IMLS data site) as a CSV file.
- Upload the CSV file to Google Sheets and open as a Google Sheet if it does not do this by default.
- Look in the
ADDRESS
column and notice that the values contain the latitude and longitude in parenthesis after the library address. - Construct a regular expression to match and extract the latitude and
longitude into a new column named ‘latlong’. HINT: Look up the function
REGEXEXTRACT
in Google Sheets. That function expects the first argument to be a string (a cell inADDRESS
column) and a quoted regular expression in the second.
This is one way to solve this challenge. You might have found others. Inside the cell you can use the below to extract the latitude and longitude into a single cell. You can then copy the formula down to the end of the column.
=REGEXEXTRACT(G2,"-?\d+\.\d+, -?\d+\.\d+")
Latitude and longitude are in decimal degree format and can be
positive or negative, so we start with an optional dash for negative
values then use \d+
for a one or more digit match followed
by a period \.
. Note we had to escape the period using
\
. After the period we look for one or more digits
\d+
again followed by a literal comma ,
. We
then have a literal space match followed by an optional dash
-
(there are few 0.0
latitude/longitudes that
are probably errors, but we’d want to retain so we can deal with them).
We then repeat our \d+\.\d+
we used for the latitude
match.
Key Points
- Regular expressions are useful for searching and cleaning data.
- Test regular expressions interactively with regex101.com or RegExr.com, and visualize them with regexper.com.
- Test yourself with RegexCrossword.com or via the quiz and exercises in this lesson.
Content from Multiple Choice Quiz
Last updated on 2023-05-03 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How do you find and match strings with regular expressions?
Objectives
- Test knowledge of use of regular expressions.
Multiple Choice Quiz
This multiple choice quiz is designed to embed the regex knowledge you learned during this module. We recommend you work through it sometime after class (within a week or so).
Q1. What is the special character that matches zero or more characters?
^
#
*
C
Q2. Which of the following matches any space, tab, or newline?
\s
\b
$
A
Q3. How do you match the string
Confident
appearing at the beginning of a line?
$Confident
^Confident
#Confident
B
Q4. How do you match the word
Confidential
appearing at the beginning of a line?
^Confidential\d
^Confidential\b
^Confidential\w
B
Q5. What does the regular expression
[a-z]
match?
The characters a and z only
All characters between the ranges a to z and A to Z
All characters between the range a to z
C
Q6. Which of these will match the strings
revolution
, revolutionary
, and
revolutionaries
?
revolution[a-z]?
revolution[a-z]*
revolution[a-z]+
B
Q7. Which of these will match the strings
revolution
, Revolution
, and their plural
variants only?
[rR]evolution[s]+
revolution[s]?
[rR]evolution[s]?
C
Q8. What regular expression matches the
strings dog
or cat
?
dog|cat
dog,cat
dog | cat
A
Q9. What regular expression matches the whole
words dog
or cat
?
\bdog|cat\b
\bdog\b | \bcat\b
\bdog\b|\bcat\b
C
Q10. What do we put after a character to match strings where that character appears two to four times in sequence?
{2,4}
{2-4}
[2,4]
A
Q11. The regular expression \d{4}
will match what?
Any four character sequence?
Any four digit sequence?
-
The letter
dfour times?
-
B
Q12. If brackets are used to define a group,
what would match the regular expression
(,\s[0-9]{1,4}){4},\s[0-9]{1,3}\.[0-9]
?
, 135, 1155, 915, 513, 18.8
, 135, 11557, 915, 513, 18.8
, 135, 1155, 915, 513, 188
A
Key Points
- Regular expressions answers
Content from Exercises
Last updated on 2023-05-03 | Edit this page
Estimated time: 50 minutes
Overview
Questions
- How do you find and match strings with regular expressions?
Objectives
- Test knowledge of use of regular expressions
Exercises
The exercises are designed to embed the regex knowledge you learned during this module. We recommend you work through it sometime after class (within a week or so).
What does Fr[ea]nc[eh]
match?
This matches France
, French
, in addition to
the misspellings Frence
, and Franch
. It would
also find strings where there were characters to either side of the
pattern such as France's
, in French
, or
French-fried
.
What does Fr[ea]nc[eh]$
match?
This matches France
, French
,
Frence
, and Franch
only at the end of
a line. It would also match strings with other characters appearing
before the pattern, such as in French
or
Sino-French
.
What would match the strings
French
and France
only that appear at the
beginning of a line?
^France|^French
This would also find strings with other
characters coming after French
, such as
Frenchness
or France's economy
.
How do you match the whole words
colour
and color
(case insensitive)?
In real life, you should only come across the case
insensitive variations colour
, color
,
Colour
, Color
, COLOUR
, and
COLOR
(rather than, say, coLour
. So one option
would be \b[Cc]olou?r\b|\bCOLOU?R\b
. This can, however, get
quickly quite complex. An option we’ve not discussed is to take
advantage of the /
delimiters and add an ignore case flag:
so /colou?r/i
will match all case insensitive variants of
colour
and color
.
How would you find the whole-word
headrest
or head rest
but not
head rest
(that is, with two spaces between
head
and rest
?
\bhead ?rest\b
. Note that although
\bhead\s?rest\b
does work, it would also match zero or one
tabs or newline characters between head
and
rest
. In most real world cases it should, however, be
fine.
How would you find a 4-letter word that ends a string and is preceded by at least one zero?
0+[a-z]{4}\b
How do you match any 4-digit string anywhere?
\d{4}
. Note this will match 4 digit strings only but
will find them within longer strings of numbers.
How would you match the date format
dd-MM-yyyy
?
\b\d{2}-\d{2}-\d{4}\b
In most real world situations, you
are likely to want word bounding here (but it may depend on your
data).
How would you match the date format
dd-MM-yyyy
or dd-MM-yy
at the end of a line
only?
\d{2}-\d{2}-\d{2,4}$
How would you match publication formats such
as British Library : London, 2015
and
Manchester University Press: Manchester, 1999
?
.* : .*, \d{4}
You will find that this matches any text
you put before British
or Manchester
. In this
case, this regular expression does a good job on the first look up and
may be need to be refined on a second depending on your real world
application.
Key Points
- Regular expressions answers