Library Carpentry: Introduction to Regular Expressions: All in One View

Last updated on 2025-12-12 | Edit this page

Overview

Questions

How can you imagine using regular expressions in your work?

Objectives

Identify potential use cases for regular expressions
Recognize common regex metacharacters
Use regular expressions in searches

Regular expressions

Regular expressions are a concept and an implementation used in many different programming environments for sophisticated pattern matching. They are an incredibly powerful tool that can amplify your capacity to find, manage, and transform data and files.

A regular expression, often abbreviated to regex, is a method of using a sequence of characters to define a search to match strings, i.e. “find and replace”-like operations. In computation, a ‘string’ is a contiguous sequence of symbols or values. For example, a word, a date, a set of numbers (e.g., a phone number), or an alphanumeric value (e.g., an identifier). A string could be any length, ranging from empty (zero characters) to one that spans many lines of text (including line break characters). The terms ‘string’ and ‘line’ are sometimes used interchangeably, even when they are not strictly the same thing.

In library searches, we are most familiar with a small part of regular expressions known as the “wild card character,” but there are many more features to the complete regular expressions syntax. Regular expressions will let you:

Match on types of characters (e.g. ‘upper case letters’, ‘digits’, ‘spaces’, etc.).
Match patterns that repeat any number of times.
Capture the parts of the original string that match your pattern.

Regex can also be useful for daily work. For example, say your organization wants to change the way they display telephone numbers on their website by removing the parentheses around the area code. Rather than search for each specific phone number (that could take forever and be prone to error) or searching for every open parenthesis character (could also take forever and return many false-positives), you could search for the pattern of a phone number. Regular expressions rely on the use of literal characters and metacharacters. A metacharacter is any American Standard Code for Information Interchange (ASCII) character that has a special meaning. By using metacharacters and possibly literal characters, you can construct a regex for finding strings or files that match a pattern rather than a specific string.

Since regular expressions defines some ASCII characters as “metacharacters” that have more than their literal meaning, it is also important to be able to “escape” these metacharacters to use them for their normal, literal meaning. For example, the period . means “match any character”, but if you want to match a period then you will need to use a \ in front of it to signal to the regular expression processor that you want to use the period as a plain old period and not a metacharacter. That notation is called “escaping” the special character. The concept of “escaping” special characters is shared across a variety of computational settings, including markdown and Hypertext Markup Language (HTML).

Callout

Regex Syntax and interoperability

Most regular expression implementations employ similar syntaxes and metacharacters (generally influenced by the regex syntax of a programming language called Perl), and they behave similarly for most pattern-matching in this lesson. But there are differences, often subtle, in each, so it’s always a good practice to read the application or language’s documentation whenever available, especially when you start using more advanced regex features. Some programs, notably many UNIX command line programs (for more on UNIX see our ‘Shell Lesson’), use an older regex standard (called ‘POSIX regular expressions’) which is less feature-rich and uses different metacharacters than Perl-influenced implementations. For the purposes of our lesson, you do not need to worry too much about all this, but if you want to follow up on this see this detailed syntax comparison.

A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression organi[sz]e matches both organise and organize. But because it locates all matches for the pattern in the file, not just for that word, it would also match reorganise, reorganize, organises, organizes, organised, organized, etc.

Learning common regex metacharacters

Square brackets can be used to define a list or range of characters to be found. So:

[ABC] matches A or B or C.
[A-Z] matches any upper case letter.
[A-Za-z] matches any upper or lower case letter.
[A-Za-z0-9] matches any upper or lower case letter or any digit.

Then there are:

. matches any character.
\d matches any single digit.
\w matches any part of word character (equivalent to [A-Za-z0-9]).
\s matches any space, tab, or newline.
\ used to escape the following character when that character is a special character. So, for example, a regular expression that found .com would be \.com because . is a special character that matches any character.
^ is an “anchor” which asserts the position at the start of the line. So what you put after the caret will only match if they are the first characters of a line. The caret is also known as a circumflex.
$ is an “anchor” which asserts the position at the end of the line. So what you put before it will only match if they are the last characters of a line.
\b asserts that the pattern must match at a word boundary. Putting this either side of a word stops the regular expression matching longer variants of words. So:
- the regular expression mark will match not only mark but also find marking, market, unremarkable, and so on.
- the regular expression \bword will match word, wordless, and wordlessly.
- the regular expression comb\b will match comb and honeycomb but not combine.
- the regular expression \brespect\b will match respect but not respectable or disrespectful.

So, what is ^[Oo]rgani.e\b going to match?

Challenge

Using special characters in regular expression matches

What will the regular expression ^[Oo]rgani.e\b match?

Show me the solution

organise
organize
Organise
Organize
organife
Organike

Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, and ends with the letter e. See solution visualised on Regexper.com

Getting help with regular expressions

Although powerful, the mix of special and regular characters in a regular expression can make them difficult to interpret and to troubleshoot when we get them wrong. The Regexper tool linked in the solution above visualises and annotates a regular expression provided by the user, which can be very helpful as you learn to read and write them. Other online tools such as Regex101 provide a more powerful – and more complex! – interface that can help you develop a regular expression to fit your use case.

If you get stuck when writing a regular expression, it is helpful to prepare a collection of “test strings”: words, sentences, or other blocks of text that represent the things you do and do not want to match. To test for matches you do not want, think about words/sentences/text that is similar to the matches you do want to make, and include those in your test strings. For example, you might include “disorganised” in a set of test strings for the regular expression in the exercise above, if you only wanted to match whole words.

Paste your test strings into a new file in your favourite text editor or the Test String box on Regex101 then paste or compose your regular expression to see which of these test strings are matched.

As well as trying to figure things out on your own, you might also get help by talking to somebody! If you have a colleague or friend with more expertise in regular expressions than you have, show them the problem you are having and ask them for help. Sometimes, the act of articulating your question can help you to identify what is going wrong. This is known as “rubber duck debugging” among programmers.

Generative AI

It is increasingly common for people to use generative AI chatbots such as ChatGPT to get help with regular expressions. You will probably receive some useful guidance by presenting your regular expression to the chatbot and asking why it matches (or does not match) a given string. However, the way this help is provided by the chatbot is different from that provided by a human. Generative AI chatbots, which are based on an advanced statistical model, respond by generating the most likely sequence of text that would follow the prompt they are given.

While responses from generative AI tools can often be helpful, they are not always reliable. These tools sometimes generate plausible but incorrect or misleading information, so (just as with an answer found on the internet or shared with you by somebody else) it is essential to verify their accuracy. You need the knowledge and skills to be able to understand these responses, to judge whether or not they are accurate, and to fix any errors in the regular expression the chatbot offers you.

In addition to asking for help, generative AI tools can be used to generate regular expressions from scratch (especially when given clear and precise instructions and examples); extend, improve and reorganise existing regular expressions; and more. However, there are drawbacks that you should be aware of.

The models used by these tools have been “trained” on very large volumes of data, much of it taken from the internet, and the responses they produce reflect that training data, and may recapitulate its inaccuracies or biases. The environmental costs (energy and water use) of LLMs are a lot higher than other technologies, both during development (known as training) and when an individual user uses one (also called inference). For more information see the AI Environmental Impact Primer developed by researchers at HuggingFace, an AI hosting platform. Concerns also exist about the way the data for this training was obtained, with questions raised about whether the people developing the LLMs had permission to use it. Other ethical concerns have also been raised, such as reports that workers were exploited during the training process.

We recommend that you avoid getting help from generative AI during the workshop because the foundational knowledge and skills you will learn in this lesson by writing and fixing your own regular expressions are essential to be able to evaluate the correctness and safety of any answers you receive from other people or a generative AI chatbot. If you choose to use these tools in the future, the expertise you gain from learning and practising these fundamentals on your own will help you use them more effectively.

More special characters

Other useful special characters are:

* matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc.
+ matches the preceding element one or more times. For example, ab+c matches “abc”, “abbbc” but not “ac”.
? matches when the preceding character appears zero or one time.
{VALUE} matches the preceding character the number of times defined by VALUE; ranges, say, 1-6, can be specified with the syntax {VALUE,VALUE}, e.g. \d{1,9} will match any number between one and nine digits in length.
| means or.
/i renders an expression case-insensitive (equivalent to [A-Za-z]).

So, what are these going to match?

Challenge

`^[Oo]rgani.e\w*`

What will the regular expression ^[Oo]rgani.e\w* match?

Show me the solution

organise
Organize
organifer
Organi2ed111

Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and zero or more characters from the range [A-Za-z0-9].

Challenge

`[Oo]rgani.e\w+$`

What will the regular expression [Oo]rgani.e\w+$ match?

Show me the solution

organiser
Organized
organifer
Organi2ed111

Or, any other string that ends a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and at least one or more characters from the range [A-Za-z0-9].

Challenge

`^[Oo]rgani.e\w?\b`

What will the regular expression ^[Oo]rgani.e\w?\b match?

Show me the solution

organise
Organized
organifer
Organi2ek

Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e, and ends with zero or one characters from the range [A-Za-z0-9].

Challenge

`^[Oo]rgani.e\w?$`

What will the regular expression ^[Oo]rgani.e\w?$ match?

Show me the solution

organise
Organized
organifer
Organi2ek

Or, any other string that starts and ends a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and zero or one characters from the range [A-Za-z0-9].

Challenge

`\b[Oo]rgani.e\w{2}\b`

What will the regular expression \b[Oo]rgani.e\w{2}\b match?

Show me the solution

organisers
Organizers
organifers
Organi2ek1

Or, any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, follows with letter e, and ends with two characters from the range [A-Za-z0-9].

Challenge

`\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b`

What will the regular expression \b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b match?

Show me the solution

organise
Organi1e
Organizer
organifed

Or, any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, and end with letter e, or any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, follows with letter e, and ends with a single character from the range [A-Za-z0-9].

This logic is useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. It can be used for looking at cells in spreadsheets for certain values, or for extracting some data from a column of a spreadsheet to make new columns. There are many other contexts in which regex is useful when using a computer to search through a document, spreadsheet, or file structure. Some real-world use cases for regex were included on a ACRL Tech Connect blog post now archived at Library Hat .

To embed this knowledge we will not - however - be using computers. Instead we’ll use pen and paper for now.

Exercise

Work in teams of four to six on the exercises below. When you think you have the right answer, check it against the solution.

\b[Cc]olou?r\b|\bCOLOU?R\b
/colou?r/i

In real life, you should only come across the case insensitive variations colour, color, Colour, Color, COLOUR, and COLOR (rather than, say, coLour). So based on what we know, the logical regular expression is \b[Cc]olou?r\b|\bCOLOU?R\b.

An alternative more elegant option we’ve not discussed is to take advantage of the / delimiters and add an ‘ignore case’ flag. To use these flags, include / delimiters before and after the expression then letters after to raise each flag (where i is ‘ignore case’): so /colou?r/i will match all case insensitive variants of colour and color.

Challenge

Word boundaries

How would you find the whole word headrest and or head rest but not head rest (that is, with two spaces between head and rest?

Show me the solution

\bhead ?rest\b

Note that although \bhead\s?rest\b does work, it will also match zero or one tabs or newline characters between head and rest. So again, although in most real world cases it will be fine, it isn’t strictly correct.

Challenge

Matching non-linguistic patterns

How would you find a string that ends with four letters preceded by at least one zero?

Show me the solution

0+[A-Za-z]{4}\b

Challenge

Matching digits

How do you match any four-digit string anywhere?

Show me the solution

\d{4}

Note: this will also match four-digit strings within longer strings of numbers and letters.

Challenge

Matching dates

How would you match the date format dd-MM-yyyy?

Show me the solution

\b\d{2}-\d{2}-\d{4}\b

Depending on your data, you may choose to remove the word bounding.

Challenge

Matching multiple date formats

How would you match the date format dd-MM-yyyy or dd-MM-yy at the end of a line only?

Show me the solution

\d{2}-\d{2}-\d{2,4}$

Note this will also find strings such as 31-01-198 at the end of a line, so you may wish to check your data and revise the expression to exclude false positives. Depending on your data, you may choose to add word bounding at the start of the expression.

Challenge

Matching publication formats

How would you match publication formats such as British Library : London, 2015 and Manchester University Press: Manchester, 1999?

Show me the solution

.* ?: .*, \d{4}

Without word boundaries you will find that this matches any text you put before British or Manchester. Nevertheless, the regular expression does a good job on the first look up and may be need to be refined on a second, depending on your data.

Key Points

Regular expressions are a language for pattern matching.

Content from Matching & Extracting Strings

Last updated on 2024-10-15 | Edit this page

Overview

Questions

How can you use regular expressions to match and extract strings?

Objectives

Use regular expressions to match words, email addresses, and phone numbers.
Use regular expressions to extract substrings from strings (e.g. addresses).

Exercise Using Regex101.com

For this exercise, open a browser and go to https://regex101.com. Regex101.com is a free regular expression debugger with real time explanation, error detection, and highlighting.

Open the swcCoC.md file, copy the text, and paste that into the test string box.

For a quick test to see if it is working, type the string community into the regular expression box.

[Cc]omm\w+s\b

[Cc] finds capital and lowercase c

omm is straightforward character matches

\w+ matches the preceding element (a word character) one or more times

s is a straightforward character match

\b ensures the ‘s’ is located at the end of the word.

Exercise finding email addresses using regex101.com

For this exercise, open a browser and go to https://regex101.com.

Open the swcCoC.md file, copy it, and paste it into the test string box.

Challenge

Start with what you know

What character do you know is held in common with all email addresses?

Show me the solution

The ‘@’ character.

Challenge

Add to what you know

The string before the “@” could contain any kind of word character, special character or digit in any combination and length. How would you express this in regex? Hint: often addresses will have a dash (-) or dot (.) in them, and neither of these are included in the word character expression (\w). How do you capture this in the expression?

Show me the solution

[\w.-]+@

\w matches any word character (including digits and underscore)

. matches a literal period (when used in between square brackets, . does not mean “any character”, it literally means “.”)

- matches a dash

[] the brackets enclose the Boolean string that ‘OR’ the word characters, dot, and dash.

+ matches any word character OR digit OR character OR - repeated 1 or more times

Challenge

Finish the expression

The string after the “@” could contain any kind of word character, special character or digit in any combination and length as well as the dash. In addition, we know that it will have some characters after a period (.). Most common domain names have two or three characters, but many more are now possible (access the latest list). What expression would capture this? Hint: the . is also a metacharacter, so you will have to use the escape \ to express a literal period. Note: for the string after the period, we did not try to match a - character, since those rarely appear in the characters after the period at the end of an email address.

Show me the solution

[\w.-]+\.\w{2,3} OR [\w.-]+\.\w+

See the previous exercise for the explanation of the expression up to the +

\. matches the literal period (‘.’) not the regex expression .

\w matches any word (including digits and underscore)

+ matches any word character OR digit OR character OR - repeated 1 or more times.

{2,3} limits the number of word characters and/or digits to a two or three-character string.

[] the brackets enclose the Boolean string that ‘OR’ the digits, word characters, characters and dash.

+ matches any word character OR digit OR character OR - repeated 1 or more times

Exercise finding phone numbers, Using regex101.com

Does this Code of Conduct contain a phone number?

What to consider:

It may or may not have a country code, perhaps starting with a “+”.
It will have an area code, potentially enclosed in parentheses.
It may have the sections all separated with a “-”.

Challenge

Start with what you know: find strings of digits

Start with what we know, which is the most basic format of a phone number: three digits, a dash, and four digits. How would we write a regex expression that matches this?

Show me the solution

\d{3}-\d{4}

\d matches digits

{3} matches 3 digits

- matches the character ‘-’

\d matches any digit

{4} matches 4 digits.

This expression should find three matches in the document.

Challenge

Match a string that includes an area code with a dash

Start with what we know, which is the most basic format of a phone number: three digits, a dash, and four digits. How would we expand the expression to include an area code (three digits and a dash)?

Show me the solution

\d{3}-\d{3}-\d{4}

\d matches digits

{3} matches 3 digits

- matches the character ‘-’

\d matches any digit

{4} matches 4 digits.

This expression should find one match in the document

Challenge

Match a string that includes an area code within parenthesis separated from the rest of the phone number with a space or without a space

Start with what we know, which is the most basic format of a phone number: three digits, a dash, and four digits. How would we expand the expression to include a phone number with an area code in parenthesis, separated from the phone number, with or without a space.

Show me the solution

\(\d{3}\) ?\d{3}-\d{4}

\( escape character with the parenthesis as straightforward character match

\d matches digits

{3} matches 3 digits

\) escape character with the parenthesis as a straightforward character match

? matches zero or one spaces

See the previous exercise for the explanation of the rest of the expression.

This expression should find two matches in the document.

Challenge

Match a phone number containing a country code.

Country codes are preceded by a “+” and can have up to three digits. We also have to consider that there may or may not be a space between the country code and anything appearing next.

Show me the solution

\+\d{1,3} ?\(\d{3}\)\s?\d{3}-\d{4}

\+ escape character with the plus sign as straightforward character match

\d matches digits

{1,3} matches 1 to 3 digits

? matches zero or one spaces

See the previous exercise for the explanation of the rest of the expression.

This expression should find one match in the document.

Callout

Using regular expressions when working with files and directories

One of the reasons we stress the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file names. For example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from ‘2017’, then you can. Or if you have ‘journal’ somewhere in a filename when you have data about journals, you can use the computer to select just those files. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data within those files. See Workshop Overview: File Naming & Formatting for further background.

Extracting a substring in Google Sheets using regex

Challenge

Extracting a substring in Google Sheets using regex

Export and unzip the 2017 Public Library Survey (originally from the IMLS data site) as a CSV file.
Upload the CSV file to Google Sheets and open as a Google Sheet if it does not do this by default.
Look in the ADDRESS column and notice that the values contain the latitude and longitude in parenthesis after the library address.
Construct a regular expression to match and extract the latitude and longitude into a new column named ‘latlong’. HINT: Look up the function REGEXEXTRACT in Google Sheets. That function expects the first argument to be a string (a cell in ADDRESS column) and a quoted regular expression in the second.

Show me the solution

This is one way to solve this challenge. You might have found others. Inside the cell you can use the below to extract the latitude and longitude into a single cell. You can then copy the formula down to the end of the column.

=REGEXEXTRACT(G2,"-?\d+\.\d+, -?\d+\.\d+")

Latitude and longitude are in decimal degree format and can be positive or negative, so we start with an optional dash for negative values then use \d+ for a one or more digit match followed by a period \.. Note we had to escape the period using \. After the period we look for one or more digits \d+ again followed by a literal comma ,. We then have a literal space match followed by an optional dash - (there are few 0.0 latitude/longitudes that are probably errors, but we’d want to retain so we can deal with them). We then repeat our \d+\.\d+ we used for the latitude match.

Key Points

Regular expressions are useful for searching and cleaning data.
Test regular expressions interactively with regex101.com or RegExr.com, and visualize them with regexper.com.
Test yourself with RegexCrossword.com or via the quiz and exercises in this lesson.

Content from Multiple Choice Quiz

Last updated on 2023-05-03 | Edit this page

Overview

Questions

How do you find and match strings with regular expressions?

Objectives

Test knowledge of use of regular expressions.

Multiple Choice Quiz

This multiple choice quiz is designed to embed the regex knowledge you learned during this module. We recommend you work through it sometime after class (within a week or so).

Challenge

Q1. What is the special character that matches zero or more characters?

1. ^
1. #
1. *

Answer

Challenge

Q2. Which of the following matches any space, tab, or newline?

1. \s
1. \b
1. $

Answer

Challenge

Q3. How do you match the string `Confident` appearing at the beginning of a line?

1. $Confident
1. ^Confident
1. #Confident

Answer

Challenge

Q4. How do you match the word `Confidential` appearing at the beginning of a line?

1. ^Confidential\d
1. ^Confidential\b
1. ^Confidential\w

Answer

Challenge

Q5. What does the regular expression `[a-z]` match?

1. The characters a and z only
1. All characters between the ranges a to z and A to Z
1. All characters between the range a to z

Answer

Challenge

Q6. Which of these will match the strings `revolution`, `revolutionary`, and `revolutionaries`?

1. revolution[a-z]?
1. revolution[a-z]*
1. revolution[a-z]+

Answer

Challenge

Q7. Which of these will match the strings `revolution`, `Revolution`, and their plural variants only?

1. [rR]evolution[s]+
1. revolution[s]?
1. [rR]evolution[s]?

Answer

Challenge

Q8. What regular expression matches the strings `dog` or `cat`?

1. dog|cat
1. dog,cat
1. dog | cat

Answer

Challenge

Q9. What regular expression matches the whole words `dog` or `cat`?

1. \bdog|cat\b
1. \bdog\b | \bcat\b
1. \bdog\b|\bcat\b

Answer

Challenge

Q10. What do we put after a character to match strings where that character appears two to four times in sequence?

1. {2,4}
1. {2-4}
1. [2,4]

Answer

Challenge

Q11. The regular expression `\d{4}` will match what?

1. Any four character sequence?
1. Any four digit sequence?
1. The letterdfour times?

Answer

Challenge

Q12. If brackets are used to define a group, what would match the regular expression `(,\s[0-9]{1,4}){4},\s[0-9]{1,3}\.[0-9]`?

1. , 135, 1155, 915, 513, 18.8
1. , 135, 11557, 915, 513, 18.8
1. , 135, 1155, 915, 513, 188

Answer

Key Points

Regular expressions answers

Content from Exercises

Last updated on 2023-05-03 | Edit this page

Overview

Questions

How do you find and match strings with regular expressions?

Objectives

Test knowledge of use of regular expressions