Data Intro for Archivists: All in One View

Content from Introduction to Library Carpentry

Last updated on 2023-04-24 | Edit this page

Overview

Questions

What do archivists gain from code?

Objectives

Explain why software skills are valuable to archivists
Know where to go for help during Library Carpentry

Overview

Introduction

Welcome to Library Carpentry! This series of introductory workshops on software skills for librarians and archivists started life as an exploratory programme funded by the Software Sustainability Institute and supported by Software Carpentry and City University London. Thanks also go to the British Library and the University of Sussex where James Baker, who developed the workshops, worked when planning and delivering the workshops. The aim of Library Carpentry is to create a set of tools the community can manage, support, enrich, and reuse as it sees fit. Periodically during the sessions we will collect anonymous feedback that will go into improving the classes and ensuring that they best fit the evolving needs and requirements of the library and information science community.

The rationale for Library Carpentry is twofold. First, as Andromeda Yelton argues in her excellent ALA Library Technology Report ‘Coding for Librarians: learning by example’, code is a means for librarians to take control of practice and to empower themselves and their organisation to meet user needs in flexible ways. Second, librarians play a crucial role in cultivating world class research. And in most research areas today world class research relies on the use of software. Librarians with software skills are then well placed to continue that cultivation of world class research.

Where to go for help

First, identify people on your table who can help: you will all be working from the same material, so someone around you may have figured out the point you are stuck at.

Second, there are helpers on hand to help if those around you can’t. You should all have access to coloured sticky notes: attaching a red sticky note to your laptop indicates that you need help (it might also alert the attention of someone around you!). So, please use them.

Third, each part of Library Carpentry may require you to install software or download data. Breaks are a good time to ask for help.

Fourth, we encourage you to finish up or repeat tasks after class time: if you run into any problem, please report them on the relevant Github issues page (see the bottom of each lesson page for a link).

Most Library Carpentry lessons will require you to follow along while your instructor demonstrates a software tool or approach. Sometimes you will fall behind. If you put your red sticky note up on your computer, this lets a helper know you need assistance. Your issue may be specific to your computer. Computers are stupid, can frustrate, and as you all have different machines it can be tricky to resolve problems. Please be patient, particularly if your issue is local. Stepping outside and taking a gulp of fresh air always helps.

Key Points

Don’t be scared to ask for help

Content from Don't think you work with data?

Last updated on 2023-04-24 | Edit this page

Overview

Questions

What sort of data do you work with?
What do you do with it?
What tools do you use to help you?

Objectives

Recognise that they work with data
Compare what tasks they peform on data and the tools they use

Don’t think you work with data? Think again

Task 1

This group task is an opportunity for you to think about the sort of data you have, what you do with it, and what tools you use to do that.

Start by getting into pairs.
Brainstorm all the different sorts of data you work with (examples might include metadata, catalogue data, legacy data, data ouptut from DROID etc.)
Your instructor will gather in these ideas and lead a discussion to establish that we are all talking about roughly the same thing when we talk about data
Get into groups of 4-6.
Discuss your own data, trying to answer questions including; How much data do you have? Where is it stored? Who has access to it? How is it formatted or stored? Can you move it about easily - in and out of systems? In particular think about the tools you use to help you manage your data as well as any problems you have with it.
Each group then reports back on two problems they have with their data.
The instructor will collate these on a whiteboard and facilitate a discussion about; a) how starting to think in terms of data is a good first step for what we will be learning, b) what it is we will be learning, and c) how what we will be learning will help us to solve some of the problems we are facing.

Task 2

This follow-on task aims to guide learners in thinking about data as conceptually seperate from the systems that produce, store, and preserve it. It offers an opportunity to think about how data move through archival systems and the value of archival data outside of those systems.

As a group, consider the types of data you discussed in the previous task and select one representative example.
Using sticky notes, map the lifecycle of a data point from the moment of creation to its long-term home or to disposition (long term transfer, destruction, etc.)
Discuss: How many people or organizations have been custodians of the data? How many systems has it moved through? Is there a relationship between the individual(s) creating the data and those who make preservation or disposition decisions? How does the lifecycle of the dataset impact documentation, metadata, or the data itself?
Each group attaches their data lifecycle map to the whiteboard
The instructor will lead a discussion about lifecycles of archival data and highlight the potential value of these data outside of the systems we typically associate with archival data.

Key Points

We all have data and it is not just enough to put it into a system and forget about it

Content from Foundations

Last updated on 2023-04-24 | Edit this page

Overview

Questions

what best practice and generic skills underpin your encounters with data and research?

Objectives

identify and use best practice in data structures
identify and understand a data-driven mindset

Foundations

In the last episode, we discussed what we each think of as data. We came up with a lot of different ideas of what data looks like and how it can be used. Before we crack on with using the computational tools at our disposal, I want to spend some time on some foundation level stuff - a combination of best practice and generic skills that frame what you’ll encounter across Archive Carpentry.

Trainer Note: we recommend using this section as an opportunity to discuss foundational skills that you think are relevant.

Data are Collected Through Research

To summarize the brainstorming session that we had in the last episode, data are information collected through research. As archivists, we support research. When we start to think of our collections as data, we can start to support new methods of providing access to our data. Data can be manipulated using automated or computational methods, allowing us to improve our workflows. When approaching our work with a data-aware mindset, we should think of the systems that we are using to do our work.

The computer and the systems inside it are stupid

This does not mean that the computer isn’t useful. Given a repetitive task, an enumerative task, or a task that relies on memory, it can produce results faster, more accurately, and less grudgingly than you or I. Rather when I say that you should keep in mind that the computer is stupid, I mean to say that computer only does what you tell it to. If it throws up an error, it is often not your fault; in most cases, the computer has failed to interpret what you mean because it can only work with what it knows (ergo, it is bad at interpreting). This is not to say that the people who told the computer what to tell you when it doesn’t know what to do couldn’t have done a better job with error messages – they could. So keep in mind as we go along that if you find an error message frustrating, it isn’t the computer’s fault that it is giving you an archaic and incomprehensible error message, it is a human person’s.

The correct language to learn is the one that works in your local context. There truly isn’t a best language, just languages with different strengths and weaknesses, all of which incorporate the same fundamental principles;
Knowing the structure of the interface that you are using will assist you in learning. Databases and computer systems can seem opaque. Knowing what data structures they were built to support can help you to troubleshoot
Automate to make the time to do something else! Taking the time to gather together even the most simple programming skills can save time to do more interesting stuff! (even if often that more interesting stuff is learning more programming skills …)
Understanding the interface can help you to communicate with developers and engineers Taking the time to gather together even the most simple programming skills can help you to better communicate your needs to developers.

Beyond the Interface

Much of the work that you do with data may be completed through a software interface. Your archival catalog and Excel spreadsheets are interfaces that allow you to view your data more easily. The data itself is organized into structures that many of you will be familiar with, but is much more text-heavy and may not be as simple for humans to read.

Plain text formats are your friend

Why? Because computers can process them! Structures and formats that may be easier for humans to read often cannot be read by computers.

If you want computers to be able to process your stuff, try to get into the habit of using platform-agnostic formats where possible, such as .txt for notes and .csv or .tsv for tabulated data (the latter pair are just spreadsheet formats, separated by commas and tabs respectively). These plain text formats are preferable to the proprietary formats used as defaults by Microsoft Office because they can be opened by many software packages and have a strong chance of remaining viewable and editable in the future. Most standard office suites include the option to save files in .txt, .csv and .tsv formats, meaning you can continue to work with familiar software and still take appropriate action to make your work accessible. Compared to .doc or .xls, these formats have the additional benefit of containing only machine-readable elements.

Whilst it is common practice to use bold, italics, and colouring to signify headings or to make a visual connection between data elements, these display-orientated annotations are not (easily) machine-readable, and hence can neither be queried and searched nor are appropriate for large quantities of information (the rule of thumb is, if you can’t find it by CTRL+F, it isn’t machine readable). It is preferable to use standards that signify heading levels, as these standards are not only machine-readable, but also translate easily across web browsers and potential future content migrations.

In archival practice, standards have been developed in order for computers to understand the methods that we use to describe our collections. ISAD(G) – General International Standard Archival Description – has helped archivists to determine how to describe their collections but EAD – Encoded Archival Description – has given archivists a standard way to format their description.

Key Points

data are used in research
archival collections and archival description are data
data structures should be consistent and predictable
consider the standards and structures used in your own data
identify and use computational methods in your work
identify how standards and structures can be used in research

Content from Regular Expressions

Last updated on 2023-04-24 | Edit this page

Overview

Questions

How can you imagine using regular expressions in your work?

Objectives

Use regular expressions in searches

Regular Expressions

One of the reason why I have stressed the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file name. So, for example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from ‘2014’, then you can. Or if you have ‘journal’ somewhere in a filename when you have data about journals, you can use the computer to select just those files then do something with them. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data within files.

A powerful means of doing this selecting based on file characteristics is to use regular expressions, often abbreviated to regex. A regular expression is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations. Regular expressions are typically surrounded by / characters, though we will (mostly) ignore those for ease of comprehension. Regular expressions will let you:

Match on types of character (e.g. ‘upper case letters’, ‘digits’, ‘spaces’, etc.)
Match patterns that repeat any number of times
Capture the parts of the original string that match your pattern

As most computational software has regular expression functionality built in and as many computational tasks in libraries are built around complex matching, it is good place for Library Carpentry to start in earnest.

A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression organi[sz]e matches both “organise” and “organize”.

But it would also match reorganise, reorganize, organises, organizes, organised, organized, et cetera, because we’ve not specified the beginning or end of our string. So there are a bunch of special syntax that help us be more precise.

The first we’ve seen: square brackets can be used to define a list or range of characters to be found. So:

[ABC] matches A or B or C
[A-Z] matches any upper case letter
[A-Za-z0-9] matches any upper or lower case letter or any digit (note: this is case-sensitive)

Then there are:

. matches any character
\d matches any single digit
\w matches any part of word character (equivalent to [A-Za-z0-9])
\s matches any space, tab, or newline
\ NB: this is also used to escape the following character when that character is a special character. So, for example, a regular expression that found .com would be \.com because . is a special character that matches any character.
^ asserts the position at the start of the line. So what you put after it will only match if they are the first characters of a line.
$ asserts the position at the end of the line. So what you put before it will only match if they are the last characters of a line.
\b adds a word boundary. Putting this either side of a word stops the regular expression matching longer variants of words. So:
- the regular expression foobar will match foobar and find 666foobar, foobar777, 8thfoobar8th et cetera
- the regular expression \bfoobar will match foobar and find foobar777
- the regular expression foobar\b will match foobar and find 666foobar
- the regular expression \bfoobar\b will find foobar

So, what is ^[Oo]rgani.e\b going to match.

Using special characters in regular expression matches

Can you guess what the regular expression ^[Oo]rgani.e\b will match?

Show me the solution

organise
organize
Organise
Organize
organife
Organike

Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, and ends with the letter e.

Other useful special characters are:

* matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc.
+ matches the preceding element one or more times. For example, ab+c matches “abc”, “abbbc” but not “ac”.
? matches when the preceding character appears zero or one time.
{VALUE} matches the preceding character the number of times define by VALUE; ranges can be specified with the syntax {VALUE,VALUE}
| means or.

So, what are these going to match?

`^[Oo]rgani.e\w*`

Can you guess what the regular expression ^[Oo]rgani.e\w* will match?

Show me the solution

organise
Organize
organifer
Organi2ed111

Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and zero or more characters from the range [A-Za-z0-9].

`[Oo]rgani.e\w+$`

Can you guess what the regular expression [Oo]rgani.e\w+$ will match?

Show me the solution

organiser
Organized
organifer
Organi2ed111

Or, any other string that ends a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and one or more characters from the range [A-Za-z0-9].

`^[Oo]rgani.e\w?\b`

Can you guess what the regular expression ^[Oo]rgani.e\w?\b will match?

Show me the solution

organise
Organized
organifer
Organi2ek

Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e, and ends with zero or one characters from the range [A-Za-z0-9].

`^[Oo]rgani.e\w?$`

Can you guess what the regular expression ^[Oo]rgani.e\w?$ will match?

Show me the solution

organise
Organized
organifer
Organi2ek

Or, any other string that starts and ends a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and zero or one characters from the range [A-Za-z0-9].

`\b[Oo]rgani.e\w{2}\b`

Can you guess what the regular expression \b[Oo]rgani.e\w{2}\b will match?

Show me the solution

organisers
Organizers
organifers
Organi2ek1

Or, any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, follows with letter e, and ends with two characters from the range [A-Za-z0-9].

`\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b`

Can you guess what the regular expression \b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b will match?

Show me the solution

organise
Organi1e
Organizer
organifed

Or, any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, and end with letter e, or any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, follows with letter e, and ends with a single character from the range [A-Za-z0-9].

This logic is super useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. Or for looking at cells in spreadsheets for certain values. Or for extracting some data from a column of a spreadsheet to make new columns. I could go on. The point is, it is super useful in many contexts. To embed this knowledge we won’t - however - be using computers. Instead we’ll use pen and paper. Work in teams of 4-6 on the exercises below. When you think you have the right answer, check it against the solution. When you finish, I’d like you to split your team into two groups and write each other some tests. These should include a) strings you want the other team to write regex for and b) regular expressions you want the other team to work out what they would match. Then test each other on the answers. If you want to check your logic, use regex101, myregexp, or regex pal regexper.com: the first three help you see what text your regular expression will match, the latter visualises the workflow of a regular expression.

Exercise

Pair up with the person next to you to work through the following problems.

Using square brackets

Can you guess what the regular expression Fr[ea]nc[eh] will match?

Show me the solution

French
France
Frence
Franch

This will also find words where there are characters either side of the solutions above, such as Francer, foobarFrench, and Franch911.

Using dollar signs

Can you guess what the regular expression Fr[ea]nc[eh]$ will match?

Show me the solution

French
France
Frence
Franch

This will also find strings at the end of a line. It will find words where there were characters before these, for example foobarFrench.

Introducing options

What would match the strings French and France only that appear at the beginning of a line?

Show me the solution

^France|^French

This will also find words where there were characters after French such as Frenchness.

Case insensitivity

How do you match the whole words colour and color (case insensitive)?

Solutions

\b[Cc]olou?r\b|\bCOLOU?R\b
/colou?r/i

In real life, you should only come across the case insensitive variations colour, color, Colour, Color, COLOUR, and COLOR (rather than, say, coLour). So based on what we know, the logical regular expression is \b[Cc]olou?r\b|\bCOLOU?R\b. An alternative more elegant option we’ve not discussed is to take advantage of the / delimiters and add an ignore case flag: so /colou?r/i will match all case insensitive variants of colour and color.

Word boundaries

How would you find the whole-word headrest and or the 2-gram head rest but not head rest (that is, with two spaces between head and rest?

Show me the solution

\bhead ?rest\b

Note that although \bhead\s?rest\b does work, it will also match zero or one tabs or newline characters between head and rest. So again, although in most real world cases it will be fine, it isn’t strictly correct.

Matching non-linguistic patterns

How would you find a string that ends with 4 letters preceded by at least one zero?

Show me the solution

0+[a-z]{4}\b

Matching digits

How do you match any 4 digit string anywhere?

Show me the solution

\d{4}

Note this will also match 4 digit strings within longer strings of numbers and letters.

Matching dates

How would you match the date format dd-MM-yyyy?

Show me the solution

\b\d{2}-\d{2}-\d{4}\b

Depending on your data, you may chose to remove the word bounding.

Matching multiple date formats

How would you match the date format dd-MM-yyyy or dd-MM-yy at the end of a string only?

Show me the solution

\d{2}-\d{2}-\d{2,4}$

Note this will also find strings such as 31-01-198 at the end of a line, so you may wish to check your data and revise the expression to exclude false positives. Depending on your data, you may chose to add word bounding at the start of the expression.

Matching publication formats

How would you match publication formats such as British Library : London, 2015 and Manchester University Press: Manchester, 1999?

Show me the solution

.* ?: .*, \d{4}

Without word boundaries you will find that this matches any text you put before British or Manchester. Nevertheless, the regular expression does a good job on the first look up and may be need to be refined on a second depending on your data.

References

James Baker , “Preserving Your Research Data,” Programming Historian (30 April 2014), http://programminghistorian.org/lessons/preserving-your-research-data.html. The sub-sections ‘Plain text formats are your friend’ and ‘Naming files sensible things is good for you and for your computers’ are reworked from this lesson.

Owen Stephens, “Working with Data using OpenRefine”, *Overdue Ideas” (19 November 2014), http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/. The section on ‘Regular Expressions’ is reworked from this lesson developed by Owen Stephens on behalf of the British Library

Andromeda Yelton, “Coding for Librarians: Learning by Example”, Library Technology Reports 51:3 (April 2015), doi: 10.5860/ltr.51n3

Fiona Tweedie, “Why Code?”, The Research Bazaar (October 2014), http://melbourne.resbaz.edu.au/post/95320810834/why-code

Key Points

Regular expressions are powerful tools for pattern matching

Content from Introduction to Data - Multiple Choice Quiz

Last updated on 2023-04-24 | Edit this page

Overview

Questions

What does Fr[ea]nc[eh] match?

Objectives

Test knowledge of use of regular expressions in searches

Multiple Choice Quiz

This multiple choice quiz is designed to embed the regex knowledge you learned during this module. We recommend you work through it someone after class (within a week or so). Answers are on the answer sheet.

Q1. What is the special character that matches zero or more characters

1. ^
1. #
1. *

Q2. Which of the following matches any space, tab, or newline?

1. \s
1. \b
1. $

Q3. How do you match the string Foobar appearing at the beginning of a line?

1. $Foobar
1. ^Foobar
1. #Foobar

Q4. How do you match the word Foobar appearing at the beginning of a line?

1. ^Foobar\d
1. ^Foobar\b
1. ^Foobar\w

Q5. What does the regular expression [a-z] match?

1. The characters a and z only
1. All characters between the ranges a to z and A to Z
1. All characters between the range a to z

Q6. Which of these will match the strings revolution, revolutionary, and revolutionaries?

1. revolution[a-z]?
1. revolution[a-z]*
1. revolution[a-z]+

Q7. Which of these will match the strings revolution, Revolution, and their plural variants only?

1. [rR]evolution[s]+
1. revolution[s]?
1. [rR]evolution[s]?

Q8. What regular expression matches the strings dog or cat?

1. dog|cat
1. dog,cat
1. dog | cat

Q9. What regular expression matches the whole words dog or cat?

1. \bdog|cat\b
1. \bdog\b | \bcat\b
1. \bdog\b|\bcat\b

Q10. What do we put after a character to match strings where that character appears 2 to 4 times in sequence?

1. {2,4}
1. {2-4}
1. [2,4]

Q11. The regular expression \d{4} will match what?

1. Any four character sequence?
1. Any four digit sequence?
1. The letter d four times?

Q12. If brackets are used to define a group, what would match the regular expression (,\s[0-9]{1,4}){4},\s[0-9]{1,3}\.[0-9]?

1. , 135, 1155, 915, 513, 18.8
1. , 135, 11557, 915, 513, 18.8
1. , 135, 1155, 915, 513, 188

Key Points

Regular expressions reference guide

Content from Introduction to Data - Multiple Choice Quiz (answers)"

Last updated on 2023-04-24 | Edit this page

Overview

Questions

What does Fr[ea]nc[eh] match?

Objectives

Test knowledge of use of regular expressions in searches

Library Carpentry Week One: Introduction to Data

Exercise Answers

What does Fr[ea]nc[eh] match?

this matches France, French, Frence, and Franch. It would find words where there were characters either side of these so Francer, foobarFrench, or Franch911.

What does Fr[ea]nc[eh]$ match?

this matches France, French, Frence, and Franch at the end of a line. It would find words where there were characters before these so foobarFrench.

What would match the strings French and France only that appear at the beginning of a line?

^France|^French This would also find words where there were characters after French such as Frenchness.

How do you match the whole words colour and color (case insensitive)?

In real life, you should only come across the case insensitive variations colour, color, Colour, Color, COLOUR, and COLOR (rather than, say, coLour. So one option would be \b[Cc]olou?r\b|\bCOLOU?R\b. This can, however, get quickly quite complex. An option we’ve not discussed is to take advantage of the / delimiters and add an ignore case flag: so /colou?r/i will match all case insensitive variants of colour and color.

How would you find the whole-word headrest and or the 2-gram head rest but not head rest (that is, with two spaces between head and rest?

\bhead ?rest\b. Note that although \bhead\s?rest\b does work, it would also match zero or one tabs or newline characters between head and rest. In most real world cases it should, however, be fine :)

How would you find a 4 letter word that ends a string and is preceded by at least one zero?

0+[a-z]{4}\b

How do you match any 4 digit string anywhere?

\d{4}. Note this will match 4 digit strings only but will find them within longer strings of numbers.

How would you match the date format dd-MM-yyyy?

\b\d{2}-\d{2}-\d{4}\b In most real world situations, you are likely to want word bounding here (but it may depend on your data).

How would you match the date format dd-MM-yyyy or dd-MM-yy at the end of a string only?

\d{2}-\d{2}-\d{2,4}$

How would you match publication formats such as British Library : London, 2015 and Manchester University Press: Manchester, 1999?

.* : .*, \d{4} You will find that this matches any text you put before British or Manchester. In this case, this regular expression does a good job on the first look up and may be need to be refined on a second depending on your real world application.

Multiple Choice Quiz Answers

Q1. C
Q2. A
Q3. B
Q4. B
Q5. C
Q6. B
Q7. C
Q8. A
Q9. C
Q10. A
Q11. B
Q12. A

Key Points

Regular expressions answer sheet