Transformations - Handling Arrays
Last updated on 2024-06-13 | Edit this page
Estimated time: 20 minutes
Overview
Questions
- How do I use Arrays in data transformation?
Objectives
- Understand the purpose of Arrays in OpenRefine
- Use arrays as part of transformations in GREL
Preview
The following example is chosen to demonstrate how to go from a list with duplicated values to a list with each value just once by using an array in a transformation.
Ask the students what transformation means to them currently. Many may only know it from Excel to convert columns into rows or vice versa. Discuss how in OpenRefine, transformation is specifically the working window–these values are neither stored nor displayed in the cells or output.
It does this using a function called uniques() which can be used to remove duplicates from an array. In this example’s Subject column, we start with a list of subject words:
crystal structure|clozapinium|crystal structure|molecular configuration|hydrogen bonding|supramolecular assembly|Chemistry|QD1-999
Examining this by eye we can see it contains “crystal structure” twice. If we assume that each cell in the subject column might have duplicates in it, and in each case the subject word/phrase that is duplicated could be different, then it’s not practical to “fix” this problem (remove the duplicates from each) by find and replace. However, we can do it using an array. The lesson’s goal is: an array as something you can manipulate.
To remove the repetition we show how to do a GREL transformation like:
value.split("|").uniques().join("|")
In total this transformation does three steps:
-
split("|")
creates an array ->["crystal structure", "clozapinium", "crystal structure", "molecular configuration", "hydrogen bonding", "supramolecular assembly", "Chemistry", "QD1-999"]
-
uniques()
takes the array created, and from it generates an array with any duplicates removed (so each value in the resulting array is unique) so the result is ->["crystal structure", "clozapinium", "molecular configuration", "hydrogen bonding", "supramolecular assembly", "Chemistry", "QD1-999"]
-
join("|")
turns the array created by the uniques() command back into a string of pipe separated values which can be stored in a cell ->crystal structure|clozapinium|molecular configuration|hydrogen bonding|supramolecular assembly|Chemistry|QD1-999
Let us now move from a list with duplicated values to a list with each value just once using an array in transformation.
Arrays
An ‘Array’ is a data type (as mentioned in the previous lesson) which can contain a list of values. In OpenRefine an array is represented by the use of square brackets containing a list of values separated by commas.
For example:
- an array containing a list of strings (in this case subject keywords
or phrases) could look like:
["crystal structure", "clozapinium", "crystal structure", "molecular configuration", "hydrogen bonding", "supramolecular assembly", "Chemistry", "QD1-999"]
- an array containing a list of numbers could look like:
[1, 2, 3, 4]
Arrays can be sorted, de-duplicated, and manipulated in other ways in
GREL expressions, but cannot be stored directly in an OpenRefine cell.
Arrays in OpenRefine are usually the result of a transformation written
with GREL. For example the split
function takes a string,
and changes it into an array based on a ‘separator’. For example if a
cell has the value:
"crystal structure|clozapinium|crystal structure|molecular configuration|hydrogen bonding|supramolecular assembly|Chemistry|QD1-999"
This can be transformed into an array using the split
function specifying the pipe character ( | ) as the separating
character. Recall the cautionary note about separator choice from Working with
Data.
value.split("|")
This would create the array containing a list of subject headings, separated by a pipe character | (as in the first example above). In the transformation preview the array will display as a list of comma separated values in double quotes, with the whole array surrounded by square brackets.
This subject string can be found for the title “The crystal structures of three clozapinium salts: different molecular configurations, and supramolecular assembly in one, two and three dimensions” in the original project.
This can be combined with array operations like uniques
.
For example, assuming the cell contains the same value as above, then
the function
value.split("|").uniques()
would result in the following array: [“crystal structure”, “clozapinium”, “molecular configuration”, “hydrogen bonding”, “supramolecular assembly”, “Chemistry”, “QD1-999”]
Compared to the first example, now the second ‘crystal structure’ has been removed.
You can extract a specific item from the array using the square bracket notation and number for position in sequence:
value.split("|")[0]
would result in the string: “crystal structure”
You can also sort and join arrays together to make a ‘String’. The GREL expression would look like
value.split("|").uniques().sort().join("|")
Taking the same example again, this would result in a string with the subjects in alphabetical order (note that the sort function is case-sensitive), listed with pipes between each subject.
Recall previous discussion of dangers of changing separators and ensuring you avoid using a separator character that is already used in the text. A possible question to pose to learners could be: Which subject would be broken if a hyphen were used as a separator?
Key Points
- Arrays cannot appear directly in an OpenRefine cell
- Arrays can be used in many ways using GREL expressions