Introduction to R
Last updated on 2024-03-12 | Edit this page
Estimated time: 80 minutes
Overview
Questions
- What is an object?
- What is a function and how can we pass arguments to functions?
- How can values be initially assigned to variables of different data types?
- How can a vector be created What are the available data types?
- How can subsets be extracted from vectors?
- How does R treat missing values?
- How can we deal with missing values in R?
Objectives
- Assign values to objects in R.
- Learn how to name objects.
- Use comments to inform script.
- Solve simple arithmetic operations in R.
- Call functions and use arguments to change their default options.
- Inspect the content of vectors and manipulate their content.
- Subset and extract values from vectors.
- Analyze vectors with missing data.
- Define the following terms as they relate to R: object, vector, assign, call, function.
Creating objects in R
You can get output from R simply by typing math in the console:
R
3 + 5
OUTPUT
[1] 8
R
7 * 2 # multiply 7 by 2
OUTPUT
[1] 14
R
sqrt(36) # take the square root of 36
OUTPUT
[1] 6
However, to do useful and interesting things, we need to assign
values to objects. To create an object, we need to
give it a name followed by the assignment operator <-
,
and the value we want to give it:
R
time_minutes <- 5 # assign the number 5 to the object time_minutes
<-
is the assignment operator. It assigns values on
the right to objects on the left. Here we are creating a symbol called
time_minutes
and assigning it the numeric value 5.
Some R users would say “time_minutes
gets 5.”
time_minutes
is now a numeric vector with one
element. Or you could say time_minutes
is a numeric vector,
and the first element is the number 5.
When you assign something to a symbol, nothing happens in the
console, but in the Environment pane in the upper right, you will notice
a new object, time_minutes
.
In RStudio, typing Alt + - (push Alt
at the same time as the - key) will write <-
in a single keystroke in a PC, while typing Option +
- (push Option at the same time as the
- key) does the same in a Mac.
Objects can be given any name such as x
,
checkouts
, or isbn
. You want your object names
to be explicit and not too long. Here are some tips for assigning
values:
-
Do not use names of functions that already exist in
R: There are some names that cannot be used because they are
the names of fundamental functions in R (e.g.,
if
,else
,for
, see here for a complete list. In general, even if it’s allowed, it’s best to not use other function names (e.g.,c
,T
,mean
,data
,df
,weights
). If in doubt, check the help to see if the name is already in use. -
R is case sensitive:
age
is different fromAge
andy
is different fromY
. -
No blank spaces or symbols other than underscores:
R users get around this in a couple of ways, either through
capitalization (e.g.
myData
) or underscores (e.g.my_data
). It’s also best to avoid dots (.
) within an object name as inmy.dataset
. There are many functions in R with dots in their names for historical reasons, but dots have a special meaning in R (for methods) and other programming languages. -
Do not begin with numbers or symbols:
2x
is not valid, butx2
is. -
Be descriptive, but make your variable names short:
It’s good practice to be descriptive with your variable names. If you’re
loading in a lot of data, choosing
myData
orx
as a name may not be as helpful as, say,ebookUsage
. Finally, keep your variable names short, since you will likely be typing them in frequently.
Objects vs. variables
What are known as objects
in R
are known as
variables
in many other programming languages. Depending on
the context, object
and variable
can have
drastically different meanings. However, in this lesson, the two words
are used synonymously. For more information see: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects
Evaluating Expressions
If you now type time_minutes
into the console, and press
Enter on your keyboard, R will evaluate the expression. In this
case, R will print the elements that are assigned to
time_minutes
(the number 5). We can do this easily since y
only has one element, but if you do this with a large dataset loaded
into R, it will overload your console because it will print the entire
thing. The [1]
indicates that the number 5 is the first
element of this vector.
When assigning a value to an object, R does not print anything to the console. You can force R to print the value by using parentheses or by typing the object name:
R
time_minutes <- 5 # doesn't print anything
(time_minutes <- 5) # putting parenthesis around the call prints the value of y
OUTPUT
[1] 5
R
time_minutes # so does typing the name of the object
OUTPUT
[1] 5
R
print(time_minutes) # so does using the print() function.
OUTPUT
[1] 5
Now that R has time_minutes
in memory, we can do
arithmetic with it. For instance, we may want to convert it into seconds
(60 seconds in 1 minute):
R
60 * time_minutes
OUTPUT
[1] 300
We can also change an object’s value by assigning it a new one:
R
time_minutes <- 10
60 * time_minutes
OUTPUT
[1] 600
This overwrites the previous value without prompting you, so be
careful! Also, assigning a value to one object does not change the
values of other objects For example, let’s store the time in seconds in
a new object, time_seconds
:
R
time_seconds <- 60 * time_minutes
Then change time_minutes
to 30:
R
time_minutes <- 30
The value of time_seconds
is still 600 because you have
not re-run the line time_seconds <- 60 * time_minutes
since changing the value of time_minutes
.
Exercise
Create two variables my_length
and my_width
and assign them any numeric values you want. Create a third variable
my_area
and give it a value based on the the multiplication
of my_length
and my_width
. Show that changing
the values of either my_length
and my_width
does not affect the value of my_area
.
R
my_length <- 2.5
my_width <- 3.2
my_area <- my_length * my_width
area
ERROR
Error in eval(expr, envir, enclos): object 'area' not found
R
# change the values of my_length and my_width
my_length <- 7.0
my_width <- 6.5
# the value of my_area isn't changed
my_area
OUTPUT
[1] 8
Removing objects from the environment
To remove an object from your R environment, use the
rm()
function. Remove multiple objects with
rm(list = c("add", "objects", "here))
, adding the objects
in c()
using quotation marks. To remove all objects, use
rm(list = ls())
or click the broom icon in the Environment
Pane, next to “Import Dataset.”
R
x <- 5
y <- 10
z <- 15
rm(x) # remove x
rm(list =c("y", "z")) # remove y and z
rm(list = ls()) # remove all objects
Functions and their arguments
R is a “functional programming language,” meaning it contains a number of functions you use to do something with your data. Functions are “canned scripts” that automate more complicated sets of commands. Many functions are predefined, or can be made available by importing R packages as we saw in the “Before We Start” lesson.
Call a function on a variable by entering the function into
the console, followed by parentheses and the variables. A function
usually gets one or more inputs called arguments. For example,
if you want to take the sum of 3 and 4, you can type in
sum(3, 4)
. In this case, the arguments must be a number,
and the return value (the output) is the sum of those numbers. An
example of a function call is:
R
sum(3, 4)
The function is.function()
will check if an argument is
a function in R. If it is a function, it will print TRUE
to
the console.
Functions can be nested within each other. For example,
sqrt()
takes the square root of the number provided in the
function call. Therefore you can run sum(sqrt(9), 4)
to
take the sum of the square root of 9 and add it to 4.
Typing a question mark before a function will pull the help page up
in the Navigation Pane in the lower right. Type ?sum
to
view the help page for the sum
function. You can also call
help(sum)
. This will provide the description of the
function, how it is to be used, and the arguments.
In the case of sum()
, the ellipses . . .
represent an unlimited number of numeric elements.
R
is.function(sum) # check to see if sum() is a function
sum(3, 4, 5, 6, 7) # sum takes an unlimited number (. . .) of numeric elements
Arguments
Some functions take arguments which may either be specified by the user, or, if left out, take on a default value. However, if you want something specific, you can specify a value of your choice which will be used instead of the default. This is called passing an argument to the function.
For example, sum()
takes the argument option
na.rm
. If you check the help page for sum (call
?sum
), you can see that na.rm
requires a
logical (TRUE/FALSE
) value specifying whether
NA
values (missing data) should be removed when the
argument is evaluated.
By default, na.rm
is set to FALSE
, so
evaluating a sum with missing values will return NA
:
R
sum(3, 4, NA) #
OUTPUT
[1] NA
Even though we do not see the argument here, it is operating in the
background, as the NA
value remains. 3 + 4 +
NA
is NA
.
But setting the argument na.rm
to TRUE
will
remove the NA
:
R
sum(3, 4, NA, na.rm = TRUE)
OUTPUT
[1] 7
It is very important to understand the different arguments that
functions take, the values that can be added to those functions, and the
default arguments. Arguments can be anything, not only TRUE
or FALSE
, but also other objects. Exactly what each
argument means differs per function, and must be looked up in the
documentation.
It’s good practice to put the non-optional arguments first in your function call, and to specify the names of all optional arguments. If you don’t, someone reading your code might have to look up the definition of a function with unfamiliar arguments to understand what you’re doing.
Vectors and data types
A vector is the most common and basic data type in R, and is pretty
much the workhorse of R. A vector is a sequence of elements of the same
type. Vectors can only contain “homogenous” data–in other
words, all data must be of the same type. The type of a vector
determines what kind of analysis you can do on it. For example, you can
perform mathematical operations on numeric
objects, but not
on character
objects.
We can assign a series of values to a vector using the
c()
function. c()
stands for combine. If you
read the help files for c()
by calling
help(c)
, you can see that it takes an unlimited
. . .
number of arguments.
For example we can create a vector of checkouts for a collection of
books and assign it to a new object checkouts
:
R
checkouts <- c(25, 15, 18)
checkouts
OUTPUT
[1] 25 15 18
A vector can also contain characters. For example, we can have a
vector of the book titles (title
) and authors
(author
):
R
title <- c("Macbeth","Dracula","1984")
The quotes around “Macbeth”, etc. are essential here. Without the
quotes R will assume there are objects called Macbeth
and
Dracula
in the environment. As these objects don’t yet
exist in R’s memory, there will be an error message.
There are many functions that allow you to inspect the content of a
vector. length()
tells you how many elements are in a
particular vector:
R
length(checkouts) # print the number of values in the checkouts vector
OUTPUT
[1] 3
An important feature of a vector, is that all of the elements are the
same type of data. The function class()
indicates the class
(the type of element) of an object:
R
class(checkouts)
OUTPUT
[1] "numeric"
R
class(title)
OUTPUT
[1] "character"
Type ?str
into the console to read the description of
the str
function. You can call str()
on an R
object to compactly display information about it, including the data
type, the number of elements, and a printout of the first few
elements.
R
str(checkouts)
OUTPUT
num [1:3] 25 15 18
R
str(title)
OUTPUT
chr [1:3] "Macbeth" "Dracula" "1984"
You can use the c()
function to add other elements to
your vector:
R
author <- "Stoker"
author <- c(author, "Orwell") # add to the end of the vector
author <- c("Shakespeare", author)
author
OUTPUT
[1] "Shakespeare" "Stoker" "Orwell"
In the first line, we create a character vector author
with a single value "Stoker"
. In the second line, we add
the value "Orwell"
to it, and save the result back into
author
. Then we add the value "Shakespeare"
to
the beginning, again saving the result back into
author
.
We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.
An atomic vector is the simplest R data
type and is a linear vector of a single type. Above, we saw 2
of the 6 main atomic vector types that R uses:
"character"
and "numeric"
(or
"double"
). These are the basic building blocks that all R
objects are built from. The other 4 atomic vector types
are:
-
"logical"
forTRUE
andFALSE
(the boolean data type) -
"integer"
for integer numbers (e.g.,2L
, theL
indicates to R that it’s an integer) -
"complex"
to represent complex numbers with real and imaginary parts (e.g.,1 + 4i
) and that’s all we’re going to say about them -
"raw"
for bitstreams that we won’t discuss further
You can check the type of your vector using the typeof()
function and inputting your vector as the argument.
Vectors are one of the many data structures that R
uses. Other important ones are lists (list
), matrices
(matrix
), data frames (data.frame
), factors
(factor
) and arrays (array
).
R implicitly converts them to all be the same type.
Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a “common denominator” that doesn’t lose any information.
Only one. There is no memory of past data types, and the coercion
happens the first time the vector is evaluated. Therefore, the
TRUE
in num_logical
gets converted into a
1
before it gets converted into "1"
in
combined_logical
.
You’ve probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. This hierarchy is: logical < integer < numeric < complex < character < list.
You can also coerce a vector to be a specific data type with
as.character()
, as.logical()
,
as.numeric()
, etc. For example, to coerce a number to a
character:
R
x <- as.character(200)
We can test this in a few ways: if we print x
to the
console, we see quotation marks around it, letting us know it is a
character:
R
x
OUTPUT
[1] "200"
We can also call class()
R
class(x)
OUTPUT
[1] "character"
And if we try to add a number to x
, we will get an error
message non-numeric argument to binary operator
--in other
words, x is non-numeric
and cannot be added to a
number.
R
x + 5
Subsetting vectors
If we want to subset (or extract) one or several values from a
vector, we must provide one or several indices in square brackets. For
this example, we will use the state
data, which is built
into R and includes data related to the 50 states of the U.S.A. Type
?state
to see the included datasets.
state.name
is a built in vector in R of all U.S.
states:
R
state.name
OUTPUT
[1] "Alabama" "Alaska" "Arizona" "Arkansas"
[5] "California" "Colorado" "Connecticut" "Delaware"
[9] "Florida" "Georgia" "Hawaii" "Idaho"
[13] "Illinois" "Indiana" "Iowa" "Kansas"
[17] "Kentucky" "Louisiana" "Maine" "Maryland"
[21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
[25] "Missouri" "Montana" "Nebraska" "Nevada"
[29] "New Hampshire" "New Jersey" "New Mexico" "New York"
[33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
[37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
[41] "South Dakota" "Tennessee" "Texas" "Utah"
[45] "Vermont" "Virginia" "Washington" "West Virginia"
[49] "Wisconsin" "Wyoming"
R
state.name[1]
OUTPUT
[1] "Alabama"
You can use the :
colon to create a vector of
consecutive numbers.
R
state.name[1:5]
OUTPUT
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
If the numbers are not consecutive, you must use the c()
function:
R
state.name[c(1, 10, 20)]
OUTPUT
[1] "Alabama" "Georgia" "Maryland"
We can also repeat the indices to create an object with more elements than the original one:
R
state.name[c(1, 2, 3, 2, 1, 3)]
OUTPUT
[1] "Alabama" "Alaska" "Arizona" "Alaska" "Alabama" "Arizona"
R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.
Conditional subsetting
Another common way of subsetting is by using a logical vector.
TRUE
will select the element with the same index, while
FALSE
will not:
R
five_states <- state.name[1:5]
five_states[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
OUTPUT
[1] "Alabama" "Arizona" "California"
Typically, these logical vectors are not typed by hand, but are the
output of other functions or logical tests. state.area
is a
vector of state areas in square miles. We can use the <
operator to return a logical vector with TRUE for the indices that meet
the condition:
R
state.area < 10000
OUTPUT
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[49] FALSE FALSE
R
state.area[state.area < 10000]
OUTPUT
[1] 5009 2057 6450 8257 9304 7836 1214 9609
The first expression gives us a logical vector of length 50, where
TRUE
represents those states with areas less than 10,000
square miles. The second expression subsets state.name
to
include only those names where the value is TRUE
.
You can also specify character values. state.region
gives the region that each state belongs to:
R
state.region == "Northeast"
OUTPUT
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
[37] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[49] FALSE FALSE
R
state.name[state.region == "Northeast"]
OUTPUT
[1] "Connecticut" "Maine" "Massachusetts" "New Hampshire"
[5] "New Jersey" "New York" "Pennsylvania" "Rhode Island"
[9] "Vermont"
Again, a TRUE/FALSE
index of all 50 states where the
region is the Northeast, followed by a subset of state.name
to return only those TRUE
values.
Sometimes you need to do multiple logical tests (think Boolean
logic). You can combine multiple tests using |
(at least
one of the conditions is true, OR) or &
(both
conditions are true, AND). Use help(Logic)
to read the help
file.
R
state.name[state.area < 10000 | state.region == "Northeast"]
OUTPUT
[1] "Connecticut" "Delaware" "Hawaii" "Maine"
[5] "Massachusetts" "New Hampshire" "New Jersey" "New York"
[9] "Pennsylvania" "Rhode Island" "Vermont"
R
state.name[state.area < 10000 & state.region == "Northeast"]
OUTPUT
[1] "Connecticut" "Massachusetts" "New Hampshire" "New Jersey"
[5] "Rhode Island" "Vermont"
The first result includes both states with fewer than 10,000 sq. mi. and all states in the Northeast. New York, Pennsylvania, Delaware and Maine have areas with greater than 10,000 square miles, but are in the Northeastern U.S. Hawaii is not in the Northeast, but it has fewer than 10,000 square miles. The second result includes only states that are in the Northeast and have fewer than 10,000 sq. mi.
R contains a number of operators you can use to compare values. Use
help(Comparison)
to read the R help file. Note that
two equal signs (==
) are used for
evaluating equality (because one equals sign (=
) is used
for assigning variables).
A common task is to search for certain strings in a vector. One could
use the “or” operator |
to test for equality to multiple
values, but this can quickly become tedious. The function
%in%
allows you to test if any of the elements of a search
vector are found:
R
west_coast <- c("California", "Oregon", "Washington")
state.name[state.name == "California" | state.name == "Oregon" | state.name == "Washington"]
OUTPUT
[1] "California" "Oregon" "Washington"
R
state.name %in% west_coast
OUTPUT
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[49] FALSE FALSE
R
state.name[state.name %in% west_coast]
OUTPUT
[1] "California" "Oregon" "Washington"
Missing data
As R was designed to analyze datasets, it includes the concept of
missing data (which is uncommon in other programming languages). Missing
data are represented in vectors as NA
. R functions have
special actions when they encounter NA.
When doing operations on numbers, most functions will return
NA
if the data you are working with include missing values.
This feature makes it harder to overlook the cases where you are dealing
with missing data. As we saw above, you can add the argument
na.rm=TRUE
to calculate the result while ignoring the
missing values.
R
rooms <- c(2, 1, 1, NA, 4)
mean(rooms)
OUTPUT
[1] NA
R
max(rooms)
OUTPUT
[1] NA
R
mean(rooms, na.rm = TRUE)
OUTPUT
[1] 2
R
max(rooms, na.rm = TRUE)
OUTPUT
[1] 4
If your data include missing values, you may want to become familiar
with the functions is.na()
, na.omit()
, and
complete.cases()
. See below for examples.
R
## Use any() to check if any values are missing
any(is.na(rooms))
OUTPUT
[1] TRUE
R
## Use table() to tell you how many are missing vs. not missing
table(is.na(rooms))
OUTPUT
FALSE TRUE
4 1
R
## Identify those elements that are not missing values.
complete.cases(rooms)
OUTPUT
[1] TRUE TRUE TRUE FALSE TRUE
R
## Identify those elements that are missing values.
is.na(rooms)
OUTPUT
[1] FALSE FALSE FALSE TRUE FALSE
R
## Extract those elements that are not missing values.
rooms[complete.cases(rooms)]
OUTPUT
[1] 2 1 1 4
You can also use !is.na(rooms)
, which is exactly the
same as complete.cases(rooms)
. The exclamation mark
indicates logical negation.
R
!c(TRUE, FALSE)
OUTPUT
[1] FALSE TRUE
How you deal with missing data in your analysis is a decision you will have to make–do you remove it entirely? Do you replace it with zeros? That will depend on your own methodological questions.
R
rooms <- c(1, 2, 1, 1, NA, 3, 1, 3, 2, 1, 1, 8, 3, 1, NA, 1)
rooms_no_na <- rooms[!is.na(rooms)]
# or
rooms_no_na <- na.omit(rooms)
# 2.
median(rooms, na.rm = TRUE)
OUTPUT
[1] 1
R
# 3.
rooms_above_2 <- rooms_no_na[rooms_no_na > 2]
length(rooms_above_2)
OUTPUT
[1] 4
Now that we have learned how to write scripts, and the basics of R’s data structures, we are ready to start working with the library catalog dataset and learn about data frames.
Key Points
- Use the assignment operator <- to assign values to objects. You can now manipulate that object in R
- R contains a number of functions you use to do something with your data. Functions automate more complicated sets of commands. Many functions are predefined, or can be made available by importing R packages
- A vector is a sequence of elements of the same type. All data in a vector must be of the same type–character, numeric (or double), integer, and logical. Create vectors with c(). Use \[ \] to subset values from vectors.
Comments
All programming languages allow the programmer to include comments in their code. To do this in R we use the
#
character. Anything to the right of the#
sign and up to the end of the line is treated as a comment and will not be evaluated by R. You can start lines with comments or include them after any code on the line.Comments are essential to helping you remember what your code does, and explaining it to others. Commenting code, along with documenting how data is collected and explaining what each variable represents, is essential to reproducible research. See the Software Carpentry lesson on R for Reproducible Scientific Analysis.
R
OUTPUT
RStudio makes it easy to comment or uncomment a paragraph: after selecting the lines you want to comment, press at the same time on your keyboard Ctrl + Shift + C. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press Ctrl + Shift + C.