Chapter 4 dplyr - A brief introduction to tidy data manipulation

4.1 Loading and checking the data
We will work with our koala dataset. You can download it here. Furthermore you need to install the tidyverse package, which contains dplyr.
install.packages('tidyverse')First we need to load the dataset we’re working with.
koala<-read.csv('data/koala.csv', stringsAsFactors = T)It should contain the following columns:
names(koala)## [1] "species" "X" "Y" "state" "region" "sex" "weight"
## [8] "size" "fur" "tail" "age" "color" "joey" "behav"
## [15] "obs"
Lets look at a structure and summary of this dataset:
str(koala)## 'data.frame': 242 obs. of 15 variables:
## $ species: Factor w/ 1 level "Phascolarctos cinereus": 1 1 1 1 1 1 1 1 1 1 ...
## $ X : num 153 148 153 153 153 ...
## $ Y : num -27.5 -22.5 -27.5 -27.5 -27.5 ...
## $ state : Factor w/ 4 levels "New South Wales",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ region : Factor w/ 2 levels "northern","southern": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 2 2 1 1 ...
## $ weight : num 7.12 5.45 6.63 6.47 5.62 ...
## $ size : num 70.8 70.4 68.7 73 65.2 ...
## $ fur : num 1.86 1.85 2.48 1.92 1.95 ...
## $ tail : num 1.17 1.56 1.06 1.8 1.63 ...
## $ age : int 8 10 1 1 10 12 9 1 1 1 ...
## $ color : Factor w/ 6 levels "chocolate brown",..: 3 4 6 3 4 4 6 4 3 3 ...
## $ joey : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ behav : Factor w/ 3 levels "Feeding","Just Chillin",..: 3 3 2 3 3 1 2 3 1 3 ...
## $ obs : Factor w/ 3 levels "Opportunistic",..: 2 1 2 3 3 1 3 2 2 2 ...
summary(koala)## species X Y
## Phascolarctos cinereus:242 Min. :138.6 Min. :-39.00
## 1st Qu.:150.0 1st Qu.:-34.49
## Median :152.0 Median :-32.67
## Mean :150.3 Mean :-32.36
## 3rd Qu.:152.9 3rd Qu.:-30.31
## Max. :153.6 Max. :-21.39
## state region sex weight
## New South Wales:181 northern:165 female:127 Min. : 5.406
## Queensland : 16 southern: 77 male :115 1st Qu.: 6.574
## South Australia: 14 Median : 7.277
## Victoria : 31 Mean : 7.923
## 3rd Qu.: 8.765
## Max. :17.889
## size fur tail age
## Min. :64.81 Min. :1.110 Min. :1.004 Min. : 1.00
## 1st Qu.:68.43 1st Qu.:2.410 1st Qu.:1.272 1st Qu.: 3.00
## Median :70.27 Median :2.797 Median :1.534 Median : 7.00
## Mean :70.94 Mean :2.896 Mean :1.507 Mean : 6.43
## 3rd Qu.:72.33 3rd Qu.:3.217 3rd Qu.:1.750 3rd Qu.: 9.00
## Max. :81.91 Max. :5.876 Max. :1.981 Max. :12.00
## color joey behav obs
## chocolate brown:21 No :185 Feeding : 48 Opportunistic:65
## dark grey :36 Yes: 57 Just Chillin: 67 Spotlighting :94
## grey :69 Sleeping :127 Stagwatching :83
## grey-brown :53
## light brown :20
## light grey :43
This dataset contains the positions of each koala in Latitude and Longitude (X and Y) as well as variables describing their physiology, behavior and how they were recorded. It’s typical presence-only wildlife data, combining observations with some data describing each individual, which could e.g. be used for distribution modeling or to test influences of other variables such as climate on behavior and physiology of this particular species. You can read more about the dataset and typical R data types in the separate koala dataset tutorial, which will introduce you to the data and explain R data and variable types.
Often in these types of studies, we are not interested in all the recorded variables and thus first need to ‘clean’ our data to make it easier to work with it. dplyr is a package designed to make data ‘cleaning’ and manipulation of large datasets easier by introducing specific syntax. Let’s see how it works and compares to base R functionality!
4.2 Working with dplyr
Base R sub-setting can be very tedious. Imagine we want to got the mean age for our koalas, but split it by sex. Getting one mean is easy:
mean(koala[koala$sex == 'male', "age"],na.rm = TRUE)## [1] 6.626087
Summarizing both sexes and saving it in a table takes a few lines of code:
female_mean<-mean(koala[koala$sex == 'female', "age"],na.rm = TRUE)
male_mean<-mean(koala[koala$sex == 'male', "age"],na.rm = TRUE)
means<-rbind(c(female_mean, male_mean))
means<-as.data.frame(means)
names(means)<-c('female', 'male')Lets have a look at the results.
means## female male
## 1 6.251969 6.626087
That’s all good, but with that many lines of code quite error prone … dplyr makes data manipulation simpler. For this example, we would only require one line of code!
library(dplyr)
mean_age_koala<-koala%>%group_by(sex)%>%summarise(mean_age = mean(age))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## sex mean_age
## <fct> <dbl>
## 1 female 6.25
## 2 male 6.63
So simple, and looks even better than our base R table too! The main functions we will explore here are dplyr’s pipe %>%, select(), filter(), group_by(), summarise() and mutate().
4.2.1 select() and dplyr’s pipe
If, for example, we wanted to move forward with only a few of the variables in our dataframe we could use the select() function. This will keep only the variables you select.
koala_select<-select(koala, species, sex, age)## species sex age
## 1 Phascolarctos cinereus male 8
## 2 Phascolarctos cinereus female 10
## 3 Phascolarctos cinereus male 1
## 4 Phascolarctos cinereus male 1
## 5 Phascolarctos cinereus female 10
## 6 Phascolarctos cinereus male 12
If we open up koala_select we’ll see that it only contains the species, sex and age columns. Above we used ‘normal’ R grammar, but the strengths of dplyr lie in combining several functions using pipes. Since the pipes grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.
koala_select_pipe<-koala%>%select(species, sex, age)To help you understand why we wrote that in that way, let’s walk through it step by step. First we summon the koala dataframe and pass it on, using the pipe syntax %>%, to the next step, which is the select() function. In this case we don’t specify which data object we use in the select() function since in gets that from the previous pipe.
4.2.2 filter()
filter() is one of the most useful dplyr functions for data manipulation. Say you’re conducting a study of only male koalas. You won’t need any data on female koalas. So lets get rid of it!
koala_filter<-koala%>%filter(sex == 'male')Did it work?
summary(koala_filter$sex)## female male
## 0 115
No more females in the data! Let’s test our knowledge with a challenge.
4.2.3 Challenge 1
Write a single command (which can span multiple lines and includes pipes) that will produce a dataframe that has the values for age, size and color for females only. How many rows and columns does your dataframe have and why?
Extra challenge: out of this new dataset, filter only koalas >70cm in size. How many are there?
This should be your data structure:
nrow(challenge1)## [1] 127
ncol(challenge1)## [1] 3
We removed all the males, so our row number reduces from 242 to 127. Then we filter our desired columns and are now at 3 instead of 15.
nrow(challenge1.2)## [1] 46
You can find the solutions to all challenges posed here at the end of the document. Don’t peek!
4.2.4 group_by() and summarise()
Now, we were supposed to be reducing the error prone repetitiveness of what can be done with base R, but up to now we haven’t done that since we would have to repeat the above for each sex. Instead of filter(), which will only pass observations that meet your criteria (in the above: sex=="female"), we can use group_by(), which will essentially use every unique criteria that you could have used in filter(). Let’s see what happens with our data structure when using dplyr’s group_by().
koala_group<-koala%>%group_by(sex)
str(koala_group)## tibble [242 x 15] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ species: Factor w/ 1 level "Phascolarctos cinereus": 1 1 1 1 1 1 1 1 1 1 ...
## $ X : num [1:242] 153 148 153 153 153 ...
## $ Y : num [1:242] -27.5 -22.5 -27.5 -27.5 -27.5 ...
## $ state : Factor w/ 4 levels "New South Wales",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ region : Factor w/ 2 levels "northern","southern": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 2 2 1 1 ...
## $ weight : num [1:242] 7.12 5.45 6.63 6.47 5.62 ...
## $ size : num [1:242] 70.8 70.4 68.7 73 65.2 ...
## $ fur : num [1:242] 1.86 1.85 2.48 1.92 1.95 ...
## $ tail : num [1:242] 1.17 1.56 1.06 1.8 1.63 ...
## $ age : int [1:242] 8 10 1 1 10 12 9 1 1 1 ...
## $ color : Factor w/ 6 levels "chocolate brown",..: 3 4 6 3 4 4 6 4 3 3 ...
## $ joey : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ behav : Factor w/ 3 levels "Feeding","Just Chillin",..: 3 3 2 3 3 1 2 3 1 3 ...
## $ obs : Factor w/ 3 levels "Opportunistic",..: 2 1 2 3 3 1 3 2 2 2 ...
## - attr(*, "groups")= tibble [2 x 2] (S3: tbl_df/tbl/data.frame)
## ..$ sex : Factor w/ 2 levels "female","male": 1 2
## ..$ .rows: list<int> [1:2]
## .. ..$ : int [1:127] 2 5 9 10 12 13 15 17 20 22 ...
## .. ..$ : int [1:115] 1 3 4 6 7 8 11 14 16 18 ...
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
You will notice that the structure of the dataframe where we used group_by() (koala_group) is not the same as the original koala dataset. A grouped dataset can be thought of as a list where each item in the list is a data.frame which contains only the rows that correspond to the a particular value ‘Sex’ (at least in the example above).
The above was a bit on the uneventful side because group_by() is only really useful in conjunction with summarise(). This will allow you to create new variable(s) by using functions that repeat for each of the sex-specific data frames. That is to say, using the group_by() function, we split our original dataframe into multiple pieces, then we can run functions such as mean() or sd() within summarise():
koala_group_sum<-koala%>%group_by(sex)%>%
summarise(mean_age=mean(age))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## sex mean_age
## <fct> <dbl>
## 1 female 6.25
## 2 male 6.63
And there we go. We got what we wanted and summarised the mean age of our koalas for both sexes separately. And we did that using only one simple line of code! I think it is time for another challenge to test our skills!
4.2.5 Challenge 2
Calculate the average weight value per state and Sex. Which combination of state and sex has the heaviest and which combination had the lightest koalas?
## # A tibble: 8 x 3
## # Groups: state [4]
## state sex mean_weight
## <fct> <fct> <dbl>
## 1 New South Wales female 6.54
## 2 New South Wales male 9.07
## 3 Queensland female 5.68
## 4 Queensland male 6.82
## 5 South Australia female 7.59
## 6 South Australia male 16.8
## 7 Victoria female 7.58
## 8 Victoria male 7.48
That is already quite powerful, but it gets even better! You’re not limited to defining only one new variable in summarise():
challenge2_ext<-koala%>%group_by(state, sex)%>%
summarise(mean_weight = mean(weight),
sd_weight = sd(weight),
sample_no = n())## `summarise()` regrouping output by 'state' (override with `.groups` argument)
We can create a new dataframe with as many new variables as we want. Very useful for our initial data exploration! Let’s get our hands another very useful function: mutate().
4.2.6 mutate()
We can create an entirely new variables in our initial dataset prior to (or even after) summarizing information using mutate(). Let’s say we’re interested in the weight:size ratio of our Koalas. Also we want to give each individual a numeric identifier to be able to better work with our data later on.
koala_mutate<-koala%>%mutate(weight_size_ratio = size/weight, ID = row_number())## [1] "species" "X" "Y"
## [4] "state" "region" "sex"
## [7] "weight" "size" "fur"
## [10] "tail" "age" "color"
## [13] "joey" "behav" "obs"
## [16] "weight_size_ratio" "ID"
Our dataset now has two extra columns containing the variables we were interested in. If you do not want to manipulate your raw data, you can use mutate before grouping and summarising to create the summary table straight away:
koala_mutate_weight_size<-koala%>%mutate(weight_size_ratio = size/weight)%>%
group_by(sex)%>%
summarise(mean_weight = mean(weight),
sd_weight = sd(weight),
mean_weight_size = mean (weight_size_ratio),
max_weight_size = max(weight_size_ratio))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 5
## sex mean_weight sd_weight mean_weight_size max_weight_size
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 female 6.67 0.533 10.4 12.9
## 2 male 9.31 2.38 8.21 11.9
Great! Let’s end the lesson with another challenge, combining all the functions we have looked at today.
4.2.7 Challenge 3
Calculate the average tail length and fur thickness for a group of 20 randomly selected males and females from New South Wales. Then arrange the mean tail length in descending order.
Hint: Use the dplyr functions arrange() and sample_n(), they have similar syntax to other dplyr functions. Look at the help by calling ‘?function’, e.g. ?arrange.
## # A tibble: 2 x 3
## sex mean_tail mean_fur
## <fct> <dbl> <dbl>
## 1 male 1.54 2.95
## 2 female 1.47 2.48
Since we are sampling randomly, these will look different for each of you.
4.2.8 Pivoting
Another really useful processing option when manipulating or re-arranging data is pivoting, especially when you come from an Microsoft Excel background. For this, a bit of knowledge on tidy data is required. When we process data in R, it is extremely beneficial to have a long dataset, which we call tidy. These type of data follow three principles:
- Variables make up our columns
- Observations are in the rows
- Values are in the cells
Our koala dataset is an example of tidy data. Let’s reduce our dataset to make an example:
koala_subsample<-koala%>%select(species, region, sex, tail, age)
head(koala_subsample)## species region sex tail age
## 1 Phascolarctos cinereus northern male 1.168241 8
## 2 Phascolarctos cinereus northern female 1.562456 10
## 3 Phascolarctos cinereus northern male 1.056640 1
## 4 Phascolarctos cinereus northern male 1.801244 1
## 5 Phascolarctos cinereus northern female 1.625600 10
## 6 Phascolarctos cinereus northern male 1.086675 12
Tidy datasets, are also called indexed dataset. We can look up a measure using both region and sex as an index (like when we were grouping before). Let’s assume we collect another measure of tail-length on another visit to our sites. In data-entry manner, this would probably go into another column:
koala_subsample<-koala_subsample%>%mutate(tail2 = runif(nrow(koala_subsample), min = 2, max = 3))
head(koala_subsample)## species region sex tail age tail2
## 1 Phascolarctos cinereus northern male 1.168241 8 2.176022
## 2 Phascolarctos cinereus northern female 1.562456 10 2.128106
## 3 Phascolarctos cinereus northern male 1.056640 1 2.425834
## 4 Phascolarctos cinereus northern male 1.801244 1 2.913316
## 5 Phascolarctos cinereus northern female 1.625600 10 2.007734
## 6 Phascolarctos cinereus northern male 1.086675 12 2.901951
Our table is now wide, because we added a variable, that is not indexed. This can cause problems, especially when we want to visualize these two measurements. We will introduce ggplot2 for visualizing in later tutorials, but here is just a basic boxplot
ggplot(koala_subsample, aes(x=species, y=tail)) +
geom_boxplot()
ggplot(koala_subsample, aes(x=species, y=tail2)) +
geom_boxplot()
We can plot either column, but not both. Nevertheless, we can use another column that is already in long format, to separate the boxes:
ggplot(koala_subsample, aes(x=species, y=tail, fill = region)) +
geom_boxplot()
Southern koalas have shorter tails, interesting :) If we want to compare our two measurements though, we need them in one column, and that is where pivoting comes in handy:
koala_pivot<-koala_subsample%>%pivot_longer(cols = c(tail, tail2), names_to = 'tail_measure', values_to = 'tail_length')
head(koala_pivot)## # A tibble: 6 x 6
## species region sex age tail_measure tail_length
## <fct> <fct> <fct> <int> <chr> <dbl>
## 1 Phascolarctos cinereus northern male 8 tail 1.17
## 2 Phascolarctos cinereus northern male 8 tail2 2.18
## 3 Phascolarctos cinereus northern female 10 tail 1.56
## 4 Phascolarctos cinereus northern female 10 tail2 2.13
## 5 Phascolarctos cinereus northern male 1 tail 1.06
## 6 Phascolarctos cinereus northern male 1 tail2 2.43
We create the index column tail_measure to address our two measurements. The column containing the measurements is now called tail_length. Let’s see what has changed between the first and second measurement:
ggplot(koala_pivot, aes(x=species, y=tail_length, fill = tail_measure)) +
geom_boxplot()
Wow, seems the tails have grown a lot in the past ;) We can also facet to check both the different measures and the genders:
ggplot(koala_pivot, aes(x=species, y=tail_length, fill = tail_measure)) +
geom_boxplot()+
facet_grid(~sex)
Great, now we know how to make our dataset long and thus tidy!
Let’s create a dataset for you to practice:
koala_subsample2<-koala%>%select(species, region, sex, age, tail, joey,)%>%
mutate(age2 = age+sample(3, nrow(koala_subsample), replace = T),
tail2 = runif(nrow(koala_subsample), min = 2, max = 3),
tail3 = runif(nrow(koala_subsample), min = 1, max = 1.5),
joey2 = ifelse(koala$sex == 'female',
sample(c('Yes', 'No'), nrow(koala), prob=c(0.3, 0.7), replace = T), 'No'),
joey3 = ifelse(koala$sex == 'female',
sample(c('Yes', 'No'), nrow(koala), prob=c(0.5, 0.5), replace = T), 'No'),)
head(koala_subsample2)## species region sex age tail joey age2 tail2
## 1 Phascolarctos cinereus northern male 8 1.168241 No 9 2.844038
## 2 Phascolarctos cinereus northern female 10 1.562456 Yes 13 2.847502
## 3 Phascolarctos cinereus northern male 1 1.056640 No 4 2.486123
## 4 Phascolarctos cinereus northern male 1 1.801244 No 3 2.456892
## 5 Phascolarctos cinereus northern female 10 1.625600 No 13 2.523897
## 6 Phascolarctos cinereus northern male 12 1.086675 No 15 2.251426
## tail3 joey2 joey3
## 1 1.187125 No No
## 2 1.079192 No Yes
## 3 1.313226 No No
## 4 1.318932 No No
## 5 1.318252 No Yes
## 6 1.322062 No No
So we continued our measurements over 3 years, each koala aged 1-3 years which is expressed in the age2 column. When we revisited, we took additional measurements of tail length, and checked weather females had a joey. Can you tidy up this table?
4.2.9 Challenge 4
To make the changes more visible, give each individual an ID that describes the individual using mutatue(), as we did before. It’s easiest in steps! Refer to ?pivot_longer for help. Please sort the dataset as follows in the end: species, region, sex, all thew index columns, all measurement columns.
## # A tibble: 6 x 10
## ID species region sex age_check tail_measure joey_check age tail joey
## <int> <fct> <fct> <fct> <chr> <chr> <chr> <int> <dbl> <chr>
## 1 1 Phasco~ north~ male age tail joey 8 1.17 No
## 2 1 Phasco~ north~ male age tail joey2 8 1.17 No
## 3 1 Phasco~ north~ male age tail joey3 8 1.17 No
## 4 1 Phasco~ north~ male age tail2 joey 8 2.84 No
## 5 1 Phasco~ north~ male age tail2 joey2 8 2.84 No
## 6 1 Phasco~ north~ male age tail2 joey3 8 2.84 No
With this meany measurements, we can illustrate the long table format well. Each individual now has 18 rows, because our maximum measurement was 3x3 for joey and tail. Each ID number is one koala.
You can easily reverse your changes using pivot_wider(). Let’s reverse the age changes:
koala_pivot_wide<-koala_pivot_subsample%>%pivot_wider(names_from = age_check, values_from = age)
head(koala_pivot_wide)## # A tibble: 6 x 10
## ID species region sex tail_measure joey_check tail joey age age2
## <int> <fct> <fct> <fct> <chr> <chr> <dbl> <chr> <int> <int>
## 1 1 Phascolarc~ north~ male tail joey 1.17 No 8 9
## 2 1 Phascolarc~ north~ male tail joey2 1.17 No 8 9
## 3 1 Phascolarc~ north~ male tail joey3 1.17 No 8 9
## 4 1 Phascolarc~ north~ male tail2 joey 2.84 No 8 9
## 5 1 Phascolarc~ north~ male tail2 joey2 2.84 No 8 9
## 6 1 Phascolarc~ north~ male tail2 joey3 2.84 No 8 9
Having each variable in a separate column is important e.g. in modeling, when you have to specify you predictor variables by column. Let’s practice pivot_wider() with some made-up koala data data:
new_data<-rbind(cbind.data.frame(ID = 1:40, sex = 'male',
measurement = c('tail1', 'tail2', 'tail3', 'tail4')),
cbind.data.frame(ID = 1:40, sex = 'female',
measurement = c('tail1', 'tail2', 'tail3', 'tail4')))%>%
mutate(tail = runif(80, min = 1, max = 2), animal = rep(c(1:20), each = 4))
head(new_data)## ID sex measurement tail animal
## 1 1 male tail1 1.777274 1
## 2 2 male tail2 1.799369 1
## 3 3 male tail3 1.795316 1
## 4 4 male tail4 1.961941 1
## 5 5 male tail1 1.415826 2
## 6 6 male tail2 1.722455 2
4.2.10 Challenge 5
Create a wide dataset with one column per tail measurement, add an individual ID for each koala. Which column do you need to remove for this to work?
## # A tibble: 6 x 6
## sex animal tail1 tail2 tail3 tail4
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 male 1 1.78 1.80 1.80 1.96
## 2 male 2 1.42 1.72 1.60 1.91
## 3 male 3 1.23 1.81 1.31 1.29
## 4 male 4 1.76 1.12 1.85 1.03
## 5 male 5 1.44 1.43 1.23 1.07
## 6 male 6 1.62 1.26 1.91 1.78
Good job on all this! Let’s move on to data visualization using ggplot2 :)
4.2.11 Solution to all challenges
4.2.11.1 Challenge 1
challenge1<-koala%>%filter(sex == 'female')%>%
select(age, size, color)challenge1.2<-challenge1%>%filter(size>70)4.2.11.2 Challenge 2
challenge2<-koala%>%group_by(state, sex)%>%
summarise(mean_weight = mean(weight))## `summarise()` regrouping output by 'state' (override with `.groups` argument)
4.2.11.3 Challenge 3
challenge3<-koala%>%filter(state == 'New South Wales')%>%
group_by(sex)%>%
sample_n(20)%>%
summarise(mean_tail = mean(tail), mean_fur = mean(fur))%>%
arrange(desc(mean_tail))## `summarise()` ungrouping output (override with `.groups` argument)
4.2.11.4 Challenge 4
koala_pivot_subsample<-koala_subsample2%>%pivot_longer(cols = c(age, age2), names_to = 'age_check', values_to = 'age')
koala_pivot_subsample<-koala_pivot_subsample%>%pivot_longer(cols = c(tail, tail2, tail3), names_to = 'tail_measure', values_to = 'tail')
koala_pivot_subsample<-koala_pivot_subsample%>%pivot_longer(cols = c(joey, joey2, joey3), names_to = 'joey_check', values_to = 'joey')
koala_pivot_subsample<-koala_pivot_subsample%>%select(1:3, 4,6,8,5,7,9) #instead of typing the names you can use the column number!4.2.11.5 Challenge 5
new_data_wide<-new_data%>%select(-ID)%>%pivot_wider(names_from = measurement, values_from = tail)