Add columns
We often want to add new columns to our data set and calculate their value based on existing columns. This chapter introduces how to do this with dplyr.
Summary
This chapter introduces the following new concepts and functions:
The
mutate-function for adding new columns or changing existing ones.The
transmute-function as an alternative tomutateto only keep the new columns and the columns involved in the calculation.Conversion functions
as.character,as.double,as.integer,as.logical, andfactorto cast a data type onto a column.Functions to extract parts of a date column, such as
year,month,day,weekday, etc., mainly from thelubridate-package.
The mutate command
mutate commandWe find ourselves in the need to expand the existing set of columns with new information. This information is usually derived from the existing columns. A good example is the turnover for our Campusbier data set. The table line_items contains the price and the quantity, but not the turnover. Thus, we can calculate a new column turnover as price * quantity.
line_items |>
mutate(turnover = price * quantity)Arithmetic calculations
Keep only used columns
line_items |>
mutate(turnover = price * quantity, .keep = "used")Statistical calculations
Extract parts of a date
For working with date and time columns, the package lubridate offers useful functions:
A compact introduction is also provided in chapter 16 Dates and times from the book "R for Data Science" by Hadley Wickham.
Calendar week
We can extract the calendar week with the function week:
covid %>%
transmute(date, week = week(date))
# A tibble: 151,776 x 2
# date week
# <date> <dbl>
# 1 2020-02-24 8
# 2 2020-02-25 8
# 3 2020-02-26 9
# 4 2020-02-27 9Month
For the month, the equivalent function is month:
covid %>%
transmute(date, month = month(date))
# A tibble: 151,776 x 2
# date month
# <date> <dbl>
# 1 2020-02-24 2
# 2 2020-02-25 2Quarter
A function with the corresponding name quarter:
covid %>%
transmute(date, quarter = quarter(date))
# A tibble: 151,776 x 2
# date quarter
# <date> <int>
# 1 2020-02-24 1
# 2 2020-02-25 1Year
No surprises here:
covid %>%
transmute(date, year = year(date))
# A tibble: 151,776 x 2
# date year
# <date> <dbl>
# 1 2020-02-24 2020
# 2 2020-02-25 2020Cutting off dates
We often want to bring a date to a higher level of aggregation. For example, if we aim to aggregate dates (daily) to full weeks, months, quarters, or years, we can use ceiling_date and floor_date to round a date down or up to achieve the desired format:
covid %>%
filter(location == "Germany") %>%
mutate(week_end = ceiling_date(date, unit = "week") ) %>%
select(date, year_month, location, new_cases_per_million)Change a column's data type
To perform some operations, such as statistical applications, or to visualize data properly, we may need to transform a column into a different data type. When loading a data set, the readr functions like read_csv are good at guessing the underlying data type for each column. Occasionally, they guess wrong, and that's when mutate comes to the rescue:
orders |>
mutate(order_id = as.character(order_id))There is a function to cast a column to each existing data type in R:
as.characterto cast a column to a string. This should always work, as strings are the most general data type.as.doubleto cast a column to a number with decimal places. This only works when the original values are, in fact, numeric. Values that cannot be converted into a decimal number are replaced withNA.as.integerto cast a column to a whole number. Analogous toas.double, this only works when the values contain solely the characters 0-9.as.logicalto cast a column to a boolean value. This works when the values contain strings with the values "TRUE" and "FALSE" or "0" and "1". Similar, the numeric values 0 and 1 can also be converted toTRUEandFALSE.lubridate::as.Datelubridate::as.POSIXct
Working with factors
Recoding factors
Reordering factors
Last updated
Was this helpful?