We often want to add new columns to our data set and calculate their value based on existing columns. This chapter introduces how to do this with dplyr.
Summary
This chapter introduces the following new concepts and functions:
The mutate-function for adding new columns or changing existing ones.
The transmute-function as an alternative to mutate to only keep the new columns and the columns involved in the calculation.
Conversion functions as.character, as.double, as.integer, as.logical, andfactor to cast a data type onto a column.
Functions to extract parts of a date column, such as year, month, day, weekday, etc., mainly from the lubridate-package.
The mutate command
We find ourselves in the need to expand the existing set of columns with new information. This information is usually derived from the existing columns. A good example is the turnover for our Campusbier data set. The table line_items contains the price and the quantity, but not the turnover. Thus, we can calculate a new column turnover as price * quantity.
A compact introduction is also provided in chapter 16 Dates and times from the book "R for Data Science" by Hadley Wickham.
Calendar week
We can extract the calendar week with the function week:
Month
For the month, the equivalent function is month:
Quarter
A function with the corresponding name quarter:
Year
No surprises here:
Cutting off dates
We often want to bring a date to a higher level of aggregation. For example, if we aim to aggregate dates (daily) to full weeks, months, quarters, or years, we can use ceiling_date and floor_date to round a date down or up to achieve the desired format:
An advantage of using ceiling_date and floor_date is that the data type remains of the type date or date time. This can be used by ggplot2 to create a properly formatted axis..
Change a column's data type
To perform some operations, such as statistical applications, or to visualize data properly, we may need to transform a column into a different data type. When loading a data set, the readr functions like read_csv are good at guessing the underlying data type for each column. Occasionally, they guess wrong, and that's when mutate comes to the rescue:
There is a function to cast a column to each existing data type in R:
as.character to cast a column to a string. This should always work, as strings are the most general data type.
as.double to cast a column to a number with decimal places. This only works when the original values are, in fact, numeric. Values that cannot be converted into a decimal number are replaced with NA.
as.integer to cast a column to a whole number. Analogous to as.double, this only works when the values contain solely the characters 0-9.
as.logical to cast a column to a boolean value. This works when the values contain strings with the values "TRUE" and "FALSE" or "0" and "1". Similar, the numeric values 0 and 1 can also be converted to TRUE and FALSE.