Select columns
This chapter introduces tools to remove unnecessary columns from the data set. Or, positively stated, we learn how to specify the columns we need for our analysis.
Last updated
Was this helpful?
This chapter introduces tools to remove unnecessary columns from the data set. Or, positively stated, we learn how to specify the columns we need for our analysis.
Last updated
Was this helpful?
This chapter introduces the following new concepts and functions:
select()
starts_with
, ends_with
, contains
, matches
where
in conjunction with is.numeric
, is.character
etc.
last_col
all_of
, any_of
Excluding columns with the minus sign -
select
commandThe function select()
is the designated tool to select columns with . By passing different things to the function, we can efficiently define the set of columns in the resulting data frame.
The intuitive way to select the columns we want is by listing their names. We can pass one or more column names to the select()
function. In case of two or more, we use commas to separate the names:
When we only want a few columns, this approach works fine and is usually a good choice. I expect you apply this method in more than 90% of all cases. However, there are cases when you’d wish there was something more flexible. Luckily, there is.
Occasionally,R we want to select columns based on a pattern of their names. Take the orders data set as an example. Here, all variables that contain information about the shipping address have the prefix shipping
. We leverage this with the helper function starts_with()
:
Similar to start_with()
, the function ends_with()
looks for a string at the end of a column name. For example, all columns that contain a date/time information in the data set end with the suffix _at
. We can take advantage of that in case we wanted to select all these columns efficiently:
To complete the picture, we can also search for string somewhere in a column name. The contains()
function does exactly that:
In some cases, it might not be enough to just match strings in column names. It is easy to imagine more complex patterns, involving wildcards or a specific order in which symbols must appear in a column name. For all this, regular expressions are a wonderful, albeit complex, solution. If you regularly encounter such complex scenarios, I recommend you familiarize yourself with the basics of regular expressions. I rarely need them myself, and if I do, I look up the expression on the internet using a good Google search.
I cannot think of a useful example in the context of the orders data set. However, the following regular expressions will look for the string _at
at the end of the column name. Thus, it mirrors the example from above, but solves it with a regular expression:
We can combine the functions that look for strings in column names to create more specific pattern searches. The example below uses the &
operator to connect two functions with a logical and. This means, both expressions must evaluate to true for the column to be selected:
In contrast to filter
, where a comma-separated list of expressions combines them with a logical and, when using this approach with select
, the resulting columns are combined to a unified set of columns. This means a logical or is applied. For example, listing starts_with("customer")
and ends_with("_at")
separated by a comma keeps all columns that start with “customer” or that end with “_at”.
Another flexible way to select columns is by their data type. Say we want to select all numeric columns because we wish to calculate the mean value across all of them in the next step of the pipeline. There is a shortcut for this, using the where()
function together with is.numeric
:
Of course, there are functions for all other data types as well:
Another way we can address columns is by their position or index:
The previous sections introduced ways to select columns, that is, specifying what we want. Often, it is more efficient to tell R what we don’t want. The minus sign -
negates any selection from the previous sections. The following command gives us all columns except the order_id
:
We can combine positive and negative selections as we need them: