Filter rows

Filtering rows within a data set is essential for data analysis. R and the package dplyr offer comprehensive possibilities for this.

Summary

This chapter introduces the following new concepts and functions:

  • Arithmetic operators such as ==, !=, >, <, >=, and <=

  • Logical operators such as &, |, and xor()

  • The between() function

  • slice() and its variants

The filter command

Besides selecting the columns we need, we need tools to restrict the rows in a data frame. For this, the dplyr package offers the filter command.

The filter command takes one or more expressions, which must evaluate to TRUE or FALSE. These types of expressions are called boolean expressions, named after George Boole, who invented the Boolean algebra. Every expression passed to the filter command is evaluated for every row in the data frame. Only if the expression returns TRUE for a row, this row is included in the resulting data frame.

To form expressions, we can use a number of operators and functions. This chapter introduces the basic ways to express filter conditions on our data.

Equals operator

The simplest way to filter data is to compare the column to a given value. This way, we can get all orders from female customers:

orders %>% 
  (customer_gender == "f")

#> filter: removed 1,613 rows (56%), 1,261 rows remaining

As you can see, the equals-operator in R consists of two equal signs in a row (==). This is important, as using only one equals sign results in an error. A single equals sign is reserved for assignments, such as when we create a new column with mutate.

In the example above, the customer_gender column is of the data type chr, which means it contains alphanumeric symbols. For such columns, when comparing values, we must enclose the literal values with quotations marks. This is because the data type chr can contain spaces. If we didn’t use quotation marks, R wouldn’t know where the string of alphanumeric character starts and ends.

The equals comparison == is useful mostly for discrete data types. Un R, these include strings (or chr), whole numbers (integer), dates, and factors. Data types such as decimal numbers (double) or date-time can in principle compared to a specific value using the comparison operator ==, but given their continuous nature, it usually doesn’t make much sense. Arithmetic operators, such as less than or greater than, are much more useful in these cases.

Arithmetic operators

The following filter removes all rows where the total price is below 50 euros:

We can combine filter conditions when listing them comma-separated:

This is equivalent to having two subsequent filter-statements in a pipeline:

Logical combinations of filter expressions

As shown above, When we list two filter expressions separated by comma, they are connected with the logical operator and:

We can do that explicitly by using the official and operator, which is denoted by the symbol &.

Or by having two subsequent filter-commands in our pipeline:

An advantage of two filter-commands is that the tidylog package prints the effect for each of the two filter expressions separately. So if we are interested in that, this is a practical option.

Another way to logically combine filter expressions is the OR-operator, which is symbolized by the | character:

The OR-operator is fundamentally different from the AND-operator. In contrast to the example with the AND, a row in the OR example must only meet one of the two conditions to be kept in the result. It can meet both, but only one is required. Only if both evaluate to FALSE, the row is removed.

The between function

If we want to keep records whose value for a numerical column is within a give range, we can achieve this with the logical AND:

For filtering on ranges, the between() function is an alternative:

Filtering based on a record’s index

We can leverage the order of the rows in the data frame (or tibble) and apply the slice-function to create a filter based on a row's index:

Last updated

Was this helpful?