Filter rows
Filtering rows within a data set is essential for data analysis. R and the package dplyr offer comprehensive possibilities for this.
Last updated
Was this helpful?
Filtering rows within a data set is essential for data analysis. R and the package dplyr offer comprehensive possibilities for this.
Last updated
Was this helpful?
This chapter introduces the following new concepts and functions:
Arithmetic operators such as ==
, !=
, >
, <
, >=
, and <=
Logical operators such as &
, |
, and
The between()
function
slice()
and its variants
filter
commandBesides we need, we need tools to restrict the rows in a data frame. For this, the package offers the filter
command.
The filter
command takes one or more expressions, which must evaluate to TRUE
or FALSE
. These types of expressions are called boolean expressions, named after , who invented the . Every expression passed to the filter
command is evaluated for every row in the data frame. Only if the expression returns TRUE
for a row, this row is included in the resulting data frame.
To form expressions, we can use a number of operators and functions. This chapter introduces the basic ways to express filter conditions on our data.
The simplest way to filter data is to compare the column to a given value. This way, we can get all orders from female customers:
In the example above, the customer_gender
column is of the data type chr
, which means it contains alphanumeric symbols. For such columns, when comparing values, we must enclose the literal values with quotations marks. This is because the data type chr
can contain spaces. If we didn’t use quotation marks, R wouldn’t know where the string of alphanumeric character starts and ends.
The equals comparison ==
is useful mostly for discrete data types. Un R, these include strings (or chr
), whole numbers (integer
), dates, and factors. Data types such as decimal numbers (double
) or date-time can in principle compared to a specific value using the comparison operator ==
, but given their continuous nature, it usually doesn’t make much sense. Arithmetic operators, such as less than or greater than, are much more useful in these cases.
The following filter removes all rows where the total price is below 50 euros:
We can combine filter conditions when listing them comma-separated:
This is equivalent to having two subsequent filter-
statements in a pipeline:
As shown above, When we list two filter expressions separated by comma, they are connected with the logical operator and:
We can do that explicitly by using the official and operator, which is denoted by the symbol &
.
Or by having two subsequent filter
-commands in our pipeline:
Another way to logically combine filter expressions is the OR
-operator, which is symbolized by the |
character:
The OR
-operator is fundamentally different from the AND
-operator. In contrast to the example with the AND
, a row in the OR
example must only meet one of the two conditions to be kept in the result. It can meet both, but only one is required. Only if both evaluate to FALSE
, the row is removed.
between
functionIf we want to keep records whose value for a numerical column is within a give range, we can achieve this with the logical AND
:
For filtering on ranges, the between()
function is an alternative:
We can leverage the order of the rows in the data frame (or tibble) and apply the slice
-function to create a filter based on a row's index:
As you can see, the equals-operator in R consists of two equal signs in a row (==
). This is important, as using only one equals sign results in an error. A single equals sign is reserved for assignments, such as when we create a new column with .
An advantage of two filter
-commands is that the package prints the effect for each of the two filter expressions separately. So if we are interested in that, this is a practical option.