In [1]:
import pandas as pd
import numpy as np
import datetime
Summary¶
There are many ways to filter rows of a pandas DataFrame. The slicing way is the universal way, however, the method
DataFrame.querycan be more convenient for simple filtering conditions and it is more efficient too.DataFrame.whereis not intened for filtering but rather a analog tocasestatement in SQL andif/elseor ternary expressions.
In [2]:
df = pd.DataFrame(
{
"id": [1, 2, 3, 4, 5],
"word": ["Mary", "is", "a", "fox", "."],
"date": [datetime.date.today() + datetime.timedelta(days=i) for i in range(5)],
}
)
df
Out[2]:
DataFrame.query¶
In [3]:
df.query("id % 2 == 0")
Out[3]:
In [4]:
df.query("id in [3, 5, 7]")
Out[4]:
In [29]:
x = [3, 5, 7]
df.query("id in @x")
Out[29]:
In [30]:
df.query("word.str.len() == id")
Out[30]:
In [31]:
df.query("word.str.lower() == 'mary'")
Out[31]:
In [10]:
df.query("word.str.startswith('f')")
Out[10]:
In [12]:
df[df.word.str.startswith("f")]
Out[12]:
The below example is convenient way to compare a datetime column with a literal string.
In [32]:
df.query("date.astype('str') >= '2020-06-08'")
Out[32]:
Use Slicing¶
In [7]:
df[df.x % 2 == 0]
Out[7]:
DataFrame.where¶
The method DataFrame.where is not intended for filtring.
It is an analog to case statement in SQL and if/else and ternary expression in programming languages.
The condition that DataFrame.where takes can be a bool Series/DataFrame, array-like, or callable.
In [14]:
df.where(lambda r: r.id % 2 == 0)
Out[14]:
In [18]:
df.where(lambda r: r.id % 2 == 0, df * 3)
Out[18]:
In [9]:
df.where(lambda r: r.x % 2 == 0).dropna()
Out[9]:
In [19]:
df.id.where(lambda x: x % 2 == 0)
Out[19]:
In [20]:
df.id.where(lambda x: x % 2 == 0, 0)
Out[20]:
In [ ]: