Ben Chuanlong Du's Blog

It is never too late to learn.

Filter pandas DataFrames in Python

In [1]:
import pandas as pd
import numpy as np
import datetime

Summary

  1. There are many ways to filter rows of a pandas DataFrame. The slicing way is the universal way, however, the method DataFrame.query can be more convenient for simple filtering conditions and it is more efficient too.

  2. DataFrame.where is not intened for filtering but rather a analog to case statement in SQL and if/else or ternary expressions.

In [2]:
df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5],
        "word": ["Mary", "is", "a", "fox", "."],
        "date": [datetime.date.today() + datetime.timedelta(days=i) for i in range(5)],
    }
)
df
Out[2]:
id word date
0 1 Mary 2020-11-27
1 2 is 2020-11-28
2 3 a 2020-11-29
3 4 fox 2020-11-30
4 5 . 2020-12-01

DataFrame.query

In [3]:
df.query("id % 2 == 0")
Out[3]:
id word date
1 2 is 2020-11-28
3 4 fox 2020-11-30
In [4]:
df.query("id in [3, 5, 7]")
Out[4]:
id word date
2 3 a 2020-11-29
4 5 . 2020-12-01
In [29]:
x = [3, 5, 7]
df.query("id in @x")
Out[29]:
id word date
2 3 a 2020-06-07
4 5 . 2020-06-09
In [30]:
df.query("word.str.len() == id")
Out[30]:
id word date
1 2 is 2020-06-06
In [31]:
df.query("word.str.lower() == 'mary'")
Out[31]:
id word date
0 1 Mary 2020-06-05
In [10]:
df.query("word.str.startswith('f')")
Out[10]:
id word date
3 4 fox 2020-11-24
In [12]:
df[df.word.str.startswith("f")]
Out[12]:
id word date
3 4 fox 2020-11-24

The below example is convenient way to compare a datetime column with a literal string.

In [32]:
df.query("date.astype('str') >= '2020-06-08'")
Out[32]:
id word date
3 4 fox 2020-06-08
4 5 . 2020-06-09

Use Slicing

In [7]:
df[df.x % 2 == 0]
Out[7]:
x y
1 2 4
3 4 2

DataFrame.where

The method DataFrame.where is not intended for filtring. It is an analog to case statement in SQL and if/else and ternary expression in programming languages. The condition that DataFrame.where takes can be a bool Series/DataFrame, array-like, or callable.

In [14]:
df.where(lambda r: r.id % 2 == 0)
Out[14]:
id word
0 NaN NaN
1 2.0 is
2 NaN NaN
3 4.0 fox
4 NaN NaN
In [18]:
df.where(lambda r: r.id % 2 == 0, df * 3)
Out[18]:
id word
0 3 MaryMaryMary
1 2 is
2 9 aaa
3 4 fox
4 15 ...
In [9]:
df.where(lambda r: r.x % 2 == 0).dropna()
Out[9]:
x y
1 2.0 4.0
3 4.0 2.0
In [19]:
df.id.where(lambda x: x % 2 == 0)
Out[19]:
0    NaN
1    2.0
2    NaN
3    4.0
4    NaN
Name: id, dtype: float64
In [20]:
df.id.where(lambda x: x % 2 == 0, 0)
Out[20]:
0    0
1    2
2    0
3    4
4    0
Name: id, dtype: int64
In [ ]:
 

Comments