Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Tips and Traps¶
polars.DataFrame.unique
andpolars.Series.unique
do not maintain the original order by default. To maintain the original order, pass the optionmaintain_order=True
.
Polars¶
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model.
It is the best replacement of pandas for small data at this time.
Polars support multithreading and lazy computation.
Polars CANNOT handle data larger than memory at this time (even though this might change in future).
Comparison with pandas DataFrame¶
Polars intentionally leaves out the concept of (row) index.
There are no methods such as
loc
andiloc
in Polars. You can usedf.get_column
/df.[col]
,df.get_columns
/df.[[col1, col2]]
to access columns.Similar to pandas DataFrame, chaining access works but chaining assignment doesn't work. To assign value of an element, use
df[row_index, col_name] = val
instead. However, notice that this is inefficient as it updates the whole column under the hood. If you have to update values of a column in a Polars DataFrame, do NOT loop through each cell to update it. Instead, create a Series which contains updated values and then update the column only once. For more discussions, please refer to Efficient way to update a single cell of a Polars DataFrame? .Polars DataFrame provides APIs
DataFrame.from_pandas
andDataFrame.to_pandas
to convert between Polars/pandas DataFrames.Polars' APIs for parsing CSV files is not as flexible as pandas's. Lucky that we can parse CSV files using pandas and then convert pandas DataFrmaes into Polars DataFrames.
!pip3 install --user polars
import itertools as it
import polars as pl
Series¶
[m for m in dir(pl.Series) if not m.startswith("_")]
s = pl.Series([1, 2, 3])
s
s[0] = 100
s
DataFrame¶
[m for m in dir(pl.DataFrame) if not m.startswith("_")]
df = pl.read_csv("https://j.mp/iriscsv")
df
df["sepal_length"]
Similar to pandas DataFrame, chaining assignment does NOT work!
df["sepal_length"][0] = 10000
df
You can slice by row and column at the same time.
df[0, "sepal_length"]
dir(df)
df[0, "sepal_length"] = 1000
df
df.columns
s = df.get_column("sepal_length")
s
s[0] = 1000
s
df
df.get_column("sepal_length")[0] = 2000
df
type(s)
pl.all¶
comp.select(pl.all().all())
DataFrame.frame_equal¶
Check whether a DataFrame equals to another DataFrame, elementwise.
df.filter((df["i0"] == 1) & (df["i1"] == 2) & (df["i2"] == 13))[
["j0", "j1", "ranks"]
].frame_equal(
df.filter((df["i0"] == 1) & (df["i1"] == 2) & (df["i2"] == 26))[
["j0", "j1", "ranks"]
]
)
df = pl.DataFrame(
{
"id": [0, 1, 2, 3, 4],
"color": ["red", "green", "green", "red", "red"],
"shape": ["square", "triangle", "square", "triangle", "square"],
}
)
df
df.filter(pl.col("sepal_length") > 5).groupby("species").sum()
df = pl.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"fruits": ["banana", "banana", "apple", "apple", "banana"],
"B": [5, 4, 3, 2, 1],
"cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
}
)
df
df.sort("fruits").select(
[
"fruits",
"cars",
pl.lit("fruits").alias("literal_string_fruits"),
pl.col("B").filter(pl.col("cars") == "beetle").sum(),
pl.col("A")
.filter(pl.col("B") > 2)
.sum()
.over("cars")
.alias("sum_A_by_cars"), # groups by "cars"
pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"), # groups by "fruits"
pl.col("A")
.reverse()
.over("fruits")
.flatten()
.alias("rev_A_by_fruits"), # groups by "fruits
pl.col("A")
.sort_by("B")
.over("fruits")
.flatten()
.alias("sort_A_by_B_by_fruits"), # groups by "fruits"
]
)
df.to_dict("records")
df.to_dicts()
dir(df)
ss = df.to_struct("ss")
ss
type(ss[0])
sort¶
DataFrame.sort is not in-place. It returns a new DataFrame.
?pl.DataFrame.sort
to_pandas¶
df = pl.DataFrame(
{
"foo": [1, 2, 3],
"bar": [6, 7, 8],
"ham": ["a", "b", "c"],
}
)
dfp = df.to_pandas()
dfp
pl.from_pandas(dfp)