Tips and Traps¶
An element in a pandas DataFrame can be any (complicated) type in Python. To save a padnas DataFrame with arbitrary (complicated) types as it is, you have to use the pickle module . The method
pandas.DataFrame.to_pickle
(which is simply a wrapper overpickle.dump
) serialize the DataFrame to a pickle file while the methodpandas.read_pickle
(which is simply a wrapper overpickle.load
) deserialize a pickle file into a pandas DataFrame.Apache Parquet is a binary file format that stores data in a columnar fashion for compressed, efficient columnar data representation. It is a very popular file format when working with big data (Hadoop/Spark, etc.) ecosystem. However, be aware that a Parquet file does not support arbitrary data types in Python! For example, an element of the list type is converted to a numpy array first. This requires types of elements of a column to be consistent. For this reason,
numpy.ndarray
is preferred tolist
if you want write the pandas DataFrame to a Parquet file later.It is good practice to have consistent and specific types when working with Parquet file in Python, especially when you have to deal with the Parquet file in Spark/PySpark later.
numpy.ndarray
is preferred tolist
andtuple
.- Avoid mixing different types (
numpy.ndarray
,list
,tuple
, etc.) in the same column, even if it still might work. - An empty
numpy.ndarray
is preferred toNone
as handling ofNone
can be inconssitent in different situations. Specically, avoid a column with allNone
's. When written to a Parquet file and then read into Spark/PySpark, a column with allNone
's is inferred asIntegerType
(due to lack of specific type information). This might or might not what you want.
You can specify a schema to help Spark/PySpark to read a Parquet file. However, I don't think this is a good practice. One advantage of Parquet file is that it has schema. The accurate schema should be stored in the Parquet file. Otherwise, it is hard for other people for figure the correct shcema to use.
Types in pandas DataFrame¶
import pandas as pd
import numpy as np
import pickle
Complicated Data Types¶
An element in a pandas DataFrame can be any (complicated) type in Python. Below is an example pandas DataFrame with complicated data types.
pdf_1 = pd.DataFrame(
{
"x1": [1, 2, 3, 4, 5],
"x2": [
None,
np.array([]),
{"key": 1},
np.array([0.1, 0.2, 0.3]),
["how", 0.5, 0.6],
],
}
)
pdf_1.head()
pdf_1.dtypes
Mixed Data Types in pandas DataFrame¶
The pandas DataFrame pdf_1
contains mixed data types in its column x2
.
It cannot be written into a Parquet file
as data types in the column x2
are not compatible in Parquet.
pdf_1.to_parquet("/tmp/pdf_1.parquet")
However, you can serialize and deserialize the pandas DataFrame using pickle. As a matter of factor, almost all Python objects can be serialized and deserialized using pickle.
pdf_1.to_pickle("/tmp/pdf_1.pickle")
with open("/tmp/pdf_1.pickle", "rb") as fin:
pdf_1c = pickle.load(fin)
pdf_1c
Some data types are compatible in Parquet.
For example,
None
, numpy.ndarray
and list
can be mixed in a pandas DataFrame column
and can be written into a Parquet file.
pdf_2 = pd.DataFrame(
{
"x1": [1, 2, 3, 4, 5],
"x2": [True, False, True, False, True],
"x3": [None, np.array([]), [], np.array([0.1, 0.2, 0.3]), [0.4, 0.5, 0.6]],
}
)
pdf_2.head()
pdf_2.dtypes
pdf_2.to_parquet("/tmp/pdf_2.parquet", flavor="spark")
pdf_2c = pd.read_parquet("/tmp/pdf_2.parquet")
pdf_2c
pdf_2c.dtypes
type(pdf_2.x3[0])
type(pdf_2.x3[1])
type(pdf_2.x3[2])
type(pdf_2.x3[3])
type(pdf_2.x3[4])
Even if some data types can be mixed together and are compatible in Parquet format, you should avoid doing this. It is good practice to have consistent and specific types when working with Parquet file in Python, especially when you have to deal with the Parquet file in Spark/PySpark later.
numpy.ndarray
is preferred tolist
andtuple
.- Avoid mixing different types (
numpy.ndarray
,list
,tuple
, etc.) in the same column, even if it still might work. - An empty
numpy.ndarray
is preferred toNone
as handling ofNone
can be inconssitent in different situations. Specically, avoid a column with allNone
's. When written to a Parquet file and then read into Spark/PySpark, a column with allNone
's is inferred asIntegerType
(due to lack of specific type information). This might or might not what you want.
Read the Parquet File into PySpark¶
import findspark
findspark.init(str(next(Path("/opt").glob("spark-*"))))
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("PySpark").enableHiveSupport().getOrCreate()
df = spark.read.parquet("/tmp/pdf_2.parquet")
df.show()
df.schema
Notice that the None
value is represented as null
in the above PySpark DataFrame.
The x3
column is represented as an array of double values.
You can provide an customized schema when reading a table into Spark.
schema = StructType(
[
StructField("x1", LongType(), False),
StructField("x2", BooleanType(), False),
StructField("x3", ArrayType(DoubleType()), True),
]
)
df_2 = spark.read.schema(schema).parquet("/tmp/pdf_2.parquet")
df_2.show()
df_2.schema
An empty array is NOT considered as null
!
df_2.select(col("x1"), col("x2"), col("x3"), col("x3").isNull().alias("is_null")).show()
You can write the PySpark DataFrame into a Parquet file and then load it into a pandas DataFrame.
null
is converted toNone
.- An array is represented as
numpy.ndarray
.
df_2.write.mode("overwrite").parquet("/tmp/df_2.parquet")
pdf_3 = pd.read_parquet("/tmp/df_2.parquet")
pdf_3
pdf_3.dtypes
type(pdf_3.x3[0])
type(pdf_3.x3[1])
type(pdf_3.x3[2])
type(pdf_3.x3[3])
type(pdf_3.x3[4])
Schema of PySpark DataFrames¶
df.show()
df.schema
DataFrame.schema
is of the pyspark.sql.types.StructType
type.
type(df.schema)
A StructType
is iterable
and each element is of the StructField
type.
for field in df.schema:
print(field)
f = df.schema[0]
f
f.name
f.dataType
type(f.dataType)
f.dataType.typeName()
f.dataType.simpleString()
f.nullable
f.simpleString()
DecimalType(18, 3)
f2 = StructField("gmb", DecimalType(18, 2), False)
f2
f2.name
f2.dataType
f2.dataType.typeName()
f2.dataType.simpleString()
f2.simpleString()
for field in df.schema:
print(f"{field.name} {field.dataType.simpleString()}")