import io
import pandas as pd
import sys
Tips and Traps¶
Use the Parquet format as much as possible instead of other binary format or text format.
Both Python libraries
pyarrow
andfastparquet
can handle Paruqet files.pyarrow
is preferred tofastparquet
.When writing a pandas DataFrame to a Parquet file, you can specify arguments
schema
andversion
to control the output schema of the DataFrame precisely. Notice that theversion
argument is required to avoid the issue .
DataFrame.read_parquet¶
DataFrame.to_parquet¶
When writing a pandas DataFrame to a Parquet file,
you can specify arguments schema
and version
to control the output schema of the DataFrame precisely.
Notice that the version
argument is required
to avoid the
issue
.
Of course,
you can also manually cast the data type of each column
before outputing the DataFrame,
but it is not as efficient as the first approach.
import numpy as np
import pyarrow as pa
import pandas as pd
df_j0 = pd.read_csv("rank_j0.csv")
schema = pa.schema(
[
("id", pa.uint64()),
("mod", pa.uint32()),
("dups", pa.uint8()),
("rank", pa.uint32()),
]
)
df_j0.to_parquet("rank_j0.parquet", version="2.6", schema=schema)
Output Types of Columns¶
null object -> null int when read into PySpark!!
https://stackoverflow.com/questions/50110044/how-to-force-parquet-dtypes-when-saving-pd-dataframe
DataFrame.to_csv¶
Read/Write A Pandas DataFrame From/To A Binary Stream¶
bio = io.BytesIO()
df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": [5, 4, 3, 2, 1], "z": [1, 1, 1, 1, 1]})
df.head()
sys.getsizeof(bio)
df.to_parquet(bio)
sys.getsizeof(bio)
pd.read_parquet(bio)
References¶
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery
http://www.legendu.net/misc/blog/python-pandas-read_csv/
http://www.legendu.net/misc/blog/read-and-write-parquet-files-in-python/
https://docs.python.org/3/library/io.html
https://www.devdungeon.com/content/working-binary-data-python