In [6]:

import io
import pandas as pd
import sys

Tips and Traps¶

Use the Parquet format as much as possible instead of other binary format or text format.
Both Python libraries pyarrow and fastparquet can handle Paruqet files. pyarrow is preferred to fastparquet.
When writing a pandas DataFrame to a Parquet file, you can specify arguments schema and version to control the output schema of the DataFrame precisely. Notice that the version argument is required to avoid the issue .

DataFrame.read_parquet¶

DataFrame.to_parquet¶

When writing a pandas DataFrame to a Parquet file, you can specify arguments schema and version to control the output schema of the DataFrame precisely. Notice that the version argument is required to avoid the issue . Of course, you can also manually cast the data type of each column before outputing the DataFrame, but it is not as efficient as the first approach.

In [1]:

import numpy as np
import pyarrow as pa
import pandas as pd

In [ ]:

df_j0 = pd.read_csv("rank_j0.csv")
schema = pa.schema(
    [
        ("id", pa.uint64()),
        ("mod", pa.uint32()),
        ("dups", pa.uint8()),
        ("rank", pa.uint32()),
    ]
)
df_j0.to_parquet("rank_j0.parquet", version="2.6", schema=schema)

mixed type

Output Types of Columns¶

null object -> null int when read into PySpark!!

https://stackoverflow.com/questions/49172428/how-to-specify-logical-types-when-writing-parquet-files-from-pyarrow

https://stackoverflow.com/questions/50110044/how-to-force-parquet-dtypes-when-saving-pd-dataframe

DataFrame.to_csv¶

Read/Write A Pandas DataFrame From/To A Binary Stream¶

In [4]:

bio = io.BytesIO()

In [5]:

df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": [5, 4, 3, 2, 1], "z": [1, 1, 1, 1, 1]})

df.head()

Out[5]:

	x	y	z
0	1	5	1
1	2	4	1
2	3	3	1
3	4	2	1
4	5	1	1

In [7]:

sys.getsizeof(bio)

Out[7]:

In [8]:

df.to_parquet(bio)

In [9]:

sys.getsizeof(bio)

Out[9]:

In [10]:

pd.read_parquet(bio)

Out[10]:

	x	y	z
0	1	5	1
1	2	4	1
2	3	3	1
3	4	2	1
4	5	1	1

References¶

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery

http://www.legendu.net/misc/blog/python-pandas-read_csv/

http://www.legendu.net/misc/blog/read-and-write-parquet-files-in-python/

https://docs.python.org/3/library/io.html

https://www.devdungeon.com/content/working-binary-data-python

https://webkul.com/blog/using-io-for-creating-file-object/

In [ ]:

Ben Chuanlong Du's Blog

It is never too late to learn.

Pandas IO