Ben Chuanlong Du's Blog

It is never too late to learn.

Hands on Python IO

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

StringIO

IO in Rust

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Handling Complicated Data Types in Python and PySpark

Tips and Traps

  1. An element in a pandas DataFrame can be any (complicated) type in Python. To save a padnas DataFrame with arbitrary (complicated) types as it is, you have to use the pickle module . The method pandas.DataFrame.to_pickle (which is simply a wrapper over pickle.dump) serialize the DataFrame to a pickle file while the method pandas.read_pickle

Parsing YAML in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. PyYAML (YAML 1.1 currently) and ruamel.yaml (YAML 1.2) are 2 Python libraries for parsing YAML. PyYAML is more widely used.

  2. PyYAML is preferred over json for serialization and deserialization for multiple reasons.

    • PyYAML is a superset of json.

Read/Write Files/Tables in Spark

References

DataFrameReader APIs

DataFrameWriter APIs

https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources

Comments

  1. It is suggested that you specify a schema when reading text files. If a schema is not specified when reading text files, it is good practice to check the types of columns (as the types are inferred).

  2. Do NOT read data from and write data to the same path in Spark! Due to lazy evaluation of Spark, the path will likely be cleared before it is read into Spark, which will throw IO exceptions. And the worst part is that your data on HDFS is removed but recoverable.