Hands on Python IO

May 01, 2022

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

StringIO¶

Nov 30, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Sep 06, 2021

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

I tried to disable auto update in Ubuntu but it doesn't help …

May 07, 2020

An element in a pandas DataFrame can be any (complicated) type in Python. To save a padnas DataFrame with arbitrary (complicated) types as it is, you have to use the pickle module . The method pandas.DataFrame.to_pickle (which is simply a wrapper over pickle.dump) serialize the DataFrame to a pickle file while the method pandas.read_pickle

Oct 17, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

PyYAML (YAML 1.1 currently) and ruamel.yaml (YAML 1.2) are 2 Python libraries for parsing YAML. PyYAML is more widely used.
PyYAML is preferred over json for serialization and deserialization for multiple reasons.
- PyYAML is a superset of json.

Sep 27, 2020

It is suggested that you specify a schema when reading text files. If a schema is not specified when reading text files, it is good practice to check the types of columns (as the types are inferred).
Do NOT read data from and write data to the same path in Spark! Due to lazy evaluation of Spark, the path will likely be cleared before it is read into Spark, which will throw IO exceptions. And the worst part is that your data on HDFS is removed but recoverable.