References¶
DataFrameReader APIs
DataFrameWriter APIs
https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources
Comments¶
It is suggested that you specify a schema when reading text files. If a schema is not specified when reading text files, it is good practice to check the types of columns (as the types are inferred).
Do NOT read data from and write data to the same path in Spark! Due to lazy evaluation of Spark, the path will likely be cleared before it is read into Spark, which will throw IO exceptions. And the worst part is that your data on HDFS is removed but recoverable.
UDF in Spark
Comments¶
Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them.
Rename and Drop Columns in Spark DataFrames
Comment¶
You can use withColumnRenamed
to rename a column in a DataFrame.
You can also do renaming using alias
when select columns.
Docker Images for Programming Languages
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Python
continuumio/miniconda3
- It is hard to figure out the version of Python from the version of the Docker image.
continuumio/anaconda3
- It is hard to figure out the version of …