Tips on Prefect

Oct 06, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Prefect — A modern, python-native data workflow engine

Why Not Airflow

Read/Write Files/Tables in Spark

Sep 27, 2020

References¶

DataFrameReader APIs

DataFrameWriter APIs

https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources

Comments¶

It is suggested that you specify a schema when reading text files. If a schema is not specified when reading text files, it is good practice to check the types of columns (as the types are inferred).
Do NOT read data from and write data to the same path in Spark! Due to lazy evaluation of Spark, the path will likely be cleared before it is read into Spark, which will throw IO exceptions. And the worst part is that your data on HDFS is removed but recoverable.

Tips on Facebook Ads

Sep 23, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Facebook Audience Network Newsletter

https://www.facebook.com/audiencenetwork/thank-you

Data Feed Specifications for Catalogs

Catalog Fields

Tips on Nox

Jul 13, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

It is suggested that you leverage profession CICD tools instead of nox for testing.

https://github.com/theacodes/nox

https://nox.thea.codes/en/stable/index.html

https://cjolowicz.github.io …

UDF in Spark

Sep 05, 2020

Comments¶

Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them.

Read Text File into a pandas DataFrame

Sep 05, 2020

Advanced Options¶

The argument sep supports regular expression! For example,
```
 :::python
 df = pd.read_csv(file, sep+" +")
```

nrows: control the number of rows to read skiprows, skip blank lines (the default behavior)

namesarray-like, optional List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

← Older Newer →

Ben Chuanlong Du's Blog

It is never too late to learn.

Tips on Prefect

Read/Write Files/Tables in Spark

References¶

Comments¶

Tips on Facebook Ads

Tips on Nox

UDF in Spark

Comments¶

Read Text File into a pandas DataFrame

Advanced Options¶