Ben Chuanlong Du's Blog

It is never too late to learn.

Row-based Mapping and Filtering on DataFrames in Spark

Comments

Spark DataFrame is an alias to Dataset[Row]. Even though a Spark DataFrame is stored as Rows in a Dataset, built-in operations/functions (in org.apache.spark.sql.functions) for Spark DataFrame are Column-based. Sometimes, there might be transformations on a DataFrame that is hard to express as Column expressions but rather evey convenient to express as Row expressions. The traditional way to resolve this issue is to wrap the row-based function into a UDF. It is worthing knowing that Spark DataFrame supports map/flatMap APIs which works on Rows. They are still experimental as Spark 2.4.3. It is suggested that you stick to Column-based operations/functions until the Row-based methods mature.

Collections in Kotlin

Fold vs Reduce

fold takes an initial value, and the first invocation of the lambda you pass to it will receive that initial value and the first element of the collection as parameters. reduce doesn't take an initial value, but instead starts with the first element of the collection as the accumulator (called sum in the following example).

Function Programming in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. The built-in function map applies a function to each element of an iterable. map is not useful mostly of time as Python has list comprehension, etc. map support iterating multiple iterables …