Ben Chuanlong Du's Blog

It is never too late to learn.

Row-based Mapping and Filtering on DataFrames in Spark

Comments

Spark DataFrame is an alias to Dataset[Row]. Even though a Spark DataFrame is stored as Rows in a Dataset, built-in operations/functions (in org.apache.spark.sql.functions) for Spark DataFrame are Column-based. Sometimes, there might be transformations on a DataFrame that is hard to express as Column expressions but rather evey convenient to express as Row expressions. The traditional way to resolve this issue is to wrap the row-based function into a UDF. It is worthing knowing that Spark DataFrame supports map/flatMap APIs which works on Rows. They are still experimental as Spark 2.4.3. It is suggested that you stick to Column-based operations/functions until the Row-based methods mature.

Use Kotlin in a Scala Project

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Methods of a Kotlin object can be called in a Scala project by KotlinObject.INSTANCE.methodToCall()

  2. You might need to provide the Kotlin standard library kotlin-stdlib.jar in order to run …