Row-based Mapping and Filtering on DataFrames in Spark
Comments¶
Spark DataFrame
is an alias to Dataset[Row]
.
Even though a Spark DataFrame is stored as Rows in a Dataset,
built-in operations/functions (in org.apache.spark.sql.functions) for Spark DataFrame are Column-based.
Sometimes,
there might be transformations on a DataFrame that is hard to express as Column expressions
but rather evey convenient to express as Row expressions.
The traditional way to resolve this issue is to wrap the row-based function into a UDF.
It is worthing knowing that Spark DataFrame supports map/flatMap APIs
which works on Rows.
They are still experimental as Spark 2.4.3.
It is suggested that you stick to Column-based operations/functions until the Row-based methods mature.
Row Object in Spark
Case of Column Names in Spark DataFrames
Comments¶
Even though Spark DataFrame/SQL APIs do not distinguish cases of column names, the columns saved into HDFS are case-sensitive!
Use Kotlin in a Scala Project
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
-
Methods of a Kotlin object can be called in a Scala project by
KotlinObject.INSTANCE.methodToCall()
-
You might need to provide the Kotlin standard library
kotlin-stdlib.jar
in order to run …