Comments¶
Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them.
In [2]:
interp.load.ivy("org.apache.spark" %% "spark-core" % "3.0.0")
interp.load.ivy("org.apache.spark" %% "spark-sql" % "3.0.0")
In [5]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[2]")
.appName("Spark UDF Examples")
.getOrCreate()
import spark.implicits._
Out[5]:
In [4]:
val df = Seq(
(0, "hello"),
(1, "world")
).toDF("id", "text")
df.show
Out[4]:
In [4]:
import org.apache.spark.sql.functions.udf
val upper: String => String = _.toUpperCase
val upperUDF = udf(upper)
Out[4]:
In [5]:
df.withColumn("upper", upperUDF($"text")).show
In [6]:
val someUDF = udf((arg1: Long, arg2: Long) => {
arg1 + arg2
})
Out[6]:
References¶
https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html
https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-udfs.html
https://blog.cloudera.com/blog/2017/02/working-with-udfs-in-apache-spark/
In [ ]: