Ben Chuanlong Du's Blog

It is never too late to learn.

Subtle Differences Among Spark DataFrame and PySpark Dataframe

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Besides using the col function to reference a column, Spark/Scala DataFrame supports using $"col_name" (based on implicit conversion and must have import spark.implicit._) while PySpark DataFrame support using …

Distributed Training of Models on Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

XGBoost

http://www.legendu.net/misc/blog/use-xgboost-with-spark/

LightGBM

http://www.legendu.net/misc/blog/use-lightgbm-with-spark/

BigDL

MMLSpark

Apache Ray

You can run Apache Ray on top of Spark via analytics-zoo …

Aggregate DataFrames in Spark

Aggregation Without Grouping

  1. You can aggregate all values in Columns of a DataFrame. Just use aggregation functions in select without groupBy, which is very similar to SQL syntax.

  2. The aggregation functions all and any are available since Spark 3.0. However, they can be achieved using other aggregation functions such as sum

Dataframe for JVM

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Spark DataFrame

Spark DataFrame is a great implementation of distributed DataFrame, if you don't mind having dependency on Spark. It can be used in a non-distributed way of course. Spark DataFrame …

Row-based Mapping and Filtering on DataFrames in Spark

Comments

Spark DataFrame is an alias to Dataset[Row]. Even though a Spark DataFrame is stored as Rows in a Dataset, built-in operations/functions (in org.apache.spark.sql.functions) for Spark DataFrame are Column-based. Sometimes, there might be transformations on a DataFrame that is hard to express as Column expressions but rather evey convenient to express as Row expressions. The traditional way to resolve this issue is to wrap the row-based function into a UDF. It is worthing knowing that Spark DataFrame supports map/flatMap APIs which works on Rows. They are still experimental as Spark 2.4.3. It is suggested that you stick to Column-based operations/functions until the Row-based methods mature.