Subtle Differences Among Spark DataFrame and PySpark Dataframe

Feb 19, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Besides using the col function to reference a column, Spark/Scala DataFrame supports using $"col_name" (based on implicit conversion and must have import spark.implicit._) while PySpark DataFrame support using …

Distributed Training of Models on Spark

Jan 05, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

XGBoost

http://www.legendu.net/misc/blog/use-xgboost-with-spark/

LightGBM

http://www.legendu.net/misc/blog/use-lightgbm-with-spark/

Apache Ray

You can run Apache Ray on top of Spark via analytics-zoo …

Aggregate DataFrames in Spark

Dec 20, 2019

Aggregation Without Grouping¶

You can aggregate all values in Columns of a DataFrame. Just use aggregation functions in select without groupBy, which is very similar to SQL syntax.
The aggregation functions all and any are available since Spark 3.0. However, they can be achieved using other aggregation functions such as sum

Dataframe for JVM

Dec 13, 2019

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Spark DataFrame

Spark DataFrame is a great implementation of distributed DataFrame, if you don't mind having dependency on Spark. It can be used in a non-distributed way of course. Spark DataFrame …

Row-based Mapping and Filtering on DataFrames in Spark

Dec 13, 2019

Comments¶

Spark DataFrame is an alias to Dataset[Row]. Even though a Spark DataFrame is stored as Rows in a Dataset, built-in operations/functions (in org.apache.spark.sql.functions) for Spark DataFrame are Column-based. Sometimes, there might be transformations on a DataFrame that is hard to express as Column expressions but rather evey convenient to express as Row expressions. The traditional way to resolve this issue is to wrap the row-based function into a UDF. It is worthing knowing that Spark DataFrame supports map/flatMap APIs which works on Rows. They are still experimental as Spark 2.4.3. It is suggested that you stick to Column-based operations/functions until the Row-based methods mature.

Row Object in Spark

Dec 13, 2019

← Older Newer →

Ben Chuanlong Du's Blog

It is never too late to learn.

Subtle Differences Among Spark DataFrame and PySpark Dataframe

Distributed Training of Models on Spark

XGBoost

LightGBM

BigDL

MMLSpark

Apache Ray

Aggregate DataFrames in Spark

Aggregation Without Grouping¶

Dataframe for JVM

Spark DataFrame

Row-based Mapping and Filtering on DataFrames in Spark

Comments¶

Row Object in Spark