Ben Chuanlong Du's Blog

It is never too late to learn.

Use LightGBM With Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md

MMLSpark seems to be the best option to use train models using LightGBM on a Spark cluster. Note that MMLSpark requires …

Build Spark from Source

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

You can download prebuilt binary Spark at https://spark.apache.org/downloads.html. This is where you should get started and it will likely satisfy your need most of the time …

Subtle Differences Among Spark DataFrame and PySpark Dataframe

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Besides using the col function to reference a column, Spark/Scala DataFrame supports using $"col_name" (based on implicit conversion and must have import spark.implicit._) while PySpark DataFrame support using …

Distributed Training of Models on Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

XGBoost

http://www.legendu.net/misc/blog/use-xgboost-with-spark/

LightGBM

http://www.legendu.net/misc/blog/use-lightgbm-with-spark/

BigDL

MMLSpark

Apache Ray

You can run Apache Ray on top of Spark via analytics-zoo …

Dataframe for JVM

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Spark DataFrame

Spark DataFrame is a great implementation of distributed DataFrame, if you don't mind having dependency on Spark. It can be used in a non-distributed way of course. Spark DataFrame …

Spark vs Redshift

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement! Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

https://www.quora.com/Spark-vs-Redshift-Should-I-be-using-both-for-big-data-Which-is-better

Performance

https://dbseer.com/benchmark-comparison-spark-sql-redshift-cluster/

Redshift vs …