Ben Chuanlong Du's Blog

It is never too late to learn.

Build Spark from Source

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

You can download prebuilt binary Spark at https://spark.apache.org/downloads.html. This is where you should get started and it will likely satisfy your need most of the time …

Subtle Differences Among Spark DataFrame and PySpark Dataframe

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Besides using the col function to reference a column, Spark/Scala DataFrame supports using $"col_name" (based on implicit conversion and must have import spark.implicit._) while PySpark DataFrame support using …

Spark vs Redshift

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement! Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

https://www.quora.com/Spark-vs-Redshift-Should-I-be-using-both-for-big-data-Which-is-better

Performance

https://dbseer.com/benchmark-comparison-spark-sql-redshift-cluster/

Redshift vs …

Docker Images for Zeppelin

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Official Zeppelin Docker Image

  1. Pull the official Zeppelin Docker image.

    docker pull apache/zeppelin
    
  2. Launch the image in a container.

    docker run -d -p 8080:8080 \
        -v $PWD/logs:/logs \
        -v …

Tips on Spark MLlib

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Spark MLlib RDD-based API supports stratified sampling but the DataFrame-based API hasn't implemented it yet as of Spark 2.4.3.

sample keys (not rows) with equal probability

References

https://spark …