Ben Chuanlong Du's Blog

It is never too late to learn.

Build Spark from Source

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

You can download prebuilt binary Spark at https://spark.apache.org/downloads.html. This is where you should get started and it will likely satisfy your need most of the time. However, below are instructions on how to build Spark from source if you have to.

https://spark.apache.org/docs/latest/building-spark.html

  1. Clone Spark 2.4.5.

    git clone --depth 1 --branch 2.4.5 https://github.com/apache/spark.git
    
  2. Use Scala 2.11.

    ./dev/change-scala-version.sh 2.11
    
  3. Run dev/make-distribution.sh

    ./dev/make-distribution.sh \
        --name hadoop2.7 \
        --tgz \
        -Pyarn \
        -Phive \
        -Phive-thriftserver \
        -Phadoop-2.7 \
        -Dhadoop.version=2.7.3 \
        -Dscala-2.11
    
  4. Build Spark.

    ./build/mvn \
        -Pyarn \
        -Phive \
        -Phive-thriftserver \
        -Phadoop-2.7 \
        -Dhadoop.version=2.7.3 \
        -Dscala-2.11 \
        -DskipTests \
        clean package
    

    This will generate a Spark-2.4.5-*.tgz file that you use to deploy. Just copy it to your Spark client and unzip it to the right location.

  5. If you are building Spark for using in your company, you probably need to replace the directory conf with the customized one for your company.

Comments