Static Analyzer¶

If we get the execuation plan, then it is quite easy to analyze ...

Spark Testing Frameworks/Tools¶

You can use Scala testing frameworks ScalaTest (recommended) and Specs, or you can use frameworks/tools developed based on them for Spark specifically. Various discussions suggests that Spark Testing Base is a good one.

https://www.slideshare.net/SparkSummit/beyond-parallelize-and-collect-by-holden-karau

Spark Unit Testing¶

QuickCheck/ScalaCheck

QuickCheck generates tests data under a set of constraints
Scala version is ScalaCheck supported by the two unit testing libraries for Spark
- sscheck
  - Awesome people
  - supports generating DStreams too!
- spark-testing-base
  - Awesome people
  - generates more pathological (e.g. empty partitions etc.) RDDs

Testing Spark Applications¶

Good Discussions¶

http://blog.ippon.tech/testing-strategy-apache-spark-jobs/

http://blog.ippon.tech/testing-strategy-for-spark-streaming/

https://www.youtube.com/watch?v=rOQEiTXNS0g

https://www.slideshare.net/SparkSummit/beyond-parallelize-and-collect-by-holden-karau

https://medium.com/@mrpowers/validating-spark-dataframe-schemas-28d2b3c69d2a

More¶

https://medium.com/@mrpowers/testing-spark-applications-8c590d3215fa

http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/

https://dzone.com/articles/testing-spark-code

https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf

https://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/

https://opencredo.com/spark-testing/

http://eugenezhulenev.com/blog/2014/10/18/run-tests-in-standalone-spark-cluster/

Data Generator¶

Please refer to Data for Testing for data generator tools.

Data Quality¶

Please refer to Data Quality for data quality related tools.

Load Testing Tools¶

Comparison of Locust and Other Load Testing Tools

Open Source Load Testing Tool Review

Locust¶

Locust is a tool/framework for writing code that simulates real user behaviour in a fairly realistic way. For example, it's very common to store state for each simulated user. Once you have written your "user behaviour code", you can then simulate a lot of simultaneous users by running it distributed across multiple machines, and hopefully get realistic load sent to you system.

If I wanted to just send a lot of requests/s to one or very few URL endpoints, I would also use something like ApacheBench, and I'm author of Locust.

ApacheBench ¶

ApacheBench (ab) is a single-threaded command line computer program for measuring the performance of HTTP web servers.[1] Originally designed to test the Apache HTTP Server, it is generic enough to test any web server.

Other¶

PipelineAI looks really interesting!

In [ ]:

 val sparkSession: SparkSession = SparkSession.builder()
      .master("local[2]")
      .appName("TestSparkApp")
      .config("spark.sql.shuffle.partitions", "1")
      .config("spark.sql.warehouse.dir", "java.io.tmpdir")
      .getOrCreate()
  import sparkSession.implicits._

References¶

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html

Ben Chuanlong Du's Blog

It is never too late to learn.

Unit Testing for Spark

Static Analyzer¶

Spark Testing Frameworks/Tools¶

Spark Unit Testing¶

Spark Performance Test¶

Spark Integration Test¶

Spark Job Validation¶

Testing Spark Applications¶

Good Discussions¶

More¶

Data Generator¶

Data Quality¶

Load Testing Tools¶

Locust¶

ApacheBench ¶

Other¶

References¶

Comments

Static Analyzer¶

Spark Testing Frameworks/Tools¶

Spark Unit Testing¶

Spark Performance Test¶

Spark Integration Test¶

Spark Job Validation¶

Testing Spark Applications¶

Good Discussions¶

More¶

Data Generator¶

Data Quality¶

Load Testing Tools¶

Locust¶

ApacheBench¶

Other¶

References¶

Comments

ApacheBench ¶