Static Analyzer¶
If we get the execuation plan, then it is quite easy to analyze ...
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-lineage.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-dependencies.html
http://hydronitrogen.com/in-the-code-spark-sql-query-planning-and-execution.html
Spark Testing Frameworks/Tools¶
You can use Scala testing frameworks ScalaTest (recommended) and Specs, or you can use frameworks/tools developed based on them for Spark specifically. Various discussions suggests that Spark Testing Base is a good one.
https://www.slideshare.net/SparkSummit/beyond-parallelize-and-collect-by-holden-karau
Spark Unit Testing¶
Spark Performance Test¶
https://github.com/databricks/spark-perf
Spark Integration Test¶
https://github.com/databricks/spark-integration-tests
Spark Job Validation¶
https://www.slideshare.net/SparkSummit/beyond-parallelize-and-collect-by-holden-karau
QuickCheck/ScalaCheck
- QuickCheck generates tests data under a set of constraints
- Scala version is ScalaCheck supported by the two unit testing libraries for Spark
- sscheck
- Awesome people
- supports generating DStreams too!
- spark-testing-base
- Awesome people
- generates more pathological (e.g. empty partitions etc.) RDDs
- sscheck
Testing Spark Applications¶
Good Discussions¶
http://blog.ippon.tech/testing-strategy-apache-spark-jobs/
http://blog.ippon.tech/testing-strategy-for-spark-streaming/
https://www.youtube.com/watch?v=rOQEiTXNS0g
https://www.slideshare.net/SparkSummit/beyond-parallelize-and-collect-by-holden-karau
https://medium.com/@mrpowers/validating-spark-dataframe-schemas-28d2b3c69d2a
More¶
https://medium.com/@mrpowers/testing-spark-applications-8c590d3215fa
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/
https://dzone.com/articles/testing-spark-code
https://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
https://opencredo.com/spark-testing/
http://eugenezhulenev.com/blog/2014/10/18/run-tests-in-standalone-spark-cluster/
Data Generator¶
Please refer to Data for Testing for data generator tools.
Data Quality¶
Please refer to Data Quality for data quality related tools.
Locust¶
Locust is a tool/framework for writing code that simulates real user behaviour in a fairly realistic way. For example, it's very common to store state for each simulated user. Once you have written your "user behaviour code", you can then simulate a lot of simultaneous users by running it distributed across multiple machines, and hopefully get realistic load sent to you system.
If I wanted to just send a lot of requests/s to one or very few URL endpoints, I would also use something like ApacheBench, and I'm author of Locust.
ApacheBench¶
ApacheBench (ab) is a single-threaded command line computer program for measuring the performance of HTTP web servers.[1] Originally designed to test the Apache HTTP Server, it is generic enough to test any web server.
Other¶
- PipelineAI looks really interesting!
val sparkSession: SparkSession = SparkSession.builder()
.master("local[2]")
.appName("TestSparkApp")
.config("spark.sql.shuffle.partitions", "1")
.config("spark.sql.warehouse.dir", "java.io.tmpdir")
.getOrCreate()
import sparkSession.implicits._