Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Static Analyzer¶
If we get the execuation plan, then it is quite easy to analyze ...
https://
https://
http://
Spark Testing Frameworks/Tools¶
You can use Scala testing frameworks ScalaTest (recommended) and Specs, or you can use frameworks/tools developed based on them for Spark specifically. Various discussions suggests that Spark Testing Base is a good one.
https://
Spark Unit Testing¶
Spark Performance Test¶
https://
Spark Integration Test¶
https://
Spark Job Validation¶
https://
QuickCheck/ScalaCheck
QuickCheck generates tests data under a set of constraints
Scala version is ScalaCheck supported by the two unit testing libraries for Spark
sscheck
Awesome people
supports generating DStreams too!
spark-testing-base
Awesome people
generates more pathological (e.g. empty partitions etc.) RDDs
Testing Spark Applications¶
Good Discussions¶
http://
http://
https://
https://
https://medium.com/mrpowers/validating-spark-dataframe-schemas-28d2b3c69d2a
More¶
https://medium.com/mrpowers/testing-spark-applications-8c590d3215fa
http://
https://
https://
https://
http://
Data Generator¶
Please refer to Data for Testing for data generator tools.
Data Quality¶
Please refer to Data Quality for data quality related tools.
Load Testing Tools¶
Locust¶
Locust is a tool/framework for writing code that simulates real user behaviour in a fairly realistic way. For example, it’s very common to store state for each simulated user. Once you have written your “user behaviour code”, you can then simulate a lot of simultaneous users by running it distributed across multiple machines, and hopefully get realistic load sent to you system.
If I wanted to just send a lot of requests/s to one or very few URL endpoints, I would also use something like ApacheBench, and I’m author of Locust.
ApacheBench¶
ApacheBench (ab) is a single-threaded command line computer program for measuring the performance of HTTP web servers.[1] Originally designed to test the Apache HTTP Server, it is generic enough to test any web server.
Other¶
PipelineAI looks really interesting!
val sparkSession: SparkSession = SparkSession.builder()
.master("local[2]")
.appName("TestSparkApp")
.config("spark.sql.shuffle.partitions", "1")
.config("spark.sql.warehouse.dir", "java.io.tmpdir")
.getOrCreate()
import sparkSession.implicits._