Ben Chuanlong Du's Blog

It is never too late to learn.

Unit Testing for Spark

Spark Testing Frameworks/Tools

You can use Scala testing frameworks ScalaTest (recommended) and Specs, or you can use frameworks/tools developed based on them for Spark specifically. Various discussions suggests that Spark Testing Base is a good one.

https://www.slideshare.net/SparkSummit/beyond-parallelize-and-collect-by-holden-karau

QuickCheck/ScalaCheck

  1. QuickCheck generates tests data under a set of constraints
  2. Scala version is ScalaCheck supported by the two unit testing libraries for Spark
    • sscheck
      • Awesome people
      • supports generating DStreams too!
    • spark-testing-base
      • Awesome people
      • generates more pathological (e.g. empty partitions etc.) RDDs

Testing Spark Applications

Data Generator

Please refer to Data for Testing for data generator tools.

Data Quality

Please refer to Data Quality for data quality related tools.

Locust

Locust is a tool/framework for writing code that simulates real user behaviour in a fairly realistic way. For example, it's very common to store state for each simulated user. Once you have written your "user behaviour code", you can then simulate a lot of simultaneous users by running it distributed across multiple machines, and hopefully get realistic load sent to you system.

If I wanted to just send a lot of requests/s to one or very few URL endpoints, I would also use something like ApacheBench, and I'm author of Locust.

ApacheBench

ApacheBench (ab) is a single-threaded command line computer program for measuring the performance of HTTP web servers.[1] Originally designed to test the Apache HTTP Server, it is generic enough to test any web server.

Other

  1. PipelineAI looks really interesting!
In [ ]:
 val sparkSession: SparkSession = SparkSession.builder()
      .master("local[2]")
      .appName("TestSparkApp")
      .config("spark.sql.shuffle.partitions", "1")
      .config("spark.sql.warehouse.dir", "java.io.tmpdir")
      .getOrCreate()
  import sparkSession.implicits._

Comments