Spark Issue: Total Size of Serialized Results Is Bigger than spark.driver.maxResultSize

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Issue¶

Total size of serialized results is bigger than spark.driver.maxResultSize

Eliminate unnecessary broadcast or collect.
If one of the tables for joining contains too large number of partitions (which results in too many jobs), repartition it to reduce the number of partitions before joining.
Make spark.driver.maxResultSize larger.
- set by SparkConf: conf.set("spark.driver.maxResultSize", "3g")
- set by spark-defaults.conf: spark.driver.maxResultSize 3g
- set when calling spark-submit: --conf spark.driver.maxResultSize=3g