Coalesce and Repartition in Spark DataFrame

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

References¶

https://stackoverflow.com/questions/42171499/get-current-number-of-partitions-of-a-dataframe

coalesce vs repartition¶

https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4m

%%classpath add mvn
org.apache.spark spark-core_2.11 2.3.1
org.apache.spark spark-sql_2.11 2.3.1

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder()
    .master("local[2]")
    .appName("Spark Column Example")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()

import spark.implicits._

org.apache.spark.sql.SparkSession$implicits$@60ad5f12

val df = spark.read.json("../../data/people.json")
df.show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

null

Get Number of Partitions¶

df.rdd.getNumPartitions

1

Repartition¶

val df2 = df.repartition(4)

[age: bigint, name: string]

df2.rdd.getNumPartitions

4