No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.
-
Union 2 RDDs.
df1.union(df2) // or for old-fashioned RDD rdd1.union(rdd_2)
-
Union multiple RDDs.
df = spark.union([df1, df2, df3]) // spark is a SparkSession object // or for old-fashioned RDD rdd = sc.union([rdd1, rdd2, rdd3]) // sc is a SparkContext object