Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Union RDDs in Spark

No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.

  1. Union 2 RDDs.

    df1.union(df2)
    // or for old-fashioned RDD
    rdd1.union(rdd_2)
    
  2. Union multiple RDDs.

    df = spark.union([df1, df2, df3]) // spark is a SparkSession object
    // or for old-fashioned RDD
    rdd = sc.union([rdd1, rdd2, rdd3]) // sc is a SparkContext object
    

References

Union DataFrames in Spark

Union DataFrames in Spark

Comments