The set Collection in Python

Apr 30, 2021

General Tips and Traps¶

The set class is implemented based on hash table which means that its elements must be hashable (has methods __hash__ and __eq__). The set class implements the mathematical concepts of set which means that its elements are unordered and does not perserve insertion order of elements. Notice that this is different from the dict class which is also implemented based on hash table but keeps insertion order of elements! The article Why don't Python sets preserve insertion order?

Jun 03, 2017

No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.

Union 2 RDDs.

df1.union(df2)
// or for old-fashioned RDD
rdd1.union(rdd_2)

Union multiple RDDs.

df = spark.union([df1, df2, df3]) // spark is a SparkSession object
// or for old-fashioned RDD
rdd …

Oct 30, 2020

union relies on column order rather than column names. This is the same as in SQL. For columns that the type don't match, the super type is used. However, this is really dangerous if you are careful. It is suggested that you define a function call unionByName to hanle this.