Column Functions and Operators in Spark

Apr 26, 2021

Quickly Create a Scala Project Using Gradle in Intellij IDEA

Jan 26, 2019

Create a directory (e.g., demo_proj) for your project.
Run gradle init --type scala-library in terminal in the above directory.
Import the directory as a Gradle project in IntelliJ IDEA. Alternatively, you can add apply plugin: 'idea' into build.gradle and then run the command ./gradlew openIdea to …

Jan 24, 2021

Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.

Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.
Generally speaking, DataFrame.persist has a better performance than DataFrame.checkpoint. However, DataFrame.checkpoint is more robust and is preferred in any of the following situations.

Jun 03, 2017

No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.

Union 2 RDDs.

df1.union(df2)
// or for old-fashioned RDD
rdd1.union(rdd_2)

Union multiple RDDs.

df = spark.union([df1, df2, df3]) // spark is a SparkSession object
// or for old-fashioned RDD
rdd …

Oct 30, 2020

union relies on column order rather than column names. This is the same as in SQL. For columns that the type don't match, the super type is used. However, this is really dangerous if you are careful. It is suggested that you define a function call unionByName to hanle this.