Quickly Create a Scala Project Using Gradle in Intellij IDEA
Easy Way
-
Create a directory (e.g.,
demo_proj) for your project. -
Run
gradle init --type scala-libraryin terminal in the above directory. -
Import the directory as a Gradle project in IntelliJ IDEA. Alternatively, you can add
apply plugin: 'idea'intobuild.gradleand then run the command./gradlew openIdeato …
Persist and Checkpoint DataFrames in Spark
Persist vs Checkpoint¶
Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.
Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.
Generally speaking,
DataFrame.persisthas a better performance thanDataFrame.checkpoint. However,DataFrame.checkpointis more robust and is preferred in any of the following situations.
Union RDDs in Spark
No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.
-
Union 2 RDDs.
df1.union(df2) // or for old-fashioned RDD rdd1.union(rdd_2) -
Union multiple RDDs.
df = spark.union([df1, df2, df3]) // spark is a SparkSession object // or for old-fashioned RDD rdd …
Union DataFrames in Spark
Comment¶
unionrelies on column order rather than column names. This is the same as in SQL. For columns that the type don't match, the super type is used. However, this is really dangerous if you are careful. It is suggested that you define a function call unionByName to hanle this.