Date Functions in Spark
Tips and Traps¶
- HDFS table might contain invalid data (I'm not clear about the reasons at this time) with respct to the column types (e.g., Date and Timestamp). This will cause issues when Spark tries to load the data. For more discussions, please refer to Unrecognized column type:TIMESTAMP_TYP.
datetime.datetimeordatetime.date
Compare Data Frames Using DataCompy in Python
Installation¶
Persist and Checkpoint DataFrames in Spark
Persist vs Checkpoint¶
Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.
Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.
Generally speaking,
DataFrame.persisthas a better performance thanDataFrame.checkpoint. However,DataFrame.checkpointis more robust and is preferred in any of the following situations.
Types of Joins of Spark DataFrames
Comments¶
It is suggested that you always pass a list of columns to the parameter
oneven if there's only one column for joining.Nonein a pandas DataFrame is converted toNaNinstead ofnull!Spark allows using following join types:
inner(default)crossouterfull,fullouter,full_outerleft,leftouter,left_outerright,rightouter,right_outersemi,leftsemi,left_semianti,leftanti,left_anti