Column Functions and Operators in Spark

Apr 26, 2021

Date Functions in Spark

Apr 27, 2021

HDFS table might contain invalid data (I'm not clear about the reasons at this time) with respct to the column types (e.g., Date and Timestamp). This will cause issues when Spark tries to load the data. For more discussions, please refer to Unrecognized column type:TIMESTAMP_TYP.

Jul 11, 2020

data-diff is similar tool which efficiently diff rows across two different databases.

Jan 24, 2021

Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.

Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.
Generally speaking, DataFrame.persist has a better performance than DataFrame.checkpoint. However, DataFrame.checkpoint is more robust and is preferred in any of the following situations.

Apr 22, 2021

It is suggested that you always pass a list of columns to the parameter on even if there's only one column for joining.
None in a pandas DataFrame is converted to NaN instead of null!
Spark allows using following join types:
- inner (default)
- cross
- outer
- full, fullouter, full_outer
- left, leftouter, left_outer
- right, rightouter, right_outer
- semi, leftsemi, left_semi
- anti, leftanti, left_anti

Apr 13, 2021

← Older