Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Tips on pex

Steps to Build a pex Environment File

  1. Start a Python Docker image with the right version of Python interpreter installed. For example,

    docker run -it -v $(pwd):/workdir python:3.5-buster /bin/bash
    
  2. Install pex.

    pip3 install pex
    
  3. Build a pex environment file.

    pex --python=python3 -v pyspark findspark -o …

Persist and Checkpoint DataFrames in Spark

Persist vs Checkpoint

Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.

  1. Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.

  2. Generally speaking, DataFrame.persist has a better performance than DataFrame.checkpoint. However, DataFrame.checkpoint is more robust and is preferred in any of the following situations.

Types of Joins of Spark DataFrames

Comments

  1. It is suggested that you always pass a list of columns to the parameter on even if there's only one column for joining.

  2. None in a pandas DataFrame is converted to NaN instead of null!

  3. Spark allows using following join types:

    • inner (default)
    • cross
    • outer
    • full, fullouter, full_outer
    • left, leftouter, left_outer
    • right, rightouter, right_outer
    • semi, leftsemi, left_semi
    • anti, leftanti, left_anti

Hands on dict in Python

Tips and Traps

  1. Starting from Python 3.7, dict preserves insertion order (i.e., dict is ordered). There is no need to use OrderedDict any more in Python 3.7+. However, set in Python is implemented as an unordered hashset and thus is neither ordered nor sorted. A trick to dedup an iterable values