Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Useful Visual Studio Code Extensions

Places to Find Extensoins

Visual Studio Code Marketplace and Open VSX Registry are 2 places to find VSCode compatible extensions.

Install VSCode Extensions from Command-line

https://stackoverflow.com/questions/34286515/how-to-install-visual-studio-code-extensions-from-command-line/34339780#34339780

Install Code-Server Extensions from Command-line

If you install extension in Dockerfile using root, the extensions are installed …

Tips on pex

Steps to Build a pex Environment File

  1. Start a Python Docker image with the right version of Python interpreter installed. For example,

    docker run -it -v $(pwd):/workdir python:3.5-buster /bin/bash
    
  2. Install pex.

    pip3 install pex
    
  3. Build a pex environment file.

    pex --python=python3 -v pyspark findspark -o …

Persist and Checkpoint DataFrames in Spark

Persist vs Checkpoint

Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.

  1. Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.

  2. Generally speaking, DataFrame.persist has a better performance than DataFrame.checkpoint. However, DataFrame.checkpoint is more robust and is preferred in any of the following situations.

Types of Joins of Spark DataFrames

Comments

  1. It is suggested that you always pass a list of columns to the parameter on even if there's only one column for joining.

  2. None in a pandas DataFrame is converted to NaN instead of null!

  3. Spark allows using following join types:

    • inner (default)
    • cross
    • outer
    • full, fullouter, full_outer
    • left, leftouter, left_outer
    • right, rightouter, right_outer
    • semi, leftsemi, left_semi
    • anti, leftanti, left_anti