Ben Chuanlong Du's Blog

And let it direct your passion with reason.

The set Collection in Python

General Tips and Traps

  1. The set class is implemented based on hash table which means that its elements must be hashable (has methods __hash__ and __eq__). The set class implements the mathematical concepts of set which means that its elements are unordered and does not perserve insertion order of elements. Notice that this is different from the dict class which is also implemented based on hash table but keeps insertion order of elements! The article Why don't Python sets preserve insertion order?

Useful Visual Studio Code Extensions

Places to Find Extensoins

Visual Studio Code Marketplace and Open VSX Registry are 2 places to find VSCode compatible extensions.

Install VSCode Extensions from Command-line

https://stackoverflow.com/questions/34286515/how-to-install-visual-studio-code-extensions-from-command-line/34339780#34339780

Install Code-Server Extensions from Command-line

If you install extension in Dockerfile using root, the extensions are installed …

Tips on pex

Steps to Build a pex Environment File

  1. Start a Python Docker image with the right version of Python interpreter installed. For example,

    docker run -it -v $(pwd):/workdir python:3.5-buster /bin/bash
    
  2. Install pex.

    pip3 install pex
    
  3. Build a pex environment file.

    pex --python=python3 -v pyspark findspark -o …

Persist and Checkpoint DataFrames in Spark

Persist vs Checkpoint

Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.

  1. Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.

  2. Generally speaking, DataFrame.persist has a better performance than DataFrame.checkpoint. However, DataFrame.checkpoint is more robust and is preferred in any of the following situations.