General Tips and Traps¶
The set class is implemented based on hash table which means that its elements must be hashable (has methods
__hash__
and__eq__
). The set class implements the mathematical concepts of set which means that its elements are unordered and does not perserve insertion order of elements. Notice that this is different from the dict class which is also implemented based on hash table but keeps insertion order of elements! The article Why don't Python sets preserve insertion order?
Useful Visual Studio Code Extensions
Places to Find Extensoins
Visual Studio Code Marketplace and Open VSX Registry are 2 places to find VSCode compatible extensions.
Install VSCode Extensions from Command-line
https://stackoverflow.com/questions/34286515/how-to-install-visual-studio-code-extensions-from-command-line/34339780#34339780
Install Code-Server Extensions from Command-line
If you install extension in Dockerfile using root
,
the extensions are installed …
Tips on conda-pack
It is suggested that you use python-build-standlone instead of conda-pack to build portable Python environments. Please refer to Packaging Python Dependencies for PySpark Using Python-Build-Standalone for more details.
-
All packages in a virtual environment must be managed by conda (rather than pip) so that it can be packe using conda-pack …
Tips on pex
Steps to Build a pex Environment File
-
Start a Python Docker image with the right version of Python interpreter installed. For example,
docker run -it -v $(pwd):/workdir python:3.5-buster /bin/bash
-
Install pex.
pip3 install pex
-
Build a pex environment file.
pex --python=python3 -v pyspark findspark -o …
Auto Rename eTrade Employee Stock Plan Release Confirmations Using pdftotext
Install pdftotext¶
Persist and Checkpoint DataFrames in Spark
Persist vs Checkpoint¶
Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.
Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.
Generally speaking,
DataFrame.persist
has a better performance thanDataFrame.checkpoint
. However,DataFrame.checkpoint
is more robust and is preferred in any of the following situations.