Comments¶
It is suggested that you always pass a list of columns to the parameter
oneven if there's only one column for joining.Nonein a pandas DataFrame is converted toNaNinstead ofnull!Spark allows using following join types:
inner(default)crossouterfull,fullouter,full_outerleft,leftouter,left_outerright,rightouter,right_outersemi,leftsemi,left_semianti,leftanti,left_anti
Hands on the Python Library toml
Tips & Traps¶
- Please refer to Parse TOML File in Python for general tips on parsing TOML in Python.
Installatoion¶
Replace Single Quotes With Double Quotes in Python Code
There are 2 ways.
-
Format the Python code using black, which will automatically convert single quotes to double quotes when applicable. (Note that you can format the code again using yapf if you want the code to formatted by yapf finally.)
-
Use the tool myint/unify to help you.
Inner Join of Spark DataFrames
Tips and Traps¶
Select only needed columns before joining.
Rename joining column names to be identical (if different) before joining.
Debug Python Project in Visual Studio Code
Ways to Open a Command Palette
- Use Menu
Menu -> View -> Command Palette.... - Use the shortcut
Shift + Command + P(on macOS).

You can search for commands in the Command Palette, which makes things very convenient.
Run Tests or a Python File
- Open the Command Palette.
- Search for
Python: Runin the …
Comparing Similarity of Two Different Clusterings
The paper Comparing Clusterings - An Overview has a good view of different metrics for comparing the similarity of 2 clusterings. Overall, Normalized Mutual Information sounds like a good one. It is implemented in sklearn as sklearn.metrics.normalized_mutual_info_score . Of course, there are many more metrics for measuring similarity of 2 …