Types of Joins of Spark DataFrames

Apr 22, 2021

Comments¶

It is suggested that you always pass a list of columns to the parameter on even if there's only one column for joining.
None in a pandas DataFrame is converted to NaN instead of null!
Spark allows using following join types:
- inner (default)
- cross
- outer
- full, fullouter, full_outer
- left, leftouter, left_outer
- right, rightouter, right_outer
- semi, leftsemi, left_semi
- anti, leftanti, left_anti

Hands on the Python Library toml

Apr 16, 2021

Tips & Traps¶

Please refer to Parse TOML File in Python for general tips on parsing TOML in Python.

Installatoion¶

Replace Single Quotes With Double Quotes in Python Code

Apr 15, 2021

There are 2 ways.

Format the Python code using black, which will automatically convert single quotes to double quotes when applicable. (Note that you can format the code again using yapf if you want the code to formatted by yapf finally.)
Use the tool myint/unify to help you.

Inner Join of Spark DataFrames

Apr 13, 2021

Tips and Traps¶

Select only needed columns before joining.
Rename joining column names to be identical (if different) before joining.

Debug Python Project in Visual Studio Code

Apr 23, 2021

Ways to Open a Command Palette

Use Menu Menu -> View -> Command Palette....
Use the shortcut Shift + Command + P (on macOS).

Command Palette

You can search for commands in the Command Palette, which makes things very convenient.

Run Tests or a Python File

Open the Command Palette.
Search for Python: Run in the …

Comparing Similarity of Two Different Clusterings

Oct 30, 2020

The paper Comparing Clusterings - An Overview has a good view of different metrics for comparing the similarity of 2 clusterings. Overall, Normalized Mutual Information sounds like a good one. It is implemented in sklearn as sklearn.metrics.normalized_mutual_info_score . Of course, there are many more metrics for measuring similarity of 2 …

← Older Newer →

Ben Chuanlong Du's Blog

And let it direct your passion with reason.