Process Big Data Using PySpark

Apr 30, 2021

PySpark 2.4 and older does not support Python 3.8. You have to use Python 3.7 with PySpark 2.4 or older.
It can be extremely helpful to run a PySpark application locally to detect possible issues before submitting it to the Spark cluster.
```
#!/usr/bin/env bash …
```

Control Number of Partitions of a DataFrame in Spark

Dec 11, 2021

Tips and Traps¶

DataFrame.repartition repartitions the DataFrame by hash code of each row. If you specify a (multiple) column(s) (instead of number of partitions) to the method DataFrame.repartition, then hash code of the column(s) are calculated for repartition. In some situations, there are lots of hash conflictions even if the total number of rows is small (e.g., a few thousand), which means that partitions generated might be skewed

User-defined Function (UDF) in PySpark

Nov 27, 2020

Tips and Traps¶

The easist way to define a UDF in PySpark is to use the @udf tag, and similarly the easist way to define a Pandas UDF in PySpark is to use the @pandas_udf tag. Pandas UDFs are preferred to UDFs for server reasons. First, pandas UDFs are typically much faster than UDFs. Second, pandas UDFs are more flexible than UDFs on parameter passing. Both UDFs and pandas UDFs can take multiple columns as parameters. In addition, pandas UDFs can take a DataFrame as parameter (when passed to the apply

PySpark Issue: Java Gateway Process Exited Before Sending the Driver Its Port Number

Oct 10, 2021

I countered the issue when using PySpark locally (the issue can happen to a cluster as well). It turned out to be caused by a misconfiguration of the environment variable JAVA_HOME in Docker.

References

PySpark: Exception: Java gateway process exited before sending the driver its port number

Date Functions in Spark

Apr 27, 2021

Tips and Traps¶

HDFS table might contain invalid data (I'm not clear about the reasons at this time) with respct to the column types (e.g., Date and Timestamp). This will cause issues when Spark tries to load the data. For more discussions, please refer to Unrecognized column type:TIMESTAMP_TYP.

datetime.datetime or datetime.date

Packaging Python Dependencies for PySpark Using conda-pack

Apr 30, 2021

python-build-standalone is a better alternative to conda-pack on managing Python dependencies for PySpark. Please refer to Packaging Python Dependencies for PySpark Using python-build-standalone for tutorials on how to use python-build-standalone to manage Python dependencies for PySpark.

Build Portable Python Environments Using conda-pack

Please refer to the GitHub repo dclong/conda_environ …

← Older

Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Process Big Data Using PySpark

Control Number of Partitions of a DataFrame in Spark

Tips and Traps¶

User-defined Function (UDF) in PySpark

Tips and Traps¶

PySpark Issue: Java Gateway Process Exited Before Sending the Driver Its Port Number

References

Date Functions in Spark

Tips and Traps¶

Packaging Python Dependencies for PySpark Using conda-pack

Build Portable Python Environments Using conda-pack