Process Big Data Using PySpark

Apr 30, 2021

PySpark 2.4 and older does not support Python 3.8. You have to use Python 3.7 with PySpark 2.4 or older.
It can be extremely helpful to run a PySpark application locally to detect possible issues before submitting it to the Spark cluster.
```
#!/usr/bin/env bash …
```

User-defined Function (UDF) in PySpark

Nov 27, 2020

Tips and Traps¶

The easist way to define a UDF in PySpark is to use the @udf tag, and similarly the easist way to define a Pandas UDF in PySpark is to use the @pandas_udf tag. Pandas UDFs are preferred to UDFs for server reasons. First, pandas UDFs are typically much faster than UDFs. Second, pandas UDFs are more flexible than UDFs on parameter passing. Both UDFs and pandas UDFs can take multiple columns as parameters. In addition, pandas UDFs can take a DataFrame as parameter (when passed to the apply

Concurrency and Parallel Computing in Python

Nov 16, 2016

The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations. Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Therefore it is only in multithreaded programs that spend …

Parallel Computing Using Multithreading

Jun 26, 2012

Not all jobs are suitable for parallel computing. The more comminication that threads has to make, the more dependent the jobs are and the less efficient the parallel computing is.
Generally speaking, commercial softwares (Mathematica, MATLAB and Revolution R, etc.) have very good support on parallel computing.

Python

Please refer …

Ben Chuanlong Du's Blog

And let it direct your passion with reason.