Process Big Data Using PySpark

Apr 30, 2021

PySpark 2.4 and older does not support Python 3.8. You have to use Python 3.7 with PySpark 2.4 or older.
It can be extremely helpful to run a PySpark application locally to detect possible issues before submitting it to the Spark cluster.
```
#!/usr/bin/env bash …
```

User-defined Function (UDF) in PySpark

Nov 27, 2020

Tips and Traps¶

The easist way to define a UDF in PySpark is to use the @udf tag, and similarly the easist way to define a Pandas UDF in PySpark is to use the @pandas_udf tag. Pandas UDFs are preferred to UDFs for server reasons. First, pandas UDFs are typically much faster than UDFs. Second, pandas UDFs are more flexible than UDFs on parameter passing. Both UDFs and pandas UDFs can take multiple columns as parameters. In addition, pandas UDFs can take a DataFrame as parameter (when passed to the apply

Regular Expression Equivalent

Oct 30, 2013

The order of precedence of operators in POSIX extended regular expression is as follows.
1. Collation-related bracket symbols [==], [::], [..]
2. Escaped characters \
3. Character set (bracket expression) []
4. Grouping ()
5. Single-character-ERE duplication *, +, ?, {m,n}
6. Concatenation
7. Anchoring ^, $
8. Alternation |
Some regular expression patterns are defined using a single leading backslash, e.g., \s, \b, etc. However, since special …

PySpark Issue: Java Gateway Process Exited Before Sending the Driver Its Port Number

Oct 10, 2021

I countered the issue when using PySpark locally (the issue can happen to a cluster as well). It turned out to be caused by a misconfiguration of the environment variable JAVA_HOME in Docker.