Tips and Traps¶
The easist way to define a UDF in PySpark is to use the
@udftag, and similarly the easist way to define a Pandas UDF in PySpark is to use the@pandas_udftag. Pandas UDFs are preferred to UDFs for server reasons. First, pandas UDFs are typically much faster than UDFs. Second, pandas UDFs are more flexible than UDFs on parameter passing. Both UDFs and pandas UDFs can take multiple columns as parameters. In addition, pandas UDFs can take a DataFrame as parameter (when passed to theapply
Column Functions and Operators in Spark
Terminal Multiplexers

-
There are 2 mature popular terminal multiplexer apps: screen and tmux. Both of them are very useful if you want to work on multiple tasks over 1 SSH connection. Screen is relative simple to use while tmux is much more powerful and more complicated to use.
-
Besides enabling users to …
Regular Expression Equivalent
-
The order of precedence of operators in POSIX extended regular expression is as follows.
- Collation-related bracket symbols
[==],[::],[..] - Escaped characters
\ - Character set (bracket expression)
[] - Grouping
() - Single-character-ERE duplication
*,+,?,{m,n} - Concatenation
- Anchoring
^,$ - Alternation
|
- Collation-related bracket symbols
-
Some regular expression patterns are defined using a single leading backslash, e.g.,
\s,\b, etc. However, since special …
Check Whether a Linux Is Using upstart, systemd or SysV
The simplest way to check whether a Linux system is running systemd, upstart or SysV is by running the following command.
ps -p1 | grep "init\|upstart\|systemd"
References
How to determine which system manager is running on Linux System
PySpark Issue: Java Gateway Process Exited Before Sending the Driver Its Port Number
I countered the issue when using PySpark locally
(the issue can happen to a cluster as well).
It turned out to be caused by a misconfiguration of the environment variable JAVA_HOME in Docker.
References
PySpark: Exception: Java gateway process exited before sending the driver its port number