Tips and Traps¶
The easist way to define a UDF in PySpark is to use the
@udf
tag, and similarly the easist way to define a Pandas UDF in PySpark is to use the@pandas_udf
tag. Pandas UDFs are preferred to UDFs for server reasons. First, pandas UDFs are typically much faster than UDFs. Second, pandas UDFs are more flexible than UDFs on parameter passing. Both UDFs and pandas UDFs can take multiple columns as parameters. In addition, pandas UDFs can take a DataFrame as parameter (when passed to theapply
Regular Expression Equivalent
-
The order of precedence of operators in POSIX extended regular expression is as follows.
- Collation-related bracket symbols
[==]
,[::]
,[..]
- Escaped characters
\
- Character set (bracket expression)
[]
- Grouping
()
- Single-character-ERE duplication
*
,+
,?
,{m,n}
- Concatenation
- Anchoring
^
,$
- Alternation
|
- Collation-related bracket symbols
-
Some regular expression patterns are defined using a single leading backslash, e.g.,
\s
,\b
, etc. However, since special …
PySpark Issue: Java Gateway Process Exited Before Sending the Driver Its Port Number
I countered the issue when using PySpark locally
(the issue can happen to a cluster as well).
It turned out to be caused by a misconfiguration of the environment variable JAVA_HOME
in Docker.
References
PySpark: Exception: Java gateway process exited before sending the driver its port number
Serialize and Deserialize Object Using Pickle in Python
Tips and Traps¶
- Make sure to use the mode
rb
/wb
when read/write pickle files.
Visual Studio Code for Python
Extensions
Please refer to Useful Visual Studio Code Extensions .
Set Python Environment for Visual Studio Code Server
-
File -> Preference -> Settings
-
Click on Workspace.
-
Search for
Python Path
. -
Change Python Path to the one you want to use.
Debug a Python Project
Visual Studio Live Share
Install Python Packages Behind Firewall
It is recommended that you use pip
to install Python packages.
-
If you don't already know the proxy in use (in your company), read the post Find out Proxy in Use to figure it out.
-
Set proxy environment variables.
set http_proxy=http://user:password@proxy_ip:port set https_proxy=https://user …