Process Big Data Using PySpark
-
PySpark 2.4 and older does not support Python 3.8. You have to use Python 3.7 with PySpark 2.4 or older.
-
It can be extremely helpful to run a PySpark application locally to detect possible issues before submitting it to the Spark cluster.
#!/usr/bin/env bash …
Regular Expression Equivalent
-
The order of precedence of operators in POSIX extended regular expression is as follows.
- Collation-related bracket symbols
[==],[::],[..] - Escaped characters
\ - Character set (bracket expression)
[] - Grouping
() - Single-character-ERE duplication
*,+,?,{m,n} - Concatenation
- Anchoring
^,$ - Alternation
|
- Collation-related bracket symbols
-
Some regular expression patterns are defined using a single leading backslash, e.g.,
\s,\b, etc. However, since special …
PySpark Issue: Java Gateway Process Exited Before Sending the Driver Its Port Number
I countered the issue when using PySpark locally
(the issue can happen to a cluster as well).
It turned out to be caused by a misconfiguration of the environment variable JAVA_HOME in Docker.
References
PySpark: Exception: Java gateway process exited before sending the driver its port number
Visual Studio Code for Python
Extensions
Please refer to Useful Visual Studio Code Extensions .
Set Python Environment for Visual Studio Code Server
-
File -> Preference -> Settings
-
Click on Workspace.
-
Search for
Python Path. -
Change Python Path to the one you want to use.

Debug a Python Project
Visual Studio Live Share
Install Python Packages Behind Firewall
It is recommended that you use pip to install Python packages.
-
If you don't already know the proxy in use (in your company), read the post Find out Proxy in Use to figure it out.
-
Set proxy environment variables.
set http_proxy=http://user:password@proxy_ip:port set https_proxy=https://user …