It is suggested that you use Python script instead of Shell script as much as possible.
If you do have to stick with Shell script,
you can use =~
for regular expression matching in Bash.
This make Bash syntax extremely flexible and powerful.
For example,
you can match multiple strings using …
Check Whether a Linux Is Using upstart, systemd or SysV
The simplest way to check whether a Linux system is running systemd, upstart or SysV is by running the following command.
ps -p1 | grep "init\|upstart\|systemd"
References
How to determine which system manager is running on Linux System
PySpark Issue: Java Gateway Process Exited Before Sending the Driver Its Port Number
I countered the issue when using PySpark locally
(the issue can happen to a cluster as well).
It turned out to be caused by a misconfiguration of the environment variable JAVA_HOME
in Docker.
References
PySpark: Exception: Java gateway process exited before sending the driver its port number
Serialize and Deserialize Object Using Pickle in Python
Tips and Traps¶
- Make sure to use the mode
rb
/wb
when read/write pickle files.
Date Functions in Spark
Tips and Traps¶
- HDFS table might contain invalid data (I'm not clear about the reasons at this time) with respct to the column types (e.g., Date and Timestamp). This will cause issues when Spark tries to load the data. For more discussions, please refer to Unrecognized column type:TIMESTAMP_TYP.
datetime.datetime
ordatetime.date
Best Filesystem Format for Cross-platform Data Exchanging
FAT32
FAT32 is an outdated filesystem. The maximum size for a single file is 4G. You should instead exFAT instead of FAT32 where possible.
exFAT
exFAT is great cross-platform filesystem that is support out-of-box by Windows, Linux and macOS. There is practically no limit (big enough for average users) on …