Tips and Traps¶
DataFrame.repartition
repartitions the DataFrame by hash code of each row. If you specify a (multiple) column(s) (instead of number of partitions) to the methodDataFrame.repartition
, then hash code of the column(s) are calculated for repartition. In some situations, there are lots of hash conflictions even if the total number of rows is small (e.g., a few thousand), which means that partitions generated might be skewed
Spark Issue: Shell Related
Symptom 1
/bin/sh: hdfs: command not found
Possible Causes of Symptom 1
The command hdfs
is not on the search path.
Possible Solutions to Symptom 1
- Use the full path to the command.
- Configure the environment variable
PATH
before you use the command. - Find other alternatives to the command …
Spark Issue: Namespace Quota Is Exceeded
Symptom
Caused by: org.apache.hadoop.hdfs.protocol.NSQuotaExceededException: The NameSpace quota (directories and files) of directory /user/user_name is exceeded: quota=163840 file count=163841
Cause
The namespace quota of the directory /user/user_name
is execeeded.
Solutions
-
Remove non-needed files from the directory
/user/user_name
to release some namespace …
Spark Issue: Rust Panic
If you use Rust with Spark/PySpark and there are issues in the Rust code, you might get Rust panic error messages.
Symptom
Error: b"thread 'main' panicked at 'index out of bounds: the len is 15 but the index is 15', src/game.rs:131:39\nnote: run with …
Spark Issue: RuntimeException: Unsupported Literal Type Class
Symptom
java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [1]
Possible Causes
This happens in PySpark
when a Python list is provide where a scalar is required.
Assuming id0
is an integer column in the DataFrame df
,
the following code throws the above error.
v = [1, 2, 3 …
Terminal Multiplexers
-
There are 2 mature popular terminal multiplexer apps: screen and tmux. Both of them are very useful if you want to work on multiple tasks over 1 SSH connection. Screen is relative simple to use while tmux is much more powerful and more complicated to use.
-
Besides enabling users to …