Get CentOS Version

Jan 01, 2022

You can get the version of CentOS using the following command.

rpm -q centos-release

This trick can be used to get the version of the CentOS distribution on a Spark cluster. Basically, you run this command in the driver or workers to print the versions and then parse the log …

Control Number of Partitions of a DataFrame in Spark

Dec 11, 2021

Tips and Traps¶

DataFrame.repartition repartitions the DataFrame by hash code of each row. If you specify a (multiple) column(s) (instead of number of partitions) to the method DataFrame.repartition, then hash code of the column(s) are calculated for repartition. In some situations, there are lots of hash conflictions even if the total number of rows is small (e.g., a few thousand), which means that partitions generated might be skewed

Spark Issue: Shell Related

Jan 15, 2022

Symptom 1

/bin/sh: hdfs: command not found

Possible Causes of Symptom 1

The command hdfs is not on the search path.

Possible Solutions to Symptom 1

Use the full path to the command.
Configure the environment variable PATH before you use the command.
Find other alternatives to the command …

Spark Issue: Namespace Quota Is Exceeded

Jan 06, 2022

Symptom

Caused by: org.apache.hadoop.hdfs.protocol.NSQuotaExceededException: The NameSpace quota (directories and files) of directory /user/user_name is exceeded: quota=163840 file count=163841

Cause

The namespace quota of the directory /user/user_name is execeeded.

Solutions

Remove non-needed files from the directory /user/user_name to release some namespace …

Spark Issue: Rust Panic

Jan 07, 2022

If you use Rust with Spark/PySpark and there are issues in the Rust code, you might get Rust panic error messages.

Symptom

Error: b"thread 'main' panicked at 'index out of bounds: the len is 15 but the index is 15', src/game.rs:131:39\nnote: run with …

Spark Issue: RuntimeException: Unsupported Literal Type Class

Jan 15, 2022

Symptom

java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [1]

Possible Causes

This happens in PySpark when a Python list is provide where a scalar is required. Assuming id0 is an integer column in the DataFrame df, the following code throws the above error.

v = [1, 2, 3 …

← Older

Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Get CentOS Version

Control Number of Partitions of a DataFrame in Spark

Tips and Traps¶

Spark Issue: Shell Related

Symptom 1

Possible Causes of Symptom 1

Possible Solutions to Symptom 1

Spark Issue: Namespace Quota Is Exceeded

Symptom

Cause

Solutions

Spark Issue: Rust Panic

Symptom

Spark Issue: RuntimeException: Unsupported Literal Type Class

Symptom

Possible Causes