Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Spark Issue: RuntimeException: Unsupported Literal Type Class

Symptom

java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [1]

Possible Causes

This happens in PySpark when a Python list is provide where a scalar is required. Assuming id0 is an integer column in the DataFrame df, the following code throws the above error.

v = [1, 2, 3 …

User-defined Function (UDF) in PySpark

Tips and Traps

  1. The easist way to define a UDF in PySpark is to use the @udf tag, and similarly the easist way to define a Pandas UDF in PySpark is to use the @pandas_udf tag. Pandas UDFs are preferred to UDFs for server reasons. First, pandas UDFs are typically much faster than UDFs. Second, pandas UDFs are more flexible than UDFs on parameter passing. Both UDFs and pandas UDFs can take multiple columns as parameters. In addition, pandas UDFs can take a DataFrame as parameter (when passed to the apply

Regular Expression in Bash

It is suggested that you use Python script instead of Shell script as much as possible. If you do have to stick with Shell script, you can use =~ for regular expression matching in Bash. This make Bash syntax extremely flexible and powerful. For example, you can match multiple strings using …