Ben Chuanlong Du's Blog

It is never too late to learn.

String Functions in Spark

Tips and Traps

  1. You can use the split function to split a delimited string into an array. It is suggested that removing trailing separators before you apply the split function. Please refer to the split section before for more detailed discussions.

  2. Some string functions (e.g., right, etc.) are available in the Spark SQL APIs but not available as Spark DataFrame APIs.

  3. Notice that functions trim/rtrim/ltrim behaves a little counter-intuitive. First, they trim spaces only rather than white spaces by default. Second, when explicitly passing the characters to trim, the 1st parameter is the characters to trim and the 2nd parameter is the string from which to trim characters.

  4. instr and locate behaves similar to each other except that their parameters are reversed.

  5. Notice that replace is for replacing elements in a column NOT for replacemnt inside each string element. To replace substring with another one in a string, you have to use either regexp_replace or translate.

  6. The operator + does not work as concatenation for sting columns. You have to use the function concat instead.

In [59]:
import re
In [65]:
re.search("\\s", "nima ")
Out[65]:
<re.Match object; span=(4, 5), match=' '>
In [66]:
s = "\s"
In [70]:
"\s\\s"
Out[70]:
'\\s\\s'
In [69]:
"\s" == "\\s"
Out[69]:
True
In [71]:
"\n" == "\\n"
Out[71]:
False
In [73]:
"\\n"
Out[73]:
'\\n'
In [72]:
"\n"
Out[72]:
'\n'
In [1]:
import pandas as pd
In [2]:
from pathlib import Path
import findspark

findspark.init(str(next(Path("/opt").glob("spark-3*"))))

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType

spark = (
    SparkSession.builder.appName("PySpark_Str_Func").enableHiveSupport().getOrCreate()
)
21/10/04 20:31:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/10/04 20:31:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
In [20]:
df = spark.createDataFrame(
    pd.DataFrame(
        data=[
            ("2017/01/01", 1),
            ("2017/02/01", 2),
            ("2018/02/05", 3),
            (None, 4),
            ("how \t", 5),
        ],
        columns=["col1", "col2"],
    )
)
df.show()
+----------+----+
|      col1|col2|
+----------+----+
|2017/01/01|   1|
|2017/02/01|   2|
|2018/02/05|   3|
|      null|   4|
|     how 	|   5|
+----------+----+

The + operator does not work as concatenation for 2 string columns.

In [5]:
df.withColumn("col", col("date") + col("month")).show()
+----------+-----+----+
|      date|month| col|
+----------+-----+----+
|2017/01/01|    1|null|
|2017/02/01|    2|null|
+----------+-----+----+

The function concat concatenate 2 string columns.

In [6]:
df.withColumn("col", concat(col("date"), col("month"))).show()
+----------+-----+-----------+
|      date|month|        col|
+----------+-----+-----------+
|2017/01/01|    1|2017/01/011|
|2017/02/01|    2|2017/02/012|
+----------+-----+-----------+

In [7]:
df.withColumn("col", concat(col("date"), lit("_"), col("month"))).show()
+----------+-----+------------+
|      date|month|         col|
+----------+-----+------------+
|2017/01/01|    1|2017/01/01_1|
|2017/02/01|    2|2017/02/01_2|
+----------+-----+------------+

instr

instr behaves similar to locate except that their parameters are reversed.

In [8]:
spark.sql(
    """
    select instr("abcd", "ab") as index
    """
).show()
+-----+
|index|
+-----+
|    1|
+-----+

In [9]:
spark.sql(
    """
    select instr("abcd", "AB") as index
    """
).show()
+-----+
|index|
+-----+
|    0|
+-----+

In [6]:
spark.sql(
    """
    select 
        left("how are you doing?", 7) as phrase
    """
).show()
+-------+
| phrase|
+-------+
|how are|
+-------+

In [18]:
val df = Seq(
    ("2017", 1),
    ("2017/02", 2),
    ("2018/02/05", 3),
    (null, 4)
).toDF("date", "month")
df.show
+----------+-----+
|      date|month|
+----------+-----+
|      2017|    1|
|   2017/02|    2|
|2018/02/05|    3|
|      null|    4|
+----------+-----+

Out[18]:
null
In [19]:
import org.apache.spark.sql.functions.length

df.select($"date", length($"date")).show
+----------+------------+
|      date|length(date)|
+----------+------------+
|      2017|           4|
|   2017/02|           7|
|2018/02/05|          10|
|      null|        null|
+----------+------------+

Out[19]:
null

ltrim

Notice that functions trim/rtrim/ltrim behaves a little counter-intuitive. First, they trim spaces only rather than white spaces by default. Second, when explicitly passing the characters to trim, the 1st parameter is the characters to trim and the 2nd parameter is the string from which to trim characters.

In [7]:
spark.sql(
    """
    select ltrim("a ", "a a abcd") as after_ltrim
"""
).show()
+-----------+
|after_ltrim|
+-----------+
|        bcd|
+-----------+

locate

locate behaves similar to instr except that their parameters are reversed.

In [8]:
df.withColumn("date", translate($"date", "/", "-")).show
+----------+-----+
|      date|month|
+----------+-----+
|2017-01-01|    1|
|2017-02-01|    2|
+----------+-----+

Out[8]:
null
public static Column regexp_extract(Column e, String exp, int groupIdx)
In [9]:
df.withColumn("date", regexp_replace(col("date"), "/", "-")).show()
+----------+-----+
|      date|month|
+----------+-----+
|2017-01-01|    1|
|2017-02-01|    2|
+----------+-----+

In [14]:
spark.sql(
    """
    select right("abcdefg", 3) 
"""
).show()
+-------------------+
|right('abcdefg', 3)|
+-------------------+
|                efg|
+-------------------+

In [21]:
df.show()
+----------+----+
|      col1|col2|
+----------+----+
|2017/01/01|   1|
|2017/02/01|   2|
|2018/02/05|   3|
|      null|   4|
|     how 	|   5|
+----------+----+

In [23]:
df.filter(col("col1").rlike("\\d{4}/02/\\d{2}")).show()
+----------+----+
|      col1|col2|
+----------+----+
|2017/02/01|   2|
|2018/02/05|   3|
+----------+----+

In [51]:
df.filter(col("col1").rlike(r"\s")).show()
+-----+----+
| col1|col2|
+-----+----+
|how 	|   5|
+-----+----+

In [37]:
df.createOrReplaceTempView("t1")
In [52]:
spark.sql(
    r"""
    select 
        *
    from 
        t1 
    where
        col1 rlike '\\d'
    """
).show()
+----------+----+
|      col1|col2|
+----------+----+
|2017/01/01|   1|
|2017/02/01|   2|
|2018/02/05|   3|
+----------+----+

rtrim

Notice that functions trim/rtrim/ltrim behaves a little counter-intuitive. First, they trim spaces only rather than white spaces by default. Second, when explicitly passing the characters to trim, the 1st parameter is the characters to trim and the 2nd parameter is the string from which to trim characters.

In [7]:
spark.sql(
    """
    select rtrim("abcd\t ") as after_trim
"""
).show()
+----------+
|after_trim|
+----------+
|     abcd	|
+----------+

In [6]:
spark.sql(
    """
    select rtrim(" \t", "abcd\t ") as after_trim
"""
).show()
+----------+
|after_trim|
+----------+
|      abcd|
+----------+

21/10/04 20:32:27 WARN Analyzer$ResolveFunctions: Two-parameter TRIM/LTRIM/RTRIM function signatures are deprecated. Use SQL syntax `TRIM((BOTH | LEADING | TRAILING)? trimStr FROM str)` instead.
In [8]:
spark.sql(
    """
    select rtrim("a ", "a a abcda a a") as after_ltrim
"""
).show()
+-----------+
|after_ltrim|
+-----------+
|   a a abcd|
+-----------+

split

If there is a trailing separator, then an emptry string is generated at the end of the array. It is suggested that you get rid of the trailing separator before applying split to avoid unnecessary empty string generated. The benefit of doing this is 2-fold.

  1. Avoid generating non-neeed data (emtpy strings).
  2. Too many empty strings can causes serious data skew issues if the corresponding column is used for joining with another table. By avoiding generating those empty strings, we avoid potential Spark issues in the beginning.
In [26]:
spark.sql(
    """
    select split("ab;cd;ef", ";") as elements
"""
).show()
+------------+
|    elements|
+------------+
|[ab, cd, ef]|
+------------+

In [27]:
spark.sql(
    """
    select split("ab;cd;ef;", ";") as elements
"""
).show()
+--------------+
|      elements|
+--------------+
|[ab, cd, ef, ]|
+--------------+

substring

  1. Uses 1-based index.

  2. substring on null returns null.

In [9]:
import org.apache.spark.sql.functions._

val df = Seq(
    ("2017/01/01", 1),
    ("2017/02/01", 2),
    (null, 3)
).toDF("date", "month")
df.show
+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|    1|
|2017/02/01|    2|
|      null|    3|
+----------+-----+

Out[9]:
null
In [10]:
df.withColumn("year", substring($"date", 1, 4)).show
+----------+-----+----+
|      date|month|year|
+----------+-----+----+
|2017/01/01|    1|2017|
|2017/02/01|    2|2017|
|      null|    3|null|
+----------+-----+----+

Out[10]:
null
In [11]:
df.withColumn("month", substring($"date", 6, 2)).show
+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|   01|
|2017/02/01|   02|
|      null| null|
+----------+-----+

Out[11]:
null
In [12]:
df.withColumn("month", substring($"date", 9, 2)).show
+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|   01|
|2017/02/01|   01|
|      null| null|
+----------+-----+

Out[12]:
null

translate

Notice that translate is different from usual replacemnt!!!

trim

Notice that functions trim/rtrim/ltrim behaves a little counter-intuitive. First, they trim spaces only rather than white spaces by default. Second, when explicitly passing the characters to trim, the 1st parameter is the characters to trim and the 2nd parameter is the string from which to trim characters.

In [23]:
spark.sql(
    """
    select trim("abcd\t  ") as after_trim
"""
).show()
+----------+
|after_trim|
+----------+
|     abcd	|
+----------+

In [21]:
spark.sql(
    """
    select trim(" \t", "abcd\t ") as after_trim
"""
).show()
+----------+
|after_trim|
+----------+
|      abcd|
+----------+

In [ ]:
 

Comments