String Functions in Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips and Traps¶

You can use the split function to split a delimited string into an array. It is suggested that removing trailing separators before you apply the split function. Please refer to the split section before for more detailed discussions.
Some string functions (e.g., right, etc.) are available in the Spark SQL APIs but not available as Spark DataFrame APIs.
Notice that functions trim/rtrim/ltrim behaves a little counter-intuitive. First, they trim spaces only rather than white spaces by default. Second, when explicitly passing the characters to trim, the 1st parameter is the characters to trim and the 2nd parameter is the string from which to trim characters.
instr and locate behaves similar to each other except that their parameters are reversed.
Notice that replace is for replacing elements in a column NOT for replacemnt inside each string element. To replace substring with another one in a string, you have to use either regexp_replace or translate.
The operator + does not work as concatenation for sting columns. You have to use the function concat instead.

import re

re.search("\\s", "nima ")

<re.Match object; span=(4, 5), match=' '>

s = "\s"

"\s\\s"

'\\s\\s'

"\s" == "\\s"

True

"\n" == "\\n"

False

"\\n"

'\\n'

"\n"

'\n'

import pandas as pd

from pathlib import Path
import findspark

findspark.init(str(next(Path("/opt").glob("spark-3*"))))

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType

spark = (
    SparkSession.builder.appName("PySpark_Str_Func").enableHiveSupport().getOrCreate()
)

21/10/04 20:31:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/10/04 20:31:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

df = spark.createDataFrame(
    pd.DataFrame(
        data=[
            ("2017/01/01", 1),
            ("2017/02/01", 2),
            ("2018/02/05", 3),
            (None, 4),
            ("how \t", 5),
        ],
        columns=["col1", "col2"],
    )
)
df.show()

+----------+----+
|      col1|col2|
+----------+----+
|2017/01/01|   1|
|2017/02/01|   2|
|2018/02/05|   3|
|      null|   4|
|     how 	|   5|
+----------+----+

concat¶

The + operator does not work as concatenation for 2 string columns.

df.withColumn("col", col("date") + col("month")).show()

+----------+-----+----+
|      date|month| col|
+----------+-----+----+
|2017/01/01|    1|null|
|2017/02/01|    2|null|
+----------+-----+----+

The function concat concatenate 2 string columns.

df.withColumn("col", concat(col("date"), col("month"))).show()

+----------+-----+-----------+
|      date|month|        col|
+----------+-----+-----------+
|2017/01/01|    1|2017/01/011|
|2017/02/01|    2|2017/02/012|
+----------+-----+-----------+

df.withColumn("col", concat(col("date"), lit("_"), col("month"))).show()

+----------+-----+------------+
|      date|month|         col|
+----------+-----+------------+
|2017/01/01|    1|2017/01/01_1|
|2017/02/01|    2|2017/02/01_2|
+----------+-----+------------+

instr¶

instr behaves similar to locate except that their parameters are reversed.

spark.sql(
    """
    select instr("abcd", "ab") as index
    """
).show()

+-----+
|index|
+-----+
|    1|
+-----+

spark.sql(
    """
    select instr("abcd", "AB") as index
    """
).show()

+-----+
|index|
+-----+
|    0|
+-----+

left¶

spark.sql(
    """
    select 
        left("how are you doing?", 7) as phrase
    """
).show()

+-------+
| phrase|
+-------+
|how are|
+-------+

Notice that functions trim/rtrim/ltrim behaves a little counter-intuitive. First, they trim spaces only rather than white spaces by default. Second, when explicitly passing the characters to trim, the 1st parameter is the characters to trim and the 2nd parameter is the string from which to trim characters.

spark.sql(
    """
    select ltrim("a ", "a a abcd") as after_ltrim
"""
).show()

+-----------+
|after_ltrim|
+-----------+
|        bcd|
+-----------+

locate¶

locate behaves similar to instr except that their parameters are reversed.

df.withColumn("date", translate($"date", "/", "-")).show

+----------+-----+
|      date|month|
+----------+-----+
|2017-01-01|    1|
|2017-02-01|    2|
+----------+-----+

null

md5¶

octet_length¶

parse_url¶

position¶

printf¶

regex_extract¶

public static Column regexp_extract(Column e, String exp, int groupIdx)

regex_extract_all¶

regexp_replace¶

df.withColumn("date", regexp_replace(col("date"), "/", "-")).show()

+----------+-----+
|      date|month|
+----------+-----+
|2017-01-01|    1|
|2017-02-01|    2|
+----------+-----+

repeat¶

replace¶

reverse¶

right¶

spark.sql(
    """
    select right("abcdefg", 3) 
"""
).show()

+-------------------+
|right('abcdefg', 3)|
+-------------------+
|                efg|
+-------------------+

rlike¶

df.show()

+----------+----+
|      col1|col2|
+----------+----+
|2017/01/01|   1|
|2017/02/01|   2|
|2018/02/05|   3|
|      null|   4|
|     how 	|   5|
+----------+----+

df.filter(col("col1").rlike("\\d{4}/02/\\d{2}")).show()

+----------+----+
|      col1|col2|
+----------+----+
|2017/02/01|   2|
|2018/02/05|   3|
+----------+----+

df.filter(col("col1").rlike(r"\s")).show()

+-----+----+
| col1|col2|
+-----+----+
|how 	|   5|
+-----+----+

df.createOrReplaceTempView("t1")

spark.sql(
    r"""
    select 
        *
    from 
        t1 
    where
        col1 rlike '\\d'
    """
).show()

+----------+----+
|      col1|col2|
+----------+----+
|2017/01/01|   1|
|2017/02/01|   2|
|2018/02/05|   3|
+----------+----+

rpad¶

rtrim¶

spark.sql(
    """
    select rtrim("abcd\t ") as after_trim
"""
).show()

+----------+
|after_trim|
+----------+
|     abcd	|
+----------+

spark.sql(
    """
    select rtrim(" \t", "abcd\t ") as after_trim
"""
).show()

+----------+
|after_trim|
+----------+
|      abcd|
+----------+

21/10/04 20:32:27 WARN Analyzer$ResolveFunctions: Two-parameter TRIM/LTRIM/RTRIM function signatures are deprecated. Use SQL syntax `TRIM((BOTH | LEADING | TRAILING)? trimStr FROM str)` instead.

spark.sql(
    """
    select rtrim("a ", "a a abcda a a") as after_ltrim
"""
).show()

+-----------+
|after_ltrim|
+-----------+
|   a a abcd|
+-----------+

sentences¶

sha¶

sha1¶

sha2¶

split¶

If there is a trailing separator, then an emptry string is generated at the end of the array. It is suggested that you get rid of the trailing separator before applying split to avoid unnecessary empty string generated. The benefit of doing this is 2-fold.

Avoid generating non-neeed data (emtpy strings).
Too many empty strings can causes serious data skew issues if the corresponding column is used for joining with another table. By avoiding generating those empty strings, we avoid potential Spark issues in the beginning.

spark.sql(
    """
    select split("ab;cd;ef", ";") as elements
"""
).show()

+------------+
|    elements|
+------------+
|[ab, cd, ef]|
+------------+

spark.sql(
    """
    select split("ab;cd;ef;", ";") as elements
"""
).show()

+--------------+
|      elements|
+--------------+
|[ab, cd, ef, ]|
+--------------+

Uses 1-based index.
substring on null returns null.

import org.apache.spark.sql.functions._

val df = Seq(
    ("2017/01/01", 1),
    ("2017/02/01", 2),
    (null, 3)
).toDF("date", "month")
df.show

+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|    1|
|2017/02/01|    2|
|      null|    3|
+----------+-----+

null

df.withColumn("year", substring($"date", 1, 4)).show

+----------+-----+----+
|      date|month|year|
+----------+-----+----+
|2017/01/01|    1|2017|
|2017/02/01|    2|2017|
|      null|    3|null|
+----------+-----+----+

null

df.withColumn("month", substring($"date", 6, 2)).show

+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|   01|
|2017/02/01|   02|
|      null| null|
+----------+-----+

null

df.withColumn("month", substring($"date", 9, 2)).show

+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|   01|
|2017/02/01|   01|
|      null| null|
+----------+-----+

null

spark.sql(
    """
    select trim("abcd\t  ") as after_trim
"""
).show()

+----------+
|after_trim|
+----------+
|     abcd	|
+----------+

spark.sql(
    """
    select trim(" \t", "abcd\t ") as after_trim
"""
).show()

+----------+
|after_trim|
+----------+
|      abcd|
+----------+

References¶

Spark Scala Functions

Spark SQL Built-in Functions

https://obstkel.com/spark-sql-functions

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html