Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Tips and Traps¶
You can use the
splitfunction to split a delimited string into an array. It is suggested that removing trailing separators before you apply thesplitfunction. Please refer to the split section before for more detailed discussions.Some string functions (e.g.,
right, etc.) are available in the Spark SQL APIs but not available as Spark DataFrame APIs.Notice that functions
trim/rtrim/ltrimbehaves a little counter-intuitive. First, they trim spaces only rather than white spaces by default. Second, when explicitly passing the characters to trim, the 1st parameter is the characters to trim and the 2nd parameter is the string from which to trim characters.instrandlocatebehaves similar to each other except that their parameters are reversed.Notice that
replaceis for replacing elements in a column NOT for replacemnt inside each string element. To replace substring with another one in a string, you have to use eitherregexp_replaceortranslate.The operator
+does not work as concatenation for sting columns. You have to use the functionconcatinstead.
import rere.search("\\s", "nima ")<re.Match object; span=(4, 5), match=' '>s = "\s""\s\\s"'\\s\\s'"\s" == "\\s"True"\n" == "\\n"False"\\n"'\\n'"\n"'\n'import pandas as pdfrom pathlib import Path
import findspark
findspark.init(str(next(Path("/opt").glob("spark-3*"))))
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = (
SparkSession.builder.appName("PySpark_Str_Func").enableHiveSupport().getOrCreate()
)21/10/04 20:31:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/10/04 20:31:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
df = spark.createDataFrame(
pd.DataFrame(
data=[
("2017/01/01", 1),
("2017/02/01", 2),
("2018/02/05", 3),
(None, 4),
("how \t", 5),
],
columns=["col1", "col2"],
)
)
df.show()+----------+----+
| col1|col2|
+----------+----+
|2017/01/01| 1|
|2017/02/01| 2|
|2018/02/05| 3|
| null| 4|
| how | 5|
+----------+----+
The + operator does not work as concatenation for 2 string columns.
df.withColumn("col", col("date") + col("month")).show()+----------+-----+----+
| date|month| col|
+----------+-----+----+
|2017/01/01| 1|null|
|2017/02/01| 2|null|
+----------+-----+----+
The function concat concatenate 2 string columns.
df.withColumn("col", concat(col("date"), col("month"))).show()+----------+-----+-----------+
| date|month| col|
+----------+-----+-----------+
|2017/01/01| 1|2017/01/011|
|2017/02/01| 2|2017/02/012|
+----------+-----+-----------+
df.withColumn("col", concat(col("date"), lit("_"), col("month"))).show()+----------+-----+------------+
| date|month| col|
+----------+-----+------------+
|2017/01/01| 1|2017/01/01_1|
|2017/02/01| 2|2017/02/01_2|
+----------+-----+------------+
spark.sql(
"""
select instr("abcd", "ab") as index
"""
).show()+-----+
|index|
+-----+
| 1|
+-----+
spark.sql(
"""
select instr("abcd", "AB") as index
"""
).show()+-----+
|index|
+-----+
| 0|
+-----+
spark.sql(
"""
select
left("how are you doing?", 7) as phrase
"""
).show()+-------+
| phrase|
+-------+
|how are|
+-------+
val df = Seq(
("2017", 1),
("2017/02", 2),
("2018/02/05", 3),
(null, 4)
).toDF("date", "month")
df.show+----------+-----+
| date|month|
+----------+-----+
| 2017| 1|
| 2017/02| 2|
|2018/02/05| 3|
| null| 4|
+----------+-----+
nullimport org.apache.spark.sql.functions.length
df.select($"date", length($"date")).show+----------+------------+
| date|length(date)|
+----------+------------+
| 2017| 4|
| 2017/02| 7|
|2018/02/05| 10|
| null| null|
+----------+------------+
nullltrim¶
Notice that functions trim/rtrim/ltrim behaves a little counter-intuitive.
First,
they trim spaces only rather than white spaces by default.
Second,
when explicitly passing the characters to trim,
the 1st parameter is the characters to trim
and the 2nd parameter is the string from which to trim characters.
spark.sql(
"""
select ltrim("a ", "a a abcd") as after_ltrim
"""
).show()+-----------+
|after_ltrim|
+-----------+
| bcd|
+-----------+
df.withColumn("date", translate($"date", "/", "-")).show+----------+-----+
| date|month|
+----------+-----+
|2017-01-01| 1|
|2017-02-01| 2|
+----------+-----+
nullpublic static Column regexp_extract(Column e, String exp, int groupIdx)df.withColumn("date", regexp_replace(col("date"), "/", "-")).show()+----------+-----+
| date|month|
+----------+-----+
|2017-01-01| 1|
|2017-02-01| 2|
+----------+-----+
spark.sql(
"""
select right("abcdefg", 3)
"""
).show()+-------------------+
|right('abcdefg', 3)|
+-------------------+
| efg|
+-------------------+
df.show()+----------+----+
| col1|col2|
+----------+----+
|2017/01/01| 1|
|2017/02/01| 2|
|2018/02/05| 3|
| null| 4|
| how | 5|
+----------+----+
df.filter(col("col1").rlike("\\d{4}/02/\\d{2}")).show()+----------+----+
| col1|col2|
+----------+----+
|2017/02/01| 2|
|2018/02/05| 3|
+----------+----+
df.filter(col("col1").rlike(r"\s")).show()+-----+----+
| col1|col2|
+-----+----+
|how | 5|
+-----+----+
df.createOrReplaceTempView("t1")spark.sql(
r"""
select
*
from
t1
where
col1 rlike '\\d'
"""
).show()+----------+----+
| col1|col2|
+----------+----+
|2017/01/01| 1|
|2017/02/01| 2|
|2018/02/05| 3|
+----------+----+
rtrim¶
Notice that functions trim/rtrim/ltrim behaves a little counter-intuitive.
First,
they trim spaces only rather than white spaces by default.
Second,
when explicitly passing the characters to trim,
the 1st parameter is the characters to trim
and the 2nd parameter is the string from which to trim characters.
spark.sql(
"""
select rtrim("abcd\t ") as after_trim
"""
).show()+----------+
|after_trim|
+----------+
| abcd |
+----------+
spark.sql(
"""
select rtrim(" \t", "abcd\t ") as after_trim
"""
).show()+----------+
|after_trim|
+----------+
| abcd|
+----------+
21/10/04 20:32:27 WARN Analyzer$ResolveFunctions: Two-parameter TRIM/LTRIM/RTRIM function signatures are deprecated. Use SQL syntax `TRIM((BOTH | LEADING | TRAILING)? trimStr FROM str)` instead.
spark.sql(
"""
select rtrim("a ", "a a abcda a a") as after_ltrim
"""
).show()+-----------+
|after_ltrim|
+-----------+
| a a abcd|
+-----------+
split¶
If there is a trailing separator,
then an emptry string is generated at the end of the array.
It is suggested that you get rid of the trailing separator
before applying split
to avoid unnecessary empty string generated.
The benefit of doing this is 2-fold.
Avoid generating non-neeed data (emtpy strings).
Too many empty strings can causes serious data skew issues if the corresponding column is used for joining with another table. By avoiding generating those empty strings, we avoid potential Spark issues in the beginning.
spark.sql(
"""
select split("ab;cd;ef", ";") as elements
"""
).show()+------------+
| elements|
+------------+
|[ab, cd, ef]|
+------------+
spark.sql(
"""
select split("ab;cd;ef;", ";") as elements
"""
).show()+--------------+
| elements|
+--------------+
|[ab, cd, ef, ]|
+--------------+
import org.apache.spark.sql.functions._
val df = Seq(
("2017/01/01", 1),
("2017/02/01", 2),
(null, 3)
).toDF("date", "month")
df.show+----------+-----+
| date|month|
+----------+-----+
|2017/01/01| 1|
|2017/02/01| 2|
| null| 3|
+----------+-----+
nulldf.withColumn("year", substring($"date", 1, 4)).show+----------+-----+----+
| date|month|year|
+----------+-----+----+
|2017/01/01| 1|2017|
|2017/02/01| 2|2017|
| null| 3|null|
+----------+-----+----+
nulldf.withColumn("month", substring($"date", 6, 2)).show+----------+-----+
| date|month|
+----------+-----+
|2017/01/01| 01|
|2017/02/01| 02|
| null| null|
+----------+-----+
nulldf.withColumn("month", substring($"date", 9, 2)).show+----------+-----+
| date|month|
+----------+-----+
|2017/01/01| 01|
|2017/02/01| 01|
| null| null|
+----------+-----+
nulltrim¶
Notice that functions trim/rtrim/ltrim behaves a little counter-intuitive.
First,
they trim spaces only rather than white spaces by default.
Second,
when explicitly passing the characters to trim,
the 1st parameter is the characters to trim
and the 2nd parameter is the string from which to trim characters.
spark.sql(
"""
select trim("abcd\t ") as after_trim
"""
).show()+----------+
|after_trim|
+----------+
| abcd |
+----------+
spark.sql(
"""
select trim(" \t", "abcd\t ") as after_trim
"""
).show()+----------+
|after_trim|
+----------+
| abcd|
+----------+