Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Comments¶
Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them.
interp.load.ivy("org.apache.spark" %% "spark-core" % "3.0.0")
interp.load.ivy("org.apache.spark" %% "spark-sql" % "3.0.0")import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[2]")
.appName("Spark UDF Examples")
.getOrCreate()
import spark.implicits._20/09/07 11:51:53 WARN SparkSession$Builder: Using an existing SparkSession; some spark core configurations may not take effect.
import org.apache.spark.sql.SparkSession
spark: SparkSession = org.apache.spark.sql.SparkSession@72b56dce
import spark.implicits._val df = Seq(
(0, "hello"),
(1, "world")
).toDF("id", "text")
df.show20/09/07 11:43:35 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/workdir/archives/blog/misc/content/spark-warehouse/').
20/09/07 11:43:35 INFO SharedState: Warehouse path is 'file:/workdir/archives/blog/misc/content/spark-warehouse/'.
20/09/07 11:43:37 INFO CodeGenerator: Code generated in 405.037 ms
20/09/07 11:43:38 INFO CodeGenerator: Code generated in 17.1985 ms
20/09/07 11:43:38 INFO CodeGenerator: Code generated in 23.8635 ms
+---+-----+
| id| text|
+---+-----+
| 0|hello|
| 1|world|
+---+-----+
df: org.apache.spark.sql.package.DataFrame = [id: int, text: string]import org.apache.spark.sql.functions.udf
val upper: String => String = _.toUpperCase
val upperUDF = udf(upper)UserDefinedFunction(<function1>,StringType,Some(List(StringType)))df.withColumn("upper", upperUDF($"text")).show+---+-----+-----+
| id| text|upper|
+---+-----+-----+
| 0|hello|HELLO|
| 1|world|WORLD|
+---+-----+-----+
val someUDF = udf((arg1: Long, arg2: Long) => {
arg1 + arg2
})UserDefinedFunction(<function2>,LongType,Some(List(LongType, LongType)))Map vs UDF¶
https://
https://
References¶
https://
https://
https://
https://
https://