Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Comment¶
You can use withColumnRenamed to rename a column in a DataFrame.
You can also do renaming using alias when select columns.
import pandas as pdfrom pathlib import Path
import findspark
# findspark.init(str(next(Path("/opt").glob("spark-3*"))))
findspark.init("/opt/spark-3.0.1-bin-hadoop3.2/")
# findspark.init("/opt/spark-2.3.0-bin-hadoop2.7")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import (
IntegerType,
StringType,
StructType,
StructField,
ArrayType,
)
spark = (
SparkSession.builder.appName("PySpark_Str_Func").enableHiveSupport().getOrCreate()
)df = spark.createDataFrame(
pd.DataFrame(
data=(
(1, "a", "foo", 3.0),
(1, "b", "bar", 4.0),
(3, "c", "foo", 5.0),
(4, "d", "bar", 7.0),
),
columns=["_0", "_1", "_2", "_3"],
)
)
df.show()+---+---+---+---+
| _0| _1| _2| _3|
+---+---+---+---+
| 1| a|foo|3.0|
| 1| b|bar|4.0|
| 3| c|foo|5.0|
| 4| d|bar|7.0|
+---+---+---+---+
Get Column Name¶
There is no good way to get the name of a Column, unfortunately.
Column._jc.toString()
df._0Column<b'_0'>str(df._0)"Column<b'_0'>"df._0._jc.toString()'_0'c = max(df._0).alias("_0_max")
cColumn<b'max(_0) AS `_0_max`'>c._jc.toString()'max(_0) AS `_0_max`'c.alias()Column<b'multialias(max(_0) AS `_0_max`)'>str(c)"Column<b'max(_0) AS `_0_max`'>"Drop Columns¶
df.drop("_1", "_3").show20/07/19 14:27:45 INFO CodeGenerator: Code generated in 22.343436 ms
20/07/19 14:27:45 INFO CodeGenerator: Code generated in 25.541113 ms
+---+---+
| _2| _4|
+---+---+
| a|3.0|
| b|4.0|
| c|5.0|
| d|7.0|
+---+---+
Renaming One Column Using withColumnRenamed¶
df.withColumnRenamed("_1", "x1").show+---+---+---+---+
| x1| _2| _3| _4|
+---+---+---+---+
| 1| a|foo|3.0|
| 2| b|bar|4.0|
| 3| c|foo|5.0|
| 4| d|bar|7.0|
+---+---+---+---+
Renaming One Column Using alias¶
df.select(
$"_1".alias("x1"),
$"_2",
$"_3",
$"_4"
).show+---+---+---+---+
| x1| _2| _3| _4|
+---+---+---+---+
| 1| a|foo|3.0|
| 2| b|bar|4.0|
| 3| c|foo|5.0|
| 4| d|bar|7.0|
+---+---+---+---+
Batch Renaming Using withColumnRenamed¶
val lookup = Map(
"_1" -> "x1",
"_2" -> "x2",
"_3" -> "x3",
"_4" -> "x4"
)Loading...
lookup.foldLeft(df) {
(acc, ca) => acc.withColumnRenamed(ca._1, ca._2)
}.show+---+---+---+---+
| x1| x2| x3| x4|
+---+---+---+---+
| 1| a|foo|3.0|
| 2| b|bar|4.0|
| 3| c|foo|5.0|
| 4| d|bar|7.0|
+---+---+---+---+
Batch Renaming Using alias¶
df.select(df.columns.map(c => col(c).alias(lookup.getOrElse(c, c))): _*).show+---+---+---+---+
| x1| x2| x3| x4|
+---+---+---+---+
| 1| a|foo|3.0|
| 2| b|bar|4.0|
| 3| c|foo|5.0|
| 4| d|bar|7.0|
+---+---+---+---+