Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Rename and Drop Columns in Spark DataFrames

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Comment

You can use withColumnRenamed to rename a column in a DataFrame. You can also do renaming using alias when select columns.

import pandas as pd
from pathlib import Path
import findspark

# findspark.init(str(next(Path("/opt").glob("spark-3*"))))
findspark.init("/opt/spark-3.0.1-bin-hadoop3.2/")
# findspark.init("/opt/spark-2.3.0-bin-hadoop2.7")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import (
    IntegerType,
    StringType,
    StructType,
    StructField,
    ArrayType,
)

spark = (
    SparkSession.builder.appName("PySpark_Str_Func").enableHiveSupport().getOrCreate()
)
df = spark.createDataFrame(
    pd.DataFrame(
        data=(
            (1, "a", "foo", 3.0),
            (1, "b", "bar", 4.0),
            (3, "c", "foo", 5.0),
            (4, "d", "bar", 7.0),
        ),
        columns=["_0", "_1", "_2", "_3"],
    )
)
df.show()
+---+---+---+---+
| _0| _1| _2| _3|
+---+---+---+---+
|  1|  a|foo|3.0|
|  1|  b|bar|4.0|
|  3|  c|foo|5.0|
|  4|  d|bar|7.0|
+---+---+---+---+

Get Column Name

  1. There is no good way to get the name of a Column, unfortunately.

  2. Column._jc.toString()

df._0
Column<b'_0'>
str(df._0)
"Column<b'_0'>"
df._0._jc.toString()
'_0'
c = max(df._0).alias("_0_max")
c
Column<b'max(_0) AS `_0_max`'>
c._jc.toString()
'max(_0) AS `_0_max`'
c.alias()
Column<b'multialias(max(_0) AS `_0_max`)'>
str(c)
"Column<b'max(_0) AS `_0_max`'>"

Drop Columns

df.drop("_1", "_3").show
20/07/19 14:27:45 INFO CodeGenerator: Code generated in 22.343436 ms
20/07/19 14:27:45 INFO CodeGenerator: Code generated in 25.541113 ms
+---+---+
| _2| _4|
+---+---+
|  a|3.0|
|  b|4.0|
|  c|5.0|
|  d|7.0|
+---+---+

Renaming One Column Using withColumnRenamed

df.withColumnRenamed("_1", "x1").show
+---+---+---+---+
| x1| _2| _3| _4|
+---+---+---+---+
|  1|  a|foo|3.0|
|  2|  b|bar|4.0|
|  3|  c|foo|5.0|
|  4|  d|bar|7.0|
+---+---+---+---+

Renaming One Column Using alias

df.select(
    $"_1".alias("x1"),
    $"_2",
    $"_3",
    $"_4"
).show
+---+---+---+---+
| x1| _2| _3| _4|
+---+---+---+---+
|  1|  a|foo|3.0|
|  2|  b|bar|4.0|
|  3|  c|foo|5.0|
|  4|  d|bar|7.0|
+---+---+---+---+

Batch Renaming Using withColumnRenamed

val lookup = Map(
    "_1" -> "x1",
    "_2" -> "x2",
    "_3" -> "x3",
    "_4" -> "x4"
)
Loading...
lookup.foldLeft(df) {
    (acc, ca) => acc.withColumnRenamed(ca._1, ca._2)
}.show
+---+---+---+---+
| x1| x2| x3| x4|
+---+---+---+---+
|  1|  a|foo|3.0|
|  2|  b|bar|4.0|
|  3|  c|foo|5.0|
|  4|  d|bar|7.0|
+---+---+---+---+

Batch Renaming Using alias

df.select(df.columns.map(c => col(c).alias(lookup.getOrElse(c, c))): _*).show
+---+---+---+---+
| x1| x2| x3| x4|
+---+---+---+---+
|  1|  a|foo|3.0|
|  2|  b|bar|4.0|
|  3|  c|foo|5.0|
|  4|  d|bar|7.0|
+---+---+---+---+