Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Case of Column Names in Spark DataFrames

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Comments

Even though Spark DataFrame/SQL APIs do not distinguish cases of column names, the columns saved into HDFS are case-sensitive!

interp.load.ivy("org.apache.spark" % "spark-core_2.12" % "3.0.0")
interp.load.ivy("org.apache.spark" % "spark-sql_2.12" % "3.0.0")
Fetching long content....
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder()
    .master("local[2]")
    .appName("Spark Column Example")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()

import spark.implicits._
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/19 14:25:31 INFO SparkContext: Running Spark version 3.0.0
20/07/19 14:25:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/07/19 14:25:32 INFO ResourceUtils: ==============================================================
20/07/19 14:25:32 INFO ResourceUtils: Resources for spark.driver:

20/07/19 14:25:32 INFO ResourceUtils: ==============================================================
20/07/19 14:25:32 INFO SparkContext: Submitted application: Spark Column Example
20/07/19 14:25:32 INFO SecurityManager: Changing view acls to: gitpod
20/07/19 14:25:32 INFO SecurityManager: Changing modify acls to: gitpod
20/07/19 14:25:32 INFO SecurityManager: Changing view acls groups to: 
20/07/19 14:25:32 INFO SecurityManager: Changing modify acls groups to: 
20/07/19 14:25:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(gitpod); groups with view permissions: Set(); users  with modify permissions: Set(gitpod); groups with modify permissions: Set()
20/07/19 14:25:33 INFO Utils: Successfully started service 'sparkDriver' on port 44959.
20/07/19 14:25:33 INFO SparkEnv: Registering MapOutputTracker
20/07/19 14:25:33 INFO SparkEnv: Registering BlockManagerMaster
20/07/19 14:25:33 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/19 14:25:33 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/19 14:25:33 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
20/07/19 14:25:33 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-9debccc3-c0df-48e2-9306-1e78a62a670d
20/07/19 14:25:33 INFO MemoryStore: MemoryStore started with capacity 1172.4 MiB
20/07/19 14:25:33 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/19 14:25:34 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/19 14:25:34 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://ws-2d470c6b-0983-4974-8000-c7b96edf3082:4040
20/07/19 14:25:34 INFO Executor: Starting executor ID driver on host ws-2d470c6b-0983-4974-8000-c7b96edf3082
20/07/19 14:25:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43519.
20/07/19 14:25:34 INFO NettyBlockTransferService: Server created on ws-2d470c6b-0983-4974-8000-c7b96edf3082:43519
20/07/19 14:25:34 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/19 14:25:34 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ws-2d470c6b-0983-4974-8000-c7b96edf3082, 43519, None)
20/07/19 14:25:34 INFO BlockManagerMasterEndpoint: Registering block manager ws-2d470c6b-0983-4974-8000-c7b96edf3082:43519 with 1172.4 MiB RAM, BlockManagerId(driver, ws-2d470c6b-0983-4974-8000-c7b96edf3082, 43519, None)
20/07/19 14:25:34 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ws-2d470c6b-0983-4974-8000-c7b96edf3082, 43519, None)
20/07/19 14:25:34 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ws-2d470c6b-0983-4974-8000-c7b96edf3082, 43519, None)
import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ spark: SparkSession = org.apache.spark.sql.SparkSession@4e94577b import spark.implicits._

Create a Spark DataFrame whose column names are in lower case.

val df = Seq(
    (1L, "a", "foo", 3.0),
    (2L, "b", "bar", 4.0),
    (3L, "c", "foo", 5.0),
    (4L, "d", "bar", 7.0)
).toDF("col1", "col2", "col3", "col4")
df.show
20/07/19 14:25:44 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/workspace/blog/misc/content/spark-warehouse').
20/07/19 14:25:44 INFO SharedState: Warehouse path is 'file:/workspace/blog/misc/content/spark-warehouse'.
20/07/19 14:25:46 INFO CodeGenerator: Code generated in 669.093681 ms
20/07/19 14:25:48 INFO CodeGenerator: Code generated in 44.437548 ms
20/07/19 14:25:48 INFO CodeGenerator: Code generated in 35.205708 ms
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|   a| foo| 3.0|
|   2|   b| bar| 4.0|
|   3|   c| foo| 5.0|
|   4|   d| bar| 7.0|
+----+----+----+----+

df: org.apache.spark.sql.package.DataFrame = [col1: bigint, col2: string ... 2 more fields]

Even though the column names of the DataFrame is in lower case, it is case-insensitive when you access them using Spark DataFrame/SQL APIs.

df.select("col1", "COL2", "Col3").show
20/07/19 14:25:52 INFO CodeGenerator: Code generated in 29.382467 ms
20/07/19 14:25:52 INFO CodeGenerator: Code generated in 30.9852 ms
+----+----+----+
|col1|COL2|Col3|
+----+----+----+
|   1|   a| foo|
|   2|   b| bar|
|   3|   c| foo|
|   4|   d| bar|
+----+----+----+

The case of column names are preserved when you write a Spark DataFrame into disk.

df.select("col1", "COL2", "Col3").write.mode("overwrite").parquet("/tmp/df")
%%python

import pandas as pd
pd.read_parquet("/tmp/df")
Loading...