Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Comments¶
Even though Spark DataFrame/SQL APIs do not distinguish cases of column names, the columns saved into HDFS are case-sensitive!
interp.load.ivy("org.apache.spark" % "spark-core_2.12" % "3.0.0")
interp.load.ivy("org.apache.spark" % "spark-sql_2.12" % "3.0.0")Fetching long content....
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder()
.master("local[2]")
.appName("Spark Column Example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/19 14:25:31 INFO SparkContext: Running Spark version 3.0.0
20/07/19 14:25:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/07/19 14:25:32 INFO ResourceUtils: ==============================================================
20/07/19 14:25:32 INFO ResourceUtils: Resources for spark.driver:
20/07/19 14:25:32 INFO ResourceUtils: ==============================================================
20/07/19 14:25:32 INFO SparkContext: Submitted application: Spark Column Example
20/07/19 14:25:32 INFO SecurityManager: Changing view acls to: gitpod
20/07/19 14:25:32 INFO SecurityManager: Changing modify acls to: gitpod
20/07/19 14:25:32 INFO SecurityManager: Changing view acls groups to:
20/07/19 14:25:32 INFO SecurityManager: Changing modify acls groups to:
20/07/19 14:25:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(gitpod); groups with view permissions: Set(); users with modify permissions: Set(gitpod); groups with modify permissions: Set()
20/07/19 14:25:33 INFO Utils: Successfully started service 'sparkDriver' on port 44959.
20/07/19 14:25:33 INFO SparkEnv: Registering MapOutputTracker
20/07/19 14:25:33 INFO SparkEnv: Registering BlockManagerMaster
20/07/19 14:25:33 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/19 14:25:33 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/19 14:25:33 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
20/07/19 14:25:33 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-9debccc3-c0df-48e2-9306-1e78a62a670d
20/07/19 14:25:33 INFO MemoryStore: MemoryStore started with capacity 1172.4 MiB
20/07/19 14:25:33 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/19 14:25:34 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/19 14:25:34 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://ws-2d470c6b-0983-4974-8000-c7b96edf3082:4040
20/07/19 14:25:34 INFO Executor: Starting executor ID driver on host ws-2d470c6b-0983-4974-8000-c7b96edf3082
20/07/19 14:25:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43519.
20/07/19 14:25:34 INFO NettyBlockTransferService: Server created on ws-2d470c6b-0983-4974-8000-c7b96edf3082:43519
20/07/19 14:25:34 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/19 14:25:34 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ws-2d470c6b-0983-4974-8000-c7b96edf3082, 43519, None)
20/07/19 14:25:34 INFO BlockManagerMasterEndpoint: Registering block manager ws-2d470c6b-0983-4974-8000-c7b96edf3082:43519 with 1172.4 MiB RAM, BlockManagerId(driver, ws-2d470c6b-0983-4974-8000-c7b96edf3082, 43519, None)
20/07/19 14:25:34 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ws-2d470c6b-0983-4974-8000-c7b96edf3082, 43519, None)
20/07/19 14:25:34 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ws-2d470c6b-0983-4974-8000-c7b96edf3082, 43519, None)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
spark: SparkSession = org.apache.spark.sql.SparkSession@4e94577b
import spark.implicits._Create a Spark DataFrame whose column names are in lower case.
val df = Seq(
(1L, "a", "foo", 3.0),
(2L, "b", "bar", 4.0),
(3L, "c", "foo", 5.0),
(4L, "d", "bar", 7.0)
).toDF("col1", "col2", "col3", "col4")
df.show20/07/19 14:25:44 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/workspace/blog/misc/content/spark-warehouse').
20/07/19 14:25:44 INFO SharedState: Warehouse path is 'file:/workspace/blog/misc/content/spark-warehouse'.
20/07/19 14:25:46 INFO CodeGenerator: Code generated in 669.093681 ms
20/07/19 14:25:48 INFO CodeGenerator: Code generated in 44.437548 ms
20/07/19 14:25:48 INFO CodeGenerator: Code generated in 35.205708 ms
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| a| foo| 3.0|
| 2| b| bar| 4.0|
| 3| c| foo| 5.0|
| 4| d| bar| 7.0|
+----+----+----+----+
df: org.apache.spark.sql.package.DataFrame = [col1: bigint, col2: string ... 2 more fields]Even though the column names of the DataFrame is in lower case, it is case-insensitive when you access them using Spark DataFrame/SQL APIs.
df.select("col1", "COL2", "Col3").show20/07/19 14:25:52 INFO CodeGenerator: Code generated in 29.382467 ms
20/07/19 14:25:52 INFO CodeGenerator: Code generated in 30.9852 ms
+----+----+----+
|col1|COL2|Col3|
+----+----+----+
| 1| a| foo|
| 2| b| bar|
| 3| c| foo|
| 4| d| bar|
+----+----+----+
The case of column names are preserved when you write a Spark DataFrame into disk.
df.select("col1", "COL2", "Col3").write.mode("overwrite").parquet("/tmp/df")%%python
import pandas as pd
pd.read_parquet("/tmp/df")Loading...