Ben Chuanlong Du's Blog

It is never too late to learn.

Spark Configuration

In [1]:
import pandas as pd
import findspark

findspark.init("/opt/spark-3.0.1-bin-hadoop3.2/")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType

spark = (
    SparkSession.builder.appName("Spark_Configuration")
    .enableHiveSupport()
    .getOrCreate()
)

Tips and Traps

  1. The Environment tab on the Spark application minotoring UI contains information of environment variables and Spark configurations. It is helpful if you forget configurations set for your Spark application or if you want to confirm that configurations for your Spark application are correct.

  2. Please refer to Tips on Spark Configuration to Avoid Issues for suggestions on Spark configurations.

Get sparkConf from a SparkSession Object

In [2]:
conf = spark.sparkContext.getConf()
conf
Out[2]:
<pyspark.conf.SparkConf at 0x7fd375ce7520>
In [18]:
conf.getAll()
Out[18]:
[('spark.driver.port', '40851'),
 ('spark.sql.warehouse.dir', '/opt/spark-3.0.1-bin-hadoop3.2/warehouse'),
 ('spark.driver.extraJavaOptions',
  '-Dderby.system.home=/opt/spark-3.0.1-bin-hadoop3.2/metastore_db'),
 ('spark.app.name', 'Spark_Configuration'),
 ('spark.app.id', 'local-1604190737869'),
 ('spark.executor.id', 'driver'),
 ('spark.sql.catalogImplementation', 'hive'),
 ('spark.driver.host', 'ws-cb963871-b2cd-4e16-b4f8-a606431f8ba4'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true')]
In [9]:
print(conf.toDebugString())
spark.app.id=local-1604190737869
spark.app.name=Spark_Configuration
spark.driver.extraJavaOptions=-Dderby.system.home=/opt/spark-3.0.1-bin-hadoop3.2/metastore_db
spark.driver.host=ws-cb963871-b2cd-4e16-b4f8-a606431f8ba4
spark.driver.port=40851
spark.executor.id=driver
spark.master=local[*]
spark.rdd.compress=True
spark.serializer.objectStreamReset=100
spark.sql.catalogImplementation=hive
spark.sql.warehouse.dir=/opt/spark-3.0.1-bin-hadoop3.2/warehouse
spark.submit.deployMode=client
spark.submit.pyFiles=
spark.ui.showConsoleProgress=true
In [16]:
conf.get("spark.yarn.appMasterEnv.ARROW_PRE_0_15_IPC_FORMAT")
In [15]:
conf.get("spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT")

-XX:MaxDirectMemorySize

--conf spark.executor.extraJavaOptions=-XX:MaxDirectMemorySize=8G

SparkConf.set

SparkConf.setAll

SparkSession vs SparkContext vs SparkConf

In [ ]:
 

Comments