You can build a portable Python environment following steps below.
-
Install python-build-standalone.
-
Install Python packages using pip of the installed python-build-standalone distribution.
-
Pack the whole python-build-standalone directory into a compressed file, e.g.,
env.tar.gz
.
The GitHub repo dclong/python-portable has good examples of building portable Python environments leveraging the Docker image dclong/python-portable (which has python-build-standalone installed).
Submit a PySpark Application Using a Portable Python Environment
Below is an example shell script for sumitting a PySpark job
using a pre-built portable Python environment named env.tar.gz
.
/apache/spark2.3/bin/spark-submit \
--files "file:///apache/hive/conf/hive-site.xml,file:///apache/hadoop/etc/hadoop/ssl-client.xml,file:///apache/hadoop/etc/hadoop/hdfs-site.xml,file:///apache/hadoop/etc/hadoop/core-site.xml,file:///apache/hadoop/etc/hadoop/federation-mapping.xml" \
--master yarn \
--deploy-mode cluster \
--queue YOUR_QUEUE \
--num-executors 200 \
--executor-memory 10G \
--driver-memory 15G \
--executor-cores 4 \
--conf spark.yarn.maxAppAttempts=2 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=1000 \
--conf spark.network.timeout=300s \
--conf spark.executor.memoryOverhead=2G \
--conf spark.pyspark.driver.python=./env/bin/python \
--conf spark.pyspark.python=./env/bin/python \
--archives env.tar.gz#env \
$1
And below is a simple example of _pyspark.py
.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test PySpark').enableHiveSupport().getOrCreate()
sql = """
SELECT *
FROM some_table
TableSample (100000 Rows)
"""
spark.sql(sql).write.mode("overwrite").parquet("output")