python-build-standalone is a better alternative to conda-pack on managing Python dependencies for PySpark. Please refer to Packaging Python Dependencies for PySpark Using python-build-standalone for tutorials on how to use python-build-standalone to manage Python dependencies for PySpark.
Build Portable Python Environments Using conda-pack
Please refer to the GitHub repo dclong/conda_environ for instructions which leverages the Docker image dclong/conda to build portable conda environments.
Submit a PySpark Application Using conda Environment
Below is an example shell script for sumitting a PySpark job
using a pre-built conda-pack Python environment named env.tar.gz
.
/apache/spark2.3/bin/spark-submit \
--files "file:///apache/hive/conf/hive-site.xml,file:///apache/hadoop/etc/hadoop/ssl-client.xml,file:///apache/hadoop/etc/hadoop/hdfs-site.xml,file:///apache/hadoop/etc/hadoop/core-site.xml,file:///apache/hadoop/etc/hadoop/federation-mapping.xml" \
--master yarn \
--deploy-mode cluster \
--queue YOUR_QUEUE \
--num-executors 200 \
--executor-memory 10G \
--driver-memory 15G \
--executor-cores 4 \
--conf spark.yarn.maxAppAttempts=2 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=1000 \
--conf spark.network.timeout=300s \
--conf spark.executor.memoryOverhead=2G \
--conf spark.pyspark.driver.python=./env/bin/python \
--conf spark.pyspark.python=./env/bin/python \
--archives env.tar.gz#env \
$1
And below is a simple example of _pyspark.py
.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test PySpark').enableHiveSupport().getOrCreate()
sql = """
SELECT *
FROM some_table
TableSample (100000 Rows)
"""
spark.sql(sql).write.mode("overwrite").parquet("output")