Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Packaging Python Dependencies for PySpark Using conda-pack

python-build-standalone is a better alternative to conda-pack on managing Python dependencies for PySpark. Please refer to Packaging Python Dependencies for PySpark Using python-build-standalone for tutorials on how to use python-build-standalone to manage Python dependencies for PySpark.

Build Portable Python Environments Using conda-pack

Please refer to the GitHub repo dclong/conda_environ for instructions which leverages the Docker image dclong/conda to build portable conda environments.

Submit a PySpark Application Using conda Environment

Below is an example shell script for sumitting a PySpark job using a pre-built conda-pack Python environment named env.tar.gz.

/apache/spark2.3/bin/spark-submit \
    --files "file:///apache/hive/conf/hive-site.xml,file:///apache/hadoop/etc/hadoop/ssl-client.xml,file:///apache/hadoop/etc/hadoop/hdfs-site.xml,file:///apache/hadoop/etc/hadoop/core-site.xml,file:///apache/hadoop/etc/hadoop/federation-mapping.xml" \
    --master yarn \
    --deploy-mode cluster \
    --queue YOUR_QUEUE \
    --num-executors 200 \
    --executor-memory 10G \
    --driver-memory 15G \
    --executor-cores 4 \
    --conf spark.yarn.maxAppAttempts=2 \
    --conf spark.dynamicAllocation.enabled=true \
    --conf spark.dynamicAllocation.maxExecutors=1000 \
    --conf spark.network.timeout=300s \
    --conf spark.executor.memoryOverhead=2G \
    --conf spark.pyspark.driver.python=./env/bin/python \
    --conf spark.pyspark.python=./env/bin/python \
    --archives env.tar.gz#env \
    $1

And below is a simple example of _pyspark.py.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Test PySpark').enableHiveSupport().getOrCreate()
sql = """
        SELECT * 
        FROM some_table 
        TableSample (100000 Rows)
    """
spark.sql(sql).write.mode("overwrite").parquet("output")

References

Comments