Ben Chuanlong Du's Blog

It is never too late to learn.

Read/Write CSV in PySpark

Load Data in CSV Format

  1. .load is a general method for reading data in different format. You have to specify the format of the data via the method .format of course. .csv (both for CSV and TSV), .json and .parquet are specializations of .load. .format is optional if you use a specific loading function (csv, json, etc.).

  2. No header by default.

  3. .coalesece(1) or repartition(1) if you want to write to only 1 file.

In [3]:
import findspark

# A symbolic link of the Spark Home is made to /opt/spark for convenience
findspark.init("/opt/spark")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("PySpark").enableHiveSupport().getOrCreate()

Using load

In [4]:
schema = StructType(
    [
        StructField("denominator", IntegerType(), False),
        StructField("max_mod", IntegerType(), False),
        StructField("num_distinct", IntegerType(), False),
        StructField("max_dup", IntegerType(), False),
        StructField("avg_dup", DoubleType(), False),
    ]
)
In [7]:
flight = (
    spark.read.format("csv")
    .option("header", "true")
    .schema(schema)
    .load("../../home/media/data/flights14.csv")
)
flight.show()
+-----------+-------+------------+-------+-------+
|denominator|max_mod|num_distinct|max_dup|avg_dup|
+-----------+-------+------------+-------+-------+
|       2014|      1|           1|    914|   14.0|
|       2014|      1|           1|   1157|   -3.0|
|       2014|      1|           1|   1902|    2.0|
|       2014|      1|           1|    722|   -8.0|
|       2014|      1|           1|   1347|    2.0|
|       2014|      1|           1|   1824|    4.0|
|       2014|      1|           1|   2133|   -2.0|
|       2014|      1|           1|   1542|   -3.0|
|       2014|      1|           1|   1509|   -1.0|
|       2014|      1|           1|   1848|   -2.0|
|       2014|      1|           1|   1655|   -5.0|
|       2014|      1|           1|   1752|    7.0|
|       2014|      1|           1|   1253|    3.0|
|       2014|      1|           1|   1907|  142.0|
|       2014|      1|           1|   1720|   -5.0|
|       2014|      1|           1|   1733|   18.0|
|       2014|      1|           1|   1640|   25.0|
|       2014|      1|           1|   1714|   -1.0|
|       2014|      1|           1|   1611|  191.0|
|       2014|      1|           1|    553|   -7.0|
+-----------+-------+------------+-------+-------+
only showing top 20 rows

Using csv

In [6]:
flight = (
    spark.read.option("header", "true")
    .schema(schema)
    .csv("../../home/media/data/flights14.csv")
)
flight.show()
+-----------+-------+------------+-------+-------+
|denominator|max_mod|num_distinct|max_dup|avg_dup|
+-----------+-------+------------+-------+-------+
|       2014|      1|           1|    914|   14.0|
|       2014|      1|           1|   1157|   -3.0|
|       2014|      1|           1|   1902|    2.0|
|       2014|      1|           1|    722|   -8.0|
|       2014|      1|           1|   1347|    2.0|
|       2014|      1|           1|   1824|    4.0|
|       2014|      1|           1|   2133|   -2.0|
|       2014|      1|           1|   1542|   -3.0|
|       2014|      1|           1|   1509|   -1.0|
|       2014|      1|           1|   1848|   -2.0|
|       2014|      1|           1|   1655|   -5.0|
|       2014|      1|           1|   1752|    7.0|
|       2014|      1|           1|   1253|    3.0|
|       2014|      1|           1|   1907|  142.0|
|       2014|      1|           1|   1720|   -5.0|
|       2014|      1|           1|   1733|   18.0|
|       2014|      1|           1|   1640|   25.0|
|       2014|      1|           1|   1714|   -1.0|
|       2014|      1|           1|   1611|  191.0|
|       2014|      1|           1|    553|   -7.0|
+-----------+-------+------------+-------+-------+
only showing top 20 rows

In [ ]:
 

Comments