Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Load Data in CSV Format¶
.loadis a general method for reading data in different format. You have to specify the format of the data via the method.formatof course..csv(both for CSV and TSV),.jsonand.parquetare specializations of.load..formatis optional if you use a specific loading function (csv, json, etc.).No header by default.
.coalesece(1)orrepartition(1)if you want to write to only 1 file.
import findspark
# A symbolic link of the Spark Home is made to /opt/spark for convenience
findspark.init("/opt/spark")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("PySpark").enableHiveSupport().getOrCreate()Using load¶
schema = StructType(
[
StructField("denominator", IntegerType(), False),
StructField("max_mod", IntegerType(), False),
StructField("num_distinct", IntegerType(), False),
StructField("max_dup", IntegerType(), False),
StructField("avg_dup", DoubleType(), False),
]
)flight = (
spark.read.format("csv")
.option("header", "true")
.schema(schema)
.load("../../home/media/data/flights14.csv")
)
flight.show()+-----------+-------+------------+-------+-------+
|denominator|max_mod|num_distinct|max_dup|avg_dup|
+-----------+-------+------------+-------+-------+
| 2014| 1| 1| 914| 14.0|
| 2014| 1| 1| 1157| -3.0|
| 2014| 1| 1| 1902| 2.0|
| 2014| 1| 1| 722| -8.0|
| 2014| 1| 1| 1347| 2.0|
| 2014| 1| 1| 1824| 4.0|
| 2014| 1| 1| 2133| -2.0|
| 2014| 1| 1| 1542| -3.0|
| 2014| 1| 1| 1509| -1.0|
| 2014| 1| 1| 1848| -2.0|
| 2014| 1| 1| 1655| -5.0|
| 2014| 1| 1| 1752| 7.0|
| 2014| 1| 1| 1253| 3.0|
| 2014| 1| 1| 1907| 142.0|
| 2014| 1| 1| 1720| -5.0|
| 2014| 1| 1| 1733| 18.0|
| 2014| 1| 1| 1640| 25.0|
| 2014| 1| 1| 1714| -1.0|
| 2014| 1| 1| 1611| 191.0|
| 2014| 1| 1| 553| -7.0|
+-----------+-------+------------+-------+-------+
only showing top 20 rows
Using csv¶
flight = (
spark.read.option("header", "true")
.schema(schema)
.csv("../../home/media/data/flights14.csv")
)
flight.show()+-----------+-------+------------+-------+-------+
|denominator|max_mod|num_distinct|max_dup|avg_dup|
+-----------+-------+------------+-------+-------+
| 2014| 1| 1| 914| 14.0|
| 2014| 1| 1| 1157| -3.0|
| 2014| 1| 1| 1902| 2.0|
| 2014| 1| 1| 722| -8.0|
| 2014| 1| 1| 1347| 2.0|
| 2014| 1| 1| 1824| 4.0|
| 2014| 1| 1| 2133| -2.0|
| 2014| 1| 1| 1542| -3.0|
| 2014| 1| 1| 1509| -1.0|
| 2014| 1| 1| 1848| -2.0|
| 2014| 1| 1| 1655| -5.0|
| 2014| 1| 1| 1752| 7.0|
| 2014| 1| 1| 1253| 3.0|
| 2014| 1| 1| 1907| 142.0|
| 2014| 1| 1| 1720| -5.0|
| 2014| 1| 1| 1733| 18.0|
| 2014| 1| 1| 1640| 25.0|
| 2014| 1| 1| 1714| -1.0|
| 2014| 1| 1| 1611| 191.0|
| 2014| 1| 1| 553| -7.0|
+-----------+-------+------------+-------+-------+
only showing top 20 rows
Output with a Single Header¶
https://
https://
References¶
https://
https://
https://
https://