Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
%%classpath add mvn
org.apache.spark spark-core_2.11 2.3.1
org.apache.spark spark-sql_2.11 2.3.1
org.apache.spark spark-hive_2.11 2.3.1Loading...
Loading...
Load Data in TSV Format¶
.loadis a general method for reading data in different format. You have to specify the format of the data via the method.formatof course..csv(both for CSV and TSV),.jsonand.parquetare specializations of.load..formatis optional if you use a specific loading function (csv, json, etc.).No header by default.
.coalesece(1)orrepartition(1)if you want to write to only 1 file.
Load Data in TSV Format¶
%%classpath add mvn
org.apache.spark spark-core_2.11 2.1.1
org.apache.spark spark-sql_2.11 2.1.1Loading...
Loading...
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local")
.appName("spark load tsv")
.config("spark-config-some-option", "some-value")
.getOrCreate()
// sparkorg.apache.spark.sql.SparkSession@4549fbd9val flights = spark.read.
format("csv").
option("header", "true").
option("delimiter", "\t").
option("mode", "DROPMALFORMED").
csv("f2.tsv")
flights.show(5)+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|cancelled|carrier|tailnum|flight|origin|dest|air_time|distance|hour|min|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
|2014| 6| 13| 1724| 114| 2125| 158| 0| B6| N709JB| 1729| JFK| RSW| 154| 1074| 17| 24|
|2014| 6| 13| 1942| 223| 2211| 282| 0| B6| N304JB| 1734| JFK| BTV| 49| 266| 19| 42|
|2014| 6| 13| 1345| 4| 1641| 11| 0| B6| N796JB| 1783| JFK| MCO| 137| 944| 13| 45|
|2014| 6| 13| 1552| 0| 1916| 8| 0| B6| N184JB| 1801| JFK| FLL| 166| 1069| 15| 52|
|2014| 6| 13| 119| 151| 215| 137| 0| B6| N203JB| 1816| JFK| SYR| 41| 209| 1| 19|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
only showing top 5 rows
nullval flights = spark.read.
option("header", "true").
option("delimiter", "\t").
csv("f2.tsv")
flights.show(5)+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|cancelled|carrier|tailnum|flight|origin|dest|air_time|distance|hour|min|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
|2014| 3| 27| 538| -7| 824| -21| 0| AA| N640AA| 2243| JFK| MIA| 147| 1089| 5| 38|
|2014| 3| 27| 812| -8| 1113| -22| 0| AA| N3GLAA| 2267| LGA| MIA| 151| 1096| 8| 12|
|2014| 3| 27| 952| -7| 1310| 1| 0| AA| N3KGAA| 2335| LGA| MIA| 153| 1096| 9| 52|
|2014| 3| 27| 1506| -4| 1810| -15| 0| AA| N3CWAA| 1327| LGA| PBI| 148| 1035| 15| 6|
|2014| 3| 27| 704| -6| 1013| -7| 0| AA| N3GHAA| 2279| LGA| MIA| 151| 1096| 7| 4|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
only showing top 5 rows
val flights = spark.read.
option("delimiter", "\t").
csv("f2.tsv")
flights.show(5)+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+----+
| _c0| _c1|_c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9| _c10| _c11|_c12| _c13| _c14|_c15|_c16|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+----+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|cancelled|carrier|tailnum|flight|origin|dest|air_time|distance|hour| min|
|2014| 3| 27| 538| -7| 824| -21| 0| AA| N640AA| 2243| JFK| MIA| 147| 1089| 5| 38|
|2014| 3| 27| 812| -8| 1113| -22| 0| AA| N3GLAA| 2267| LGA| MIA| 151| 1096| 8| 12|
|2014| 3| 27| 952| -7| 1310| 1| 0| AA| N3KGAA| 2335| LGA| MIA| 153| 1096| 9| 52|
|2014| 3| 27| 1506| -4| 1810| -15| 0| AA| N3CWAA| 1327| LGA| PBI| 148| 1035| 15| 6|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+----+
only showing top 5 rows
Write DataFrame to TSV¶
val flights = spark.read.
format("csv").
option("header", "true").
option("mode", "DROPMALFORMED").
csv("flights14.csv")
flights.write.
option("header", "true").
option("delimiter", "\t").
csv("f2.tsv")