In [1]:

%%classpath add mvn
org.apache.spark spark-core_2.11 2.3.1
org.apache.spark spark-sql_2.11 2.3.1
org.apache.spark spark-hive_2.11 2.3.1

Load Data in TSV Format¶

.load is a general method for reading data in different format. You have to specify the format of the data via the method .format of course. .csv (both for CSV and TSV), .json and .parquet are specializations of .load. .format is optional if you use a specific loading function (csv, json, etc.).
No header by default.
.coalesece(1) or repartition(1) if you want to write to only 1 file.

Load Data in TSV Format¶

In [2]:

%%classpath add mvn
org.apache.spark spark-core_2.11 2.1.1
org.apache.spark spark-sql_2.11 2.1.1

In [3]:

import org.apache.spark.sql.SparkSession

val spark = SparkSession
    .builder()
    .master("local")
    .appName("spark load tsv")
    .config("spark-config-some-option", "some-value")
    .getOrCreate()
// spark

Out[3]:

org.apache.spark.sql.SparkSession@4549fbd9

In [4]:

val flights = spark.read.
    format("csv").
    option("header", "true").
    option("delimiter", "\t").
    option("mode", "DROPMALFORMED").
    csv("f2.tsv")
flights.show(5)

+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|cancelled|carrier|tailnum|flight|origin|dest|air_time|distance|hour|min|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
|2014|    6| 13|    1724|      114|    2125|      158|        0|     B6| N709JB|  1729|   JFK| RSW|     154|    1074|  17| 24|
|2014|    6| 13|    1942|      223|    2211|      282|        0|     B6| N304JB|  1734|   JFK| BTV|      49|     266|  19| 42|
|2014|    6| 13|    1345|        4|    1641|       11|        0|     B6| N796JB|  1783|   JFK| MCO|     137|     944|  13| 45|
|2014|    6| 13|    1552|        0|    1916|        8|        0|     B6| N184JB|  1801|   JFK| FLL|     166|    1069|  15| 52|
|2014|    6| 13|     119|      151|     215|      137|        0|     B6| N203JB|  1816|   JFK| SYR|      41|     209|   1| 19|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
only showing top 5 rows

Out[4]:

null

In [6]:

val flights = spark.read.
    option("header", "true").
    option("delimiter", "\t").
    csv("f2.tsv")
flights.show(5)

+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|cancelled|carrier|tailnum|flight|origin|dest|air_time|distance|hour|min|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
|2014|    3| 27|     538|       -7|     824|      -21|        0|     AA| N640AA|  2243|   JFK| MIA|     147|    1089|   5| 38|
|2014|    3| 27|     812|       -8|    1113|      -22|        0|     AA| N3GLAA|  2267|   LGA| MIA|     151|    1096|   8| 12|
|2014|    3| 27|     952|       -7|    1310|        1|        0|     AA| N3KGAA|  2335|   LGA| MIA|     153|    1096|   9| 52|
|2014|    3| 27|    1506|       -4|    1810|      -15|        0|     AA| N3CWAA|  1327|   LGA| PBI|     148|    1035|  15|  6|
|2014|    3| 27|     704|       -6|    1013|       -7|        0|     AA| N3GHAA|  2279|   LGA| MIA|     151|    1096|   7|  4|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
only showing top 5 rows

In [7]:

val flights = spark.read.
    option("delimiter", "\t").
    csv("f2.tsv")
flights.show(5)

+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+----+
| _c0|  _c1|_c2|     _c3|      _c4|     _c5|      _c6|      _c7|    _c8|    _c9|  _c10|  _c11|_c12|    _c13|    _c14|_c15|_c16|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+----+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|cancelled|carrier|tailnum|flight|origin|dest|air_time|distance|hour| min|
|2014|    3| 27|     538|       -7|     824|      -21|        0|     AA| N640AA|  2243|   JFK| MIA|     147|    1089|   5|  38|
|2014|    3| 27|     812|       -8|    1113|      -22|        0|     AA| N3GLAA|  2267|   LGA| MIA|     151|    1096|   8|  12|
|2014|    3| 27|     952|       -7|    1310|        1|        0|     AA| N3KGAA|  2335|   LGA| MIA|     153|    1096|   9|  52|
|2014|    3| 27|    1506|       -4|    1810|      -15|        0|     AA| N3CWAA|  1327|   LGA| PBI|     148|    1035|  15|   6|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+----+
only showing top 5 rows

Write DataFrame to TSV¶

In [25]:

val flights = spark.read.
    format("csv").
    option("header", "true").
    option("mode", "DROPMALFORMED").
    csv("flights14.csv")
flights.write.
    option("header", "true").
    option("delimiter", "\t").
    csv("f2.tsv")

References¶

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html

Ben Chuanlong Du's Blog

It is never too late to learn.

Read/Write TSV in Spark

Load Data in TSV Format¶

Load Data in TSV Format¶

Write DataFrame to TSV¶

References¶

Comments