In [1]:
%%classpath add mvn
org.apache.spark spark-core_2.11 2.3.1
org.apache.spark spark-sql_2.11 2.3.1
org.apache.spark spark-hive_2.11 2.3.1
Load Data in TSV Format¶
.load
is a general method for reading data in different format. You have to specify the format of the data via the method.format
of course..csv
(both for CSV and TSV),.json
and.parquet
are specializations of.load
..format
is optional if you use a specific loading function (csv, json, etc.).No header by default.
.coalesece(1)
orrepartition(1)
if you want to write to only 1 file.
Load Data in TSV Format¶
In [2]:
%%classpath add mvn
org.apache.spark spark-core_2.11 2.1.1
org.apache.spark spark-sql_2.11 2.1.1
In [3]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local")
.appName("spark load tsv")
.config("spark-config-some-option", "some-value")
.getOrCreate()
// spark
Out[3]:
In [4]:
val flights = spark.read.
format("csv").
option("header", "true").
option("delimiter", "\t").
option("mode", "DROPMALFORMED").
csv("f2.tsv")
flights.show(5)
Out[4]:
In [6]:
val flights = spark.read.
option("header", "true").
option("delimiter", "\t").
csv("f2.tsv")
flights.show(5)
In [7]:
val flights = spark.read.
option("delimiter", "\t").
csv("f2.tsv")
flights.show(5)
Write DataFrame to TSV¶
In [25]:
val flights = spark.read.
format("csv").
option("header", "true").
option("mode", "DROPMALFORMED").
csv("flights14.csv")
flights.write.
option("header", "true").
option("delimiter", "\t").
csv("f2.tsv")