In [1]:
%%classpath add mvn
org.apache.spark spark-core_2.11 2.3.1
org.apache.spark spark-sql_2.11 2.3.1
org.apache.spark spark-hive_2.11 2.3.1
Load Data¶
.load
is a general method for reading data in different format. You have to specify the format of the data via the method.format
of course..csv
(both for CSV and TSV),.json
and.parquet
are specializations of.load
..format
is optional if you use a specific loading function (csv, json, etc.).No header by default.
.coalesece(1)
orrepartition(1)
if you want to write to only 1 file.
Load Data in Parquet Format¶
In [7]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().master("local")
.appName("IO")
.getOrCreate()
spark
Out[7]:
In [9]:
val df = spark.read.parquet("f2.parquet")
df.show
Out[9]:
In [10]:
df.count
Out[10]:
In [11]:
df.select(input_file_name()).show
In [1]:
val df = spark.read.load("namesAndAges.parquet")
df.show
In [9]:
val df = spark.sql("SELECT * FROM parquet.`namesAndAges.parquet`")
df.show
In [20]:
import java.io.File
new File(".").listFiles.filter(_.getPath.endsWith(".csv"))
Out[20]:
Write DataFrame to Parquet¶
In [32]:
val flights = spark.read.
format("csv").
option("header", "true").
option("mode", "DROPMALFORMED").
csv("flights14.csv")
flights.write.parquet("f2.parquet")
In [3]:
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")