In [1]:
%%classpath add mvn
org.apache.spark spark-core_2.11 2.3.1
org.apache.spark spark-sql_2.11 2.3.1
org.apache.spark spark-hive_2.11 2.3.1
Load Data¶
.loadis a general method for reading data in different format. You have to specify the format of the data via the method.formatof course..csv(both for CSV and TSV),.jsonand.parquetare specializations of.load..formatis optional if you use a specific loading function (csv, json, etc.).No header by default.
.coalesece(1)orrepartition(1)if you want to write to only 1 file.
Load Data in Parquet Format¶
In [7]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().master("local")
.appName("IO")
.getOrCreate()
spark
Out[7]:
In [9]:
val df = spark.read.parquet("f2.parquet")
df.show
Out[9]:
In [10]:
df.count
Out[10]:
In [11]:
df.select(input_file_name()).show
In [1]:
val df = spark.read.load("namesAndAges.parquet")
df.show
In [9]:
val df = spark.sql("SELECT * FROM parquet.`namesAndAges.parquet`")
df.show
In [20]:
import java.io.File
new File(".").listFiles.filter(_.getPath.endsWith(".csv"))
Out[20]:
Write DataFrame to Parquet¶
In [32]:
val flights = spark.read.
format("csv").
option("header", "true").
option("mode", "DROPMALFORMED").
csv("flights14.csv")
flights.write.parquet("f2.parquet")
In [3]:
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")