Read/Write CSV in PySpark
Load Data in CSV Format¶
.load
is a general method for reading data in different format. You have to specify the format of the data via the method.format
of course..csv
(both for CSV and TSV),.json
and.parquet
are specializations of.load
..format
is optional if you use a specific loading function (csv, json, etc.).
BufferedReader in Java IO
Comemnt¶
- The methods
BufferedRead.readLine
andBufferedRead.lines
are very helpful for reading text Files.
public String BufferedRead.readLine¶
Reads a line of text. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed.
public Stream\<String> BufferedRead.lines¶
Returns a Stream, the elements of which are lines read from this BufferedReader. The Stream is lazily populated, i.e., read only occurs during the terminal stream operation. The reader must not be operated on during the execution of the terminal stream operation. Otherwise, the result of the terminal stream operation is undefined.
The Source Class in Scala IO
Java IO
References¶
- Generally speaking, BufferedReader/BufferedWrite are preferred over InputStreamReader/InputStreamWriter.
https://docs.oracle.com/javase/8/docs/api/java/io/BufferedWriter.html
public static Stream\<String> java.nio.file.Files.lines¶
public static List\<String> java.nio.file.Files.readAllLines¶
Read all lines from a file.
Bytes from the file are decoded into characters using the UTF-8 charset.
Notice that this method returns a List of Strings,
which is different from the method java.nio.file.Files.lines
(who returns a Stream of Strings).