Ben Chuanlong Du's Blog

It is never too late to learn.

Read/Write CSV in PySpark

Load Data in CSV Format

  1. .load is a general method for reading data in different format. You have to specify the format of the data via the method .format of course. .csv (both for CSV and TSV), .json and .parquet are specializations of .load. .format is optional if you use a specific loading function (csv, json, etc.).

BufferedReader in Java IO


  1. The methods BufferedRead.readLine and BufferedRead.lines are very helpful for reading text Files.

public String BufferedRead.readLine

Reads a line of text. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed.

public Stream\<String> BufferedRead.lines

Returns a Stream, the elements of which are lines read from this BufferedReader. The Stream is lazily populated, i.e., read only occurs during the terminal stream operation. The reader must not be operated on during the execution of the terminal stream operation. Otherwise, the result of the terminal stream operation is undefined.

Java IO

public static Stream\<String> java.nio.file.Files.lines

public static List\<String> java.nio.file.Files.readAllLines

Read all lines from a file. Bytes from the file are decoded into characters using the UTF-8 charset. Notice that this method returns a List of Strings, which is different from the method java.nio.file.Files.lines (who returns a Stream of Strings).