Ben Chuanlong Du's Blog

It is never too late to learn.

Python Virtual Environment

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. venv is a standard Python library and is the recommended way for managing virtual environments in Python3.

  2. When developing a Python project, it is recommended that you use Poetry to manage …

Dataframe for JVM

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Spark DataFrame

Spark DataFrame is a great implementation of distributed DataFrame, if you don't mind having dependency on Spark. It can be used in a non-distributed way of course. Spark DataFrame …

Row-based Mapping and Filtering on DataFrames in Spark

Comments

Spark DataFrame is an alias to Dataset[Row]. Even though a Spark DataFrame is stored as Rows in a Dataset, built-in operations/functions (in org.apache.spark.sql.functions) for Spark DataFrame are Column-based. Sometimes, there might be transformations on a DataFrame that is hard to express as Column expressions but rather evey convenient to express as Row expressions. The traditional way to resolve this issue is to wrap the row-based function into a UDF. It is worthing knowing that Spark DataFrame supports map/flatMap APIs which works on Rows. They are still experimental as Spark 2.4.3. It is suggested that you stick to Column-based operations/functions until the Row-based methods mature.