Comments¶
DataFrame.replace
works different fromDataFrame.update
.
Sort a pandas DataFrame
Subtle Differences Among Spark DataFrame and PySpark Dataframe
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
-
Besides using the
col
function to reference a column, Spark/Scala DataFrame supports using$"col_name"
(based on implicit conversion and must haveimport spark.implicit._
) while PySpark DataFrame support using …
Aggregate DataFrames in Spark
Aggregation Without Grouping¶
You can aggregate all values in Columns of a DataFrame. Just use aggregation functions in
select
withoutgroupBy
, which is very similar to SQL syntax.The aggregation functions
all
andany
are available since Spark 3.0. However, they can be achieved using other aggregation functions such assum
Dataframe for JVM
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Spark DataFrame
Spark DataFrame is a great implementation of distributed DataFrame, if you don't mind having dependency on Spark. It can be used in a non-distributed way of course. Spark DataFrame …
Row-based Mapping and Filtering on DataFrames in Spark
Comments¶
Spark DataFrame
is an alias to Dataset[Row]
.
Even though a Spark DataFrame is stored as Rows in a Dataset,
built-in operations/functions (in org.apache.spark.sql.functions) for Spark DataFrame are Column-based.
Sometimes,
there might be transformations on a DataFrame that is hard to express as Column expressions
but rather evey convenient to express as Row expressions.
The traditional way to resolve this issue is to wrap the row-based function into a UDF.
It is worthing knowing that Spark DataFrame supports map/flatMap APIs
which works on Rows.
They are still experimental as Spark 2.4.3.
It is suggested that you stick to Column-based operations/functions until the Row-based methods mature.