Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Dataframe for JVM
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Spark DataFrame
Spark DataFrame is a great implementation of distributed DataFrame, if you don't mind having dependency on Spark. It can be used in a non-distributed way of course. Spark DataFrame …
Row-based Mapping and Filtering on DataFrames in Spark
Comments¶
Spark DataFrame is an alias to Dataset[Row].
Even though a Spark DataFrame is stored as Rows in a Dataset,
built-in operations/functions (in org.apache.spark.sql.functions) for Spark DataFrame are Column-based.
Sometimes,
there might be transformations on a DataFrame that is hard to express as Column expressions
but rather evey convenient to express as Row expressions.
The traditional way to resolve this issue is to wrap the row-based function into a UDF.
It is worthing knowing that Spark DataFrame supports map/flatMap APIs
which works on Rows.
They are still experimental as Spark 2.4.3.
It is suggested that you stick to Column-based operations/functions until the Row-based methods mature.
Cut and qcut in pandas DataFrame
Row Object in Spark
Aggregation in pandas DataFrame
Comment¶
The order of elements within each group are preserved (as the original order).
groupbyworks exactly the same on index if the index is named.The order of columns in groupby matters if you want unstack the results later.
groupby works on columns too and it can group by some level of a MultiIndex.