Comments¶
- After sorting, rows in a DataFrame are sorted according to partition ID. And within each partition, rows are sorted. This property can be leverated to implement global ranking of rows. For more details, please refer to Computing global rank of a row in a DataFrame with Spark SQL. However, notice that multi-layer ranking is often more efficiency than a global ranking in big data applications.
In [2]:
import findspark
findspark.init("/opt/spark")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = (
SparkSession.builder.appName("PySpark_Sorting").enableHiveSupport().getOrCreate()
)
In [3]:
import pandas as pd
In [12]:
df_p = pd.DataFrame(
[
("Ben", "Du", 1),
("Ben", "Du", 2),
("Ken", "Xu", 1),
("Ken", "Xu", 9),
("Ben", "Tu", 3),
("Ben", "Tu", 4),
],
columns=["first_name", "last_name", "id"],
)
df_p
Out[12]:
In [13]:
df = spark.createDataFrame(df_p)
df.show()
In [14]:
df.orderBy(["first_name", "last_name"]).show()
Note: The asecending
keyword below cannot be omitted!
In [16]:
df.orderBy(["first_name", "last_name"], ascending=[False, False]).show()
In [ ]: