Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
- Spark MLlib RDD-based API supports stratified sampling but the DataFrame-based API hasn't implemented it yet as of Spark 2.4.3.
sample keys (not rows) with equal probability
References
https://spark.apache.org/docs/latest/ml-guide.html
https://spark.apache.org/docs/latest/ml-statistics.html