Tips on Spark MLlib

May 16, 2019

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Spark MLlib RDD-based API supports stratified sampling but the DataFrame-based API hasn't implemented it yet as of Spark 2.4.3.

sample keys (not rows) with equal probability

References

https://spark.apache.org/docs/latest/ml-guide.html

https://spark.apache.org/docs/latest/ml-statistics.html

Comments