Ben Chuanlong Du's Blog

It is never too late to learn.

Tips on Spark MLlib

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Spark MLlib RDD-based API supports stratified sampling but the DataFrame-based API hasn't implemented it yet as of Spark 2.4.3.

sample keys (not rows) with equal probability

References

https://spark.apache.org/docs/latest/ml-guide.html

https://spark.apache.org/docs/latest/ml-statistics.html

Comments