Ben Chuanlong Du's Blog

It is never too late to learn.

Ensemble Machine Learning Models

The prediction error is a trade-off of bias and variance. In statistics, we often talk about unbiased estimators (especially in linear regression). In this case we restrict the estimators/predictors to be in a (small) class, and find the optimal solution in this class (called BLUE or BLUP).

Generally speaking …

Tips on Spark MLlib

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Spark MLlib RDD-based API supports stratified sampling but the DataFrame-based API hasn't implemented it yet as of Spark 2.4.3.

sample keys (not rows) with equal probability

References

https://spark …

Cross Validation in Machine Learning

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Training and Testing Data Set

  • good when you have large amount of data

  • usually use 1/5 to 1/3 of the data as testing data set.

K-fold CV

  • suitable when …