Ben Chuanlong Du's Blog

It is never too late to learn.

Optimization Method in Machine Learning

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

L-BFGS converges faster and with better solutions on small datasets. However, ADAM is very robust for relatively large datasets. It usually converges quickly and gives pretty good performance. SGD with momentum …

Tips on XGBoost

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. It is suggested that you use the sklearn wrapper classes XGBClassifier and XGBRegressor so that you can fully leverage other tools of the sklearn package.

  2. There are 2 types of boosters …

Libraries for Gradient Boosting

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

XGBoost

https://xgboost.ai/

XGBoost Documentation

Speedup XGBoost

https://machinelearningmastery.com/best-tune-multithreading-support-xgboost-python/

https://medium.com/data-design/xgboost-gpu-performance-on-low-end-gpu-vs-high-end-cpu-a7bc5fcd425b

xgboost GPU is fast. Very fast. As long as it fits in RAM and …

Ensemble Machine Learning Models

The prediction error is a trade-off of bias and variance. In statistics, we often talk about unbiased estimators (especially in linear regression). In this case we restrict the estimators/predictors to be in a (small) class, and find the optimal solution in this class (called BLUE or BLUP).

Generally speaking …

Tips on Spark MLlib

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Spark MLlib RDD-based API supports stratified sampling but the DataFrame-based API hasn't implemented it yet as of Spark 2.4.3.

sample keys (not rows) with equal probability

References

https://spark …

Cross Validation in Machine Learning

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Training and Testing Data Set

  • good when you have large amount of data

  • usually use 1/5 to 1/3 of the data as testing data set.

K-fold CV

  • suitable when …