Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
-
GPU is more accisible for average individual people. GPU is still the main tool for deep learning right now.
-
Python Distributed Computing Frameworks (Ray, Modin, etc.) servers as a mid solution between GPU and Spark. It can handle more data than GPU but less then Spark. Ray, Modin, etc is easier to use and maintain than Spark.
-
Even though there are many libraries making it possible to run deep learning on Spark, I still don't it is the right choice unless you have really large data that cannot be handled by other frameworks. There are rarely such situations. Real big data mostly occur in the ETL and preprocessing stage rather than in the model training stage.
-
Python and Rust are good choices. C is not productive. C++ is too complicated. JVM-based languages are first-class citizens for production. Rust seems to have a bright future.
-
As the development of Kubernetes, there will be distributed computing frameworks that does not limit you into a specific languages. Once that is a common situation, people will start shifting away from JVM languages and seek for better performance and easier to use solutions. Rust is a good language choice for performance while Python is a good choice for glue-language that is easy to use.
Machine Learning Frameworks
scikit-learn
LightGBM / XGBoost
PyTorch
TensorFlow 2
Caffe2 is often used for productionizing models trained in PyTorch and it is part of the PyTorch project now.
Notice that H2O-3 (less popularity and lower quality compared to the above libraries), AI-Blocks, and Nvidia DIGIGS provides user-friendly UI for training models.
Computing Frameworks
Multi-threading & Multi-Processing are not discussed here since they are relatively simple for scientific computing.
GPU
ZeRO + DeepSpeed is a deep learning optimization library that makes distributed training on GPU clusters easy, efficient, and effective.
Apache Ray
Python Distributed Computing Frameworks (Ray, Celery, Dask, Modin, etc.)
Spark
TPU
Model Serving
https://github.com/cortexlabs/cortex
Multi-framework machine learning model serving infrastructure.
Ray Serve
TFX
Torch Serve
Programming Languages
Python
Rust
JVM (Java, Scala, Kotlin)
C/C++
References
-
(Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey)[https://link.springer.com/article/10.1007/s10462-018-09679-z]
-
caffe2/AICamera is a demonstration of using Caffe2 inside an Android application.
Accelerating Deep Learning Using Distributed SGD — An Overview
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
Scalable Distributed DL Training: Batching Communication and Computation
Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools
A Hitchhiker’s Guide On Distributed Training of Deep Neural Networks
Distributed training of Deep Learning models with PyTorch
Awesome Distributed Deep Learning