Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Training a Model Implemented in PyTorch
https://github.com/ray-project/ray/tree/master/python/ray/util/sgd/pytorch/examples
Distributed PyTorch Using Apache Ray
RaySGD: Distributed Training Wrappers
Hyperparameter Optimization for Models Implemented in PyTorch
https://ray.readthedocs.io/en/latest/tune-examples.html
Is the following example running distributed or not? Do I need to use tags to tell Ray to run it on multiple machines?
import torch.optim as optim
from ray import tune
from ray.tune.examples.mnist_pytorch import (
get_data_loaders, ConvNet, train, test)
def train_mnist(config):
train_loader, test_loader = get_data_loaders()
model = ConvNet()
optimizer = optim.SGD(model.parameters(), lr=config["lr"])
for i in range(10):
train(model, optimizer, train_loader)
acc = test(model, test_loader)
tune.track.log(mean_accuracy=acc)
analysis = tune.run(
train_mnist, config={"lr": tune.grid_search([0.001, 0.01, 0.1])})
print("Best config: ", analysis.get_best_config(metric="mean_accuracy"))
# Get a dataframe for analyzing trial results.
df = analysis.dataframe()
- data parallelism vs model parallelism
- use Ring Allreduce (RA) (instead of Parameter Server or Peer to Peer) for synchronization among processes (CPU/GPU on the same node or different nodes)
- Distributed Optimization Algorithm
- synchronized SGD
- asynchronized SGD
- 1-bit SGD
- The Hogwild algorithm
- Downpour SGD
- synchronized SGD + large minibatch to reduce update frequency of parameters
References
Parallel and Distributed Deep Learning
A Comparison of Distributed Machine Learning Platforms
Performance Analysis and Comparison of Distributed Machine Learning Systems
Multiprocessing failed with Torch.distributed.launch module
https://jdhao.github.io/2019/11/01/pytorch_distributed_training/
Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups
DISTRIBUTED COMMUNICATION PACKAGE - TORCH.DISTRIBUTED
Distributed data parallel training in Pytorch
Visual intuition on ring-Allreduce for distributed Deep Learning
Technologies behind Distributed Deep Learning: AllReduce
Writing Distributed Applications with PyTorch
https://github.com/ray-project/ray/issues/3609
https://github.com/ray-project/ray/issues/3520
Accelerating Deep Learning Using Distributed SGD — An Overview
Distributed training of Deep Learning models with PyTorch
Scalable Distributed DL Training: Batching Communication and Computation
https://github.com/dmmiller612/sparktorch
Awesome Distributed Deep Learning