Train PyTorch Distributedly Using Apache Ray

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Training a Model Implemented in PyTorch

https://github.com/ray-project/ray/tree/master/python/ray/util/sgd/pytorch/examples

Hyperparameter Optimization for Models Implemented in PyTorch

https://ray.readthedocs.io/en/latest/tune-examples.html

Is the following example running distributed or not? Do I need to use tags to tell Ray to run it on multiple machines?

import torch.optim as optim
from ray import tune
from ray.tune.examples.mnist_pytorch import (
    get_data_loaders, ConvNet, train, test)


def train_mnist(config):
    train_loader, test_loader = get_data_loaders()
    model = ConvNet()
    optimizer = optim.SGD(model.parameters(), lr=config["lr"])
    for i in range(10):
        train(model, optimizer, train_loader)
        acc = test(model, test_loader)
        tune.track.log(mean_accuracy=acc)


analysis = tune.run(
    train_mnist, config={"lr": tune.grid_search([0.001, 0.01, 0.1])})

print("Best config: ", analysis.get_best_config(metric="mean_accuracy"))

# Get a dataframe for analyzing trial results.
df = analysis.dataframe()

data parallelism vs model parallelism
use Ring Allreduce (RA) (instead of Parameter Server or Peer to Peer) for synchronization among processes (CPU/GPU on the same node or different nodes)
Distributed Optimization Algorithm
- synchronized SGD
- asynchronized SGD
- 1-bit SGD
- The Hogwild algorithm
- Downpour SGD
- synchronized SGD + large minibatch to reduce update frequency of parameters