Ben Chuanlong Du's Blog

It is never too late to learn.

Train PyTorch Distributedly Using Apache Ray

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Training a Model Implemented in PyTorch

https://github.com/ray-project/ray/tree/master/python/ray/util/sgd/pytorch/examples

Distributed PyTorch Using Apache Ray

RaySGD: Distributed Training Wrappers

Hyperparameter Optimization for Models Implemented in PyTorch

https://ray.readthedocs.io/en/latest/tune-examples.html

Is the following example running distributed or not? Do I need to use tags to tell Ray to run it on multiple machines?

import torch.optim as optim
from ray import tune
from ray.tune.examples.mnist_pytorch import (
    get_data_loaders, ConvNet, train, test)


def train_mnist(config):
    train_loader, test_loader = get_data_loaders()
    model = ConvNet()
    optimizer = optim.SGD(model.parameters(), lr=config["lr"])
    for i in range(10):
        train(model, optimizer, train_loader)
        acc = test(model, test_loader)
        tune.track.log(mean_accuracy=acc)


analysis = tune.run(
    train_mnist, config={"lr": tune.grid_search([0.001, 0.01, 0.1])})

print("Best config: ", analysis.get_best_config(metric="mean_accuracy"))

# Get a dataframe for analyzing trial results.
df = analysis.dataframe()
  • data parallelism vs model parallelism
  • use Ring Allreduce (RA) (instead of Parameter Server or Peer to Peer) for synchronization among processes (CPU/GPU on the same node or different nodes)
  • Distributed Optimization Algorithm
    • synchronized SGD
    • asynchronized SGD
    • 1-bit SGD
    • The Hogwild algorithm
    • Downpour SGD
    • synchronized SGD + large minibatch to reduce update frequency of parameters

References

Parallel and Distributed Deep Learning

A Comparison of Distributed Machine Learning Platforms

Performance Analysis and Comparison of Distributed Machine Learning Systems

Multiprocessing failed with Torch.distributed.launch module

https://jdhao.github.io/2019/11/01/pytorch_distributed_training/

Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups

DISTRIBUTED COMMUNICATION PACKAGE - TORCH.DISTRIBUTED

DistributedDataParallel

Distributed data parallel training in Pytorch

Visual intuition on ring-Allreduce for distributed Deep Learning

Technologies behind Distributed Deep Learning: AllReduce

Writing Distributed Applications with PyTorch

https://github.com/ray-project/ray/issues/3609

https://github.com/ray-project/ray/issues/3520

Accelerating Deep Learning Using Distributed SGD — An Overview

Distributed training of Deep Learning models with PyTorch

Scalable Distributed DL Training: Batching Communication and Computation

https://github.com/dmmiller612/sparktorch

Awesome Distributed Deep Learning

Intro to Distributed Deep Learning Systems

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Comments