Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
-
Modules can hold parameters of different types on different devices, so it's not always possible to unambiguously determine the device. The recommended workflow in PyTorch is to create the device object separately and use that everywhere. However, if you know that all the parameters in a model are on the same device, you can use
next(model.parameters()).device
to get the device. In that situation, you can also usenext(model.parameters()).is_cuda
to check if the model is on CUDA. -
It is suggested that you use use method
.to
to move a model/tensor to a specific device.model.to("cuda") tensor = tensor.to("cpu")
Notice that
Module.to
is in-place whileTensor.to
returns a copy!
Function for Managing Device
torch.cuda.current_device Returns the index of a currently selected device.
torch.cuda.device Context-manager that changes the selected device.
torch.cuda.device_count Returns the number of GPUs on the machine (no matter whether they are busy or not).
torch.cuda.device_of Context-manager that changes the current device to that of given object.
torch.cuda.get_device_capability Gets the cuda capability of a device.
Use Multiple GPUs on the Same Machine
Below is a typical pattern of code to train/run your model on multiple GPUs.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
model = torch.nn.DataParallel(model)
model(data)
-
torch.nn.DataParallel
parallels a model on GPU devices only. It doesn't matter which device the data is on if the model is wrapped bytorch.nn.DataParallel
. It can be on a CPU or any GPU device. It will get splitted and distributed to all GPU devices anyway. -
If GPU devices have different capabilities, it is best to have the most powerful GPU device as device 0.
Does DataParallel matters in CPU-mode
My recurrent network doesn’t work with data parallelism
Use Multiple Processes or GPUs on Different Machines
https://pytorch.org/docs/stable/nn.html#distributeddataparallel
-
Similar to
torch.nn.DataParallel
,torch.nn.DistributedDataParallel
works for GPU only. -
It is suggested that you spawn multiple processes (on each node) and have each process operate a single GPU.
-
nccl
is the suggested backend to use. If not available, then use thegloo
backend. -
If you use torch.save on one process to checkpoint the module, and
torch.load
on some other processes to recover it, make sure that map_location is configured properly for every process. Withoutmap_location
,torch.load
would recover the module to devices where the module was saved from.
https://pytorch.org/docs/stable/distributed.html
References
[Feature Request] nn.Module should also get a device
attribute
Which device is model / tensor stored on?
How to get the device type of a pytorch module conveniently?
[Feature Request] nn.Module should also get a device
attribute #7460