(P1) torch.distributed 总览及底层工具

动机、参考资料、涉及内容

为了阅读 fairscale 源码, 涉及到 torch.distributed 模块, DP/DDP/FSDP/Pipe/Tensor Parallel 这些上层技术都用到了这些基础设施 (例如通信原语), 本文只涉及这些基础设施. 这些接口的含义似乎大体都类似于 MPI/OpenMP

这里额外收录一些关于分布式训练的资料:

https://www.cnblogs.com/rossiXYZ/p/15815013.html
https://lilianweng.github.io/posts/2021-09-25-train-large/

关于 Pytorch FSDP:

(2021/07/15)FairSeq早期实现的版本的介绍博客：https://engineering.fb.com/2021/07/15/open-source/fsdp/
(2022/03/14)Pytorch在1.11版本引入FSDP的博客：https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/
Pytorch的tutorial
- 入门：https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html
- 进阶：https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html

APIs

process_group

# 初始化默认的 process_group, 必须写在程序开头
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)

# 获取到已创建默认的 process_group
import torch.distributed as dist
dist.group.WORLD  # torch._C._distributed_c10d.ProcessGroupNCCL
isinstance(dist.group.WORLD, torch._C._distributed_c10d.ProcessGroup)  # True