动机、参考资料、涉及内容

为了阅读 fairscale 源码, 涉及到 torch.distributed 模块, DP/DDP/FSDP/Pipe/Tensor Parallel 这些上层技术都用到了这些基础设施 (例如通信原语), 本文只涉及这些基础设施. 这些接口的含义似乎大体都类似于 MPI/OpenMP

这里额外收录一些关于分布式训练的资料:

  • https://www.cnblogs.com/rossiXYZ/p/15815013.html
  • https://lilianweng.github.io/posts/2021-09-25-train-large/

关于 Pytorch FSDP:

  • (2021/07/15)FairSeq早期实现的版本的介绍博客:https://engineering.fb.com/2021/07/15/open-source/fsdp/
  • (2022/03/14)Pytorch在1.11版本引入FSDP的博客:https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/
  • Pytorch的tutorial
    • 入门:https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html
    • 进阶:https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html

APIs

  • process_group
# 初始化默认的 process_group, 必须写在程序开头
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)
# 获取到已创建默认的 process_group
import torch.distributed as dist
dist.group.WORLD  # torch._C._distributed_c10d.ProcessGroupNCCL
isinstance(dist.group.WORLD, torch._C._distributed_c10d.ProcessGroup)  # True