Admittedly, the design of MPI interface is the one of the most successful abstractions in computer science, where most interactions in HPC and AI can be fully described. In this article, I'm going to introduce common implementation algorithms for MPI collectives, and then briefly compare their bandwidth and latency.
Suppose both the in/out (i.e., bidirectional) bandwidth of a single node is , and the interconnection latency is .
References
- http://blog.sysu.tech/Research/Allreduce%E7%AE%97%E6%B3%95%E8%B0%83%E7%A0%94/ by Dr. Guangnan Feng (in Chinese)
- https://zhuanlan.zhihu.com/p/469942194 (in Chinese)