Hyunji Jung, Sungbin Shin, Namhoon Lee
This study addresses gradient staleness in asynchronous pipeline parallelism by using basis rotation to improve alignment and accelerate convergence, achieving faster training for large models.
In large-scale machine learning, asynchronous pipeline parallelism is a technique used to improve efficiency by keeping hardware busy. However, this method can suffer from 'gradient staleness,' where updates to the model are delayed, causing inefficiencies. The researchers found that the problem worsens as the pipeline gets deeper, which limits scalability. They propose a solution called 'basis rotation' to align the mathematical structure of the problem better, allowing for faster and more stable training of large models.