Haosong Zhang, Shenxi Wu, Yichi Zhang, Wei Lin
Arithmetic-Mean $μ$P provides a unified learning-rate scaling method for CNNs and ResNets, enabling consistent performance across varying network depths.
Choosing the right learning rate is crucial for training deep neural networks effectively, especially as they get deeper and more complex. Traditional methods struggle with modern architectures like convolutional and residual networks due to their layer-specific imbalances. This paper introduces a new approach called Arithmetic-Mean $μ$P, which focuses on maintaining a consistent average update across the entire network rather than individual layers. This method allows for a reliable learning rate that adapts well as network depth changes, simplifying the training process and improving performance without the need for extensive tuning.