Damien Ferbach, Courtney Paquette, Gauthier Gidel, Katie Everett, Elliot Paquette
ADANA, an optimizer with time-varying schedules for hyperparameters, improves large-scale language model training efficiency by up to 40% compared to AdamW.
In the realm of training large language models, the choice of hyperparameters is crucial for performance. Traditionally, certain parameters in the AdamW optimizer are kept constant, but this research suggests that changing them over time can lead to better results. By using a method called logarithmic-time scheduling, the researchers developed a new optimizer named ADANA. This optimizer adjusts its settings as training progresses, resulting in faster and more efficient training of language models, particularly as they become larger.