Chuanghao Ding, Jiaping Wang, Ziqing Yang, Xiaoliang Wang, Dahua Lin, Cam-Tu Nguyen, Fei Tan
Consultant Decoding (CD) improves inference speed and quality for large language models by using token-level likelihoods for draft verification, achieving significant efficiency gains over traditional speculative decoding.
Consultant Decoding (CD) is a new method designed to make large language models work faster without losing quality. It improves upon an existing method called Speculative Decoding, which often requires repeated checks that slow down the process. CD uses a different way to verify the model's guesses, leading to faster results with the same quality. Surprisingly, CD works well even when combining models of very different sizes and reduces the need to use the largest model, making it more efficient for complex tasks.