Speculative decoding

It is a technique to accelerate serving.

Key idea

During inference, use the low-cost smaller models for some tokens and only use the large models for the tokens that it does not agree with the smaller one.

Efficiency

It is not a technique of saving resource, but a way to inference multiple tokens in paralell so that we can save time. It consumes the same amount (or slightly more) of resource and requires more available GPUs at once since we are running in parallel, not in sequence. The small overhead is caused by using the smaller model.

The process

The smaller model generate some tokens: (x0, x1, x2, ..., xn).

Put them into n sequences: (x0), (x0, x1), (x0, x1, x2), ..., (x0, x1, x2, ..., xn).

Feed these sequences to the large model in parallel on n processes.

Compare the output probabily of the n next tokens over the vocabulary with that of the smaller model.

If the larger model has a lower confidence on an output token than the smaller model, which means the larger model doesn't like that token, then change that token.

If a token is changed, the tokens following it, generated by the smaller model, are discarded.

We just repeat the above process for the entire inference session.