1
Store high-quality outputs of a large model using the
store
parameter in the Chat Completions API to store them.2
Evaluate the stored completions with both the large and the small model to establish a baseline.
3
Select the stored completions that you’d like to use to for distillation and use them to fine-tune the smaller model.
4
Evaluate the performance of the fine-tuned model to see how it compares to the large model.
Store high-quality outputs of a large model
The first step in the distillation process is to generate good results with a large model likeo1-preview
or gpt-4o
that meet your bar. As you generate these results, you can store them using the store: true
option in the Chat Completions API. We also recommend you use the metadata property to tag these completions for easy filtering later.
These stored completion can then be viewed and filtered in the dashboard.
When using the
store: true
option, completions are stored for 30 days. Your completions may contain sensitive information and so, you may want to consider creating a new Project with limited access to store these completions.Evaluate to establish a baseline
You can use your stored completions to evaluate the performance of both the larger model and a smaller model on your task to establish a baseline. This can be done using the evals product. Typically, the large model will outperform the smaller model on your evaluations. Establishing this baseline allows you to measure the improvements gained through the distillation / fine-tuning process.Create training dataset to fine-tune smaller model
Next you can select a subset of your stored completions to use as training data for fine-tuning a smaller model likegpt-4o-mini
. Filter your stored completions to those that you would like to use to train the small model, and click the “Distill” button. A few hundred samples might be sufficient, but sometimes a more diverse range of thousands of samples can yield better results.
Evaluate the fine-tuned small model
When your fine-tuning job is complete, you can run evals against it to see how it stacks up against the base small and large models. You can select fine-tuned models in the Evals product to generate new completions with the fine-tuned small model.- The diversity of the training data
- Your prompts and outputs on the large model
- The accuracy of your eval graders