Create eval
Create the structure of an evaluation that can be used to test a model’s performance. An evaluation is a set of testing criteria and a datasource. After creating an evaluation, you can run it on different models and model parameters. We support several types of graders and datasources. For more information, see the Evals guide.
Authorizations
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Body
A CustomDataSourceConfig object that defines the schema for the data source used for the evaluation runs. This schema is used to define the shape of the data that will be:
- Used to define your testing criteria and
- What data is required when creating a run
- CustomDataSourceConfig
- StoredCompletionsDataSourceConfig
A list of graders for all eval runs in this group.
A LabelModelGrader object which uses a model to assign labels to each item in the evaluation.
- LabelModelGrader
- StringCheckGrader
- TextSimilarityGrader
The name of the evaluation.
Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
Indicates whether the evaluation is shared with OpenAI.
Response
OK
An Eval object with a data source config and testing criteria. An Eval represents a task to be done for your LLM integration. Like:
- Improve the quality of my chatbot
- See how well my chatbot handles customer support
- Check if o3-mini is better at my usecase than gpt-4o
The object type.
eval Unique identifier for the evaluation.
The name of the evaluation.
"Chatbot effectiveness Evaluation"
A CustomDataSourceConfig which specifies the schema of your item and optionally sample namespaces.
The response schema defines the shape of the data that will be:
- Used to define your testing criteria and
- What data is required when creating a run
- CustomDataSourceConfig
- StoredCompletionsDataSourceConfig
A list of testing criteria.
A LabelModelGrader object which uses a model to assign labels to each item in the evaluation.
- LabelModelGrader
- StringCheckGrader
- TextSimilarityGrader
The Unix timestamp (in seconds) for when the eval was created.
Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
Indicates whether the evaluation is shared with OpenAI.