Learn how to use vision capabilities to understand images.
File types | Size limits | Other requirements |
---|---|---|
PNG (.png) JPEG (.jpeg and .jpg) WEBP (.webp) Non-animated GIF (.gif) | Up to 20MB per image Up to 500 individual images per request Up to 50 MB image bytes per request Low-resolution: 512px x 512px High-resolution: 768px (short side) x 2000px (long side) | No watermarks or logos No text No NSFW content Clear enough for a human to understand |
detail
parameter tells the model what level of detail to use when processing and understanding the image (low
, high
, or auto
to let the model decide). If you skip the parameter, the model will use auto
. Put it right after your image_url
, like this:
"detail": "low"
. This lets the model process the image with a budget of 85 tokens. The model receives a low-resolution 512px x 512px version of the image. This is fine if your use case doesn’t require the model to see with high-resolution detail (for example, if you’re asking about the dominant shape or color in the image).
Or give the model more detail to generate its understanding by using "detail": "high"
. This lets the model see the low-resolution image (using 85 tokens) and then creates detailed crops using 170 tokens for each 512px x 512px tile.
gpt-4.1-mini
, we multiply image tokens by 1.62 to get total tokens, and for gpt-4.1-nano
, we multiply image tokens by 2.46 to get total tokens, that are then billed at normal text token rates.(1024 + 32 - 1) // 32 = 32
patches(1024 + 32 - 1) // 32 = 32
patches32 * 32 = 1024
, below the cap of 1536(1800 + 32 - 1) // 32 = 57
patches(2400 + 32 - 1) // 32 = 75
patches57 * 75 = 4275
patches to cover the full image. Since that exceeds 1536, we need to scale down the image while preserving the aspect ratio.sqrt(token_budget × patch_size^2 / (width * height))
. In our example, the shrink factor is sqrt(1536 * 32^2 / (1800 * 2400)) = 0.603
.1086 / 32 = 33.94
patches1448 / 32 = 45.25
patches33 / 33.94 = 0.97
to fit the width in 33 patches.1086 * (33 / 33.94) = 1056)
and the final height is 1448 * (33 / 33.94) = 1408
1056 / 32 = 33
patches to cover the width and 1408 / 32 = 44
patches to cover the height33 * 44 = 1452
, below the cap of 1536"detail": "low"
costs 85 tokens. To calculate the cost of an image with "detail": "high"
, we do the following:
"detail": "high"
mode costs 765 tokens
170 * 4 + 85 = 765
."detail": "high"
mode costs 1105 tokens
170 * 6 + 85 = 1105
."detail": "low"
most costs 85 tokens