Claude 3 "Vision" uses Google's Cloud Vision API
# this page is a work-in-progress; I have a lot of data to go through. ~85% sure about the current conclusion.
Anthropic's Claude 3 family provides their models with Vision capability, allowing it to "understand and analyze images, opening up exciting possibilities for multimodal interaction". In Amazon AWS' announcement about the availability of Claude 3 Opus on Amazon Bedrock, they state that Claude 3 models "were trained to be extremely capable at understanding photos, charts, graphs, technical diagrams, and optical character recognition (OCR)".
Interestingly, in benchmarks the image processing capabilities of Claude Opus - according to the Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (MMMU) - was found to be very similar to those of Gemini Ultra in all but one out of five vision-related tests. The exact same MMMU score of 59.4 resulted in a shared first place in math and reasoning, out of six models. The "Turbo" of the Claude 3 family, Haiku, has also been found to be unexpectedly good at image analysis for being a less capable model - punching way above its weights - becoming the to-go for fast image processing with Claude, at a low cost. This indicates a nonlinear relationship between the Claude 3 models' general capability and their ability to analyze images, which is somewhat confounding.
Why could this be? An analysis of ~50 raw and unmodified outputs from Claude Vision, that were acquired by asking nicely¹, consistently shows that Claude 3 does not use any in-house image processing modules - it simply calls Google's Cloud Vision API and parses the output. Separate tests utilizing different standardized question formats also appear to fail to disprove the conclusion made. Anthropic's claim that "All Claude 3 models are capable of understanding and analyzing images" is therefore somewhat misleading, since the current output from the models all indicates API calls to an external service. That Google would be the ones to provide this service to Anthropic makes sense, given Google's $2B investment. Google is to Anthropic as Microsoft is to OpenAI - given their massive investments, service integration (Bing Chat, Dall-E 3, et c), and hardware utilization (GPT-4 VMs run on Microsoft Azure servers, for example).
The request format for Vision that Claude utilizes is identical to the one used for Cloud Vision. Claude requires a base64-encoded string as one of its arguments, which matches the "send the contents of the image file as a base64 encoded string in the body of your request" instruction from the Google Vision API. The output consistently matches their responses response layout and 200 OK message. The labelAnnotations dict contains the keys mid (machine-generated identifier), description (plaintext), score (confidence), and topicality (importance of the concept, in relation to full image context). The MIDs used seem to be generally based off ones used by Cloud Vision (Google's Knowledge Graph), with exact MIDs also matching with Tensorflow libraries and the HF Conceptual Captions dataset.
Ratings for the Cloud Vision class safeSearchAnnotation is also consistent, and the attributes used check out with Claude Vision's guidelines for allowed content: "adult", "spoof", "medical", "violence", "racy". imagePropertiesAnnotation (RGBA color information) is also used in the Google Vision-specified format, as well as landmarkingConfidence that locates various parts of the face. OCR (Optical Character Recognition) format checks out as well, with strings matching with labels for bounding boxes in the same format as Cloud Vision.
A variety of tests were performed on Haiku, Sonnet, and Opus - utilizing both Amazon Bedrock's setup as well as Anthropic's API. Multiple message regenerations (swipes) and different phrasings were used, for the purpose of proving the conclusion wrong while simultaneously accounting for model hallucinations. While Claude certainly provided contradictory information and template/mimic/example/placeholder information at several times, which was especially common for the "low-end" models, the general tendency (a consistent theme) became quite clear with repeated tests.
Please note that the snippets below were generally not generated using the same image. The full output for one image was extensive.
API Formatting used by Claude
{"type": "text", "text": "Please describe this image."}
Sample output: safeSearchAnnotation
https://cloud.google.com/vision/docs/detecting-safe-search
Sample output: OCR Format
https://cloud.google.com/vision/docs/reference/rest/v1/AnnotateImageResponse
Example output from Claude:
Sample output: localizedObjectAnnotations
https://cloud.google.com/vision/docs/object-localizer#vision_localize_objects-drest
Sample output: imagePropertiesAnnotation
Sample output: webDetection
Sample output: labelAnnotations
https://cloud.google.com/vision/docs/labels
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | "labelAnnotations": [
{
"mid": "/m/01g317",
"description": "Person",
"score": 0.9694548,
"topicality": 0.9694548
},
{
"mid": "/m/014sv8",
"description": "Facial hair",
"score": 0.89984846,
"topicality": 0.89984846
},
{
"mid": "/m/027x7n",
"description": "Beard",
"score": 0.8961298,
"topicality": 0.8961298
},
{
"mid": "/m/04yx4",
"description": "Man",
"score": 0.8809903,
"topicality": 0.8809903
},
{
"mid": "/m/04hgtk",
"description": "Head",
"score": 0.8691441,
"topicality": 0.8691441
},
{
"mid": "/m/0dzct",
"description": "Face",
"score": 0.83789873,
"topicality": 0.83789873
},
{
"mid": "/m/0k0pj",
"description": "Nose",
"score": 0.8358519,
"topicality": 0.8358519
},
{
"mid": "/m/0dsw8",
"description": "Cool",
"score": 0.80728,
"topicality": 0.80728
},
{
"mid": "/m/0dxrf",
"description": "Street fashion",
"score": 0.7849332,
"topicality": 0.7849332
},
{
"mid": "/m/0h_x9",
"description": "Eyebrow",
"score": 0.7698962,
"topicality": 0.7698962
},
{
"mid": "/m/0242c",
"description": "Fashion",
"score": 0.7621175,
"topicality": 0.7621175
},
{
"mid": "/m/04h4w",
"description": "Jacket",
"score": 0.7491627,
"topicality": 0.7491627
},
{
"mid": "/m/015h_t",
"description": "Facial expression",
"score": 0.74665624,
"topicality": 0.74665624
},
{
"mid": "/m/0g861",
"description": "Photography",
"score": 0.73927814,
"topicality": 0.73927814
},
{
"mid": "/m/01n5jq",
"description": "Wrinkle",
"score": 0.7363771,
"topicality": 0.7363771
},
{
"mid": "/m/02r0_6m",
"description": "Male",
"score": 0.728889,
"topicality": 0.81997025
},
{
"mid": "/m/025_jnm",
"description": "Photography poses",
"score": 0.7270868,
"topicality": 0.7270868
},
{
"mid": "/m/016p0m",
"description": "Skin",
"score": 0.7255238,
"topicality": 0.7255238
},
{
"mid": "/m/03gq5hm",
"description": "Neck",
"score": 0.7185524,
"topicality": 0.7185524
},
{
"mid": "/m/02kq0",
"description": "Hip hop fashion",
"score": 0.7097955,
"topicality": 0.7097955
}
],
|
Sample output: objects
https://cloud.google.com/vision/docs/reference/rest/v1p2beta1/images/annotate#normalizedvertex
Unsorted References
¹ This was once revealed to me in a dream.
https://cloud.google.com/vision/docs/features-list
https://cloud.google.com/vision/docs/labels
~ desuAnon