- Published on
Kosmos-2-Patch14-224: Multi-modal Object Detection
- Authors
- Name
- Tim Dolan
Is Kosmos-2 a better use of compute than detr-resnet-50?

Introduction
Facebook, Google, and Microsoft all have industry leading object detection platforms, but what gives Microsoft the advantage over Facebook (Meta) in this race?
Microsoft has changed the dynamic by introducing a new model called Kosmos-2. This model is trained on a dataset of 1.5 million images and 10 million captions. It is able to detect objects in images with 99% accuracy, which is more advanced than many/all object detection models availbale today.
Kosmos-2 Features
Multimodal Grounding
Phrase Grounding
Here's an example of how to ground a phrase:
prompt = "<grounding><phrase> a snowman</phrase>"
run_example(prompt)
Output and annotations:
a snowman is warming himself by the fire
[('a snowman', (0, 9), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('the fire', (32, 40), [(0.203125, 0.015625, 0.453125, 0.859375)])]
<grounding><phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> is warming himself by<phrase> the fire</phrase><object><patch_index_0006><patch_index_0878></object>
Referring Expression Comprehension
In this task, we comprehend a referring expression:
prompt = "<grounding><phrase> a snowman next to a fire</phrase>"
run_example(prompt)
Output:
a snowman next to a fire
[('a snowman next to a fire', (0, 24), [(0.390625, 0.046875, 0.984375, 0.828125)])]
<grounding><phrase> a snowman next to a fire</phrase><object><patch_index_0044><patch_index_0863></object>
Multimodal Referring
Referring Expression Generation
Generating a referring expression:
prompt = "<grounding><phrase> It</phrase><object><patch_index_0044><patch_index_0863></object> is"
run_example(prompt)
Resulting expression:
It is a snowman in a hat and scarf
[('It', (0, 2), [(0.390625, 0.046875, 0.984375, 0.828125)])]
<grounding><phrase> It</phrase><object><patch_index_0044><patch_index_0863></object> is a snowman in a hat and scarf
Perception-Language Tasks
Grounded Visual Question Answering (VQA)
Here we answer a question based on visual and textual cues:
prompt = "<grounding> Question: What is special about this image? Answer:"
run_example(prompt)
Answer with details:
Question: What is special about this image? Answer: The image features a snowman sitting by a campfire in the snow.
[('a snowman', (71, 80), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a campfire', (92, 102), [(0.109375, 0.640625, 0.546875, 0.984375)])]
<grounding> Question: What is special about this image? Answer: The image features<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> sitting by<phrase> a campfire</phrase><object><patch_index_0643><patch_index_1009></object> in the snow.
Grounded Image Captioning
Brief Captioning
For a concise summary:
prompt = "<grounding> An image of"
run_example(prompt)
Caption output:
An image of a snowman warming himself by a campfire.
[('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a campfire', (41, 51), [(0.109375, 0.640625, 0.546875, 0.984375)])]
<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming himself by<phrase> a campfire</phrase><object><patch_index_0643><patch_index_1009></object>.
Detailed Captioning
For an in-depth description:
prompt = "<grounding> Describe this image in detail:"
run_example(prompt)
# An image of a snowman warming himself by a campfire.
# [('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a campfire', (41, 51), [(0.109375, 0.640625, 0.546875, 0.984375)])]
# <grounding> An im
- Detailed description:
prompt = "<grounding> Describe this image in detail:"
run_example(prompt)
# Describe this image in detail: The image features a snowman sitting by a campfire in the snow. He is wearing a hat, scarf, and gloves, with a pot nearby and a cup nearby. The snowman appears to be enjoying the warmth of the fire, and it appears to have a warm and cozy atmosphere.
# [('a campfire', (71, 81), [(0.171875, 0.015625, 0.484375, 0.984375)]), ('a hat', (109, 114), [(0.515625, 0.046875, 0.828125, 0.234375)]), ('scarf', (116, 121), [(0.515625, 0.234375, 0.890625, 0.578125)]), ('gloves', (127, 133), [(0.515625, 0.390625, 0.640625, 0.515625)]), ('a pot', (140, 145), [(0.078125, 0.609375, 0.265625, 0.859375)]), ('a cup', (157, 162), [(0.890625, 0.765625, 0.984375, 0.984375)])]
# <grounding> Describe this image in detail: The image features a snowman sitting by<phrase> a campfire</phrase><object><patch_index_0005><patch_index_1007></object> in the snow. He is wearing<phrase> a hat</phrase><object><patch_index_0048><patch_index_0250></object>,<phrase> scarf</phrase><object><patch_index_0240><patch_index_0604></object>, and<phrase> gloves</phrase><object><patch_index_0400><patch_index_0532></object>, with<phrase> a pot</phrase><object><patch_index_0610><patch_index_0872></object> nearby and<phrase> a cup</phrase><object><patch_index_0796><patch_index_1023></object> nearby. The snowman appears to be enjoying the warmth of the fire, and it appears to have a warm and cozy atmosphere.
Kosmos-2 Code Example
First, we will import the required libraries and modules.
from fastapi import FastAPI, File, UploadFile
from fastapi.responses import FileResponse, JSONResponse
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image, ImageDraw, ImageFont
import json
import io
import os
Declare the FastAPI app and transformers model
app = FastAPI()
# Load the model and processor
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")
Draw bounding boxes on the image
Before we write the endpoint, we also need a function to draw the bounding boxes on the image.
def draw_bounding_boxes(image: Image, entities):
draw = ImageDraw.Draw(image)
width, height = image.size
# Define a color bank of hex codes
color_bank = [
"#0AC2FF", "#0AC2FF", "#30D5C8", "#F3C300",
"#47FF0A", "#C2FF0A", "#F7CA18", "#D91E18",
"#FF0AC2", "#FF0A47", "#DB0A5B", "#1E824C"
]
# Use a built-in PIL font at a larger size if arial.ttf is not available
try:
# Try to use a specific font
font_size = 20
font = ImageFont.truetype("arial.ttf", font_size)
except IOError:
# Fall back to the default PIL font at a larger size
font_size = 20
font = ImageFont.load_default()
for entity in entities:
label, _, boxes = entity
for box in boxes:
box_coords = [
box[0] * width, # x_min
box[1] * height, # y_min
box[2] * width, # x_max
box[3] * height # y_max
]
# Randomly choose colors for the outline and text fill
outline_color = random.choice(color_bank)
text_fill_color = random.choice(color_bank)
draw.rectangle(box_coords, outline=outline_color, width=4)
# Adjust the position to draw text based on font size
text_position = (box_coords[0] + 5, box_coords[1] - font_size - 5)
draw.text(text_position, label, fill=text_fill_color, font=font)
return image
This will use the coordinates returned by the model to alter the original image and returned an annotated one.
Microsoft Also provides a function to create the bounding boxes:
Write the endpoint
This endpoint will accept an image, process it with the model, draw the bounding boxes, and return the annotated image.
## Main endpoint for object detection and drawing
@app.post("/detect/")
async def detect_and_draw_objects(file: UploadFile = File(...)):
if file.content_type.startswith('image/'):
# Read the image file
image_data = await file.read()
image = Image.open(io.BytesIO(image_data))
# Process the image with the model
prompt = "<grounding><phrase> a snowman</phrase>"
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
pixel_values=inputs["pixel_values"],
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
image_embeds=None,
image_embeds_position_mask=inputs["image_embeds_position_mask"],
use_cache=True,
max_new_tokens=128,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Process the generated text
processed_text, entities = processor.post_process_generation(generated_text)
# Draw bounding boxes on the image
annotated_image = draw_bounding_boxes(image, entities)
# Convert the annotated image to base64
buffered = io.BytesIO()
annotated_image.save(buffered, format="JPEG")
img_str = base64.b64encode(buffered.getvalue()).decode()
# Prepare the response data
response_data = {
"description": processed_text,
"entities": entities,
"image_base64": img_str
}
# Return the JSON response with the image and detection data
return JSONResponse(content=response_data)
else:
raise HTTPException(status_code=400, detail="File type not supported")
Testing the API and Getting a response
Now that we have the endpoint written, we can test it out. We will use the requests
library to send a POST request to the endpoint.
uvicorn api:app --reload
This should start the API server on port 8000. Now we can send a POST request to the endpoint.
http://localhost:8000/detect/
Adding a Front-End
Now that we have the API working, we can add a front-end to make it easier to use. We will use the streamlit
library to create a simple front-end.
First, we need to import the required libraries and modules.
import streamlit as st
import requests
from PIL import Image
import io
import base64
If you do not have streamlit installed already the command is:
pip install streamlit
# Streamlit interface
st.title('Object Detection API Client')
# Endpoint URL
api_url = "http://localhost:8000/detect/"
# File uploader allows user to add their own image
uploaded_file = st.sidebar.file_uploader("Choose an image...", type=["jpg", "jpeg"])
if uploaded_file is not None:
# Display the uploaded image
image = Image.open(uploaded_file)
st.image(image, caption='Uploaded Image.', use_column_width=True)
# Convert the uploaded image to bytes
buffered = io.BytesIO()
image.save(buffered, format=image.format)
img_byte = buffered.getvalue()
# Prepare the file in the correct format for uploading
files = {'file': ('image.jpg', img_byte, 'image/jpeg')}
# Post the image to the endpoint
st.write("Sending image to the API...")
response = requests.post(api_url, files=files)
# Check the response
if response.status_code == 200:
st.write("Response received from the API!")
response_data = response.json()
# Display the base64 image
base64_image = response_data['image_base64']
st.write("Annotated Image:")
st.image(base64.b64decode(base64_image), caption='Processed Image.', use_column_width=True)
# Display the other response data
st.write("Description:", response_data['description'])
st.write("Entities Detected:")
st.json(response_data['entities'])
else:
st.error("Failed to get response from the API")
Finally, you will need a file to trigger both of the server start events.
Create a file named run.sh and run this command.
chmod +x run.sh
Open the file in your favorite text editor and add the following code:
#!/bin/bash
# Execute the streamlit command
streamlit run app.py &
# Execute the uvicorn command
uvicorn api:app --reload
Once you have made the file executable:
./run.sh
Testing
The API and frontend should now be available at http://localhost:8501/
and http://localhost:8000/
respectively.
How does this compare to Meta's Object Detection API?
Citations
Microsoft Kosmos-2
@article{kosmos-2,
title={Kosmos-2: Grounding Multimodal Large Language Models to the World},
author={Zhiliang Peng and Wenhui Wang and Li Dong and Yaru Hao and Shaohan Huang and Shuming Ma and Furu Wei},
journal={ArXiv},
year={2023},
volume={abs/2306}
}
@article{kosmos-1,
title={Language Is Not All You Need: Aligning Perception with Language Models},
author={Shaohan Huang and Li Dong and Wenhui Wang and Yaru Hao and Saksham Singhal and Shuming Ma and Tengchao Lv and Lei Cui and Owais Khan Mohammed and Qiang Liu and Kriti Aggarwal and Zewen Chi and Johan Bjorck and Vishrav Chaudhary and Subhojit Som and Xia Song and Furu Wei},
journal={ArXiv},
year={2023},
volume={abs/2302.14045}
}
@article{metalm,
title={Language Models are General-Purpose Interfaces},
author={Yaru Hao and Haoyu Song and Li Dong and Shaohan Huang and Zewen Chi and Wenhui Wang and Shuming Ma and Furu Wei},
journal={ArXiv},
year={2022},
volume={abs/2206.06336}
}