Is Kosmos-2 a better use of compute than detr-resnet-50?

Introduction

Facebook, Google, and Microsoft all have industry leading object detection platforms, but what gives Microsoft the advantage over Facebook (Meta) in this race?

Microsoft has changed the dynamic by introducing a new model called Kosmos-2. This model is trained on a dataset of 1.5 million images and 10 million captions. It is able to detect objects in images with 99% accuracy, which is more advanced than many/all object detection models availbale today.

Kosmos-2 Features

Multimodal Grounding

Phrase Grounding

Here's an example of how to ground a phrase:

prompt = "<grounding><phrase> a snowman</phrase>"
run_example(prompt)

Output and annotations:

a snowman is warming himself by the fire
[('a snowman', (0, 9), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('the fire', (32, 40), [(0.203125, 0.015625, 0.453125, 0.859375)])]

<grounding><phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> is warming himself by<phrase> the fire</phrase><object><patch_index_0006><patch_index_0878></object>

Referring Expression Comprehension

In this task, we comprehend a referring expression:

prompt = "<grounding><phrase> a snowman next to a fire</phrase>"
run_example(prompt)

Output:

a snowman next to a fire
[('a snowman next to a fire', (0, 24), [(0.390625, 0.046875, 0.984375, 0.828125)])]

<grounding><phrase> a snowman next to a fire</phrase><object><patch_index_0044><patch_index_0863></object>

Multimodal Referring

Referring Expression Generation

Generating a referring expression:

prompt = "<grounding><phrase> It</phrase><object><patch_index_0044><patch_index_0863></object> is"
run_example(prompt)

Resulting expression:

It is a snowman in a hat and scarf
[('It', (0, 2), [(0.390625, 0.046875, 0.984375, 0.828125)])]

<grounding><phrase> It</phrase><object><patch_index_0044><patch_index_0863></object> is a snowman in a hat and scarf

Perception-Language Tasks

Grounded Visual Question Answering (VQA)

Here we answer a question based on visual and textual cues:

prompt = "<grounding> Question: What is special about this image? Answer:"
run_example(prompt)

Answer with details:

Question: What is special about this image? Answer: The image features a snowman sitting by a campfire in the snow.
[('a snowman', (71, 80), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a campfire', (92, 102), [(0.109375, 0.640625, 0.546875, 0.984375)])]

<grounding> Question: What is special about this image? Answer: The image features<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> sitting by<phrase> a campfire</phrase><object><patch_index_0643><patch_index_1009></object> in the snow.

Grounded Image Captioning

Brief Captioning

For a concise summary:

prompt = "<grounding> An image of"
run_example(prompt)

Caption output:

An image of a snowman warming himself by a campfire.
[('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a campfire', (41, 51), [(0.109375, 0.640625, 0.546875, 0.984375)])]

<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming himself by<phrase> a campfire</phrase><object><patch_index_0643><patch_index_1009></object>.

Detailed Captioning

For an in-depth description:

prompt = "<grounding> Describe this image in detail:"
run_example(prompt)
# An image of a snowman warming himself by a campfire.
# [('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a campfire', (41, 51), [(0.109375, 0.640625, 0.546875, 0.984375)])]

# <grounding> An im

Detailed description:

prompt = "<grounding> Describe this image in detail:"
run_example(prompt)
# Describe this image in detail: The image features a snowman sitting by a campfire in the snow. He is wearing a hat, scarf, and gloves, with a pot nearby and a cup nearby. The snowman appears to be enjoying the warmth of the fire, and it appears to have a warm and cozy atmosphere.
# [('a campfire', (71, 81), [(0.171875, 0.015625, 0.484375, 0.984375)]), ('a hat', (109, 114), [(0.515625, 0.046875, 0.828125, 0.234375)]), ('scarf', (116, 121), [(0.515625, 0.234375, 0.890625, 0.578125)]), ('gloves', (127, 133), [(0.515625, 0.390625, 0.640625, 0.515625)]), ('a pot', (140, 145), [(0.078125, 0.609375, 0.265625, 0.859375)]), ('a cup', (157, 162), [(0.890625, 0.765625, 0.984375, 0.984375)])]

# <grounding> Describe this image in detail: The image features a snowman sitting by<phrase> a campfire</phrase><object><patch_index_0005><patch_index_1007></object> in the snow. He is wearing<phrase> a hat</phrase><object><patch_index_0048><patch_index_0250></object>,<phrase> scarf</phrase><object><patch_index_0240><patch_index_0604></object>, and<phrase> gloves</phrase><object><patch_index_0400><patch_index_0532></object>, with<phrase> a pot</phrase><object><patch_index_0610><patch_index_0872></object> nearby and<phrase> a cup</phrase><object><patch_index_0796><patch_index_1023></object> nearby. The snowman appears to be enjoying the warmth of the fire, and it appears to have a warm and cozy atmosphere.

Kosmos-2 Code Example

First, we will import the required libraries and modules.

from fastapi import FastAPI, File, UploadFile
from fastapi.responses import FileResponse, JSONResponse
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image, ImageDraw, ImageFont
import json
import io
import os

Declare the FastAPI app and transformers model

app = FastAPI()

# Load the model and processor
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

Draw bounding boxes on the image

Before we write the endpoint, we also need a function to draw the bounding boxes on the image.

def draw_bounding_boxes(image: Image, entities):
    draw = ImageDraw.Draw(image)
    width, height = image.size

    # Define a color bank of hex codes
    color_bank = [
        "#0AC2FF", "#0AC2FF", "#30D5C8", "#F3C300",
        "#47FF0A", "#C2FF0A", "#F7CA18", "#D91E18",
        "#FF0AC2", "#FF0A47", "#DB0A5B", "#1E824C"
    ]

    # Use a built-in PIL font at a larger size if arial.ttf is not available
    try:
        # Try to use a specific font
        font_size = 20
        font = ImageFont.truetype("arial.ttf", font_size)
    except IOError:
        # Fall back to the default PIL font at a larger size
        font_size = 20
        font = ImageFont.load_default()

    for entity in entities:
        label, _, boxes = entity
        for box in boxes:
            box_coords = [
                box[0] * width,  # x_min
                box[1] * height, # y_min
                box[2] * width,  # x_max
                box[3] * height  # y_max
            ]

            # Randomly choose colors for the outline and text fill
            outline_color = random.choice(color_bank)
            text_fill_color = random.choice(color_bank)

            draw.rectangle(box_coords, outline=outline_color, width=4)

            # Adjust the position to draw text based on font size
            text_position = (box_coords[0] + 5, box_coords[1] - font_size - 5)
            draw.text(text_position, label, fill=text_fill_color, font=font)

    return image

This will use the coordinates returned by the model to alter the original image and returned an annotated one.

Microsoft Also provides a function to create the bounding boxes:

Write the endpoint

This endpoint will accept an image, process it with the model, draw the bounding boxes, and return the annotated image.

## Main endpoint for object detection and drawing
@app.post("/detect/")
async def detect_and_draw_objects(file: UploadFile = File(...)):
    if file.content_type.startswith('image/'):
        # Read the image file
        image_data = await file.read()
        image = Image.open(io.BytesIO(image_data))

        # Process the image with the model
        prompt = "<grounding><phrase> a snowman</phrase>"
        inputs = processor(text=prompt, images=image, return_tensors="pt")
        generated_ids = model.generate(
            pixel_values=inputs["pixel_values"],
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            image_embeds=None,
            image_embeds_position_mask=inputs["image_embeds_position_mask"],
            use_cache=True,
            max_new_tokens=128,
        )
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

        # Process the generated text
        processed_text, entities = processor.post_process_generation(generated_text)

        # Draw bounding boxes on the image
        annotated_image = draw_bounding_boxes(image, entities)

        # Convert the annotated image to base64
        buffered = io.BytesIO()
        annotated_image.save(buffered, format="JPEG")
        img_str = base64.b64encode(buffered.getvalue()).decode()

        # Prepare the response data
        response_data = {
            "description": processed_text,
            "entities": entities,
            "image_base64": img_str
        }

        # Return the JSON response with the image and detection data
        return JSONResponse(content=response_data)
    else:
        raise HTTPException(status_code=400, detail="File type not supported")

Testing the API and Getting a response

Now that we have the endpoint written, we can test it out. We will use the requests library to send a POST request to the endpoint.

uvicorn api:app --reload

This should start the API server on port 8000. Now we can send a POST request to the endpoint.

http://localhost:8000/detect/

Adding a Front-End

Now that we have the API working, we can add a front-end to make it easier to use. We will use the streamlit library to create a simple front-end.

First, we need to import the required libraries and modules.

import streamlit as st
import requests
from PIL import Image
import io
import base64

If you do not have streamlit installed already the command is:

pip install streamlit

# Streamlit interface
st.title('Object Detection API Client')

# Endpoint URL
api_url = "http://localhost:8000/detect/"



# File uploader allows user to add their own image
uploaded_file = st.sidebar.file_uploader("Choose an image...", type=["jpg", "jpeg"])

if uploaded_file is not None:
    # Display the uploaded image
    image = Image.open(uploaded_file)
    st.image(image, caption='Uploaded Image.', use_column_width=True)

    # Convert the uploaded image to bytes
    buffered = io.BytesIO()
    image.save(buffered, format=image.format)
    img_byte = buffered.getvalue()

    # Prepare the file in the correct format for uploading
    files = {'file': ('image.jpg', img_byte, 'image/jpeg')}

    # Post the image to the endpoint
    st.write("Sending image to the API...")
    response = requests.post(api_url, files=files)

    # Check the response
    if response.status_code == 200:
        st.write("Response received from the API!")
        response_data = response.json()

        # Display the base64 image
        base64_image = response_data['image_base64']
        st.write("Annotated Image:")
        st.image(base64.b64decode(base64_image), caption='Processed Image.', use_column_width=True)

        # Display the other response data
        st.write("Description:", response_data['description'])
        st.write("Entities Detected:")
        st.json(response_data['entities'])
    else:
        st.error("Failed to get response from the API")

Finally, you will need a file to trigger both of the server start events.

Create a file named run.sh and run this command.

chmod +x run.sh

Open the file in your favorite text editor and add the following code:

#!/bin/bash

# Execute the streamlit command
streamlit run app.py &

# Execute the uvicorn command
uvicorn api:app --reload

Once you have made the file executable:

./run.sh

Testing

The API and frontend should now be available at http://localhost:8501/ and http://localhost:8000/ respectively.

How does this compare to Meta's Object Detection API?

Citations

Microsoft Kosmos-2

@article{kosmos-2,
  title={Kosmos-2: Grounding Multimodal Large Language Models to the World},
  author={Zhiliang Peng and Wenhui Wang and Li Dong and Yaru Hao and Shaohan Huang and Shuming Ma and Furu Wei},
  journal={ArXiv},
  year={2023},
  volume={abs/2306}
}

@article{kosmos-1,
  title={Language Is Not All You Need: Aligning Perception with Language Models},
  author={Shaohan Huang and Li Dong and Wenhui Wang and Yaru Hao and Saksham Singhal and Shuming Ma and Tengchao Lv and Lei Cui and Owais Khan Mohammed and Qiang Liu and Kriti Aggarwal and Zewen Chi and Johan Bjorck and Vishrav Chaudhary and Subhojit Som and Xia Song and Furu Wei},
  journal={ArXiv},
  year={2023},
  volume={abs/2302.14045}
}

@article{metalm,
  title={Language Models are General-Purpose Interfaces},
  author={Yaru Hao and Haoyu Song and Li Dong and Shaohan Huang and Zewen Chi and Wenhui Wang and Shuming Ma and Furu Wei},
  journal={ArXiv},
  year={2022},
  volume={abs/2206.06336}
}