Zephyr-7b API tutorial

Introduction
Prerequisites
Import the Required Libraries
Setup the Pipeline and FastAPI Application
Add the CORS Middleware
Setup a Pydantic Model
Response Endpoint
Testing
Adding a Frontend
Conclusion
Citations

Introduction

Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful.

More details can be found in the model card or the technical report

The Github repository for this project can be found here

Prerequisites

This tutorial will use conda for virtual environemnt. You can use whichever venv you would like.

Make sure that you have these frameworks installed:

Transformers 4.35.0.dev0

pip install transformers --upgrade

Pytorch 2.0.1+cu118

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

Datasets 2.12.0

pip install datasets

Tokenizers 0.14.0

pip install tokenizers

The remainder of the requirements are:

pip install fastapi uvicorn streamlit

Import the required libraries

from fastapi import FastAPI, HTTPException, WebSocket, Request
from fastapi.responses import HTMLResponse
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from typing import Optional, List
import black
import base64
import asyncio
import torch

Setup the pipeline and FastAPI application

Now, let's get into the actual setup of our application.

app = FastAPI()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Here, you're initializing a FastAPI application and setting up the device for PyTorch, which will be either a GPU (cuda) or a CPU, based on availability.

Following that, you'll establish the pipeline for text generation using the Zephyr model.

pipe = pipeline(
    "text-generation",
    model="HuggingFaceH4/zephyr-7b-beta",
    torch_dtype=torch.bfloat16,
    device_map=device,
)

This code sets up a text-generation pipeline with the specified Zephyr model. You can substitute zephyr-7b-beta with zephyr-7b-alpha if you prefer using the alpha version.

Add the CORS middleware

To ensure your API can communicate with other services, especially if they are hosted on different origins, you'll want to set up CORS.

# Add CORS middleware
origins = [
    "http://localhost",
    "http://localhost:8000",
    "http://localhost:8501",
    # Add any other origins from which you want to allow requests
]

app.add_middleware(
    CORSMiddleware,
    allow_origins=origins,
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

This code snippet specifies which origins are allowed to communicate with your API. As of now, it's set up for local development.

Setup a pydantic model

To ensure data consistency and validation, you'll define a Pydantic model. This model will dictate the structure of incoming data.

class QueryModel(BaseModel):
    user_message: str
    max_new_tokens: int = 256
    do_sample: bool = True
    temperature: float = 0.7
    top_k: int = 50
    top_p: float = 0.95
    system_message: Optional[str] = "You are a friendly chatbot who always responds in the style of a python developer that uses a combination of natural language and markdown to answer questions."

This model defines the input format for the chatbot, allowing users to customize the system message and other parameters.

Response Endpoint

With the necessary setup out of the way, you can now define the main interaction endpoint.

## This endpoint allows the user to specify a custom system message
@app.post("/zephyr/system-message", description="Get a response from the Zephyr chatbot with a custom system message. The default message is geared towards python code.")
async def get_custom_response(query: QueryModel):
    messages = [
        {
            "role": "system",
            "content": query.system_message,
        },
        {"role": "user", "content": query.user_message},
    ]

    prompt = pipe.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    outputs = pipe(
        prompt,
        max_new_tokens=query.max_new_tokens,
        do_sample=query.do_sample,
        temperature=query.temperature,
        top_k=query.top_k,
        top_p=query.top_p
    )

    # Split at the last occurrence of '</s>' and take everything after it
    response = outputs[0]["generated_text"].split('</s>')[-1].strip()

    # Remove all leading newline characters
    while response.startswith("\n"):
        response = response[1:]

    if response.startswith("\n"):
        response = response[len("\n"):].lstrip()
        # Remove all leading newline characters
    while response.startswith("\n"):
        response = response[1:]

    if response.startswith("<|assistant|>\n"):
        response = response[len("<|assistant|>\n"):].lstrip()

    return {"response": response}

from fastapi import HTTPException

This endpoint uses the QueryModel to generate a chatbot response. It takes in a system and user message, processes them, and returns the model's response.

Testing

With everything in place, it's time to test your application.

Execute the following command in your terminal:

uvicorn filename:app --reload

This will start the FastAPI application. You can then access the interactive documentation and test the endpoints at:

http://lcoalhost:8000/docs

Technically, this API is now a complete implementation of self hosting this technology.

However, we may want to continue the development of this application and add some more features.

Adding a frontend

Import the required libraries

import streamlit as st
import requests
import time
import json
import threading
import websockets
import asyncio

Set page config information

BASE_URL = "http://localhost:8000"

# Set page title and tab icon
st.set_page_config(
    page_title="Zephyr-API Demo",
    page_icon="🪂",  # You can specify a URL or an emoji as the tab icon
)

You can change the information here to whatever you would like or change to a different emoji.

Define the main structure

In this section, we're setting up the primary structure of our application's interface. Imagine you're walking into a control room. The first thing you'll see are the main controls and selection switches.

st.title("zephyr-7B-α demo")

endpoints = ["zephyr/system-message"]
selected_endpoint = st.sidebar.selectbox("Choose Endpoint:", endpoints)

## Endpoint specific configurations
if selected_endpoint == "zephyr/system-message":
    system_message = st.sidebar.text_input("System Message:", value="You are a friendly chatbot who always responds in the style of a python developer that uses a combination of natural language and python to answer questions.")

Here, when the "zephyr/system-message" channel is selected, we're presented with a textbox to type in a system message. This is akin to setting the theme or mood of our chatbot.

Next, we will add the parameters in the sidebar to allow further configuration of the model.

These are universal controls that impact the performance of our chatbot, no matter which channel we're on.

max_new_tokens: Think of this as setting the maximum length of a response. If you want shorter replies, slide it down. If you're okay with longer answers, push it up.
do_sample: This is like a toggle switch. When turned on, the chatbot gets a bit more creative with its responses.
temperature, top_k, and top_p: These are fine-tuning knobs. They control the randomness, diversity, and focus of the chatbot's answers. If you've ever played with the bass, treble, and balance knobs on a stereo, it's kind of like that.

# Global model configurations (applies to all endpoints)
max_new_tokens = st.sidebar.slider("Max New Tokens:", 50, 500, 256)
do_sample = st.sidebar.checkbox("Do Sample:", value=True)
temperature = st.sidebar.slider("Temperature:", 0.1, 1.0, 0.7, 0.1)
top_k = st.sidebar.slider("Top K:", 1, 100, 50)
top_p = st.sidebar.slider("Top P:", 0.1, 1.0, 0.95, 0.05)

Initialize the session state

Have you ever wished that machines had memory? With Streamlit's session state, they sort of do!

# Initialize chat history
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat messages from history on app rerun
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

Just like opening a chat window and seeing your previous messages, this section initializes the chat history so you can refer back to past interactions.

Initialize the response data

In this section, we're preparing to converse with our chatbot:

# Accept user input
if prompt := st.chat_input("What is up?"):

    # Construct the data dictionary based on the selected endpoint
    if selected_endpoint == "zephyr/system-message" or selected_endpoint == "zephyr/structured-code-doc":
        data = {
            "user_message": prompt,
            "max_new_tokens": max_new_tokens,
            "do_sample": True,
            "temperature": temperature,
            "top_k": top_k,
            "top_p": top_p,
            "system_message": system_message  # Ensure system_message is defined
        }

Behind the scenes, this is how your message is packaged with all the settings you've chosen before sending it off to the chatbot.

Package the JSON data and send the request

Now, it's time to send our message:

    # Convert the data payload into a compact JSON string
    compact_json_payload = json.dumps(data)
    url = f"{BASE_URL}/{selected_endpoint}"

    # Send POST request to API
    response = requests.post(url, data=compact_json_payload)
    response_data = response.json()

Display the response

Finally, we get to hear back from our chatbot:

    with st.chat_message("user"):
        st.markdown(prompt)
    # Add user message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})

    # Display assistant response in chat message container
    with st.chat_message("assistant"):
        message_placeholder = st.empty()
        full_response = ""
        assistant_response = response_data["response"]
        for chunk in assistant_response.split():
            full_response += chunk + " "
            time.sleep(0.05)
            message_placeholder.markdown(full_response + "▌")
        message_placeholder.markdown(full_response)

    # Add assistant response to chat history
    st.session_state.messages.append({"role": "assistant", "content": full_response})

Your message is displayed, just like in a chat window.

The text appears gradually, giving the illusion of a real-time conversation.

This is because of the simulated streaming in this chunk:

for chunk in assistant_response.split():
            full_response += chunk + " "
            time.sleep(0.05)
            message_placeholder.markdown(full_response + "▌")
        message_placeholder.markdown(full_response)

Conclusion

With this setup, you can have a dynamic, flowing conversation with the chatbot, tweaking settings on the go and watching how it responds. Think of it as your personal chatroom where the chatbot is always eager to converse, and you're in control of its personality and style!

If you enjoyed this tutorial, please consider giving it a star on Github and sharing it with your friends and colleagues. Also, if you would like more content using this model there is more available on github if you reach out I would be happy to create more complex guides.

Citations

@misc{tunstall2023zephyr,
      title={Zephyr: Direct Distillation of LM Alignment},
      author={Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf},
      year={2023},
      eprint={2310.16944},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Comprehensive Zephyr-7b API tutorial

Zephyr-7b API tutorial

Table of Contents

Introduction

Prerequisites

Import the required libraries

Setup the pipeline and FastAPI application

Add the CORS middleware

Setup a pydantic model

Response Endpoint

Testing

Adding a frontend

Set page config information

Define the main structure

Initialize the session state

Initialize the response data

Package the JSON data and send the request

Display the response

Conclusion

Citations