Skip to content
AI

Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon

A practical walkthrough of two paths to working with Mistral — the managed API for fast prototyping and self-hosted deployment for full control — with real code covering prompting, model selection, function calling, RAG, and INT8 quantization.

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

9 min read
Share:

Mistral AI gives you two distinct ways to use their models: a managed API (La Plateforme) and open-weight models you can self-host. Most tutorials cover one or the other. This post covers both, side by side, so you can make an informed choice for your use case.

The Problem

You want to build with Mistral models. You go to their docs and immediately face a fork: do you use the API (mistralai Python SDK) or download the weights and run them yourself (HuggingFace Transformers)? Each path has different capabilities, costs, and trade-offs — but you won’t discover them until you’ve invested hours going down one road.

The API gives you function calling, JSON mode, and embeddings out of the box. Self-hosted gives you full control over quantization, latency, and data residency. Knowing which features live where — and what the code actually looks like — saves you from making the wrong architectural choice.

The Solution

Work through both paths in a single session. Start with the API for rapid iteration (prompting, model selection, function calling, RAG), then switch to self-hosted for deployment control (FP16 loading, INT8 quantization, local RAG).

Two Paths to Mistral AI

The decision comes down to your constraints: if you need speed to market and don’t want to manage GPUs, use the API. If you need data sovereignty, predictable costs, or custom inference optimization, self-host.

How It Works

Path 1: The Mistral API

The API is the fastest way to get a response from a Mistral model. Install the SDK, provide an API key, and you’re generating text in four lines of code.

from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage

client = MistralClient(api_key="your-key")

response = client.chat(
    model="mistral-small-latest",
    messages=[ChatMessage(role="user", content="What is the capital of France?")]
)
print(response.choices[0].message.content)

This is the “hello world” of Mistral. The model parameter is where it gets interesting.

Model Selection: Small vs Medium vs Large

Mistral offers tiered models optimized for different workloads. The naming is straightforward — smaller models are faster and cheaper, larger models are more capable.

ModelBest forRelative cost
mistral-small-latestClassification, simple extraction, routingLowest
mistral-medium-latestEmail composition, summarization, language tasksMedium
mistral-large-latestComplex reasoning, math, multi-step logicHighest

The practical difference shows up in tasks that require reasoning. Given a dataset of transactions and asked to find the two closest payment amounts and calculate the date difference, mistral-small gets confused. mistral-large solves it correctly by first sorting, then comparing, then calculating.

The cost difference between small and large is roughly 10x. The rule of thumb: start with small, escalate only when quality drops. Classification and extraction rarely need large. Reasoning and math usually do.

Function Calling: Connecting Models to Your Code

Function calling is where the API becomes genuinely useful for production systems. Instead of the model guessing at data, it calls your functions to retrieve real information.

The flow has four steps:

Step 1 — Define tools as JSON schemas:

tools = [{
    "type": "function",
    "function": {
        "name": "retrieve_payment_status",
        "description": "Get payment status of a transaction",
        "parameters": {
            "type": "object",
            "properties": {
                "transaction_id": {
                    "type": "string",
                    "description": "The transaction id."
                }
            },
            "required": ["transaction_id"]
        }
    }
}]

Step 2 — Model generates function arguments (not the answer):

response = client.chat(
    model="mistral-large-latest",
    messages=chat_history,
    tools=tools,
    tool_choice="auto"
)
# response contains: name="retrieve_payment_status", arguments={"transaction_id": "T1001"}

Step 3 — You execute the function with those arguments:

function_result = retrieve_payment_status(df, transaction_id="T1001")
# Returns: {"status": "Paid"}

Step 4 — Feed the result back, model generates the final answer:

chat_history.append({"role": "tool", "content": function_result, "tool_call_id": tool_id})
response = client.chat(model=model, messages=chat_history)
# "The status of your transaction T1001 is Paid."

The model never touches your database. It just decides which function to call and what arguments to pass. You execute, you return, the model synthesizes. This separation is what makes function calling safe for production.

RAG via the API: Embeddings + FAISS

The API includes an embeddings endpoint (mistral-embed) that produces 1024-dimensional vectors. Combined with FAISS for similarity search, you get a RAG pipeline in about 30 lines.

# Embed documents
def get_text_embedding(text):
    response = client.embeddings(model="mistral-embed", input=text)
    return response.data[0].embedding

# Chunk, embed, index
chunks = [text[i:i+512] for i in range(0, len(text), 512)]
embeddings = np.array([get_text_embedding(chunk) for chunk in chunks])

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# Query
query_embedding = np.array([get_text_embedding(question)])
D, I = index.search(query_embedding, k=2)
retrieved = [chunks[i] for i in I[0]]

Then inject the retrieved chunks into the prompt as context. The model answers based on your documents, not its training data. This is the #1 enterprise pattern for Mistral deployments — grounded answers with no hallucination on your proprietary data.

Path 2: Self-Hosted Deployment

Switching to self-hosted means downloading model weights and running inference on your own GPU. The trade-off: more setup, but full control.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

On a T4 GPU (16 GB VRAM), Mistral-7B in FP16 occupies ~14.3 GB — a tight fit. The model loads, but you have barely any headroom for KV cache during generation.

Key parameters for inference:

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,      # 0.1 = deterministic, 1.0 = creative
    do_sample=True,       # False = greedy decoding
    pad_token_id=tokenizer.eos_token_id  # Mistral has no pad token
)

The pad_token_id line is important — Mistral-7B doesn’t define a pad token by default, so you need to set it explicitly or you’ll get warnings that clutter your output.

INT8 Quantization: Half the Memory, Same Quality

With FP16 eating 14.3 GB on a 16 GB GPU, you have no room for anything else. INT8 quantization cuts that in half:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model_int8 = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)
# GPU memory: ~7.5 GB (vs 14.3 GB for FP16)

The output quality is nearly identical. Ask both models to explain quantization and you get the same structure, same key points, same level of detail. The INT8 version just uses half the memory and leaves room for longer contexts and batch processing.

Real numbers from a T4:

PrecisionVRAM UsedQualityUse case
FP1614.3 GBBaselineDev/testing
INT87.5 GB~99% of FP16Production on budget GPUs

Self-Hosted RAG: No API Dependency

For self-hosted RAG, you swap mistral-embed for a local embedding model like sentence-transformers/all-MiniLM-L6-v2 (384 dimensions, runs on CPU) and use the same FAISS retrieval pattern:

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = embedder.encode(documents)

# Same retrieval logic: cosine similarity + top-k
query_embedding = embedder.encode([question])
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
top_k = np.argsort(similarities)[-2:][::-1]

The advantage: no API calls, no data leaving your infrastructure. The embedding model is tiny (~80 MB), runs on CPU, and produces results in milliseconds. For European enterprises with data sovereignty requirements, this is often the deciding factor.

API vs Self-Hosted: The Decision Matrix

FactorAPI (La Plateforme)Self-Hosted
Setup timeMinutesHours
GPU requiredNoYes
Function callingBuilt-inYou build it
JSON modeBuilt-inPrompt engineering
Embeddingsmistral-embed (1024d)Bring your own model
Data residencyMistral’s serversYour infrastructure
Cost modelPer tokenFixed GPU cost
Model selectionSmall/Medium/LargeAny open-weight model
Quantization controlNoneFull (FP16/INT8/INT4)
Max throughputRate limitedHardware limited

For prototyping and most SaaS products: start with the API. For regulated industries, high-volume inference, or when you need to control every parameter: self-host.

What I Learned

  • Function calling is the killer feature of the API — it’s not just chat completion. The four-step flow (define tools, model generates args, you execute, model synthesizes) is what makes Mistral viable for production systems that need to interact with real data.
  • INT8 quantization is free performance on budget hardware — going from 14.3 GB to 7.5 GB with no perceptible quality loss means you can run Mistral-7B on a T4 with headroom to spare. There’s no reason to run FP16 in production on memory-constrained GPUs.
  • The two paths aren’t mutually exclusive — the most practical architecture uses the API for rapid development and function calling, then migrates latency-sensitive or high-volume inference to self-hosted once the use case is validated. Start managed, graduate to self-hosted.

Do It Yourself

Key takeaways:

  • Function calling is the production path, not a bonus feature. The four-step flow (define tools → model generates args → you execute → model synthesizes) is what makes Mistral viable for systems that interact with real data. If your use case needs database queries, API calls, or filesystem access, start here, not with raw chat completion.
  • INT8 quantization is free performance on memory-constrained GPUs. Going from 14.3GB (FP16) to 7.5GB (INT8) with negligible quality loss means you can run Mistral-7B on a T4 with headroom for longer contexts and batching. There’s no production reason to run FP16 on budget hardware.
  • API for prototyping, self-hosted for production at scale. The most practical architecture uses the API during development (function calling, JSON mode, rapid iteration), then migrates high-volume or latency-sensitive inference to self-hosted once the use case is validated.

Try it now:

  1. Test the API with function calling: Sign up for Mistral API access at console.mistral.ai, grab an API key, and run the function calling example from the official docs. Define a tool for retrieving data (weather, stock prices, database query), watch the model generate the function arguments, execute them yourself, and feed the result back.
  2. Run Mistral-7B locally in INT8: Install transformers and bitsandbytes, then run the INT8 quantization code from the post. You need a GPU with 8GB+ VRAM (T4, RTX 3060, or better). Compare generation quality and memory usage against FP16. HuggingFace model hub: mistralai/Mistral-7B-Instruct-v0.1.
  3. Build a simple RAG pipeline: Use the API’s mistral-embed endpoint with FAISS for local search (example in the post), or swap to sentence-transformers/all-MiniLM-L6-v2 for fully self-hosted. Embed a few documents, store in FAISS, query with a question, and inject the retrieved text into the prompt. Full walkthrough in Mistral’s RAG guide.
Alexandre Agius

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

Never miss a post

Get notified when I publish new articles about AI, Cloud, and AWS.

No spam, unsubscribe anytime.

Comments

Sign in to leave a comment

Related Posts