Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon

A practical walkthrough of two paths to working with Mistral — the managed API for fast prototyping and self-hosted deployment for full control — with real code covering prompting, model selection, function calling, RAG, and INT8 quantization.

Alexandre Agius

AWS Solutions Architect

February 24, 2026 9 min read

AI LLM Mistral RAG Function Calling Quantization Inference

🧠 Part of: AI & LLM Fundamentals 🤖 Part of: Agentic AI

Table of Contents

The Problem
The Solution
How It Works
Path 1: The Mistral API
Model Selection: Small vs Medium vs Large
Function Calling: Connecting Models to Your Code
RAG via the API: Embeddings + FAISS
Path 2: Self-Hosted Deployment
INT8 Quantization: Half the Memory, Same Quality
Self-Hosted RAG: No API Dependency
API vs Self-Hosted: The Decision Matrix
What I Learned
Do It Yourself

Mistral AI gives you two distinct ways to use their models: a managed API (La Plateforme) and open-weight models you can self-host. Most tutorials cover one or the other. This post covers both, side by side, so you can make an informed choice for your use case.

The Problem

You want to build with Mistral models. You go to their docs and immediately face a fork: do you use the API (mistralai Python SDK) or download the weights and run them yourself (HuggingFace Transformers)? Each path has different capabilities, costs, and trade-offs — but you won’t discover them until you’ve invested hours going down one road.

The API gives you function calling, JSON mode, and embeddings out of the box. Self-hosted gives you full control over quantization, latency, and data residency. Knowing which features live where — and what the code actually looks like — saves you from making the wrong architectural choice.

The Solution

Work through both paths in a single session. Start with the API for rapid iteration (prompting, model selection, function calling, RAG), then switch to self-hosted for deployment control (FP16 loading, INT8 quantization, local RAG).

Two Paths to Mistral AI

The decision comes down to your constraints: if you need speed to market and don’t want to manage GPUs, use the API. If you need data sovereignty, predictable costs, or custom inference optimization, self-host.

How It Works

Path 1: The Mistral API

The API is the fastest way to get a response from a Mistral model. Install the SDK, provide an API key, and you’re generating text in four lines of code.

from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage

client = MistralClient(api_key="your-key")

response = client.chat(
    model="mistral-small-latest",
    messages=[ChatMessage(role="user", content="What is the capital of France?")]
)
print(response.choices[0].message.content)

This is the “hello world” of Mistral. The model parameter is where it gets interesting.

Model Selection: Small vs Medium vs Large

Mistral offers tiered models optimized for different workloads. The naming is straightforward — smaller models are faster and cheaper, larger models are more capable.

Model	Best for	Relative cost
`mistral-small-latest`	Classification, simple extraction, routing	Lowest
`mistral-medium-latest`	Email composition, summarization, language tasks	Medium
`mistral-large-latest`	Complex reasoning, math, multi-step logic	Highest

The practical difference shows up in tasks that require reasoning. Given a dataset of transactions and asked to find the two closest payment amounts and calculate the date difference, mistral-small gets confused. mistral-large solves it correctly by first sorting, then comparing, then calculating.

The cost difference between small and large is roughly 10x. The rule of thumb: start with small, escalate only when quality drops. Classification and extraction rarely need large. Reasoning and math usually do.

Function Calling: Connecting Models to Your Code

Function calling is where the API becomes genuinely useful for production systems. Instead of the model guessing at data, it calls your functions to retrieve real information.

The flow has four steps:

Step 1 — Define tools as JSON schemas:

tools = [{
    "type": "function",
    "function": {
        "name": "retrieve_payment_status",
        "description": "Get payment status of a transaction",
        "parameters": {
            "type": "object",
            "properties": {
                "transaction_id": {
                    "type": "string",
                    "description": "The transaction id."
                }
            },
            "required": ["transaction_id"]
        }
    }
}]

Step 2 — Model generates function arguments (not the answer):

response = client.chat(
    model="mistral-large-latest",
    messages=chat_history,
    tools=tools,
    tool_choice="auto"
)
# response contains: name="retrieve_payment_status", arguments={"transaction_id": "T1001"}

Step 3 — You execute the function with those arguments:

function_result = retrieve_payment_status(df, transaction_id="T1001")
# Returns: {"status": "Paid"}

Step 4 — Feed the result back, model generates the final answer:

chat_history.append({"role": "tool", "content": function_result, "tool_call_id": tool_id})
response = client.chat(model=model, messages=chat_history)
# "The status of your transaction T1001 is Paid."

The model never touches your database. It just decides which function to call and what arguments to pass. You execute, you return, the model synthesizes. This separation is what makes function calling safe for production.

RAG via the API: Embeddings + FAISS

The API includes an embeddings endpoint (mistral-embed) that produces 1024-dimensional vectors. Combined with FAISS for similarity search, you get a RAG pipeline in about 30 lines.

# Embed documents
def get_text_embedding(text):
    response = client.embeddings(model="mistral-embed", input=text)
    return response.data[0].embedding

# Chunk, embed, index
chunks = [text[i:i+512] for i in range(0, len(text), 512)]
embeddings = np.array([get_text_embedding(chunk) for chunk in chunks])

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# Query
query_embedding = np.array([get_text_embedding(question)])
D, I = index.search(query_embedding, k=2)
retrieved = [chunks[i] for i in I[0]]

Then inject the retrieved chunks into the prompt as context. The model answers based on your documents, not its training data. This is the #1 enterprise pattern for Mistral deployments — grounded answers with no hallucination on your proprietary data.

Path 2: Self-Hosted Deployment

Switching to self-hosted means downloading model weights and running inference on your own GPU. The trade-off: more setup, but full control.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

On a T4 GPU (16 GB VRAM), Mistral-7B in FP16 occupies ~14.3 GB — a tight fit. The model loads, but you have barely any headroom for KV cache during generation.

Key parameters for inference:

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,      # 0.1 = deterministic, 1.0 = creative
    do_sample=True,       # False = greedy decoding
    pad_token_id=tokenizer.eos_token_id  # Mistral has no pad token
)

The pad_token_id line is important — Mistral-7B doesn’t define a pad token by default, so you need to set it explicitly or you’ll get warnings that clutter your output.

INT8 Quantization: Half the Memory, Same Quality

With FP16 eating 14.3 GB on a 16 GB GPU, you have no room for anything else. INT8 quantization cuts that in half:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model_int8 = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)
# GPU memory: ~7.5 GB (vs 14.3 GB for FP16)

The output quality is nearly identical. Ask both models to explain quantization and you get the same structure, same key points, same level of detail. The INT8 version just uses half the memory and leaves room for longer contexts and batch processing.

Real numbers from a T4:

Precision	VRAM Used	Quality	Use case
FP16	14.3 GB	Baseline	Dev/testing
INT8	7.5 GB	~99% of FP16	Production on budget GPUs

Self-Hosted RAG: No API Dependency

For self-hosted RAG, you swap mistral-embed for a local embedding model like sentence-transformers/all-MiniLM-L6-v2 (384 dimensions, runs on CPU) and use the same FAISS retrieval pattern:

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = embedder.encode(documents)

# Same retrieval logic: cosine similarity + top-k
query_embedding = embedder.encode([question])
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
top_k = np.argsort(similarities)[-2:][::-1]

The advantage: no API calls, no data leaving your infrastructure. The embedding model is tiny (~80 MB), runs on CPU, and produces results in milliseconds. For European enterprises with data sovereignty requirements, this is often the deciding factor.

API vs Self-Hosted: The Decision Matrix

Factor	API (La Plateforme)	Self-Hosted
Setup time	Minutes	Hours
GPU required	No	Yes
Function calling	Built-in	You build it
JSON mode	Built-in	Prompt engineering
Embeddings	`mistral-embed` (1024d)	Bring your own model
Data residency	Mistral’s servers	Your infrastructure
Cost model	Per token	Fixed GPU cost
Model selection	Small/Medium/Large	Any open-weight model
Quantization control	None	Full (FP16/INT8/INT4)
Max throughput	Rate limited	Hardware limited

For prototyping and most SaaS products: start with the API. For regulated industries, high-volume inference, or when you need to control every parameter: self-host.

What I Learned

Function calling is the killer feature of the API — it’s not just chat completion. The four-step flow (define tools, model generates args, you execute, model synthesizes) is what makes Mistral viable for production systems that need to interact with real data.
INT8 quantization is free performance on budget hardware — going from 14.3 GB to 7.5 GB with no perceptible quality loss means you can run Mistral-7B on a T4 with headroom to spare. There’s no reason to run FP16 in production on memory-constrained GPUs.
The two paths aren’t mutually exclusive — the most practical architecture uses the API for rapid development and function calling, then migrates latency-sensitive or high-volume inference to self-hosted once the use case is validated. Start managed, graduate to self-hosted.

Do It Yourself

Key takeaways:

Function calling is the production path, not a bonus feature. The four-step flow (define tools → model generates args → you execute → model synthesizes) is what makes Mistral viable for systems that interact with real data. If your use case needs database queries, API calls, or filesystem access, start here, not with raw chat completion.
INT8 quantization is free performance on memory-constrained GPUs. Going from 14.3GB (FP16) to 7.5GB (INT8) with negligible quality loss means you can run Mistral-7B on a T4 with headroom for longer contexts and batching. There’s no production reason to run FP16 on budget hardware.
API for prototyping, self-hosted for production at scale. The most practical architecture uses the API during development (function calling, JSON mode, rapid iteration), then migrates high-volume or latency-sensitive inference to self-hosted once the use case is validated.

Try it now:

Test the API with function calling: Sign up for Mistral API access at console.mistral.ai, grab an API key, and run the function calling example from the official docs. Define a tool for retrieving data (weather, stock prices, database query), watch the model generate the function arguments, execute them yourself, and feed the result back.
Run Mistral-7B locally in INT8: Install transformers and bitsandbytes, then run the INT8 quantization code from the post. You need a GPU with 8GB+ VRAM (T4, RTX 3060, or better). Compare generation quality and memory usage against FP16. HuggingFace model hub: mistralai/Mistral-7B-Instruct-v0.1.
Build a simple RAG pipeline: Use the API’s mistral-embed endpoint with FAISS for local search (example in the post), or swap to sentence-transformers/all-MiniLM-L6-v2 for fully self-hosted. Embed a few documents, store in FAISS, query with a question, and inject the retrieved text into the prompt. Full walkthrough in Mistral’s RAG guide.

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

LinkedIn GitHub

ONE LETTER A MONTH · NO TRACKER · UNSUBSCRIBE ANYTIME

Comments

AI22 Feb 2026

Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon

The Problem

The Solution

How It Works

Path 1: The Mistral API

Model Selection: Small vs Medium vs Large

Function Calling: Connecting Models to Your Code

RAG via the API: Embeddings + FAISS

Path 2: Self-Hosted Deployment

INT8 Quantization: Half the Memory, Same Quality

Self-Hosted RAG: No API Dependency

API vs Self-Hosted: The Decision Matrix

What I Learned

Do It Yourself

Alexandre Agius

Comments

Related Posts

Fine-Tuning Mistral with Transformers and Serving with vLLM on AWS

Python, Transformers, and SageMaker: A Practical Guide for Cloud Engineers

TFLOPS: The GPU Metric Every AI Engineer Should Understand