AI

Fine-Tuning Mistral with Transformers and Serving with vLLM on AWS

End-to-end guide: fine-tune Mistral models with LoRA using Hugging Face Transformers, then deploy at scale with vLLM on AWS — from training to production serving on SageMaker, ECS, or Bedrock.

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

11 min read
Share:

If you’re building production AI systems in 2026, chances are you don’t need GPT-4-class models for every task. What you need is a small, fast, domain-specific model that runs on your infrastructure, costs a fraction, and doesn’t send your data to a third party. That’s the sweet spot for fine-tuned Mistral models served with vLLM on AWS.

This is the pipeline I recommend to most teams: take a Mistral base model, fine-tune it with LoRA on your data, and serve it with vLLM behind a SageMaker endpoint (or ECS if you want more control). It’s battle-tested, cost-effective, and you can go from zero to production in a week.

Why Mistral? The Mistral family — Mistral 7B, Mixtral 8x7B, Mistral Small, Mistral Large — offers arguably the best price/performance ratio in open-weight LLMs. Strong multilingual support (especially French, which matters if you’re working with European clients), Apache 2.0 licensing for the smaller models, and architectures that fine-tune cleanly. Mistral 7B punches well above its weight class after fine-tuning — I’ve seen it match GPT-3.5-turbo on domain-specific tasks.

Why vLLM? Because naive model.generate() calls with Transformers leave 60-70% of your GPU idle. vLLM’s PagedAttention and continuous batching squeeze 2-4x more throughput from the same hardware. That’s the difference between needing 4 GPUs and needing 1.

Let’s build this.

Fine-Tuning with Transformers + PEFT/LoRA

Why LoRA, Not Full Fine-Tuning

Full fine-tuning of Mistral 7B requires ~60GB of GPU memory (model weights + optimizer states + gradients). That means you need at least an A100 80GB or multiple GPUs. LoRA (Low-Rank Adaptation) freezes the base model and injects small trainable matrices into the attention layers. The result:

  • Memory: ~16GB VRAM instead of ~60GB — fits on a single A10G (ml.g5.xlarge)
  • Speed: 2-3x faster training since you’re only updating ~0.5% of parameters
  • Cost: Train on a $1.21/hr g5.xlarge instead of a $5.67/hr p4d.24xlarge
  • Quality: For most downstream tasks, LoRA matches full fine-tuning

If even 16GB is too much, QLoRA (4-bit quantized base model + LoRA) drops it to ~8GB. I use QLoRA for experimentation and LoRA (with bf16) for production training runs.

Dataset Preparation

Your dataset needs to match Mistral’s chat template. Mistral uses a specific [INST] / [/INST] format. Here’s what your training data should look like:

# Each example as a conversation
dataset = [
    {
        "messages": [
            {"role": "user", "content": "Summarize the key risks in this contract clause: ..."},
            {"role": "assistant", "content": "The clause presents three main risks: ..."}
        ]
    }
]

Save this as JSONL. For the chat template, let the tokenizer handle it — don’t manually format [INST] tags:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
tokenizer.pad_token = tokenizer.eos_token

# The tokenizer.apply_chat_template() handles formatting
formatted = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=False
)

Tip: Aim for 1,000-10,000 high-quality examples. More data isn’t always better — 2,000 carefully curated examples often beat 50,000 noisy ones.

Training Setup

Here’s the actual training code I use. No pseudocode — copy this and adapt:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# === Model & Tokenizer ===
model_id = "mistralai/Mistral-7B-Instruct-v0.3"

# For QLoRA, uncomment the quantization config:
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16,
#     bnb_4bit_use_double_quant=True,
# )

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    # quantization_config=bnb_config,  # Uncomment for QLoRA
    device_map="auto",
    attn_implementation="flash_attention_2",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# === LoRA Config ===
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)

# model = prepare_model_for_kbit_training(model)  # Uncomment for QLoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 7,255,896,064 || trainable%: 0.188

# === Dataset ===
dataset = load_dataset("json", data_files="train.jsonl", split="train")

# === Training ===
training_args = SFTConfig(
    output_dir="./mistral-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size: 16
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    max_seq_length=2048,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    dataset_text_field="text",  # Or use formatting_func
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
)

trainer.train()
trainer.save_model()

Key Hyperparameters

These are my go-to defaults for Mistral fine-tuning:

  • LoRA rank (r): 16 — good balance. Use 8 for quick experiments, 32 if you have a complex task
  • LoRA alpha: 2x the rank (so 32 for r=16). This is the scaling factor
  • Learning rate: 2e-4 for LoRA, 1e-4 for QLoRA
  • Epochs: 3 for datasets under 5K examples, 1-2 for larger datasets
  • Batch size: As large as VRAM allows, use gradient accumulation to reach effective batch size of 16-32
  • Max sequence length: 2048 for most tasks, 4096 if you need longer context

Training Infrastructure

Local/dev: Single A10G or RTX 4090. Fine for experiments and small datasets.

Production training: SageMaker Training Jobs on ml.g5.2xlarge (single A10G, 24GB VRAM). For Mixtral 8x7B, use ml.g5.12xlarge (4x A10G). SageMaker handles spot instances, checkpointing, and artifact management:

from sagemaker.huggingface import HuggingFace

estimator = HuggingFace(
    entry_point="train.py",
    source_dir="./scripts",
    instance_type="ml.g5.2xlarge",
    instance_count=1,
    role=role,
    transformers_version="4.47",
    pytorch_version="2.5",
    py_version="py311",
    hyperparameters={
        "model_id": "mistralai/Mistral-7B-Instruct-v0.3",
        "epochs": 3,
        "lr": 2e-4,
    },
    use_spot_instances=True,
    max_wait=7200,
    max_run=3600,
)

estimator.fit({"training": "s3://my-bucket/training-data/"})

Spot instances save 60-70% on g5 instances. Always use them for training jobs.

Merge & Export

After training, you have LoRA adapter weights (~50MB) separate from the base model (~14GB). For serving, merge them into a single model:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model (full precision for merging)
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    torch_dtype=torch.bfloat16,
    device_map="cpu",  # CPU to avoid VRAM issues
)

# Load and merge LoRA
model = PeftModel.from_pretrained(base_model, "./mistral-finetuned")
model = model.merge_and_unload()

# Save merged model
model.save_pretrained("./mistral-merged")

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
tokenizer.save_pretrained("./mistral-merged")

# Push to Hub or S3
# model.push_to_hub("your-org/mistral-7b-custom")
# Or: aws s3 sync ./mistral-merged s3://my-bucket/models/mistral-7b-custom/

Important: Always merge on CPU. Loading the base model + LoRA adapter on GPU can OOM during the merge operation. The merged model is the same size as the original — ~14GB for Mistral 7B in bf16.

Serving with vLLM

Why vLLM Wins

vLLM (developed by Woosuk Kwon and team at UC Berkeley) solves the GPU memory waste problem in LLM inference through two key innovations:

PagedAttention: Traditional serving pre-allocates contiguous GPU memory for each request’s KV cache based on max sequence length. If your max is 4096 tokens but the average request uses 500, you’re wasting 88% of KV cache memory. PagedAttention borrows the concept of virtual memory paging from operating systems — it allocates KV cache in non-contiguous blocks and maps them with a page table. Memory waste drops from ~60% to under 4%.

Continuous batching: Instead of waiting for an entire batch to finish (static batching), vLLM inserts new requests into the batch as soon as any request completes. This keeps the GPU saturated and reduces latency for short requests.

The result: 2-4x higher throughput versus naive Transformers inference, with an OpenAI-compatible API out of the box.

Quick Start

pip install vllm

# Serve your fine-tuned model
vllm serve ./mistral-merged \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.9

# For multi-GPU (Mixtral 8x7B on 2 GPUs):
# vllm serve ./mixtral-merged --tensor-parallel-size 2

Call it like OpenAI:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="./mistral-merged",
    messages=[{"role": "user", "content": "Analyze this contract clause..."}],
    max_tokens=512,
    temperature=0.1,
)

Key Serving Parameters

  • --tensor-parallel-size: Split model across GPUs. Use 2 for Mixtral on 2x A10G
  • --max-model-len: Cap sequence length to save memory. Set to your actual max, not the model’s theoretical max
  • --quantization awq: Serve AWQ-quantized models for ~2x memory savings with minimal quality loss
  • --gpu-memory-utilization 0.9: How much VRAM vLLM can use. 0.9 is safe, 0.95 for aggressive packing
  • --enable-prefix-caching: Reuse KV cache for shared prefixes (great for system prompts)

vLLM vs Alternatives

vLLMTGI (HuggingFace)TensorRT-LLM (NVIDIA)
ThroughputExcellentGoodBest (with compilation)
Ease of useVery easyEasyComplex
Model supportBroadBroadNarrower
QuantizationAWQ, GPTQ, FP8GPTQ, AWQFP8, INT8, INT4
Best forMost production useHF ecosystemMax performance on NVIDIA

My recommendation: start with vLLM. Move to TensorRT-LLM only if you’ve profiled and proven you need the extra 15-20% throughput and can afford the engineering complexity.

Deploying on AWS — 3 Options

Managed infrastructure, autoscaling, built-in monitoring. Use the HuggingFace Deep Learning Container with vLLM backend:

import sagemaker
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()

model = HuggingFaceModel(
    model_data="s3://my-bucket/models/mistral-7b-custom/model.tar.gz",
    role=role,
    image_uri=f"763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.4-tgi2.4-gpu-py311-cu124-ubuntu22.04",
    env={
        "HF_MODEL_ID": "/opt/ml/model",
        "SM_NUM_GPUS": "1",
        "MAX_INPUT_TOKENS": "2048",
        "MAX_TOTAL_TOKENS": "4096",
    },
)

predictor = model.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    endpoint_name="mistral-custom-endpoint",
    container_startup_health_check_timeout=600,
)

# Invoke
response = predictor.predict({
    "inputs": "<s>[INST] Analyze this... [/INST]",
    "parameters": {"max_new_tokens": 512, "temperature": 0.1}
})

Best for: Teams that want managed infra, autoscaling, and integration with the SageMaker ecosystem. Setup time: ~1 hour.

Option B: ECS on G5 Instances

More control, potentially 30-40% cheaper at scale with reserved instances or Savings Plans:

FROM vllm/vllm-openai:latest

COPY ./mistral-merged /model

ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server", \
    "--model", "/model", \
    "--host", "0.0.0.0", \
    "--port", "8080", \
    "--max-model-len", "4096"]

Deploy with ECS task definition on g5.xlarge instances, ALB in front, and ECS Service Auto Scaling based on custom GPUUtilization metrics from CloudWatch.

Best for: Teams with container expertise who want cost optimization and full control over the serving stack.

Option C: Bedrock Custom Model Import

The simplest option if you’re already in the Bedrock ecosystem:

  1. Upload your merged model to S3 in Hugging Face format
  2. Create a custom model import job in Bedrock
  3. Bedrock provisions a managed endpoint
import boto3

bedrock = boto3.client("bedrock", region_name="us-east-1")

response = bedrock.create_model_import_job(
    jobName="mistral-custom-import",
    importedModelName="mistral-7b-custom",
    roleArn="arn:aws:iam::role/BedrockImportRole",
    modelDataSource={
        "s3DataSource": {
            "s3Uri": "s3://my-bucket/models/mistral-7b-custom/"
        }
    },
)

Best for: Teams already using Bedrock who want zero infra management. Trade-off: less control over serving parameters.

Comparison

SageMakerECS/EKSBedrock Import
Setup effortMediumHighLow
Cost controlGoodBestLimited
AutoscalingBuilt-inCustom (but flexible)Managed
CustomizationMediumFullLow
Best instanceml.g5.2xlargeg5.xlargeManaged
When to chooseDefault choiceCost-sensitive at scaleAlready on Bedrock

Performance & Cost

Throughput Benchmarks (Mistral 7B, Single A10G)

MethodThroughput (tokens/sec)Latency p50 (ms)
Transformers generate()~35 tok/s~850ms
vLLM (bf16)~120 tok/s~250ms
vLLM (AWQ 4-bit)~180 tok/s~170ms

That’s a 3-5x improvement just by switching the serving engine. AWQ quantization adds another 50% on top with negligible quality loss for most tasks.

AWS Instance Pricing (us-east-1, on-demand)

InstanceGPUVRAMOn-demand $/hrSpot $/hr
g5.xlarge1x A10G24GB$1.006~$0.35
g5.2xlarge1x A10G24GB$1.212~$0.45
g6e.xlarge1x L40S48GB$1.862~$0.70

For Mistral 7B, g5.xlarge is the sweet spot — same GPU as g5.2xlarge, just fewer vCPUs (which doesn’t matter for inference).

Cost Estimation: 1M Tokens/Day

Assuming Mistral 7B with vLLM on g5.xlarge:

  • Throughput: ~120 tokens/sec = ~10.3M tokens/day
  • You need 1M/10.3M = ~10% of one instance’s capacity
  • With reserved pricing ($0.63/hr): **$15/day or ~$460/month**
  • Compare with OpenAI GPT-3.5-turbo at $0.50/1M tokens: $500/month (input only)

Custom fine-tuned Mistral is cost-competitive with the cheapest commercial APIs, and you own the model, control the data, and can scale horizontally.

Optimization Tips

  • AWQ quantization: Run autoawq on your merged model before serving. Halves VRAM usage, negligible quality impact
  • Prefix caching: If all requests share a system prompt, enable --enable-prefix-caching for 20-30% throughput improvement
  • Tensor parallelism: For Mixtral 8x7B, use 2x A10G with --tensor-parallel-size 2
  • Right-size max_model_len: Don’t set 32K if your requests never exceed 2K. Lower = more concurrent requests

Key Takeaways

Use this pipeline when: You have a well-defined task, 1K+ training examples, need data privacy, and want predictable inference costs. Classification, extraction, summarization, domain-specific Q&A — all perfect fits.

My recommendations:

  • Start with Mistral 7B Instruct v0.3 — best balance of size and capability
  • Fine-tune with LoRA (r=16, alpha=32) on SageMaker Training Jobs (ml.g5.2xlarge, spot)
  • Serve with vLLM on SageMaker Endpoints to start, migrate to ECS when you need cost optimization
  • Always AWQ quantize your production model — free performance

What to watch:

  • SGLang is emerging as a strong vLLM alternative with better structured output support
  • Mistral’s own fine-tuning API (la Plateforme) is convenient but locks you into their ecosystem
  • NVIDIA NIM offers optimized containers but adds licensing complexity
  • Bedrock Custom Model Import keeps improving — watch for native vLLM support

The Mistral + LoRA + vLLM + AWS stack is the production workhorse for custom LLMs in 2026. It’s not the most exciting stack — it’s the one that ships.

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

Related Posts

Back to Blog