A Practical Guide to Fine-Tuning LLMs: From Full Training to LoRA

Understand how LLM fine-tuning works, when to use it, and how to choose between full fine-tuning, LoRA, soft prompts, and other PEFT methods.

Alexandre Agius

AWS Solutions Architect

February 7, 2026 8 min read

AI LLM Fine-Tuning LoRA PEFT Machine Learning

🧠 Part of: AI & LLM Fundamentals

Table of Contents

The Problem
The Solution
How It Works
Inside a Transformer
Fine-Tuning Flavors
The “What” vs the “How”
Full Fine-Tuning (PFT) vs PEFT
LoRA (Low-Rank Adaptation)
Soft Prompts (Prompt Tuning)
PEFT Methods Compared
Choosing Your Method
On AWS
What I Learned
Do It Yourself
Key Takeaways
Try It Now

Fine-tuning is how you take a general-purpose LLM and make it yours. But “fine-tuning” is an umbrella term covering wildly different approaches — from retraining every parameter to learning a handful of virtual tokens. Choosing the wrong method means burning compute for marginal gains.

This post breaks down the options and when to use each one.

The Problem

Base foundation models are impressive but generic. They don’t know your domain vocabulary, your output format requirements, or your organization’s tone. Prompt engineering gets you far, RAG adds knowledge — but sometimes you need the model itself to behave differently.

The challenge: fine-tuning methods range from trivial to massive in cost and complexity. Without understanding what each method actually does inside the model, it’s hard to pick the right one.

The Solution

Map each fine-tuning method to where it intervenes in the transformer architecture, then choose based on your quality requirements, compute budget, and how different your target behavior is from the base model.

Fine-Tuning Decision Tree

The decision ladder:

Prompt engineering — just ask the model better. Zero cost. Try this first.
RAG — give the model reference documents at inference time. Moderate effort, no training.
PEFT (LoRA) — retrain ~1% of the model. Low cost, high impact. Default choice.
Full fine-tuning — retrain everything. Maximum quality, maximum cost.

How It Works

Inside a Transformer

Every transformer-based LLM is built from repeated layers with two main components:

Self-Attention — figures out which tokens are relevant to each other
Feedforward Network (FFN) — processes what that information means

The FFN is deceptively simple: expand the representation to a larger dimension, apply an activation function, compress it back down. These layers store most of the model’s factual knowledge and account for roughly two-thirds of its parameters.

Each layer repeats: Attention → FFN → Attention → FFN, stacked dozens or hundreds of times.

Where Fine-Tuning Methods Act Inside a Transformer

Different fine-tuning methods intervene at different points in this pipeline.

Fine-Tuning Flavors

Think of it this way:

Pre-training = general education (school, university)
Fine-tuning = job-specific training (learning your company’s processes)

Method	What it does
Supervised Fine-Tuning (SFT)	Train on input/output pairs — the most common starting point
RLHF	Humans rank outputs, model learns to prefer higher-ranked ones
Instruction Tuning	SFT subset focused on instruction-following — turns base models into chat models

Typical pipeline: Base Model → SFT → RLHF → Final Model

The “What” vs the “How”

A common source of confusion: people compare instruction fine-tuning, LoRA, and soft prompts as if they’re all alternatives. They’re not — they answer different questions.

Instruction fine-tuning = the what — the type of training data you use (instruction/response pairs)
LoRA, soft prompts, full fine-tuning = the how — the technical method you use to modify the model

You can do instruction fine-tuning with LoRA. Or with full fine-tuning. Or with soft prompts. The training data and the training method are independent choices.

	What it is	Analogy
Instruction fine-tuning	Training data format	The textbook
LoRA	Training method — modifies model internals	Teaching new skills
Soft prompts	Training method — modifies model input	Writing a cheat sheet

In practice, instruction fine-tuning + LoRA is what most people mean when they say “fine-tuning” today.

Full Fine-Tuning (PFT) vs PEFT

This is the fundamental fork in the road.

Full fine-tuning unfreezes every parameter. Maximum flexibility, but a 7B model needs 100+ GB VRAM for training.

PEFT freezes the original model, trains only a small number of new parameters. Gets surprisingly close to full fine-tuning quality.

	Full Fine-Tuning (PFT)	PEFT
Params trained	100%	~0.1–2%
Memory	Very high	10-100x less
Quality	Best possible	90-99% of PFT
Training speed	Slow	Fast
Catastrophic forgetting risk	Higher	Lower (original weights frozen)
Multi-task	Separate model copies	Swap small adapters on one base model

LoRA (Low-Rank Adaptation)

The most popular PEFT method. Injects small trainable matrices alongside existing attention layers. The original weights stay frozen; LoRA matrices modify how information flows through the model.

Think of it as teaching someone new skills — the model internalizes new capabilities.

Trains ~0.1-1% of total parameters
Produces small adapter files (MBs, not GBs)
Multiple adapters can be swapped on a single base model

QLoRA goes further: quantizes the base model to 4-bit, letting you fine-tune a 65B model on a single GPU.

Soft Prompts (Prompt Tuning)

Learns virtual tokens prepended to your input. The model itself is completely frozen.

Think of it as giving someone a really good briefing — they don’t change, they just get better instructions.

[learned token 1][learned token 2]...[learned token N] + your actual input

Trains only thousands of parameters (vs millions for LoRA)
Adapter size in KBs
Limited for complex generation — the change is surface-level

PEFT Methods Compared

Method	How it works	Best for
LoRA	Adds trainable matrices to attention layers	General-purpose fine-tuning
QLoRA	LoRA + quantized base model	Large models on limited hardware
Prefix Tuning	Learns virtual tokens at every layer	Generation tasks
Adapters	Inserts small trainable layers between frozen layers	Multi-task setups
Soft Prompts	Learns input embeddings only	Simple classification

The key difference: soft prompts change what the model sees (input). LoRA changes how the model thinks (internal processing).

Choosing Your Method

Situation	Method
Most use cases, best cost/quality tradeoff	LoRA
Large model, limited GPU budget	QLoRA
Simple classification or routing	Soft Prompts
Maximum quality, unlimited budget	Full Fine-Tuning
Multiple tasks on one base model	LoRA (swap adapters)

On AWS

Amazon Bedrock Custom Models — fine-tune foundation models without managing infrastructure
Amazon SageMaker — full control, bring your own training scripts, Hugging Face PEFT library
AWS Trainium / Inferentia — cost-effective custom silicon for training and inference at scale

For most teams, Bedrock’s managed fine-tuning is the path of least resistance. SageMaker when you need full control over hyperparameters or cutting-edge PEFT techniques.

What I Learned

Separate the “what” from the “how” — instruction fine-tuning is the training data, LoRA is the training method. They’re independent choices, not alternatives
LoRA is the default answer — it covers the vast majority of use cases at a fraction of the cost of full fine-tuning
Don’t fine-tune first — prompt engineering and RAG solve most problems without any training
FFN layers are the knowledge store — understanding where knowledge lives in a transformer helps explain why different methods work differently
Soft prompts are niche — interesting for minimal adapters, but LoRA’s quality advantage makes it the practical choice
QLoRA is a game-changer for access — fine-tuning large models on consumer hardware was impossible before it

Do It Yourself

Key Takeaways

LoRA is the practical default — it trains ~1% of parameters, costs 10-100x less than full fine-tuning, and reaches 90-99% of full fine-tuning quality. Start here unless you have a compelling reason not to.
“What” vs “how” confusion is common — instruction fine-tuning (the training data) and LoRA (the training method) are independent choices, not alternatives. You can do instruction fine-tuning with LoRA.
Don’t fine-tune first — prompt engineering and RAG solve most problems without training. Fine-tune only when you need consistent behavior changes the model can’t achieve through instructions alone.

Try It Now

Start with prompt engineering — before fine-tuning, iterate on your prompt structure. Use Claude Code, ChatGPT, or any frontier model to refine instructions until quality plateaus. Only then consider fine-tuning.
Fine-tune with LoRA on SageMaker — use the Hugging Face PEFT library on SageMaker. Full tutorial: SageMaker + Hugging Face PEFT. Start with r=8 rank, lora_alpha=16, and 3 epochs.
Use Amazon Bedrock for managed fine-tuning — if you want zero infrastructure management, Bedrock Custom Models handles the training pipeline end-to-end. Bedrock Custom Models guide
Measure quality rigorously — create a held-out test set before training, define task-specific metrics (accuracy, F1, ROUGE, human eval), and compare fine-tuned vs base model. Never trust vibes.
Try QLoRA for large models — if you want to fine-tune a 70B model on a single GPU, QLoRA quantizes the base model to 4-bit and applies LoRA adapters. Reference implementation: QLoRA paper code

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

LinkedIn GitHub

ONE LETTER A MONTH · NO TRACKER · UNSUBSCRIBE ANYTIME

Comments

AI22 Feb 2026

A Practical Guide to Fine-Tuning LLMs: From Full Training to LoRA

The Problem

The Solution

How It Works

Inside a Transformer

Fine-Tuning Flavors

The “What” vs the “How”

Full Fine-Tuning (PFT) vs PEFT

LoRA (Low-Rank Adaptation)

Soft Prompts (Prompt Tuning)

PEFT Methods Compared

Choosing Your Method

On AWS

What I Learned

Do It Yourself

Key Takeaways

Try It Now

Alexandre Agius

Comments

Related Posts

Fine-Tuning Mistral with Transformers and Serving with vLLM on AWS

TFLOPS: The GPU Metric Every AI Engineer Should Understand

Chemistry LLMs in the Real World: What a Discovery Call Taught Me About AI in Chemical R&D