One of the projects that genuinely gives me hope about the future of AI is llama.cpp. At a time when most discussions revolve around burning money on massive closed models, I’ve increasingly found that small, well-tuned models running on my machhine are often enough

Many models in the 4B–8B range are already competent at everyday reasoning tasks. This experiment was about pushing that idea further: can a small model solve LeetCode problems reliably, and how small can it get without falling apart?

The Task

The constraints were intentionally strict:

  • Solve LeetCode-style problems
  • Train on a single T4 GPU
  • Finish training within an hour
  • Use a small dataset (~5K samples)
  • Export to GGUF and run locally via Ollama
  • Dataset used: OpenCoder-LLM

Model and Training Setup

The base model was LLaMA-3.1-8B, loaded using Unsloth with 4-bit weights to keep VRAM usage low:

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=768,
    dtype=torch.float16,
    load_in_4bit=True,
)

Fine-tuning was done using LoRA-based supervised fine-tuning (SFT) via Unsloth and TRL. The base model was first prepared for parameter-efficient training by attaching LoRA adapters to the attention and MLP projection layers precisely the components that most influence reasoning and generation.

from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

Only a small fraction of the model’s parameters were made trainable, keeping memory usage low while still allowing the model to adapt effectively to LeetCode-style reasoning tasks.

Training itself was intentionally simple and lightweight. A small batch size with gradient accumulation was used to stay within the limits of a single T4 GPU, and the optimizer was chosen to be both memory-efficient and stable.

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = TRAIN_DATASET_PREP_5K,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    packing = True,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

Despite training on a ~5K sample dataset and running for only 60 steps, the model showed strong generalization on unseen LeetCode problems. This reinforced a key takeaway from the experiment: when the base model is already strong, fine-tuning is less about scale and more about alignment—guiding the model toward the right problem-solving patterns rather than teaching it from scratch.

Results on Unseen Problems

The most interesting outcome was generalization.

Even with:

  • a relatively small dataset
  • repeated passes over the same examples
  • no massive curriculum or synthetic data

…the model performed consistently decent on unseen LeetCode questions. It produced structured reasoning, handled edge cases reasonably, and often reached correct solutions without excessive verbosity.

Merging and Quantization

After training, the LoRA adapters were merged into the base model:

model.save_pretrained_merged(
    "model",
    tokenizer,
    save_method="merged_16bit",
)

The merged model was then converted to GGUF and quantized to q4_k_m, reducing the size by roughly 70% (~4Gib) while preserving most of the performance.

Running Locally

The final GGUF model runs locally via Ollama, entirely offline, on CPU. No Python, no GPU, no cloud APIs—just a compact model answering LeetCode-style questions on demand.

Future Objectives

Evaluation framework Build a structured benchmark to measure accuracy and consistency across LeetCode difficulty levels.

Further size reduction Explore parameter pruning on top of quantization to shrink the model further with minimal performance loss.