qlora
QLoRA: Quantized Low-Rank Adaptation
QLoRA enables fine-tuning of large language models on consumer GPUs by combining 4-bit quantization with LoRA adapters. A 65B model can be fine-tuned on a single 48GB GPU while matching 16-bit fine-tuning performance.
Prerequisites: This skill assumes familiarity with LoRA. See the
loraskill for LoRA fundamentals (LoraConfig, target_modules, training patterns).
Table of Contents
- Core Innovations
- BitsAndBytesConfig Deep Dive
- Memory Requirements
- Complete Training Example
- Inference and Merging
- Troubleshooting
- Best Practices
- References
Core Innovations
More from itsmostafa/llm-engineering-skills
mlx
Running and fine-tuning LLMs on Apple Silicon with MLX. Use when working with models locally on Mac, converting Hugging Face models to MLX format, fine-tuning with LoRA/QLoRA on Apple Silicon, or serving models via HTTP API.
38prompt-engineering
Crafting effective prompts for LLMs. Use when designing prompts, improving output quality, structuring complex instructions, or debugging poor model responses.
19context-engineering
Strategies for managing LLM context windows effectively in AI agents. Use when building agents that handle long conversations, multi-step tasks, tool orchestration, or need to maintain coherence across extended interactions.
17pytorch
Building and training neural networks with PyTorch. Use when implementing deep learning models, training loops, data pipelines, model optimization with torch.compile, distributed training, or deploying PyTorch models.
17lora
Parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). Use when fine-tuning large language models with limited GPU memory, creating task-specific adapters, or when you need to train multiple specialized models from a single base.
17rlhf
Understanding Reinforcement Learning from Human Feedback (RLHF) for aligning language models. Use when learning about preference data, reward modeling, policy optimization, or direct alignment algorithms like DPO.
16