hf_mem estimates the required memory for inference, including model weights and an optional KV cache, for Safetensors and GGUF for models on the Hugging Face Hub using HTTP Range requests i.e., without downloading or loading any weights locally.

When to use?

User asks how much VRAM or memory a model needs to run
User wants to know if a model fits on their GPU or a given instance
User references a Hugging Face model ID or URL and asks about inference requirements

What are the requirements?

uv installed (for uvx)
HF_TOKEN env var or --hf-token flag (for gated or private models only)

How to run?

Run with --model-id pointing to the Hugging Face Hub repository which will check that it either contains Safetensors (via model.safetensors, model.safetensors.index.json if sharded, or model_index.json for Diffusers) or GGUF model weights within.

hf-mem

When to use?

What are the requirements?

How to run?