Why I built stressllm?

A few weeks ago I wrote about building a multi agent system on my old personal laptop. You can read more about it here. This project failed miserably because of my hardware limitations. My hardware was theoretically sufficient, but the system was unusable in practice. I realized I was flying blind against the KV Cache; the "working memory" that scales with context. I needed a way to know exactly where my hardware would "choke" before I spent hours debugging a lagging agent.

So, I spent the weekend vibe-coding stressllm: a Python CLI tool that performs context saturation tests to find your hardware's breaking point.

The Core Bottleneck: VRAM vs. Context

Most people only look at "Static VRAM" i.e the memory needed to load the model weights. But the real killer is Dynamic VRAM.

As your conversation grows, the model needs to store the "Keys" and "Values" (KV Cache) of every token to avoid re-processing the entire prompt. This grows linearly. On my 6-year-old laptop with 2GB of VRAM, the moment that cache spills over into system RAM, performance falls off a cliff.

Under the Hood: The Architecture

I built StressLLM with a clean separation of concerns: The Watcher and The Worker.

I split the tool into two main components: probe.py (the observer) and engine.py (the executor).

`probe.py`: Hardware Telemetry & Constraints

This module is responsible for heartbeat-style monitoring of your system.

NVIDIA Only: The tool currently relies on pynvml (NVIDIA Management Library). It specifically looks for nvml.dll on Windows or the standard drivers on Linux.
Graceful Fallback: When the GPU isn't available, the tool automatically pivots to reporting system-wide RAM and CPU usage via psutil.
Error Isolation: Each sensor (VRAM, Temperature, CPU) is independent. If a single telemetry call fails, the rest of the stats are still yielded.

2. `engine.py`: The Saturation Logic

This is the "stress" part of the tester. It uses a specific strategy to find the breaking point:

Synthetic Pressure: Instead of asking the model to "think," it uses a wordpool.txt to generate a random prompt of a specific token length. The instruction is simple: "Read this and say 'done'." This isolates Prompt Evaluation speed and KV Cache overhead from actual generation.
The Generator Pattern: The test is written as a Python generator (yield). This is a safety feature. If the model causes a total system hang or an Out-of-Memory (OOM) error at 128k context, the tool has already "yielded" the successful results for 2k, 8k, and 32k.
Direct GGUF vs. Ollama: It supports two paths. It can call the Ollama API (testing your local server) or load a .gguf file directly via llama-cpp-python, which allows for manual control over n_gpu_layers.
The Verdict: The code maps Tokens-Per-Second (TPS) to a status. Anything under 5 TPS is flagged as the "Cliff" meaning the KV cache has likely spilled into your slower system RAM.

The Verdict: Mapping the Performance Cliff

The tool doesn't just give you numbers; it gives you a "vibe check" based on the _verdict logic:

✅ Smooth (>15 TPS): Native GPU speeds. Your agents will feel snappy.
⚠️ Slowing (5-15 TPS): You’ve likely hit the "Knee" where the KV cache is spilling into system RAM.
💀 Cliff (<5 TPS): Total saturation. Time to lower your num_ctx or buy a better GPU.

Getting Started

I’ve pushed the tool to PyPI and GitHub. You can test your own local setup in seconds:

You can use the following commands to test the tool

#install via pip
pip install stressllm

#list available models 
stressllm models

#run test
stressllm run <modelname> --depth 2

Whether you are building agentic frameworks or just running local LLMs for privacy, you need to know your limits. Stop guessing and start stressing.

Find it on PyPI: pypi.org/project/stressllm/

Why I built stressllm?

The Core Bottleneck: VRAM vs. Context

Under the Hood: The Architecture

`probe.py`: Hardware Telemetry & Constraints

2. `engine.py`: The Saturation Logic

The Verdict: Mapping the Performance Cliff

Getting Started

Comments

More from this blog

Optimizing AWS Costs: NAT Gateways, S3 Storage Classes, and EBS Lifecycle Management

When to Use ECS vs EKS vs Lambda: A Decision Framework

Building a Multi Agent System to Track Real Madrid Matches Using AWS Strands and Ollama

Three AWS MCP Servers you should use today

Command Palette

The Core Bottleneck: VRAM vs. Context

Under the Hood: The Architecture

probe.py: Hardware Telemetry & Constraints

2. engine.py: The Saturation Logic

The Verdict: Mapping the Performance Cliff

Getting Started

Comments

More from this blog

`probe.py`: Hardware Telemetry & Constraints

2. `engine.py`: The Saturation Logic