Skip to main content

Command Palette

Search for a command to run...

Why I built stressllm?

Published
4 min read
Why I built stressllm?
C

Cloud | AWS | DevOps | AI 📍 Toronto 🇨🇦 🚀 Cloud Architect @ AWS 👨🏽‍🏫 Professor

A few weeks ago I wrote about building a multi agent system on my old personal laptop. You can read more about it here. This project failed miserably because of my hardware limitations. My hardware was theoretically sufficient, but the system was unusable in practice. I realized I was flying blind against the KV Cache; the "working memory" that scales with context. I needed a way to know exactly where my hardware would "choke" before I spent hours debugging a lagging agent.

So, I spent the weekend vibe-coding stressllm: a Python CLI tool that performs context saturation tests to find your hardware's breaking point.

The Core Bottleneck: VRAM vs. Context

Most people only look at "Static VRAM" i.e the memory needed to load the model weights. But the real killer is Dynamic VRAM.

As your conversation grows, the model needs to store the "Keys" and "Values" (KV Cache) of every token to avoid re-processing the entire prompt. This grows linearly. On my 6-year-old laptop with 2GB of VRAM, the moment that cache spills over into system RAM, performance falls off a cliff.

Under the Hood: The Architecture

I built StressLLM with a clean separation of concerns: The Watcher and The Worker.

I split the tool into two main components: probe.py (the observer) and engine.py (the executor).

probe.py: Hardware Telemetry & Constraints

This module is responsible for heartbeat-style monitoring of your system.

  • NVIDIA Only: The tool currently relies on pynvml (NVIDIA Management Library). It specifically looks for nvml.dll on Windows or the standard drivers on Linux.

  • Graceful Fallback: When the GPU isn't available, the tool automatically pivots to reporting system-wide RAM and CPU usage via psutil.

  • Error Isolation: Each sensor (VRAM, Temperature, CPU) is independent. If a single telemetry call fails, the rest of the stats are still yielded.

2. engine.py: The Saturation Logic

This is the "stress" part of the tester. It uses a specific strategy to find the breaking point:

  • Synthetic Pressure: Instead of asking the model to "think," it uses a wordpool.txt to generate a random prompt of a specific token length. The instruction is simple: "Read this and say 'done'." This isolates Prompt Evaluation speed and KV Cache overhead from actual generation.

  • The Generator Pattern: The test is written as a Python generator (yield). This is a safety feature. If the model causes a total system hang or an Out-of-Memory (OOM) error at 128k context, the tool has already "yielded" the successful results for 2k, 8k, and 32k.

  • Direct GGUF vs. Ollama: It supports two paths. It can call the Ollama API (testing your local server) or load a .gguf file directly via llama-cpp-python, which allows for manual control over n_gpu_layers.

  • The Verdict: The code maps Tokens-Per-Second (TPS) to a status. Anything under 5 TPS is flagged as the "Cliff" meaning the KV cache has likely spilled into your slower system RAM.

The Verdict: Mapping the Performance Cliff

The tool doesn't just give you numbers; it gives you a "vibe check" based on the _verdict logic:

  • ✅ Smooth (>15 TPS): Native GPU speeds. Your agents will feel snappy.

  • ⚠️ Slowing (5-15 TPS): You’ve likely hit the "Knee" where the KV cache is spilling into system RAM.

  • 💀 Cliff (<5 TPS): Total saturation. Time to lower your num_ctx or buy a better GPU.

Getting Started

I’ve pushed the tool to PyPI and GitHub. You can test your own local setup in seconds:

You can use the following commands to test the tool

#install via pip
pip install stressllm

#list available models 
stressllm models

#run test
stressllm run <modelname> --depth 2

Whether you are building agentic frameworks or just running local LLMs for privacy, you need to know your limits. Stop guessing and start stressing.

Find it on PyPI: pypi.org/project/stressllm/