The first comprehensive benchmarking framework for generative AI applications running directly on end-user devices. Unlike traditional benchmarks that assume exclusive GPU access in cloud datacenters, ConsumerBench captures the realistic, concurrent execution of multiple GenAI apps such as chatbots, image generation, live captioning, and research agents competing for limited local resources.
Generative AI has shifted from cloud servers to local devices such as laptops and smartphones. This migration is motivated by privacy, low latency, and offline availability—but it introduces new systems challenges.
Unlike cloud environments, end-user devices must juggle multiple concurrent applications, each powered by distinct AI models with different service-level objectives (SLOs).
Existing benchmarks evaluate models in isolation, measuring throughput and efficiency on dedicated hardware but overlooking what happens when models compete.
ConsumerBench models real multi-application execution on resource-constrained hardware, measuring both application-level SLOs (latency, throughput) and system-level metrics (GPU utilization, memory bandwidth, power) to reveal critical inefficiencies and guide the design of efficient on-device AI systems.
ConsumerBench bridges a major gap between cloud-centric AI benchmarking and real-world on-device deployment.
Pinpoint inefficiencies in their kernels and optimize for concurrent execution
Explore new scheduling and memory-management strategies for on-device AI
Understand how different workloads coexist on shared hardware
By treating the end-user device as a first-class platform for AI, ConsumerBench paves the way toward responsive, energy-efficient, and fair on-device intelligence.
ConsumerBench orchestrates realistic GenAI workloads through a flexible, graph-based execution model.
 
                ConsumerBench architecture: from YAML configuration to DAG execution with comprehensive monitoring and reporting
1 Analysis (DeepResearch): 2 model: Llama-3.2-3B 3 num_requests: 1 4 device: cpu 5 Creating Cover Art (ImageGen): 6 model: SD-3.5-Medium-Turbo 7 num_requests: 5 8 device: gpu 9 slo: 1s 10 Generating Captions (LiveCaptions): 11 model: Whisper-Large-V3-Turbo 12 num_requests: 1 13 device: gpu 14 ...
1 analysis_1: 2 uses: Analysis 3 cover_art: 4 uses: Creating Cover Art 5 depend_on: ["analysis_1"] 6 analysis_2: 7 uses: Analysis 8 depend_on: ["analysis_1"] 9 generate_captions: 10 uses: Generating Captions 11 depend_on: ["cover_art", 12 "analysis_2"] 13 ...
Example YAML configuration showing (a) task definition with models and SLOs, and (b) workflow definition with dependencies
setup(), execute(), cleanup() interface| Application | Modality | Model | 
|---|---|---|
| Chatbot | text → text | Llama-3.2-3B | 
| DeepResearch | agent | Llama-3.2-3B | 
| ImageGen | text → image | SD-3.5-Turbo | 
| LiveCaptions | audio → text | Whisper-v3-Turbo | 
These applications span interactive and background workloads, reflecting modern consumer AI diversity.
Benchmarked on consumer-grade RTX 6000 GPU and MacBook M1 Pro, revealing critical inefficiencies in current systems.
 
                Comparison of greedy allocation vs. GPU partitioning (MPS) showing normalized latency and GPU utilization patterns. Greedy scheduling causes 12.4Ă— latency increase for LiveCaptions due to starvation, while partitioning improves fairness but reduces overall GPU utilization.
Under greedy GPU allocation, large-kernel workloads like image generation monopolize GPU resources, starving lighter tasks like live captioning—causing up to 12.4× latency increase and near-zero SLO attainment.
Evenly dividing GPU resources (e.g., via NVIDIA MPS) improves fairness but leads to under-utilization: idle partitions cannot borrow unused compute, lowering throughput despite available capacity.
In a realistic content-creation workflow (brainstorm → analysis → cover art → captioning):
 
                        Cuts total runtime by 45% but risks starvation for interactive apps (9.5Ă— SLO violation for live captioning)
 
                        Improves fairness and SLO attainment but increases total completion time by 2.2Ă—
No single policy fits all scenarios—systems must adapt dynamically based on workload characteristics and SLO requirements.
When applications share a single model through an inference server (e.g., Chatbot + DeepResearch on llama.cpp), fixed configurations like large KV caches can harm interactive latency, revealing the need for SLO-aware adaptive servers.
There is a critical need for dynamic GPU scheduling policies that are SLO-aware and transparently share GPU resources across multiple GenAI applications. Existing dynamic scheduling systems (Orion, REEF, etc.) fail to run applications off-the-shelf without modifications or manual CUDA kernel inspection.
Design model kernels to maximize SM occupancy and minimize register pressure for better concurrent execution.
Anticipate multi-app co-execution when implementing GPU kernels to avoid resource monopolization.
Move beyond static GPU partitions toward dynamic, SLO-aware scheduling that balances fairness and efficiency.
Check out the code and feel free to post requests or bug reports at our GitHub Issues.