ConsumerBench

ConsumerBench: Benchmarking GenAI on
End-User Devices

The first comprehensive benchmarking framework for generative AI applications running directly on end-user devices. Unlike traditional benchmarks that assume exclusive GPU access in cloud datacenters, ConsumerBench captures the realistic, concurrent execution of multiple GenAI apps such as chatbots, image generation, live captioning, and research agents competing for limited local resources.

The Challenge of On-Device AI

Generative AI has shifted from cloud servers to local devices such as laptops and smartphones. This migration is motivated by privacy, low latency, and offline availability—but it introduces new systems challenges.

🔄 Multi-Application Reality

Unlike cloud environments, end-user devices must juggle multiple concurrent applications, each powered by distinct AI models with different service-level objectives (SLOs).

📊 Traditional Benchmarks Fall Short

Existing benchmarks evaluate models in isolation, measuring throughput and efficiency on dedicated hardware but overlooking what happens when models compete.

ConsumerBench Fills This Gap

ConsumerBench models real multi-application execution on resource-constrained hardware, measuring both application-level SLOs (latency, throughput) and system-level metrics (GPU utilization, memory bandwidth, power) to reveal critical inefficiencies and guide the design of efficient on-device AI systems.

Why It Matters

ConsumerBench bridges a major gap between cloud-centric AI benchmarking and real-world on-device deployment.

Model Developers

Pinpoint inefficiencies in their kernels and optimize for concurrent execution

System Designers

Explore new scheduling and memory-management strategies for on-device AI

Device Manufacturers

Understand how different workloads coexist on shared hardware

By treating the end-user device as a first-class platform for AI, ConsumerBench paves the way toward responsive, energy-efficient, and fair on-device intelligence.

Framework Architecture

ConsumerBench orchestrates realistic GenAI workloads through a flexible, graph-based execution model.

ConsumerBench Architecture

ConsumerBench architecture: from YAML configuration to DAG execution with comprehensive monitoring and reporting

Simple YAML Configuration

(a) Task Definition

1  Analysis (DeepResearch):
2    model: Llama-3.2-3B
3    num_requests: 1
4    device: cpu
5  Creating Cover Art (ImageGen):
6    model: SD-3.5-Medium-Turbo
7    num_requests: 5
8    device: gpu
9    slo: 1s
10 Generating Captions (LiveCaptions):
11   model: Whisper-Large-V3-Turbo
12   num_requests: 1
13   device: gpu
14   ...

(b) Workflow Definition

1  analysis_1:
2    uses: Analysis
3  cover_art:
4    uses: Creating Cover Art
5    depend_on: ["analysis_1"]
6  analysis_2:
7    uses: Analysis
8    depend_on: ["analysis_1"]
9  generate_captions:
10   uses: Generating Captions
11   depend_on: ["cover_art",
12              "analysis_2"]
13   ...

Example YAML configuration showing (a) task definition with models and SLOs, and (b) workflow definition with dependencies

Key Features

  • Multi-application workflows: Define tasks and dependencies using simple YAML
  • SLO tracking: Measure per-app latency and SLO attainment
  • System-level monitoring: GPU/CPU utilization, memory bandwidth, power via NVIDIA DCGM and Intel PCM
  • Flexible orchestration: Greedy allocation, static partitioning, or shared inference servers
  • Extensibility: Plug in custom apps via setup(), execute(), cleanup() interface

Supported Applications

Application Modality Model
Chatbot text → text Llama-3.2-3B
DeepResearch agent Llama-3.2-3B
ImageGen text → image SD-3.5-Turbo
LiveCaptions audio → text Whisper-v3-Turbo

These applications span interactive and background workloads, reflecting modern consumer AI diversity.

Key Evaluation Results

Benchmarked on consumer-grade RTX 6000 GPU and MacBook M1 Pro, revealing critical inefficiencies in current systems.

Concurrent Application Execution: Greedy vs. GPU Partitioning

Concurrent Application Evaluation

Comparison of greedy allocation vs. GPU partitioning (MPS) showing normalized latency and GPU utilization patterns. Greedy scheduling causes 12.4Ă— latency increase for LiveCaptions due to starvation, while partitioning improves fairness but reduces overall GPU utilization.

⚠️ Resource Contention Hurts Fairness

Under greedy GPU allocation, large-kernel workloads like image generation monopolize GPU resources, starving lighter tasks like live captioning—causing up to 12.4× latency increase and near-zero SLO attainment.

📉 Static Partitioning Wastes Performance

Evenly dividing GPU resources (e.g., via NVIDIA MPS) improves fairness but leads to under-utilization: idle partitions cannot borrow unused compute, lowering throughput despite available capacity.

🔄 Real-World Workflows Expose Trade-offs

In a realistic content-creation workflow (brainstorm → analysis → cover art → captioning):

Greedy Scheduling

Greedy Scheduling Workflow Results

Cuts total runtime by 45% but risks starvation for interactive apps (9.5Ă— SLO violation for live captioning)

GPU Partitioning (MPS)

MPS Partitioning Workflow Results

Improves fairness and SLO attainment but increases total completion time by 2.2Ă—

Key Insight

No single policy fits all scenarios—systems must adapt dynamically based on workload characteristics and SLO requirements.

🖥️ Inference Servers Aren't One-Size-Fits-All

When applications share a single model through an inference server (e.g., Chatbot + DeepResearch on llama.cpp), fixed configurations like large KV caches can harm interactive latency, revealing the need for SLO-aware adaptive servers.

⚡ Need for Dynamic GPU Scheduling

There is a critical need for dynamic GPU scheduling policies that are SLO-aware and transparently share GPU resources across multiple GenAI applications. Existing dynamic scheduling systems (Orion, REEF, etc.) fail to run applications off-the-shelf without modifications or manual CUDA kernel inspection.

Insights for Future Design

🏗️

Architectural Efficiency

Design model kernels to maximize SM occupancy and minimize register pressure for better concurrent execution.

🔄

Concurrency-Aware Kernels

Anticipate multi-app co-execution when implementing GPU kernels to avoid resource monopolization.

⚖️

Flexible Resource Management

Move beyond static GPU partitions toward dynamic, SLO-aware scheduling that balances fairness and efficiency.

Get Involved

We Welcome Contributions

Check out the code and feel free to post requests or bug reports at our GitHub Issues.

âž•

New Applications

🔄

Realistic Workflows

🖥️

Different GPUs

Citation

BibTeX

@article{consumerbench2025, title={ConsumerBench: Benchmarking GenAI on End-User Devices}, author={Gu, Yile and Kadekodi, Rohan and Nguyen, Hoang and Kamahori, Keisuke and Liu, Yiyu and Kasikci, Baris}, journal={arXiv preprint arXiv:2506.17538}, year={2025} }