ConsumerBench: Benchmarking GenAI on
End-User Devices

The first comprehensive benchmarking framework for generative AI applications running directly on end-user devices. Unlike traditional benchmarks that assume exclusive GPU access in cloud datacenters, ConsumerBench captures the realistic, concurrent execution of multiple GenAI apps such as chatbots, image generation, live captioning, and research agents competing for limited local resources.

Yile Gu, Rohan Kadekodi, Hoang Nguyen, Keisuke Kamahori, Yiyu Liu, Baris Kasikci

📄 Read Paper GitHub

The Challenge of On-Device AI

Generative AI has shifted from cloud servers to local devices such as laptops and smartphones. This migration is motivated by privacy, low latency, and offline availability—but it introduces new systems challenges.

🔄 Multi-Application Reality

Unlike cloud environments, end-user devices must juggle multiple concurrent applications, each powered by distinct AI models with different service-level objectives (SLOs).

📊 Traditional Benchmarks Fall Short

Existing benchmarks evaluate models in isolation, measuring throughput and efficiency on dedicated hardware but overlooking what happens when models compete.

ConsumerBench Fills This Gap

ConsumerBench models real multi-application execution on resource-constrained hardware, measuring both application-level SLOs (latency, throughput) and system-level metrics (GPU utilization, memory bandwidth, power) to reveal critical inefficiencies and guide the design of efficient on-device AI systems.

Why It Matters

ConsumerBench bridges a major gap between cloud-centric AI benchmarking and real-world on-device deployment.

Model Developers

Pinpoint inefficiencies in their kernels and optimize for concurrent execution

System Designers

Explore new scheduling and memory-management strategies for on-device AI

Device Manufacturers

Understand how different workloads coexist on shared hardware

By treating the end-user device as a first-class platform for AI, ConsumerBench paves the way toward responsive, energy-efficient, and fair on-device intelligence.

Framework Architecture

ConsumerBench orchestrates realistic GenAI workloads through a flexible, graph-based execution model.

ConsumerBench architecture: from YAML configuration to DAG execution with comprehensive monitoring and reporting

Simple YAML Configuration

(a) Task Definition

1  Analysis (DeepResearch):
2    model: Llama-3.2-3B
3    num_requests: 1
4    device: cpu
5  Creating Cover Art (ImageGen):
6    model: SD-3.5-Medium-Turbo
7    num_requests: 5
8    device: gpu
9    slo: 1s
10 Generating Captions (LiveCaptions):
11   model: Whisper-Large-V3-Turbo
12   num_requests: 1
13   device: gpu
14   ...

(b) Workflow Definition

1  analysis_1:
2    uses: Analysis
3  cover_art:
4    uses: Creating Cover Art
5    depend_on: ["analysis_1"]
6  analysis_2:
7    uses: Analysis
8    depend_on: ["analysis_1"]
9  generate_captions:
10   uses: Generating Captions
11   depend_on: ["cover_art",
12              "analysis_2"]
13   ...

Example YAML configuration showing (a) task definition with models and SLOs, and (b) workflow definition with dependencies

Key Features

Multi-application workflows: Define tasks and dependencies using simple YAML
SLO tracking: Measure per-app latency and SLO attainment
System-level monitoring: GPU/CPU utilization, memory bandwidth, power via NVIDIA DCGM and Intel PCM
Flexible orchestration: Greedy allocation, static partitioning, or shared inference servers
Extensibility: Plug in custom apps via setup(), execute(), cleanup() interface

Supported Applications

Application	Modality	Model
Chatbot	text → text	Llama-3.2-3B
DeepResearch	agent	Llama-3.2-3B
ImageGen	text → image	SD-3.5-Turbo
LiveCaptions	audio → text	Whisper-v3-Turbo

These applications span interactive and background workloads, reflecting modern consumer AI diversity.

Key Evaluation Results

Benchmarked on consumer-grade RTX 6000 GPU and MacBook M1 Pro, revealing critical inefficiencies in current systems.

Concurrent Application Execution: Greedy vs. GPU Partitioning

Comparison of greedy allocation vs. GPU partitioning (MPS) showing normalized latency and GPU utilization patterns. Greedy scheduling causes 12.4× latency increase for LiveCaptions due to starvation, while partitioning improves fairness but reduces overall GPU utilization.

⚠️ Resource Contention Hurts Fairness

Under greedy GPU allocation, large-kernel workloads like image generation monopolize GPU resources, starving lighter tasks like live captioning—causing up to 12.4× latency increase and near-zero SLO attainment.

📉 Static Partitioning Wastes Performance

Evenly dividing GPU resources (e.g., via NVIDIA MPS) improves fairness but leads to under-utilization: idle partitions cannot borrow unused compute, lowering throughput despite available capacity.

🔄 Real-World Workflows Expose Trade-offs

In a realistic content-creation workflow (brainstorm → analysis → cover art → captioning):

Greedy Scheduling

Cuts total runtime by 45% but risks starvation for interactive apps (9.5× SLO violation for live captioning)

GPU Partitioning (MPS)

Improves fairness and SLO attainment but increases total completion time by 2.2×

Key Insight

No single policy fits all scenarios—systems must adapt dynamically based on workload characteristics and SLO requirements.

🖥️ Inference Servers Aren't One-Size-Fits-All

When applications share a single model through an inference server (e.g., Chatbot + DeepResearch on llama.cpp), fixed configurations like large KV caches can harm interactive latency, revealing the need for SLO-aware adaptive servers.

⚡ Need for Dynamic GPU Scheduling

There is a critical need for dynamic GPU scheduling policies that are SLO-aware and transparently share GPU resources across multiple GenAI applications. Existing dynamic scheduling systems (Orion, REEF, etc.) fail to run applications off-the-shelf without modifications or manual CUDA kernel inspection.

Insights for Future Design

🏗️

Architectural Efficiency

Design model kernels to maximize SM occupancy and minimize register pressure for better concurrent execution.

🔄

Concurrency-Aware Kernels

Anticipate multi-app co-execution when implementing GPU kernels to avoid resource monopolization.

⚖️

Flexible Resource Management

Move beyond static GPU partitions toward dynamic, SLO-aware scheduling that balances fairness and efficiency.

ConsumerBench

ConsumerBench: Benchmarking GenAI on
End-User Devices

The Challenge of On-Device AI

🔄 Multi-Application Reality

📊 Traditional Benchmarks Fall Short

ConsumerBench Fills This Gap

Why It Matters

Model Developers

System Designers

Device Manufacturers

Framework Architecture

Simple YAML Configuration

(a) Task Definition

(b) Workflow Definition

Key Features

Supported Applications

Key Evaluation Results

Concurrent Application Execution: Greedy vs. GPU Partitioning

⚠️ Resource Contention Hurts Fairness

📉 Static Partitioning Wastes Performance

🔄 Real-World Workflows Expose Trade-offs

Greedy Scheduling

GPU Partitioning (MPS)

Key Insight

🖥️ Inference Servers Aren't One-Size-Fits-All

⚡ Need for Dynamic GPU Scheduling

Insights for Future Design

Architectural Efficiency

Concurrency-Aware Kernels

Flexible Resource Management

Get Involved

We Welcome Contributions

New Applications

Realistic Workflows

Different GPUs

Citation

BibTeX

ConsumerBench: Benchmarking GenAI on End-User Devices

The Challenge of On-Device AI

🔄 Multi-Application Reality

📊 Traditional Benchmarks Fall Short

ConsumerBench Fills This Gap

Why It Matters

Model Developers

System Designers

Device Manufacturers

Framework Architecture

Simple YAML Configuration

(a) Task Definition

(b) Workflow Definition

Key Features

Supported Applications

Key Evaluation Results

Concurrent Application Execution: Greedy vs. GPU Partitioning

⚠️ Resource Contention Hurts Fairness

📉 Static Partitioning Wastes Performance

🔄 Real-World Workflows Expose Trade-offs

Greedy Scheduling

GPU Partitioning (MPS)

Key Insight

🖥️ Inference Servers Aren't One-Size-Fits-All

⚡ Need for Dynamic GPU Scheduling

Insights for Future Design

Architectural Efficiency

Concurrency-Aware Kernels

Flexible Resource Management

Get Involved

We Welcome Contributions

New Applications

Realistic Workflows

Different GPUs

Citation

BibTeX

ConsumerBench: Benchmarking GenAI on
End-User Devices