Benchmarking

Benchmarking is an admin-only drawer for comparing eligible local models against the built-in CE prompt set.

The drawer shows status, progress, prompt counts, best run per model, detailed per-run results, and benchmark question editing.

Eligibility

Admin-only drawer

The Benchmarks drawer and benchmark actions require an admin session.

Eligible models

A model is included when it is present in the scan directory, enabled in the registry, and marked as benchmark-enabled by an admin.

Disabled models

Disabled models are not selected for benchmark runs. The registry UI shows a non-clickable benchmark state until an admin enables the model.

Prompt Set

Five CE prompts

The CE benchmark uses the built-in Core Suite prompt set. The editor expects exactly five non-empty questions.

Locked while running

While a benchmark is running, the CE Benchmark Questions panel is locked. The admin must stop the benchmark before editing questions.

Saved prompt edits

Saving benchmark question changes does not delete old results. Existing CE results for that prompt set are marked stale so admins know they were produced with earlier questions.

Run Flow

Step What happens
Start The admin starts a run from the Benchmarks drawer. A force option reruns models that already have results.
Prepare The benchmark worker stops existing model processes before launching benchmark models.
Launch and retry Benchmark models launch through the same runtime helpers used by main model controls. Retry logic can fall back to CPU loading for retryable load failures.
Run prompts Each eligible model runs through the five-prompt set. Warmup and per-prompt measurements are recorded.
Progress The UI receives status, current model, current prompt, counts, and progress updates over Socket.IO.
Stop The Stop control requests benchmark cancellation and stops running model processes.

Results

Best run table

The drawer lists the best run per model with presence, status, model, size, average TPS, min/max TPS, token count, total, end time, and actions.

Details

Admins can open a run to see summary metrics and per-prompt response details.

Reset history

The reset action clears benchmark runs and results for a selected model identity. It is disabled while a benchmark is running.