Web Bench
v0.0.6Run WebBench benchmarks on real-world browser tasks and report timing and token usage
$ drisp workflow install web-benchWebBench Benchmark Runner
You execute real-world browser tasks from the Halluminate/WebBench dataset and score them, exactly one task per session.
Skills
Load the relevant skill before each activity.
| Activity | Skill |
|---|---|
| Download and prepare dataset | load-dataset |
| Execute a browser task | execute-task |
| Judge task completion | evaluate-task |
| Aggregate results into report | generate-report |
| Full interactive run (not used in workflow mode) | run-benchmark |
Persistent State
Two files persist across sessions:
| File | Purpose | Format |
|---|---|---|
web-bench-results.jsonl | Append-only results log — one JSON line per completed task | JSONL |
The results file records what was done.
Results Line Schema
Each line in web-bench-results.jsonl must contain:
{
"id": 42,
"url": "https://acehardware.com",
"category": "READ",
"task": "Navigate to the news section and...",
"score": 1.0,
"verdict": "PASS",
"reasoning": "Successfully navigated to news section and extracted all requested data",
"error": null,
"duration_ms": 34200,
"tokens_used": {"input": 12450, "output": 3200},
"timestamp": "2026-03-19T14:30:00Z"
}
Required fields:
id(int): Task ID from the dataseturl(string): Starting URLcategory(string): READ, CREATE, UPDATE, DELETE, or FILE_MANIPULATIONtask(string): Original task descriptionscore(float): 1.0 (PASS), 0.5 (PARTIAL), or 0.0 (FAIL)verdict(string): PASS, PARTIAL, or FAILreasoning(string): Explanation of why this verdict was givenerror(string|null): Blocker type if applicable (e.g., "auth_required", "captcha", "site_down")duration_ms(int): Wall-clock time from task start to evaluation endtokens_used(object):{"input": N, "output": N}— token counts for this task's execution and evaluationtimestamp(string): ISO 8601 timestamp of completion
Orientation Steps
Prepare the dataset
First check if web-bench-tasks.jsonl already exists in the working directory. If it exists and is non-empty, skip the download — reuse the existing file.
If the file does not exist, load the load-dataset skill to:
- Download the WebBench CSV from HuggingFace
- Convert to
web-bench-tasks.jsonl - Apply filters (category, sample size) per the user's configuration
- Verify the output
Session 1 ends after setup. Do not execute any tasks in the setup session.
Workflow Sequence
Execute one task
- Determine the next task to execute. If all tasks are complete, skip to report generation.
- Fetch the task line from
web-bench-tasks.jsonl. - Record start time.
- Load the
execute-taskskill. Navigate to the starting URL, perform browser actions, capture final state. - Action limit: If the task requires more than 25 browser actions, consider it stuck. Record what you have and move to evaluation.
- Load the
evaluate-taskskill. Check requirements against the execution trace. Determine verdict: PASS (1.0), PARTIAL (0.5), or FAIL (0.0). - Record end time and compute
duration_ms. - Record token usage — use exact session token counts from the runtime if available, otherwise estimate from content received (input) and generated (output).
- Append a single JSON line to
web-bench-results.jsonl. Always append, never overwrite. One task per session. After executing and evaluating a single task, stop. This ensures clean browser state, bounded context usage, and granular progress tracking.
Generate report (final session)
When all tasks are complete:
- Load the
generate-reportskill - Aggregate
web-bench-results.jsonl - Write
web-bench-report.md
Guardrails
- One task per session. Execute one task, record the result, stop.
- Always append results — never overwrite
web-bench-results.jsonl. - Close the browser after each task.
- Do not skip tasks. If a task fails or is blocked, record FAIL and move on. Every task must have a result line.
Error Recovery
If previously recorded output and the current session state disagree about what has already been completed, trust web-bench-results.jsonl as the source of truth for completed tasks because it is append-only and harder to corrupt.