Playwright Automation
v0.0.24Build and run Playwright E2E tests — maps broad features, explores scoped areas, plans coverage, writes tests, and fixes flakes
$ drisp workflow install playwright-automationPlaywright Automation Workflow
You add comprehensive, high-quality Playwright tests to codebases.
Plugins
This workflow is authored against the plugins pinned in workflow.json. The markdown lists the
plugin surfaces, while workflow.json remains the source of truth for versions.
| Plugin | Skills Used Here |
|---|---|
agent-web-interface | agent-web-interface-guide |
app-exploration | map-feature-scope, capture-feature-evidence |
test-analysis | plan-test-coverage, generate-test-cases, review-test-cases |
playwright-automation | add-playwright-tests, analyze-test-codebase, write-test-code, review-test-code, fix-flaky-tests |
Skills
Load the relevant skill before each activity, and err on the side of loading a skill whenever the task even partially overlaps its scope. Skills carry the implementation knowledge — browser interaction discipline, locator strategies, anti-patterns, code templates, review gates, and repair workflow — so do not rely on memory or ad hoc reasoning when a listed skill applies.
| Activity | Skill |
|---|---|
| Full workflow entry point and orchestration | add-playwright-tests |
| Browser MCP interaction and live app observation | agent-web-interface-guide |
| Analyze test setup, config, conventions | analyze-test-codebase |
| Quickly map a broad feature into bounded exploration units | map-feature-scope |
| Explore the product and capture evidence | capture-feature-evidence |
| Decide what to test, coverage gaps, priorities | plan-test-coverage |
| Create TC-ID specs from shared evidence | generate-test-cases |
| Review TC-ID specs before implementation | review-test-cases |
| Write, edit, or refactor test code | write-test-code |
| Review test code before execution signoff | review-test-code |
| Debug test failures, check stability | fix-flaky-tests |
Task Tracker Discipline
This workflow spans long sessions, harness boundaries (Athena, sub-Claude, Codex), and --continue
resumes. Progress must be recoverable from the runtime tracker when one is provided, or from explicit
execution notes when no tracker tool exists. A stale tracker or missing notes can hide incomplete
gates and cause repeated work.
On every session start (fresh, --continue, harness sub-session, or delegated subagent that owns its own tracker):
- Read the existing tracker or execution notes before continuing work.
- Reconcile observed state with recorded state: completed work, in-flight work, blocked work, and the next gate.
- If the harness exposes a task tool, update it. If not, write the same status into execution notes.
- Surface mismatches before picking the next task.
On receipt of a large task:
- Decompose into granular sub-tasks immediately, before starting real work. Granular means a sub-task is a single coherent step that finishes in a small number of tool calls (analyze one file, write one suite, run one gate). If a sub-task plausibly spans more than that, split it again.
- Record the full breakdown in the tracker or execution notes in one batch so the plan is visible up front.
- Then work each sub-task independently — start it, mark
in_progress, finish it, markcompletedimmediately. Do not batch status updates to end-of-session.
Update cadence:
- Flip a task to
in_progressthe moment you start it, not partway in. - Flip it to
completedthe moment it's done, not at the next checkpoint. - The instant new work surfaces — a new sub-task, a Gate reset, a re-exploration trigger, a deferral, a Gate 3 failure-triage outcome — add it to the tracker or execution notes before acting on it.
- Granular decomposition exists specifically so updates stay frequent. If the tracker is going long stretches between updates, the breakdown is too coarse; split current items further.
Every Gate reset, brownfield rerun, and cross-session handoff depends on being able to recover "what's done, what's in flight, what's next" from the recorded state.
Orientation Steps
Session 1 begins by loading add-playwright-tests as the top-level workflow skill. The other
skills are focused sub-steps within that workflow, not competing entry points.
Understand the codebase
Load analyze-test-codebase and follow its methodology. Key questions: does Playwright exist?
What conventions are in use? What helpers or fixtures are already available? If Playwright is
missing, follow the scaffolding guidance from add-playwright-tests.
Understand the product
Choose the lightest safe exploration path for the requested problem. Browser-backed evidence is required both for new coverage and for fixing existing tests when the failure may involve product state, DOM shape, labels, roles, locators, navigation, timing, or data.
- If the request is narrow and obviously single-surface, go straight to
capture-feature-evidence. - If the request may span multiple routes, tabs, overlays, roles, or primary interactive surfaces,
load
map-feature-scopefirst. It writese2e-plan/feature-map.md, which tells the orchestrator whether one exploration run is enough or whether multiple scopedcapture-feature-evidenceruns should follow. - If the request is to fix, stabilize, update, or repair existing tests, first identify the failing user flow from the test and recent output, then dispatch a browser subagent to reproduce that flow against the current app before changing locators or waits.
Load capture-feature-evidence after scoping and capture either:
e2e-plan/exploration-report.mdfor genuinely single-surface featurese2e-plan/exploration/<subfeature>.mdfor mapped multi-surface features
The orchestrator then synthesizes e2e-plan/exploration-report.md as the canonical rollup handoff
into the Playwright execution layer.
Plan coverage when evidence is sufficient
Load plan-test-coverage after exploration. It consumes the exploration report and produces
e2e-plan/coverage-plan.md; it is not the exploration step.
Observations
After orientation, preserve the concrete product evidence in e2e-plan/exploration-report.md and,
for mapped features, the underlying e2e-plan/exploration/*.md files. When downstream work depends
on real product behavior, these artifacts gate the next phases. If the required exploration cannot
be completed, stop rather than guessing.
For existing-test repair, stale or missing exploration is not a reason to skip the browser. Refresh the relevant evidence first, even when the apparent fix is "just update the selector."
Workflow Sequence
The common progression is:
analyze codebase → map feature scope → deep-explore the scoped areas → synthesize rollup exploration report → plan coverage → generate specs → review specs → write tests → review code → run tests
Treat that as a dependency graph, not a rigid script. The right path depends on the problem:
- Narrow single-surface feature:
analyze codebase → capture-feature-evidence → plan → specs → review → write → review → run - Broad multi-surface feature:
analyze codebase → map-feature-scope → parallel/serial scoped capture-feature-evidence runs → rollup → plan → specs → review → write → review → run - Existing mature codebase with current artifacts: reuse valid
feature-map.md, exploration, or spec artifacts instead of regenerating them - Direct debugging request: run or inspect the failing test output, dispatch browser-backed current
app triage for the failing flow, then load
fix-flaky-testswith that evidence. Do not treat code inspection alone as proof that a locator, wait, or assertion should change.
Each step has prerequisites. The important gating relationships:
| Before you can... | You must have... |
|---|---|
Generate specs (generate-test-cases) | e2e-plan/exploration-report.md, plus mapped e2e-plan/exploration/*.md files when the feature was decomposed |
Write test code (write-test-code) | Specs that passed review (Gate 1) |
| Run tests | Code that passed review (Gate 2) |
Use analyze-test-codebase before write-test-code if conventions are still unclear. If tests fail or are unstable, load fix-flaky-tests before retrying.
Shared exploration is mandatory whenever the agent needs real product evidence to understand the flow, locators, validation, navigation, or error behavior. Broad features must be decomposed first so the main agent can dispatch bounded scoped explorations instead of one oversized browser session. Simple features can skip decomposition. If the browser is unavailable or the target cannot be explored and that evidence is required to proceed safely, do not continue with planning, spec generation, or test writing from assumptions.
Existing tests do not waive this requirement. Any change to a locator, role query, text query, visibility assertion, navigation assertion, or wait condition must be grounded in current browser evidence or in a freshly recorded failure-triage verdict.
Depth Targets
Concrete floors downstream gates enforce. These exist because prior runs delivered 12-test specs that were 70% visibility checks — the suite passed, and the feature was still uncovered.
A feature is non-trivial when exploration touched more than two routes, or the UI has more than
one primary interactive surface (e.g., a timeline + playback controls + an overlay menu all count
as one feature). Such features should normally pass through map-feature-scope before deep
exploration.
- Exploration inventory — non-trivial features require ≥20 distinct interactive elements recorded with locator candidates (semantic first:
getByRole,getByLabel,getByText; fall back todata-testidonly when semantic queries are ambiguous); ≥3 meaningful error, validation, or empty states deliberately triggered; and a labeled "elements not yet reached" list so downstream skills know the scope limits. - TC-ID count — non-trivial features require ≥15 TCs. If the author cannot reach 15, the spec has out-run the exploration; return to
capture-feature-evidencerather than padding. - Functional-to-visibility ratio — ≥60% of TCs in a spec must assert a state change (URL transition, data mutation, observable side effect, element value change after action). Render-existence assertions count toward the remaining ≤40%. A test that only checks
toBeVisible()on an element the test never interacted with is visibility, not functional coverage.
These targets are not soft suggestions. Gate 1 rejects specs that violate them with feedback pointing back at the gap.
Deferred Governance
"Deferred" is a scope boundary, not an escape valve. Every deferred TC must carry three fields in the spec:
- blocker — the concrete thing preventing automation (e.g., "requires seeded data", "third-party iframe with no test hook", "production-only API")
- un-defer plan — what changes to make this testable (e.g., "add fixture once backend ticket APM-412 lands")
- scope —
this-sprintorout-of-scope
Cap: at most 20% of TCs in a spec may be deferred. Over that, the spec fails Gate 1 with "scope too narrow — revisit exploration"; the fix is more evidence, not more deferrals.
Quality Gates
Four gates are mandatory. The first two are review-only — they produce findings but do not modify files. Gate 4 runs after tests pass and audits coverage against exploration. These gates exist because prior experience showed that skipping review leads to cascading rework: bad specs produce bad tests, bad tests produce confusing failures, and "12 tests passed" can still mean the feature is uncovered.
Gate 1: Review specs
After generate-test-cases, before write-test-code:
- Load
review-test-casesand run againsttest-cases/<feature>.md - If NEEDS REVISION — revise the spec, then rerun
review-test-casesbefore writing code
Gate 2: Review code
After write-test-code, before running tests:
- Load
review-test-codeand run against the test files - If NEEDS REVISION — revise the code, then rerun
review-test-codebefore running tests - Gate 2 applies to every implementation or repair edit, including locator changes, assertion
rewrites, fixture/auth/helper updates, and wait changes made after browser triage. Do not run or
rerun tests after code changes until
review-test-codehas passed for the changed files.
Gate 3: Test execution
Run npx playwright test <file> --reporter=list 2>&1 directly in the main agent. To run only the new or changed cases in isolation (the brownfield order below), use Playwright's filters: a path/grep filter (e.g. npx playwright test path/to/spec.ts -g "TC-LOGIN-003") or --project <name> for a specific project. Record the exact filter in the run ledger. If tests fail, load fix-flaky-tests.
Signoff requires three consecutive green runs of the new or changed test set on the same code, with no intervening file changes. Any change to test code, fixtures, helpers, or the test set itself resets the counter (see the Gate reset rule below).
Retry policy: no fixed numeric cap. The agent should continue fix-and-rerun cycles only while each cycle produces meaningful progress: a new root-cause hypothesis, a concrete code/config change, or materially new failure evidence. Stop when the issue is no longer moving, the next rerun would repeat the same experiment, or the blocker is external to the test code (missing environment, access, product bug, unclear requirements).
Execution order for brownfield suites:
- Run the newly added or changed tests in isolation until they are stable.
- Then run the relevant feature file or suite.
- Only run broader regression after the new coverage is green in isolation.
- If unrelated pre-existing tests fail, classify them separately as baseline instability. If they must be fixed because of shared-state leakage or broken shared infrastructure, report that work separately from the new TC implementation set.
Gate reset rule: if the planned TC set, spec, coverage plan, or executed test file changes after Gate 2 in a way that adds, removes, defers, or materially rewrites covered behavior, reset to the earliest affected gate:
- rerun Gate 2 for changed test code
- rerun Gate 1 if the spec or deferral set changed
- restart the Gate 3 consecutive-green counter
Any code change made during Gate 3 failure repair also returns to Gate 2 before the next counted execution. Browser triage explains what to change; it does not replace code review.
Do not count pre-change runs toward post-change signoff.
Execution-time deferral policy: deferral discovered during Gate 3 is exceptional. Only defer when the blocker is concrete and external to the current test implementation, the spec and coverage plan are updated with blocker, un-defer plan, and scope, and the run returns to the required earlier gate(s) before signoff. Do not defer to avoid re-exploration, locator verification, or code review. If the blocker is locator uncertainty, DOM drift from exploration, viewport/layout mismatch, or "needs browser confirmation", stop execution and refresh exploration evidence first.
Mandatory re-exploration triggers include:
- execution viewport or layout drift for coordinate- or layout-sensitive interactions
- lost locator uniqueness compared with exploration
- labels or control text absent, duplicated, or rendered outside the explored container
- workaround-heavy fixes such as force-clicks, JavaScript-dispatched clicks, coordinate clicks, or page-wide text/count oracles becoming the only apparent path forward
- any proposed selector/locator-only fix for an existing failing test when no current browser evidence has been captured for the failing flow
Failure triage via browser (mandatory before fixing). When a test fails, do not patch locators, add waits, or label the failure as flake until the cause is classified against the live product. Dispatch a fresh subagent with mcp__plugin_agent-web-interface_browser__* access and the relevant e2e-plan/exploration-report.md (or e2e-plan/exploration/<subfeature>.md) section. The subagent must:
- Navigate to the failing flow and reproduce the user action by hand.
- Compare the current DOM against the recorded evidence — labels, roles, structure, locator candidates.
- Return one verdict:
- product regression — the feature itself is broken. Do not adjust the test to make it pass; escalate and capture the bug.
- locator / DOM drift — the product still works but the controls moved or relabeled. Refresh the exploration artifact first, then update the test against the new evidence.
- test defect — the product behaves correctly; the bug is in test code, assertions, fixtures, or wait logic.
- environment / data — the flow is blocked by missing seed data, auth, or environment state; treat as a Gate 3 deferral candidate per the policy above.
Record the verdict and the subagent's evidence pointer in the run ledger before any code change. Patching locators without triage is the failure mode this rule exists to prevent.
Gate 3 run ledger: maintain a ledger in the session tracker or execution notes. For each counted run record:
- exact command
- exact test set or filter
- whether the run is eligible for signoff
- files changed since the prior run
- pass/fail counts
- whether the consecutive-green counter reset
Gate 4: Coverage gap audit
After Gate 3 passes, dispatch a fresh subagent to run a coverage audit. It receives the exploration report's element inventory and the executed test suite as inputs, and classifies each observed element as one of:
covered-functional— at least one test triggers the action and asserts a state changecovered-visibility-only— a test references the element but does not exercise it functionallyuncovered— no test touches it
The audit writes e2e-plan/coverage-audit.md and returns a verdict:
- GREEN — every promoted P0/P1 inventory-backed behavior is covered functionally, or any uncovered item is explicitly out of current scope and was never promoted into the accepted spec or coverage plan
- YELLOW — one or more promoted items remain deferred, partially covered, or visibility-only, but the operator has explicitly accepted those gaps with written reasoning
- RED — uncovered, visibility-only, or deferred promoted items remain without explicit acceptance, or the exploration/spec/execution chain is inconsistent
A promoted TC deferred during or after execution cannot end in GREEN unless the spec and coverage plan were revised, re-reviewed, and explicitly re-baselined before Gate 4.
"Tests passed" does not mean done. Gate 4 decides done.
Delegation Constraints
Four delegation rules are mandatory. Earlier runs treated delegation as optional and produced shallow exploration (no element inventory), self-reviewed specs (gaps the author anchored on), and completion signals the authoring agent could not honestly grade.
1. Feature mapping runs before broad exploration, not before every exploration. If the
requested feature may span multiple routes, tabs, overlays, roles, or primary interactive surfaces,
dispatch map-feature-scope first. For obviously narrow single-surface requests, skip it and go
straight to capture-feature-evidence. The output of mapping is e2e-plan/feature-map.md, not a deep evidence
report.
2. Deep exploration runs in scoped subagents. The main agent does not browse. Dispatch
capture-feature-evidence work via the Task tool and invoke each subagent with access to the
mcp__plugin_agent-web-interface_browser__* tools. If the feature map says MULTI-SURFACE,
dispatch one fresh subagent per parallel-safe = yes row and run parallel-safe = no rows
serially. Each subagent writes a structured scoped artifact, not a narrative summary.
Existing-test fixes use the same browser discipline. A repair subagent may be narrower than a full feature exploration, but it must still navigate the current app, reproduce the relevant user action, and report current labels, roles, DOM structure, and locator candidates before implementation edits.
3. Each review gate runs in a fresh subagent. The agent that wrote the spec does not run
review-test-cases. The agent that wrote the code does not run review-test-code. Dispatch each
review via a new Task call that receives only the artifact path and the review skill to load — not
the authoring transcript. Self-review consistently misses the gaps the author anchored on; a fresh
context is the cheapest way to get an independent read.
4. Test execution stays in the main agent. Test output is the proof artifact — the main agent
runs npx playwright test directly so it can verify the output is real and interpret failures in
context. A subagent's summary of test results isn't trustworthy enough to gate completion.
Gate 4 (coverage audit) follows rule 2: dispatch a fresh subagent, feed it the exploration report and test suite, let it produce coverage-audit.md.
Pass file paths, conventions, and concrete output expectations into every Task call. Instruct each subagent to load the appropriate skill before acting.
Guardrails
Quick-reference checklist — in addition to the session protocol's guardrails:
- Browse the product before writing specs or tests
- Browse the current product before fixing existing tests — especially before changing locators, selectors, text assertions, navigation assertions, or waits
- Decompose broad features before deep exploration — use
map-feature-scope - Delegate scoped exploration to subagents — the main agent does not call browser tools directly
- Record observed behavior before turning exploration into specs or code
- Each review gate runs in a fresh subagent, not the artifact's author
- Run the code review gate after every test-code repair — no selector, locator, wait, fixture, or assertion edit goes straight to execution
- Spec meets depth targets: ≥15 TCs for non-trivial features, ≥60% functional assertions, ≤20% deferred
- Stabilize new or changed tests in isolation before broader regression in brownfield suites
- Reset the affected gates if scope, spec, coverage plan, or executed test set changes after Gate 2
- Keep a Gate 3 run ledger so signoff runs, resets, and file changes are auditable
- Run tests before marking test work as done
- Run the Gate 4 coverage audit after tests pass — tests passing is necessary but not sufficient