Parallel Scaling: Worker Count vs Pass Rate#

Date: 2026-02-24 Specs: 118 gold-standard (1 skipped — import error) Browser: Firefox (headless) Runs per config: 2

Context#

The original A/B experiment compared codegen baseline (15.3% pass rate) against gold-standard tests run sequentially with 1 worker. After completing the A/B comparison, we tested whether increasing Playwright’s worker count affected pass rates or introduced contention-related flakiness.

Results#

Combined chart showing pass rate plateau and duration decrease as worker count increases

Pass rate vs worker count showing plateau at 98.5-99%

Test suite duration vs worker count showing near-linear speedup

Summary by Worker Count#

Workers	Avg Pass Rate	Avg Failures	Avg Duration	Speedup vs 1w
1	84.5%	18.5 files	~50 min	—
8	98.6%	9.5 tests	14.4 min	~3.5x
16	98.7%	9.0 tests	8.7 min	~5.7x
32	98.8%	8.5 tests	5.8 min	~8.6x

Detailed Run Data#

Sequential Baseline (1 worker) — per spec file#

Workers	Run	Passed	Failed	Skipped	Total	Pass Rate	Duration
1	Run 1	98	21	—	119	82.4%	~50+ min
1	Run 2	103	16	—	119	86.6%	~50+ min

Parallel Results — per individual test#

Workers	Run	Passed	Failed	Skipped	Total (non-skipped)	Pass Rate	Duration
8	Run 1	699	10	34	709	98.6%	876s (14.6 min)
8	Run 2	701	9	33	710	98.7%	850s (14.2 min)
16	Run 1	709	7	27	716	99.0%	512s (8.5 min)
16	Run 2	698	11	34	709	98.4%	534s (8.9 min)
32	Run 1	698	9	36	707	98.7%	352s (5.9 min)
32	Run 2	705	8	30	713	98.9%	343s (5.7 min)

Key Findings#

Pass rates plateau at ~98.5–99% regardless of worker count (8, 16, 32). Parallelism does not introduce flakiness.
Sequential runs are the outlier. The 1-worker pass rate (84.5%) is significantly lower than all parallel configurations. Likely cause: longer total runtime increases exposure to network timeouts, session expiry, and server-side state drift.
Duration scales roughly linearly with worker count: 50 min → 14 min → 8.5 min → 5.8 min. Diminishing returns begin around 32 workers for 118 specs.
Failures are stable across all configurations. The same ~9 tests fail regardless of parallelism — these are known issues (missing demo data, non-existent routes, environment-specific problems), not resource contention.
Measurement difference: Sequential counts at the spec-file level (1 file = 1 pass/fail), parallel counts at the individual test level. This makes direct numerical comparison imprecise, but the trend is unambiguous.