Testing and Results#

Exploratory Testing with Randomized Inputs#

Beyond the codegen-driven test generation, the pipeline also supports exploratory testing with randomized inputs. By using a seed-based random input generator, tests can cover a wide range of edge cases while remaining fully reproducible. The same seed always produces the same inputs, making failures easy to investigate.

import random

def generate_test_inputs(seed, field_configs):
    """Generate randomized but reproducible test inputs."""
    rng = random.Random(seed)
    inputs = {}
    for field, config in field_configs.items():
        if config["type"] == "text":
            length = rng.randint(1, config.get("max", 255))
            inputs[field] = "".join(
                rng.choices("abcdefghijklmnopqrstuvwxyz ", k=length)
            )
        elif config["type"] == "select":
            inputs[field] = rng.choice(config["options"])
        elif config["type"] == "number":
            inputs[field] = rng.randint(
                config.get("min", 0), config.get("max", 9999)
            )
    return inputs

This approach catches issues that hand-written tests miss,unexpected field lengths, uncommon dropdown values, and boundary conditions,without sacrificing determinism.

Baseline Failure Rate Analysis#

Before relying on the generated tests as a quality signal, I established a baseline failure rate through systematic stability analysis:

Metric	Approach
Stability runs	Execute the full suite N times under identical conditions
Baseline pass rate	Calculate the mean pass rate across all stability runs
Failure classification	Categorize each failure as flaky, genuine bug, or environment-related
Regression detection	Flag any test whose pass rate drops below its established baseline between builds

This baseline makes it possible to distinguish real regressions from noise, so that pipeline failures correspond to actual issues in the documentation or the software being tested.

Output Metrics#

Metric	Value
User manuals produced	4
HTML pages generated	455
PDF pages generated	848
Test files created	119
Total tests written	~765

Efficiency Comparison#

	First Manual	Remaining 3 Manuals
Time	~1.5 months	~4 hours total
Hours worked	~80–120 hours	~4 hours
Per manual	~80–120 hours	~1.3 hours each
Efficiency gain	,	~20–30x faster

The first manual required establishing all patterns, templates, configurations, and the pipeline itself. Once those were in place, subsequent manuals reused everything,the agent only needed section names to generate tests and documentation for each additional manual.

Future Testing: Measuring Pipeline Efficiency#

Note

This section is planned. It outlines the experimental framework I intend to use for evaluating how much each pipeline stage contributes to output quality.

Stage Ablation#

Systematically remove individual pipeline stages and measure the effect on final output quality. Each configuration runs the full pipeline minus one stage:

Configuration	Stage Removed	Question Answered
Baseline	None (full pipeline)	What does the complete pipeline produce?
A	Programmatic Normalization	How much does pre-cleaning matter?
B	Readability Audit	Does the readability pass improve downstream stages?
C	Style Guide Enforcement	Does style enforcement affect final accuracy?
D	Anti-Pattern Check	How much do accumulated anti-patterns catch?
E	Completeness Audit	What gets missed without side-by-side verification?

Output Accuracy Scoring#

Score final documentation against a ground-truth rubric (e.g., manual human-reviewed output)
Possible dimensions: content completeness, structural correctness, image placement, style conformance
Each dimension scored independently to isolate where quality degrades

Progressive Complexity#

Start with simple, well-structured pages and increase complexity (nested forms, multi-step workflows, edge-case UI patterns)
Track where accuracy drops off,identifies the pipeline’s complexity ceiling
Measure whether additional stages recover accuracy at higher complexity levels

Statistical Validation#

Compare ablation configurations against the baseline using paired tests (e.g., paired t-test or Wilcoxon signed-rank, depending on distribution)
Null hypothesis: removing a stage has no effect on output quality
Reject the null at a chosen significance level to confirm each stage’s contribution
Report effect sizes alongside p-values to distinguish statistical significance from practical significance