Testing and Results#

Exploratory Testing with Randomized Inputs#

Beyond the codegen-driven test generation, the pipeline also supports exploratory testing with randomized inputs. By using a seed-based random input generator, tests can cover a wide range of edge cases while remaining fully reproducible. The same seed always produces the same inputs, making failures easy to investigate.

import random

def generate_test_inputs(seed, field_configs):
    """Generate randomized but reproducible test inputs."""
    rng = random.Random(seed)
    inputs = {}
    for field, config in field_configs.items():
        if config["type"] == "text":
            length = rng.randint(1, config.get("max", 255))
            inputs[field] = "".join(
                rng.choices("abcdefghijklmnopqrstuvwxyz ", k=length)
            )
        elif config["type"] == "select":
            inputs[field] = rng.choice(config["options"])
        elif config["type"] == "number":
            inputs[field] = rng.randint(
                config.get("min", 0), config.get("max", 9999)
            )
    return inputs

This approach catches issues that hand-written tests miss,unexpected field lengths, uncommon dropdown values, and boundary conditions,without sacrificing determinism.

Baseline Failure Rate Analysis#

Before relying on the generated tests as a quality signal, I established a baseline failure rate through systematic stability analysis:

Metric

Approach

Stability runs

Execute the full suite N times under identical conditions

Baseline pass rate

Calculate the mean pass rate across all stability runs

Failure classification

Categorize each failure as flaky, genuine bug, or environment-related

Regression detection

Flag any test whose pass rate drops below its established baseline between builds

This baseline makes it possible to distinguish real regressions from noise, so that pipeline failures correspond to actual issues in the documentation or the software being tested.

Output Metrics#

Metric

Value

User manuals produced

4

HTML pages generated

455

PDF pages generated

848

Test files created

119

Total tests written

~765

Efficiency Comparison#

First Manual

Remaining 3 Manuals

Time

~1.5 months

~4 hours total

Hours worked

~80–120 hours

~4 hours

Per manual

~80–120 hours

~1.3 hours each

Efficiency gain

,

~20–30x faster

The first manual required establishing all patterns, templates, configurations, and the pipeline itself. Once those were in place, subsequent manuals reused everything,the agent only needed section names to generate tests and documentation for each additional manual.

Future Testing: Measuring Pipeline Efficiency#

Note

This section is planned. It outlines the experimental framework I intend to use for evaluating how much each pipeline stage contributes to output quality.

Stage Ablation#

Systematically remove individual pipeline stages and measure the effect on final output quality. Each configuration runs the full pipeline minus one stage:

Configuration

Stage Removed

Question Answered

Baseline

None (full pipeline)

What does the complete pipeline produce?

A

Programmatic Normalization

How much does pre-cleaning matter?

B

Readability Audit

Does the readability pass improve downstream stages?

C

Style Guide Enforcement

Does style enforcement affect final accuracy?

D

Anti-Pattern Check

How much do accumulated anti-patterns catch?

E

Completeness Audit

What gets missed without side-by-side verification?

Output Accuracy Scoring#

  • Score final documentation against a ground-truth rubric (e.g., manual human-reviewed output)

  • Possible dimensions: content completeness, structural correctness, image placement, style conformance

  • Each dimension scored independently to isolate where quality degrades

Progressive Complexity#

  • Start with simple, well-structured pages and increase complexity (nested forms, multi-step workflows, edge-case UI patterns)

  • Track where accuracy drops off,identifies the pipeline’s complexity ceiling

  • Measure whether additional stages recover accuracy at higher complexity levels

Statistical Validation#

  • Compare ablation configurations against the baseline using paired tests (e.g., paired t-test or Wilcoxon signed-rank, depending on distribution)

  • Null hypothesis: removing a stage has no effect on output quality

  • Reject the null at a chosen significance level to confirm each stage’s contribution

  • Report effect sizes alongside p-values to distinguish statistical significance from practical significance