Testing and Results#
Exploratory Testing with Randomized Inputs#
Beyond the codegen-driven test generation, the pipeline also supports exploratory testing with randomized inputs. By using a seed-based random input generator, tests can cover a wide range of edge cases while remaining fully reproducible. The same seed always produces the same inputs, making failures easy to investigate.
import random
def generate_test_inputs(seed, field_configs):
"""Generate randomized but reproducible test inputs."""
rng = random.Random(seed)
inputs = {}
for field, config in field_configs.items():
if config["type"] == "text":
length = rng.randint(1, config.get("max", 255))
inputs[field] = "".join(
rng.choices("abcdefghijklmnopqrstuvwxyz ", k=length)
)
elif config["type"] == "select":
inputs[field] = rng.choice(config["options"])
elif config["type"] == "number":
inputs[field] = rng.randint(
config.get("min", 0), config.get("max", 9999)
)
return inputs
This approach catches issues that hand-written tests miss,unexpected field lengths, uncommon dropdown values, and boundary conditions,without sacrificing determinism.
Baseline Failure Rate Analysis#
Before relying on the generated tests as a quality signal, I established a baseline failure rate through systematic stability analysis:
Metric |
Approach |
|---|---|
Stability runs |
Execute the full suite N times under identical conditions |
Baseline pass rate |
Calculate the mean pass rate across all stability runs |
Failure classification |
Categorize each failure as flaky, genuine bug, or environment-related |
Regression detection |
Flag any test whose pass rate drops below its established baseline between builds |
This baseline makes it possible to distinguish real regressions from noise, so that pipeline failures correspond to actual issues in the documentation or the software being tested.
Output Metrics#
Metric |
Value |
|---|---|
User manuals produced |
4 |
HTML pages generated |
455 |
PDF pages generated |
848 |
Test files created |
119 |
Total tests written |
~765 |
Efficiency Comparison#
First Manual |
Remaining 3 Manuals |
|
|---|---|---|
Time |
~1.5 months |
~4 hours total |
Hours worked |
~80–120 hours |
~4 hours |
Per manual |
~80–120 hours |
~1.3 hours each |
Efficiency gain |
, |
~20–30x faster |
The first manual required establishing all patterns, templates, configurations, and the pipeline itself. Once those were in place, subsequent manuals reused everything,the agent only needed section names to generate tests and documentation for each additional manual.
Future Testing: Measuring Pipeline Efficiency#
Note
This section is planned. It outlines the experimental framework I intend to use for evaluating how much each pipeline stage contributes to output quality.
Stage Ablation#
Systematically remove individual pipeline stages and measure the effect on final output quality. Each configuration runs the full pipeline minus one stage:
Configuration |
Stage Removed |
Question Answered |
|---|---|---|
Baseline |
None (full pipeline) |
What does the complete pipeline produce? |
A |
Programmatic Normalization |
How much does pre-cleaning matter? |
B |
Readability Audit |
Does the readability pass improve downstream stages? |
C |
Style Guide Enforcement |
Does style enforcement affect final accuracy? |
D |
Anti-Pattern Check |
How much do accumulated anti-patterns catch? |
E |
Completeness Audit |
What gets missed without side-by-side verification? |
Output Accuracy Scoring#
Score final documentation against a ground-truth rubric (e.g., manual human-reviewed output)
Possible dimensions: content completeness, structural correctness, image placement, style conformance
Each dimension scored independently to isolate where quality degrades
Progressive Complexity#
Start with simple, well-structured pages and increase complexity (nested forms, multi-step workflows, edge-case UI patterns)
Track where accuracy drops off,identifies the pipeline’s complexity ceiling
Measure whether additional stages recover accuracy at higher complexity levels
Statistical Validation#
Compare ablation configurations against the baseline using paired tests (e.g., paired t-test or Wilcoxon signed-rank, depending on distribution)
Null hypothesis: removing a stage has no effect on output quality
Reject the null at a chosen significance level to confirm each stage’s contribution
Report effect sizes alongside p-values to distinguish statistical significance from practical significance