Article

📡rssdevto-ai·Kotaro Andy

I Built a Claude Code Plugin for Formal Specifications and Ran a 30-Trial Evaluation. Here's What I Found.

The Problem: LLMs Write Plausible Code That Misses Edge Cases If you've used Claude Code (or any LLM-powered coding assistant) for non-trivial tasks, you've probably noticed a pattern: the generated code looks correct, passes the obvious test cases, and then breaks on edge cases that were never specified in the prompt. This isn't a model quality issue — it's a specification problem. When you tell an LLM "build a bank transfer system," there are dozens of implicit rules the model has to guess at: What happens if the source and destination accounts are the same? Is a transfer of zero valid? Is the transfer atomic? I built a Claude Code plugin called Formal Agent Contracts to address this. It guides the LLM through defining business rules in VDM-SL (a formal specification language) before writing any implementation code. The idea is that the act of formalizing constraints forces the model to confront ambiguities upfront rather than silently making assumptions. The real question was: does this actually improve output quality in a measurable way? Experiment Design I designed a controlled evaluation comparing two conditions across 3 benchmark tasks × 5 trials each = 30 total runs. Conditions Group Setup Prompt Control (A) Claude Code only, no formal methods "{task spec}. Implement in TypeScript with tests." Treatment (B) Claude Code + Formal Agent Contracts plugin "Run the integrated workflow for {task spec}. Generate TypeScript." Both groups receive identical task specifications — the treatment group gets no extra information, just a different workflow that starts with spec formalization. Benchmark Tasks Each task was designed with increasing complexity and a set of deliberately hidden "traps" — edge cases not explicitly stated in the prompt that a thorough implementation should handle: Task Complexity Trap Count Description Bank Account Low 10 Single agent. CRUD + transfers. Traps: transfer atomicity, zero-balance withdrawal, negative initial balance Library System Medium 12 3 agents (Catalog, Member, Loan). Traps: inventory count consistency across agents, immediate re-borrow after overdue return, double extension prevention Online Auction High 14 State machine + time constraints + concurrency. Traps: cascading bid extensions, simultaneous bid priority, payment timeout boundaries, re-listing after cancellation Metrics Four metrics, each scored per-trial: M1 — Contract Violation Detection Rate: For each hidden trap, score 0–3 based on whether the code handles it and whether a test covers it. CVDR = Σ(scores) / (traps × 3) × 100% M2 — Specification Coverage: What fraction of business rules have explicit validation in code (runtime checks, assertions, type constraints)? SC = Σ(rule_scores) / (rules × 3) × 100% M4 — Specification Explicitness: Are business rules captured in a machine-verifiable form (formal spec = 3, code assertion = 2, comment = 1, implicit = 0)? M6 — Test Effectiveness: Heuristic analysis of test code quality — edge case coverage, boundary value testing, negative test cases, test density. Scoring was automated via keyword-matching heuristics against gold-standard trap definitions and business rule extractions. All raw scores are in the repo. Results Aggregate Metric Control Treatment Δ Cliff's δ Effect Size M1: Violation Detection 52.0% ± 33.6 63.9% ± 22.5 +11.8pp 0.18 small M2: Spec Coverage 39.1% ± 21.2 81.9% ± 26.9 +42.8pp 0.74 large M4: Spec Explicitness 8.9% ± 15.3 100.0% ± 0.0 +91.1pp 1.00 large M6: Test Effectiveness 66.7% ± 23.4 87.0% ± 8.5 +20.3pp 0.56 large Three out of four metrics show large effect sizes (Cliff's δ ≥ 0.474). For context, a Cliff's δ of 0.74 on M2 means that in 87% of pairwise comparisons between control and treatment runs, the treatment run had higher spec coverage. The Complexity Scaling Effect This was the most interesting finding. The plugin's impact scales with task complexity: Complexity M1 Δ M2 Δ M6 Δ Low (Bank) -3.3pp +40.0pp +13.5pp Medium (Library) +15.0pp +62.2pp +17.5pp High (Auction) +23.8pp +26.2pp +30.0pp For the low-complexity task, the control group already does well on M1 — simple business rules are within the LLM's "intuition." But once you introduce multi-agent coordination (medium) or state machines with temporal constraints (high), the gap widens sharply. The auction task is where it gets stark. The control group averaged 11.4% on violation detection vs. 35.2% for treatment. Edge cases like cascading bid extension limits, simultaneous bid resolution, and payment timeout boundaries were almost universally missed without formal spec guidance. What the Treatment Group Actually Does Differently Looking at the generated artifacts, the mechanism is clear. The treatment group produces three files per run (.vdmsl + .ts + .test.ts) vs. two for control (.ts + .test.ts). The VDM-SL spec acts as an intermediate artifact that: Forces explicit enumeration of invariants, pre-conditions, and post-conditions before code generation Surfaces implicit requirements through the formalization process — when you try to write inv balance >= 0 you immediately ask "what about the initial balance?" Generates more targeted tests because the test cases are derived from spec violations rather than happy-path scenarios For example, in the auction task, a typical control run might test bid > currentPrice but not bid >= currentPrice + minIncrement(currentPrice). The treatment run, having defined minBidIncrement as a VDM-SL function with explicit boundary conditions, naturally produces tests for the 1% minimum and the ¥100 floor. Threats to Validity I want to be upfront about the limitations: Internal validity: The same Claude instance generated both conditions. Each run used intentional style variation, but the fundamental issue remains — this isn't a human-subjects experiment. The treatment group's improvement could partly reflect the structured workflow rather than the formal specification per se. Scoring methodology: M1 and M2 use heuristic keyword matching, not actual code execution. A function named validateBalance scores the same whether it's correctly implemented or not. M6 was originally designed to run gold-standard tests via Jest, but filesystem permission constraints forced a fallback to heuristic test code analysis. M4 is structurally biased: The treatment group produces VDM-SL files by definition, which automatically scores L3 on explicitness. The +91.1pp delta on M4 is real but partially tautological. The more meaningful signal comes from M2 and M6, which measure actual code behavior. Scale: Three tasks, five trials each. The effect sizes are large enough to be meaningful at n=15, but replication with more tasks and different LLMs would strengthen the conclusions. All raw data (30 directories of generated code, scores.json, the evaluation script, and gold-standard test suites) are in the repository for independent verification. How It Works in Practice The plugin bundles 10 skills for Claude Code. The typical workflow is: You: "Run the integrated workflow for this auction system spec." Claude (via plugin): 1. define-contract → Generates VDM-SL from your natural language description 2. verify-spec → Runs VDMJ syntax/type checking + proof obligation generation 3. smt-verify → Converts POs to SMT-LIB, proves with Z3 4. generate-code → Scaffolds TypeScript/Python with runtime contract checks 5. (runs tests) → Validates generated code against the spec For existing codebases, there's a reverse workflow (extract-spec → refine-spec → reconcile-code) that extracts provisional specs from code and refines them through dialogue. You don't need to know VDM-SL. The conversation is in natural language — the plugin handles the translation. Try It /plugin marketplace add kotaroyamame/formal-agent-contracts Or clone and install locally: git clone https://github.com/kotaroyamame/formal-agent-contracts.git cd formal-agent-contracts /plugin install . Prerequisites: Java 11+ (for VDMJ) and optionally Z3 (pip install z3-solver) for automated proving. The full evaluation data, protocol, and scoring scripts are under eval/ in the repo. If you run your own evaluation or replicate these results, I'd love to hear about it. Repository: github.com/kotaroyamame/formal-agent-contracts Detailed evaluation report: iid.systems/formal-agent-contracts/evaluation

·5h
Read Original
Sponsored
Ad
HomeTrendingBookmarksAgentSettings