Why this is hard to get right
The Legacy Code Testing Problem Nobody Warns You About
Marcus is a senior backend engineer at a fintech startup. His team inherited a 6-year-old payments module — 1,200 lines of Python, zero tests, and one developer who understood it has since left the company. Leadership wants the module refactored before a major platform migration. The rule: don't break payments.
Marcus knows he needs tests before touching anything. But the code is tangled. Functions call external APIs, touch the database directly, perform timezone conversions without comments, and retry failed transactions using logic that only makes sense if you read a Slack thread from 2019.
His first attempt with an AI assistant was a disaster. He pasted the function and typed: "Write unit tests for this." The AI produced something that looked plausible — but it imported a real database connection, called a live Stripe endpoint, and tested the happy path only. The tests passed locally and failed immediately in CI. Worse, they gave false confidence: the riskiest branches (retry overflow, timezone edge cases) had zero coverage.
Marcus spent three hours fixing AI output that should have taken thirty minutes to generate correctly.
The problem wasn't the AI. It was the prompt.
He had given the model no framework, no constraints, no coverage target, and no signal about where the real risks lived. The AI made reasonable assumptions — and every assumption was wrong for his context.
On his second attempt, Marcus built a structured prompt. He specified pytest and unittest.mock, named the file and its known risk areas, set an 85% branch coverage target, explicitly forbade real network and database calls, and asked the AI to produce a test plan before writing any code.
The results were dramatically different. The AI proposed a test plan with 14 named cases, organized by risk tier. It used fakes for the HTTP client and patched the clock for timezone tests. The AAA structure made each test readable. The coverage checklist flagged two untested branches Marcus hadn't noticed.
The whole suite ran hermetically in CI in under four seconds.
What changed was specificity. A well-crafted prompt doesn't just save time — it forces you to articulate what "good" means before the AI starts. That act of articulation — naming your risks, constraints, and success criteria — is itself valuable engineering work. The prompt becomes a requirements document for your test suite, not just a query.
Common mistakes to avoid
Pasting Code Without Naming Risk Areas
When you paste a function without flagging known risks — timezone math, retry logic, rounding — the AI tests what's structurally obvious, not what's actually dangerous. You get high line coverage but miss the branches that cause production incidents. Always list 3-5 specific risk areas explicitly in your prompt so the AI prioritizes edge cases that matter.
Omitting the Mocking Policy
Without explicit constraints, AI-generated tests will often make real HTTP calls, open database connections, or read from the filesystem. These tests pass locally and fail in CI, or worse, they hit production systems. Specify your mocking policy directly: forbid real I/O, name the libraries to use (unittest.mock, pytest-mock), and require fakes for external dependencies.
Skipping the Test Plan Request
Asking for code immediately produces tests without a reasoning layer. The AI skips boundary analysis and jumps to implementation, missing whole categories of cases. Ask for a test plan first — named cases, boundaries, fixtures, and mocking strategy — then generate code. This separates thinking from writing and produces auditable, complete suites.
Not Specifying the Coverage Target
Saying 'good coverage' is meaningless. Without a numeric target like 85% branch coverage, the AI optimizes for volume of tests rather than meaningful coverage. You end up with 20 tests that cover the same happy path. State a specific percentage and clarify whether you mean line coverage, branch coverage, or mutation score.
Forgetting to Restrict Source Modification
AI assistants frequently suggest refactoring the source code to make it more testable — adding seams, extracting functions, or changing signatures. This is dangerous in legacy systems where side effects are unknown. Explicitly state that the source file must not be modified and that all testability must come through mocking and dependency injection at the call site.
Using Generic Role Instructions
A prompt that says 'act as a developer' produces generic output. Specifying 'Senior Software Test Engineer' signals that the AI should apply professional testing standards: AAA structure, meaningful assertion messages, fixture reuse, and coverage analysis. Role specificity shifts the output from beginner-level examples to production-quality test suites.
The transformation
Write unit tests for this old module. Aim for good coverage.
Role: **Senior Software Test Engineer** Goal: Generate a maintainable unit test suite for a legacy module. Context: 1) Language/Framework: Python 3.11, pytest, unittest.mock 2) Target file: payments/processor.py (paste code below) 3) Coverage: minimum 85% lines/branches 4) Constraints: don’t modify source; avoid network/DB calls; use fakes 5) Risks: timezone math, retry logic, rounding errors Instructions: - Propose a test plan first (cases, boundaries, mocks, fixtures). - Write pytest tests with clear names and AAA structure. - Include examples for success, failure, and edge inputs. - Show how to mock HTTP and clock. - Provide a coverage checklist and next steps. Code: [paste processor.py here]
Why this works
Role Primes Quality Standards
The After Prompt opens with 'Senior Software Test Engineer' as the role. This single instruction shifts the AI's output register from tutorial-style examples to professional testing practices — AAA structure, fixture reuse, meaningful assertion messages. Without this, the AI defaults to the simplest interpretation of 'write tests,' which is rarely production-ready.
Explicit Constraints Eliminate Guessing
The After Prompt's constraints section — 'don't modify source; avoid network/DB calls; use fakes' — removes the AI's most common failure modes for legacy code. AI models default to the path of least resistance; without these constraints, they'll suggest source changes or real I/O. Explicit rules produce hermetic, CI-safe tests every time.
Named Risks Focus Edge Case Generation
The After Prompt lists 'timezone math, retry logic, rounding errors' under risks. This is the highest-leverage instruction in the entire prompt. AI models respond to specificity here — named risks produce named test cases targeting those exact failure modes, rather than generic boundary tests that miss the actual defects.
Sequential Instructions Separate Thinking from Writing
The instructions section starts with 'Propose a test plan first' before requesting code. This forces a reasoning step — the AI identifies cases, boundaries, and mocking needs before writing a single line of code. The result is a structured, auditable plan that catches omissions before implementation begins.
Measurable Success Criteria Anchors Output
'Minimum 85% lines/branches' gives the AI a concrete target to optimize toward. Vague goals like 'good coverage' produce inconsistent results. A numeric threshold causes the model to include a coverage checklist, flag untested paths, and suggest next steps — outputs that are directly actionable for an engineering team.
The framework behind the prompt
The Theory Behind Effective Test Generation Prompts
Unit testing legacy code sits at the intersection of two disciplines: software testing theory and prompt engineering. Understanding both helps you write prompts that produce genuinely useful output.
Why Legacy Code Is Different
Legacy systems resist standard test-driven development (TDD) because TDD assumes you write tests before code. Michael Feathers' foundational work in Working Effectively with Legacy Code defines legacy code precisely as code without tests — and notes that making it testable often requires identifying seams: points where behavior can be substituted without changing production logic. A good test prompt explicitly names these seams (the HTTP client, the clock, the database layer) and instructs the AI to mock at those exact points.
Coverage as a Specification Language
Coverage metrics — line, branch, and mutation — function as a specification language in prompts. Branch coverage is more meaningful than line coverage for legacy code because it requires every conditional path to be exercised. Mutation testing, pioneered by DeMillo, Lipton, and Sayward in 1978, goes further: it measures whether your assertions are strong enough to detect artificial bugs. Including mutation score in a prompt shifts AI output from volume-optimized tests to assertion-quality tests.
The AAA Pattern as a Cognitive Framework
The Arrange-Act-Assert (AAA) pattern, popularized by Bill Wake, does more than organize test code — it forces clarity about what a function is supposed to do. When you require AAA structure in a prompt, you're asking the AI to separate setup (Arrange), execution (Act), and verification (Assert) explicitly. This prevents a common AI failure mode: tests that act and assert but never properly isolate the system under test.
Prompt Engineering Principles at Work
Few-Shot Prompting — providing one working test as a reference — dramatically reduces structural errors. Chain-of-Thought — requesting a test plan before code — improves coverage completeness by forcing a reasoning step before implementation. Role Prompting elevates output register by activating domain-specific knowledge associated with a professional identity. All three apply directly to test generation prompts.
Prompt variations
Role: Senior JavaScript Test Engineer
Goal: Generate a complete Jest test suite for a legacy Node.js service module with no existing tests.
Context:
- Language/Framework: Node.js 18, Jest 29, jest.mock for dependencies
- Target file: src/billing/invoiceGenerator.js (code pasted below)
- Coverage target: 80% branch coverage measured by Jest's --coverage flag
- Constraints: do not modify source; no real filesystem or network calls; mock all external modules
- Known risks: date arithmetic using moment.js, PDF generation timeout handling, currency rounding to two decimal places
Instructions:
- Start with a test plan listing all cases, mocked modules, and fixture data needed.
- Write Jest tests using describe/it blocks with clear, readable names.
- Use beforeEach for setup, afterEach for teardown, and jest.spyOn where appropriate.
- Cover success paths, error branches, and the three named risk areas explicitly.
- End with a coverage gap analysis and suggested next test targets.
Code: [paste invoiceGenerator.js here]
Role: Senior Java Test Engineer specializing in Spring Boot applications
Goal: Write a JUnit 5 unit test suite for a legacy Spring service class, isolating it from real database and HTTP dependencies.
Context:
- Language/Framework: Java 17, JUnit 5, Mockito 5, Spring Boot 3
- Target class: OrderFulfillmentService.java in the order-service microservice
- Coverage target: 85% line coverage, verified by JaCoCo
- Constraints: no actual database calls; use @Mock and @InjectMocks; do not modify production source
- Known risks: optimistic locking exceptions, event publishing failures, null-safety gaps in legacy DTOs
Instructions:
- Produce a test plan first: list test method names, the scenario each covers, and which dependencies to mock.
- Write @Test methods using the Arrange/Act/Assert pattern with descriptive display names via @DisplayName.
- Mock all repository and event publisher dependencies with Mockito.
- Include tests for the three named risk areas with explicit assertions on exception types and messages.
- Provide a JaCoCo configuration snippet and a checklist of uncovered branches.
Class file: [paste OrderFulfillmentService.java here]
Role: Senior Frontend Test Engineer
Goal: Generate a Vitest unit test suite for a legacy TypeScript utility library used across a React application.
Context:
- Language/Framework: TypeScript 5, Vitest 1.x, vi.mock for module mocking
- Target file: src/utils/formatters.ts — currency, date, and string formatting functions
- Coverage target: 90% line and branch coverage using Vitest's built-in v8 coverage provider
- Constraints: pure function tests only; no DOM interaction; no external API calls; do not alter source types
- Known risks: locale-sensitive number formatting, daylight saving time edge cases in date formatters, handling of null and undefined inputs
Instructions:
- Begin with a test plan organized by function, listing normal cases, boundary values, and known risk scenarios.
- Write tests using describe/it blocks; group by exported function name.
- Use vi.setSystemTime to control dates deterministically.
- Include parametrized tests using test.each for locale and null/undefined variants.
- Close with a coverage summary and a list of any functions that may need source-level fixes to reach the coverage target.
File: [paste formatters.ts here]
Role: Senior Software Test Engineer conducting a legacy code audit
Goal: Analyze an unfamiliar legacy module and produce both a risk assessment and a starter test suite.
Context:
- Language/Framework: Python 3.10, pytest, unittest.mock
- Target file: utils/data_transformer.py — purpose unknown, no documentation
- Coverage target: achieve 70% branch coverage as a first pass; identify gaps for a second pass
- Constraints: read source only; do not modify; no real I/O; infer behavior from code structure
Instructions:
- First, produce a code audit summary: what does this module appear to do, what are its dependencies, and where are the highest-risk sections?
- Second, list 10 to 15 test cases ranked by risk, with a one-sentence rationale for each.
- Third, write pytest tests for the top 8 cases using AAA structure and unittest.mock.
- Flag any functions where behavior is ambiguous and recommend clarification from the original team.
- Provide a coverage estimate and a list of branches that require domain knowledge to test safely.
Code: [paste data_transformer.py here]
When to use this prompt
Engineering Teams Refactoring Legacy Services
Increase safety before refactors by generating tests around fragile modules. Catch regressions without changing production code.
Product Managers Requiring Release Confidence
Set measurable coverage targets on critical paths and get auditable test plans that support go/no-go decisions.
Customer Success Reproducing Edge Bugs
Create focused tests for reported issues (timezones, retries) to validate fixes and prevent recurrences.
Researchers Benchmarking Code Quality
Generate consistent test suites to compare module reliability and mutation score improvements across versions.
DevOps Owners Hardening CI Pipelines
Add deterministic, mocked tests that run fast in CI, improving feedback loops and deployment confidence.
Pro tips
- 1
Specify mutation testing or branch coverage if quality matters more than line count.
- 2
List concrete risk areas (e.g., time math, retries, floating-point) to focus edge cases.
- 3
Provide interface contracts or sample inputs/outputs to clarify expected behavior.
- 4
State mocking policies (e.g., forbid real I/O, use fakes for HTTP/clock) to keep tests reliable.
Once you've mastered basic coverage-driven prompting, two advanced techniques sharpen output quality significantly.
Mutation Testing Instructions
Line and branch coverage tell you how much code runs during tests — not whether the tests actually catch bugs. Mutation testing tools like mutmut (Python) and Stryker (JavaScript/Java) introduce small code changes and verify that your tests fail. Add this to your prompt:
'After generating tests, identify which assertions are most likely to catch mutations in the three named risk areas. Flag any tests that assert on return values only without asserting on side effects or state changes.'
This produces tests with stronger assertions rather than just wider coverage.
Behavioral Contracts as Prompt Input
If you have any documentation — even a comment, a Slack message, or an old ticket — paste it as a behavioral contract. Example:
'Known behavior from ticket PLAT-412: retry logic should attempt exactly 3 times with exponential backoff, and on final failure should publish a FailedPaymentEvent. Treat this as the specification for tests in the retry section.'
This anchors the AI to intended behavior, not just structural inference from code. It produces tests that document intent, which is the most durable value of a unit test suite in a legacy system.
The core prompt structure works across industries, but regulated environments need specific additions.
Financial Services
Add rounding mode to your risk section explicitly: Decimal('ROUND_HALF_UP') vs ROUND_HALF_EVEN produces different results at scale. Include a constraint requiring assertion precision: 'Assert currency values to exactly two decimal places using Decimal comparison, not float equality.' This prevents a common class of rounding-related test false positives.
Healthcare (HIPAA-Adjacent Systems)
For systems handling patient identifiers or PHI, add: 'Use synthetic test data only — no real names, dates of birth, or identifiers in fixtures. Use a deterministic fake data generator.' This keeps test files safe for code review, CI logs, and version control.
DevOps / CI Pipeline Ownership
If you own the CI configuration, add a performance constraint: 'Each test must complete in under 200ms. Flag any test requiring setup that likely exceeds this threshold.' Slow test suites get disabled by developers under deadline pressure. Speed constraints prompt the AI to prefer lightweight fakes over heavy fixture setup, keeping your CI feedback loop fast.
Use this checklist before sending your prompt to any AI assistant. Missing items are the most common cause of unusable output.
Code Context
- Pasted the target file or the specific functions to test
- Named any interfaces, base classes, or type definitions the target depends on
- Included one passing test file as a style reference (optional but high-value)
Technical Specification
- Language and exact version specified (e.g., Python 3.11, not just 'Python')
- Test framework and version named (e.g., pytest 7.4, not just 'pytest')
- Mocking library named explicitly
Constraints
- Source modification explicitly forbidden
- Real I/O prohibition stated
- Mocking policy defined
Quality Targets
- Numeric coverage target specified (line, branch, or mutation)
- Test naming convention or structure stated (AAA, describe/it, etc.)
- Risk areas listed (at least 3 specific ones)
Output Structure
- Test plan requested before code
- Coverage checklist or gap analysis requested
- Next steps or flagged unknowns requested
If you can check every item, your prompt is ready. If you're missing more than three items, expect to iterate at least twice before the output is usable.
When not to use this prompt
When This Prompt Pattern Is Not the Right Tool
This prompt works well for isolated, deterministic modules with clear inputs and outputs. It is less appropriate in these situations:
When behavior is genuinely unknown. If you cannot infer what the function is supposed to do from code alone, tests will assert on current behavior rather than correct behavior. Use the Exploratory Audit variation first to surface ambiguities, then get human clarification before generating a full suite.
When the module has no testable seams. Some legacy code makes global state mutations, spawns threads, or writes to files in ways that cannot be intercepted without source changes. In this case, start with an integration or smoke test approach rather than forcing unit tests onto untestable code.
When you need property-based or fuzzing tests. This prompt generates example-based tests. For functions requiring exhaustive input space exploration — parsers, serializers, cryptographic utilities — a separate prompt targeting tools like Hypothesis (Python) or fast-check (JavaScript) produces better results.
When you're testing UI components or end-to-end flows. This pattern targets server-side logic. For frontend component testing or E2E test generation, use a prompt that addresses the specific rendering lifecycle and interaction model of your framework.
Troubleshooting
Generated tests import modules that don't exist in my project
Paste your project's directory structure (a two-level tree is enough) and one existing passing test file in the prompt. Say explicitly: 'Match the import paths and conftest.py patterns shown in this reference test.' The AI infers import structure from context — without a reference, it invents plausible-looking but incorrect paths.
All generated tests cover only the happy path — no failure or edge cases
Add this instruction explicitly: 'For each test case, generate at minimum one success variant, one failure variant, and one boundary or edge case variant.' Also list your specific risk areas (e.g., null inputs, overflow values, timezone offsets) by name. Without explicit edge case instructions, AI models default to the most obvious positive flow.
Tests use real network calls despite my mocking instruction
Move your mocking constraint to the first line of the instructions section, not the context section. Also name the specific library method to patch: 'Patch requests.Session.send using unittest.mock.patch, not the module-level requests.get.' Specificity in the patch target prevents the AI from choosing an ineffective or incorrect mock point.
Test names are generic (test_1, test_function_works) and don't describe behavior
Include a naming convention example in your prompt: 'Name all tests using the pattern test_[function][condition][expected_outcome], for example: test_calculate_total_with_zero_quantity_returns_zero.' One concrete example overrides generic naming habits more reliably than a description alone.
The AI modifies the source code instead of writing tests around it
Add this sentence to your constraints in bold or as the final constraint: 'Under no circumstances modify, refactor, or suggest changes to the source file.' If the AI still suggests changes, add a follow-up: 'If a function cannot be tested without modification, flag it in a separate section called UNTESTABLE FUNCTIONS and explain why — do not change the source.'
How to measure success
How to Evaluate AI-Generated Test Output
Don't accept the first output uncritically. Use this checklist to measure quality before adding tests to your codebase.
Structural Quality
- Every test follows AAA — you can identify Arrange, Act, and Assert sections clearly
- Test names describe behavior, not function names (e.g., test_retry_fails_after_three_attempts, not test_retry)
- No test method exceeds 25 lines — long tests usually assert on too many things at once
Coverage and Risk
- All named risk areas appear as explicit test cases in the output
- Both success and failure paths exist for each major function
- Boundary values are tested — off-by-ones, empty inputs, zero values, maximum values
Technical Correctness
- No real I/O occurs — verify by reading mock setup, not just the test body
- Tests are independent — no test depends on another test's side effects
- The coverage checklist identifies untested branches, not just reports passing tests
Maintainability
- Fixtures are reused across related tests rather than duplicated inline
- Assertion messages explain failures — you can read a failing test and know what went wrong without opening the source
Now try it on something of your own
Reading about the framework is one thing. Watching it sharpen your own prompt is another — takes 90 seconds, no signup.
Get a production-ready pytest suite for your legacy module — with mocking, edge cases, and a coverage checklist built in.
Try one of these
Frequently asked questions
Paste the full target file if it's under 300 lines. For larger files, paste the specific class or functions you want tested, plus any interfaces or type definitions they depend on. Include just enough context for the AI to understand inputs, outputs, and dependencies — but not so much that it dilutes focus. For files over 500 lines, consider splitting into separate prompts per logical section.
Yes, but adjust two things. First, change the constraints section — remove the 'no real I/O' rule and specify which real dependencies are allowed (e.g., a test database, a local mock server). Second, change the role to reflect integration testing. The rest of the structure — test plan first, named risks, coverage target — applies equally well to integration test generation.
This usually means the AI made assumptions about your project structure or import paths. Fix it by adding one working test file as a reference example in your prompt. Say: 'Here is an existing passing test for reference — match its import style, fixture setup, and conftest.py usage.' One concrete example eliminates most structural errors in generated tests.
Replace '85% lines/branches' with your team's actual standard. If you track mutation score, say so explicitly: 'Achieve a mutation score of 70% measured by mutmut.' If you're doing a first pass on completely untested code, a lower initial target like 60% with a gap analysis is more realistic. The model follows whatever metric you specify — be precise about the measurement method.
Add this exact sentence to your constraints: 'Do not suggest changes to the source file under any circumstances.' Legacy code often has poor seams, and AI models instinctively recommend improvements. You need to suppress this. If the source truly cannot be tested without modification, ask the AI to flag those specific functions separately rather than interspersing refactoring suggestions throughout the test code.
No. Name the framework and version — that's sufficient. AI models have strong knowledge of pytest, JUnit, Jest, and Vitest up to their training cutoff. If you're using an unusual plugin or a custom test utility, paste a short example of its usage. Overcrowding the prompt with docs reduces the model's focus on your actual test requirements.
Add a specific instruction to your prompt: 'This module uses global state — include teardown fixtures that reset state between tests.' Name the specific globals or class variables if you know them. Ask the AI to wrap each test in appropriate setup/teardown and to flag any globals it cannot safely reset without source modification. This surfaces hidden coupling early.