Coding & Technical

Unit Test Suite Generation for Legacy Code AI Prompt

Writing unit tests for legacy code is hard. You’re missing docs, function behavior is unclear, and risk of breaking production is high. You need targeted tests that boost coverage without rewriting half the codebase. A strong prompt turns your scattered notes into a precise testing plan: what to test, how to mock, and which edge cases matter.

AskSmarter.ai guides you with clarifying questions—language, framework, coverage goals, constraints, risk areas—then builds a structured prompt that yields maintainable tests on the first pass. You’ll spend less time re-running AI guesses and more time shipping safe improvements.

Use this prompt to generate a focused test suite that increases confidence, documents intent, and prevents regressions—especially when you can’t change the original code yet.

intermediate9 min read

Why this is hard to get right

The Legacy Code Testing Problem Nobody Warns You About

Marcus is a senior backend engineer at a fintech startup. His team inherited a 6-year-old payments module — 1,200 lines of Python, zero tests, and one developer who understood it has since left the company. Leadership wants the module refactored before a major platform migration. The rule: don't break payments.

Marcus knows he needs tests before touching anything. But the code is tangled. Functions call external APIs, touch the database directly, perform timezone conversions without comments, and retry failed transactions using logic that only makes sense if you read a Slack thread from 2019.

His first attempt with an AI assistant was a disaster. He pasted the function and typed: "Write unit tests for this." The AI produced something that looked plausible — but it imported a real database connection, called a live Stripe endpoint, and tested the happy path only. The tests passed locally and failed immediately in CI. Worse, they gave false confidence: the riskiest branches (retry overflow, timezone edge cases) had zero coverage.

Marcus spent three hours fixing AI output that should have taken thirty minutes to generate correctly.

The problem wasn't the AI. It was the prompt.

He had given the model no framework, no constraints, no coverage target, and no signal about where the real risks lived. The AI made reasonable assumptions — and every assumption was wrong for his context.

On his second attempt, Marcus built a structured prompt. He specified pytest and unittest.mock, named the file and its known risk areas, set an 85% branch coverage target, explicitly forbade real network and database calls, and asked the AI to produce a test plan before writing any code.

The results were dramatically different. The AI proposed a test plan with 14 named cases, organized by risk tier. It used fakes for the HTTP client and patched the clock for timezone tests. The AAA structure made each test readable. The coverage checklist flagged two untested branches Marcus hadn't noticed.

The whole suite ran hermetically in CI in under four seconds.

What changed was specificity. A well-crafted prompt doesn't just save time — it forces you to articulate what "good" means before the AI starts. That act of articulation — naming your risks, constraints, and success criteria — is itself valuable engineering work. The prompt becomes a requirements document for your test suite, not just a query.

Common mistakes to avoid

  • Pasting Code Without Naming Risk Areas

    When you paste a function without flagging known risks — timezone math, retry logic, rounding — the AI tests what's structurally obvious, not what's actually dangerous. You get high line coverage but miss the branches that cause production incidents. Always list 3-5 specific risk areas explicitly in your prompt so the AI prioritizes edge cases that matter.

  • Omitting the Mocking Policy

    Without explicit constraints, AI-generated tests will often make real HTTP calls, open database connections, or read from the filesystem. These tests pass locally and fail in CI, or worse, they hit production systems. Specify your mocking policy directly: forbid real I/O, name the libraries to use (unittest.mock, pytest-mock), and require fakes for external dependencies.

  • Skipping the Test Plan Request

    Asking for code immediately produces tests without a reasoning layer. The AI skips boundary analysis and jumps to implementation, missing whole categories of cases. Ask for a test plan first — named cases, boundaries, fixtures, and mocking strategy — then generate code. This separates thinking from writing and produces auditable, complete suites.

  • Not Specifying the Coverage Target

    Saying 'good coverage' is meaningless. Without a numeric target like 85% branch coverage, the AI optimizes for volume of tests rather than meaningful coverage. You end up with 20 tests that cover the same happy path. State a specific percentage and clarify whether you mean line coverage, branch coverage, or mutation score.

  • Forgetting to Restrict Source Modification

    AI assistants frequently suggest refactoring the source code to make it more testable — adding seams, extracting functions, or changing signatures. This is dangerous in legacy systems where side effects are unknown. Explicitly state that the source file must not be modified and that all testability must come through mocking and dependency injection at the call site.

  • Using Generic Role Instructions

    A prompt that says 'act as a developer' produces generic output. Specifying 'Senior Software Test Engineer' signals that the AI should apply professional testing standards: AAA structure, meaningful assertion messages, fixture reuse, and coverage analysis. Role specificity shifts the output from beginner-level examples to production-quality test suites.

The transformation

Before
Write unit tests for this old module. Aim for good coverage.
After
Role: **Senior Software Test Engineer**

Goal: Generate a maintainable unit test suite for a legacy module.

Context:
1) Language/Framework: Python 3.11, pytest, unittest.mock
2) Target file: payments/processor.py (paste code below)
3) Coverage: minimum 85% lines/branches
4) Constraints: don’t modify source; avoid network/DB calls; use fakes
5) Risks: timezone math, retry logic, rounding errors

Instructions:
- Propose a test plan first (cases, boundaries, mocks, fixtures).
- Write pytest tests with clear names and AAA structure.
- Include examples for success, failure, and edge inputs.
- Show how to mock HTTP and clock.
- Provide a coverage checklist and next steps.

Code:
[paste processor.py here]

Why this works

  • Role Primes Quality Standards

    The After Prompt opens with 'Senior Software Test Engineer' as the role. This single instruction shifts the AI's output register from tutorial-style examples to professional testing practices — AAA structure, fixture reuse, meaningful assertion messages. Without this, the AI defaults to the simplest interpretation of 'write tests,' which is rarely production-ready.

  • Explicit Constraints Eliminate Guessing

    The After Prompt's constraints section — 'don't modify source; avoid network/DB calls; use fakes' — removes the AI's most common failure modes for legacy code. AI models default to the path of least resistance; without these constraints, they'll suggest source changes or real I/O. Explicit rules produce hermetic, CI-safe tests every time.

  • Named Risks Focus Edge Case Generation

    The After Prompt lists 'timezone math, retry logic, rounding errors' under risks. This is the highest-leverage instruction in the entire prompt. AI models respond to specificity here — named risks produce named test cases targeting those exact failure modes, rather than generic boundary tests that miss the actual defects.

  • Sequential Instructions Separate Thinking from Writing

    The instructions section starts with 'Propose a test plan first' before requesting code. This forces a reasoning step — the AI identifies cases, boundaries, and mocking needs before writing a single line of code. The result is a structured, auditable plan that catches omissions before implementation begins.

  • Measurable Success Criteria Anchors Output

    'Minimum 85% lines/branches' gives the AI a concrete target to optimize toward. Vague goals like 'good coverage' produce inconsistent results. A numeric threshold causes the model to include a coverage checklist, flag untested paths, and suggest next steps — outputs that are directly actionable for an engineering team.

The framework behind the prompt

The Theory Behind Effective Test Generation Prompts

Unit testing legacy code sits at the intersection of two disciplines: software testing theory and prompt engineering. Understanding both helps you write prompts that produce genuinely useful output.

Why Legacy Code Is Different

Legacy systems resist standard test-driven development (TDD) because TDD assumes you write tests before code. Michael Feathers' foundational work in Working Effectively with Legacy Code defines legacy code precisely as code without tests — and notes that making it testable often requires identifying seams: points where behavior can be substituted without changing production logic. A good test prompt explicitly names these seams (the HTTP client, the clock, the database layer) and instructs the AI to mock at those exact points.

Coverage as a Specification Language

Coverage metrics — line, branch, and mutation — function as a specification language in prompts. Branch coverage is more meaningful than line coverage for legacy code because it requires every conditional path to be exercised. Mutation testing, pioneered by DeMillo, Lipton, and Sayward in 1978, goes further: it measures whether your assertions are strong enough to detect artificial bugs. Including mutation score in a prompt shifts AI output from volume-optimized tests to assertion-quality tests.

The AAA Pattern as a Cognitive Framework

The Arrange-Act-Assert (AAA) pattern, popularized by Bill Wake, does more than organize test code — it forces clarity about what a function is supposed to do. When you require AAA structure in a prompt, you're asking the AI to separate setup (Arrange), execution (Act), and verification (Assert) explicitly. This prevents a common AI failure mode: tests that act and assert but never properly isolate the system under test.

Prompt Engineering Principles at Work

Few-Shot Prompting — providing one working test as a reference — dramatically reduces structural errors. Chain-of-Thought — requesting a test plan before code — improves coverage completeness by forcing a reasoning step before implementation. Role Prompting elevates output register by activating domain-specific knowledge associated with a professional identity. All three apply directly to test generation prompts.

Chain-of-Thought PromptingFew-Shot PromptingRole PromptingRISEN

Prompt variations

JavaScript / Jest — Node.js Service

Role: Senior JavaScript Test Engineer

Goal: Generate a complete Jest test suite for a legacy Node.js service module with no existing tests.

Context:

  1. Language/Framework: Node.js 18, Jest 29, jest.mock for dependencies
  2. Target file: src/billing/invoiceGenerator.js (code pasted below)
  3. Coverage target: 80% branch coverage measured by Jest's --coverage flag
  4. Constraints: do not modify source; no real filesystem or network calls; mock all external modules
  5. Known risks: date arithmetic using moment.js, PDF generation timeout handling, currency rounding to two decimal places

Instructions:

  • Start with a test plan listing all cases, mocked modules, and fixture data needed.
  • Write Jest tests using describe/it blocks with clear, readable names.
  • Use beforeEach for setup, afterEach for teardown, and jest.spyOn where appropriate.
  • Cover success paths, error branches, and the three named risk areas explicitly.
  • End with a coverage gap analysis and suggested next test targets.

Code: [paste invoiceGenerator.js here]

Java / JUnit 5 — Microservice with Database

Role: Senior Java Test Engineer specializing in Spring Boot applications

Goal: Write a JUnit 5 unit test suite for a legacy Spring service class, isolating it from real database and HTTP dependencies.

Context:

  1. Language/Framework: Java 17, JUnit 5, Mockito 5, Spring Boot 3
  2. Target class: OrderFulfillmentService.java in the order-service microservice
  3. Coverage target: 85% line coverage, verified by JaCoCo
  4. Constraints: no actual database calls; use @Mock and @InjectMocks; do not modify production source
  5. Known risks: optimistic locking exceptions, event publishing failures, null-safety gaps in legacy DTOs

Instructions:

  • Produce a test plan first: list test method names, the scenario each covers, and which dependencies to mock.
  • Write @Test methods using the Arrange/Act/Assert pattern with descriptive display names via @DisplayName.
  • Mock all repository and event publisher dependencies with Mockito.
  • Include tests for the three named risk areas with explicit assertions on exception types and messages.
  • Provide a JaCoCo configuration snippet and a checklist of uncovered branches.

Class file: [paste OrderFulfillmentService.java here]

TypeScript / Vitest — Frontend Utility Library

Role: Senior Frontend Test Engineer

Goal: Generate a Vitest unit test suite for a legacy TypeScript utility library used across a React application.

Context:

  1. Language/Framework: TypeScript 5, Vitest 1.x, vi.mock for module mocking
  2. Target file: src/utils/formatters.ts — currency, date, and string formatting functions
  3. Coverage target: 90% line and branch coverage using Vitest's built-in v8 coverage provider
  4. Constraints: pure function tests only; no DOM interaction; no external API calls; do not alter source types
  5. Known risks: locale-sensitive number formatting, daylight saving time edge cases in date formatters, handling of null and undefined inputs

Instructions:

  • Begin with a test plan organized by function, listing normal cases, boundary values, and known risk scenarios.
  • Write tests using describe/it blocks; group by exported function name.
  • Use vi.setSystemTime to control dates deterministically.
  • Include parametrized tests using test.each for locale and null/undefined variants.
  • Close with a coverage summary and a list of any functions that may need source-level fixes to reach the coverage target.

File: [paste formatters.ts here]

Exploratory Audit — Unknown Legacy Module

Role: Senior Software Test Engineer conducting a legacy code audit

Goal: Analyze an unfamiliar legacy module and produce both a risk assessment and a starter test suite.

Context:

  1. Language/Framework: Python 3.10, pytest, unittest.mock
  2. Target file: utils/data_transformer.py — purpose unknown, no documentation
  3. Coverage target: achieve 70% branch coverage as a first pass; identify gaps for a second pass
  4. Constraints: read source only; do not modify; no real I/O; infer behavior from code structure

Instructions:

  • First, produce a code audit summary: what does this module appear to do, what are its dependencies, and where are the highest-risk sections?
  • Second, list 10 to 15 test cases ranked by risk, with a one-sentence rationale for each.
  • Third, write pytest tests for the top 8 cases using AAA structure and unittest.mock.
  • Flag any functions where behavior is ambiguous and recommend clarification from the original team.
  • Provide a coverage estimate and a list of branches that require domain knowledge to test safely.

Code: [paste data_transformer.py here]

When to use this prompt

  • Engineering Teams Refactoring Legacy Services

    Increase safety before refactors by generating tests around fragile modules. Catch regressions without changing production code.

  • Product Managers Requiring Release Confidence

    Set measurable coverage targets on critical paths and get auditable test plans that support go/no-go decisions.

  • Customer Success Reproducing Edge Bugs

    Create focused tests for reported issues (timezones, retries) to validate fixes and prevent recurrences.

  • Researchers Benchmarking Code Quality

    Generate consistent test suites to compare module reliability and mutation score improvements across versions.

  • DevOps Owners Hardening CI Pipelines

    Add deterministic, mocked tests that run fast in CI, improving feedback loops and deployment confidence.

Pro tips

  • 1

    Specify mutation testing or branch coverage if quality matters more than line count.

  • 2

    List concrete risk areas (e.g., time math, retries, floating-point) to focus edge cases.

  • 3

    Provide interface contracts or sample inputs/outputs to clarify expected behavior.

  • 4

    State mocking policies (e.g., forbid real I/O, use fakes for HTTP/clock) to keep tests reliable.

Once you've mastered basic coverage-driven prompting, two advanced techniques sharpen output quality significantly.

Mutation Testing Instructions

Line and branch coverage tell you how much code runs during tests — not whether the tests actually catch bugs. Mutation testing tools like mutmut (Python) and Stryker (JavaScript/Java) introduce small code changes and verify that your tests fail. Add this to your prompt:

'After generating tests, identify which assertions are most likely to catch mutations in the three named risk areas. Flag any tests that assert on return values only without asserting on side effects or state changes.'

This produces tests with stronger assertions rather than just wider coverage.

Behavioral Contracts as Prompt Input

If you have any documentation — even a comment, a Slack message, or an old ticket — paste it as a behavioral contract. Example:

'Known behavior from ticket PLAT-412: retry logic should attempt exactly 3 times with exponential backoff, and on final failure should publish a FailedPaymentEvent. Treat this as the specification for tests in the retry section.'

This anchors the AI to intended behavior, not just structural inference from code. It produces tests that document intent, which is the most durable value of a unit test suite in a legacy system.

The core prompt structure works across industries, but regulated environments need specific additions.

Financial Services

Add rounding mode to your risk section explicitly: Decimal('ROUND_HALF_UP') vs ROUND_HALF_EVEN produces different results at scale. Include a constraint requiring assertion precision: 'Assert currency values to exactly two decimal places using Decimal comparison, not float equality.' This prevents a common class of rounding-related test false positives.

Healthcare (HIPAA-Adjacent Systems)

For systems handling patient identifiers or PHI, add: 'Use synthetic test data only — no real names, dates of birth, or identifiers in fixtures. Use a deterministic fake data generator.' This keeps test files safe for code review, CI logs, and version control.

DevOps / CI Pipeline Ownership

If you own the CI configuration, add a performance constraint: 'Each test must complete in under 200ms. Flag any test requiring setup that likely exceeds this threshold.' Slow test suites get disabled by developers under deadline pressure. Speed constraints prompt the AI to prefer lightweight fakes over heavy fixture setup, keeping your CI feedback loop fast.

Use this checklist before sending your prompt to any AI assistant. Missing items are the most common cause of unusable output.

Code Context

  • Pasted the target file or the specific functions to test
  • Named any interfaces, base classes, or type definitions the target depends on
  • Included one passing test file as a style reference (optional but high-value)

Technical Specification

  • Language and exact version specified (e.g., Python 3.11, not just 'Python')
  • Test framework and version named (e.g., pytest 7.4, not just 'pytest')
  • Mocking library named explicitly

Constraints

  • Source modification explicitly forbidden
  • Real I/O prohibition stated
  • Mocking policy defined

Quality Targets

  • Numeric coverage target specified (line, branch, or mutation)
  • Test naming convention or structure stated (AAA, describe/it, etc.)
  • Risk areas listed (at least 3 specific ones)

Output Structure

  • Test plan requested before code
  • Coverage checklist or gap analysis requested
  • Next steps or flagged unknowns requested

If you can check every item, your prompt is ready. If you're missing more than three items, expect to iterate at least twice before the output is usable.

When not to use this prompt

When This Prompt Pattern Is Not the Right Tool

This prompt works well for isolated, deterministic modules with clear inputs and outputs. It is less appropriate in these situations:

When behavior is genuinely unknown. If you cannot infer what the function is supposed to do from code alone, tests will assert on current behavior rather than correct behavior. Use the Exploratory Audit variation first to surface ambiguities, then get human clarification before generating a full suite.

When the module has no testable seams. Some legacy code makes global state mutations, spawns threads, or writes to files in ways that cannot be intercepted without source changes. In this case, start with an integration or smoke test approach rather than forcing unit tests onto untestable code.

When you need property-based or fuzzing tests. This prompt generates example-based tests. For functions requiring exhaustive input space exploration — parsers, serializers, cryptographic utilities — a separate prompt targeting tools like Hypothesis (Python) or fast-check (JavaScript) produces better results.

When you're testing UI components or end-to-end flows. This pattern targets server-side logic. For frontend component testing or E2E test generation, use a prompt that addresses the specific rendering lifecycle and interaction model of your framework.

Troubleshooting

Generated tests import modules that don't exist in my project

Paste your project's directory structure (a two-level tree is enough) and one existing passing test file in the prompt. Say explicitly: 'Match the import paths and conftest.py patterns shown in this reference test.' The AI infers import structure from context — without a reference, it invents plausible-looking but incorrect paths.

All generated tests cover only the happy path — no failure or edge cases

Add this instruction explicitly: 'For each test case, generate at minimum one success variant, one failure variant, and one boundary or edge case variant.' Also list your specific risk areas (e.g., null inputs, overflow values, timezone offsets) by name. Without explicit edge case instructions, AI models default to the most obvious positive flow.

Tests use real network calls despite my mocking instruction

Move your mocking constraint to the first line of the instructions section, not the context section. Also name the specific library method to patch: 'Patch requests.Session.send using unittest.mock.patch, not the module-level requests.get.' Specificity in the patch target prevents the AI from choosing an ineffective or incorrect mock point.

Test names are generic (test_1, test_function_works) and don't describe behavior

Include a naming convention example in your prompt: 'Name all tests using the pattern test_[function][condition][expected_outcome], for example: test_calculate_total_with_zero_quantity_returns_zero.' One concrete example overrides generic naming habits more reliably than a description alone.

The AI modifies the source code instead of writing tests around it

Add this sentence to your constraints in bold or as the final constraint: 'Under no circumstances modify, refactor, or suggest changes to the source file.' If the AI still suggests changes, add a follow-up: 'If a function cannot be tested without modification, flag it in a separate section called UNTESTABLE FUNCTIONS and explain why — do not change the source.'

How to measure success

How to Evaluate AI-Generated Test Output

Don't accept the first output uncritically. Use this checklist to measure quality before adding tests to your codebase.

Structural Quality

  • Every test follows AAA — you can identify Arrange, Act, and Assert sections clearly
  • Test names describe behavior, not function names (e.g., test_retry_fails_after_three_attempts, not test_retry)
  • No test method exceeds 25 lines — long tests usually assert on too many things at once

Coverage and Risk

  • All named risk areas appear as explicit test cases in the output
  • Both success and failure paths exist for each major function
  • Boundary values are tested — off-by-ones, empty inputs, zero values, maximum values

Technical Correctness

  • No real I/O occurs — verify by reading mock setup, not just the test body
  • Tests are independent — no test depends on another test's side effects
  • The coverage checklist identifies untested branches, not just reports passing tests

Maintainability

  • Fixtures are reused across related tests rather than duplicated inline
  • Assertion messages explain failures — you can read a failing test and know what went wrong without opening the source

Now try it on something of your own

Reading about the framework is one thing. Watching it sharpen your own prompt is another — takes 90 seconds, no signup.

Get a production-ready pytest suite for your legacy module — with mocking, edge cases, and a coverage checklist built in.

Try one of these

Frequently asked questions

Paste the full target file if it's under 300 lines. For larger files, paste the specific class or functions you want tested, plus any interfaces or type definitions they depend on. Include just enough context for the AI to understand inputs, outputs, and dependencies — but not so much that it dilutes focus. For files over 500 lines, consider splitting into separate prompts per logical section.

Yes, but adjust two things. First, change the constraints section — remove the 'no real I/O' rule and specify which real dependencies are allowed (e.g., a test database, a local mock server). Second, change the role to reflect integration testing. The rest of the structure — test plan first, named risks, coverage target — applies equally well to integration test generation.

This usually means the AI made assumptions about your project structure or import paths. Fix it by adding one working test file as a reference example in your prompt. Say: 'Here is an existing passing test for reference — match its import style, fixture setup, and conftest.py usage.' One concrete example eliminates most structural errors in generated tests.

Replace '85% lines/branches' with your team's actual standard. If you track mutation score, say so explicitly: 'Achieve a mutation score of 70% measured by mutmut.' If you're doing a first pass on completely untested code, a lower initial target like 60% with a gap analysis is more realistic. The model follows whatever metric you specify — be precise about the measurement method.

Add this exact sentence to your constraints: 'Do not suggest changes to the source file under any circumstances.' Legacy code often has poor seams, and AI models instinctively recommend improvements. You need to suppress this. If the source truly cannot be tested without modification, ask the AI to flag those specific functions separately rather than interspersing refactoring suggestions throughout the test code.

No. Name the framework and version — that's sufficient. AI models have strong knowledge of pytest, JUnit, Jest, and Vitest up to their training cutoff. If you're using an unusual plugin or a custom test utility, paste a short example of its usage. Overcrowding the prompt with docs reduces the model's focus on your actual test requirements.

Add a specific instruction to your prompt: 'This module uses global state — include teardown fixtures that reset state between tests.' Name the specific globals or class variables if you know them. Ask the AI to wrap each test in appropriate setup/teardown and to flag any globals it cannot safely reset without source modification. This surfaces hidden coupling early.

Your turn

Build a prompt for your situation

This example shows the pattern. AskSmarter.ai guides you to create prompts tailored to your specific context, audience, and goals.