← All posts
·vibe-codingvibe-testingaiplaywrighttesting

Vibe coding vs. vibe testing: what's the difference, and why it matters

In March 2025, Y Combinator partner Jared Friedman said a quarter of the startups in YC's Winter 2025 batch had codebases that were roughly 95% AI-generated. Writing software stopped being the bottleneck. Knowing whether that software works became the new one.

Two terms grew out of that shift. One is settled and famous. The other is still being argued over. This post pins down both.

What is vibe coding?

Andrej Karpathy coined "vibe coding" in a post on X on February 2, 2025:

There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.

The defining move is not "AI helped me write code." It is that you stop reading the code. You prompt, you run, you describe what's still wrong, and you let the model patch it. The artifact becomes a black box you steer by feel.

The term went mainstream fast. Collins Dictionary named vibe coding its Word of the Year for 2025, defining it as using AI prompted by natural language to help write code.

Simon Willison drew the line that matters. In a March 2025 piece he argued vibe coding is fine for a low-stakes prototype and "a terrible idea" for a production codebase. His test is sharp: if an LLM wrote every line but you reviewed, tested, and understood it, that isn't vibe coding. That's using a model as a typing assistant.

What is vibe testing?

Here the ground is softer. No one is credited with coining "vibe testing," and two camps use it for different things.

One camp means testing by feel: does the app feel right to a real user. TestGrid describes it as focusing "not just on whether an app works as expected, but on how the app feels."

The other camp, the one most automation vendors use, means authoring tests from natural-language intent. Testkube defines it as "a conversational approach to software testing where testers write product requirements and user scenarios in plain English, and AI converts those descriptions into executable tests." Joe Colantonio's TestGuild podcast has run episodes framing it around self-healing Playwright tests.

The honest summary: vibe testing is open territory. The word is being claimed faster than it's being defined, which is exactly why it's worth getting the definition right.

Vibe coding vs. vibe testing: the difference

Vibe codingVibe testing (the useful sense)
Who coined itAndrej Karpathy, Feb 2025No one yet; defined by vendors
What you describeThe feature you wantThe behavior you want verified
What the AI producesApp codeA test of that code
The risk it createsUnread code that may not workThe answer to "does it actually work?"
The good versionPrototype fast, then reviewAuthor from intent, keep a real test

Vibe coding writes the thing. Vibe testing checks the thing. The first one is why the second one had to exist.

Why vibe testing showed up now

When you generate code faster than you can read it, the bugs don't disappear. They move downstream and wait.

The data is blunt. Veracode's 2025 GenAI Code Security Report tested over 100 models on 80 coding tasks and found AI-generated code introduced a security vulnerability in 45% of cases, with no improvement from newer or larger models. RedHunt Labs scanned about 130,000 vibe-coded sites and found roughly one in five leaked at least one secret, including thousands of live Firebase and Supabase credentials. In July 2025, Replit's coding agent deleted a production database during a code freeze and called its own action "a catastrophic failure."

Teams feel it. SmartBear's AI Software Quality Gap Report (January 2026, 273 quality decision-makers) found 93% had adopted AI coding tools, 68% worried that faster AI development would create testing bottlenecks, and 60% had already hit quality problems.

So the gap is real: more code, shipped faster, by people who read less of it. Manual clicking can't keep pace, and you won't hand-write a test suite for code you didn't hand-write. Vibe testing is the attempt to close that gap with the same move that opened it. You describe what should be true, and the AI does the work of checking.

The fault line that actually matters

Strip away the word and one engineering question decides whether vibe testing is trustworthy: does a model run when the test runs?

Some tools keep an AI in the loop at execution time. Stagehand calls a model to interpret each step by default, and re-calls it when the page changes. That adapts to drift, and it means every run costs tokens, varies between runs, and can fail in ways you didn't write.

Other tools spend the model once, while authoring, then hand you a deterministic artifact. Playwright's own codegen records your clicks into a plain .spec.ts. Vendors like QA Wolf make the same argument: "automated tests can't improvise or hallucinate, so there's no variance between runs." The AI authors; the saved test is ordinary code.

This is the choice that outlasts the buzzword. A test that calls a model on every run is convenient and never finished paying for itself. A test that crystallized into plain code runs the same way next year, for free, in any CI, with no model to drift.

Where vibe testing is going

Two things are likely. The "describe it in English" front end will become the default way most tests get authored, because the speed is real and the alternative is writing selectors by hand. And the back end will split along the fault line above: products that keep a model in the runtime, and products that use the model to author a durable artifact and then get out of the way.

The teams shipping AI-generated code at volume will lean toward durable artifacts, for the same reason they run unit tests in CI instead of asking a human to re-check every build. You want the test to be cheaper and more stable than the thing it guards, not another model call that can have a bad day.

How Hover does vibe testing

Hover is a free, open-source VS Code extension built on the durable-artifact side of that line. You describe a flow in plain English. The agent drives your real Chrome over the DevTools Protocol, using the claude or codex CLI you already run, and works out the steps. When the run is clean, Hover crystallizes it into a standard @playwright/test spec.

That spec is the whole point. It runs in CI with no AI, no tokens, and no variance, the same as any Playwright test you wrote by hand. We've written about why AI-authored tests should leave no AI in CI, and the same chat flips into security and pentest modes for the access-control holes that vibe-coded apps ship by default.

Vibe-code the feature if you want. Then vibe-test it, and keep a real Playwright spec when you're done.

Try Hover on your own app.

Install the VS Code extension. Author tests with AI, ship plain Playwright.

Install on VS Code Marketplace →