Generate tests from GitHub pull requests

2026-03-1323:0188

I’ve been experimenting with something interesting.

AI coding tools generate code very quickly, but they almost never generate full end to end test coverage. they create a ton of tests mostly unit and intergations but real user scenarios are missing. In many repos we looked at, the ratio of new co...

I’ve been experimenting with something interesting.

AI coding tools generate code very quickly, but they almost never generate full end to end test coverage. they create a ton of tests mostly unit and intergations but real user scenarios are missing. In many repos we looked at, the ratio of new code vs small number of high quality e2e tests dropped dramatically once teams started using Copilot-style tools or is left for testers as a separate job.

So I tried a different approach.

the system reads a pull request and:

• analyzes changed files • identifies uncovered logic paths - using dependency graph (one repo or multi-repo) • Understand the context via user story or requirements (given as a comment in PR) • generates test scenarios • produces e2e automated tests tied to the PR

in addition if a user can connect with their CMS, or TMS then it can be pulled into as well. (internally i use graphRAG but that is for another post)

Example workflow:

1. Push a PR 2. System reads diff + linked Jira ticket 3. Generates missing tests and coverage report

In early experiments the system consistently found edge cases that developers missed.

Example output:

Code Reference| Requirement ID | Requirement / Acceptance Criteria |Test Type Test ID | Test Description |Status

src/api/auth.js:45-78 | GITHUB-234 / JIRA-API-102 | API should return 400 for invalid token| Integration| IT-01 | Validate response for invalid token Pass

Curious how others are thinking about this kind of traceability. I am a developer too so i am sensitive to only show this to developer and only developer can make it visible to other folks otherwise he can just take the corrective action.


Comments

  • By marcospolanco 2026-03-1620:07

    I always translate specs into Gherkin BDD scenarios and drive the tests off that; without this linkage test coverage diverges from user flows.

  • By Aamir21 2026-03-1519:19

    And the other thing is tacit or tribal knowledge. Ai system is good when data is structured and available. Not so much when data is scattered and largely the connect the dots information is in dev or testers head. My recipe is memory + context combine with seamless ui to capture dev / tester mindset will make any ai system customizable. It doesn’t have to be LLM system, it can 90% rag or some kind of graph tag and 10% LLM usage. That will create a moat easily defensible otherwise a new LLM upgrade wipe out all the moat you might have.

  • By jmathai 2026-03-141:231 reply

    I think Claude Code can write very good end to end tests given the right constructs.

    I have been building a desktop app (electron-based) which interacts with Anthropic’s AgentSDK and the local file system.

    It’s 100% spec driven and Claude Code has written every line. I do large features instead of small ones (spec in issue around 300 lines of markdown).

    I have had it generate playwright tests from the start. It was doing okay but one thing made it do amazing. I created a spec driven pull request to use data-testid attributes for selectors.

    Every new feature adds tests, and verifies it hasn’t broken existing features.

    I don’t even bother with unit tests. It’s working amazing.

    • By Aamir21 2026-03-1510:00

      I tried claude code, and it did write some good quality e2e tests but my biggest worry was the full coverage. Its really difficult to quantify e2e test coverage the way developers do unit test coverage. its really impossible. specs is just one artifact just like code is just one of many artifacts that full system wide e2e coverage needs. addng production logs + producton incidents which I tried also give me some sense of full e2e coverage. if you are using claude code for dev and testing both, its like having cake and eat it too. If claude for whatever reason misrepresent or misinterpret a requirement, that will percolate in code and testing as well. having a 3rd party testing tool is appropiate with allthe data flowing in it like specs, legacy tests, prod incidents, code and then perhaps we can expect full unbiased test coevrage. I am not talking about wanna be enterprise apps or hobby apps, i am talking about >v0 enterprise apps that have real customers and real downside if they go down with rich data set of past incidents and not so perfect code but now they are increasingly using agentic ai to produce more non-human code. they need a 3rd party tool that ingest their data, create a KG understanding of their data and prevent crtical bugs leak into production by geenrating small number of high quality high coverage tests.

HackerNews