Weave – A language aware merge algorithm based on entities

2026-03-041:52188114github.com

Entity-level semantic merge driver for Git. Resolves conflicts that git can't by understanding code structure via tree-sitter. 31/31 clean merges vs git's 15/31. - Ataraxy-Labs/weave

weave

Resolves merge conflicts that Git can't by understanding code structure via tree-sitter.

Release Homebrew Rust Tests Version License Languages

Git merges by comparing lines. When two branches both add code to the same file — even to completely different functions — Git sees overlapping line ranges and declares a conflict:

<<<<<<< HEAD
export function validateToken(token: string): boolean {
    return token.length > 0 && token.startsWith("sk-");
}
=======
export function formatDate(date: Date): string {
    return date.toISOString().split('T')[0];
}
>>>>>>> feature-branch

These are completely independent changes. There's no real conflict. But someone has to manually resolve it anyway.

This happens constantly when multiple AI agents work on the same codebase. Agent A adds a function, Agent B adds a different function to the same file, and Git halts everything for a human to intervene.

Weave replaces Git's line-based merge with entity-level merge. Instead of diffing lines, it:

  1. Parses all three versions (base, ours, theirs) into semantic entities — functions, classes, JSON keys, etc. — using tree-sitter
  2. Matches entities across versions by identity (name + type + scope)
  3. Merges at the entity level:
    • Different entities changed → auto-resolved, no conflict
    • Same entity changed by both → attempts intra-entity merge, conflicts only if truly incompatible
    • One side modifies, other deletes → flags a meaningful conflict

The same scenario above? Weave merges it cleanly with zero conflicts — both functions end up in the output.

Scenario Git (line-based) Weave (entity-level)
Two agents add different functions to same file CONFLICT Auto-resolved
Agent A modifies foo(), Agent B adds bar() CONFLICT (adjacent lines) Auto-resolved
Both agents modify the same function differently CONFLICT CONFLICT (with entity-level context)
One agent modifies, other deletes same function CONFLICT (cryptic diff) CONFLICT: function 'validateToken' (modified in ours, deleted in theirs)
Both agents add identical function CONFLICT Auto-resolved (identical content detected)
Different JSON keys modified CONFLICT Auto-resolved

The key difference: Git produces false conflicts on independent changes because they happen to be in the same file. Weave only conflicts on actual semantic collisions when two branches change the same entity incompatibly.

Tested on real merge commits from major open-source repositories. For each merge commit, we replay the merge with both Git and Weave, then compare against the human-authored result.

  • Wins: Merge commits where Git conflicted but Weave resolved cleanly
  • Regressions: Cases where Weave introduced errors (0 across all repos)
  • Human Match: How often Weave's output exactly matches what the human wrote
  • Resolution Rate: Percentage of all merge commits Weave resolved vs total attempted
Repository Language Merge Commits Wins Regressions Human Match Resolution
git/git C 1319 39 0 64% 13%
Flask Python 56 14 0 57% 54%
CPython C/Python 256 7 0 29% 13%
Go Go 1247 19 0 58% 28%
TypeScript TypeScript 2000 65 0 6% 23%

Zero regressions across all repositories. Every "win" is a place where a developer had to manually resolve a false conflict that Weave handles automatically.

When a real conflict occurs, weave gives you context that Git doesn't:

<<<<<<< ours — function `process` (both modified)
export function process(data: any) {
    return JSON.stringify(data);
}
=======
export function process(data: any) {
    return data.toUpperCase();
}
>>>>>>> theirs — function `process` (both modified)

You immediately know: what entity conflicted, what type it is, and why it conflicted.

TypeScript, JavaScript, Python, Go, Rust, JSON, YAML, TOML, Markdown. Falls back to standard line-level merge for unsupported file types.

# Build
cargo build --release # In your repo:
./target/release/weave-cli setup # Or manually:
git config merge.weave.name "Entity-level semantic merge"
git config merge.weave.driver "/path/to/weave-driver %O %A %B %L %P"
echo "*.ts *.tsx *.js *.py *.go *.rs *.json *.yaml *.toml *.md merge=weave" >> .gitattributes

Then use Git normally. git merge will use weave automatically for configured file types.

Dry-run a merge to see what weave would do:

weave-cli preview feature-branch
  src/utils.ts — auto-resolved
    unchanged: 2, added-ours: 1, added-theirs: 1
  src/api.ts — 1 conflict(s)
    ✗ function `process`: both modified

✓ Merge would be clean (1 file(s) auto-resolved by weave)
weave-core       # Library: entity extraction, 3-way merge algorithm, reconstruction
weave-driver     # Git merge driver binary (called by git via %O %A %B %L %P)
weave-cli        # CLI: `weave setup` and `weave preview`

Uses sem-core for entity extraction via tree-sitter grammars.

         base
        /    \
     ours    theirs
        \    /
       weave merge
  1. Parse all three versions into semantic entities via tree-sitter
  2. Extract regions — alternating entity and interstitial (imports, whitespace) segments
  3. Match entities across versions by ID (file:type:name:parent)
  4. Resolve each entity: one-side-only changes win, both-changed attempts intra-entity 3-way merge
  5. Reconstruct file from merged regions, preserving ours-side ordering
  6. Fallback to line-level merge for files >1MB, binary files, or unsupported types

Read the original article

Comments

  • By rs545837 2026-03-043:144 reply

    Some context on the validation so far: Elijah Newren, who wrote git's merge-ort (the default merge strategy), reviewed weave and said language-aware content merging is the right approach, that he's been asked about it enough times to be certain there's demand, and that our fallback-to-line-level strategy for unsupported languages is "a very reasonable way to tackle the problem." Taylor Blau from the Git team said he's "really impressed" and connected us with Elijah. The creator of libgit2 starred the repo. Martin von Zweigbergk (creator of jj) has also been excited about the direction. We are also working with GitButler team to integrate it as a research feature.

    The part that's been keeping me up at night: this becomes critical infrastructure for multi-agent coding. When multiple agents write code in parallel (Cursor, Claude Code, Codex all ship this now), they create worktrees for isolation. But when those branches merge back, git's line-level merge breaks on cases where two agents added different functions to the same file. weave resolves these cleanly because it knows they're separate entities. 31/31 vs git's 15/31 on our benchmark.

    Weave also ships as an MCP server with 14 tools, so agents can claim entities before editing, check who's touching what, and detect conflicts before they happen.

    • By tveita 2026-03-0416:131 reply

      > Elijah Newren, who wrote git's merge-ort (the default merge strategy), reviewed weave and said language-aware content merging is the right approach, that he's been asked about it enough times to be certain there's demand, and that our fallback-to-line-level strategy for unsupported languages is "a very reasonable way to tackle the problem." Taylor Blau from the Git team said he's "really impressed" and connected us with Elijah. The creator of libgit2 starred the repo. Martin von Zweigbergk (creator of jj) has also been excited about the direction.

      Are any of these statements public, or is this all private communication?

      > We are also working with GitButler team to integrate it as a research feature.

      Referring to this discussion, I assume: https://github.com/gitbutlerapp/gitbutler/discussions/12274

      • By rs545837 2026-03-0416:16

        Email conversations with Elijah and Taylor are private. Martin commented on our X post that went viral, and suggested a new benchmark design.

    • By deckar01 2026-03-044:062 reply

      Does this actually matter for multi-agent use cases? Surely people that are using swarms of AI agents to write code are just letting them resolve merge conflicts.

      • By rs545837 2026-03-044:131 reply

        So that you don't feel that I am biased about my thing but just giving more context that it's not just me, its actually people saying on twitter how often the merging breaks when you are running production level code and often merging different branches.

        https://x.com/agent_wrapper/status/2026937132649247118 https://x.com/omega_memory/status/2028844143867228241 https://x.com/vincentmvdm/status/2027027874134343717

        • By deckar01 2026-03-044:341 reply

          Those users all work for companies that sell AI tools. And the first one literally says they let AI fix merge conflicts. The second one is in a thread advocating for 0 code review (which this can’t guarantee) (and also ew). The third is also saying to just have another bot handle merging.

          • By rs545837 2026-03-044:381 reply

            Thanks a lot for the fair criticism, Appreciate it! You're right that those links aren't the strongest evidence. The real argument isn't "people are complaining on twitter." It's just much simpler when two agents add different functions to the same file, where git creates a conflict that doesn't need to exist. Weave just knows they're separate entities and merges cleanly. Whether you let AI resolve the false conflict or avoid it entirely is a design choice, we think avoiding it is better.

            • By deckar01 2026-03-044:541 reply

              Dear god, it’s bots all the way down.

              • By rs545837 2026-03-045:041 reply

                What do you mean?

                • By deckar01 2026-03-045:142 reply

                  It’s your GitHub profile. It looks suspiciously just like the other 10 GitHub users that have been spamming AI generated issues and PRs for the last 2 weeks. They always go quiet eventually. I suspect because they are violating GitHub’s ToS, but maybe they just run out of free tokens.

                  • By rs545837 2026-03-045:371 reply

                    Thanks again for criticising, so tackling each of your comment:

                    GitHub’s ToS, because you suspect, so I can help you understand them.

                    > What violates it:

                            1. Automated Bulk issues/PRs, that we don't own
                            2. Fake Stars or Engagement Farming
                            3. Using Bot Accounts.
                    
                    We own the repo, there's not even a single fake star, I don't even know how to create a bot account lol.

                    > Scenario when we run out of free tokens.

                    Open AI and Anthropic have been sponsoring my company with credits, because I am trying to architect new software post agi world, so if I run out I will ask them for more tokens.

                    • By deckar01 2026-03-045:411 reply

                      And you are opening issues on projects trying to get them to adopt your product. Seems like spam to me. How much are you willing to spend maintaining this project if those free tokens go away?

                      • By rs545837 2026-03-046:12

                        When you're just a normal guy genuinely trying to build something great and there's nobody who believes in you yet, the only thing you can do is go to projects you admire and ask "would this help you?" Patrick Collison did the same thing early on, literally taking people's laptops to install Stripe.

                  • By Palanikannan 2026-03-045:491 reply

                    https://github.com/Ataraxy-Labs/weave/pull/11

                    Dude did you just call me AI generated haha, i've been actively using weave for a gui I've been building for blazingly fast diffs

                    https://x.com/Palanikannan_M/status/2022190215021126004

                    So whenever I run into bugs I patched locally in my clone, I try to let the clanker raise a pr upstream, insane how easy things are now.

                    • By deckar01 2026-03-046:071 reply

                      [flagged]

                      • By rs545837 2026-03-046:48

                        Nope that's other user, he has been working with me on weave, check the PRs that you are calling AI generated.

      • By vidarh 2026-03-0413:091 reply

        I'm running agents doing merges right now, and yes and no. They can resolve merges, but it often takes multiple extra rounds. If you can avoid that more often it will definitely save both time and money.

        • By rs545837 2026-03-0416:08

          Thanks for the great explaination again.

    • By kubb 2026-03-048:031 reply

      Congrats on getting acknowledged by people with credibility.

      I also think that this approach has a lot of potential. Keep up the good work sir.

      • By rs545837 2026-03-048:55

        Thanks a lot! Appreciate it.

  • By gritzko 2026-03-044:018 reply

    At this point, the question is: why keep files as blobs in the first place. If a revision control system stores AST trees instead, all the work is AST-level. One can run SQL-level queries then to see what is changing where. Like

      - do any concurrent branches touch this function?
      - what new uses did this function accrete recently?
      - did we create any actual merge conflicts?
    
    Almost LSP-level querying, involving versions and branches. Beagle is a revision control system like that [1]

    It is quite early stage, but the surprising finding is: instead of being a depository of source code blobs, an SCM can be the hub of all activities. Beagle's architecture is extremely open in the assumption that a lot of things can be built on top of it. Essentially, it is a key-value db, keys are URIs and values are BASON (binary mergeable JSON) [2] Can't be more open than that.

    [1]: https://github.com/gritzko/librdx/tree/master/be

    [2]: https://github.com/gritzko/librdx/blob/master/be/STORE.md

    • By rs545837 2026-03-044:043 reply

      This is the right question. Storing ASTs directly would make all of this native instead of layered on top.

      The pragmatic reason weave works at the git layer: adoption. Getting people to switch merge drivers is hard enough, getting them to switch VCS is nearly impossible. So weave parses the three file versions on the fly during merge, extracts entities, resolves per-entity, and writes back a normal file that git stores as a blob. You get entity-level merging without anyone changing their workflow.

      But you're pointing at the ceiling of that approach. A VCS that stores ASTs natively could answer "did any concurrent branches touch this function?" as a query, not as a computation. That's a fundamentally different capability. Beagle looks interesting, will dig into the BASON format.

      We built something adjacent with sem (https://github.com/ataraxy-labs/sem) which extracts the entity dependency graph from git history. It can answer "what new uses did this function accrete" and "what's the blast radius of this change" but it's still a layer on top of git, not native storage.

    • By zokier 2026-03-049:521 reply

      > At this point, the question is: why keep files as blobs in the first place. If a revision control system stores AST trees instead, all the work is AST-level.

      The problem is that disks (and storage in general) store only bytes so you inherently need to deal with bytes at some point. You could view source code files as the serialization of the AST (or other parse tree).

      This is especially apparent with LISPs and their sexprs, but equally applies to other languages too.

      • By rs545837 2026-03-0423:55

        Source code is already a serialization of an AST, we just forgot that and started treating it as text. The practical problem is adoption: every tool in the ecosystem reads bytes.

    • By pfdietz 2026-03-045:361 reply

      Well, if you're programming in C or C++, there may not be a parse tree. Tree-sitter makes a best effort attempt to parse but it can't in general due to the preprocessor.

      • By rs545837 2026-03-045:471 reply

        Great point. C/C++ with macros and preprocessor directives is where tree-sitter's error recovery gets stretched. We support both C and C++ in sem-core(https://github.com/Ataraxy-Labs/sem) but the entity extraction is best-effort for heavily macro'd code. For most application-level C++ it works well, but something like the Linux kernel would be rough. Honestly that's an argument for gritzko's AST-native storage approach where the parser can be more tightly integrated.

        • By pfdietz 2026-03-0410:281 reply

          It's an argument against preprocessors for programming languages.

          Tree-sitter's error handling is constrained by its intended use in editors, so incrementality and efficiency are important. For diffing/merging, a more elaborate parsing algorithm might be better, for example one that uses an Earley/CYK-like algorithm but attempts to minimize some error term (which a dynamic programming algorithm could be naturally extended to.)

          • By rs545837 2026-03-0416:00

            Interesting idea. Tree-sitter's trade-off (speed + incrementality over completeness) makes sense for editors but you're right that for merge/diff a more thorough parser could be worth the cost since it's a cold path, not real-time. We only parse three file versions at merge time so spending an extra 50ms on a better parse would be fine. Worth exploring, thanks for the pointer.

    • By samuelstros 2026-03-047:032 reply

      How do you get blob file writes fast?

      I built lix [0] which stores AST’s instead of blobs.

      Direct AST writing works for apps that are “ast aware”. And I can confirm, it works great.

      But, the all software just writes bytes atm.

      The binary -> parse -> diff is too slow.

      The parse and diff step need to get out of the hot path. That semi defeats the idea of a VCS that stores ASTs though.

      [0] https://github.com/opral/lix

      • By gritzko 2026-03-047:421 reply

        I only diff the changed files. Producing blob out of BASON AST is trivial (one scan). Things may get slow for larger files, e.g. tree-sitter C++ parser is 25MB C file, 750KLoC. Takes couple seconds to import it. But it never changes, so no biggie.

        There is room for improvement, but that is not a show-stopper so far. I plan round-tripping Linux kernel with full history, must show all the bottlenecks.

        P.S. I checked lix. It uses a SQL database. That solves some things, but also creates an impedance mismatch. Must be x10 slow down at least. I use key-value and a custom binary format, so it works nice. Can go one level deeper still, use a custom storage engine, it will be even faster. Git is all custom.

        • By rs545837 2026-03-0417:17

          Good framing. Source code is already a serialization of an AST, we just forgot that and started treating it as text. The practical problem is adoption: every tool in the ecosystem reads bytes.

      • By rs545837 2026-03-047:11

        This is exactly a reason why weave stays on top of git instead of replacing storage. Parsing three file versions at merge time is fine (was about 5-67ms). Parsing on every read/write would be a different story. I know about Lix, but will check it out again.

    • By orbisvicis 2026-03-049:261 reply

      That's a really good point. I'm not familiar with Unison, but I think that's the idea behind the language?

      https://www.unison-lang.org/

      • By rs545837 2026-03-049:45

        This is actually cool, gonna check it out.

    • By philipallstar 2026-03-0411:371 reply

      You might need a bit more than ASTs, as you need code to be human-readable as well as machine-readable. Maybe CSTs?

      • By rs545837 2026-03-0416:26

        CSTs are the right call for round-tripping, but isn't that essentially what tree-sitter gives you, a concrete syntax tree that preserves whitespace, comments, and formatting.

    • By jerf 2026-03-0414:361 reply

      Everything on a disk ends up as a linear sequence of bytes. This is the source of the term "serialization", which I think is easy to hear as a magic word without realizing that it is actually telling you something important in its etymology: It is the process of taking an arbitrary data structure and turning it into something that can be sent or stored serially, that is, in an order, one bit at a time if you really get down to it. To turn something into a file, to send something over a socket, to read something off a sheet of paper to someone else, it has to be serialized.

      The process of taking such a linear stream and reconstructing the arbitrary data structure used to generate it (or, in more sophisticated cases, something related to it if not identical), is deserialization. You can't send anyone a cyclic graph directly but you can send them something they can deserialize into a cyclic graph if you arrange the serialization/deserialization protocol correctly. They may deserialize it into a raw string in some programming language so they can run regexes over it. They may deserialize it into a stream of tokens. This all happens from the same source of serialized data.

      So let's say we have an AST in memory. As complicated as your language likes, however recursive, however cross-"module", however bizarre it may be. But you want to store it on a disk or send it somewhere else. In that case it must be serialized and then deserialized.

      What determines what the final user ends up with is not the serialization protocol. What determines what the final user ends up with is the deserialization procedure they use. They may, for instance, drop everything except some declaration of what a "package" is if they're just doing some initial scan. They may deserialize it into a compiler's AST. They may deserialize it into tree sitter's AST. They may deserialize it into some other proprietary AST used by a proprietary static code analyzer with objects designed to not just represent the code but also be immediately useful in complicated flow analyses that no other user of the data is interested in using.

      The point of this seemingly rambling description of what serialization is is that

      "why keep files as blobs in the first place. If a revision control system stores AST trees instead"

      doesn't correspond to anything actionable or real. Structured text files are already your programming language's code stored as ASTs. The corresponding deserialization format involves "parsing" them, which is a perfectly sensible and very, very common deserialization method. For example, the HTML you are reading was deserialized into the browser's data structures, which are substantially richer than "just" an AST of HTML due to all the stuff a browser does with the HTML, with a very complicated parsing algorithm defined by the HTML standard. The textual representation may be slightly suboptimal for some purposes but they're pretty good at others (e.g., lots of regexes run against code over the years). If you want some other data structure in the consumer, the change has to happen in the code that consumes the serialized stream. There is no way to change the code as it is stored on disk to make it "more" or "less" AST-ish than it already is, and always has been.

      You can see that in the article under discussion. You don't have to change the source code, which is to say, the serialized representation of code on the disk, to get this new feature. You just have to change the deserializer, in this case, to use tree sitter to parse instead of deserializing into "an array of lines which are themselves just strings except maybe we ignore whitespace for some purposes".

      Once you see the source code as already being an AST, it is easy to see that there are multiple ways you could store it that could conceivably be optimized for other uses... but nothing you do to the serialization format is going to change what is possible at all, only adjust the speed at which it can be done. There is no "more AST-ish" representation that will make this tree sitter code any easier to write. What is on the disk is already maximally "AST-ish" as it is today. There isn't any "AST-ish"-ness being left on the table. The problem was always the consumers, not the representation.

      And as far as I can tell, it isn't generally the raw deserialization speed nowadays that is the problem with source code. Optimizing the format for any other purpose would break the simple ability to read it is as source code, which is valuable in its own right. But then, nothing stops you from representing source code in some other way right now if you want... but that doesn't open up possibilities that were previously impossible, it just tweak how quickly some things will run.

      • By rs545837 2026-03-0421:56

        interesting read, will comment more once I go through everything in detail. Thanks.

    • By handfuloflight 2026-03-045:391 reply

      Well, I'll be diving in. Thank you for sharing. Same for Weave.

      • By rs545837 2026-03-045:51

        Awesome, let me know how it goes. Happy to help if you hit any rough edges.

  • By _flux 2026-03-047:042 reply

    How does it compare to https://mergiraf.org/ ? I've had good experience with it so far, although I rarely even need it.

    It's also based on treesitter, but probably otherwise a more baseline algorithm. I wonder if that "entity-awareness" actually then brings something to the table in addition to the AST.

    edit: man, I tried searching this thread for mention of the tool for a few times, but apparently its name is not mergigraf

HackerNews