Sem – Semantic version control. Entity-level diffs on top of Git

2026-03-086:019524github.com

Semantic version control CLI. Entity-level diff, blame, graph, and impact analysis for code. 16 languages via tree-sitter. - Ataraxy-Labs/sem

NameName

Semantic version control. Entity-level diffs on top of Git.

Instead of line 43 changed, sem tells you function validateToken was added in src/auth.ts.

sem diff

┌─ src/auth/login.ts ──────────────────────────────────
│
│  ⊕ function  validateToken          [added]
│  ∆ function  authenticateUser       [modified]
│  ⊖ function  legacyAuth             [deleted]
│
└──────────────────────────────────────────────────────

┌─ config/database.yml ─────────────────────────────────
│
│  ∆ property  production.pool_size   [modified]
│    - 5
│    + 20
│
└──────────────────────────────────────────────────────

Summary: 1 added, 1 modified, 1 deleted across 2 files

Build from source (requires Rust):

git clone https://github.com/Ataraxy-Labs/sem
cd sem/crates
cargo install --path sem-cli

Or grab a binary from GitHub Releases.

Works in any Git repo. No setup required.

# Semantic diff of working changes
sem diff # Staged changes only
sem diff --staged # Specific commit
sem diff --commit abc1234 # Commit range
sem diff --from HEAD~5 --to HEAD # JSON output (for AI agents, CI pipelines)
sem diff --format json # Read file changes from stdin (no git repo needed)
echo '[{"filePath":"src/main.rs","status":"modified","beforeContent":"...","afterContent":"..."}]' \ | sem diff --stdin --format json # Only specific file types
sem diff --file-exts .py .rs # Entity dependency graph
sem graph # Impact analysis (what breaks if this entity changes?)
sem impact validateToken # Entity-level blame
sem blame src/auth.ts

13 programming languages with full entity extraction via tree-sitter:

Language Extensions Entities
TypeScript .ts .tsx functions, classes, interfaces, types, enums, exports
JavaScript .js .jsx .mjs .cjs functions, classes, variables, exports
Python .py functions, classes, decorated definitions
Go .go functions, methods, types, vars, consts
Rust .rs functions, structs, enums, impls, traits, mods, consts
Java .java classes, methods, interfaces, enums, fields, constructors
C .c .h functions, structs, enums, unions, typedefs
C++ .cpp .cc .hpp functions, classes, structs, enums, namespaces, templates
C# .cs classes, methods, interfaces, enums, structs, properties
Ruby .rb methods, classes, modules
PHP .php functions, classes, methods, interfaces, traits, enums
Fortran .f90 .f95 .f functions, subroutines, modules, programs

Plus structured data formats:

Format Extensions Entities
JSON .json properties, objects (RFC 6901 paths)
YAML .yml .yaml sections, properties (dot paths)
TOML .toml sections, properties
CSV .csv .tsv rows (first column as identity)
Markdown .md .mdx heading-based sections

Everything else falls back to chunk-based diffing.

Three-phase entity matching:

  1. Exact ID match — same entity in before/after = modified or unchanged
  2. Structural hash match — same AST structure, different name = renamed or moved (ignores whitespace/comments)
  3. Fuzzy similarity — >80% token overlap = probable rename

This means sem detects renames and moves, not just additions and deletions. Structural hashing also distinguishes cosmetic changes (whitespace, formatting) from real logic changes.

{ "summary": { "fileCount": 2, "added": 1, "modified": 1, "deleted": 1, "total": 3 }, "changes": [ { "entityId": "src/auth.ts::function::validateToken", "changeType": "added", "entityType": "function", "entityName": "validateToken", "filePath": "src/auth.ts"
    }
  ]
}

sem-core can be used as a Rust library dependency:

[dependencies]
sem-core = { git = "https://github.com/Ataraxy-Labs/sem", version = "0.3" }

Used by weave (semantic merge driver) and inspect (entity-level code review).

  • tree-sitter for code parsing (native Rust, not WASM)
  • git2 for Git operations
  • rayon for parallel file processing
  • xxhash for structural hashing
  • Plugin system for adding new languages and formats

Star History Chart

MIT OR Apache-2.0

You can’t perform that action at this time.


Read the original article

Comments

  • By perching_aix 2026-03-0810:302 reply

    Anyone here using semantic diffing tools in their daily work? How good are they?

    I use some for e.g. YAML [0] and JSON [1], and they're nice [2], but these are comparatively simple languages.

    I'm particularly curious because just plain diffing ASTs is more on the "syntax-aware diffing" side rather than the "semantic diffing" side, yet most semantic tooling descriptions stop at saying they use ASTs.

    ASTs are not necessarily in a minimal / optimized form by construction I believe, so I'm pretty sure you'll have situations where a "semantic" differ will report a difference, whereas a compiler would still compile the given translation unit to the same machine bytecode after all the optimization passes during later levels. Not even necessarily for target platform dependent reasons.

    But maybe this doesn't matter much or would be more confusing than helpful?

    [0] dyff: https://github.com/homeport/dyff

    [1] jd: https://github.com/josephburnett/jd

    [2] they allow me to ignore ordering differences within arrays (arrays are ordered in YAML and JSON as per the standard), which I found to be a surprisingly rare and useful capability; the programs that consume the YAMLs and JSONs I use these on are not sensitive to these ordering differences

    • By rs545837 2026-03-097:081 reply

      Fair point on AST vs semantic. sem sits somewhere in between. It doesn't go as far as checking compiled output equivalence, but it does normalize the AST before hashing (we call it structural_hash), so purely cosmetic changes like reformatting or renaming a local variable won't show as a diff. The goal isn't "would the compiler produce the same binary" but "did the developer change the behavior of this entity." For most practical cases that's the useful boundary. The YAML/JSON ordering point is interesting, we handle JSON keys as entities so reordering doesn't conflict during merges.

      By the way creator here.

      • By perching_aix 2026-03-0911:031 reply

        Hey there, thanks for checking in.

        Regarding the custom normalization step, that makes sense, and I don't really have much more to add either. Looked into it a bit further since, it seems that specifically with programming languages the topic gets pretty gnarly pretty quick for various language theory reasons. So the solution you settled on is understandable. I might spend some time comparing how various semantic toolings compare, I'd imagine they probably aim for something similar.

        > The YAML/JSON ordering point is interesting, we handle JSON keys as entities so reordering doesn't conflict during merges.

        Just to clarify, I specifically meant the ordering of elements within arrays, not the ordering of keys within an object. The order of keys in an object is relaxed as per the spec, so normalizing across that is correct behavior. What I'm doing with these other tools is technically a spec violation, but since I know that downstream tooling is explicitly order invariant, it all still works out and helps a ton. It's pretty ironic too, I usually hammer on about not liking there being options, but in this case an option is exactly the right way to go about this; you would not want this as a default.

    • By henrebotha 2026-03-0813:471 reply

      I guess I don't understand the difference between semantic and syntax-aware, but I've been trying out difftastic which is a bit of an odd beast but does a great job at narrowing down diffs to the actual meaningful parts.

      • By rs545837 2026-03-0923:37

        difftastic is solid. The difference is roughly: syntax-aware (difftastic) knows what changed in the tree, sem knows which entity changed and whether it actually matters. difftastic will show you that a node in the AST moved. sem will tell you "the function processOrder was modified, and 3 other functions across 2 files depend on it." difftastic is a better diff. sem is trying to be a different layer on top of git entirely.

  • By gritzko 2026-03-0811:371 reply

    Was "sem" named "graft" last week and "got" a week before that? Everyone is vibing so hard it is difficult to keep track of things. Also, idea theft gets to entirely new levels. Bot swarms promote 100% vibed stolen projects... what a moment in time we all enjoy.

    Still, my two cents: Beagle the AST-level version control system, experimental

    https://github.com/gritzko/librdx/tree/master/be#readme

    It genuinely stores AST trees in (virtually any) key-value database (RocksDB at the moment). In fact, it is a versioned database for the code with very open format and complete freedom to build on top of it.

    be is in fact, more of a format/protocol, like in the good old days (HTTP, SMTP, XML, JSON - remember those?)

    • By rs545837 2026-03-097:10

      Different project, same author. sem has been sem since the first commit. Beagle looks interesting, storing ASTs directly in a KV store is a different approach. sem stays on top of git so there's zero migration cost, you keep your existing repo and workflows.

  • By mellosouls 2026-03-0816:092 reply

    I like this but would think "Semantic" is pushing it a bit as it relies on function names and changes therein having a direct mapping to meaning with standard text processing.

    In fact, I fully expected a use of LLM to derive those meaningful descriptions before I checked the repo.

    Anyway I definitely see this as a useful thing to try out as a potential addition to the armoury, but as we go further along the route to AI-coding I expect the diffs to be abstracted even further (derived using AI), for use by both agentic and human contributors.

    • By rs545837 2026-03-097:12

      You're right that "semantic" is doing some heavy lifting in the name. We use it to mean "understands code structure" rather than "understands code meaning." sem knows that something like "validateToken" is a function and tracks it as an entity across versions, but it doesn't know what validation means. For the merge use case (weave - https://ataraxy-labs.github.io/weave/), that level of understanding is enough to resolve 100% of false conflicts on our benchmarks. LLM-powered semantic understanding is the next layer, and that's what our review tool (inspect) does, it uses sem's entity extraction to triage what an LLM should look at.

    • By RaftPeople 2026-03-0817:271 reply

      > I like this but would think "Semantic" is pushing it a bit

      It would be nice to get to the feature level, meaning across files/classes/functions etc.

      • By rs545837 2026-03-097:13

        sem already does this. sem graph builds a cross-file entity dependency graph and sem impact tells you "if this function changes, these other entities across these many files are affected." It's transitive too, follows the full call chain.

HackerNews