So you wanna build a local RAG?

2025-11-2816:54390105portfolio-blog-starter.vercel.app

When we launched Skald, we wanted it to not only be self-hostable, but also for one to be able to run it without sending any data to third-parties. With LLMs getting better and better, privacy…

When we launched Skald, we wanted it to not only be self-hostable, but also for one to be able to run it without sending any data to third-parties.

With LLMs getting better and better, privacy-sensitive organizations shouldn't have to choose between being left behind by not accessing frontier models and doing away with their committment (or legal requirement) for data privacy.

So here's what we did to support this use case and also some benchmarks comparing performance when using proprietary APIs vs self-hosted open-source tech.

RAG components and their OSS alternatives

A basic RAG usually has the following core components:

  • A vector database
  • A vector embeddings model
  • An LLM

And most times it also has these as well:

  • A reranker
  • Document parsing (for PDFs, PowerPoints, etc)

What that means is that when you're looking to build a fully local RAG setup, you'll need to substitute whatever SaaS providers you're using for a local option for each of those components.

Here's a table with some examples of what we might use in a scenario where we can use third-party Cloud services and one where we can't:

ComponentProprietary OptionsOpen-Source Options
Vector DatabasePinecone, Turbopuffer, Weaviate Cloud, Qdrant CloudQdrant, Weaviate, Postgres with pgvector
Vector Embeddings ProviderOpenAI, Cohere, VoyageSentence Transformers, BGE, E5
LLMGPT, Claude, GeminiLlama, Mistral, GPT-OSS
RerankerCohere, VoyageBGE Reranker, Sentence Transformers Cross-Encoder
Document ParsingReducto, DatalabDocling

Do note that running something locally does not mean it needs to be open-source, as one could pay for a license to self-host proprietary software. But at Skald our goal was to use fully open-source tech, which is what I'll be convering here.

The table above is far from covering all available options on both columns, but basically it gives you an indication of what to research into in order to pick a tool that works for you.

As with anything, what works for you will greatly depend on your use case. And you need to be prepared to run a few more services than you're used to if you've just been calling APIs.

For our local stack, we went with the easiest setup for now to get it working (and it does! see writeup on this lower down) but will be running benchmarks on all other options to determine the best possible setup.

This is what we have today:

Vector DB: Postgres + pgvector. We already use Postgres and didn't want to bundle another service into our stack, but this is controversial and we will be running benchmarks to make a better informed decision here. Note that pgvector will serve a lot of use cases well all the way up to hundreds of thousands of documents, though.

Vector embeddings: Users can configure this in Skald and we use Sentence Transformers (all-MiniLM-L6-v2) as our default (solid all-around performer for speed and retrieval, English-only). I also ran Skald with bge-m3 (larger, multi-language) and share the results later in this post.

LLM: We don't even bundle a default with Skald and it's up to the users to run and manage this. I tested our setup with GPT-OSS 20B on EC2 (results shown below).

Reranker: Users can also configure this in Skald, and the default is the Sentence Transformers cross encoder (solid, English-only). I've also used bge-reranker-v2-m3 and mmarco-mMiniLMv2-L12-H384-v1 which offer multi-lingual support.

Document parsing: There isn't much of a question on this one. We're using Docling. It's great. We run it via docling-serve.

Does it perform though?

So the main goal here was first to get something working then ensure it worked well with our platform and could be easily deployed. From here we'll be running extensive benchmarks and working with our clients to provide a solid setup that both performs well but is also not a nightmare to deploy and manage.

From that perspective, this was a great success.

Deploying a production instance of Skald with this whole stack took me 8 minutes, and that comes bundled with the vector database (well, Postgres), a reranking and embedding service, and Docling.

The only thing I needed to run separately was the LLM, which I did via llama.cpp.

Having gotten this sorted, I imported all the content from the PostHog website [1] and set up a tiny dataset [2] of questions and expected answers inside of Skald, then used our Experiments feature to run the RAG over this dataset.

I explicitly kept the topK values really high (100 for the vector search and 50 for post-reranking), as I was mostly testing for accuracy and wanted to see the performance when questions required e.g. aggregating context over 15+ documents.

Full config

Here are the params configured in the Skald UI for the the experiment.

Config optionSelection
Extra system promptBe really concise in your answers
Query rewritingOff
Vector search topK100
Vector search distance threshold0.8
RerankingOn
Reranking topK50
ReferencesOff

So without any more delay, here are the results of my not-very-scientific at all benchmark using the experimentation platform inside of Skald.

Voyage + Claude

This is our default Cloud setup. We use voyage-3-large and rerank-2.5 from Voyage AI as our embedding and reranking models respectively, and we default to Claude Sonnet 3.7 for responses (users can configure the model though).

It passed with flying colors.

Our LLM-as-a-Judge gave an average score of 9.45 to the responses, and I basically agree with the assessment. All answers were correct, with one missing a few extra bits of context.

Voyage + GPT-OSS 20B

With the control experiment done, I then moved on to a setup where I kept Voyage as the embeddings provider and reranker, and then used GPT-OSS 20B running on a llama.cpp server on a g5.2xlarge EC2 instance as the LLM.

The goal here was to see how well the open-source LLM model itself stacked up against a frontier model accessed via API.

And it did great!

We don't yet support LLM-as-a-Judge on fully local deployments, so the only score we have here is mine. I scored the answers an average of 9.18 and they were all correct, with two of them just missing a few bits of information or highlighting less relevant information from the context.

Fully local + GPT-OSS 20B

Lastly, it was time for the moment of truth: running a fully local setup.

For this I ran two tests:

1. Default sentence transformers embedding and reranking models

The most popular open-source models are all-MiniLM-L6-v2 for embeddings and ms-marco-MiniLM-L6-v2 as the reranker, so I used those for my first benchmark.

Here the average score was 7.10. Not bad, but definitely not great. However, when we dig into the results, we can get a better understanding of how this setup fails.

Basically, it got all point queries right, which are questions where the answer is somewhere in the mess of documents, but can be found from one specific place.

Where it failed was:

  • Non-english query: The embeddings model and the reranker are English-based, so my question in Portuguese obviously got no answer
  • An ambiguous question with very little context ("what's ch")
  • Aggregating information from multiple documents/chunks e.g. it only found 5 out of PostHog's 7 funding rounds, and only a subset of the PostHog competitors that offer session replay (as mentioned in the source data)

In my view, this is good news. That means that the default options will go a long way and should give you very good performance if your use case is only doing point queries in English. The other great thing is that these models are also fast.

Now, if you need to handle ambiguity better, or handle questions in other languages, then this setup is simply not for you.

2. Multi-lingual models

The next test I did used bge-m3 as the embeddings model and mmarco-mMiniLMv2-L12-H384-v1 as the reranker. The embeddings model is supposedly much better than the one used in the previous test and is also multi-lingual. The reranker on the other hand uses the same cross-encoder from the previous test as the base model but also adds multi-lingual support. The more standard option here would have been the much more popular bge-reranker-v2-m3 model, but I found it to be much slower. I intend to tweak my setup and test it again, however.

Anyway, onto the results! I scored it 8.63 on average, which is very good. There were no complete failures, and it handled the question in Portuguese well.

The mistakes it made were:

  • This new setup also did not do the best job at aggregating information, missing 2 of PostHog's funding rounds, and a couple of its session replay competitors
  • It also answered a question correctly, but added incorrect additional context after it

So overall it performed quite well. Again what we what saw was the main problem is when the context needed for the response is scattered across multiple documents. There are various techniques to help with this and we'll be trialing some soon! They haven't been needed on the Cloud version because better models save you from having to add complexity for minimal performance gains, but as we're focused on building a really solid setup for local deploys, we'll be looking into this more and more.

Now what?

I hope this writeup has provided you with at least some insight and context into building a local RAG, and also the fact that it does work, it can serve a lot of use cases, and that the tendency is for this setup to get better and better as a) models improve b) we get more open-source models across the board, with both being things that we seem to be trending towards.

As for us at Skald, we intend to polish this setup further in order to serve even more use cases really well, as well as intend to soon be publishing more legitimate benchmarks for models in the open-source space, from LLMs to rerankers.

If you're a company that needs to run AI tooling in air-gapped infrastructure, let's chat -- feel free to email me at yakko [at] useskald [dot] com.

Lastly, if you want to get involved, feel free to chat to us over on our GitHub repo (MIT-licensed) or catch us on Slack.

[1] I used the PostHog website here because the website content is MIT-licensed (yes, wild) and readily-available as markdown on GitHub and having worked there I know a lot of answers off the top of my head making it a great dataset of ~2000 documents that I know well.

[2] The questions and answers dataset I used for the experiments was the following:

Dataset
QuestionExpected answerComments
How many raises did PostHog do?PostHog has raised money 7 times: it raised $150k from YCombinator, then did a seed round ($3.025M), a series A ($12M), a series B ($15M), a series C ($10M), a series D ($70M), and a series E ($75M).Requires aggregating context from at least 7 documents
When did group analytics launch?December 16, 2021.Point query, multiple mentions to "group analytics" in the source docs
Why was the sessions page removed?The sessions page was removed because it was confusing and limited in functionality. It was replaced by the 'Recordings' tab.Point query, multiple mentions to "sessions" in the source docs
What's the difference between a product engineer and other roles?Compared to product managers, product engineers focus more on building rather than deep research and planning. When it comes to software engineers, both product and software engineers write code, but software engineers focus on building great software, whereas product engineers focus on building great products.Requires aggregating context from multiple docs + there are a ton of mentions of "product engineer" in the source docs
What were the main benefits of YC?The main benefits of YC were: Network Access, Investor Reviews, Office Hours, Funding Opportunities, Hiring Resources, Angel Investing Opportunities, Accelerated Growth and Experience, Shift in Self-Perception, Customer Acquisition, Product Market Fit, Ambitious Goal Setting, Access to Thought Leaders, Community SupportPoint query
quem foi o primeiro investidor da posthogg?O primeiro investidor da PostHog foi o YCombinator.Question in Portuguese, with PostHog misspelled
what posthog competitors also offer session replaysLogRocket, Smartlook, FullStory, Microsoft Clarity, Contentsquare, Mouseflow, Heap, Pendo, Hotjar, Glassbox, and Amplitude.Requires aggregating content from at least 11 docs (more because I actually missed some in my expected answer)
top tips find client1. Leverage your inner circle 2. Join relevant communities 3. Be laser-focused 4. Set achievable goals 5. Frame conversations properlyPoint query, worded weirdly
what's chCH most likely refers to ClickHouse, a column-oriented OLAP database.Really ambiguous. I meant ClickHouse with my question.
what is mixedpanelMixpanel is a popular product analytics tool that was founded in 2009Mixpanel misspelled as Mixedpanel
how was prpoerty filtering made faster?Using materialized columns allowed ClickHouse to skip JSON parsing during queries and made queries with property filtering 25x faster.Point query with a typo

Read the original article

Comments

  • By simonw 2025-11-2817:5018 reply

    My advice for building something like this: don't get hung up on a need for vector databases and embedding.

    Full text search or even grep/rg are a lot faster and cheaper to work with - no need to maintain a vector database index - and turn out to work really well if you put them in some kind of agentic tool loop.

    The big benefit of semantic search was that it could handle fuzzy searching - returning results that mention dogs if someone searches for canines, for example.

    Give a good LLM a search tool and it can come up with searches like "dog OR canine" on its own - and refine those queries over multiple rounds of searches.

    Plus it means you don't have to solve the chunking problem!

    • By navar 2025-11-297:13

      I created a small app that shows the difference between embedding-based ("semantic") and bm25 search:

      http://search-sensei.s3-website-us-east-1.amazonaws.com/

      (warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)

      It runs a small embedding model in the browser and returns search results in "real time".

      It has a few illustrative examples where semantic search returns the intended results. For example bm25 does not understand that "j lo" or "jlo" refer to Jennifer Lopez. Similarly embedding based methods can better deal with things like typos.

      EDIT: search is performed over 1000 news articles randomly sampled from 2016 to 2024

    • By andai 2025-11-291:552 reply

      https://www.anthropic.com/engineering/contextual-retrieval

      Anthropic found embeddings + BM25 (keyword search) gave the best results. (Well, after contextual summarization, and fusion, and reranking, and shoving the whole thing into an LLM...)

      But sadly they didn't say how BM25 did on its own, which is the really interesting part to me.

      In my own (small scale) tests with embeddings, I found that I'd be looking right at the page that contained the literal words in my query and embeddings would fail to find it... Ctrl+F wins again!

      • By bredren 2025-11-296:592 reply

        FWIW, the org decided against vector embeddings for Claude Code due in part to maintenance. See 41:05 here: https://youtu.be/IDSAMqip6ms

        • By mips_avatar 2025-11-298:061 reply

          It would also blow up the price/latency of Claude code if every chunk of every file had to be read into haiku->summarized->sent to an embedding model ->reindexed into a project index and that index stored somewhere. Since there’s a lot of context inherent in things like the file structure, storing the central context in Claude.md is a lot simpler. I don’t think them not using vector embeddings in the project space is anything other than an indication that it’s hard to manage embeddings in Claude code.

          • By andai 2025-11-2919:041 reply

            Some agents integrate with code intelligence tools which do use embeddings, right? (As well as "mechanical" solutions like LSPs, I imagine.)

            I think it's just a case of "this isn't something we need to solve, other companies solve it already and then our thing can integrate with that."

            Or maybe it's really just marginal gains compared with iterative grepping. I don't know. (Still amazed how well that works!)

            • By mips_avatar 2025-11-301:07

              I think your last point captures it, for various reasons (RL, inherent structure of code) iterative grepping is unreasonably effective. Interestingly Cursor does use embedding vectors for codebase indexing:

              https://cursor.com/docs/context/codebase-indexing

              Seems like sometimes Cursor has a better understanding of the vibe of my codebase than Claude code, maybe this is part of it. Or maybe it’s just really marginally important in codebase indexing. Vector dbs still have a huge benefit in less verifiable domains.

        • By skrebbel 2025-11-2914:42

          who's "the org"?

      • By noobcoder 2025-11-296:25

        No cross encoders?

    • By whakim 2025-11-292:011 reply

      In my experience the semantic/lexical search problem is better understood as a precision/recall tradeoff. Lexical search (along with boolean operators, exact phrase matching, etc.) has very high precision at the expense of lower recall, whereas semantic search sits at a higher recall/lower precision point on the curve.

      • By simonw 2025-11-292:201 reply

        Yeah, that sounds about right to me. The most effective approach does appear to be a hybrid of embeddings and BM25, which is worth exploring if you have the capacity to do so.

        For most cases though sticking with BM25 is likely to be "good enough" and a whole lot cheaper to build and run.

        • By mips_avatar 2025-11-298:081 reply

          Depends on the app and how often you need to change your embeddings, but I run my own hybrid semantic/bm25 search on my MacBook Pro across millions of documents without too much trouble.

          • By echion 2025-12-122:17

            Can you elaborate a bit on your setup if you have time?

    • By cwmoore 2025-11-2821:27

      I recently came across a “prefer the most common synonym” problem, in Google Maps, while searching for a poolhall—even literally ‘billiards’ returned results for swimming pools and chlorine. I wonder if some more NOTs aren’t necessary…interested in learning about RAGs though I’m a little behind the curve.

    • By mips_avatar 2025-11-2819:291 reply

      In my app the best lexical search approaches completely broke my agent. For my rag system the llm would on average take 2.1 lexical searches to get the results it needed. Which wasn’t terrible but it meant sometimes it needed up to 5 searches to find it which blew up user latency. Now that I have a hybrid semantic search + lexical search it only requires 1.1 searches per result.

      • By nostrebored 2025-11-2823:091 reply

        The problem is not using parallel tool calling or not returning a search array. We do this across large data sets and don’t see much of a problem. It also means you can swap algorithms on the fly. Building a BM25 index over a few thousand documents is not very expensive locally. Rg and grep are freeish. If you have information on folder contents you can let your agent decide at execution time based on information need.

        Embeddings just aren’t the most interesting thing here if you’re running a frontier fm.

        • By mips_avatar 2025-11-297:52

          Search arrays help, but parallel tool calling assumes you’ve solved two hard problems: generating diverse query variations, and verifying which result is correct. Most retrieval doesn’t have clean verification. The better approach is making search good enough that you sidestep verification as much as possible (hopefully you are only requiring the model to make a judgment call within its search array). In my case (OpenStreetMap data), lexical recall is unstable, but embeddings usually get it right if you narrow the search space enough—and a missed query is a stronger signal to the model that it’s done something wrong.

          Besides, if you could reliably verify results, you’ve essentially built an RL harness—which is a lot harder to do than building an effective search system and probably worth more.

    • By froobius 2025-11-2818:52

      Hmm it can capture more than just single words though, e.g. meaningful phrases or paragraphs that could be written in many ways.

    • By leetrout 2025-11-2818:151 reply

      Simon have you ever given a talk or written about this sort of pragmatism? A spin on how to achieve this with Datasette is an easy thing to imagine IMO.

    • By scosman 2025-11-2916:552 reply

      Alternative advice: just test and see what works best for your use case. Totally agreed embeddings are often overkill. However, sometimes they really help. The flow is something like:

      - Iterate over your docs to build eval data: hundreds of pairs of [synthetic query, correct answer]. Focus on content from the docs not general LLM knowledge.

      - Kick off a few parallel evaluations of different RAG configurations to see what works best for your use case: BM25, Vector, Hybrid. You can do a second pass to tune parameters: embedding model, top k, re-ranking, etc.

      I build a free system that does all this (synthetic data from docs, evals, test various RAG configs without coding each version). https://docs.kiln.tech/docs/evaluations/evaluate-rag-accurac...

      • By simonw 2025-11-2917:521 reply

        That's excellent advice, the only downside being that collecting that eval data remains difficult and time-consuming.

        But if you want to build truly great search that's the approach to take.

        • By scosman 2025-11-300:58

          Agree totally. I’m spending half my time focused that problem (mostly synthetic data gen with guidance), and the other half on how to optimize once it works.

      • By sbene970 2025-11-2918:531 reply

        At this point you could also optimize your agentic flow directly in DSPy using a colbert model / Ratatouille for retrieval.

        • By scosman 2025-11-301:00

          Not there yet. The biggest vectors for optimizing aren’t in the agents yet (RAG method, embedding model, etc)

    • By victorbuilds 2025-11-2910:13

      This matches what I found building an AI app for kids. Started with embeddings because everyone said to, then ripped it out and went with simple keyword matching. The extra complexity wasn't worth it for my use case. Most of the magic comes from the LLM anyway, not the retrieval layer.

    • By dmezzetti 2025-11-2910:50

      Are multiple LLM queries faster than vector search? Even with the example "dog OR canine" that leads to two LLM inference calls vs one. LLM inference is also more expensive than vector search.

      In general RAG != Vector Search though. If a SQL query, grep, full text search or other does the job then by all means. But for relevance-based search, vector search shines.

    • By 7734128 2025-11-2921:36

      No reason to try to avoid semantic search. Dead easy to implement, works across languages to some extent and the fuzziness is worth quite alot.

      You're realistically going to need chunks of some kind anyway to feed the LLM, and once you got those it's just a few lines of code to get a basic persistant ChromaDB going.

    • By tra3 2025-11-2818:26

      I built a simple emacs package based on this idea [0]. It works surprisingly well, but I dont know how far it scales. It's likely not as frugal from a token usage perspective.

      0: https://github.com/dmitrym0/dm-gptel-simple-org-memory

    • By drittich 2025-11-2913:481 reply

      Do you have a standard prompt you use for this? I have definitely seen agentic tools doing this for me, e.g., when searching the local file system, but I'm not sure if it native behaviour for tool-using LLMs or if it is coerced via prompts.

      • By simonw 2025-11-2915:35

        No I've not got a good only for this yet. I've found the modern models (or the Claude Code etc harness) know how to do this already by default - you can ask them a question and give them a search tool and they'll start running and iterating on searches by themselves.

    • By enraged_camel 2025-11-2818:341 reply

      Yes, exactly. We have our AI feature configured to use our pre-existing TypeSense integration and it's stunningly competent at figuring out exactly what search queries to use across which collections in order to find relevant results.

      • By busssard 2025-11-2819:05

        if this is coupled with powerful search engines beyond elastic then we are getting somewhere. other nonmonotonic engines that can find structural information are out there.

    • By petercooper 2025-11-2921:48

      So kinda GAR - Generation-Augmented Retrieval :-)

    • By pstuart 2025-11-2821:09

      Perhaps SQLite with FTS5? Or even better, getting DuckDB into the party as it's ecosystem seems ripe for this type of work.

    • By paulyy_y 2025-11-2916:531 reply

      Burying the lede here - your solution for avoiding using vector search is either offloading to 1) user, expecting them to remember the right terms or 2) using LLM to craft the search query? And having it iterate multiple times? Holy mother of inefficiency, this agentic focus is making us all brain dead.

      Vector DB's and embeddings are dead simple to figure out, implement, and maintain. Especially for a local RAG, which is the primary context here. If I want to find my latest tabular notes on some obscure game dealing with medical concepts, I should be able to just literally type that. It shouldn't require me remembering the medical terms, or having some local (or god forbid, remote) LLM iterate through a dozen combos.

      FWIW I also think this is a matter of how well one structures their personal KB. If you follow strict metadata/structure and have coherent/logical writing, you'll have better chance of getting results with text matching. For someone optimizing for vector space search, and minimizing the need for upfront logical structuring, it will not work out well.

      • By simonw 2025-11-2917:54

        My opinion on this really isn't very extreme.

        Claude Code is widely regarded to be the best coding agent tool right now and it uses search, not embeddings.

        I use it to answer questions about files on my computer all the time.

  • By mips_avatar 2025-11-2817:421 reply

    One thing I didn’t see here that might be hurting your performance is a lack of semantic chunking. It sounds like you’re embedding entire docs, which kind of breaks down if the docs contain multiple concepts. A better approach for recall is using some kind of chunking program to get semantic chunks (I like spacy though you have to configure it a bit). Then once you have your chunks you need to append context to how this chunk relates to the rest of your doc before you do your embedding. I have found anthropics approach to contextual retrieval to be very performant in my RAG systems (https://www.anthropic.com/engineering/contextual-retrieval) you can just use gpt oss 20b as the model for generation of context.

    Unless I’ve misunderstood your post and you are doing some form of this in your pipeline you should see a dramatic improvement in performance once you implement this.

    • By yakkomajuri 2025-11-2818:472 reply

      hey, author (not op) here. we do do semantic chunking! I think maybe I gave the impression that we don't because of the mention of aggregating context but I tested this with questions that would require aggregating context from 15+ documents (meaning 2x that in chunks), hence the comment in the post!

      • By NebulaStorm456 2025-11-2910:281 reply

        Is there a way to convert documents into a hierarchical connected graph data structure which references each other similar to how we use personal knowledge tools like Obsidian and ability to traverse this graph? Is GraphRag technique trying to do this exactly?

      • By mips_avatar 2025-11-2819:151 reply

        Ah so you’re generating context from multiple docs for your chunks? How do you decide which docs get aggregated?

        • By nostrebored 2025-11-2823:111 reply

          Haven’t seen an answer better than “vibes” here. Especially with data across multiple domains.

          • By mips_avatar 2025-11-293:21

            I mean as long as they're not too long I suppose you could use just about any heuristic for grouping sources. Just seems like it would be hard to generate succinct context if you mess it up.

  • By abhashanand1501 2025-11-2917:13

    My advice - use same rigor as other software development for a RAG application. Have a test suite (of say 100 cases) which says for this question correct response is this. Use an LLM judge to score each of the outputs of the RAG system. Now iterate till you get a score of 85 or so. And every change of prompts and strategy triggers this check, and ensures that output of 85 is always maintained.

HackerNews