
Learn about Late Chunking and how it may be the right fit for balancing cost and performance in your long context retrieval applications
Large-scale RAG applications that require long context retrieval deal with a unique set of challenges. The volume of data is often huge, while the precision of the retrieval system is critical. However, ensuring high-quality retrieval in such systems is often at odds with cost and performance. For users this can present a difficult balancing act. But there may be a new solution that can even the scales.
Two weeks ago, JinaAI announced a new methodology to aid in long-context retrieval called late chunking. This article explores why late chunking may just be the happy medium between naive (but inexpensive) solutions and more sophisticated (but costly) solutions like ColBERT for users looking to build high-quality retrieval systems on long documents.

It appears we have a goldilocks problem, the naive approach may be more cost-effective but can reduce precision, while on the other hand, late interaction and ColBERT offer us increased precision at extreme costs. Surely there must be something in the middle that's just right? Well, late chunking may exactly that.
As mentioned earlier late chunking origins are linked closely with late interaction in that both utilise the token-level vector representations that are produced during the forward pass of an embedding model.
Unlike late interaction, there is a pooling step that occurs after the initial inference. This pooling differs from traditional embedding models that pool all representations from every token into a single representation. In late chunking, this pooling is done on segments of the text according to some predetermined chunking strategy that can be aligned based on token spans or boundary cues, thus the term late chunking.
The result is that a long document is still represented by numerous embeddings but critically those embeddings are primed with contextual information relevant to their neighboring chunks.
Late chunking requires a relatively simple alteration to the pooling step of the embedding model that can be implemented in under 30 lines of code and its vectors can be ingested as individual chunks into a vector database without any modification to the retrieval pipeline.
There are however some requirements needed ahead of performing late chunking:
Long context models are a requirement as we need token representation for the entirety of the long document to make them contextually aware. Notably, JinaAI tested using their model jina-embeddings-v2-small-en which has the highest performance to parameter ratio on MTEB's long embed retrieval benchmark. This model supports up to 8192 tokens which is roughly equivalent to 10 standard pages of text. This model also uses a mean pooling strategy in typical behavior which is a requirement for any model looking to take advantage of late interaction.
Chunking logic: being able to chunk text ahead of inference as well as associating each chunk with its corresponding token spans is also critical to making late chunking work. Luckily there are many ways to create chunks in this manner and given late chunking's ability to condition each chunk on previous ones chunking approaches like fixed-size chunking without any overlap may be all that is needed.
In the below head-to-head comparison, we used our recent blog as a sample to test late chunking vs naive chunking against. Using a fixed token chunking strategy (num tokens = 128) resulted in the following sentence being split across two different chunks:
Weaviate's native, multi-tenant architecture shines for customers who need to prioritize data privacy while maintaining fast retrieval and accuracy.
The two chunks that sentence was split between are:
| Chunk 1 | Chunk 2 |
|---|---|
| ...tech stacks to evolve. This optionality, combined with ease of use, helps teams scale AI prototypes into production faster. Flexibility is also vital when it comes to architecture. Different use cases have different requirements. For example, we work with many software companies and those operating in regulated industries. They often require multi-tenancy to isolate data and maintain compliance. When building a Retrieval Augmented Generation (RAG) application, using account or user-specific data to contextualize results, data must remain within a dedicated tenant for its user group. Weaviate’s native, multi-tenant architecture shines for customers who need to prioritize... | ...data privacy while maintaining fast retrieval and accuracy. On the other hand, we support some very large scale single-tenant use cases that orient toward real-time data access. Many of these are in e-commerce and industries that compete on speed and customer experience. |
To answer the query:
what do customers need to prioritise?
We need to return both of the above chunks for a gold standard answer. However, with the naive approach we end up with two separate chunks that are not neighboring one another.
But when we apply late chunking we end returning the two exact paragraphs over which the query is most relevant.
| Naive Approach (Top 2) | Late Chunking (Top 2) |
|---|---|
| 1. Chunk 8 (Similarity: 0.756): "product updates, join our upcoming webinar." | 2. Chunk 2 (Similarity: 0.701): "data privacy while maintaining fast retrieval and accuracy. On the other hand, we support some very large scale single-tenant use cases that orient toward real-time data access. Many of these are in e-commerce and industries that compete on speed and customer experience..." |
| 1. Chunk 3 (Similarity: 0.748): "diverse use cases and the evolving needs of developers. Introducing hot, warm, and cold storage tiers. It's amazing to see our customers' products gain popularity, attracting more users, and in many cases, tenants. However, as multi-tenant use cases scale, infrastructure costs can quickly become prohibitive..." | 2. Chunk 1 (Similarity: 0.689): "tech stacks to evolve. This optionality, combined with ease of use, helps teams scale AI prototypes into production faster. Flexibility is also vital when it comes to architecture. Different use cases have different requirements. For example, we work with many software companies and those operating in regulated industries. They often require multi-tenancy to isolate data and maintain compliance. When building a Retrieval Augmented Generation (RAG) application, using account or user-specific data to contextualize results, data must remain within a dedicated tenant for its user group. Weaviate’s native, multi-tenant architecture shines for customers who need to prioritize" |
Intuitively we as readers understand the link between the two correlated chunks, however with the naive approach there is no ability to condition the two separate embeddings with information about their neighboring chunks.
However, when we apply late chunking this contextual conditioning is preserved and we are able to return the two exact paragraphs needed to answer the query in a RAG application.
Let's revisit our theoretical storage comparison from earlier:
| Approach | Total embeddings required per document | Number of Documents | Total Vectors Stored | Storage Required |
|---|---|---|---|---|
| Late Interaction (no pooling) | 8,000 | 100,000 | 800 million | ~2.46 TB |
| Naive Approach (chunking before inference) | 16 ( 8,000 / 512 ) | 100,000 | 1.6 million | ~4.9 GB |
| Late Chunking (chunking after inference) | 16 ( 8,000 / 512 ) | 100,000 | 1.6 million | ~4.9 GB |
As we can see late chunking offers the same reduction in storage requirements as the naive approach while giving stronger preservation of the contextual information that late interaction offers.
If interested in more examples, here is another great notebook from Danny Williams exploring Late Chunking with quesitons about Berlin!
We believe that late chunking is extremely promising for a number of reasons:
In retrieval, there is no one-size-fits-all solution and the best approach will always be that which solves the user's problem given their specific constraints. However, if you want to avoid the pitfalls of naive chunking and the high potential costs of ColBERT, late chunking may be a great alternative for you to explore when you need to strike a balance between cost and performance.
Late chunking is a new approach and as such there is limited data available on its performance in benchmarks, which for long context retrieval are already scarcely available. The initial quantitative benchmarks from JinaAI are promising showing improved results across the board against naive chunking. Specifically the relative uplift in performance from late chunking was also shown to improve as the document length in characters increased, which makes sense given where the late chunking operation comes into effect.
We are keen to test late chunking out in further detail, particularly in benchmarks designed for assessing performance in long embedding retrieval. So stay tuned for more on this topic as we continue to explore the benefits of late chunking and integrate it into Weaviate.
Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).