Learn how MyClone migrated from OpenAI text-embedding-3-small (1536d) to Voyage-3.5-lite (512d) to achieve 3× storage savings, 2× faster retrieval, and 15-20% reduction in voice latency—without…
At MyClone.is, our mission is to build truly personal digital personas. We achieve this by creating a rich, interactive clone of a user’s knowledge base, powered by Retrieval-Augmented Generation (RAG). We create knowledge base for each user by encoding their uploaded documents, notes, and knowledge into a vector database that powers chat and voice assistants.
Every time a user interacts with their persona via voice or chat, it runs RAG (retrieval‑augmented generation) over those embeddings to instantly pinpoint the most relevant piece of knowledge from their unique knowledge base—often in milliseconds—to deliver a response that sounds just like them. In this architecture, the embedding model is central: it determines how well the system understands user content, how much vector storage is required, and how quickly relevant information can be retrieved and ranked. At he end latency is the enemy of natural conversation.
Previously, MyClone used OpenAI’s text-embedding-3-small, which produces 1536‑dimensional float vectors optimized for general-purpose semantic similarity. This model is known for strong quality across common retrieval benchmarks at a relatively low price point, but its default 1536‑dim size implies higher storage and bandwidth than lower‑dim alternatives.
In high‑throughput RAG systems, 1536‑dim vectors increase memory footprint, disk usage, and I/O per query, which can become a bottleneck for both latency and cost as the number of users and knowledge items grows.
We recently identified this bottleneck in our RAG pipeline and we took a bold step: we replaced OpenAI’s text-embedding-3-small (1536 dim) with Voyage-3.5 Lite (512 dim). It cuts storage and latency substantially while maintaining, and often improving, retrieval quality for User’s Persona. This kind of infrastructure change directly translates into faster, cheaper, and more natural-feeling AI assistants for your users.
Lets go dive deeper
On the surface, going from 1536 dimensions down to 512 seems like a compromise. Fewer dimensions should mean less information and poorer retrieval quality. However, the landscape of embedding models is evolving rapidly, driven by innovations like Matryoshka Representation Learning (MRL), which Voyage AI utilizes.
Voyage‑3.5‑lite leverages Matryoshka training and quantization‑aware techniques so that the first 256 or 512 dimensions capture the majority of the semantic signal instead of being a naive truncation of a larger vector. Public benchmarks and vendor claims indicate that Voyage‑3.5‑lite at reduced dimensions maintains retrieval performance very close to full‑dimension variants and competitive with leading commercial models.
By contrast, OpenAI’s embeddings were designed primarily with fixed 1536‑dim outputs, where dimensionality reduction is typically done post‑hoc (e.g., PCA or truncation), which may lose information unless carefully tuned for each domain. This makes Voyage‑3.5‑lite more attractive for applications where vector cost and latency are critical but quality cannot be sacrificed.
The most immediate gain was in our storage layer. By reducing the dimensionality from 1536 to 512, we achieved a ~66% reduction in the storage footprint required for our entire user knowledge base in the Vector DB.
Vector databases rely on calculating the similarity (usually cosine similarity) between the query vector and millions of stored document vectors. The computational cost of this search is heavily dependent on the vector size.
This optimization resulted in retrieval latency being slashed by 50% (2x faster).
For a Digital Persona designed for voice interaction, every millisecond counts. A long pause after a user asks a question breaks the illusion of a real conversation.
The massive reduction in retrieval latency directly fed into our overall system speed:
| Feature | OpenAI text-embedding-3-small | Voyage-3.5-lite (512d float) |
|---|---|---|
| Default dimensions | 1536 | 1024 (supports 256/512/1024/2048) |
| Dimensions used at MyClone | 1536 | 512 |
| Vector size vs 512d | Baseline | 3x smaller |
| Retrieval quality | Strong general-purpose | Competitive / improved on retrieval |
| Storage cost | High (per vector) | ~3× lower at same precision |
| Vector DB latency | Baseline | 2–2.5× faster at MyClone |
| E2E voice latency impact | Baseline | 15–20% reduction at MyClone |
| First-token latency | Baseline | ~15% faster at MyClone |
| Dimensional flexibility | Fixed 1536 (practical) | Flexible via Matryoshka (256–2048) |
For a digital persona platform, user satisfaction is tightly linked to how responsive and on‑point the assistant feels in both chat and voice. Lower vector dimensions reduce tail latency for retrieval, which directly shortens the time to first token and makes voice conversations feel more natural and less “robotic pause” heavy.
At the same time, users expect the persona to recall their uploaded knowledge accurately, which means any optimization that saves cost must not degrade retrieval quality or introduce hallucinations. Voyage‑3.5‑lite’s retrieval‑focused design allows MyClone to hit this balance: high‑fidelity grounding with a much lighter retrieval stack.
From a product and business perspective, the embedding migration unlocks several advantages:
Better UX at scale: Faster responses improve perceived intelligence and trust, especially in voice interactions where humans are highly sensitive to delay.
Lower infra cost per persona: 3× storage savings and faster queries mean cheaper vector DB and compute, allowing MyClone to host more user knowledge for the same budget.
Headroom for richer features: Freed-up latency and cost can be reinvested into deeper RAG pipelines, more reranking, or multi‑step reasoning without exceeding user latency budgets.
Future flexibility: Voyage‑3.5‑lite supports multiple dimensions and quantization schemes (e.g., int8, binary), opening the door to further optimizations like ultra‑cheap archival memory or hybrid binary‑plus‑float retrieval strategies.
For MyClone, these gains compound: each user’s digital persona can reference more documents, answer faster, and operate more cheaply—while staying faithful to the user’s own voice, style, and knowledge.
The shift from OpenAI’s 1536‑dim embeddings to Voyage‑3.5‑lite 512‑dim embeddings shows how embedding choice is a product decision, not just an infra detail. By aligning the embedding model with the needs of high‑scale RAG—fast, cost‑efficient retrieval with strong semantic quality—MyClone improved both user experience and unit economics in one move.
As RAG systems mature, embedding models like Voyage‑3.5‑lite that are explicitly optimized for flexible dimensions, quantization, and retrieval quality will increasingly become the default for latency‑sensitive, knowledge‑heavy products like digital personas.
Great article! I always feel that the choice of embedding model is quite important, but it's seldom mentioned. Most tutorials about RAG just tell you to use a common model like OpenAI's text embedding, making it seem as though it's okay to use anything else. But even though I'm somewhat aware of this, I lack the knowledge and methods to determine which model is best suited for my scenario. Can you give some suggestions on how to evaluate that? Besides, I'm wondering what you think about some open-source embedding models like embeddinggemma-300m or e5-large.
The biggest latency improvement I saw was switching off OpenAI's API that would have a latency anywhere between 0.3 - 6 seconds(!) for the same two word search embedding...
Cool article, but nothing groundbreaking? Obviously if you reduce your dimensionality the storage and latency decreases.. it’s less data