Show HN: Learning a Language Using Only Words You Know

2025-12-1513:328429simedw.com

Can you bootstrap language learning from a small set of initial words?

Show article

December 15, 2025

TL;DR LangSeed is a proof-of-concept language learning app that defines new words using only the vocabulary you already know, with emojis bridging semantic gaps. The code is available on GitHub, and you can try it here LangSeed.

I recently bought a simplified version of the Journey to the West in Mandarin. Turns out I had overestimated my reading skills, and I barely recognised 20% of the words on the first page.

While I could have googled each word (or better yet, signed up for Chinese lessons) and gotten the English definition, doing so felt like it broke the immersion. Instead, I wanted to look up the words I didn't know and see their definitions in my target language. The problem is that there is no dictionary that is 100% tailored to my current level.¹

This is where LLMs come into the picture.

Generative Dictionary

The idea is simple: you have a set of words you have already mastered, M, and when you encounter a new word, you ask a model to define it using only words from M. There are two ways to do this: guided decoding and post-generation validation.

In guided decoding, you block the model from generating tokens that don't match words in M. This is slightly complicated, since tokens sometimes correspond to a full word, multiple words or a single character. This is easier to set up for local models, and while frontier models and their APIs offer some support for it, they are often centred around CFGs and JSON schemas.

Post-generation is much simpler. After the model has produced an answer, you segment it into words (using Jieba), find all words not in M, and call the LLM again with a list of words it used that weren't allowed. You repeat this in a loop until the model yields or you give up. It's impossible to explain "melancholy" if you only know the words for "I", "like", "food". In practice, it averages 1.5 loops for me, with three being the maximum before I give up. ²

It's worth noting that I feed the list of words I already know into the context, together with a sentence containing the word, since many words have multiple meanings depending on context.
You can "train" yourself to get stronger, or you can ride a "train".

One thing I noticed was that if the seed vocabulary was small enough, it was essentially impossible to break out and explain more complicated concepts. At first, I considered generating images to explain the concepts, since, if you squint hard enough, it's similar to how we learn words along with our visual senses. But then it struck me that I could use emojis, the universal language we all "speak", the Rosetta Stone of our time. Now the model could substitute words and concepts using flags, animals and sleeping emoji.

Hover or tap to see the translations

I also discovered that a single definition was often not enough to grasp the idea, so I started having the model generate three definitions when possible, which drastically increased my chances of understanding the word.

I also had the model output words it wished I already knew in order to create better definitions. For example, 学习 (study) recommends that I learn 学校 (school) and 知识 (knowledge), and initially relied on the emoji sequence 📚✏️🧠💡 to convey some of that meaning (see image below). The model also self-rates its definitions from 0 to 5.

But it's still far from perfect. My Chinese friends pointed out multiple grammatical issues, as well as the use of somewhat unusual words in the example sentences. There are some words I still don't get, so I've paused them until I have a larger vocabulary, at which point I'll come back and try again (for example 关于).

Drills

The next step was to create a basic training process. I’ve enjoyed spaced repetition in the past (for example, Anki), but for this proof of concept I wanted something simpler. In the end, I landed on two types of questions: a sentence with a gap where I need to pick from four options, and a sentence with a yes-or-no answer. All of these use only words you already know (or emojis), relying on the same post-processing verification as before.

I also made a version that could handle Swedish (my native language) and English, which made it easier to understand how well the LLMs were doing, how sensible the questions were, and how many grammatical problems they had. One big advantage of Chinese for this task is that words don’t conjugate; there’s no tense, no -ing forms, and so on. In Swedish and English, by contrast, words often have multiple stems or inflected forms.

This helped me debug conceptual errors and avoid trusting the models blindly.

Implementation

For this project, I decided to use Phoenix LiveView (Elixir), and it was a real joy to work with. I also tried the new req_llm library for unifying provider requests. I used Oban for generating questions in the background, ensuring that each word always had at least four pending questions, since they take some time to generate.

I deployed it on Fly.io, but their hosted Postgres was a bit too expensive for a one-off project, so I gave Neon a try for the first time.

I tried a few models but ended up using Gemini 2.5 Pro as my default model. Gemini Flash wasn't "creative" enough to bridge the gap using emojis. GPT-5 (and 5.1) also did a fairly good job.

Conclusion

I have been using this app for a week now, mostly on my phone while commuting to the office, and I can now read the first page! Next up: figuring out pronunciation; maybe a Christmas break project.

Seeing the definition of a new word feels a bit like solving a puzzle. You have a bunch of clues, you start figuring out what it means, and as you learn more, you can come back and build a clearer and clearer picture.

Sometimes it will mislead you, and reading a clear definition would definitely speed things up, but I wonder whether the process of figuring out a word doesn’t reinforce it much more.

Read the original article

Comments

By dylanzhangdev 2025-12-190:422 reply

Even for Chinese people, Journey to the West is a somewhat difficult text because it belongs to classical literature. Using some children's books published in recent years, and progressing gradually, might be a better approach?

By gcanyon 2025-12-192:161 reply

In the late 1800s/early 1900s there were books published "in words of one syllable" -- e.g. https://en.wikipedia.org/wiki/Alice%27s_Adventures_in_Wonder...

More: https://triviumpursuit.com/childrens-books-in-words-of-one-s...

By bitwize 2025-12-193:33

By simedw 2025-12-1922:46

This is a simplified version: Journey to the West in Easy Chinese by Jeff Pepper and Xiao Hui Wang. Otherwise, I would definitely have waited a bit before biting off something like this.

By englishcat 2025-12-1913:511 reply

This is quite a great idea, as a native Chinese speaker, i want to say this is the way very similar how we learned Chinese when we were kids.

On the other hand, the Chinese writing system is logographic (or ideographic), unlike the English system which is phonetic. The most basic characters, such as 日 (sun), 月 (moon), and 山 (mountain), are essentially graphics (or pictures) of the objects themselves. that makes them very suitable for being represented by images. The emoji you are using is also very good.

I believe this method should be very effective for beginners in Chinese. However, once you have mastered the basic Chinese characters, you can learn about the structure of Chinese characters and then continue reading more materials to expand your vocabulary.

The real challenge is to expand your vocabulary through extensive reading, i'm actually working on a tool to solve this specific problem (https://lingoku.ai/learn-chinese), If you are reading English, it will insert Chinese text for you, if your are reading Chinese text, it will translate the text from Chinese to English then inject Chinese words into the translated text, thus improving your vocabulary while reading.

By bisonbear 2025-12-1918:443 reply

checked out the tool and think it's a cool idea! one piece of feedback though - I actually feel like the inverse product would be more helpful for me. What I mean is replacing ~95% of english text with words (Chinese in my case) that I can understand, and leaving the remaining ~5% (words I definitely don't know) in English.

At least for me, there's large value in consuming bigger volumes of Chinese to get me used to pattern-matching on the characters, as opposed to only reading a smaller amount of harder characters that I'm less likely to actually encounter

By englishcat 2025-12-211:45

That makes a lot of sense, it really highlights the diffences in learning stages. My current tool if primarily designed for intermediate language learners who have already learned some basic words, but still in the 'accumulation phase' - their main bottleneck is vocabulary size, so they need to see new words frequently.

it sounds like you are at a more advanced stage of learning Chinese, you have moved past simple vocab building and are focusing on flow and fluency reading. For your use case, that 'inverse' approach (Chinese with English safety nets) is definitely superior for pattern-matching, it's a different problem set, but a very valid one.

Appreciate your feedback.

By simedw 2025-12-1922:37

That's a really cool concept. Naively replacing words might work, but sometimes the context is needed. Maybe a model like gemini 2.5 flash lite would be fast enough but still maintain better context awareness?

By jtokoph 2025-12-1515:553 reply

This is a really smart idea.

I’m trying to learn to speak Chinese and not read it yet. The issue is most of the language learning apps have a focus on characters. I feel like I just want to see the pinyin. Maybe I don’t know what I need, but I haven’t found the right tool.

By andai 2025-12-190:002 reply

There's a language learning method where you just listen to audio, until you develop a basic familiarity with the language. (Then learn reading and writing later.)

You listen to audio you don't understand yet, and over time your brain begins to pick up the patterns. It takes a lot of time but you can do it in the background, because that processing happens subconsciously. So you can get that time "for free".

I learned it from this guy https://alljapanesealltheti.me/index.html

But he got it from linguist Stephen Krashen and his Input Hypothesis of language acquisition. (i.e. that the way babies and kids learn languages, thru osmosis, works for adults too.)

I think the ideal solution is somewhere in the middle, starting with something like Pimsleur which is the same idea (audio and repetition) but more structured and focused, to give you that "seed" of vocabulary and grammar, before you flesh it out with the "long tail" of the language.

By cblum 2025-12-193:49

To add a bit more to this: AJATT (all Japanese all the time) later evolved into MIA (mass input approach), which then became Refold.

The gist of those methods is mass input + create SRS cards for sentences where only one word or grammar pattern is unfamiliar to you.

A similar but more relaxed approach is ALG (automatic language growth), where you start from very basic input with lots of visual aids and let the language “wash over you”: no taking notes, no creating flashcards, no dictionary lookups. Sounds crazy, but it works for a lot of people. It’s the method behind Dreaming Spanish, which was inspired by the teaching method at the AUA language school in Bangkok, where Dr. J Marvin Brown used Stephen Krashen’s ideas to create a Natural Approach course to teach foreigners Thai from zero to fluency.

By armenarmen 2025-12-191:002 reply

Pimsleur is also a great place to start for spoken fist

By bpev 2025-12-198:52

As someone who did most of Pimsleur Spanish and Mandarin (and did a single unit in various other languages), and has since continued learning these languages (I'm currently taking 4-5 hours of Spanish class a day in Spain), my two cents is that Pimsleur is fine for gaining confidence in the basic phrases of a language, but is a pretty poor tool if you want to actually learn a language. imo it focuses too much on set phrases without practicing further application.

For adults learning a language, I think you need 3 things to be most efficient. You need to learn the grammar rules/structure, you need vocabulary, and you need lots and lots of content. The specificity of Pimsleur I think is a major blocker. It lacks both vocabulary and content, and there is often a better resource for explaining grammar. I guess maybe the first unit of each Pimsleur course is pretty ok for getting used to the mouthfeel of a language, though.

For Spanish, I got far more out of languagetransfer.org, which helped me understand the concepts of the language much more, and dreaming.com, which gave me lots of content. For Chinese, I haven't found a course I like, but I still think I got more from drilling characters (I made my own app, but something like hanzihero or just an HSK/TOCFL Anki deck is probably good) and using graded readers. I think spoken-first in Chinese is a little bit of a trap, because it's easier to remember things with the written characters, when the relationships between words is a bit more clear.

edit: oh also sidenote, it's been a long time since I used it, but iirc, the Mandarin one is particularly outdated (eg talks about using a phone book) and uses a Beijing dialect, so everyone in Taipei made fun of me the first time I went there.

By cblum 2025-12-193:42

Pimsleur is awful for Mandarin. I wish I hadn’t wasted my time on it.

By SuperNinKenDo 2025-12-192:21

I recently changed all my language flashcards to be like this. Anki is probably the best option. I have the field with the Hanzi, but just configure my cards not to show it at the moment, so I break the habit of translating everything to characters in my head when I'm trying to listen. It's worked well, and the characters will be there when I decire to do something with them again.

By simedw 2025-12-1516:10

Thanks! I think getting comfortable with characters fairly early is important, as it helps shift your mindset into the right place. That said, I don’t think this project really works until you’re comfortable with at least ~60 characters.

Hacker News