The path to ubiquitous AI (17k tokens/sec)

2026-02-2010:32847455taalas.com

By Ljubisa Bajic Many believe AI is the real deal. In narrow domains, it already surpasses human performance. Used well, it is an unprecedented amplifier of human ingenuity and productivity. Its…

By Ljubisa Bajic

Many believe AI is the real deal. In narrow domains, it already surpasses human performance. Used well, it is an unprecedented amplifier of human ingenuity and productivity. Its widespread adoption is hindered by two key barriers: high latency and astronomical cost. Interactions with language models lag far behind the pace of human cognition. Coding assistants can ponder for minutes, disrupting the programmer’s state of flow, and limiting effective human-AI collaboration. Meanwhile, automated agentic AI applications demand millisecond latencies, not leisurely human-paced responses.

On the cost front, deploying modern models demands massive engineering and capital: room-sized supercomputers consuming hundreds of kilowatts, with liquid cooling, advanced packaging, stacked memory, complex I/O, and miles of cables. This scales to city-sized data center campuses and satellite networks, driving extreme operational expenses.

Though society seems poised to build a dystopian future defined by data centers and adjacent power plants, history hints at a different direction. Past technological revolutions often started with grotesque prototypes, only to be eclipsed by breakthroughs yielding more practical outcomes.

Consider ENIAC, a room-filling beast of vacuum tubes and cables. ENIAC introduced humanity to the magic of computing, but was slow, costly, and unscalable. The transistor sparked swift evolution, through workstations and PCs, to smartphones and ubiquitous computing, sparing the world from ENIAC sprawl.

General-purpose computing entered the mainstream by becoming easy to build, fast, and cheap.

AI needs to do the same.

About Taalas

Founded 2.5 years ago, Taalas developed a platform for transforming any AI model into custom silicon. From the moment a previously unseen model is received, it can be realized in hardware in only two months.

The resulting Hardcore Models are an order of magnitude faster, cheaper, and lower power than software-based implementations.

Taalas’ work is guided by the following core principles:

1. Total specialization

Throughout the history of computation, deep specialization has been the surest path to extreme efficiency in critical workloads.

AI inference is the most critical computational workload that humanity has ever faced, and the one that stands to gain the most from specialization.

Its computational demands motivate total specialization: the production of optimal silicon for each individual model.

2. Merging storage and computation

Modern inference hardware is constrained by an artificial divide: memory on one side, compute on the other, operating at fundamentally different speeds.

This separation arises from a longstanding paradox. DRAM is far denser, and therefore cheaper, than the types of memory compatible with standard chip processes. However, accessing off-chip DRAM is thousands of times slower than on-chip memory. Conversely, compute chips cannot be built using DRAM processes.

This divide underpins much of the complexity in modern inference hardware, creating the need for advanced packaging, HBM stacks, massive I/O bandwidth, soaring per-chip power consumption, and liquid cooling.

Taalas eliminates this boundary. By unifying storage and compute on a single chip, at DRAM-level density, our architecture far surpasses what was previously possible.

3. Radical simplification

By removing the memory-compute boundary and tailoring silicon to each model, we were able to redesign the entire hardware stack from first principles.

The result is a system that does not depend on difficult or exotic technologies, no HBM, advanced packaging, 3D stacking, liquid cooling, high speed IO.

Engineering simplicity enables an order-of-magnitude reduction in total system cost.

Early Products

Guided by this technical philosophy, Taalas has created the world’s fastest, lowest cost/power inference platform.

Taalas HC1 board
Figure 1: Taalas HC1 hard-wired with Llama 3.1 8B model

Today, we are unveiling our first product: a hard-wired Llama 3.1 8B, available as both a chatbot demo and an inference API service.

Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.

Chart showing speed comparison between Taalas and competitors - tokens per second per userPerformance data for Llama 3.1 8B, Input sequence length 1k/1k | Source: Nvidia Baseline (H200), B200 measured by Taalas | Groq, Sambanova, Cerebras performance from Artificial Analysis | Taalas Performance run by Taalas labs

Figure 2: Taalas HC1 delivers leadership tokens/sec/user on Llama 3.1 8B

We selected the Llama 3.1 8B as the basis for our first product due to its practicality. Its small size and open-source availability allowed us to harden the model with minimal logistical effort.

While largely hard-wired for speed, the Llama retains flexibility through configurable context window size and support for fine-tuning via low-rank adapters (LoRAs).

At the time we began work on our first generation design, low-precision parameter formats were not standardized. Our first silicon platform therefore used a custom 3-bit base data type. The Silicon Llama is aggressively quantized, combining 3-bit and 6-bit parameters, which introduces some quality degradations relative to GPU benchmarks.

Our second-generation silicon adopts standard 4-bit floating-point formats, addressing these limitations while maintaining high speed and efficiency.

Upcoming models

Our second model, still based on Taalas’ first-generation silicon platform (HC1), will be a mid-sized reasoning LLM. It is expected in our labs this spring and will be integrated into our inference service shortly thereafter.

Following this, a frontier LLM will be fabricated using our second-generation silicon platform (HC2). HC2 offers considerably higher density and even faster execution. Deployment is planned for winter.

Instantaneous AI, in your hands today

Our debut model is clearly not on the leading edge, but we decided to release it as a beta service anyway – to let developers explore what becomes possible when LLM inference runs at sub-millisecond speed and near-zero cost.

We believe that our service enables many classes of applications that were previously impractical, and want to encourage developers to experiment, and discover how these capabilities can be applied.

Apply for access here, and engage with a system that removes traditional AI latency and cost constraints.

On substance, team and craft

At its core, Taalas is a small group of long-time collaborators, many of whom have been together for over twenty years. To remain lean and focused, we rely on external partners who bring equal skill and decades of shared experience. The team grows slowly, with new team members joining through demonstrated excellence, alignment with our mission and respect for our established practices. Here, substance outweighs spectacle, craft outweighs scale, and rigor outweighs redundancy.

Taalas is a precision strike, in a world where deep-tech startups approach their chosen problems like medieval armies besieging a walled city, with swarming numbers, overflowing coffers of venture capital, and a clamor of hype that drowns out clear thought.

Our first product was brought to the world by a team of 24 team members, and a total of just $30M spent, of more than $200M raised. This achievement demonstrates that precisely defined goals and disciplined focus achieve what brute force cannot.

Going forward, we will advance in the open. Our Llama inference platform is already in your hands. Future systems will follow as they mature. We will expose them early, iterate swiftly, and accept the rough edges.

Conclusion

Innovation begins by questioning assumptions and venturing into the neglected corners of any solution space. That is the path we chose at Taalas.

Our technology delivers step-function gains in performance, power efficiency, and cost.

It reflects a fundamentally different architectural philosophy from the mainstream, one that redefines how AI systems are built and deployed.

Disruptive advances rarely look familiar at first, and we are committed to helping the industry understand and adopt this new operating paradigm.

Our first products, beginning with our hard-wired Llama and rapidly expanding to more capable models, eliminate high latency and cost, the core barriers to ubiquitous AI.

We have placed instantaneous, ultra-low-cost intelligence in developers’ hands, and are eagerly looking forward to seeing what they build with it.


Read the original article

Comments

  • By dust42 2026-02-2011:2720 reply

    This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.

    Tech summary:

      - 15k tok/sec on 8B dense 3bit quant (llama 3.1) 
      - limited KV cache
      - 880mm^2 die, TSMC 6nm, 53B transistors
      - presumably 200W per chip
      - 20x cheaper to produce
      - 10x less energy per token for inference
      - max context size: flexible
      - mid-sized thinking model upcoming this spring on same hardware
      - next hardware supposed to be FP4 
      - a frontier LLM planned within twelve months
    
    This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.

    Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.

    Not exactly a competitor for Nvidia but probably for 5-10% of the market.

    Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.

    Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

    • By vessenes 2026-02-2012:067 reply

      This math is useful. Lots of folks scoffing in the comments below. I have a couple reactions, after chatting with it:

      1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.

      2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.

      3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.

      However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.

      Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

      I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.

      • By rbanffy 2026-02-2014:042 reply

        > any factor of 10 being a new science / new product category,

        I often remind people two orders of quantitative change is a qualitative change.

        > The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

        The real product they have is automation. They figured out a way to compile a large model into a circuit. That's, in itself, pretty impressive. If they can do this, they can also compile models to an HDL and deploy them to large FPGA simulators for quick validation. If we see models maturing at a "good enough" state, even a longer turnaround between model release and silicon makes sense.

        While I also see lots of these systems running standalone, I think they'll really shine combined with more flexible inference engines, running the unchanging parts of the model while the coupled inference engine deals with whatever is too new to have been baked into silicon.

        I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.

        • By ttul 2026-02-2015:341 reply

          Having dabbled in VLSI in the early-2010s, half the battle is getting a manufacturing slot with TSMC. It’s a dark art with secret handshakes. This demonstrator chip is an enormous accomplishment.

          • By vessenes 2026-02-2020:511 reply

            Yeah and a team I’m not familiar with — I didn’t check bios but they don’t lead with ‘our team made this or that gpu for this or that bigco’.

            The design ip at 6nm is still tough; I feel like this team must have at least one real genius and some incredibly good support at tsmc. Or they’ve been waiting a year for a slot :)

            • By dust42 2026-02-2021:23

              From the article:

              "Ljubisa Bajic desiged video encoders for Teralogic and Oak Technology before moving over to AMD and rising through the engineering ranks to be the architect and senior manager of the company’s hybrid CPU-GPU chip designs for PCs and servers. Bajic did a one-year stint at Nvidia as s senior architect, bounced back to AMD as a director of integrated circuit design for two years, and then started Tenstorrent."

              His wife (COO) worked at Altera, ATI, AMD and Testorrent.

              "Drago Ignjatovic, who was a senior design engineer working on AMD APUs and GPUs and took over for Ljubisa Bajic as director of ASIC design when the latter left to start Tenstorrent. Nine months later, Ignjatovic joined Tenstorrent as its vice president of hardware engineering, and he started Taalas with the Bajices as the startup’s chief technology officer."

              Not a youngster gang...

        • By VagabundoP 2026-02-2014:481 reply

          There might be a foodchain of lower order uses when they become "obsolete".

          • By rbanffy 2026-02-2016:43

            I think there will be a lot of space for sensorial models in robotics, as the laws of physics don't change much, and a light switch or automobile controls have remained stable and consistent over the last decades.

      • By Gareth321 2026-02-2012:544 reply

        I think the next major innovation is going to be intelligent model routing. I've been exploring OpenClaw and OpenRouter, and there is a real lack of options to select the best model for the job and execute. The providers are trying to do that with their own models, but none of them offer everything to everyone at all times. I see a future with increasingly niche models being offered for all kinds of novel use cases. We need a way to fluidly apply the right model for the job.

        • By nylonstrung 2026-02-2013:12

          Agree that routing is becoming the critical layer here. Vllm iris is really promising for this https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html

          There's already some good work on router benchmarking which is pretty interesting

        • By condiment 2026-02-2015:311 reply

          At 16k tokens/s why bother routing? We're talking about multiple orders of magnitude faster and cheaper execution.

          Abundance supports different strategies. One approach: Set a deadline for a response, send the turn to every AI that could possibly answer, and when the deadline arrives, cancel any request that hasn't yet completed. You know a priori which models have the highest quality in aggregate. Pick that one.

          • By IanCal 2026-02-2017:131 reply

            The best coding model won’t be the best roleplay one which won’t be the best at tool use. It depends what you want to do in order to pick the best model.

            • By PhunkyPhil 2026-02-2018:162 reply

              I'm not saying you're wrong, but why is this the case?

              I'm out of the loop on training LLMs, but to me it's just pure data input. Are they choosing to include more code rather than, say fiction books?

              • By refulgentis 2026-02-2018:203 reply

                I’ll go ahead and say they’re wrong (source: building and maintaining llm client with llama.cpp integrated & 40+ 3p models via http)

                I desperately want there to be differentiation. Reality has shown over and over again it doesn’t matter. Even if you do same query across X models and then some form of consensus, the improvements on benchmarks are marginal and UX is worse (more time, more expensive, final answer is muddied and bound by the quality of the best model)

              • By jmalicki 2026-02-2020:06

                There is the pre-training, where you passively read stuff from the web.

                From there you go to RL training, where humans are grading model responses, or the AI is writing code to try to pass tests and learning how to get the tests to pass, etc. The RL phase is pretty important because it's not passive, and it can focus on the weaker areas of the model too, so you can actually train on a larger dataset than the sum of recorded human knowledge.

        • By monooso 2026-02-2014:08

          I came across this yesterday. Haven't tried it, but it looks interesting:

          https://agent-relay.com/

        • By eshaham78 2026-02-2013:03

          [dead]

      • By ssivark 2026-02-215:08

        > speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious

        Can we use older (previous generation, smaller) models as a speculative decoder for the current model? I don't know whether the randomness in training (weight init, data ordering, etc) will affect this kind of use. To the extent that these models are learning the "true underlying token distribution" this should be possible, in principle. If that's the case, speculative decoding is an elegant vector to introduce this kind of tech, and the turnaround time is even less of a problem.

      • By btown 2026-02-2012:392 reply

        For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?

        • By Zetaphor 2026-02-2013:212 reply

          My understanding as well is that speculative decoding only works with a smaller quant of the same model. You're using the faster sampling of the smaller models representation of the larger models weights in order to attempt to accurately predict its token output. This wouldn't work cross-model as the token probabilities are completely different.

          • By jasonjmcghee 2026-02-2014:561 reply

            This is not correct.

            Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever.

            It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc

            The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.

            • By Zetaphor 2026-02-2117:46

              > The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.

              So what I said is correct then lol. If you're saying I can use a model that isn't just a smaller quant of the larger model I'm trying to speculatively decode, except that model would never get an accurate prediction, then how is that in any way useful or desirable?

          • By ashirviskas 2026-02-2013:451 reply

            Smaller quant or smaller model?

            Afaik it can work with anything, but sharing vocab solves a lot of headaches and the better token probs match, the more efficient it gets.

            Which is why it is usually done with same family models and most often NOT just different quantizations of the same model.

            • By Zetaphor 2026-02-2117:47

              Smaller quant of the same model. A smaller quant of a different family of model would be practically useless and there wouldn't be any point in even setting it up.

        • By vessenes 2026-02-2013:35

          I think they’d commission a quant directly. Benefits go down a lot when you leave model families.

      • By jasonwatkinspdx 2026-02-2119:04

        > The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

        They may be using Rapidus, which is a Japanese government backed foundry built around all single wafer processing vs traditional batching. They advertise ~2 month turnaround time as standard, and as short as 2 weeks for priority.

      • By empath75 2026-02-2013:57

        Think about this for solving questions in math where you need to explore a search space. You can run 100 of these for the same cost and time of doing one api call to open ai.

      • By joha4270 2026-02-2012:195 reply

        The guts of a LLM isn't something I'm well versed in, but

        > to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model

        suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?

    • By soleveloper 2026-02-2012:385 reply

      In 20$ a die, they could sell Gameboy style cartridges for different models.

      • By twalla 2026-02-212:27

        Okay, now _this_ is the cyberpunk future I asked for.

      • By noveltyaccount 2026-02-2015:131 reply

        That would be very cool, get an upgraded model every couple of months. Maybe PCIe form factor.

        • By soleveloper 2026-02-2020:28

          Yes, and even holding couple of cartridges for different scenarios e.g image generation, coding, tts/stt, etc

      • By pennomi 2026-02-2015:58

        Make them shaped like floppy disks to confuse the younger generations.

      • By fennecbutt 2026-02-221:58

        Microsoft

      • By merlindru 2026-02-210:34

        dude that would be so incredibly cool

    • By alexjplant 2026-02-2115:21

      Most importantly this opens up an amazing future where we get the real version of the classic science fiction MacGuffin of a physical AI chip. Pair this with several TB of flash storage and you have persistent artificial consciousness that can be carried around with you. Bonus points if it's quirky, custom-trained and the chip is one of a kind that you stole from an evil corporation. Additional bonus points if the packaging is such that it's small enough to plug into the USB-C port on your smart glasses and has an eBPF module it can leverage to see what you're doing and talk to you in real time about your actions.

      I enjoy envisioning futures more whimsical than "the bargain-basement LLM provider that my insurance company uses denied my claim because I chose badly-vectored words".

    • By jameslk 2026-02-2022:33

      > Certainly interesting for very low latency applications which need < 10k tokens context.

      I’m really curious if context will really matter if using methods like Recursive Language Models[0]. That method is suited to break down a huge amount of context into smaller subagents recursively, each working on a symbolic subset of the prompt.

      The challenge with RLM seemed like it burned through a ton of tokens to trade for more accuracy. If tokens are cheap, RLM seems like it could be beneficial here to provide much more accuracy over large contexts despite what the underlying model can handle

      0. https://arxiv.org/abs/2512.24601

    • By aurareturn 2026-02-2011:301 reply

      Don’t forget that the 8B model requires 10 of said chips to run.

      And it’s a 3bit quant. So 3GB ram requirement.

      If they run 8B using native 16bit quant, it will use 60 H100 sized chips.

      • By dust42 2026-02-2011:381 reply

        > Don’t forget that the 8B model requires 10 of said chips to run.

        Are you sure about that? If true it would definitely make it look a lot less interesting.

        • By aurareturn 2026-02-2011:461 reply

          Their 2.4 kW is for 10 chips it seems based on the next platform article.

          I assume they need all 10 chips for their 8B q3 model. Otherwise, they would have said so or they would have put a more impressive model as the demo.

          https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

          • By audunw 2026-02-2012:092 reply

            It doesn’t make any sense to think you need the whole server to run one model. It’s much more likely that each server runs 10 instances of the model

            1. It doesn’t make sense in terms of architecture. It’s one chip. You can’t split one model over 10 identical hardwire chips

            2. It doesn’t add up with their claims of better power efficiency. 2.4kW for one model would be really bad.

            • By aurareturn 2026-02-2012:451 reply

              We are both wrong.

              First, it is likely one chip for llama 8B q3 with 1k context size. This could fit into around 3GB of SRAM which is about the theoretical maximum for TSMC N6 reticle limit.

              Second, their plan is to etch larger models across multiple connected chips. It’s physically impossible to run bigger models otherwise since 3GB SRAM is about the max you can have on an 850mm2 chip.

                followed by a frontier-class large language model running inference across a collection of HC cards by year-end under its HC2 architecture
              
              https://mlq.ai/news/taalas-secures-169m-funding-to-develop-a...

              • By pigpop 2026-02-2022:15

                Aren't they only using the SRAM for the KV cache? They mention that the hardwired weights have a very high density. They say about the ROM part:

                > We have got this scheme for the mask ROM recall fabric – the hard-wired part – where we can store four bits away and do the multiply related to it – everything – with a single transistor. So the density is basically insane.

                I'm not a hardware guy but they seem to be making a strong distinction between the techniques they're using for the weights vs KV cache

                > In the current generation, our density is 8 billion parameters on the hard wired part of the chip., plus the SRAM to allow us to do KV caches, adaptations like fine tuning, and etc.

            • By moralestapia 2026-02-2012:131 reply

              Thanks for having a brain.

              Not sure who started that "split into 10 chips" claim, it's just dumb.

              This is Llama 3B hardcoded (literally) on one chip. That's what the startup is about, they emphasize this multiple times.

              • By aurareturn 2026-02-2012:51

                It’s just dumb to think that one chip per model is their plan. They stated that their plan is to chain multiple chips together.

                I was indeed wrong about 10 chips. I thought they would use llama 8B 16bit and a few thousand context size. It turns out, they used llama 8B 3bit with around 1k context size. That made me assume they must have chained multiple chips together since the max SRAM on TSMC n6 for reticle sized chip is only around 3GB.

    • By elternal_love 2026-02-2011:471 reply

      Were we go towards really smart roboters. It is interesting what kind of diferent model chips they can produce.

      • By varispeed 2026-02-2011:506 reply

        There is nothing smart about current LLMs. They just regurgitate text compressed in their memory based on probability. None of the LLMs currently have actual understanding of what you ask them to do and what they respond with.

        • By adamtaylor_13 2026-02-2013:534 reply

          If LLMs just regurgitate compressed text, they'd fail on any novel problem not in their training data. Yet, they routinely solve them, which means whatever's happening between input and output is more than retrieval, and calling it "not understanding" requires you to define understanding in a way that conveniently excludes everything except biological brains.

          • By fennecbutt 2026-02-222:02

            I somewhat agree with you but I also realise that there are very few "novel" problems in the world. I think it's really just more complex problem spaces is all.

            Same relative logic, just more of it/more steps or trials.

          • By sfn42 2026-02-2015:432 reply

            Yes there are some fascinating emergent properties at play, but when they fail it's blatantly obvious that there's no actual intelligence nor understanding. They are very cool and very useful tools, I use them on a daily basis now and the way I can just paste a vague screenshot with some vague text and they get it and give a useful response blows my mind every time. But it's very clear that it's all just smoke and mirrors, they're not intelligent and you can't trust them with anything.

            • By pennomi 2026-02-2016:012 reply

              When humans fail a task, it’s obvious there is no actual intelligence nor understanding.

              Intelligence is not as cool as you think it is.

              • By swftarrow 2026-02-2112:45

                It can still be cool- but maybe it's just not as rare.

              • By sfn42 2026-02-2016:13

                I assure you, intelligence is very cool.

            • By atomicthumbs 2026-02-212:20

              you'd think with how often Opus builds two separate code paths without feature parity when you try to vibe code something complex, people wouldn't regard this whole thing so highly

          • By otabdeveloper4 2026-02-2019:471 reply

            > they'd fail on any novel problem not in their training data

            Yes, and that's exactly what they do.

            No, none of the problems you gave to the LLM while toying around with them are in any way novel.

            • By adamtaylor_13 2026-02-2021:011 reply

              None of my codebases are in their training data, yet they routinely contribute to them in meaningful ways. They write code that I'm happy with that improves the codebases I work in.

              Do you not consider that novel problem solving?

              • By otabdeveloper4 2026-02-229:56

                Correct, you are not doing any novel problem solving.

          • By varispeed 2026-02-2014:031 reply

            They don't solve novel problems. But if you have such strong belief, please give us examples.

        • By bsenftner 2026-02-2012:43

          We know that, but that does not make them unuseful. The opposite in fact, they are extremely useful in the hands of non-idiots.We just happen to have a oversupply of idiots at the moment, which AI is here to eradicate. /Sort of satire.

        • By visarga 2026-02-2013:47

          So you are saying they are like copy, LLMs will copy some training data back to you? Why do we spend so much money training and running them if they "just regurgitate text compressed in their memory based on probability"? billions of dollars to build a lossy grep.

          I think you are confused about LLMs - they take in context, and that context makes them generate new things, for existing things we have cp. By your logic pianos can't be creative instruments because they just produce the same 88 notes.

        • By flamedoge 2026-02-217:02

          I have a gut feeling, huge portion of deficiencies we note with AI is just reflection of the training data. For instance, wiki/reddit/etc internet is just a soup of human description of the world model, not the actual world model itself. There are gaps or holes in the knowledge because codified summary of world is what is remarkable to us humans, not a 100% faithful, comprehensive description of the world. What is obvious to us humans with lived real world experience often does not make it into the training data. A simple, demonstrable example is whether one should walk or drive to car wash.

        • By small_model 2026-02-2012:042 reply

          Thats not how they work, pro-tip maybe don't comment until you have a good understanding?

          • By fyltr 2026-02-2012:232 reply

            Would you mind rectifying the wrong parts then?

            • By retsibsi 2026-02-2012:47

              Phrases like "actual understanding", "true intelligence" etc. are not conducive to productive discussion unless you take the trouble to define what you mean by them (which ~nobody ever does). They're highly ambiguous and it's never clear what specific claims they do or don't imply when used by any given person.

              But I think this specific claim is clearly wrong, if taken at face value:

              > They just regurgitate text compressed in their memory

              They're clearly capable of producing novel utterances, so they can't just be doing that. (Unless we're dealing with a very loose definition of "regurgitate", in which case it's probably best to use a different word if we want to understand each other.)

            • By mhl47 2026-02-2012:48

              The fact that the outputs are probabilities is not important. What is important is how that output is computed.

              You could imagine that it is possible to learn certain algorithms/ heuristics that "intelligence" is comprised of. No matter what you output. Training for optimal compression of tasks /taking actions -> could lead to intelligence being the best solution.

              This is far from a formal argument but so is the stubborn reiteration off "it's just probabilities" or "it's just compression". Because this "just" thing is getting more an more capable of solving tasks that are surely not in the training data exactly like this.

          • By 100721 2026-02-2012:311 reply

            Huh? Their words are an accurate, if simplified, description of how they work.

            • By fragmede 2026-02-2116:08

              The simplification is where it loses granularity. I could describe every human's life as they were born and then they died. That's 100% accurate, but there's just a little something lost by simplifying that much.

        • By beyondCritics 2026-02-2012:40

          Just HI slop. Ask any decent model, it can explain what's wrong this this description.

    • By Aissen 2026-02-2013:121 reply

      > 880mm^2 die

      That's a lot of surface, isn't it? As big an M1 Ultra (2x M1 Max at 432mm² on TSMC N5P), a bit bigger than an A100 (820mm² on TSMC N7) or H100 (814mm² on TSMC N5).

      > The larger the die size, the lower the yield.

      I wonder if that applies? What's the big deal if a few parameter have a few bit flips?

      • By rbanffy 2026-02-2013:172 reply

        > I wonder if that applies? What's the big deal if a few parameter have a few bit flips?

        We get into the sci-fi territory where a machine achieves sentience because it has all the right manufacturing defects.

        Reminds me of this https://en.wikipedia.org/wiki/A_Logic_Named_Joe

        • By sowbug 2026-02-2014:321 reply

          Also see Adrian Thompson's Xilinx 6200 FPGA, programmed by a genetic algorithm that worked but exploited nuances unique to that specific physical chip, meaning the software couldn't be copied to another chip. https://news.ycombinator.com/item?id=43152877

          • By rbanffy 2026-02-2016:44

            I love that story.

        • By philipwhiuk 2026-02-2013:21

          2000s movie line territory:

          > There have always been ghosts in the machine. Random segments of code, that have grouped together to form unexpected protocols.

    • By empath75 2026-02-2013:45

      An on-device reasoning model what that kind of speed and cost would completely change the way people use their computers. It would be closer to star trek than anything else we've ever had. You'd never have to type anything or use a mouse again.

    • By xnx 2026-02-2022:05

      Hardware decoders make sense for fixed codecs like MPEG, but I can't see it making sense for small models that improve every 6 months.

    • By WhitneyLand 2026-02-2013:44

      There’s a bit of a hidden cost here… the longevity of GPU hardware is going to be longer, it’s extended every time there’s an algorithmic improvement. Whereas any efficiency gains in software that are not compatible with this hardware will tend to accelerate their depreciation.

    • By gwern 2026-02-215:29

      K-V caches are large, but hidden states aren't necessarily that large. And if you can run a model once ridiculously fast, then you can loop it repeatedly and still be fast. So I wonder about the 'modern RNNs' like RWKV here...

    • By make3 2026-02-224:22

      It's weird to me to train such huge models to then destroy them by using them a 3 bits quantization per presumably 16bits (bfloat16) weights. Why not just train smaller models then.

    • By pankajdoharey 2026-02-217:301 reply

      There is nothing new here. This has been demonstrated several times by previous researchers:

      https://arxiv.org/abs/2511.06174

      https://arxiv.org/abs/2401.03868

      For a real world use case, you would need an FPGA with terabytes of RAM. Perhaps it'll be a Off chip HBM. But for s large models, even that won't be enough. Then you would need to figure out NV-link like interconnect for these FPGAs. And we are back to square one.

      • By smokel 2026-02-2114:371 reply

        This is new. You are citing FPGA prototypes. Those papers do not demonstrate the same class of scaling or hardware integration that Taalas is advocating. For one, the FPGA solutions typically use fixed multipliers (or lookup tables), the ASIC solution has more freedom to optimize routing for 4 bit multiplication.

        • By pankajdoharey 2026-02-254:50

          I understand that what Taalas is claiming. I was trying to actually describe that model on a hardware is some not something new Or unthought of The natural progression of FPGA is ASIC. Taalas process is more expensive And not really worth it because once you burn a model on the silicon, the silicon can only serve that model. speed improvement alone is not enough for the cost you will incur in the long run. GPU's are still general purpose, FPGA's are atleast reusable but wont have the same speed. But this alone cannot be a long term business. Turning a model to hardware in two months is too long. Models already take quite a long time to train. Anyone going down this strategy would leave wide open field to their competitors. Deployment planning of existing models already so complicated.

    • By bsenftner 2026-02-2012:42

      Do not overlook traditional irrational investor exuberance, we've got an abundance of that right now. With the right PR manouveurs these guys could be a tulip craze.

    • By oliwary 2026-02-2011:29

      This is insane if true - could be super useful for data extraction tasks. Sounds like we could be talking in the cents per millions of tokens range.

    • By pulse7 2026-02-2118:54

      Maybe they can stack LLM parameters in 200 layers like 3D NAND flash and make the chip very small ...

    • By mikhail-ramirez 2026-02-2014:37

      Yea its fast af but very quickly loses context/hallucinates from my own tests with large chunks of text

    • By Tepix 2026-02-2013:11

      Doesn't the blog state that it's now 4bit (the first gen was 3bit + 6bit)?

    • By robotnikman 2026-02-2018:15

      Sounds perfect for use in consumer devices.

    • By zozbot234 2026-02-2011:572 reply

      Low-latency inference is a huge waste of power; if you're going to the trouble of making an ASIC, it should be for dog-slow but very high throughput inference. Undervolt the devices as much as possible and use sub-threshold modes, multiple Vt and body biasing extensively to save further power and minimize leakage losses, but also keep working in fine-grained nodes to reduce areas and distances. The sensible goal is to expend the least possible energy per operation, even at increased latency.

      • By dust42 2026-02-2012:091 reply

        Low latency inference is very useful in voice-to-voice applications. You say it is a waste of power but at least their claim is that it is 10x more efficient. We'll see but if it works out it will definitely find its applications.

        • By zozbot234 2026-02-2012:181 reply

          This is not voice-to-voice though, end-to-end voice chat models (the Her UX) are completely different.

          • By dust42 2026-02-2012:29

            I haven't found any end-to-end voice chat models useful. I had much better results with separate STT-LLM-TTS. One big problem is the turn detection and having inference with 150-200ms latency would allow for a whole new level of quality. I would just use it with a prompt: "You think the user is finished talking?" and then push it to a larger model. The AI should reply within the ballpark of 600ms-1000ms. Faster is often irritating, slower will make the user to start talking again.

      • By PhunkyPhil 2026-02-2018:19

        I think it's really useful for agent to agent communication, as long as context loading doesn't become a bottleneck. Right now there can be noticeable delays under the hood, but at these speeds we'll never have to worry about latency when chain calling hundreds or thousands of agents in a network (I'm presuming this is going to take off in the future). Correct me if I'm wrong though.

  • By Alifatisk 2026-02-2017:175 reply

    What's happening in the comment section? How come so many cannot understand that his is running Llama 3.1 8B? Why are people judging its accuracy? It's almost a 2 years old 8B param model, why are people expecting to see Opus level response!?

    The focus here should be on the custom hardware they are producing and its performance, that is whats impressive. Imagine putting GLM-5 on this, that'd be insane.

    This reminds me a lot of when I tried the Mercury coder model by Inceptionlabs, they are creating something called a dLLM which is like a diffusion based llm. The speed is still impressive when playing aroun with it sometimes. But this, this is something else, it's almost unbelievable. As soon as I hit the enter key, the response appears, it feels instant.

    I am also curious about Taalas pricing.

    > Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.

    Do we have an idea of how much a unit / inference / api will cost?

    Also, considering how fast people switch models to keep up with the pace. Is there really a potential market for hardware designed for one model only? What will they do when they want to upgrade to a better version? Throw the current hardware and buy another one? Shouldn't there be a more flexible way? Maybe only having to switch the chip on top like how people upgrade CPUs. I don't know, just thinking out loudly.

    • By mike_hearn 2026-02-2018:071 reply

      They don't give cost figures in their blog post but they do here:

      https://www.nextplatform.com/wp-content/uploads/2026/02/taal...

      Probably they don't know what the market will bear and want to do some exploratory pricing, hence the "contact us" API access form. That's fair enough. But they're claiming orders of magnitude cost reduction.

      > Is there really a potential market for hardware designed for one model only?

      I'm sure there is. Models are largely interchangeable especially as the low end. There are lots of use cases where you don't need super smart models but cheapness and fastness can matter a lot.

      Think about a simple use case: a company has a list of one million customer names but no information about gender or age. They'd like to get a rough understanding of this. Mapping name -> guessed gender, rough guess of age is a simple problem for even dumb LLMs. I just tried it on ChatJimmy and it worked fine. For this kind of exploratory data problem you really benefit from mass parallelism, low cost and low latency.

      > Shouldn't there be a more flexible way?

      The whole point of their design is to sacrifice flexibility for speed, although they claim they support fine tunes via LoRAs. LLMs are already supremely flexible so it probably doesn't matter.

      • By pigpop 2026-02-2022:24

        Yes, there are all kinds of fuzzy NLP tasks that this would be great for. Jobs where you can chunk the text into small units and add instructions and only need a short response. You could burn through huge data sets very quickly using these chips.

    • By himata4113 2026-02-2017:561 reply

      I personally don't buy it, cerebras is way more advanced than this, comparing this tok/s to cerebras is disingenious.

      • By alfalfasprout 2026-02-214:11

        Cerebras is a totally different product though. They can (theoretically) run any frontier model provided it gets compiled a certain way. Like a wafer scale TPU.

        This is using hardwired weights with on-die SRAM used for K/V for example. It's WAY more power efficient and faster. The tradeoff being it's hardwired.

        Still, most frontier models are "good enough" where an obscenely fast version would be a major seller.

    • By test001only 2026-02-2114:071 reply

      That is my concern too. A chip optimised for a model or specific model architecture will not be useful for long.

      • By ahofmann 2026-02-2116:03

        I just tried the demo and I think, this is huge! If they manage to build a chip in 2 or 3 years, that can run something like Opus 4.6 or even Sonnet, at that speed, the disruption in the world of software development will be more than we saw in the last 3-5 years. LLMs today are somewhat useful, but they are still too slow and expensive for a meaningful ralph loop. Being able to runs those loops (or if you want to call it "thinking") much faster, will enable a lot of stuff, that is not feasible today. Writing things like openclaw will not take weeks, but hours. Maybe even rewriting entire tools, kernels or OSes will be feasible because the LLM can run through almost endless tries.

        Speed and cost wins over quality and this will also be true for LLMs.

    • By Herring 2026-02-214:251 reply

      If it's so easy to do custom silicon for any model (they say only 2 months), why didn't they demo one of the newer DeepSeek models instead? Using a 2-year model is so bad. I'm not buying it.

      • By robotpepi 2026-02-217:571 reply

        they explain it in the article: this is the first iteration, so they wanted to start with something simple, ie, this is a tech demo.

        • By Herring 2026-02-218:061 reply

          Ok then I look forward to seeing DeepSeek running instantly at the end of April.

          • By akie 2026-02-2111:572 reply

            Why so negative lol. The speed and very reduced power use of this thing are nothing to be sneezed at. I mean, hardware accelerated LLMs are a huge step forward. But yeah, this is a proof of concept, basically. I wouldn't be surprised if the size factor and the power use go down even more, and that we'll start seeing stuff like this in all kinds of hardware. It's an enabler.

            • By Herring 2026-02-2117:061 reply

              You don't know. You just have marketing materials, not independent analysis. Maybe it actually takes 2 years to design and manufacture the hardware, so anything that comes out will be badly out of date. Wouldn't be the first time someone lied. A good demo backed by millions of dollars should not allow such doubts.

              • By akie 2026-02-2120:05

                Did you not see the chatbot they posted online (https://chatjimmy.ai/)? That thing is near instantaneous, it's all the proof you need that this is real.

                And if the hardware is real and functional, as you can independently verify by chatting with that thing, how much more effort would it be to etch more recent models?

                The real question is of course: what about LARGER models? I'm assuming you can apply some of the existing LLM inference parallelization techniques and split the workload over multiple cards. Some of the 32B models are plenty powerful.

                It's a proof of concept, and a convincing one.

    • By real-hacker 2026-02-224:31

      They support Lora, it is something.

  • By freakynit 2026-02-2012:2413 reply

    Holy cow their chatapp demo!!! I for first time thought i mistakenly pasted the answer. It was literally in a blink of an eye.!!

    https://chatjimmy.ai/

    • By qingcharles 2026-02-2015:131 reply

      I asked it to design a submarine for my cat and literally the instant my finger touched return the answer was there. And that is factoring in the round-trip time for the data too. Crazy.

      The answer wasn't dumb like others are getting. It was pretty comprehensive and useful.

        While the idea of a feline submarine is adorable, please be aware that building a real submarine requires significant expertise, specialized equipment, and resources.

      • By robotpepi 2026-02-2017:45

        it's incredible how many people are commenting here without having read the article. they completely lost the point.

    • By smusamashah 2026-02-2013:565 reply

      With this speed, you can keep looping and generating code until it passes all tests. If you have tests.

      Generate lots of solutions and mix and match. This allows a new way to look at LLMs.

      • By Retr0id 2026-02-2014:172 reply

        Not just looping, you could do a parallel graph search of the solution-space until you hit one that works.

        • By xi_studio 2026-02-2014:49

          Infinite Monkey Theory just reached its peak

        • By dave1010uk 2026-02-2019:06

          You could also parse prompts into an AST, run inference, run evals, then optimise the prompts with something like a genetic algorithm.

      • By turnsout 2026-02-2016:31

        Agreed, this is exciting, and has me thinking about completely different orchestrator patterns. You could begin to approach the solution space much more like a traditional optimization strategy such as CMA-ES. Rather than expect the first answer to be correct, you diverge wildly before converging.

      • By Epskampie 2026-02-2014:251 reply

        And then it's slow again to finally find a correct answer...

        • By 34679 2026-02-2016:481 reply

          It won't find the correct answer. Garbage in, garbage out.

          • By akie 2026-02-2111:58

            How about if you run this loop (one year from now) on this kind of hardware but with something like Claude/Kimi K2. How about that? Because that's where it'll go.

      • By MattRix 2026-02-2014:00

        This is what people already do with “ralph” loops using the top coding models. It’s slow relative to this, but still very fast compared to hand-coding.

      • By otabdeveloper4 2026-02-2019:571 reply

        This doesn't work. The model outputs the most probable tokens. Running it again and asking for less probable tokens just results in the same but with more errors.

        • By therealdrag0 2026-02-2022:11

          Do you not have experience with agents solving problems? They already successfully do this. They try different things until they get a solution.

    • By amelius 2026-02-2012:402 reply

      OK investors, time to pull out of OpenAI and move all your money to ChatJimmy.

      • By freakynit 2026-02-2012:435 reply

        A related argument I raised a few days back on HN:

        What's the moat with with these giant data-centers that are being built with 100's of billions of dollars on nvidia chips?

        If such chips can be built so easily, and offer this insane level of performance at 10x efficiency, then one thing is 100% sure: more such startups are coming... and with that, an entire new ecosystem.

        • By codebje 2026-02-2013:001 reply

          RAM hoarding is, AFAICT, the moat.

          • By freakynit 2026-02-2013:101 reply

            lol... true that for now though

            • By Windchaser 2026-02-2016:441 reply

              Yeah, just cause Cisco had a huge market lead on telecom in the late '90s, it doesn't mean they kept it.

              (And people nowadays: "Who's Cisco?")

              • By wmf 2026-02-211:201 reply

                They did mostly keep it though.

                • By Windchaser 2026-02-2418:39

                  Sure, but it's taken their stock price about 20 years to recover.

        • By bee_rider 2026-02-2013:581 reply

          I think their hope is that they’ll have the “brand name” and expertise to have a good head start when real inference hardware comes out. It does seem very strange, though, to have all these massive infrastructure investment on what is ultimately going to be useless prototyping hardware.

          • By elictronic 2026-02-2015:07

            Tools like openclaw start making the models a commodity.

            I need some smarts to route my question to the correct model. I wont care which that is. Selling commodities is notorious for slow and steady growth.

        • By jzymbaluk 2026-02-2016:482 reply

          You'd still need those giant data centers for training new frontier models. These Taalas chips, if they work, seem to do the job of inference well, but training will still require general purpose GPU compute

          • By amelius 2026-02-2111:59

            Yeah but you need even bigger factories to fabricate those inference chips, so what is the point?

          • By bonoboTP 2026-02-2021:01

            Next up: wire up a specialized chip to run the training loop of a specific architecture.

        • By mlboss 2026-02-2017:50

          If I am not mistaken this chip was build specifically for the llama 8b model. Nvidia chips are general purpose.

        • By wmf 2026-02-2019:46

          Nvidia bought all the capacity so their competitors can't be manufactured at scale.

      • By raincole 2026-02-2013:12

        You mean Nvidia?

    • By rstuart4133 2026-02-215:27

      > It was literally in a blink of an eye.!!

      It's not even close. It takes the eye 100mm .. 400ms to blink. This think takes under 30ms to process a small query - about 10 times faster.

    • By zwaps 2026-02-2012:34

      I got 16.000 tokens per second ahaha

    • By gwd 2026-02-2012:372 reply

      I dunno, it pretty quickly got stuck; the "attach file" didn't seem to work, and when I asked "can you see the attachment" it replied to my first message rather than my question.

      • By scosman 2026-02-2012:541 reply

        It’s llama 3.1 8B. No vision, not smart. It’s just a technical demo.

        • By anthonypasq 2026-02-2015:23

          why is everyone seemingly incapable of understanding this? waht is going on here? Its like ai doomers consistently have the foresight of a rat. yeah no shit it sucks its running llama 3 8b, but theyre completely incapable of extrapolation.

      • By freakynit 2026-02-2012:41

        Hmm.. I had tried simple chat converation without file attachments.

    • By PlatoIsADisease 2026-02-2014:251 reply

      Well it got all 10 incorrect when I asked for top 10 catchphrases from a character in Plato's books. It confused the baddie for Socrates.

      • By Rudybega 2026-02-212:32

        Well yeah, they're running a small, outdated, older model. That's not really the point. This approach can be used for better, larger, newer models.

    • By bsenftner 2026-02-2012:361 reply

      I get nothing, no replies to anything.

      • By freakynit 2026-02-2012:40

        Maybe hn and reddit crowd have overloaded them lol

    • By elliotbnvl 2026-02-2012:27

      That… what…

    • By b0ner_t0ner 2026-02-2013:282 reply

      I asked, “What are the newest restaurants in New York City?”

      Jimmy replied with, “2022 and 2023 openings:”

      0_0

      • By freakynit 2026-02-2013:38

        Well, technically it's answer is correct when you consider it's knowledge cutoff date... it just gave you a generic always right answer :)

      • By xi_studio 2026-02-2014:52

        chatjimmy's trained on LLama 3.1

    • By jvidalv 2026-02-2013:362 reply

      Is super fast but also super inaccurate, I would say not even gpt-3 levels.

      • By roywiggins 2026-02-2018:05

        That's because it's llama3 8b.

      • By empath75 2026-02-2013:591 reply

        There are a lot of people here that are completely missing the point. What is it called where you look at a point of time and judge an idea without seemingly being able to imagine 5 seconds into the future.

    • By Etheryte 2026-02-2012:443 reply

      It is incredibly fast, on that I agree, but even simple queries I tried got very inaccurate answers. Which makes sense, it's essentially a trade off of how much time you give it to "think", but if it's fast to the point where it has no accuracy, I'm not sure I see the appeal.

      • By andrewdea 2026-02-2013:271 reply

        the hardwired model is Llama 3.1 8B, which is a lightweight model from two years ago. Unlike other models, it doesn't use "reasoning:" the time between question and answer is spent predicting the next tokens. It doesn't run faster because it uses less time to "think," It runs faster because its weights are hardwired into the chip rather than loaded from memory. A larger model running on a larger hardwired chip would run about as fast and get far more accurate results. That's what this proof of concept shows

        • By Etheryte 2026-02-2013:521 reply

          I see, that's very cool, that's the context I was missing, thanks a lot for explaining.

          • By Sabinus 2026-02-212:461 reply

            I don't mean to be rude, but did you read the article before commenting?

            • By Etheryte 2026-02-2210:25

              I'm commenting on the link to their demo, not on the article.

      • By kaashif 2026-02-2012:462 reply

        If it's incredibly fast at a 2022 state of the art level of accuracy, then surely it's only a matter of time until it's incredibly fast at a 2026 level of accuracy.

        • By PrimaryExplorer 2026-02-2012:511 reply

          yeah this is mindblowing speed. imagine this with opus 4.6 or gpt 5.2. probably coming soon

          • By scotty79 2026-02-2013:16

            I'd be happy if they can run GLM 5 like that. It's amazing at coding.

        • By Gud 2026-02-2012:532 reply

          Why do you assume this?

          I can produce total jibberish even faster, doesn’t mean I produce Einstein level thought if I slow down

          • By Closi 2026-02-2017:54

            Better models already exist, this is just proving you can dramatically increase inference speeds / reduce inference costs.

            It isn't about model capability - it's about inference hardware. Same smarts, faster.

          • By andy12_ 2026-02-2013:33

            Not what he said.

      • By scotty79 2026-02-2013:15

        I think it might be pretty good for translation. Especially when fed with small chunks of the content at a time so it doesn't lose track on longer texts.

    • By rvz 2026-02-2014:004 reply

      Fast, but stupid.

         Me: "How many r's in strawberry?"
      
         Jimmy: There are 2 r's in "strawberry".
      
         Generated in 0.001s • 17,825 tok/s
      
      The question is not about how fast it is. The real question(s) are:

         1. How is this worth it over diffusion LLMs (No mention of diffusion LLMs at all in this thread)
      
      (This also assumes that diffusion LLMs will get faster)

         2. Will Talaas also work with reasoning models, especially those that are beyond 100B parameters and with the output being correct? 
      
         3. How long will it take to create newer models to be turned into silicon? (This industry moves faster than Talaas.)
      
         4. How does this work when one needs to fine-tune the model, but still benefit from the speed advantages?

      • By mike_hearn 2026-02-2017:59

        The blog answers all those questions. It says they're working on fabbing a reasoning model this summer. It also says how long they think they need to fab new models, and that the chips support LoRAs and tweaking context window size.

        I don't get these posts about ChatJimmy's intelligence. It's a heavily quantized Llama 3, using a custom quantization scheme because that was state of the art when they started. They claim they can update quickly (so I wonder why they didn't wait a few more months tbh and fab a newer model). Llama 3 wasn't very smart but so what, a lot of LLM use cases don't need smart, they need fast and cheap.

        Also apparently they can run DeepSeek R1 also, and they have benchmarks for that. New models only require a couple of new masks so they're flexible.

      • By fennecbutt 2026-02-222:11

        The counting rs in strawberry problem was a example of people not understanding how the models work but I guess good to show the limitations of the current architectures.

        But thing is, those architectures haven't improved a whole lot. Now when it answers that correctly it's either in training data or by virtue of "count letters" or code sandbox tools.

      • By simlevesque 2026-02-2017:141 reply

        LLMs can't count. They need tool use to answer these questions accurately.

        • By CamperBob2 2026-02-216:17

          That particular one can't count without using external tools. Others can, and do.

      • By refsys 2026-02-2016:25

        [dead]

HackerNews