reissbaker

2025-06-14 11:57

Commented: "Fine-tuning LLMs is a waste of time"

It technically works with enough data but it's pretty inefficient compared to RAG. However, changing behavior via prompting/RAG is harder than changing behavior via finetuning; they're useful for different purposes.

2025-06-11 12:18

2025-06-11 7:42

Commented: "Magistral — the first reasoning model by Mistral AI"

Magistral Small seems wayyy too heavy-handed with its RL to me:

\boxed{Hey! How can I help you today?}

They clearly rewarded the \boxed{...} formatting during their RL training, since it makes it easier to naively extract answers to math problems and thus verify them. But Magistral uses it for pretty much everything, even when it's inappropriate (in my own testing as well).

It also forgets to <think> unless you use their special system prompt reminding it to.

Honestly a little disappointing. It obviously benchmarks well, but it seems a little overcooked on non-benchmark usage.

2025-06-11 2:02

Commented: "Fine-tuning LLMs is a waste of time"

Clickbait headline. "Fine-tuning LLMs for knowledge injection is a waste of time" is true, but IDK who's trying to do that. Fine-tuning is great for changing model behavior (i.e. the zillions of uncensored models on Hugging Face are much more willing to respond to... dodgy... prompts than any amount of RAG is gonna get you), and RAG is great for knowledge injection.

Also... "LoRA" as a replacement for finetuning??? LoRA is a kind of finetuning! In the research community it's actually referred to as "parameter efficient finetuning." You're changing a smaller number of weights, but you're still changing them.

2025-06-11 1:53

Commented: "Magistral — the first reasoning model by Mistral AI"

It's not better than full R1; Mistral is using misleading benchmarks. The latest version of R1, R1-0528, is much better: 91.4% on AIME2024 pass@1. Mistral uses the original R1 release from January in their comparisons, presumably because it makes their numbers look more competitive.

That being said, it's still very impressive for a 24B.

I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.

Sidenote, but I'm pretty sure DeepSeek is focused on V4, and after that will train an R2 on top. The V3-0324 and R1-0528 releases weren't retrained from scratch, they just continued training from the previous V3/R1 checkpoints. They're nice bumps, but V4/R2 will be more significant.

Of course, OpenAI, Google, and Anthropic will have released new models by then too...

Hacker News

reissbaker

5170

2011-10-11

About Me

Recent Activity

Commented: "Fine-tuning LLMs is a waste of time"

Commented: "Magistral — the first reasoning model by Mistral AI"

Commented: "Fine-tuning LLMs is a waste of time"

Commented: "Magistral — the first reasoning model by Mistral AI"

HackerNews