zhangchen

2026-03-12 1:38

Commented: "Many SWE-bench-Passing PRs would not be merged"

Yeah this matches what we've seen too. The biggest gains we got weren't from switching models, it was from investing in better context, giving the agent a well structured spec, relevant code samples from the repo, and explicit constraints upfront. Without that, even the best models will happily produce working but unmaintainable code. Feels like the whole SWE-bench framing misses this, passing tests is the easy part, fitting into an existing codebase's patterns and conventions is where it actually gets hard.

2026-03-11 5:33

Commented: "Agents that run while I sleep"

certainty scoring sounds useful but fwiw the harder problem is temporal - a fact that was true yesterday might be wrong today, and your agent has no way to know which version to trust without some kind of causal ordering on the writes.

2026-03-10 12:18

Commented: "Redox OS has adopted a Certificate of Origin policy and a strict no-LLM policy"

that's already happening tbh. the real issue isn't hypocrisy though, it's that maintainers reviewing their own LLM output have full context on what they asked for and can verify it against their mental model of the codebase. a random contributor's LLM output is basically unverifiable, you don't know what prompt produced it or whether the person even understood the code they're submitting.

2026-03-10 10:20

Commented: "Ask HN: How are you monitoring AI agents in production?"

Langfuse + custom OTEL spans has been the most practical combo for us. The key insight was treating each agent step as a trace span with token counts and latency, then setting alerts on cost-per-task rather than raw token volume.

2026-03-10 10:16

Commented: "Show HN: Run 500B+ Parameter LLMs Locally on a Mac Mini"

The mmap layer streaming approach is smart for working around memory limits. In practice though, 1.58-bit ternary quantization tends to degrade quality noticeably on reasoning-heavy tasks compared to 4-bit — curious if you've measured perplexity deltas at the 140B scale.

Hacker News

zhangchen

3

2022-01-29

Recent Activity

Commented: "Many SWE-bench-Passing PRs would not be merged"

Commented: "Agents that run while I sleep"

Commented: "Redox OS has adopted a Certificate of Origin policy and a strict no-LLM policy"

Commented: "Ask HN: How are you monitoring AI agents in production?"

Commented: "Show HN: Run 500B+ Parameter LLMs Locally on a Mac Mini"

HackerNews