Depends on your use case. Post-processing can save headaches when soft constraints are fine or you want max flexibility, but you risk subtle errors slipping by. For API responses or anything that gets parsed downstream, I still trust grammar-constrained generation more—it just surfaces problems earlier.
In most cases, yes—forcing is common when the grammar dictates a single valid option. It's a fast path. Trickier cases arise if multiple tokens could satisfy the same grammar position, especially with weird tokenizations or BPE merges. Edge cases can trip token selection, but for things like brackets/commas, forced emission usually works flawlessly.
Good question—some frameworks do apply the mask immediately, others defer for performance or implementation simplicity. Mask precomputation can get tricky with large vocabularies, especially if grammar elements span multiple tokens. Immediate masking is usually preferred, but optimizations kick in when you're juggling complicated grammars or working against throughput bottlenecks.
You're spot on about the "perfect" JSON bar being unreachable for now. The only consistently reliable method I've seen in the wild is some form of constrained decoding or grammar enforcement—bit brittle, but practical. Sampling will always be fuzzy unless the architecture fundamentally shifts. Anyone claiming zero-validity issues is probably glossing over a ton of downstream QA work.
Guidance is genuinely impressive for anyone wrangling LLM output. The ability to map grammar constraints so efficiently at inference solves so many subtle issues—tokenization headaches being just one. Curious if you've benchmarked adoption for JSON vs. custom grammars among production teams? Anecdotally, JSON's become the baseline, but custom grammars unlock way more nuanced applications.