There's at least some discussion in https://www.lesswrong.com/posts/pLnLSgWphqDbdorgi/on-the-imp...
>Instead of generating tokens one at a time, a dLLM produces the full answer at once. The initial answer is iteratively refined through a diffusion process, where a transformer suggests improvements for the entire answer at once at every step. In contrast to autoregressive transformers, the later tokens don’t causally depend on the earlier ones (leaving aside the requirement that the text should look coherent). For an intuition of why this matters, suppose that a transformer model has 50 layers and generates a 500-token reasoning trace, the final token of this trace being the answer to the question. Since information can only move vertically and diagonally inside this transformer and there are fewer layers than tokens, any computations made before the 450th token must be summarized in text to be able to influence the final answer at the last token. Unless the model can perform effective steganography, it had better output tokens that are genuinely relevant for producing the final answer if it wants the performed reasoning to improve the answer quality. For a dLLM generating the same 500-token output, the earlier tokens have no such causal role, since the final answer isn’t autoregressively conditioned on the earlier tokens. Thus, I’d expect it to be easier for a dLLM to fill those tokens with post-hoc rationalizations.
>Despite this, I don’t expect dLLMs to be a similarly negative development as Huginn or COCONUT would be. The reason is that in dLLMs, there’s another kind of causal dependence that could prove to be useful for interpreting those models: the later refinements of the output causally depend on the earlier ones. Since dLLMs produce human-readable text at every diffusion iteration, the chains of uninterpretable serial reasoning aren’t that deep. I’m worried about the text looking like gibberish at early iterations and the reasons behind the iterative changes the diffusion module makes to this text being hard to explain, but the intermediate outputs nevertheless have the form of human-readable text, which is more interpretable than long series of complex matrix multiplications.
Based solely on the above, my armchair analysis is that it seems like it's not strictly diffusion in the Langevin diffusion/denoising sense (since there are discrete iteration rounds), but instead borrows the idea of "iterative refinement". You drop the causal masking and token-by-token autoregressive generation, and instead start with a bunch of text and propose a series of edits at each step? On one hand dropping the causal masking over token sequence means that you don't have an objective that forces the LLM to learn a representation sufficient to "predict" things as normally thought, but on the flipside there is now a sort of causal masking over _time_, since each iteration depends on the previous. It's a neat tradeoff.
Subthread https://news.ycombinator.com/item?id=43851429 also has some discussion
Isn't there some cache of code-signing info? https://wiki.lazarus.freepascal.org/Code_Signing_for_macOS
>Specifically, the code signing information (code directory hash) is hung off the vnode within the kernel, and modifying the file behind that cache will cause problems. You need a new vnode, which means a new file, that is, a new inode. Documented in WWDC 2019 Session 703 All About Notarization - see slide 65 (PDF).
This seems to be described in https://eclecticlight.co/2024/04/29/apfs-beyond-to-vfs-and-v... but I'm just a layman here. I don't quite understand the benefits of this caching if you have to recompute them to detect mismatch anyway. [1]
And I realize now the initial gatekeeper scan is probably just controlled by presence of quarantine bit, the result themselves are probably not cached.
Edit: Now I'm not so sure, spctl has a --ignore-cache option. So the result of gatekeeper is indeed cached somehow. And presumably as you noted it's a cache miss for this which causes the long application launch delay.
[1] https://github.com/golang/go/issues/42684 has a bit more info on this, I'm happy to see that even seasoned experts are confused about these things.
> Macs have a cache of SHA-256 hashes of all bundled files of all apps that have been launched. But where exactly is this cache
I always assumed this had to be the case? When you first launch an application gatekeeper takes a long time verifying it, but on subsequent launches it's fast. So _some_ bit seems to be stored somewhere indicating whether or not this is "first launch" and whether full verification needs to be performed (maybe it's the launch services cache?)
As for whether the entire image is verified before _each_ launch, I'm not 100% familiar with the flow but I don't think that's correct, it can be done lazily on a page by page basis. https://developer.apple.com/documentation/endpointsecurity/e...
>In the specific case of process execution, this is after the exec completes in the kernel, but before any code in the process starts executing. At that point, XNU has validated the signature itself and has verified that the cdhash is correct. This second validation means that the hash of all individual page hashes in the Code Directory match the signed cdhash, essentially verifying the signature wasn’t tampered with. However, XNU doesn’t verify individual page hashes until the binary executes and pages in the corresponding pages. XNU doesn’t determine a binary shows signs of tampering until the individual pages page in, at which point XNU updates the code signing flags.
If you can replicate this on an Intel mac where code signature is optional, you could try more rigorous comparisons comparing an unsigned binary vs a signed one. In both cases I'd assume yara signature checks would apply.
What's the intuition behind fourth power? Looks like it was mainly derived from experimental testing but there should be some physical explanation as to why we'd expect 4th power right (e.g. you can have some intuition for square/inverse-square laws based on arguments of surface area). Is it really 4th power, or is it the artifact of curve fitting?