...

rhdunn

690

Karma

2021-05-28

Created

Recent Activity

  • Newer versions of XPath and XSLT allow

        /*:bookstore/*:book/*:title

  • It depends on 1) what model you are running; and 2) how many models you are running.

    You can just about run a 32B (at Q4/Q5 quantization) on 24GB. Running anything higher (such as the increasingly common 70B models, or higher if you want to run something like Llama 4 or DeepSeek) means splitting the model between RAM and RAM. -- But yes, anything 24B or lower you can run comfortably, including enough capacity for the context.

    If you have other models -- such as text-to-speech, speech recognition, etc. -- then those are going to take up VRAM for both the model and during processing/generation. That affects the size of LLM you can run.

  • Google have their own ASIC via their TPU. The other major players have leveraged NVIDIA and -- to a lesser extent -- AMD. This is partly due to investment in TPUs/ASICs being complex (need specialist knowledge and fabrication units) and GPU performance being hard to compete with.

    Training is the thing that costs the most in terms of power/memory/energy, often requiring months of running multiple (likely 4-8) A100/H100 GPUs on the training data.

    Performing inference is cheaper as you can 1) keep the model loaded in VRAM, and 2) run it on a single H100. With the 80GB capacity you would need two to run a 70B model at F16, or one at F8. For 32B models and lower you could run them on a single H100. Then you only need 1 or 2 GPUs to handle the request.

    ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.

    I think the sweat spot will be when CPUs have support for high-throughput matrix operations similar to the SIMD operations. That way the system will benefit from being able to use system memory [1] and not have another chip/board consuming power. -- IIUC, things are already moving in that direction for consumer devices.

    [1] This will allow access to large amounts of memory without having to chain multiple GPUs. That will make it possible to run the larger models at higher precisions more efficiently and process the large amount of training data efficiently.

  • The ARIA Authoring Practices Guide (APG) [1] is my usual goto for what ARIA markup I need for custom elements, as it is clear on what the keyboard navigation, ARIA roles, ARIA states, and ARIA properties should be for a given element in a given state. That makes implementing and testing it easier than reading the core specs.

    [1] https://www.w3.org/WAI/ARIA/apg/

  • There's also https://www.govinfo.gov/bulkdata/FR/resources for federal register XML documents.

    Note: it could be worth checking the issues at https://github.com/usgpo/bulk-data/issues as some of those contain fixes and formatting improvements.

HackerNews