WeatherNext 2: Our most advanced weather forecasting model

Comments

By lysecret 2025-11-1719:225 reply

Im pretty deep into this topic and what might be interesting to an outsider is that the leading models like neuralgcm/weathernext 1 before as well as this model now are all trained with a "crps" objective which I haven't seen at all outside of ml weather prediction.

Essentially you add random noise to the inputs and train by minimizing the regular loss (like l1) and at the same time maximizing the difference between 2 members with different random noise initialisations. I wonder if this will be applied to more traditional genai at some point.

By nerdponx 2025-11-1721:391 reply

> Essentially you add random noise to the inputs and train by minimizing the regular loss (like l1) and at the same time maximizing the difference between 2 members with different random noise initialisations. I wonder if this will be applied to more traditional genai at some point.

We recently had a situation where we specifically wanted to generate 2 "different" outputs from an optimization task and struggled to come up with a good heuristic for doing so. Not at all a GenAI task, but this technique probably would have helped us.

By albertzeyer 2025-11-188:53

This idea is often used for self-supervised learning (SSL). E.g. see DINO (https://arxiv.org/abs/2104.14294).

By albertzeyer 2025-11-188:491 reply

The random noise is added to the model parameters, not the inputs, or not?

This reminds me of variational noise (https://www.cs.toronto.edu/~graves/nips_2011.pdf).

If it is random noise on the input, it would be like many of the SSL methods, e.g. DINO (https://arxiv.org/abs/2104.14294), right?

By lysecret 2025-11-1813:27

Yes you are right it's applied to the parameters, but other models (like ngcm) applied it to the inputs. IMO it shouldn't make a huge difference main point is you max differences between models.

By cleak 2025-11-1720:41

That’s pretty neat. It reminds me of how VAEs work: https://en.wikipedia.org/wiki/Variational_autoencoder

By rytill 2025-11-1719:283 reply

What is the goal of doing that vs using L2 loss?

By counters 2025-11-1723:101 reply

To add to the existing answers - L2 losses induce a "blurring" effect when you autoregressively roll out these models. That means you not only lose import spatial features, you also truncate the extrema of the predictions - in other terms, you can't forecast high-impact extreme weather with these models at moderate lead times.

By lysecret 2025-11-1813:291 reply

Yes very good point this to me is one of the most magical elements of this loss how it suddenly makes the model "collapse" on one output and the predictions become sharp.

By counters 2025-11-1817:20

Yeah, it's underplayed in the the writeup here but the context here is important. The "sharpness" issue was a major impediment to improving the skill and utility of these models. When GDM published GenCast two years ago, there was a lot of excitement because the generative approach seemed to completely eliminate this issue. But, there was a trade-off - GenCast was significantly more expensive to train and run inference with, and there wasn't an obvious way to make improvements there. Still faster than an NWP model, but the edge starts to dull.

FGN (and NVIDIA's FourCastNet-v3) show a new path forward that balances inference/training cost without sacrificing the sharpness of the outputs. And you get well-calibrated ensembles if you run them with random seeds to their noise vectors, too!

This is a much bigger deal than people realize.

By lysecret 2025-11-1719:33

To encourage diversity between the different members in an ensemble. I think people are doing very similar things for MOE networks but im not that deep into that topic.

By sunshinesnacks 2025-11-1721:41

The goal of using CRPS is to produce an ensemble that is a good probabilistic forecast without needing calibration/post processing.

[edit: "without", not "with"]

By jasonmarks_ 2025-11-181:02

> Im pretty deep into this topic and what might be interesting to an outsider is that the leading models like neuralgcm/weathernext 1 before as well as this model now are all trained with a "crps" objective which I haven't seen at all outside of ml weather prediction.

You are a bit misleading here. The model is trained on historical data but each run off of new instrument readings will be generated a few times in an ensemble.

By binsquare 2025-11-1717:249 reply

I find it interesting that they quantify the improvement on speed and number of forecast-ed scenarios but lack details on how it results in improved accuracy of the forecast per:

``` WeatherNext 2 can generate forecasts 8x faster and with resolution up to 1-hour. This breakthrough is enabled by a new model that can provide hundreds of possible scenarios. ```

As an end user, all I care is that there's one accurate forecasted scenario.

By meandthewallaby 2025-11-1719:102 reply

This is really important: You're not the end user of this product. These types of models are not built for laypeople to access them. You're an end user of a product that may use and process this data, but the CRPS scorecard, for example, should mean nothing to you. This is specifically addressing an under-dispersion problem in traditional ensemble models, due to a limited number (~50) and limited set of perturbed initial conditions (and the fact that those perturbations do very poorly at capturing true uncertainty).

Again, you, as an end user, don't need to know any of that. The CRPS scorecard is a very specific measure of error. I don't expect them to reveal the technical details of the model, but an industry expert instantly knows what WeatherBench[1] is, the code it runs, the data it uses, and how that CRPS scorecard was generated.

By having better dispersed ensemble forecasts, we can more quickly address observation gaps that may be needed to better solidify certain patterns or outcomes, which will lead to more accurate deterministic forecasts (aka the ones you get on your phone). These are a piece of the puzzle, though, and not one that you will ever actually encounter as a layperson.

[1]: https://sites.research.google/gr/weatherbench/

By DoctorOetker 2025-11-1721:571 reply

Sorry to hijack you: I have some questions regarding current weather models:

I am personally not interested in predicting the weather as end users expect it, rather I am interested in representative evolutions of wind patterns. I.e. specify some location (say somewhere in the North Sea, or perhaps on mainland Western Europe), and a date (say Nov 12) without specifying a year, and would like to have the wind patterns at different heights for that location say for half an hour. Basically running with different seeds, I want to have representative evolutions of the wind vector field (without specifying starting conditions, other than location and date, i.e. NO prior weather).

Are there any ML models capable of delivering realistic and representative wind gust models?

(The context is structural stability analysis of hypothetical megastructures)

By counters 2025-11-1723:081 reply

I mean - you don't need any ML for that. Just go grab random samples from a ~30 day window centered on your day of interest over the region of interest from a reanalysis product like ERA5. If the duration of ERA5 isn't sufficient (e.g. you wouldn't expect on average to see events with a >100 year return period given the limited temporal extent of the dataset) then you could take one step further and pull from an equilibrium climate model simulation - some of these are published as part of the CMIP inter-comparison, or you could go to special-built ensembles like the CESM LENS [1]. You could also use a generative climate downscaling model like NVIDIA's Climate-in-a-bottle, but that's almost certainly overkill for your application.

[1]: https://www.cesm.ucar.edu/community-projects/lens

By DoctorOetker 2025-11-2021:05

The ERA5 seems to give hourly data, i.e. nyquist limit would thus give decent oscillation amplitudes for waves with periods of about 5 hours or more, whereas I am more interested in faster timescales seconds, minutes, i.e. wind gusts.

Calculating the stability and structural requirements for a super-chimney to the tropopause, would require representative higher temporal frequency wind fields

Do you know if I can extract such a high time resolution from LENS since a cursory look at ERA5 showed a time resolution of just 1 hour?

The advantage of an ML model is that its usually possible to calculate the joint probability for a wind field, or to selectively generate a dataset with N-th percentile wind fields etc.

If its differentiable, and the structural stress assumptions are known, then one can "optimize" towards wind profiles that are simultaneously more dangerous and more probable, to identify what needs adressing. Thats why an ML model of local wind patterns would be desirable. ML is more than just LLM's. What one typically complains of in the context of LLM's: that there's no error bars on the output, is not entirely correct: just like differentiable ML models for physical and other phenomena they too allow to calculate the joint probability of sentences, except instead of modeling natural phenomena it is modelling what humans uttered in the corpus (or implicit corpus after RLHF etc). A base model LLM can quite accurately predict the likelihood of a human expressing a certain phrase, but thats modeling human expressions, not their validity. An ML model trained on actual weather data, or fine grained simulated weather data results in comparatively more accurate probability distributions, because physics isn't much of an opinion.

By counters 2025-11-1719:29

> By having better dispersed ensemble forecasts, we can more quickly address observation gaps that may be needed to better solidify certain patterns or outcomes, which will lead to more accurate deterministic forecasts.

Sorry - not sure this is a reasonable take-away. The models here are all still initialized from analysis performed by ECMWF; Google is not running an in-house data assimilation product for this. So there's no feedback mechanism between ensemble spread/uncertainty and the observation itself in this stack. The output of this system could be interrogated using something like Ensemble Sensitivity Analysis, but there's nothing novel about that and we can do that with existing ensemble forecast systems.

By tylervigen 2025-11-1718:271 reply

For lay-users they could have explained that better. I think they may not have completely uninformed users in mind for this page though.

Developing an ensemble of possible scenarios has been the central insight of weather forecasting since the 1960s when Edward Lorenz discovered that tiny differences in initial conditions can grow exponentially (the "butterfly effect"). Since they could really do it in the 90s, all competitive forecasts are based on these ensemble models.

When you hear "a 70% chance of rain," it more or less means "there was rain in 70 of the 100 scenarios we ran."[0] There is no "single accurate forecast scenario."

[0] Acknowledging this dramatically oversimplifies the models and the location where the rain could occur.

By sweettea 2025-11-1722:171 reply

My understanding is that it's an expected value based on coverage in each of the ensemble scenarios, not quite as simplified as "how many scenarios was there rain in this forecast cell".

At least for the US NWS: if 30 of 100 scenarios result in 50% shower coverage, and 70 out of 100 result in 0%, this is reported as 15% chance of rain. Which is exactly the same as 15 with 100% coverage and 85 with 0% coverage, or 100 with 15% coverage.

Understanding this, and digging further into the forecast, gives a better sense of whether you're likely to encounter widespread rainfall or spotty rainfall in your local area.

By tylervigen 2025-11-1912:04

Yes, that complexity is what my acknowledgment was meant to refer to.

By NoiseBert69 2025-11-1717:281 reply

As a end user I also want to see the variance to get a feeling of the uncertainty.

Quite a lot of weather sites offer this data in an easily eatable visual format.

By mmooss 2025-11-1723:261 reply

That would be great - do you recommend any sites?

By NoiseBert69 2025-11-1816:42

https://www.meteoblue.com/en/weather/10-days/new-york_united...

https://www.meteoblue.com/en/weather/forecast/multimodelense...

also

https://weather.us/forecast/5128581-new-york/ensemble/euro-r...

By Sanzig 2025-11-1718:06

Indeed. The most important benchmark is accuracy and how well it stacks up against existing physics-based models like GFS or ECMWF.

Sure, those big physics-based models are very computationally intensive (national weather bureaus run them on sizeable HPC clusters), but you only need to run them every few hours in a central location and then distribute the outputs online. It's not like every forecaster in a country needs to run a model, they just need online access to the outputs. Even if they could run the models themselves, they would still need the mountains of raw observation data that feeds the models (weather stations, satellite imagery, radars, wind profilers...). And these are usually distributed by... the national weather bureau of that country. So the weather bureau might as well do the number crunching as well and distribute that.

By jasonmarks_ 2025-11-181:16

> I find it interesting that they quantify the improvement on speed and number of forecast-ed scenarios but lack details on how it results in improved accuracy of the forecast per:

Definitely. Training on the historical data creates compelling forecasts but it comes off as a magic box. Where are the missing physics for the high performance cluster?

By sails 2025-11-1719:56

As others have explained, ensembles are useful.

As a layperson, what _is_ useful is to look at the difference between models. My long range favourite is to compare ECMWF and GFS27 and if the deviation is high (windy app has this) then you can bet that at least one of them is likely wrong

By agildehaus 2025-11-1718:19

They integrated "MetNet-3" into Google products and my personal perception was accuracy decreased.

By bigtones 2025-11-1722:021 reply

Googles weather prediction engine is already very good, and the new hurricane model was breathtakingly good this season when tested against actual hurricane paths. Meanwhile, the US Government Global Forecasting System continues to get worse.

https://arstechnica.com/science/2025/11/googles-new-weather-...

By jasonmarks_ 2025-11-180:221 reply

> Global Forecasting System continues to get worse

What do you mean?

By tylervigen 2025-11-181:351 reply

It's the subtitle of the article they linked to.

But to expand: the US flagship forecast model just had its worst year predicting hurricanes since 2005. The trend of errors over the last few years hasn't been great.

By jasonmarks_ 2025-11-181:471 reply

More objectively it reads as if none of the models performed well outside of 24 hours with a significant uptick in inaccuracy after 72 hours.

By tylervigen 2025-11-1820:59

I disagree with your logic. Increased mean error 72 hours out (vs 24 hours out) is not an indication that GFS is getting worse over time. At that scale it’s obviously getting better over time; 24 hours out is further in the future than 72 hours out.

However, an increase in the mean error at the same time out year over year (or between 2005 and 2025) is an indication of an issue, and that’s what we see.

Hacker News