Nvidia-smi hangs indefinitely after ~66 days

2026-01-253:3320051github.com

NVIDIA Open GPU Kernel Modules Version [root@A11-R42-I61-42-5504045 ~]# cat /proc/driver/nvidia/params ResmanDebugLevel: 4294967295 RmLogonRC: 1 ModifyDeviceFiles: 1 DeviceFileUID: 0 DeviceFileGID:...

@zheng199512
@zheng199512

[root@A11-R42-I61-42-5504045 ~]# cat /proc/driver/nvidia/params ResmanDebugLevel: 4294967295 RmLogonRC: 1 ModifyDeviceFiles: 1 DeviceFileUID: 0 DeviceFileGID: 0 DeviceFileMode: 438 InitializeSystemMemoryAllocations: 1 UsePageAttributeTable: 4294967295 EnableMSI: 1 EnablePCIeGen3: 0 MemoryPoolSize: 0 KMallocHeapMaxSize: 0 VMallocHeapMaxSize: 0 IgnoreMMIOCheck: 0 EnableStreamMemOPs: 0 EnableUserNUMAManagement: 1 NvLinkDisable: 0 RmProfilingAdminOnly: 1 PreserveVideoMemoryAllocations: 0 EnableS0ixPowerManagement: 0 S0ixPowerManagementVideoMemoryThreshold: 256 DynamicPowerManagement: 3 DynamicPowerManagementVideoMemoryThreshold: 200 RegisterPCIDriver: 1 EnablePCIERelaxedOrderingMode: 0 EnableResizableBar: 0 EnableGpuFirmware: 18 EnableGpuFirmwareLogs: 2 RmNvlinkBandwidthLinkCount: 0 EnableDbgBreakpoint: 0 OpenRmEnableUnsupportedGpus: 1 DmaRemapPeerMmio: 1 ImexChannelCount: 2048 CreateImexChannel0: 0 GrdmaPciTopoCheckOverride: 0 RegistryDwords: "" RegistryDwordsPerDevice: "" RmMsg: "" GpuBlacklist: "" TemporaryFilePath: "" ExcludedGpus: ""

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

[root@A11-R42-I61-42-5504045 ~]# cat /etc/openeuler-release openeuler release 2.0 (LTS-SP2) [root@A11-R42-I61-42-5504045 ~]#

Kernel Release

[root@A11-R42-I61-42-5504045 ~]# uname -a Linux A11-R42-I61-42-5504045. 6.6.0-100. SMP Fri Aug 22 10:50:04 CST 2025 x86_64 x86_64 x86_64 GNU/Linux
[root@A11-R42-I61-42-5504045 ~]# uname -r 6.6.0-100

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

B200

Describe the bug

nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200

[root@A11-R42-I61-42-5504045 ~]# dmesg -T | grep -i nvrm | head -n 10 [Sat Nov 22 05:08:50 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:08:50 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed! [Sat Nov 22 05:08:54 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:08:54 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed! [Sat Nov 22 05:08:58 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:08:58 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed! [Sat Nov 22 05:09:02 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:09:02 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer0's postRxDetLinkMask failed! [Sat Nov 22 05:09:06 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:09:06 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!

[root@A11-R42-I61-42-5504045 ~]#

[root@A11-R42-I61-42-5504045 ~]# uptime 22:50:02 up 67 days, 6:11, 2 users, load average: 17.40, 16.73, 18.67 [root@A11-R42-I61-42-5504045 ~]# last reboot reboot system boot 6.6.0-100. Tue Sep 16 16:38 still running

reboot system boot 6.6.0-100 Tue Sep 9 17:02 - 16:34 (6+23:32)

To Reproduce

nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200 and kernel 6.6.0

Bug Incidence

Once

nvidia-bug-report.log.gz

no

More Info

No response

dkoudlo, JasonLovesDoggo and TitaniumtownThieum

You can’t perform that action at this time.


Read the original article

Comments

  • By wvenable 2026-01-258:381 reply

    A few years ago, at my company, we would get random TPM crashes every few months on all our machines. You'd be working and the TPM would just disappear and then any apps that rely on it for key retrieval would error out. Even worse, since the TPM chip is always running, neither a reboot nor a shutdown would fix it -- you literally had to pull the plug.

    This went on for months. Then one day we had a power outage. Two months later, every single machine failed at the same time. I checked the logs and it was 49 days and few hours since that outage. It didn't take me too long to figure out what the underlying programming error inside the TPM was. At least we could then describe exactly what the problem was to our PC vendor.

  • By foota 2026-01-256:511 reply

    Wow, someone in the github comments[1] noticed that one of the bug numbers assigned internally for the issue matches to the day the number of days the driver would stay up.

    1: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/971...

    • By userbinator 2026-01-258:12

      You mean number of seconds, but yes, I think everyone looking at this would be converting units to see if there was a particular boundary being met.

  • By pajko 2026-01-255:532 reply

    Timestamps should NOT be compared like this. Exactly this is why time_before() or time_after() exist.

    https://elixir.bootlin.com/linux/v6.15.7/source/include/linu...

    • By AceJohnny2 2026-01-258:222 reply

      Offtopic...

          * Do this with "<0" and ">=0" to only test the sign of the result. A
          * good compiler would generate better code (and a really good compiler
          * wouldn't care). Gcc is currently neither.
      
      It's funny the love-hate relationship the Linux kernel has with GCC. It's the only supported compiler[1], and yet...

      [1] can Clang fully compile Linux yet? I haven't followed the updates in a while.

      • By rwmj 2026-01-2510:15

        To be fair this comment predates git history (before 2005) when GCC wasn't a very good compiler. The kernel developers at one point were sticking with a specific version of GCC because later versions would miscompile the kernel. Clang didn't exist then.

        GCC is a different beast and far better nowadays.

    • By Joker_vD 2026-01-257:301 reply

      Do I understand it correctly that the logic is that if timestamp B is above timestamp A, but the difference is more than half of the unsigned range, B is considered to happen before A?

      • By rcxdude 2026-01-259:40

        Yes. When the timestamps wrap it's fundamentally ambiguous, but this will be correct unless the timestamps are very far apart (and the failure mode is more benign: a really long time difference being considered shorter is better than all time differences being considered zero after the timestamp wraps).

HackerNews