NVMe and SATA health data on ESXi: some links to investigate
Posted by jpluimers on 2021/08/25
(Edit 20221202: added one more link on “REALLOCATED SECTOR CT below threshold”)
Somehow, health data of my NVMe and SATA drives do not show up as health information on the web-ui of my ESXi playground rig.
So far, I noticed that ESXi runs a smartd, but does not ship with a smartctl, nor health data ends up in the web user interface. So you cannot see the state of NVMe and SATA devices easily.
Still these devices deteriorate over time and afterwards die, so below are some links to investigate later.
Goal is to use my own thresholds to set warning and error levels.
Some log entries:
syslog.log:2021-04-16T18:28:26Z jumpstart[65941]: UnresolvedVmfsVolume: deviceName=eui.0000000001000000e4d25c0e8dc74e01:1,lvmName=5ad4aeea-630efcbc-c307-0cc47aaa9742,label=IntelNVMe1TB-BTPY7425047S1P0H(VMFS),fsUuid=5ad4aeea-6954841c-470e-0cc47aaa9742 syslog.log:2021-04-16T18:30:57Z smartd: [warn] eui.0000000001000000e4d25c0e8dc74e01: REALLOCATED SECTOR CT below threshold (7 < 90) syslog.log:2021-04-16T18:53:25Z jumpstart[65944]: UnresolvedVmfsVolume: deviceName=naa.600605b00aa054a0ff0000210221eaf8:1,lvmName=552f5788-ee485725-ce41-001f29022aed,label=850EVO1TBR1B(VMFS),fsUuid=552f5788-33e30274-8dba-001f29022aed vmkernel.log:2021-04-17T16:58:58.665Z cpu8:66219)ScsiDeviceIO: 3001: Cmd(0x4395014c7140) 0x1a, CmdSN 0xf60 from world 67512 to dev "naa.600605b00aa054a0ff0000210221eaf8" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. vmkernel.log:2021-04-17T17:29:02.656Z cpu0:67578)ScsiDeviceIO: 3001: Cmd(0x4395015c34c0) 0x85, CmdSN 0xfbb from world 67512 to dev "naa.600605b00aa054a0ff0000210221eaf8" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. vmkernel.log:2021-04-17T17:59:06.658Z cpu0:68128)ScsiDeviceIO: 3001: Cmd(0x43950d7af780) 0x4d, CmdSN 0x1011 from world 67512 to dev "naa.600605b00aa054a0ff0000210221eaf8" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Some links
Most links below on REALLOCATED SECTOR CT below threshold
seem to indicate the warning is benign.
Smartmontools
- [Wayback] smartmontools
- [Wayback] FAQ – smartmontools
Is smartctl available for VMware ESXi?
No. See the ESXi related tickets and this thread on smartmontools-support mailing list.
- [Wayback] monitoring – Why there is no smartctl tool in ESXi 5.x? – Server Fault
- [Wayback/Archive.is/Archive.is-of-cache/Google-cached] Identifying Power On Hours for SSD Drives – Cisco
- [Wayback] ipsecguy/esxi_smartmon_exporter: Smartmontools on ESXi Exporter for Prometheus
- [Wayback] How to check NVMe Drives TBW in ESXi with PowerCLI | virten.net
Google searches
- [Archive.is] failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. – Google Search
- [Archive.is] sata sense codes 0x85 – Google Search
- [Wayback] VMware and SCSI: Why do my Pure Storage datastores report SCSI 0x85 errors every 30 minutes? – Transmitting on the wire
- [Wayback] VMware ESXi SCSI Sense Code Decoder | virten.net
- [Wayback] VMware ESXi SCSI Sense Code Decoder: example for H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. | virten.net
- [Wayback] VMware ESXi SCSI Sense Code Decoder V2 | virten.net
- [Wayback] VMware ESXi SCSI Sense Code Decoder V2: example for H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. | virten.net
- [Archive.is] “REALLOCATED SECTOR CT below threshold” – Google Search
- [Archive.is] ESXi not saving settings / VMs : homelab
- [Archive.is] About VMware ESXi: The SMART value of NVMe SSD is incorrect on ESXi 6.7u3
- [Archive.is] ESXi 6.7u3においてNVMe SSDのS.M.A.R.Tの値がおかしい – VMware Technology Network VMTN which mentions to use a command like
esxcli storage core device smart get -d t10.NVMe____INTEL_SSDPE2KX020T8_XXXX
to verify the smart status. - [Wayback/Archive] smartd Warnings for both NVMe drives ESXi 7.0.0u2 – VMware Technology Network VMTN
- [Wayback/Archive] VMware Esxi 7.0 – Administrator
- [Wayback/Archive] ESXi on 8th Gen Intel NUC (Coffee Lake – Bean Canyon) | virten.net
- [Wayback/Archive] ESX / ESXi – Hilfethread | Seite 276 | Hardwareluxx
- [Wayback] Determine TBW from SSDs with S.M.A.R.T Values in ESXi (smartctl) | virten.net
- [Wayback] “ssdpekkw01” esxi not mounting – Google Search
Other ways of getting SMART data
- [Wayback/Archive.is] ESXi S.M.A.R.T. health monitoring for hard drives (2040405) and [Wayback] VMware ESXi S.M.A.R.T Health Monitoring | ESX Virtualization which talk about the
smartinfo.sh
script which by now is a binary/usr/lib/vmware/vm-support/bin/smartinfo
which shows similar results. Note thePower-on Hours
are unreliable: for most drives they are non-persistent and are actuallyPower-on Hours since last reboot
.- There is a ton more goodies in the
/usr/lib/vmware/vm-support/bin
directory which I want to look into:
altlocaltgz.sh cat-newest-vmkernel-core.sh censor-shell-log.sh debug-hung-vm dump-upit-info.py dump-vmdk-rdm-info.sh dump-vmfs-traces.sh dump-vvol-traces.sh dvsData.sh encryption-epilog.sh encryption-prolog.sh extract_hp_docs.py hostd.sh localtgz.sh monitorCoreDump.sh nicinfo.sh nvmeinfo.sh partedUtil.sh rdmainfo.sh smartinfo storageHostProfiles.sh swfw.sh vFlash.sh vsanIscsiTarget.sh vsanIscsiTargetVitConf.sh vsanIscsiTargetVitStatus.py zdumps.sh
- There is a ton more goodies in the
—jeroen
Leave a Reply