Skip to content

[Bug]: StorCLI collector falsely reports ~7k errors/day by treating static "Media Error Count" as an incremental event rate #1126

@AGI-chandler

Description

@AGI-chandler

Bug description

System Information

  • OS: Linux 6.1.0-34-amd64
  • StorCLI Version: 007.1907.0000.0000
  • Controller: Broadcom/LSI MegaRAID (SAS3)
  • Netdata Version: [Enter your Netdata version here]

Description

The StorCLI collector appears to be misinterpreting the Media Error Count field returned by storcli /cX/eY/sZ show all.

My physical drive has a static "Media Error Count" of 1.  This error occurred months ago (August 2025) and has not incremented since.  However, Netdata interprets this single static value as a new error occurring at every collection interval.

This results in:

  1. A reported error rate of ~0.08 errors/s (approx 1 error per 12-second collection cycle).
  2. A critical alert claiming ~7,000 errors per day.
  3. A total accumulated error count of ~97,000+ over 14 days, despite the hardware counter remaining exactly at 1.
  4. An AI "Alert Investigation Report" (attached) that incorrectly predicts immediate hardware failure with "95% Confidence" based on this math error.

Evidence & Reproduction

1. The Hardware Reality (Static Count)

storcli reports a single Media Error.  This number is persistent and does not change.

root@server:~# storcli /c0/e4/s11 show all
...
Drive /c0/e4/s11 - Detailed Information :
=======================================
Shield Counter = 0
Media Error Count = 1   <--- This value is static (history from August)
Other Error Count = 0
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

2. The Controller Logs (Silent)

The controller event log confirms no new errors are occurring.  The log is silent since the drive was installed.  (Full details at end)

root@server:~# storcli /c0 show events file=events.log
# (The log contains no entries for slot 11 in the last 3 months)

3. The Netdata Interpretation (Accumulation)

Netdata reports a continuous error rate of 0.08 errors/second.

  • The error graph shows a perfectly linear accumulation (flat slope), characteristic of summing a constant value rather than recording sporadic hardware failures.

Hypothesis

The collector logic seems to treat the raw value returned by storcli as a count of events in the last interval rather than a lifetime counter.

  • Current Logic: Total_Errors += Current_Value (e.g., 1 + 1 + 1...)
  • Expected Logic: Current_Rate = Current_Value - Previous_Value (e.g., 1 - 1 = 0)

Logs & Raw Data

Below, please find:

  1. smartctl output Shows the drive's internal firmware sees zero reallocated sectors.  This directly contradicts the Netdata AI report that claims "progressive disk surface defects."
  2. storcli output Shows specific model numbers, part numbers, firmware versions, etc., and only 1 media error.
  3. storcli Event Log Shows events from August to present and proves that the drive did stuff (maybe initialized, maybe had that 1 error), and then went silent.  Visual silence in the logs vs. the "0.08 errors/s" claim is the ultimate proof of a bug.
  4. Netdata AI Investigation Report (Netdata Insights...pdf) generated from this issue.  It highlights the severity of the false positive, as the AI analysis generated 17 pages of 💩, claiming things such as:
  • Root Cause Confidence: HIGH (95%)
  • Alert Validity: CONFIRMED VALID
  • Legitimate Hardware Issue
  • No👏 False👏 Positive👏

In the end, it concluded "Proactive Drive Replacement" was required based on the erroneous "97,459 accumulated errors."

Full smartctl output (Proves drive health)
root@server:~# smartctl -x /dev/disk/by-id/wwn-0x5000cca244c2b5e1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-34-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Gold
Device Model:     WDC WD4002FYYZ-01B7CB0
Serial Number:    N8G5YV3Y
LU WWN Device Id: 5 000cca 244c2b5e1
Firmware Version: 01.01M02
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Dec  3 23:39:46 2025 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (  113) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 571) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  P-S---   136   136   054    -    108
  3 Spin_Up_Time            POS---   100   100   024    -    0
  4 Start_Stop_Count        -O--C-   100   100   000    -    1
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   100   100   000    -    2836
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    1
192 Power-Off_Retract_Count -O--CK   100   100   000    -    119
193 Load_Cycle_Count        -O--C-   100   100   000    -    119
194 Temperature_Celsius     -O----   214   214   000    -    28 (Min/Max 23/30)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ Non-Data log
0x15       GPL,SL  R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    256  Current Device Internal Status Data log
0x25       GPL     R/O    256  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
Device State:                        Active (0)
Current Temperature:                    28 Celsius
Power Cycle Min/Max Temperature:     23/30 Celsius
Lifetime    Min/Max Temperature:     23/30 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (61)

Index    Estimated Time   Temperature Celsius
  62    2025-12-03 21:32    28  *********
 ...    ..(  5 skipped).    ..  *********
  68    2025-12-03 21:38    28  *********
  69    2025-12-03 21:39    27  ********
  70    2025-12-03 21:40    27  ********
  71    2025-12-03 21:41    28  *********
  72    2025-12-03 21:42    27  ********
  73    2025-12-03 21:43    27  ********
  74    2025-12-03 21:44    27  ********
  75    2025-12-03 21:45    28  *********
  76    2025-12-03 21:46    27  ********
 ...    ..( 53 skipped).    ..  ********
   2    2025-12-03 22:40    27  ********
   3    2025-12-03 22:41    28  *********
   4    2025-12-03 22:42    27  ********
   5    2025-12-03 22:43    28  *********
 ...    ..( 55 skipped).    ..  *********
  61    2025-12-03 23:39    28  *********

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4               1  ---  Lifetime Power-On Resets
0x01  0x018  6            2155  ---  Logical Sectors Written
0x01  0x020  6              23  ---  Number of Write Commands
0x01  0x028  6           15826  ---  Logical Sectors Read
0x01  0x030  6             305  ---  Number of Read Commands
0x01  0x038  6     10210698900  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4            2835  ---  Spindle Motor Power-on Hours
0x03  0x010  4            2835  ---  Head Flying Hours
0x03  0x018  4             119  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              28  ---  Current Temperature
0x05  0x010  1              27  N--  Average Short Term Temperature
0x05  0x018  1              26  N--  Average Long Term Temperature
0x05  0x020  1              30  ---  Highest Temperature
0x05  0x028  1              23  ---  Lowest Temperature
0x05  0x030  1              29  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              27  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               1  ---  Number of Hardware Resets
0x06  0x010  4               0  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
Full storcli output (Shows static Error Count = 1)
root@server:~# storcli /c0/e4/s11 show all
CLI Version = 007.1907.0000.0000 Sep 13, 2021
Operating system = Linux 6.1.0-34-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive /c0/e4/s11 :
================

----------------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model                  Sp Type 
----------------------------------------------------------------------------------
4:11     14 JBOD  -  3.638 TB SATA HDD N   N  512B WDC WD4002FYYZ-01B7CB0 U  -    
----------------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild


Drive /c0/e4/s11 - Detailed Information :
=======================================

Drive /c0/e4/s11 State :
======================
Shield Counter = 0
Media Error Count = 1
Other Error Count = 0
Drive Temperature =  28C (82.40 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No


Drive /c0/e4/s11 Device attributes :
==================================
SN = N8G5YV3Y            
Manufacturer Id = ATA     
Model Number = WDC WD4002FYYZ-01B7CB0
NAND Vendor = NA
WWN = 5000CCA244C2B5E1
Firmware Revision = 01.01M02
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
NCQ setting = Enabled
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = Port 0 - 3 


Drive /c0/e4/s11 Policies/Settings :
==================================
Enclosure position = 1
Connected Port Number = 0(path0) 
Sequence Number = 6
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
FDE Type = None
SED Capable = No
SED Enabled = No
Secured = No
Cryptographic Erase Capable = No
Sanitize Support = Not supported
Locked = No
Needs EKM Attention = No
PI Eligible = No
Certified = No
Wide Port Capable = No
Multipath = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address        
-----------------------------------------
   0 Active 6.0Gb/s   0x50015b2038046311 
-----------------------------------------


Inquiry Data = 
5a 04 ff 3f 37 c8 10 00 00 00 00 00 3f 00 00 00 
00 00 00 00 38 4e 35 47 56 59 59 33 20 20 20 20 
20 20 20 20 20 20 20 20 03 00 00 00 38 00 31 30 
30 2e 4d 31 32 30 44 57 20 43 44 57 30 34 32 30 
59 46 5a 59 30 2d 42 31 43 37 30 42 20 20 20 20 
20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 80 
00 40 00 2f 00 40 00 02 00 02 07 00 ff 3f 10 00 
3f 00 10 fc fb 00 00 51 ff ff ff 0f 00 00 07 00 
storcli event log from Aug-Present (Proves no recent errors)
root@server:~# storcli /c0 show events file=events.log
root@server:~# tail ... events.log
...
seqNum: 0x000132ed
Time: Fri May  9 13:30:00 2025

Code: 0x00000124
Class: 0
Locale: 0x20
Event Description: Patrol Read can't be started, as PDs are either not ONLINE, or are in a VD with an active process, or are in an excluded VD
Event Data:
===========
None


seqNum: 0x000132ee
Time: Fri Aug  1 19:30:07 2025

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 0e(e0x04/s11) Path 50015b2038046311, CDB: 88 00 00 00 00 00 00 5a d6 00 00 00 02 00 00 00, Sense: 3/11/00
Event Data:
===========
Device ID: 14
Enclosure Index: 4
Slot Number: 11
CDB Length: 16
CDB Data:
0088 0000 0000 0000 0000 0000 0000 005a 00d6 0000 0000 0000 0002 0000 0000 0000 
Sense Length: 18
Sense Data:
00f0 0000 0003 0000 005a 00d7 0039 000a 0000 0000 0000 0000 0011 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 


seqNum: 0x000132ef
Time: Fri Aug  8 01:02:55 2025

Code: 0x0000021c
Class: 0
Locale: 0x02
Event Description: Locate LED started on PD 0e(e0x04/s11)
Event Data:
===========
Device ID: 14
Enclosure Index: 4
Slot Number: 11


seqNum: 0x000132f0
Time: Fri Aug  8 02:10:06 2025

Code: 0x0000010c
Class: 1
Locale: 0x02
Event Description: PD 0e(e0x04/s11) Path 50015b2038046311  reset (Type 03)
Event Data:
===========
Device ID: 14
Enclosure Index: 4
Slot Number: 11
Error: 3


seqNum: 0x000132f1
Time: Fri Aug  8 02:10:06 2025

Code: 0x00000070
Class: 1
Locale: 0x02
Event Description: Removed: PD 0e(e0x04/s11)
Event Data:
===========
Device ID: 14
Enclosure Index: 4
Slot Number: 11


seqNum: 0x000132f2
Time: Fri Aug  8 02:10:06 2025

Code: 0x000000f8
Class: 0
Locale: 0x02
Event Description: Removed: PD 0e(e0x04/s11) Info: enclPd=04, scsiType=0, portMap=00, sasAddr=50015b2038046311,0000000000000000
Event Data:
===========
Device ID: 14
Enclosure Device ID: 4
Enclosure Index: 1
Slot Number: 11
SAS Address 1: 50015b2038046311
SAS Address 2: 0


seqNum: 0x000132f3
Time: Fri Aug  8 02:10:06 2025

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0e(e0x04/s11) from JBOD(40) to UNCONFIGURED_BAD(1)
Event Data:
===========
Device ID: 14
Enclosure Index: 4
Slot Number: 11
Previous state: 64
New state: 1


seqNum: 0x000132f4
Time: Fri Aug  8 02:10:06 2025

Code: 0x00000152
Class: 0
Locale: 0x20
Event Description: Controller requests a host bus rescan
Event Data:
===========
None


seqNum: 0x000132f5
Time: Fri Aug  8 02:18:31 2025

Code: 0x0000005b
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0e(e0x04/s11)
Event Data:
===========
Device ID: 14
Enclosure Index: 4
Slot Number: 11


seqNum: 0x000132f6
Time: Fri Aug  8 02:18:31 2025

Code: 0x000000f7
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0e(e0x04/s11) Info: enclPd=04, scsiType=0, portMap=00, sasAddr=50015b2038046311,0000000000000000
Event Data:
===========
Device ID: 14
Enclosure Device ID: 4
Enclosure Index: 1
Slot Number: 11
SAS Address 1: 50015b2038046311
SAS Address 2: 0

Netdata Insights - 2025-12-04 06_18Z storcli_phys_drive_errors.pdf

Expected behavior

The collector should treat the Media Error Count value from storcli as a cumulative lifetime counter, not an instantaneous event count.

  • Current (Incorrect) Behavior: The collector sums the raw values from each poll. (Poll 1: "1" + Poll 2: "1" = 2 Total Errors).
  • Expected Behavior: The collector should calculate the delta between polls.
    • If the counter remains at 1 between polls, the calculated error rate should be 0.
    • The alert should only trigger if the physical counter actually increments (e.g., goes from 1 to 2).

Steps to reproduce

  1. Identify a drive with a static error: Find a physical drive that has a non-zero, static Media Error Count in storcli (e.g., a count of 1 from a historical event).
    • Command: storcli /c0/eX/sY show all
    • Verify the count is stable and not incrementing by running the command multiple times over a few minutes.
  2. Verify the event log is silent: Confirm via storcli /c0 show events that no new events are being generated for this drive slot (proving the error is historical, not active).
  3. Enable Netdata StorCLI collector: Allow Netdata to collect metrics from this controller.
  4. Observe the false positive: Check the Netdata dashboard for this drive.
    • Notice that Netdata reports a continuous error rate (e.g., 0.08 errors/s if polling every ~12s).
    • Notice that the "Total Media Errors" metric increases linearly over time, despite the hardware counter in Step 1 remaining unchanged.

Screenshots

No response

Error Logs

No response

Desktop

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions