Linux起動時にNVMe SSDがSMARTエラーを出す件 - pyopyopyo - Linuxとかプログラミングの覚え書き -

NVMe のSSD（Micron 3400 2TB）を新調したところ，Linux起動時に毎回SMARTエラーが出るようになりました．

原因を調査し，自己解決したので経緯をまとめます

2024年1月30日追記：この問題は smartmontools 7.4 で解決しました
smartmontools 側で対処することになり，コードが追加されています
https://www.smartmontools.org/changeset/5472

エラーの内容

smartmontools を使い SMARTのエラーを監視するように設定したところ，再起動するたびに root 宛に以下のエラーメールが届くようになりました

Subject：SMART error (ErrorCount) detected on host: [hostname]

This message was generated by the smartd daemon running on:

  host name:  [HOSTNAME]
  DNS domain: [DOMAINNAME]

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 43 to 44

Device info:
Micron_3400_**********, S/N:*******, FW:P7MU000, 2.04 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at [TIMESTAMP]
Another message will be sent in 24 hours if the problem persists.

SMARTのエラー，つまり SSDに何らかの障害が起きているようです．

エラーの内容は

Device: /dev/nvme0, number of Error Log entries increased from 43 to 44

です

状況確認

システム自体は問題なく起動し，データの破損はありませんでした．

コンソールから SMARTの状態を確認します

NVMe SSDのデバイスファイルは /dev/nvme0 なので smartctl で error を確認します

$ sudo smartctl -l error /dev/nvme0

=== START OF SMART DATA SECTION ===
Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID VS
  0         44     0  0x3011  0x8004  0x000

ErrCount が44になってます．

その後の調査で ErrCount はシステムを再起動するたびに１増加することがわかりました．

さらに詳しく調べるためにNVMe SSDの error-log を確認します

NVMe デバイスは，デバイス内にログを保存しています．ログは nvme コマンドを使うと読み出せます．

$ sudo nvme error-log /dev/nvme0

.................
 Entry[63]
.................
error_count    : 0
sqid        : 0
cmdid        : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba        : 0
nsid        : 0
vs        : 0
trtype        : The transport type is not indicated or the error is not
transport related.
cs        : 0
trtype_spec_info: 0
.................

確かにエラーログは記録されていますが，その状態 (status_field)は SUCCESS です

どうやらエラーは発生しておらず，何かの処理に成功した，というログが毎回発生し，それを smartmontools が障害として誤検知しているようです．

google検索

以上の調査結果から，おそらく，本件は NVMeのfirmwareのバグか，smartmontools のバグが原因であると推察されます．

エラーメッセージを手掛かりにバグレポートを探します google検索します

バグレポート（その１）

すぐにdebianのBTSでバグレポートが見つかりました．

#900244 - NVM error information log entry count increase not an error - Debian Bug report logs

レポートをざっとみたところ，#50 の投稿で，全く同じデバイスを使っている人が同じエラーで困ってることを報告しています．
#900244 - NVM error information log entry count increase not an error - Debian Bug report logs

完全に同じ症状です