Hi all,


UPDATE: I closed the post (the timebox I gave myself to understand the issue is now over). Thank you all for the help ^^


DISCLAIMER: The objective of this post is to understand how people would debug issues like these when real data is involved and get to the bottom of the problem. The objective is NOT to “restore service” but to understand what failed. The tone of the post is voluntarily not serious to keep it light.


I am playing a little with TrueNas Scale and ZFS. I was trying to use a second NVME disk via USB to do a replication once a day of the main pool, however I had issues with this secondary pool being SUSPENDED for “too many errors”. This pool is not directly write/read by users/apps, but it is just there to be “replicated on” once a day.

Now, please, I know that using disks via USB is not advised. Also I am not interested in recovering the data, since there is nothing real on it. What I am doing is testing to see if the system is brittle, and if it is, how to debug if there is a real issue.

Now to the point. The pool is SUSPENDED. Good. Why? I mean, the real reason why. To see if the system can be used in real life it needs to be debuggable.

Let’s start. The pool is SUSPENDED:

pool: tank-02
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
config:

    	NAME                                	STATE 	READ WRITE CKSUM
    	tank-02                             	UNAVAIL  	0 	0 	0  insufficient replicas
      	xxx-xxx-xxx-xxx-xxx                FAULTED  	3 	0 	0  too many errors

errors: 4 data errors, use '-v' for a list

To which you may ask: why? Too many errors (the -v says nothing more). Well that doesn’t help, does it. When you run zpool clear:

# zpool clear tank-02   	 
cannot clear errors for tank-02: I/O error

Incredibly useful as you can see. dmesg to the rescue?

WARNING: Pool 'tank-02' has encountered an uncorrectable I/O failure and has been suspended.

Thanks? I guess. I know it it trying to safeguard data but again… why?

Before you ask:

  • SMART checks are good
  • Yes, I restarted the device. As soon as you try to use/mount/import you get to the same issues.
  • Nothing else peculiar in dmesg. I mean the USB was usb 2-4: USB disconnect, device number 12 whatever the reason why. I mean, kick me if I know why TrueNas scale decided that having /sys/module/usbcore/parameters/autosuspend to 2 is a good idea but again, that is not the point. I need ZFS to reply to me what is the issue for its point of view.

I have read a lot online. Maybe it is the temperarure (usb enclosure heating up), maybe it is the cable, power, “it is the usb controller”, or the chipset doing the usb -> nvme… However, therey are not saying what to check. People is guessing. I saw more tech behind reading tea leaves.

My question for you all is this: ZFS SUSPENDED one of my pools. It (seems to me) is refusing to fix it. Refusing to do anything with it and to tell me why. So, in a real world case, how to debug it? If I have to trust my data to it, I don’t want the only option to be “use many disks and just replace one and the cable when ZFS poo-poo”.

How to know the cause?

Thank you for the help.

PS: I am sure I am missing some very basic ZFS knoweldge on the topic, so please let me know what else can I do to make ZFS talk to me.

  • justpassingby@sh.itjust.worksOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    7 months ago

    Hi! Thanks for the pointers. Unfortunately dmesg and system logs where the first places I looked at, but I found nothing at the time. I tried it again now to give you the output of a zpool clear, you can obviously ignore the failed email attempt. journalctl:

    Jun 07 08:06:24 truenas kernel: WARNING: Pool 'tank-02' has encountered an uncorrectable I/O failure and has been suspended.
    Jun 07 08:06:24 truenas zed[799040]: eid=309 class=statechange pool='tank-02' vdev=xxx-xxx-xxx-xxx-xxx vdev_state=ONLINE
    Jun 07 08:06:24 truenas zed[799049]: eid=310 class=statechange pool='tank-02' vdev=xxx-xxx-xxx-xxx-xxx vdev_state=FAULTED
    Jun 07 08:06:24 truenas zed[799057]: eid=313 class=data pool='tank-02' priority=3 err=28 flags=0x20004000 bookmark=0:0:0:1
    Jun 07 08:06:24 truenas zed[799058]: eid=311 class=vdev_clear pool='tank-02' vdev=xxx-xxx-xxx-xxx-xxx vdev_state=FAULTED
    Jun 07 08:06:24 truenas zed[799067]: eid=312 class=data pool='tank-02' priority=3 err=28 flags=0x20004000 bookmark=0:62:0:0
    Jun 07 08:06:24 truenas zed[799081]: eid=316 class=io_failure pool='tank-02'
    Jun 07 08:06:24 truenas zed[799082]: eid=315 class=data pool='tank-02' priority=3 err=28 flags=0x20004000 bookmark=0:0:-1:0
    Jun 07 08:06:24 truenas zed[799090]: eid=314 class=data pool='tank-02' priority=3 err=28 flags=0x20004000 bookmark=0:0:1:0
    Jun 07 08:06:24 truenas find_alias_for_smtplib.py[799114]: sending mail to 
                                                               To: root
                                                               Subject: ZFS device fault for pool tank-02 on truenas
                                                               MIME-Version: 1.0
                                                               Content-Type: text/plain; charset="ANSI_X3.4-1968"
                                                               Content-
    Jun 07 08:06:24 truenas find_alias_for_smtplib.py[799114]: No aliases found to send email to root
    Jun 07 08:06:24 truenas zed[799144]: error: statechange-notify.sh: eid=310: mail exit=1
    

    dmesg says even less.

    I also tried to reboot the machine with the drive detached and then attach it at runtime while tailing dmesg and journalctl. Now, they are pretty verbose, so will only add here any interesting part (I didn’t notice anything new however):

    [...]
    [  221.952569] usb 2-4: Enable of device-initiated U1 failed.
    [  221.954164] usb 2-4: Enable of device-initiated U2 failed.
    [  221.965756] usbcore: registered new interface driver usb-storage
    [  221.983528] usb 2-4: Enable of device-initiated U1 failed.
    [  221.983997] usb 2-4: Enable of device-initiated U2 failed.
    [  221.987603] scsi host2: uas
    [  221.987831] usbcore: registered new interface driver uas
    [...]
    [  222.040564] sd 2:0:0:0: Attached scsi generic sg1 type 0
    [  222.049860] sd 2:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
    [  222.051867] sd 2:0:0:0: [sdb] Write Protect is off
    [  222.051879] sd 2:0:0:0: [sdb] Mode Sense: 37 00 00 08
    [  222.056719] sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    [  222.058407] sd 2:0:0:0: [sdb] Preferred minimum I/O size 512 bytes
    [  222.058413] sd 2:0:0:0: [sdb] Optimal transfer size 33553920 bytes
    [  222.252607]  sdb: sdb1
    [  222.253015] sd 2:0:0:0: [sdb] Attached SCSI disk
    [  234.935926] usb 2-4: USB disconnect, device number 2
    [  234.983962] sd 2:0:0:0: [sdb] Synchronizing SCSI cache
    [  235.227936] sd 2:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
    [...]
    

    Thanks for the advices, it was worth another try. Anything more that comes to mind?

      • justpassingby@sh.itjust.worksOP
        link
        fedilink
        English
        arrow-up
        1
        ·
        7 months ago

        Hi.

        There is one usb drive in an nvme enclosure without their own power supply. I know the brand and I can find the chipset however what I need to understand is the issue from the logs.

        The error usb 2-4: Enable of device-initiated U1 failed. seems common for USB devices not working.

        What does it point to and what to look for to understand it?

        Thanks.

        PS: Just for curiosity I did swap the enclosure days ago and the cable but had the same issue, so the error message is not specific to it. Also I was using this enclosure as the main disk for one of my PI with no issue, so power via USB or cable should not be the problem. Not that I want to use that as metric, I need data/logs from the OS.