DJ - Update

New About Yours API Help
3.1 KB, Plain text
Allan, Benedict, and BSD Now Crew,

I finally have an update from last Halloween's Ep. 269 Zombie ZFS question--success! Following Allan's prescient guidance and sage advice, combined with hardware voodoo, I managed on Easter Sunday to resurrect my zombie ZFS pool (pure coincidence). Apologies in advance if the debrief is lengthy--this may be an interesting deep dive into how the solution worked.

Confirming Allan's point, when zfs processes hang waiting for a missing (removed/unavailable) provider to respond, the only way to kill them was by hard shutdown/power cycle. More commands that hung:

zpool status -v (no problem without the -v)
zpool clear -nF (hangs even with -n flag!)
zpool clear -F
zpool clear
zpool import -F (no problem exporting)
zpool import -o readonly=on (this also prevents onlining providers)

Failing these, I tried reconnecting the disk drives in the mirrored pair. I started getting the bus noise again, lots of (noperiph:ahcich0:0:-1:ffffffff): rescan already queued messages, sometimes hanging the system entirely. Web-searching, I found some bug tracker posts with Jordan Hubbard and Kris Moore, basically identifying this as fatal flaws in cables/connectors, with no recourse in software to fix or work around.

I just had to open the case and keep plugging away at different permutations of where to connect these disks and which cables to use, how to position the drives and cables, and various other magic dances like that. Eventually, I stumbled across a configuration that made those bus errors go away (as it happens, one side of the pair connects to the main board, and the other is on a cheap HBA). In retrospect, all the errors may likely have been caused in the first place by this hardware subtly jostling loose and gradually becoming evil.

Now the disks show up in the system and don't crash--just zpool online and resilver? Not so fast. Although ZFS could see the previously removed drive with the correct ID, I could not bring it online right away. Rearranging the cables changed the device names (/dev/ada#), and maybe that threw something off?

My attempt to online the device by its ID resulted in a message that the device was onlined but in a faulted state. That's generally bad news--now considering detach rather than online. But I rebooted first, and then everything suddenly worked, automatically came online, started resilvering, and eventually corrected all errors!

If I had been unable to reconnect the missing device, I would have been tempted to detach the missing device from the mirror and take my chances to read the remaining disk as a stripe, despite the dozen or so checksum errors.

Out of curiosity, if I had detached, then somehow later reattached the missing device, would it again have worked as a mirror and successfully resilvered? I feel file-backedlike recreating similar scenarios with a safe ZFS test lab for my edification.

Anyway, this little ordeal taught me a lot about ZFS--thanks to Allan for pointing me in the right direction and lighting the way. Many thanks to the rest of the team (Benedict, JT, Angela, Chris) for helping to air this and document the progress.

Cheers!
Pasted 1 month, 4 weeks ago — Expires in 307 days
URL: http://dpaste.com/3KTQ45G