SATA Hard Disk Errors
About a week ago one of my backup drives1 appeared to give up the ghost. It seemed to suffer from numerous failures resulting in my daily backup failing to complete. Besides the backup failing, the most obvious symptom was 100s of error messages in the kernel log like these:
[Sat Apr 20 08:16:08 2019] sdd: detected capacity change from 8001563222016 to 0
[Sat Apr 20 08:23:56 2019] sd 9:0:0:0: [sdd] tag#25 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sat Apr 20 08:23:56 2019] sd 9:0:0:0: [sdd] tag#25 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[Sat Apr 20 08:53:56 2019] sd 9:0:0:0: [sdd] tag#27 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sat Apr 20 08:53:56 2019] sd 9:0:0:0: [sdd] tag#27 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[Sat Apr 20 09:23:56 2019] sd 9:0:0:0: [sdd] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sat Apr 20 09:23:56 2019] sd 9:0:0:0: [sdd] tag#29 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[Sat Apr 20 09:53:56 2019] sd 9:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sat Apr 20 09:53:56 2019] sd 9:0:0:0: [sdd] tag#0 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
Once these errors started, the disk became all but unusable until power cycled. I
started troubleshooting by looking at the SMART
data for the disk drive. Surprisingly this showed no anomalies and reported the
disk drive healthy. Next I ran the badblocks utility, and it found 1000s of
bad blocks.
My backup drive is an external drive that I attach via eSATA. My first troubleshooting steps involved replacing all the cables in the data path between the drive and the motherboard. I then tried to attach the drive to a known good SATA port. Still the problems persisted through all these troubleshooting steps. At this point I assumed I had a bad drive and ordered a replacement; not a big deal (computer part shopping is perhaps the only type of shopping that I find enjoyable). My new disk shows up and I start my backup process and … the same exact errors. On a BRAND NEW DISK.
I can’t believe my luck! Last shot in the dark, I put my original disk inside my case and direct attach it to the motherboard. No errors. Now I’m really confused. At this point the only suspect remaining is my external drive enclosure, so I try swapping that out. But using the second enclosure I still see the same errors.
Finally I dive down the Google rabbit hole. The problem with Googling these errors is that they are apparently very common and are most commonly caused by bad SATA cables or a bad disk drive. But I know from my troubleshooting that neither of those are my problem. Finally, on the third or fourth answer to the 10th Stackoverflow question I read I get a hint: a user suggested that a bad power supply could cause these errors as well. Coincidentally, the power supply for my external disk enclosures is the one thing I have not yet tried to replace!
Fortunately I do have a couple of extra 12V DC power supplies sitting around. So I put back the original cables, the original drive, and put them in the original enclosure using the replacement power supply. Surprise! No errors!
I’m sure there’s an important moral to this story… but I don’t know what it is. Mostly, I’m just happy I got my backup drive working again and I am a bit disappointed that I dropped $200 on a replacement drive that it turns out I didn’t need. I’ve written this up mostly in the hope that I save the next person to run into this problem a bit of time.
-
Everyone should have backups. I use a home grown 3-2-1 Backup Strategy. ↩