Ich war mir nicht ganz sicher, ob das jetzt eher in "Linux" gehört oder in die "Hardware-Ecke".
Und zwar habe ich gestern Datenrettung auf dem Server eines Freundes durchgeführt. Das ext4-Dateisystem hatte sich ziemlich zerrockt, eine ganze Menge (zum Glück unwichtiger) Dateien waren nicht mehr lesbar. Da die SMART-Werte der HDD schlecht aussahen (Raw_Read_Error_Rate ~2000, Current_Pending_Sector ~40, UDMA_CRC_Error_Count ~90), haben wir das darauf geschoben.
Also neue (gebrauchte) HDD rein, deren SMART-Werte ok sind, alle Daten mit rsync rüberkopiert, GRUB2 repariert und die Mounts in der fstab angepasst, dann bootete das System auch wieder.
Nur: Unter Last, insbesondere wenn mehrere Prozesse gleichzeitig IOs auf die Platte verursachen, wird das System extrem träge, die Load geht auf 8-15, bis die Prozesse teilweise abbrechen, weil sync() fehlgeschlagen ist.
Der Syslog wirft wieder die gleichen Fehler aus wie vor dem HDD-Tausch.
Hier ein paar Beispiele aus dem Log:
Und noch viele mehr... Wenn ich die HDD nicht gerade getauscht hätte, hätte ich ja gesagt: Eindeutig, HDD kaputt. Aber nun frage ich mich: Was ist nun das Problem? Kann der SATA-Controller auf dem Mainboard kaputt sein? Vorher hingen die HDDs auch an einer Backplane, aktuell hängt sie direkt am Board, die Backplane kann es also nicht sein.
Hardware:
Server von Rackable Systems (müsste ungefähr der hier sein: http://www.ebay.com/itm/RACKABLE-SYSTEMS-2U-2x-2-33GHz-Dual-Core-4GB-RAM-4x-250GB-HDD-/171248157660)
2x Xeon X5355
8x 2 GB ECC RAM
aktuell 1x 2,5" 500GB HDD.
Und zwar habe ich gestern Datenrettung auf dem Server eines Freundes durchgeführt. Das ext4-Dateisystem hatte sich ziemlich zerrockt, eine ganze Menge (zum Glück unwichtiger) Dateien waren nicht mehr lesbar. Da die SMART-Werte der HDD schlecht aussahen (Raw_Read_Error_Rate ~2000, Current_Pending_Sector ~40, UDMA_CRC_Error_Count ~90), haben wir das darauf geschoben.
Also neue (gebrauchte) HDD rein, deren SMART-Werte ok sind, alle Daten mit rsync rüberkopiert, GRUB2 repariert und die Mounts in der fstab angepasst, dann bootete das System auch wieder.
Nur: Unter Last, insbesondere wenn mehrere Prozesse gleichzeitig IOs auf die Platte verursachen, wird das System extrem träge, die Load geht auf 8-15, bis die Prozesse teilweise abbrechen, weil sync() fehlgeschlagen ist.
Der Syslog wirft wieder die gleichen Fehler aus wie vor dem HDD-Tausch.
Hier ein paar Beispiele aus dem Log:
Code:
Jun 30 05:21:07 Bumblebee kernel: [ 4699.764065] ata7.00: exception Emask 0x0 SAct 0x1c00 SErr 0x0 action 0x6 frozenJun 30 05:21:07 Bumblebee kernel: [ 4699.764200] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:21:07 Bumblebee kernel: [ 4699.764284] ata7.00: cmd 61/48:50:98:c1:8e/00:00:15:00:00/40 tag 10 ncq 36864 out
Jun 30 05:21:07 Bumblebee kernel: [ 4699.764284] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:21:07 Bumblebee kernel: [ 4699.764502] ata7.00: status: { DRDY }
Jun 30 05:21:07 Bumblebee kernel: [ 4699.764551] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:21:07 Bumblebee kernel: [ 4699.764632] ata7.00: cmd 61/18:58:60:d6:04/01:00:1d:00:00/40 tag 11 ncq 143360 out
Jun 30 05:21:07 Bumblebee kernel: [ 4699.764632] res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:21:07 Bumblebee kernel: [ 4699.764852] ata7.00: status: { DRDY }
Jun 30 05:21:07 Bumblebee kernel: [ 4699.764901] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:21:07 Bumblebee kernel: [ 4699.764982] ata7.00: cmd 61/08:60:e0:db:45/00:00:05:00:00/40 tag 12 ncq 4096 out
Jun 30 05:21:07 Bumblebee kernel: [ 4699.764982] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:21:07 Bumblebee kernel: [ 4699.765196] ata7.00: status: { DRDY }
Jun 30 05:21:07 Bumblebee kernel: [ 4699.765248] ata7: hard resetting link
Jun 30 05:21:09 Bumblebee kernel: [ 4702.216050] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 30 05:21:09 Bumblebee kernel: [ 4702.253312] ata7.00: configured for UDMA/100
Jun 30 05:21:09 Bumblebee kernel: [ 4702.268044] ata7.00: device reported invalid CHS sector 0
Jun 30 05:21:09 Bumblebee kernel: [ 4702.268056] ata7: EH complete
Code:
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361341] sd 6:0:0:0: [sda] Unhandled error code
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361349] sd 6:0:0:0: [sda]
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361351] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361353] sd 6:0:0:0: [sda] CDB:
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361356] Write(10): 2a 00 26 a8 8f f8 00 00 08 00
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361364] end_request: I/O error, dev sda, sector 648581112
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361457] EXT4-fs warning (device sda1): ext4_end_bio:317: I/O error -5 writing to inode 12845860 (offset 42987520 size 4096 starting block 81072640)
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361461] Buffer I/O error on device sda1, logical block 81072383
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361584] sd 6:0:0:0: [sda] Unhandled error code
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361587] sd 6:0:0:0: [sda]
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361588] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361590] sd 6:0:0:0: [sda] CDB:
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361591] Write(10): 2a 00 26 a9 9c 00 00 00 d8 00
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361596] end_request: I/O error, dev sda, sector 648649728
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361676] EXT4-fs warning (device sda1): ext4_end_bio:317: I/O error -5 writing to inode 12857718 (offset 0 size 110592 starting block 81081243)
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361679] Buffer I/O error on device sda1, logical block 81080960
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361779] Buffer I/O error on device sda1, logical block 81080961
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361879] Buffer I/O error on device sda1, logical block 81080962
Jun 30 05:22:44 Bumblebee kernel: [ 4796.361995] Buffer I/O error on device sda1, logical block 81080963
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362095] Buffer I/O error on device sda1, logical block 81080964
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362194] Buffer I/O error on device sda1, logical block 81080965
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362293] Buffer I/O error on device sda1, logical block 81080966
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362393] Buffer I/O error on device sda1, logical block 81080967
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362492] Buffer I/O error on device sda1, logical block 81080968
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362660] sd 6:0:0:0: [sda] Unhandled error code
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362663] sd 6:0:0:0: [sda]
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362664] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362666] sd 6:0:0:0: [sda] CDB:
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362667] Write(10): 2a 00 18 87 96 80 00 00 60 00
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362673] end_request: I/O error, dev sda, sector 411539072
Jun 30 05:22:44 Bumblebee kernel: [ 4796.362755] EXT4-fs warning (device sda1): ext4_end_bio:317: I/O error -5 writing to inode 12846636 (offset 0 size 49152 starting block 51442396)
Code:
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732073] ata7.00: exception Emask 0x0 SAct 0x7fe SErr 0x0 action 0x6 frozen
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732193] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732279] ata7.00: cmd 61/00:08:00:7c:aa/04:00:26:00:00/40 tag 1 ncq 524288 out
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732279] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732496] ata7.00: status: { DRDY }
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732545] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732625] ata7.00: cmd 61/00:10:00:80:aa/04:00:26:00:00/40 tag 2 ncq 524288 out
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732625] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732843] ata7.00: status: { DRDY }
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732891] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732972] ata7.00: cmd 61/00:18:00:84:aa/04:00:26:00:00/40 tag 3 ncq 524288 out
Jun 30 05:34:57 Bumblebee kernel: [ 5529.732972] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:34:57 Bumblebee kernel: [ 5529.733194] ata7.00: status: { DRDY }
Jun 30 05:34:57 Bumblebee kernel: [ 5529.733247] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:34:57 Bumblebee kernel: [ 5529.733328] ata7.00: cmd 61/00:20:00:88:aa/04:00:26:00:00/40 tag 4 ncq 524288 out
Jun 30 05:34:57 Bumblebee kernel: [ 5529.733328] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:34:57 Bumblebee kernel: [ 5529.733544] ata7.00: status: { DRDY }
Jun 30 05:34:57 Bumblebee kernel: [ 5529.733593] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:34:57 Bumblebee kernel: [ 5529.733674] ata7.00: cmd 61/00:28:00:8c:aa/04:00:26:00:00/40 tag 5 ncq 524288 out
Jun 30 05:34:57 Bumblebee kernel: [ 5529.733674] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:34:57 Bumblebee kernel: [ 5529.743264] ata7.00: status: { DRDY }
Jun 30 05:34:57 Bumblebee kernel: [ 5529.748076] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:34:57 Bumblebee kernel: [ 5529.752770] ata7.00: cmd 61/00:30:00:90:aa/04:00:26:00:00/40 tag 6 ncq 524288 out
Jun 30 05:34:57 Bumblebee kernel: [ 5529.752770] res 40/00:ff:ff:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:34:57 Bumblebee kernel: [ 5529.762108] ata7.00: status: { DRDY }
Jun 30 05:34:57 Bumblebee kernel: [ 5529.766771] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:34:57 Bumblebee kernel: [ 5529.771393] ata7.00: cmd 61/00:38:00:94:aa/04:00:26:00:00/40 tag 7 ncq 524288 out
Jun 30 05:34:57 Bumblebee kernel: [ 5529.771393] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:34:57 Bumblebee kernel: [ 5529.780790] ata7.00: status: { DRDY }
Jun 30 05:34:57 Bumblebee kernel: [ 5529.785471] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:34:57 Bumblebee kernel: [ 5529.790119] ata7.00: cmd 61/00:40:00:98:aa/04:00:26:00:00/40 tag 8 ncq 524288 out
Jun 30 05:34:57 Bumblebee kernel: [ 5529.790119] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:34:57 Bumblebee kernel: [ 5529.799582] ata7.00: status: { DRDY }
Jun 30 05:34:57 Bumblebee kernel: [ 5529.804314] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:34:57 Bumblebee kernel: [ 5529.809007] ata7.00: cmd 61/00:48:00:9c:aa/04:00:26:00:00/40 tag 9 ncq 524288 out
Jun 30 05:34:57 Bumblebee kernel: [ 5529.809007] res 40/00:ff:ff:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:34:57 Bumblebee kernel: [ 5529.818478] ata7.00: status: { DRDY }
Jun 30 05:34:57 Bumblebee kernel: [ 5529.823214] ata7.00: failed command: WRITE FPDMA QUEUED
Jun 30 05:34:57 Bumblebee kernel: [ 5529.827958] ata7.00: cmd 61/00:50:00:a0:aa/04:00:26:00:00/40 tag 10 ncq 524288 out
Jun 30 05:34:57 Bumblebee kernel: [ 5529.827958] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 30 05:34:57 Bumblebee kernel: [ 5529.837808] ata7.00: status: { DRDY }
Jun 30 05:34:57 Bumblebee kernel: [ 5529.842859] ata7: hard resetting link
Jun 30 05:34:59 Bumblebee kernel: [ 5532.296057] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 30 05:35:00 Bumblebee kernel: [ 5532.335105] ata7.00: configured for UDMA/100
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348046] ata7.00: device reported invalid CHS sector 0
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348051] ata7.00: device reported invalid CHS sector 0
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348057] ata7.00: device reported invalid CHS sector 0
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348061] ata7.00: device reported invalid CHS sector 0
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348065] ata7.00: device reported invalid CHS sector 0
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348068] ata7.00: device reported invalid CHS sector 0
Code:
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348086] sd 6:0:0:0: [sda] Unhandled error code
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348088] sd 6:0:0:0: [sda]
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348091] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348093] sd 6:0:0:0: [sda] CDB:
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348095] Write(10): 2a 00 26 aa 28 00 00 04 00 00
Jun 30 05:35:00 Bumblebee kernel: [ 5532.348105] end_request: I/O error, dev sda, sector 648685568
Jun 30 05:35:00 Bumblebee kernel: [ 5532.353317] EXT4-fs warning (device sda1): ext4_end_bio:317: I/O error -5 writing to inode 12846129 (offset 16777216 size 8388608 starting block 81085824)
Jun 30 05:35:00 Bumblebee kernel: [ 5532.353324] buffer_io_error: 30 callbacks suppressed
Jun 30 05:35:00 Bumblebee kernel: [ 5532.353329] Buffer I/O error on device sda1, logical block 81085440
Jun 30 05:35:00 Bumblebee kernel: [ 5532.358623] Buffer I/O error on device sda1, logical block 81085441
Jun 30 05:35:00 Bumblebee kernel: [ 5532.363917] Buffer I/O error on device sda1, logical block 81085442
Jun 30 05:35:00 Bumblebee kernel: [ 5532.369236] Buffer I/O error on device sda1, logical block 81085443
Jun 30 05:35:00 Bumblebee kernel: [ 5532.374554] Buffer I/O error on device sda1, logical block 81085444
Jun 30 05:35:00 Bumblebee kernel: [ 5532.379778] Buffer I/O error on device sda1, logical block 81085445
Jun 30 05:35:00 Bumblebee kernel: [ 5532.384893] Buffer I/O error on device sda1, logical block 81085446
Jun 30 05:35:00 Bumblebee kernel: [ 5532.389910] Buffer I/O error on device sda1, logical block 81085447
Jun 30 05:35:00 Bumblebee kernel: [ 5532.394823] Buffer I/O error on device sda1, logical block 81085448
Jun 30 05:35:00 Bumblebee kernel: [ 5532.399641] Buffer I/O error on device sda1, logical block 81085449
Jun 30 05:35:00 Bumblebee kernel: [ 5532.404421] sd 6:0:0:0: [sda] Unhandled error code
Jun 30 05:35:00 Bumblebee kernel: [ 5532.404423] sd 6:0:0:0: [sda]
Jun 30 05:35:00 Bumblebee kernel: [ 5532.404424] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jun 30 05:35:00 Bumblebee kernel: [ 5532.404426] sd 6:0:0:0: [sda] CDB:
Jun 30 05:35:00 Bumblebee kernel: [ 5532.404427] Write(10): 2a 00 26 aa 44 00 00 04 00 00
Jun 30 05:35:00 Bumblebee kernel: [ 5532.404433] end_request: I/O error, dev sda, sector 648692736
Jun 30 05:35:00 Bumblebee kernel: [ 5532.409146] EXT4-fs warning (device sda1): ext4_end_bio:317: I/O error -5 writing to inode 12846129 (offset 16777216 size 8388608 starting block 81086720)
Jun 30 05:35:00 Bumblebee kernel: [ 5532.409243] sd 6:0:0:0: [sda] Unhandled error code
Jun 30 05:35:00 Bumblebee kernel: [ 5532.409245] sd 6:0:0:0: [sda]
Jun 30 05:35:00 Bumblebee kernel: [ 5532.409246] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jun 30 05:35:00 Bumblebee kernel: [ 5532.409248] sd 6:0:0:0: [sda] CDB:
Jun 30 05:35:00 Bumblebee kernel: [ 5532.409249] Write(10): 2a 00 26 aa 48 00 00 04 00 00
Jun 30 05:35:00 Bumblebee kernel: [ 5532.409255] end_request: I/O error, dev sda, sector 648693760
Jun 30 05:35:00 Bumblebee kernel: [ 5532.413879] EXT4-fs warning (device sda1): ext4_end_bio:317: I/O error -5 writing to inode 12846129 (offset 25165824 size 8388608 starting block 81086848)
Und noch viele mehr... Wenn ich die HDD nicht gerade getauscht hätte, hätte ich ja gesagt: Eindeutig, HDD kaputt. Aber nun frage ich mich: Was ist nun das Problem? Kann der SATA-Controller auf dem Mainboard kaputt sein? Vorher hingen die HDDs auch an einer Backplane, aktuell hängt sie direkt am Board, die Backplane kann es also nicht sein.
Hardware:
Server von Rackable Systems (müsste ungefähr der hier sein: http://www.ebay.com/itm/RACKABLE-SYSTEMS-2U-2x-2-33GHz-Dual-Core-4GB-RAM-4x-250GB-HDD-/171248157660)
2x Xeon X5355
8x 2 GB ECC RAM
aktuell 1x 2,5" 500GB HDD.