Context
A friend asked for help. A Windows box at a small medical point has become unable to boot. Unfortunately, this was the central box, serving files, storing many year's worth of email and medical data, plus running a couple of medical applications. A single point of failure, without a reasonably fresh backup: a typical real life problem of small, non-IT businesses.
The goal was to get the environment up and running. A handful of medical stations depend on the services of this central computer, and staff needs the medical data it stores. I took the box home for a weekend, to see what could be done.
This series of blog posts will give the reader a systematic approach to diagnosing and fixing this issue with open source Linux tools.
The plan
While driving home, I decided to develop an action plan rather than rushing and making mistakes with non revocable impact. Instead of a clear sequence of steps, an ever growing binary decision tree crystallized from hypothetical observations, possible causes and outcomes. A plethora of unknown variables and a time window of a bit more than a weekend.
Pragmatic time boxed problem solving can be based on the the OODA loop which promotes a calm, systematic, approach made up of iterations of absorbing information, filtering and contextualizing it, then making a reasonably sound (what seems to be best) decision, and observing the effects of the decision, which serve as input for the next iteration. This is how one adapts to changes during the problem solving process in a disciplined manner.
Symptoms
The single socket, quad core machine had Windows 8.1 installed with the Norton Security Suite, several usual and a few special medical applications. One of the two hard drives gave place to the operating system and some data, while a separate 1TB disk was dedicated for the large amount of growing business data.
A few seconds after power on, the machine got suck at a state where the boot logo of the motherboard was displayed with a spinner, then after a while went black and did not progress for hours (until forcefully powered off). The HDD led on the case indicated intense disk activity. These are the symptoms I have collected over phone.
First steps
The first decision was not to boot the operating system, as that might make the situation even worse, whether or not we are dealing with malware or "just" software or hardware failure.
The second decision was to source a 2TB disk. Any data recovery or forensic investigation shall only be performed on disk images or clones and not the disks themselves, to shield the effects of mechanically damaged disks and to retain backup of the starting of the activities. The disk would serve as storage upgrade of the machine once returned.
I entered the UEFI Firmware Setup Utility, and checked for indications of hardware failure, but did not see anything obvious.
Having booted SystemRescueCD, my preferred live Linux distribution for general system rescue, from a USB stick, I launched an extended self test on all 3 disks in parallel, along with two iterations of memtester, which is a userspace memory stress utility, that reliably works on UEFI and does not need a reboot. Running these 4 tests in parallel saved many hours of unproductive time with an acceptable risk of not being able to test the first 200MB of RAM.
smartctl -a /dev/sda | less -S
smartctl -a /dev/sdb | less -S
smartctl -a /dev/sdc | less -S
smartctl -t long /dev/sda
smartctl -t long /dev/sdb
smartctl -t long /dev/sdc
memtester 7956M 2
Findings, assumptions and decisions
The disk /dev/sdb
, where Windows was installed, has failed the self test. The other two disks and also the memory have passed the test, which allowed for narrowing down the investigation, with the assumption that there is not hardware issue except the disk which failed the self test. Making assumption is fine as long as one takes note of them so they can be revisited and challenged later if needed.
The good news was that the drive /dev/sda
, the one holding medical data in a single NTFS partition was fine hardware-wise as per the self test. It could me mounted read-only and the filesystem seemed to be all right. The decision was to clone this drive to the new 2TB drive, and then repurpose /dev/sda
later during the process of recovering the system drive.
ddrescue --ask --verbose --binary-prefixes --idirect --force /dev/sda /dev/sdc disk.map
One could certainly use a shorter command to clone one drive to another, however, during a recover task, I recommend using ddrescue
with a map file and the arguments like shown above. Copying will not be aborted on the first error like it would be when using dd
. Instead, read errors are gracefully skipped, allowing one to maximize the amount of data copied before bad areas are retried. The map file captures progress and state and allows one to continue with recovery by re-running ddrescue
with the same or different arguments.
The problematic disk
After re-running the self test a couple of times, I took a deeper look at the output of smartctl
.
[root@sysresccd ~]# smartctl -a /dev/sdb
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.32-1-lts] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue
Device Model: WDC WD10EZRZ-00HTKB0
Serial Number: WD-WCC4J4APRJA4
LU WWN Device Id: 5 0014ee 261b8c4c0
Firmware Version: 01.01A01
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun May 17 11:13:28 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (13440) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 153) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 199 183 051 Pre-fail Always - 115158
3 Spin_Up_Time 0x0027 161 125 021 Pre-fail Always - 2908
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1423
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 10667
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1423
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 83
193 Load_Cycle_Count 0x0032 177 177 000 Old_age Always - 71057
194 Temperature_Celsius 0x0022 110 107 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 199 197 000 Old_age Always - 292
198 Offline_Uncorrectable 0x0030 199 198 000 Old_age Offline - 285
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 198 000 Old_age Offline - 289
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 10605 10323936
# 2 Extended offline Completed: read failure 90% 10605 10323936
# 3 Extended offline Completed: read failure 90% 10605 10323936
# 4 Extended offline Completed: read failure 90% 10605 10323936
# 5 Extended offline Completed: read failure 90% 10604 10323936
# 6 Extended offline Completed without error 00% 773 -
# 7 Extended offline Completed: read failure 90% 770 977627520
# 8 Extended offline Completed: read failure 90% 760 18849666
# 9 Extended offline Completed: read failure 90% 759 18849665
#10 Extended offline Completed: read failure 90% 757 844589
#11 Extended offline Completed: read failure 90% 756 844584
#12 Extended offline Completed: read failure 90% 746 844584
#13 Extended offline Completed: read failure 90% 745 844584
#14 Extended offline Completed: read failure 90% 744 844584
#15 Extended offline Completed: read failure 90% 744 844588
#16 Extended offline Completed: read failure 90% 744 844588
#17 Extended offline Completed: read failure 90% 744 844584
11 of 16 failed self-tests are outdated by newer successful extended offline self-test # 6
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The SMART attributes 5 and 197 indicate that there are 292 bad sectors pending for reallocation, and so far, this disk has not reallocated any sectors during it's lifetime. That in itself would be fine, since reallocations usually happen when the bad areas are written to. Such a write would cause the number of pending sectors to decrease, and the number of reallocation to increase.
However, looking at the log of previously run self test unveils some contradition. It seems that this disk already had multiple bad sectors years ago, which seem to have been fixed by rewriting the bad areas and forcing a reallocation. Test number 6 shows that a few hours after the supposed rewrite, the self test did not find any bad sectors. So reallocations must have happened, however, the SMART attribute showing the number of reallocated sectors is still 0. This seems to be either a firmware bug, or a fishy "feature" used by Western Digital.
The next steps
As far as the current self test results are concerned, there is a consistently reproducible read error at logical block address 10323936. Once could just fix that single sector by using dd
to directly write to that sector to force a reallocation, then re-run the self test to test for further problematic sectors. I have successfully employed this practice in the past on Linux machines but there are 3 caveats.
- Overwriting the bad 512 bytes would permanently destroy data of that sector. One should check under what partition this sector falls, and whether or not the filesystem on that partition uses the given sector. It might be unused area (which can be safely overwritten), a 512 bytes section of a file, or part of the directory strucutre, journal, superblock, or other are of the filesystem which may or may not tolerate being overwritten. The situation has to be analysed so the impact of overwriting becomes clear.
- There is a realistic chance that repeatedly re-reading a failing sector would sooner or later succeed allowing us to recover that 512 bytes of data, however, the extra wear and tear this causes may kill the drive before data can be recovered, making recovery of other areas impossible, so it should only be applied on unhealthy drives after evaluating the risks.
- If you manually fix bad sectors on a failing drive and succeed, you end up with a drive that seems healthy but known to be prone to errors. Whether or not that disk should by used in production or not, is your choice and your liability. One might be just better off replacing it with a new disk.
Read on, proceed to the next part of the series!
No comments:
Post a Comment