Monday, 8 June 2020

Non-bootable Windows recovery (Part 1)

Context

A friend asked for help. A Windows box at a small medical point has become unable to boot. Unfortunately, this was the central box, serving files, storing many year's worth of email and medical data, plus running a couple of medical applications. A single point of failure, without a reasonably fresh backup: a typical real life problem of small, non-IT businesses.

The goal was to get the environment up and running. A handful of medical stations depend on the services of this central computer, and staff needs the medical data it stores. I took the box home for a weekend, to see what could be done.

This series of blog posts will give the reader a systematic approach to diagnosing and fixing this issue with open source Linux tools.

The plan

While driving home, I decided to develop an action plan rather than rushing and making mistakes with non revocable impact. Instead of a clear sequence of steps, an ever growing binary decision tree crystallized from hypothetical observations, possible causes and outcomes. A plethora of unknown variables and a time window of a bit more than a weekend.

Pragmatic time boxed problem solving can be based on the the OODA loop which promotes a calm, systematic, approach made up of iterations of absorbing information, filtering and contextualizing it, then making a reasonably sound (what seems to be best) decision, and observing the effects of the decision, which serve as input for the next iteration. This is how one adapts to changes during the problem solving process in a disciplined manner.

Symptoms

The single socket, quad core machine had Windows 8.1 installed with the Norton Security Suite, several usual and a few special medical applications. One of the two hard drives gave place to the operating system and some data, while a separate 1TB disk was dedicated for the large amount of growing business data.

A few seconds after power on, the machine got suck at a state where the boot logo of the motherboard was displayed with a spinner, then after a while went black and did not progress for hours (until forcefully powered off). The HDD led on the case indicated intense disk activity. These are the symptoms I have collected over phone.

First steps

The first decision was not to boot the operating system, as that might make the situation even worse, whether or not we are dealing with malware or "just" software or hardware failure.

The second decision was to source a 2TB disk. Any data recovery or forensic investigation shall only be performed on disk images or clones and not the disks themselves, to shield the effects of mechanically damaged disks and to retain backup of the starting of the activities. The disk would serve as storage upgrade of the machine once returned.

I entered the UEFI Firmware Setup Utility, and checked for indications of hardware failure, but did not see anything obvious.

Having booted SystemRescueCD, my preferred live Linux distribution for general system rescue, from a USB stick, I launched an extended self test on all 3 disks in parallel, along with two iterations of memtester, which is a userspace memory stress utility, that reliably works on UEFI and does not need a reboot. Running these 4 tests in parallel saved many hours of unproductive time with an acceptable risk of not being able to test the first 200MB of RAM.


smartctl -a /dev/sda | less -S
smartctl -a /dev/sdb | less -S
smartctl -a /dev/sdc | less -S
smartctl -t long /dev/sda
smartctl -t long /dev/sdb
smartctl -t long /dev/sdc
memtester 7956M 2

Findings, assumptions and decisions

The disk /dev/sdb, where Windows was installed, has failed the self test. The other two disks and also the memory have passed the test, which allowed for narrowing down the investigation, with the assumption that there is not hardware issue except the disk which failed the self test. Making assumption is fine as long as one takes note of them so they can be revisited and challenged later if needed.

The good news was that the drive /dev/sda, the one holding medical data in a single NTFS partition was fine hardware-wise as per the self test. It could me mounted read-only and the filesystem seemed to be all right. The decision was to clone this drive to the new 2TB drive, and then repurpose /dev/sda later during the process of recovering the system drive.


ddrescue --ask --verbose --binary-prefixes --idirect --force /dev/sda /dev/sdc disk.map

One could certainly use a shorter command to clone one drive to another, however, during a recover task, I recommend using ddrescue with a map file and the arguments like shown above. Copying will not be aborted on the first error like it would be when using dd. Instead, read errors are gracefully skipped, allowing one to maximize the amount of data copied before bad areas are retried. The map file captures progress and state and allows one to continue with recovery by re-running ddrescue with the same or different arguments.

The problematic disk

After re-running the self test a couple of times, I took a deeper look at the output of smartctl.


[root@sysresccd ~]# smartctl -a /dev/sdb
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.32-1-lts] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD10EZRZ-00HTKB0
Serial Number:    WD-WCC4J4APRJA4
LU WWN Device Id: 5 0014ee 261b8c4c0
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun May 17 11:13:28 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
     was completed without error.
     Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
     the read element of the test failed.
Total time to complete Offline 
data collection:   (13440) seconds.
Offline data collection
capabilities:     (0x7b) SMART execute Offline immediate.
     Auto Offline data collection on/off support.
     Suspend Offline collection upon new
     command.
     Offline surface scan supported.
     Self-test supported.
     Conveyance Self-test supported.
     Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
     power-saving mode.
     Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
     General Purpose Logging supported.
Short self-test routine 
recommended polling time:   (   2) minutes.
Extended self-test routine
recommended polling time:   ( 153) minutes.
Conveyance self-test routine
recommended polling time:   (   5) minutes.
SCT capabilities:         (0x3035) SCT Status supported.
     SCT Feature Control supported.
     SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   199   183   051    Pre-fail  Always       -       115158
  3 Spin_Up_Time            0x0027   161   125   021    Pre-fail  Always       -       2908
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1423
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10667
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1423
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       83
193 Load_Cycle_Count        0x0032   177   177   000    Old_age   Always       -       71057
194 Temperature_Celsius     0x0022   110   107   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   199   197   000    Old_age   Always       -       292
198 Offline_Uncorrectable   0x0030   199   198   000    Old_age   Offline      -       285
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   198   000    Old_age   Offline      -       289

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     10605         10323936
# 2  Extended offline    Completed: read failure       90%     10605         10323936
# 3  Extended offline    Completed: read failure       90%     10605         10323936
# 4  Extended offline    Completed: read failure       90%     10605         10323936
# 5  Extended offline    Completed: read failure       90%     10604         10323936
# 6  Extended offline    Completed without error       00%       773         -
# 7  Extended offline    Completed: read failure       90%       770         977627520
# 8  Extended offline    Completed: read failure       90%       760         18849666
# 9  Extended offline    Completed: read failure       90%       759         18849665
#10  Extended offline    Completed: read failure       90%       757         844589
#11  Extended offline    Completed: read failure       90%       756         844584
#12  Extended offline    Completed: read failure       90%       746         844584
#13  Extended offline    Completed: read failure       90%       745         844584
#14  Extended offline    Completed: read failure       90%       744         844584
#15  Extended offline    Completed: read failure       90%       744         844588
#16  Extended offline    Completed: read failure       90%       744         844588
#17  Extended offline    Completed: read failure       90%       744         844584
11 of 16 failed self-tests are outdated by newer successful extended offline self-test # 6

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The SMART attributes 5 and 197 indicate that there are 292 bad sectors pending for reallocation, and so far, this disk has not reallocated any sectors during it's lifetime. That in itself would be fine, since reallocations usually happen when the bad areas are written to. Such a write would cause the number of pending sectors to decrease, and the number of reallocation to increase.

However, looking at the log of previously run self test unveils some contradition. It seems that this disk already had multiple bad sectors years ago, which seem to have been fixed by rewriting the bad areas and forcing a reallocation. Test number 6 shows that a few hours after the supposed rewrite, the self test did not find any bad sectors. So reallocations must have happened, however, the SMART attribute showing the number of reallocated sectors is still 0. This seems to be either a firmware bug, or a fishy "feature" used by Western Digital.

The next steps

As far as the current self test results are concerned, there is a consistently reproducible read error at logical block address 10323936. Once could just fix that single sector by using dd to directly write to that sector to force a reallocation, then re-run the self test to test for further problematic sectors. I have successfully employed this practice in the past on Linux machines but there are 3 caveats.

  1. Overwriting the bad 512 bytes would permanently destroy data of that sector. One should check under what partition this sector falls, and whether or not the filesystem on that partition uses the given sector. It might be unused area (which can be safely overwritten), a 512 bytes section of a file, or part of the directory strucutre, journal, superblock, or other are of the filesystem which may or may not tolerate being overwritten. The situation has to be analysed so the impact of overwriting becomes clear.
  2. There is a realistic chance that repeatedly re-reading a failing sector would sooner or later succeed allowing us to recover that 512 bytes of data, however, the extra wear and tear this causes may kill the drive before data can be recovered, making recovery of other areas impossible, so it should only be applied on unhealthy drives after evaluating the risks.
  3. If you manually fix bad sectors on a failing drive and succeed, you end up with a drive that seems healthy but known to be prone to errors. Whether or not that disk should by used in production or not, is your choice and your liability. One might be just better off replacing it with a new disk.

Read on, proceed to the next part of the series!

No comments:

Post a Comment