Context
This is the final post in a series dedicated to documenting a systematic approach of recovering a non-bootable Windows machine with Linux and open source tools. Make sure you read the first, second, third and fourth post to familiarize yourself with the context and previous activities.
A S.M.A.R.T. extended self test indicated one of the disks was failing, other hardware components seemed to be healthy.
The first 48 hours of data recovery were spent running ddrescue
to read out as much disk area as possible, and ended up with 1081 unreadable kilobytes spread across 16 ranges of consecutive bad sectors. The filesystem was analyzed to identify the usage of the bad sector ranges, that is, the list of files and parts of filesystem metadata that could not be read. As the Master File Table of the NTFS partition was also damaged, information about file allocation on the bad blocks might have contained false negatives.
Priorities were set, calculated risks taken and four sector ranges, which did not seem to be used, were the largest and located towards the end of the partition were nuked with dd
to force reallocation of the bad sectors. All calculations and steps involved were covered in detail. A bit more than three hours later the unreadable area reported by ddrescue
decreased to around 68 kilobytes spread across 15 bad sector ranges. There was an additional finding: it was confirmed and demonstrated that the firmware of this Western Digital disk misreports Reallocated Sector Count
and Reallocated_Event_Count
, which raised serious concerns...
Next, a more sophisticated approach was taken to progress data recovery even further without overwriting any further sector of the failing disk. The goal was to focus only on those bad sector ranges, which are known to hold either files or MFT entries, which was done by manipulating the mapfile ddrescue
created to track the state of the rescue process. After multiple runs of ddrescue
totaling to around 38 hours and up to 700 retries of each bad sector, 5 bad areas remained, consisting of 8 bad sectors each.
Assessing the results achieved
Enough time was spent trying hard to resurrect data from bad sectors, and odds for further progress within the next days were negligible compared to the impact of service outage. The raw disk data read so far was backed up from the healthy drive to an image stored on the third drive, to have a snapshot to return to if subsequent action went wrong. No more attempt was taken to read from the failing disk.
[root@sysresccd ~]# ddrescue --ask --verbose --binary-prefixes /dev/sda4 /mnt/REPO/images/sda4-dd.img sda4-dd.map
Next came the well known steps of listing affected files and metadata based on the location of the remaining unrecoverable 5 instances of 8 sector ranges.
[root@sysresccd /mnt/REPO]# ddrescuelog --list-blocks=- disk-sdb-to-sda-priority.map | while read lba; do echo $(($lba - 1083392)); done >> bad-sectors-sdb4.txt
[root@sysresccd /mnt/REPO]# cat bad-sector-ranges-sdb4.txt
3733376-3733383
9240544-9240551
9324120-9324127
10061520-10061527
16845920-16845927
[root@sysresccd /mnt/REPO]# cat affected-files-sdb4.txt
Searching for sector range 3733376-3733383
Inode 112588 /ProgramData/Norton/{0C55C096-0F1D-4F28-AAA2-85EF591126E7}/NS_22.6.0.142/NCW/hlinks/ncwperfm.db.data/$DATA
* one inode found
Searching for sector range 9240544-9240551
Inode 248863 /Windows/ServiceProfiles/LocalService/AppData/Roaming/PeerNetworking/91e9854e5d4d7ac47cd0bf3ad8003817/922c235eb3d1d91e502cb02599cc24c7/grouping/edb02861.log/$DATA
* one inode found
Searching for sector range 9324120-9324127
Inode 96681 /$Extend/$UsnJrnl/$DATA($J)
* one inode found
Searching for sector range 10061520-10061527
Inode 0 /$MFT/$DATA
* one inode found
Searching for sector range 16845920-16845927
Inode 112588 /ProgramData/Norton/{0C55C096-0F1D-4F28-AAA2-85EF591126E7}/NS_22.6.0.142/NCW/hlinks/ncwperfm.db.data/$DATA
* one inode found
The results were interpreted as follows, along with potential next steps:
- A database file most probably related to norton community watch, is damaged. Once Windows boots, norton community watch could be removed and reinstalled in worst case.
- Another damaged file is probably a binary log file related to peer networking. Might need to delete folder and re-join home group or recreate the whole homegroup in case this was the machine that created the homegroup. This has to be done on-site.
- Unfortunately, the bad sectors at both the user journal and the master file table remained. The next step was to work with the NTFS filesystem offline.
Looking into damaged NTFS
Knowing that the MFT is redundantly stored on NTFS, a dry run of ntfsfix
was executed. The industry best practice had long been to leave checking NTFS volumes to Microsoft tools or commencial recovery software. A dry run was meant as a quick sanity check and it did not report any issue.
[root@sysresccd /mnt/REPO]# ntfsfix --no-action /dev/sda4
Mounting volume... OK
Processing of $MFT and $MFTMirr completed successfully.
Checking the alternate boot sector... OK
NTFS volume version is 3.1.
NTFS partition /dev/sda4 was processed successfully.
Next, testdisk
was used to extract the two damaged files just in case they would be needed later. During this process, there was no sign of filesystem inconsistency.
[root@sysresccd /mnt/REPO]# testdisk /log /dev/sda
TestDisk 7.1, Data Recovery Utility, July 2019
Christophe GRENIER <grenier@cgsecurity.org>
https://www.cgsecurity.org
### backup corrupted files via testdisk
[root@sysresccd /mnt/REPO/files]# find . -type f
./Windows/ServiceProfiles/LocalService/AppData/Roaming/PeerNetworking/91e9854e5d4d7ac47cd0bf3ad8003817/922c235eb3d1d91e502cb02599cc24c7/grouping/edb02861.log
./ProgramData/Norton/{0C55C096-0F1D-4F28-AAA2-85EF591126E7}/NS_22.6.0.142/NCW/hlinks/ncwperfm.db.data
The SystemRescueCD session was running for 4 days without a reboot, and now the time has come to save all logs and intermediate files from the recovery session, shut down, remove the faulty disk and boot into Windows. The system greeted me with a message that it would continue installing fixes. Some time and a few reboots later I could log in, and performed a quick filesystem check. The system reported no errors, however I forced an offline filesystem check, which did find issues to repair. Once rebooted and logged in again, another filesystem check was made, and no errors were reported.
Next, SFC
and DISM
were run, but covering these built-in Windows tools is outside of the scope of this blog. fortunately, both tools reported there was nothing to fix.
Outside of business hours, the machine was transported back to the medical point and a quick sanity check was run to confirm the whole environment including networking, homegroup, file sharing and medical software were working. No further steps were required on this machine, but an extended self-test was scheduled for all disks of all the other machines.
Wrapping up
The points below capture my recommendation on solving problems of this kind.
- Understand circumstances such as time and resource constraints and the consequences of full or partial failure.
- Think, read up and learn. Develop a deep understanding of your possibilities before you execute actions.
- Do not panic, it does not help. Assess the stituation, evaluate options then take the best available alternative. Observe the results, adapt and start over again.
- Take as few assumption with as low risk as possible, and document them!
- Optimize for efficiency: prioritize targets and harvest as much result with low risk action as you can, then proceed to lower value targets or higher risk actions.
- Optimize for time: plan long running non-interactive tasks and set up sleep schedule accordingly. Understand the risk of shortcuts.
Happy end. No medical or business critical data was lost, computer fully recovered and business restored after a few days. As a bonus, an unexpected hard drive firmware issue was found and documented.
Remember that each real-life situation is somewhat different, so your milage may vary. The steps taken and shared here have worked in the particular case documented, but there is not prescriptive recipe to successful data recovery. There are merely tools and guidelines, but you are responsible for developing a plan and taking the right decision during the process, based on your findings as you progress.
If any reader spots a logical error, or recommends a superior alternative for a particular task, leave a comment. One lucky commenter - willing to pick up in person - will be awarded with a used Western Digital BLUE 1TB disk in reasonably fair condition, guaranteed to show zero Current_Pending_Sector
, Reallocated Sector Count
and Reallocated_Event_Count
under S.M.A.R.T. Attributes. 😉 Just kidding...
I hope this systematic approach will serve as a good example and that particual steps can be adapted by the reader to overcome their respective challenges.