Technical Overview of Popular Software Data Recovery Procedures - Technibble
Technibble
Shares

Technical Overview of Popular Software Data Recovery Procedures

Shares

How is ddrescue different?

What ddrescue does do in terms of data recovery is aggressively jump over any sections of the drive that are difficult to read on early phases. As soon as the drive starts to read slower, or runs into bad sectors, ddrescue will jump over a range of sectors and continue on. It will return to these skipped areas on future passes to try to get more data from them after the easily accessible sectors have been backed up to a healthy drive.

It is critical to understand that ddrescue cannot do anything to reduce the overall stress applied on the drive by the software recovery attempt. It is only pushing back the most harmful recovery procedures to the end of the recovery process, which is the best that can be expected from a software data recovery imaging tool.

What is drive stress? What causes the most damage?

A recovery procedure is called stressful when it has a relatively high likelihood of causing further harm to the drive. By far the most destructive processes involved when recovering drives with software tools is when a drive is processing its own bad sectors. Every time a read command falls on a bad sector, the drive will spend 3-7 seconds making hundreds of failed read attempts, then conclude the sector is bad, write to its firmware area (located on the platters) to update various logs, and finally respond with an error message, at which point the software can send the next read command.

There is a lot that can go wrong during this process, depending on the initial cause of this read instability:

  • If the heads are degraded, numerous failed read attempts will quickly cause further degradation, risking complete mechanical failure.
  • If the platters are degraded, numerous failed read attempts focused on the worst areas of the platter will increase the level of media damage and potentially make the drive unrecoverable even by professionals.
  • Various logs being updated in the firmware area can become full, causing a firmware failure.
  • Intense usage of the drive’s processor can exacerbate electronic issues, causing a failure of the PCB.
  • Degraded read/write heads can introduce corruption to the drive’s firmware while updating various logs, causing a firmware failure.

Needless to say, we should attempt to read bad sectors as rarely as possible. Relatively speaking, successful reads cause very little harm to the drive.

When will ddrescue try to read bad sectors?

It will happen more and more with each pass. During the first pass of the Copying phase, ddrescue will be skipping some of the bad sectors based on the two data points it has access to:

  • success or failure of its read commands,
  • and the length of time successful read commands take to process.

Bad areas will often have numerous slow and bad sectors grouped together. Slow sectors will still be successfully read by the drive, but they will take many times longer to process in comparison with healthy sectors. If a bad area starts with slow sectors, ddrescue won’t necessarily have to hit a single bad sector to skip it on this pass, since it will look at the long read times and skip ahead based on that. If a bad area starts with bad sectors, or if the skipped range of sectors was too small, ddrescue will have to hit a bad block and go through the harmful 3-7 second processing time at least once to skip ahead.

The second pass of the Copying phase is a lot like the first, but it will be going backwards and only targeting previously skipped areas. It will still be skipping ahead when it runs into slow or bad sectors.

The third pass of the Copying phase is when the recovery process usually becomes more harsh on the drive. At this point ddrescue will not be doing any more skipping and instead will try to read every non-tried block of all previously skipped areas. The fact that these areas were skipped on the first two passes means that they are surrounded with either slow or bad sectors. Ddrescue will be attempting to read every block within those bad areas, forcing the drive to process every single remaining bad block.

During the last two phases – Trimming and Scraping – only known bad blocks will be left and ddrescue will be trying to read them sector by sector to recover every good sector within every bad block. These blocks previously failed to read, so we know there is a minimum of 1 bad sector present in each block. During these phases ddrescue will be hitting bad sectors so frequently that the majority of the recovery time will be spent on fruitless processing of bad sectors. Allowing these passes to run has a high risk of causing further damage to the drive, so we should only do this if we are certain that the customer’s data is not valuable enough to warrant a recovery by someone with specialized data recovery equipment.

For some numerical examples, let’s say we have two hypothetical drives with light and medium levels of read instability. Here is how many times ddrescue will hit bad sectors in each phase and pass using the default setting:

 

Drive Information Light Instability Medium Instability
Number of Bad Areas 100 300
Bad Blocks per Bad Area 1 to 9 (500 total bad blocks) 1 to 19 (3,000 total bad blocks)
Bad Sectors per Bad Block 1 to 5 (1,500 total bad sectors) 1 to 9 (15,000 total bad sectors)
% of Bad Areas Surrounded by Slow Sectors 25% 50%
Ddrescue Phase (Pass) Number of Times Doing Harmful Bad Sector Processing
Copying (1) 214 900
Copying (2) 64 339
Copying (3) 222 1,761
Trimming 900 5,667
Scraping 600 9,333
Total 2000 18,000

 

This is a highly simplified example. The main idea behind it is to show that the majority of the harmful processing happens during Trimming and Scraping phases of ddrescue’s imaging process. We are assuming a perfectly even distribution of all read instabilities in these examples to simplify calculation and here is how we calculated the numbers for light instability:

Copying (1) – in 75 of the 100 bad areas ddrescue will hit the first bad block. A skip of a single block will usually not be sufficient to skip past the whole bad area, as this example has anywhere from 1 to 9 bad blocks per bad area. Ddrescue doubles the skip range with every repeated skip, so it follows that in this pass ddrescue will hit a bad block once if the bad area has 1-2 bad blocks (2/9 chance), twice if it has 3-5 bad blocks (3/9 chance), and thrice if it has 6 to 9 bad blocks (4/9 chance). This adds up to: 75*1*2/9 + 75*2*3/9 + 75*3*4/9 = 166.7, or 167 failed read attempts.

In 25 of the 100 bad areas ddrescue will skip the first bad block based on surrounding slow sectors. It follows that it will not hit a bad block if the bad area has 1 bad block in it, it will hit a bad block once if the bad area has 2-3 bad blocks, twice if it has 4-6 bad blocks, and thrice if it has 7-9 bad blocks. This adds up to: 25*0*1/9 + 25*1*2/9 + 25*2*3/9 + 25*3*3/9 = 47.2, or 47 failed read commands. The total for Copying (1) is 47 + 167 = 214 failed read attempts.

Copying (2) – This pass will go backwards and only target the previously skipped areas that were not already marked bad in Copying (1). In 75 of the 100 bad areas when ddrescue hit the first bad block during Copying (1), it will not hit a bad block again going backwards if the bad area only had 1, 3, or 6 bad blocks in it because the last bad block of the area would already be marked from the Copying (1) pass. It will hit a bad block once if the bad area had 2, 4, 5, 7, or 8 bad blocks, and it will hit a bad block twice if the bad area had 9 bad blocks. This adds up to: 75*0*3/9 + 75*1*5/9 + 75*2*1/9 = 58.3, or 58 failed read attempts.

In 25 of the 100 bad areas when ddrescue skipped the first bad block during Copying (1), it will not hit a bad block again going backwards if the bad area had 1, 2, 3, 4, 5, 7, or 8 bad blocks in it. It will hit a bad block once if the bad area had 6 or 9 bad blocks. This adds up to 25*0*7/9 + 25*1*2/9 = 5.6, or 6 failed read commands. The total for Copying (2) is 58 + 6 = 64 failed read attempts.

Copying (3) – Out of 500 bad blocks, 214 + 64 = 278 were already marked in Copying (1) and Copying (2), so 222 bad blocks are left to be marked in this pass, leading to 222 failed read attempts.

Trimming – There are 500 bad blocks marked at this point and ddrescue will be trying them sector by sector from each end until it hits the first bad sector. In cases of 1 bad sector per bad block, it will hit the bad sector once. In cases of 2, 3, 4, or 5 bad sectors per block it will hit one bad sector from each end. This adds up to 500*1/5*1+ 500*4/5*2 = 900 failed read attempts.

Scraping – out of the total of 1,500 bad sectors, 900 have already been tried, so 600 are left to try on this phase, leading to 600 failed read attempts.

Scraping – out of the total of 1,500 bad sectors, 900 have already been tried, so 600 are left to try on this phase, leading to 600 failed read attempts.
Disadvantages of ddrescue

The biggest disadvantage of ddrescue is the lack of targeted recovery options. Ddrescue has no information about what it’s imaging, so it’s treating every sector in the exact same way, instead of reading the file system metadata, parsing it, and only recovering specific files/folders. This means it will spend a lot of time imaging countless sectors which are not necessary to recover the customer’s files.

More importantly, it will not be able to focus its recovery efforts on file system metadata, which is the most important part of the drive to recover for data recovery purposes. Most drives fail slowly over time and when the customer’s drive starts showing the first signs of read instabilities while it is still in their computer, their Windows or OS X operating system will try to fix the corruption by doing a lot of writing, attempting to replace corrupted entries with integral ones. Unfortunately these writes, combined with the general degradation of the drive, cause more bad sectors to appear within file system metadata.

The skipping algorithm used by ddrescue will actually skip the file system metadata of the drive more frequently than average on early phases because bad sectors are more commonly found within it. If the drive is too unstable for the recovery process to ever be completed, this will pose a large problem, leading to fewer files being recovered.

In this sense, logical recovery software has a more intelligent approach, since it will read file system metadata and allow us to focus on specific files. However in most cases, this benefit is not worth the aforementioned drawbacks of using logical recovery software directly on the customer’s drive.

The worst recovery method: Using software designed to “repair” bad sectors and restore drives to working condition.

 

Pros Cons
None.
  • Apply an extreme level of stress on the drive with numerous read/write operations that focus on the worst areas of the drive
  • No data being recovered because everything is written back to the same unstable drive.
  • Erase customer data whenever faced with bad sectors

 

Tools like Spinrite or HDD Regenerator claim to repair hard drives, so we’ll explain exactly how they work.

Almost all hard drives will develop bad sectors at some point in their life. To deal with this, drives have internal processes designed to reallocate bad sectors. Whenever a drive identifies a bad sector, it will add its address to the grown defect list within its firmware. All subsequent read/write commands that fall on that sector will be forwarded to the drive’s reserve, so the drive will effectively be using a healthy sector from the reserve instead of the bad sector. Modern drives actually have a built-in self-scan process that allows them to seek and reallocate bad sectors in this manner. This self-scan process is started automatically after only 20-30 seconds of idling.

How do bad sector repair tools work?

Bad sector repair tools try to force reallocations to hide bad sectors, similarly to the standard self-scan process. They accomplish this by reading sectors from the drive and writing them back to the same drive many times, which gives the drive’s firmware a greater opportunity to determine that a sector is bad and add it to the defect list. Whenever a sector cannot be read, such tools will simply overwrite it with meaningless data, thereby permanently erasing some of the customer’s data on this drive.

This process is the absolute fastest way of causing further harm to a degraded drive. The mechanics and/or electronics will be quickly degrading and the firmware could suffer a failure if the number of reallocations gets high enough to overfill the defect list. Worst of all, no data is actually recovered because everything is written back to the same unstable drive.

But it worked for me before!

Bad sector repair tools can help if the data recovery software tool used on the drive after the ‘repair’ process uses only large block sizes to read the drive. Many logical recovery software tools use a block size of 4,096 sectors. If there is a single bad sector within the 4,096 sectors that the software tool tries to access at a time, the entire block will be considered “bad” and we would effectively lose 4,096 sectors of data instead of just 1.

If we forced a reallocation of that bad sector using bad sector repair tools prior to running data recovery software, we’d be able to recover an additional 4,095 sectors of data. Naturally, this advantage only appears when data recovery software being used on the ‘repaired’ drive is limited to only using large read block sizes.

Properly designed data recovery imaging software (like GNU ddrescue) will try reading the previously failed blocks sector by sector on future passes to get the data from all remaining good sectors. In this situation, forcing bad sector reallocations beforehand will not lead to more data being recovered and will only cause a lot of additional harm to the drive.

Customer Communication

Ideally, the customer should be notified of possible risks before any work is done on their drive. Software tools cannot do anything to mitigate further degradation of the drive, so there is a possibility that the drive will suffer a complete failure before we have a chance to recover enough data.

Read instabilities can be properly handled by someone with specialized hardware data recovery tools from companies like DeepSpar, Atola, Ace Laboratory, or DFL. Such tools generally cost somewhere between $2,000US and $8,000US and they have a host of features that reduce drive stress by tens of times in comparison with any software recovery method.

If the drive only has read instability problems, it could be easily recovered by someone who has one of these specialized tools. The current market rate for this work in Canada/USA is around $300US/drive. In most cases this work does not require much data recovery expertise from the user, so such a service is sometimes offered by general IT service providers as well as by professional data recovery companies.

If the drive suffers a complete failure, the recovery process becomes much more complicated. At that point it will have to be recovered by a professional data recovery company with a full range of dedicated tools and years of experience in the field. The current market rate for this work in Canada/USA is usually between $800US/drive and $2,000US/drive, depending on the exact problem. Roughly 10% of these cases will be unrecoverable even by professionals due to heavy platter damage, or corruption of critical unique sections of drive firmware.

If we start a recovery with ddrescue and the drive suffers a complete failure part of the way through the process then we would have done a major disservice to the customer. They would now be looking at paying a much larger fee to get their data recovered, would likely have to send their drive further away to the closest suitable professional, and risk permanent data loss if the drive turns out to be unrecoverable. Needless to say, this is a terrible situation to put the customer into.

Only the owner of the data can accept such a risk, so as the first step we should educate them that the drive could suffer a complete failure due to this recovery attempt, after which it will be much more expensive or impossible to recover the data. However if the attempt succeeds, the cost and/or lead time will be lower than sending it out right away.

What we should do from that point forward will depend on the customer’s response. If the data is critical to them and $300 is within their price range then they will likely ask for the drive to be outsourced. In that case, no software recovery attempts should be made and the drive should be forwarded to someone with specialized tools.

On the other end of the spectrum, if the data is not worth $300, the customer will likely ask to proceed with the software recovery attempt. At this point we can confidently execute the full ddrescue process because we are certain that this will be the one and only recovery attempt for this drive.

Some customers will say that they want to give software recovery a try in an attempt to save money and/or time, but they will outsource to a professional if it doesn’t work out. This situation is more of a gray area. They did agree to the risks, but if their drive crashes part of the way through the recovery then they will likely still be quite unhappy about it.

In such cases it can be an option to try running ddrescue without Trimming and Scraping phases by adding ‘–n –N’ to the ddrescue command (make sure to use a map file!). This would instruct ddrescue to only run the Copying phase, which carries a much lower level of risk for the drive. If the drive does not have any read instabilities, nothing will have to be skipped and we will recover 100% of the data. If it does have read instabilities, bad blocks will be left behind and not attempted sector by sector.

After this process completes, we can scan the image with logical recovery software to get an idea of what was recovered. If we find that there is already enough data there to satisfy the customer then we can always go back and run Trimming and Scraping phases later to fill in as many sectors as possible. If we find that lots of data is missing after the Copying phase, it would mean that the drive is highly unstable, which is an indicator to consider outsourcing the case to someone with specialized data recovery tools.

Upgrading from Software

If you ever find yourself in the market for specialized data recovery tools, have a look at our RapidSpar. This hardware & software solution was specifically designed for IT service providers who want to take their data recovery capabilities to the next level to generate more revenue and increase customer satisfaction.

It includes many features that:

  1. improve detection and recoverability of unstable drives.
  2. drastically reduce drive stress during recovery, making the process safer.
  3. speed up the recovery process, improving turnaround times.
  4. provide more diagnostic information, including the integrity of every recovered file.
  5. fix common firmware issues.

This functionality adds up to recover about 50% of the cases that software recovery methods struggle with, all while making the recovery process much faster and safer for the drive. Any provider of IT services in the US, Canada, UK, or Australia is welcome to sign up for our free demo program (http://rapidspar.com/demoterms.html) to test out a RapidSpar for a week. Feedback from RapidSpar’s first users can be seen here on Technibble forums: https://www.technibble.com/forums/threads/rapidspar-demo-program.68924/

Summary

Here are the most important points to remember from this guide:

  1. A hard drive is unstable when it is no longer capable of processing standard commands in a predictable manner.
  2. The most common symptoms of read instabilities include bad sectors, slow response times, and intermittent exceptions that take the drive offline.
  3. Read instabilities are most commonly caused by physical degradation of the read/write heads and/or platters. Other causes include firmware corruption and electronic instabilities.
  4. There is no way to determine the underlying cause of read instabilities using software tools. Bad sectors will be processed the exact same way regardless of whether they are caused by physical, electronic, or firmware problems.
  5. The main goal in working with unstable drives is to quickly save the data to another healthy drive, since there is no way to predict when drives will suffer a complete failure.
  6. Mounting unstable drives in Windows or OS X causes additional degradation and data loss. It should not be done directly on patient drives.
  7. Using logical recovery software tools directly on the patient drive is not recommended. There isn’t any way to determine whether the drive has read instabilities prior to reading the entire surface. If the patient drive turns out to be unstable and we try to scan it with logical recovery software, we will leave some good data behind because such tools are not designed to work with unstable drives. If the recovery is not an immediate success, the drive will suffer unnecessary degradation and nothing will be recovered because the scan results are only stored in RAM.
  8. Logical recovery software can do file system repair, quick scans, or full scans. It is not recommended to run file system repair functions directly on the patient drive because if it doesn’t succeed, it will erase some data and make the situation worse.
  9. GNU ddrescue is the best approach because it: saves everything it recovers directly to a healthy drive; does not require the patient drive to go through the harmful mounting processes of Windows or OS X; and leaves the most damaging recovery processes until the end. However, it cannot do anything to reduce the overall stress applied on the drive by the recovery attempt.
  10. By far the most stressful part of the software recovery process is when the drive is processing its own bad sectors. It takes 3-7 seconds to process each read command that falls on a bad sector, during which the drive makes hundreds of failed read attempts and writes to the firmware area on its platters to update various logs.
  11. The majority of bad sector processing will happen on Trimming and Scraping phases of ddrescue’s imaging process.
  12. Bad sector repair tools should not be used for data recovery purposes because they apply an extreme level of stress on the drive, do not recover any data, and cause data loss by overwriting unreadable sectors with random data.
  13. All software recovery methods carry a certain level of risk when they are executed on an unstable drive. The drive could suffer a complete failure, which would be much more costly or impossible to recover from afterward. The customer should be notified of possible risks to their drive and data before any work is started.

Previous page

  • Jeff Swallows says:

    Great stuff, I learned some new details.
    Thanks

  • Larry Sabo says:

    Excellent write-up!

  • >