No more keeping fingers crossed
Drive failures can go undetected and may occur anytime, and they will either lead to volume degradation or volume crashes. It’s less of a problem if a hard drive failure results in a degraded volume since you only have to rebuild your RAID array by finding out the damaged drive and replacing it with a new one. However, it’s a bigger threat when it comes to volume crashes. If you don’t have a backup plan or a DR solution in place, it’s very likely that you’re going to experience catastrophic data loss.
So, is there anything we can do to forestall drive failure?
Yes, and there are two precautionary measures that we can take to minimize the possibility of data loss caused by drive failure: running regular S.M.A.R.T. tests and setting up event-triggered notifications.
First, perform a S.M.A.R.T. test on a regular basis to keep tabs on your drive health status and take immediate action when necessary. S.M.A.R.T. is the acronym for Self-Monitoring, Analysis and Reporting Technology, which is a monitoring system used to gauge drive reliability and provide information on the current status of the drives. S.M.A.R.T. attributes are examined by utilizing several parameters to see if the drive is starting to develop problems. The results can serve as an indicator of the remaining lifespan of a drive.
Pay extra attention to the following three S.M.A.R.T. attributes1 that are related to bad sectors: Reallocated Sector Count (ID 5), Reallocated Event Count (ID 196), and Current Pending Sector Count (ID 197). A bad sector is a cluster of unreadable data caused by wear and tear, over-heating, collision, file system error, etc. Upon detecting an impaired sector, it will be redirected to a reserved space – a spare sector. This reallocation process is called “remapping.” Note, though, that increasing remapping operations will slow down drive access and may spell the end of your drive.
It is ideal to have a low value of the above attributes, as these values can be used as a benchmark to detect looming drive failures. Both Google’s and our statistics show that these attributes are highly correlated to a higher chance of drive failure. Drives that have developed bad sectors are 10X more likely to result in failed drive access than those which don’t have any bad sectors.
An extra layer of prevention
Other than running S.M.A.R.T. tests regularly, the other thing you can do on your Synology NAS is configuring notification event settings in “Internal Storage” under the Advanced tab in Control Panel. Select the seven events2 in particular and take necessary action upon receiving a notification message triggered by them.
Let’s get started with three common error terms: ICRC, IDNF, and UNC errors. An ICRC error is a communication problem occurring when data is transferred between the host and the hard drive, while an IDNF error occurs when the drive is unable to read data that is located at a corrupt sector. A UNC error implies that the data the hard drive attempts to read is damaged and cannot be corrected using ECC (Error correction code). The following are events3 related to these errors:
1. Drive reconnection (ICRC error) alert
2. Drive re-identification (IDNF error) alert
3. Drive reconnection alert when Synology NAS boots up
4. Alert of drive with read abnormality (UNC error)
When you receive a notification regarding any of these errors above, it could be an early warning sign of a failing drive. If the issue continues, it may suggest that the drive is not working properly. We strongly recommend that you back up your data and replace the current drive. Other than the above-mentioned alerts, there are three other events that you should pay attention to as well.
5. Bad sectors on drive increased
6. Drive I/O error
7. SSD lifespan warning4
Since accumulated bad sectors will gradually lead to data loss in the long term, you’ll receive a warning when the detected bad sectors are increasing. Bad sectors may also lead to drive I/O errors. However, your drive may be still working properly after several retries. If this error keeps occurring, please back up your data and examine the hard drive status by conducting a S.M.A.R.T. test. By the way, you can refer to the Synology Products Compatibility List to check the expected lifespan of your SDD. Consider replacing your drive with a healthy one when you receive a warning, as it could be a sign of impending drive failure.
Better safe than sorry
In general, it’s only a matter of time before a drive fails, but we can take simple yet important precautions against drive failures before they ultimately lead to data loss. Take preemptive action upon receiving hard drive alerts, for ignoring these warning signs may cost you big when disaster strikes. You can take a more proactive approach by performing diagnostic S.M.A.R.T. tests on a regular basis to gain insights into the current status of your drive.
In addition to these preventive measures, we also need to prepare for the worst by regularly scheduling backup tasks in case of unexpected drive failures. Be well-prepared and you can minimize the possibility of data loss.
Tell us how you prevent drive failures on Synology Community.
1 The table below lists the three bad sector-related S.M.A.R.T. attributes.
2 In DSM versions before soon-to-be-released 6.2.2, the names of these events are disk with reconnection (ICRC error) alert, disk with re-identification (IDNF error) alert, disk with reconnection alert when booting up, disk with read abnormality (UNC error) alert, disk with bad sectors exceeding the limit, disk I/O error, and internal disk lifespan warning, respectively.
3 These four events are not included in the default notification settings. It’s recommended to tick the checkboxes and select notification mediums for them.
4 Drive lifespan warning is supported on SSDs only.