Replace an ECS Appliance disk
Table of Contents
This article applies to EMC ViPR 2.0.
To replace the node's main (non-storage) drives, contact EMC Customer Support for assistance.
When the system labels a disk as BAD, the data protection logic rebuilds the data on that drive on other drives in the system. The bad drive no longer participates in the system in any way. Replacing a drive does not involve restoring any data to the replacement disk. Therefore, a bad drive only represents a loss of raw capacity to the system. This characteristic of the built-in data protection lessens the need for an immediate service action when a drive goes bad.
Suspect drives that are suspect because of physical errors, as opposed to connectivity issues, should also be replaced. When a drive is labeled SUSPECT, the system no longer writes new data to the drive, but the system will continue to read from it. The suspect drive with physical errors will eventually be labeled BAD when the physical errors exceed a certain threshold. While the drive remains suspect, the drive is not participating fully in the storage system. Therefore, suspect disks should be manually labeled as BAD using the supplied command line tool. This step gives the system the opportunity to finish pending network interactions with the drive. Data on the drive will be reconstructed elsewhere in the system using copies of the data from other drives. Therefore, you can begin the replacement process as soon as the system indicates that the drive was successfully re-labeled as BAD.
The ViPR UI lists a node as DEGRADED when the node has storage drives with the SUSPECT or BAD status.
An efficient way to handle bad drives is to replace them periodically. The replacement period should be calculated using the manufacturer's mean failure rate of the drives such that there is no danger that a critical number of drives can fail by the next planned service date. This process is called "grooming." Grooming involves:
- Ordering a sufficient number of drives to cover the mean failure rate expected by the next service date.
- Identifying the bad drives.
- Identifying suspect drives that are suspect because of physical errors and manually forcing these drives to be labeled as BAD.
- Replacing the bad drives.
In the abbreviated cs_hal list disks output below, notice the different types of rows:
- The first three rows represent the RAID structure of the node's two internal disk drives.
- The next row shows a good storage drive. Device names, slot numbers, and serial numbers can be used in cs_hal commands as long as they are unique. The enclosure name can also be used in cs_hal commands.
- The next row represents a bad storage drive. The Partition Name indicates this drive is assigned to the Object service, the drive is formatted, and the drive status is GOOD.
- The next row represents a suspect storage drive.
- The last row represents either an empty slot or an undetectable drive.
- The DiskSet column is reserved for future use.
Do not place the node in maintenance mode to replace disk drives.
- Start an SSH session with the node that owns the bad or suspect drives.
- Use the
cs_hal list disks command:
Here the report shows the 2 internal drives of the node, 30 storage drives, 1 suspect drives, and 30 empty slots labeled as unavailable.
- Mark down the Slot ID and Serial Number of the bad or suspect drive.
- If the drive is labeled as SUSPECT and diagnosis indicates that it needs to be replaced, use the
cs_hwmgr command to force its status to BAD. In the example above, the suspect drive has a reallocated sector count that indicates the drive has physical failures and is a candidate for replacement.
- Use the
cs_hwmgr RemoveDrive to remove/unallocate the drive from the default diskset. Note that this command does not remove the drive from the output of the
cs_hwmgr ListDrives command.
- Use the
cs_hal led command and the Serial Number or Slot ID to turn on the LED for the drive (to find the proper drive in the drawer). The two commands below are equivalent. The first uses the Serial Number and the second uses the Slot ID.
- Replace the drive.
- Use the cs_hal list disks command to confirm that the system recognizes the new drive. The new drive has the same Slot ID as the replaced drive. Take note of the new Serial Number. The system automatically integrates the replaced drive into use by the appropriate services.
- Use the
cs_hal led command and the new Serial Number or Slot ID to turn off the LED for the drive.