Replace an ECS Appliance disk

Table of Contents

Back to Top

Disk drive replacement planning

Describes an efficient strategy for periodic grooming of an ECS Appliance or Commodity system to replace bad and suspect storage disk drives.

This article applies to EMC ViPR 2.0.

Note Image
To replace the node's main (non-storage) drives, contact EMC Customer Support for assistance.

When the system labels a disk as BAD, the data protection logic rebuilds the data on that drive on other drives in the system. The bad drive no longer participates in the system in any way. Replacing a drive does not involve restoring any data to the replacement disk. Therefore, a bad drive only represents a loss of raw capacity to the system. This characteristic of the built-in data protection lessens the need for an immediate service action when a drive goes bad.

Suspect drives that are suspect because of physical errors, as opposed to connectivity issues, should also be replaced. When a drive is labeled SUSPECT, the system no longer writes new data to the drive, but the system will continue to read from it. The suspect drive with physical errors will eventually be labeled BAD when the physical errors exceed a certain threshold. While the drive remains suspect, the drive is not participating fully in the storage system. Therefore, suspect disks should be manually labeled as BAD using the supplied command line tool. This step gives the system the opportunity to finish pending network interactions with the drive. Data on the drive will be reconstructed elsewhere in the system using copies of the data from other drives. Therefore, you can begin the replacement process as soon as the system indicates that the drive was successfully re-labeled as BAD.

Note Image
The ViPR UI lists a node as DEGRADED when the node has storage drives with the SUSPECT or BAD status.

An efficient way to handle bad drives is to replace them periodically. The replacement period should be calculated using the manufacturer's mean failure rate of the drives such that there is no danger that a critical number of drives can fail by the next planned service date. This process is called "grooming." Grooming involves:

Back to Top

Output for the cs_hal list disks command

Describes the types of output rows from the cs_hal list disks command.

In the abbreviated cs_hal list disks output below, notice the different types of rows:

[root@layton-cyan ~]# cs_hal list disks Disks(s): SCSI Device Block Device Enclosure Partition Name Slot Serial Number SMART Status DiskSet ----------- ------------ ----------- ----------------------------------- ---- ------------------- -------------- ------------ n/a /dev/md0 RAID vol n/a n/a not supported n/a /dev/sg4 /dev/sdb internal 0 KLH6DHXJ GOOD /dev/sg5 /dev/sdc internal 1 KLH6DM1J GOOD /dev/sg8 /dev/sdf /dev/sg0 Object:Formatted:Good A08 WCAW32601327 GOOD /dev/sg9 /dev/sdg /dev/sg0 Object:Formatted:Bad A09 WCAW32568324 FAILED: self-test fail; read element; /dev/sg10 /dev/sdh /dev/sg0 Object:Formatted:Suspect A10 WCAW32547329 SUSPECT: Reallocated_Sector_Count(5)=11 ... unavailable /dev/sg0 E05 internal: 2 external: 30 total disks: 32

Back to Top

Replace disk drives

Replace bad and suspect storage disk drives using commands on the node.

Note Image
Do not place the node in maintenance mode to replace disk drives.

Procedure

  1. Start an SSH session with the node that owns the bad or suspect drives.
  2. Use the cs_hal list disks command:

    cs_hal list disks Disks(s): SCSI Device Block Device Enclosure Partition Name Slot Serial Number SMART Status DiskSet ----------- ------------ ----------- ----------------------------------- ---- ------------------- -------------- ------------ n/a /dev/md0 RAID vol n/a n/a not supported n/a /dev/sg4 /dev/sdb internal 0 KLH6DHXJ GOOD /dev/sg5 /dev/sdc internal 1 KLH6DM1J GOOD /dev/sg8 /dev/sdf /dev/sg0 Object:Formatted:Good A08 WCAW32601327 GOOD /dev/sg9 /dev/sdg /dev/sg0 Object:Formatted:Good A09 WCAW32568324 GOOD /dev/sg10 /dev/sdh /dev/sg0 Object:Formatted:Good A10 WCAW32547329 GOOD /dev/sg11 /dev/sdi /dev/sg0 Object:Formatted:Good B08 WCAW32543442 GOOD /dev/sg12 /dev/sdj /dev/sg0 Object:Formatted:Good B07 WCAW32499200 GOOD /dev/sg13 /dev/sdk /dev/sg0 Object:Formatted:Good B06 WCAW32540070 GOOD /dev/sg14 /dev/sdl /dev/sg0 Object:Formatted:Good A11 WCAW32493928 GOOD /dev/sg15 /dev/sdm /dev/sg0 Object:Formatted:Good B10 WCAW32612977 GOOD /dev/sg16 /dev/sdn /dev/sg0 Object:Formatted:Suspect B11 WCAW32599710 SUSPECT: Reallocated_Sector_Count(5)=11 /dev/sg17 /dev/sdo /dev/sg0 Object:Formatted:Good C11 WCAW32566491 GOOD /dev/sg18 /dev/sdp /dev/sg0 Object:Formatted:Good D11 WCAW32528145 GOOD /dev/sg19 /dev/sdq /dev/sg0 Object:Formatted:Good C10 WCAW32509214 GOOD /dev/sg20 /dev/sdr /dev/sg0 Object:Formatted:Good D10 WCAW32499242 GOOD /dev/sg21 /dev/sds /dev/sg0 Object:Formatted:Good C09 WCAW32613242 GOOD /dev/sg22 /dev/sdt /dev/sg0 Object:Formatted:Good D09 WCAW32503593 GOOD /dev/sg23 /dev/sdu /dev/sg0 Object:Formatted:Good E11 WCAW32607576 GOOD /dev/sg24 /dev/sdv /dev/sg0 Object:Formatted:Good E10 WCAW32507898 GOOD /dev/sg25 /dev/sdw /dev/sg0 Object:Formatted:Good E09 WCAW32613032 GOOD /dev/sg26 /dev/sdx /dev/sg0 Object:Formatted:Good C08 WCAW32613016 GOOD /dev/sg27 /dev/sdy /dev/sg0 Object:Formatted:Good D08 WCAW32543718 GOOD /dev/sg28 /dev/sdz /dev/sg0 Object:Formatted:Good C07 WCAW32599747 GOOD /dev/sg29 /dev/sdaa /dev/sg0 Object:Formatted:Good E06 WCAW32612688 GOOD /dev/sg30 /dev/sdab /dev/sg0 Object:Formatted:Good E08 WCAW32567229 GOOD /dev/sg31 /dev/sdac /dev/sg0 Object:Formatted:Good D06 WCAW32609899 GOOD /dev/sg32 /dev/sdad /dev/sg0 Object:Formatted:Good C06 WCAW32546152 GOOD /dev/sg33 /dev/sdae /dev/sg0 Object:Formatted:Good D07 WCAW32613312 GOOD /dev/sg34 /dev/sdaf /dev/sg0 Object:Formatted:Good E07 WCAW32507641 GOOD /dev/sg3 /dev/sda /dev/sg0 Object:Formatted:Good A06 WCAW32547444 GOOD /dev/sg6 /dev/sdd /dev/sg0 Object:Formatted:Good A07 WCAW32525128 GOOD /dev/sg7 /dev/sde /dev/sg0 Object:Formatted:Good B09 WCAW32496935 GOOD … … unavailable /dev/sg0 E02 unavailable /dev/sg0 E03 unavailable /dev/sg0 E04 unavailable /dev/sg0 E05 internal: 2 external: 30 total disks: 32

    Here the report shows the 2 internal drives of the node, 30 storage drives, 1 suspect drives, and 30 empty slots labeled as unavailable.
  3. Mark down the Slot ID and Serial Number of the bad or suspect drive.
  4. If the drive is labeled as SUSPECT and diagnosis indicates that it needs to be replaced, use the cs_hwmgr command to force its status to BAD. In the example above, the suspect drive has a reallocated sector count that indicates the drive has physical failures and is a candidate for replacement.

    cs_hwmgr ListDrives | grep WCAW32599710 ATA HUS726060ALA640 WCAW32599710 6001 Suspect B11 cs_hwmgr SetDriveHealthToBad ATA HUS726060ALA640 WCAW32599710

  5. Use the cs_hwmgr RemoveDrive to remove/unallocate the drive from the default diskset. Note that this command does not remove the drive from the output of the cs_hwmgr ListDrives command.

    cs_hwmgr RemoveDrive ATA HUS726060ALA640 WCAW32599710

  6. Use the cs_hal led command and the Serial Number or Slot ID to turn on the LED for the drive (to find the proper drive in the drawer). The two commands below are equivalent. The first uses the Serial Number and the second uses the Slot ID.

    cs_hal led WCAW32599710 on cs_hal led B11 on

  7. Replace the drive.
  8. Use the cs_hal list disks command to confirm that the system recognizes the new drive. The new drive has the same Slot ID as the replaced drive. Take note of the new Serial Number. The system automatically integrates the replaced drive into use by the appropriate services.
  9. Use the cs_hal led command and the new Serial Number or Slot ID to turn off the LED for the drive.

    cs_hal led <serial_number_of_new_drive> off cs_hal led B11 off

Results

The node and system automatically detect the drive and initialize it for use.