ECS 2.0 – Replace an ECS storage disk in third-party hardware

Table of Contents

Drive replacement planning

Describes an efficient strategy for periodic grooming of an ECS Appliance to replace FAILED and SUSPECT storage drives.

Note Image
To replace the node's main (non-storage) drives, contact EMC Customer Support for assistance.

This document makes a distinction between the terms "drive" and "disk." A drive is a physical spindle. A disk is a logical entity (that is, a partition on a drive).

When the system labels a drive as FAILED, the data protection logic rebuilds the data on that drive on other drives in the system. The FAILED drive no longer participates in the system in any way. Replacing a drive does not involve restoring any data to the replacement drive. Therefore, a FAILED drive only represents a loss of raw capacity to the system. This characteristic of the built-in data protection lessens the need for an immediate service action when a drive fails.

SUSPECT drives that are suspect because of physical errors, as opposed to connectivity issues, should also be replaced. When a drive is labeled SUSPECT, the system no longer writes new data to the drive, but the system will continue to read from it. The SUSPECT drive with physical errors will eventually be labeled FAILED when the physical errors exceed a certain threshold. While the drive remains SUSPECT, the drive is not participating fully in the storage system. Therefore, a disk on the corresponding drive should be manually set to Bad using the supplied cs_hwmgr command line tool. This step gives the system the opportunity to finish pending network interactions with the drive. Data on the drive will be reconstructed elsewhere in the system using copies of the data from other drives. Therefore, you can begin the replacement process as soon as the system indicates that the disk on the corresponding FAILED drive was successfully set to Bad.

Note Image
The ViPR UI lists a node as DEGRADED when the node has storage disks with the Suspect or Bad status.

An efficient way to handle FAILED drives is to replace them periodically. The replacement period should be calculated using the manufacturer's mean failure rate of the drives such that there is no danger that a critical number of drives can fail by the next planned service date. This process is called "grooming." Grooming involves:

Back to Top

Output for the cs_hal list disks command

Describes the types of output rows from the cs_hal list disks command.

In the abbreviated cs_hal list disks output below, notice the different types of rows:

[root@layton-cyan ~]# cs_hal list disks
Disks(s):
SCSI Device Block Device Enclosure   Partition Name                      Slot Serial Number       SMART Status   DiskSet
----------- ------------ ----------- ----------------------------------- ---- ------------------- -------------- ------------
n/a         /dev/md0     RAID vol    n/a                                 n/a  not supported       n/a
/dev/sg4    /dev/sdb     internal                                        0    KLH6DHXJ            GOOD
/dev/sg5    /dev/sdc     internal                                        1    KLH6DM1J            GOOD
/dev/sg8    /dev/sdf     /dev/sg0    Object:Formatted:Good               A08  WCAW32601327        GOOD
/dev/sg9    /dev/sdg     /dev/sg0    Object:Formatted:Bad                A09  WCAW32568324        FAILED: self-test fail; read element;
/dev/sg10   /dev/sdh     /dev/sg0    Object:Formatted:Suspect            A10  WCAW32547329        SUSPECT: Reallocated_Sector_Count(5)=11
...
unavailable              /dev/sg0                                        E05    

   internal: 2
   external: 30

total disks: 32

Back to Top

Replace drives

Replace FAILED and SUSPECT storage drives using commands on the node.

Note Image
Do not place the node in maintenance mode to replace drives.

Procedure

  1. To access the ECS rack (cabinet) using the private (192.168.219.xxx) network from a laptop:
    1. Locate the 1 GbE private switch network ports.
    2. On the 1 GbE (turtle) switch, attach a network cable from your laptop to port 24 on the switch.
    3. Set the network interface on the laptop to the static address192.168.219.99, subnet mask 255.255.255.0, with no gateway required.
    4. Verify that the temporary network between the laptop and rack's private management network is functioning by using the ping command.
      C:\>ping 192.168.219.1 
      Pinging 192.168.219.1 with 32 bytes of data:
      Reply from 192.168.219.1: bytes=32 time<1ms TTL=64
      Reply from 192.168.219.1: bytes=32 time<1ms TTL=64
      Reply from 192.168.219.1: bytes=32 time<1ms TTL=64
      Reply from 192.168.219.1: bytes=32 time<1ms TTL=64
      
      Ping statistics for 192.168.219.1:
         Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
      Approximate round trip times in milli-seconds:
         Minimum = 0ms, Maximum = 0ms, Average = 0ms
      
      Note Image
      If 192.168.219.1 does not answer, try 192.168.218.2. If neither responds, verify the laptop IP/subnet mask, network connection, and switch port connection.

    5. From the laptop, SSH into Node 1 (provo) via 192.168.219.1, using the root login credentials (with default or customer-modified password).
  2. Use the cs_hal list disks command and observe the drive health status of each drive in the SMART Status column:
    # cs_hal list disks
    Disks(s):
    SCSI Device Block Device Enclosure   Partition Name                      Slot Serial Number       SMART Status   DiskSet
    ----------- ------------ ----------- ----------------------------------- ---- ------------------- -------------- ------------
    n/a         /dev/md0     RAID vol    n/a                                 n/a  not supported       n/a
    /dev/sg4    /dev/sdb     internal                                        0    KLH6DHXJ            GOOD
    /dev/sg5    /dev/sdc     internal                                        1    KLH6DM1J            GOOD
    /dev/sg8    /dev/sdf     /dev/sg0    Object:Formatted:Good               A08  WCAW32601327        GOOD
    /dev/sg9    /dev/sdg     /dev/sg0    Object:Formatted:Good               A09  WCAW32568324        GOOD
    /dev/sg10   /dev/sdh     /dev/sg0    Object:Formatted:Good               A10  WCAW32547329        GOOD
    /dev/sg11   /dev/sdi     /dev/sg0    Object:Formatted:Good               B08  WCAW32543442        GOOD
    /dev/sg12   /dev/sdj     /dev/sg0    Object:Formatted:Good               B07  WCAW32499200        GOOD
    /dev/sg13   /dev/sdk     /dev/sg0    Object:Formatted:Good               B06  WCAW32540070        GOOD
    /dev/sg14   /dev/sdl     /dev/sg0    Object:Formatted:Good               A11  WCAW32493928        GOOD
    /dev/sg15   /dev/sdm     /dev/sg0    Object:Formatted:Good               B10  WCAW32612977        GOOD
    /dev/sg16   /dev/sdn     /dev/sg0    Object:Formatted:Suspect            B11  WCAW32599710        SUSPECT: Reallocated_Sector_Count(5)=11
    /dev/sg17   /dev/sdo     /dev/sg0    Object:Formatted:Good               C11  WCAW32566491        GOOD
    /dev/sg18   /dev/sdp     /dev/sg0    Object:Formatted:Good               D11  WCAW32528145        GOOD
    /dev/sg19   /dev/sdq     /dev/sg0    Object:Formatted:Good               C10  WCAW32509214        GOOD
    /dev/sg20   /dev/sdr     /dev/sg0    Object:Formatted:Good               D10  WCAW32499242        GOOD
    /dev/sg21   /dev/sds     /dev/sg0    Object:Formatted:Good               C09  WCAW32613242        GOOD
    /dev/sg22   /dev/sdt     /dev/sg0    Object:Formatted:Good               D09  WCAW32503593        GOOD
    /dev/sg23   /dev/sdu     /dev/sg0    Object:Formatted:Good               E11  WCAW32607576        GOOD
    /dev/sg24   /dev/sdv     /dev/sg0    Object:Formatted:Good               E10  WCAW32507898        GOOD
    /dev/sg25   /dev/sdw     /dev/sg0    Object:Formatted:Good               E09  WCAW32613032        GOOD
    /dev/sg26   /dev/sdx     /dev/sg0    Object:Formatted:Good               C08  WCAW32613016        GOOD
    /dev/sg27   /dev/sdy     /dev/sg0    Object:Formatted:Good               D08  WCAW32543718        GOOD
    /dev/sg28   /dev/sdz     /dev/sg0    Object:Formatted:Good               C07  WCAW32599747        GOOD
    /dev/sg29   /dev/sdaa    /dev/sg0    Object:Formatted:Good               E06  WCAW32612688        GOOD
    /dev/sg30   /dev/sdab    /dev/sg0    Object:Formatted:Good               E08  WCAW32567229        GOOD
    /dev/sg31   /dev/sdac    /dev/sg0    Object:Formatted:Good               D06  WCAW32609899        GOOD
    /dev/sg32   /dev/sdad    /dev/sg0    Object:Formatted:Good               C06  WCAW32546152        GOOD
    /dev/sg33   /dev/sdae    /dev/sg0    Object:Formatted:Good               D07  WCAW32613312        GOOD
    /dev/sg34   /dev/sdaf    /dev/sg0    Object:Formatted:Good               E07  WCAW32507641        GOOD
    /dev/sg3    /dev/sda     /dev/sg0    Object:Formatted:Good               A06  WCAW32547444        GOOD
    /dev/sg6    /dev/sdd     /dev/sg0    Object:Formatted:Good               A07  WCAW32525128        GOOD
    /dev/sg7    /dev/sde     /dev/sg0    Object:Formatted:Good               B09  WCAW32496935        GOOD
    …
    …
    unavailable              /dev/sg0                                        E02
    unavailable              /dev/sg0                                        E03
    unavailable              /dev/sg0                                        E04
    unavailable              /dev/sg0                                        E05
    
       internal: 2
       external: 30
    
    total disks: 32
    Here the report shows the 2 internal drives of the node, 30 storage drives (one of them in a SUSPECT health state), and 30 empty slots labeled as unavailable.
    Note Image
    If the output of the above command indicates that the Partition Name of a drive is "Object:Formatted:Bad" or "Object:Formatted:Suspect" and its SMART status is "GOOD", stop here. This condition is the result of a bug in hardware manager that causes it to catch an exception when determining the health of a drive. Please, engage ECS Customer Support for assistance and wait until they run a workaround procedure that fixes the issue. Otherwise, proceed with the next step.

    Note Image
    If you have a disk ID from the ECS Portal, use the following command with grep to get the serial number of the disk in the sixth column. In this example, the Serial Number is AR31021EG62L9C:
    # cs_hwmgr disk --list-by-service Object  | grep a1d7c15f-9995-4205-99b9-2267794d0154
    a1d7c15f-9995-4205-99b9-2267794d0154    ...     AR31021EG62L9C  ...
    

    Now use the cs_hwmgr drives --list command with grep and the Serial Number to get the slot ID in the last (sixth) column. In this example, the slot ID is A06:

    # cs_hwmgr drives --list | grep AR31021EG62L9C
    ATA       HGST HUS726060AL  AR31021EG62L9C      6001      Good         A06
    You now have the information needed to use this procedure.

  3. Mark down the Serial Number of the FAILED or SUSPECT drive.
  4. If the target drive is SUSPECT:
    Determine if the drive is in the SUSPECT state because of hardware issues:
    • Continue the replace drive procedure if the SMART Status column in the cs-hal command output shows "reallocated_sector_count" data. This data indicates hardware problems.
    • Stop the replace drive procedure if you do not see "reallocated_sector_count" data. Drives can be in the SUSPECT state for non-hardware issues like connectivity problems.
    1. Use the cs_hwmgr --list command to get the VendorId and ProductID for the disk, which are found in the first two columns.
      # cs_hwmgr drive --list | grep -i WCAW32599710 
      ATA       HUS726060ALA640      WCAW32599710         6001      Suspect      B11       
      
      Note Image
      If the cs_hwgmr drive --list | grep <disk SN#> command does not return any output for a "Suspect" drive, the drive is not being managed by the hardware manager. Engage ECS Customer Support for assistance. Refer to Knowledge Base entry 197235.

    2. Set the drive health status to "Bad" with the cs_hwmgr drive --set-health-bad command using the <VendorID> , <ProductID> and <Disk SN#> values.
      # cs_hwmgr drive --set-health-bad ATA HUS726060ALA640 WCAW32547329
      Setting drive health of ATA:HUS726060ALA640:WCAW32547329 to bad...Suceeded
    3. Use the cs_hwmgr drive --remove command to remove the target drive (or more specifically the disk associated with that drive) from the configuration. Note that this command does not remove the drive from the output of the cs_hwmgr drive --list command.
      # cs_hwmgr drive --remove ATA HUS726060ALA640 WCAW32599710
      Removing drive ATA:HUS726060ALA640:WCAW32547329...Suceeded
      Note Image
      If the cs_hwmgr drive --remove command fails, use the cs_hwmgr drive --force-remove command using the <VendorID>, <ProductID>, <Disk SN#> values.

  5. If the target drive is BAD:
    • If the drive's Serial Number did not appear in the cs_hwmgr drive --list command output, skip this step.
    • If the If the drive health reported using the cs_hwmgr drive --list command is "Bad", continue with the substep a below.
    1. Use the cs_hwmgr drive --remove command to remove the target drive (or more specifically the disk associated with that drive) from the configuration. Note that this command does not remove the drive from the output of the cs_hwmgr drive --list command.
      # cs_hwmgr drive --remove ATA HUS726060ALA640 WCAW32599710
      Removing drive ATA:HUS726060ALA640:WCAW32547329...Suceeded
      Note Image
      If the cs_hwmgr drive --remove command fails, use the cs_hwmgr drive --force-remove command using the <VendorID>, <ProductID>, <Disk SN#> values.

  6. Use the cs_hal led command and the serial number to blink the LED for the drive (to find the proper drive in the drawer).
    # cs_hal led WCAW32599710 blink
  7. Following the instructions from your hardware manufacturer, locate the disk and replace it in the same slot.
  8. Use the cs_hal list disks command to confirm that the system recognizes the new drive. The new drive has the same Slot ID as the replaced drive. Take note of the new Serial Number. The system automatically integrates the new disk into use by the appropriate services.
  9. Use the cs_hal led command and the new Serial Number to turn off the LED for the drive.
    # cs_hal led <serial_number_of_new_drive> off

Results

The node and system automatically detect the drive and initialize it for use.

Back to Top