ECS 2.1 – Understanding data protection

Table of Contents

Overview

Learn about how ECS protects unstructured data against node, disk, and site failures through replication and erasure coding.

ECS ensures durability, reliability, and availability of objects by creating and distributing three copies of objects and their metadata across the set of nodes in the local site. After the three copies are successfully written, ECS erasure-codes the object copies to reduce storage overhead. It handles failure and recovery operations automatically with no additional backup software or devices required.

Back to Top

Storage service

The storage service layer handles data availability and protection against data corruption, hardware failures, and data center disasters.

The unstructured storage engine (USE) is part of the storage services layer. It is a distributed shared service that runs on each node, and it manages transactions and persists data to nodes. The USE enables global namespace management across geographically dispersed data centers through geo-replication.

The USE writes all object-related data (such as, user data, metadata, object location data) to logical containers of contiguous disk space known as chunks. Chunks are open and accepting writes, or closed and not accepting writes. After chunks are closed, the storage engine erasure-codes them. The storage engine writes to chunks in an append-only pattern so that existing data is never overwritten or modified. This strategy improves performance because locking and cache validation is not required for I/O operations. All nodes can process write requests for the same object simultaneously while writing to different chunks.

The storage engine tracks object location through an index that records object name, chunk id, and offset. Chunk location is separately tracked through an index that records chunk id and a set of disk locations. The chunk location index contains three disk location pointers before erasure coding, and multiple location pointers after erasure coding. The storage engine performs all of the storage operations (such as, erasure coding and object recovery) on chunks.

Back to Top

Object creates

Object creates: one VDC

The following figure shows how the storage engine writes object data when there is a single VDC. In this example, there is a single appliance deployed at the site, but the same principles apply when more appliances are deployed. The eight nodes are in a single storage pool within a single replication group.

Single site: object creates

  1. An application creates an object in a bucket.
  2. The storage engine writes the object to one chunk. The disk locations corresponding to this chunk are on three different disks/nodes, so the writes go to three different disks/nodes in parallel. The storage engine can write the object to any of the nodes that belong to the bucket's replication group. The VDC where the object is created is the object's owner.
  3. The storage engine records the disk locations of the chunk in the chunk location index, and the chunk id and offset in the object location index.
  4. The storage engine writes the object location index to one chunk and the disk locations corresponding to the chunk to three different disks/nodes, so the writes go to three different disks/nodes in parallel. The index locations are chosen independently from the object chunk locations.
  5. After all of the disk locations are written successfully, the storage engine acknowledges the write to the application.

When object chunks are full, the storage engine erasure-codes them. It does not erasure code the object location index chunks.

Object creates: federated VDCs (2 sites)

In a federated deployment of two VDCs, the storage engine writes object chunks to the local VDC and also to the remote VDC.

Two site: object creates

  1. An application creates an object in a bucket.
  2. The storage engine writes the object to one chunk at the site where it is ingested. The disk locations corresponding to this chunk are on three different disks/nodes, so the writes go to three different disks/nodes in parallel. The storage engine can write the object to any of the nodes that belong to the bucket's replication group. The storage engine records the disk locations of the chunk in the chunk location index, and the chunk id and offset in the object location index. The site where the object is originally ingested is the object's owner.
  3. After all of the disk locations are written successfully, the storage engine acknowledges the write to the application.
  4. The storage engine replicates the chunk to three nodes in the federated site. It records the chunk locations in the object location index (not shown in this diagram) also on three different nodes at the federated site.

When the chunks are full, the storage engine erasure-codes the object chunks. It does not erasure code the object location index chunks.

When two VDCs are in a replication group, both VDCs have a readable copy of the object.

Three sites: object creates

Object creates: federated VDCs (3 or more sites)

  1. An application creates an object in a bucket.
  2. The storage engine writes the object to one chunk at the site where it is ingested. The disk locations corresponding to this chunk are on three different disks/nodes, so the writes go to three different disks/nodes in parallel. It can write the object to any of the nodes that belong to the bucket's replication group. The storage engine records the disk locations of the chunk in the chunk location index, and the chunk id and offset in the object location index (not shown in this diagram). The VDC where the write received is the object's owner, and it contains a readable copy of the object.
  3. After all of the disk locations are written successfully, the storage engine acknowledges the write to the application.
  4. The storage engines replicates the chunks to nodes in another VDC within the replication group. To improve storage efficiency, the storage engine XOR's the chunks with other chunks from other objects also stored on the node.

When the chunks are full, the storage engine erasure-codes the XOR'd chunks. When possible, it writes XOR chunks directly in erasure-coded format without going through the replication phase. It does not erasure code the object location index chunks.

Object updates

When an application fully updates an object, the storage engine writes a new object (following the principles described earlier). The storage engine then updates the object location index to point to the new location. Because the old location is no longer referenced by an index, the original object is available for garbage collection.

Back to Top

Object reads

Object reads: single VDC

In a single site deployment, when a client submits a read request, the storage engine uses the object location index to find which chunks are storing the object, it retrieves the chunks or erasure-coded fragments, reconstructs and returns the object to the client.

Object reads: federated VDCs (2 sites)

In a two-site federation, the storage engine reads the object chunk or erasure coded fragments from the nodes on the VDC where the application is connected. In a two-site federation, object chunks exist on both sites.

Object reads: federated VDCs (3 sites or more sites)

If the requesting application is connected to the VDC that owns the object, the storage engine reads the object chunk or erasure coded fragments from the nodes on the VDC. If the requesting application is not connected to the owning VDC, the storage engine retrieves the object chunk or erasure coded fragments from the VDC that owns the object, copies them to the VDC the application is connected to, and returns the object to the application. The storage engine keeps a copy of the object in its cache in case another request is made for the object. If another request is made, the storage engine compares the timestamp of the object in the cache with the timestamp of the object in the owning VDC. If they are the same, it returns the object to the application; if the timestamps are different, it retrieves and caches the object again.

Back to Top

Erasure coding

ECS uses erasure coding to provides better storage efficiency without compromising data protection.

The storage engine implements the Reed Solomon 12/4 erasure coding scheme in which an object is broken into 12 data fragments and 4 coding fragments. The resulting 16 fragments are dispersed across the nodes in the local site. The storage engine can reconstruct an object from any of the 12 fragments.

ECS requires a minimum of four nodes running the object service in a single site. It tolerates failures based on the number of nodes.

When an object is erasure coded, the original chunk data is present as a single copy that consists of 16 fragments dispersed throughout the cluster. When an object has been erasure-coded, ECS can read objects directly without any decoding or reconstruction. ECS only uses the code fragments for object reconstruction when there is hardware failure.

Back to Top

Recovery on disk and node failures

ECS continuously monitors the health of the nodes, their disks, and objects stored in the cluster. Since ECS disperses data protection responsibilities across the cluster, it is able to automatically re-protect at-risk objects when nodes or disks fail.

Disk health

ECS reports disk health as Good, Suspect, or Bad.

ECS writes only to disks in good health; it does not write to disks in suspect or bad health. ECS reads from good disks and from suspect disks. When two of an object’s chunks are located on suspect disks, ECS writes the chunks to other nodes.

Node health

ECS reports node health as Good, Suspect, Degraded, or Bad.

ECS writes only to nodes in good health; it does not write to nodes in suspect, degraded, or bad health. ECS reads from good and suspect nodes. When two of an object’s chunks are located on suspect nodes, ECS writes two new chunks of it to other nodes. When a node is reported as suspect or bad, all of the disks it manages are also considered suspect or bad.

Data recovery

When there is a failure of a node or drive in the site, the storage engine:

  1. Identifies the chunks or EC fragments affected by the failure.
  2. Writes copies of the affected chunks or EC fragments to good nodes and disks that do not currently have copies.

Back to Top

Site fail over and recovery

ECS provides protection against a site failure due to a disaster or other problem that causes a site to go offline or to be disconnected from the other sites in a geo-federated deployment.

Temporary site failure

Temporary site failures occur when network connectivity is interrupted between federated VDCs or when a VDC goes down temporarily. When a VDC goes down, the Replication Group page displays the status Temporarily unavailable for the VDC that is unreachable.

When buckets are configured with the Access During Outage property set to On, applications can read objects while connected to any site. When applications are connected to a site that is not the bucket's owner, the application must explicitly access the bucket to write to it or to view the contents. If an application modifies an object or bucket while connected to a VDC that is not the owner, the storage engine transfers ownership to the site where the change is initiated.

The following operations cannot be completed at any site in the geo-federation until the temporary failure is resolved regardless of the Access During Outage setting:

After the sites are reconnected, the storage engine starts a resync operation in the background. Use the portal's Monitor > Recovery Status to monitor the progress of the resync operation.

Permanent site fail over

If a disaster occurs at a site and the VDC cannot be brought back online, you must delete it.

Back to Top