ViPR 2.1 - What is ViPR HDFS?

Table of Contents

What is ViPR HDFS?

ViPR HDFS is a Hadoop Compatible File System (HCFS) that enables you to run Hadoop 2.0 applications on top of your ViPR storage infrastructure.

You can configure your Hadoop distribution to run against the built-in Hadoop file system, against ViPR HDFS, or any combination of HDFS, ViPR HDFS, or other Hadoop Compatible File Systems available in your environment. The following figure illustrates how ViPR HDFS integrates with an existing Hadoop cluster.

ViPR HDFS integration in a Hadoop cluster

In a Hadoop environment configured to use ViPR HDFS, each of the Hadoop HDFS data nodes (VM or commodity) functions as a traditional Hadoop NameNode which means that all of the Hadoop HDFS data nodes are capable of accepting HDFS requests and servicing them.

When you set up the Hadoop client to use Hadoop HDFS instead of traditional HDFS, the configuration points to Hadoop HDFS to do all the HDFS activity. On each Hadoop HDFS client node, any traditional Hadoop component would use the Hadoop HDFS client (JAR) to perform the HDFS activity.

To integrate ViPR HDFS with an existing Hadoop environment, you must have the following:
  • A Hadoop cluster already installed and configured. See the EMC ViPR Support Matrix for the list of supported distributions.
  • Hadoop installed and configured to support ViPR HDFS, which requires:
    • A data services license for HDFS or for Object + HDFS.
    • A data services virtual pool that supports HDFS or Object + HDFS.
    • One or more buckets that support HDFS or Object + HDFS.
Back to Top

Configuring Hadoop to use ViPR HDFS

Hadoop stores system configuration information in a file called core-site.xml. Editing core-site.xml is a required part of the ViPR HDFS configuration.

There are several types of properties to add or modify in core-site.xml including:
  • ViPR HDFS Java classes: This set of properties defines the ViPR HDFS implementation classes that are contained in the ViPR HDFS client JAR. They are required.
  • File system location properties: These properties define the file system URI (scheme and authority) to use when running Hadoop jobs, and the IP addresses to the ViPR data VMs for a specific ViPR file system.
  • Identity translation properties: These properties allow you to map anonymously owned objects to users, as well as specify user realms.
  • Kerberos realm and service principal properties: These properties are required only when you are running in a Hadoop environment where Kerberos is present. These properties map Hadoop and ViPR HDFS users.

core-site.xml resides on each node in the Hadoop cluster. You must add the same properties to each instance of core-site.xml.

Back to Top

ViPR HDFS URI for file system access

After you configure Hadoop to use the ViPR file system, you can access it by specifying the ViPR HDFS URI with viprfs:// as the scheme and a combination of ViPR bucket, tenant namespace, and user-defined installation name for the authority.

The ViPR HDFS URI looks like this:
viprfs://bucket_name.namespace.installation/path

The bucket_name corresponds to a bucket residing in an HDFS or HDFS + Object enabled virtual pool. It contains the data you want to analyze with Hadoop. The namespace corresponds to a tenant namespace, and the installation_name is a name you assign to a specific set of ViPR nodes or a load balancer. ViPR HDFS resolves the installation_name to a set of ViPR data VMs or to a load balancer by using the fs.vipr.installation.[installation_name].hosts property, which includes the IP addresses of the data service VMs or load balancer.

If the installation_name maps to a set of ViPR data VMs, you can specify how often to query ViPR for the list of active nodes by setting the fs.vipr.installation.[installation_name].resolution to dynamic, and the fs.vipr.installation.[installation_name].resolution.dynamic.time_to_live_ms to specify how often to query ViPR for the list of active nodes.

You can specify the ViPR HDFS URI as the default file system in core-site.xml by setting it as the value of the fs.defaultFS property, but this is not a requirement. Whether or not to set this value to ViPR HDFS requires careful consideration as part of your overall Hadoop on ViPR HDFS integration planning. If you do not specify ViPR HDFS as the default file system, you must use the full URI including the path each time you access ViPR data. If you have existing applications that already use a different default file system, you need to update those applications.

Back to Top

Hadoop authentication modes

ViPR HDFS integrates with Hadoop clusters configured to use either simple or Kerberos authentication modes. To handle each mode, you must set different properties in core-site.xml to ensure that users and services are able to access the data they need.

Hadoop applications access data stored in ViPR buckets, so the Hadoop users must have permissions to read the objects they are trying to read, and permissions to write to the buckets they are trying to write to. Hadoop services (such as mapred, hive, and hbase) must have permissions to write system files.

The following table lists the default ACLs applied to files and directories in both simple and Kerberos authentication mode.

Back to Top

Hadoop simple authentication mode

In a Hadoop cluster running in simple mode, when users create files, directories and objects, those files and directories are created with the owner as the empty string. This can cause problems for some applications that do their own ACL checking.

To resolve this problem, set the fs.viprfs.auth.anonymous_translation property in core-site.xml to CURRENT_USER. This setting allows anonymously owned files to be displayed as if they were owned by the current Unix user. This example shows what happens when the setting for fs.viprfs.auth.anonymous_translation is set to NONE:

# hadoop fs -ls /
14/01/30 11:36:23 INFO vipr.ViPRFileSystem: Initialized ViPRFS (atom 1.0.1.0-811) for viprfs://hdfs.s3.docsite
d------rwx   -          0 2014-01-30 10:42 /bar
d------rwx   -          0 2014-01-30 10:34 /fooDir
d------rwx   -          0 2014-01-30 06:15 /tmp
If you change the setting to CURRENT_USER, and the logged in user is root, you see that root becomes the owner:
# hadoop fs -ls /
14/01/30 11:30:37 INFO vipr.ViPRFileSystem: Initialized ViPRFS (atom 1.0.1.0-811) for viprfs://hdfs.s3.docsite
Found 3 items
drwx------   - root          0 2014-01-30 10:42 /bar
drwx------   - root          0 2014-01-30 10:34 /fooDir
drwx------   - root          0 2014-01-30 06:15 /tmp

In anonymous mode, set up your ViPR buckets to allow access from Everyone to ensure that Hadoop processes can access ViPR buckets.

ViPR HDFS provides the fs.viprfs.auth.identity_translation as a way to map users to a realm when Kerberos is not present. If you must chown a file, you can specify the realm to use. For example:
hdfs dfs -chown sally@MYREALM.COM /sallys/new/file
When you specify NONE, users must type the realm each time they chown a file. Otherwise, you can specify FIXED_REALM as a convenience, then specify the actual realm to use in the fs.viprfs.auth.realm property.
<property>
<name>fs.viprfs.auth.identity_translation</name>
<value>FIXED_REALM</value>
</property>
<property>
<name>fs.viprfs.auth.realm</name>
<value>MYREALM.COM</value>
</property>

Using chown in anonymous mode is not recommended. When you chown files or directories, you change the owner from the empty string to an actual owner. Once the files or directories have an owner, anonymous users no longer have access to it.

Back to Top

Hadoop Kerberos authentication mode

When Kerberos and the ViPR Active Directory server are integrated, the Kerberos realm provides a single namespace of users so that the Hadoop users authenticated with kinit are recognized as credentialed ViPR users.

In a Hadoop cluster running in Kerberos mode, there must be a one-way cross-realm trust from the Kerberos realm to the Active Directory realm used to authenticate your ViPR users.

The following identity translation properties in core-site.xml are used to ensure the proper Hadoop-to-ViPR user translation:
  • fs.permissions.umask-mode: Set the value to 027.
  • fs.viprfs.auth.anonymous_translation: Set the value to CURRENT_USER.
  • fs.viprfs.auth.identity_translation: Set the value to CURRENT_USER_REALM so the realm of users is auto-detected; alternatively set to FIXED_REALM if you want to hard-code the user's realm by using the fs.viprfs.auth.realm property.
In addition, you must set the following properties in core-site.xml to define service principals and to map users to realms:
  • viprfs.security.principal
  • fs.vipr.auth.services.users
  • fs.vipr.auth.services.[user].principal
  • fs.vipr.auth.service.[user].keytab
Back to Top

PIG and ACLs

Because PIG performs its own ACL checking to determine if a user has proper permissions to an object, you must do one of the following when running PIG on top of ViPR HDFS.

  • Option 1: Set the following properties in core-site.xml:
    • fs.<scheme>.auth.identity_translation = CURRENT_USER_REALM (in Kerberos authentication mode) or FIXED_REALM (in simple authentication mode)
    • fs.<scheme>.auth.anonymous_translation = CURRENT_USER
  • Option 2: Set the following environment variable:
    • $VIPR_LOCAL_USER_MODE ="true"
Back to Top

SymLink support

In standard HDFS, a symbolic link that does not specify the full URI to a file points to a path in the same HDFS instance.

The same rule is true in ViPR HDFS. When you do not specify the full URI in a symlink, ViPR HDFS uses the current namespace and bucket as the root. To provide a symlink to a file outside of the current namespace and bucket, you must provide the full URI that includes both the scheme and the authority.
Note Image
Hadoop 2.2 does not support SymLinks.

Back to Top

File system interaction

When you are interacting directly with ViPR HDFS, you might notice the following differences from interaction with the standard HDFS file system:

Back to Top

Unsupported Hadoop applications

ViPR HDFS does not support a small subset of Hadoop applications.

Back to Top

Obtaining the ViPR HDFS installation and support package

The ViPR HDFS JAR and HDFS support tools are provided in a ZIP file, vipr-hdfs-<version>.zip that you can download from the ViPR support pages on support.EMC.com.

The ZIP file contains\client and\tools\bin directories. Before you unzip the file, create a directory to hold the zip contents (your unzip tool might do this for you), then extract the contents to that directory. After you extract the files, the directories will contain the following:

Back to Top