ECS 2.1 – Configure HDFS

Table of Contents

What is ECS HDFS?

ECS HDFS is a Hadoop Compatible File System (HCFS) that enables you to run Hadoop 2.0 applications on top of your ECS storage infrastructure.

You can configure your Hadoop distribution to run against the built-in Hadoop file system, against ECS HDFS, or any combination of HDFS, ECS HDFS, or other Hadoop Compatible File Systems available in your environment. The following figure illustrates how ECS HDFS integrates with an existing Hadoop cluster.

ECS HDFS integration in a Hadoop cluster

In a Hadoop environment configured to use ECS HDFS, each of the ECS HDFS data nodes functions as a traditional Hadoop NameNode which means that all of the ECS HDFS data nodes are capable of accepting HDFS requests and servicing them.

When you set up the Hadoop client to use ECS HDFS instead of traditional HDFS, the configuration points to ECS HDFS to do all the HDFS activity. On each ECS HDFS client node, any traditional Hadoop component would use the ECS HDFS client (JAR) to perform the HDFS activity.

To integrate ECS HDFS with an existing Hadoop environment, you must have the following:
  • A Hadoop cluster already installed and configured. The following table lists the supported distributions:
  • Hadoop installed and configured to support ECS HDFS, which requires:
    • An ECS unstructured license with Object and HDFS access methods.
    • An ECS replication group.
    • One or more buckets that support HDFS access.
Back to Top

Configuring Hadoop to use ECS HDFS

Hadoop stores system configuration information in a file called core-site.xml. Editing core-site.xml is a required part of the ECS HDFS configuration.

There are several types of properties to add or modify in core-site.xml including:
  • ECS HDFS Java classes: This set of properties defines the ECS HDFS implementation classes that are contained in the ECS HDFS client JAR. They are required.
  • File system location properties: These properties define the file system URI (scheme and authority) to use when running Hadoop jobs, and the IP addresses to the ECS data nodes for a specific ECS file system.
  • Identity translation properties: These properties allow you to map anonymously owned objects to users, as well as specify user realms.
  • Kerberos realm and service principal properties: These properties are required only when you are running in a Hadoop environment where Kerberos is present. These properties map Hadoop and ECS HDFS users.

core-site.xml resides on each node in the Hadoop cluster. You must add the same properties to each instance of core-site.xml.

Note Image

With Cloudera distributions it is better to use Cloudera Safety Valve, and with Hortonworks it is better to use Hortonworks Ambari, to make these changes so that they are persistent across the cluster.


Back to Top

ECS HDFS URI for file system access

After you configure Hadoop to use the ECS file system, you can access it by specifying the ECS HDFS URI with viprfs:// as the scheme and a combination of ECS bucket, tenant namespace, and user-defined installation name for the authority.

The ECS HDFS URI looks like this:
viprfs://bucket_name.namespace.installation/path

The bucket_name corresponds to a HDFS-enabled bucket. It contains the data you want to analyze with Hadoop. The namespace corresponds to a tenant namespace, and the installation_name is a name you assign to a specific set of ECS nodes or a load balancer. ECS HDFS resolves the installation_name to a set of ECS nodes or to a load balancer by using the fs.vipr.installation.[installation_name].hosts property, which includes the IP addresses of the ECS nodes or load balancer.

If the installation_name maps to a set of ECSECS nodes, you can specify how often to query ECS for the list of active nodes by setting the fs.vipr.installation.[installation_name].resolution to dynamic, and the fs.vipr.installation.[installation_name].resolution.dynamic.time_to_live_ms to specify how often to query ECS for the list of active nodes.

With a Hadoop environment that uses simple security, you can specify the ECS HDFS URI as the default file system in core-site.xml by setting it as the value of the fs.defaultFS property, but this is not a requirement. With a Hadoop environment secured with Kerberos, setting ECS HDFS as a default filesystem is not supported. Where ECS HDFS is not the default file system, you must use the full URI including the path each time you access ECS data. If you have existing applications that already use a different default file system, you need to update those applications.

Back to Top

Hadoop authentication modes

ECS HDFS integrates with Hadoop clusters configured to use either simple or Kerberos authentication modes. To handle each mode, you must set different properties in core-site.xml to ensure that users and services are able to access the data they need.

Hadoop applications access data stored in ECS buckets, so the Hadoop users must have permissions to read the objects they are trying to read, and permissions to write to the buckets they are trying to write to. Hadoop services (such as mapred, hive, and hbase) must have permissions to write system files.

The following table lists the default ACLs applied to files and directories in both simple and Kerberos authentication mode.

Back to Top

Hadoop simple authentication mode

In a Hadoop cluster running in simple mode, when users create files, directories and objects, those files and directories are created with the owner as the empty string and are referred to as anonymous or anonymously-owned objects. This can cause problems for some applications that do their own ACL checking.

To resolve this problem, set the fs.viprfs.auth.anonymous_translation property in core-site.xml to CURRENT_USER. This setting allows anonymously owned files to be displayed as if they were owned by the current Unix user. This example shows what happens when the setting for fs.viprfs.auth.anonymous_translation is set to NONE:

# hadoop fs -ls /
14/01/30 11:36:23 INFO vipr.ViPRFileSystem: Initialized ViPRFS (atom 1.0.1.0-811) for viprfs://hdfs.s3.docsite
d------rwx   -          0 2014-01-30 10:42 /bar
d------rwx   -          0 2014-01-30 10:34 /fooDir
d------rwx   -          0 2014-01-30 06:15 /tmp
If you change the setting to CURRENT_USER, and the logged in user is root, you see that root becomes the owner:
# hadoop fs -ls /
14/01/30 11:30:37 INFO vipr.ViPRFileSystem: Initialized ViPRFS (atom 1.0.1.0-811) for viprfs://hdfs.s3.docsite
Found 3 items
drwx------   - root          0 2014-01-30 10:42 /bar
drwx------   - root          0 2014-01-30 10:34 /fooDir
drwx------   - root          0 2014-01-30 06:15 /tmp

In anonymous mode, set up your ECS buckets to allow access from Everyone to ensure that Hadoop processes can access ECS buckets.

ECS HDFS provides the fs.viprfs.auth.identity_translation as a way to map users to a realm when Kerberos is not present. If you must chown a file, you can specify the realm to use. For example:
hdfs dfs -chown sally@MYREALM.COM /sallys/new/file
When you specify NONE, users must type the realm each time they chown a file. Otherwise, you can specify FIXED_REALM as a convenience, then specify the actual realm to use in the fs.viprfs.auth.realm property.
<property>
<name>fs.viprfs.auth.identity_translation</name>
<value>FIXED_REALM</value>
</property>
<property>
<name>fs.viprfs.auth.realm</name>
<value>MYREALM.COM</value>
</property>

Using chown in anonymous mode is not recommended. When you chown files or directories, you change the owner from the empty string to an actual owner. Once the files or directories have an owner, anonymous users no longer have access to it.

Back to Top

Hadoop Kerberos authentication mode

When Kerberos and the ECS Active Directory server are integrated, the Kerberos realm provides a single namespace of users so that the Hadoop users authenticated with kinit are recognized as credentialed ECS users.

In a Hadoop cluster running in Kerberos mode, there must be a one-way cross-realm trust from the Kerberos realm to the Active Directory realm used to authenticate your ECS users.

The following identity translation properties in core-site.xml are used to ensure the proper Hadoop-to-ECS user translation:
  • fs.permissions.umask-mode: Set the value to 027.
  • fs.viprfs.auth.anonymous_translation: Set the value to CURRENT_USER.
  • fs.viprfs.auth.identity_translation: Set the value to CURRENT_USER_REALM so the realm of users is auto-detected; alternatively set to FIXED_REALM if you want to hard-code the user's realm by using the fs.viprfs.auth.realm property.
In addition, you must set the following properties in core-site.xml to define a service principal:
  • viprfs.security.principal
Back to Top

PIG and ACLs

Because PIG performs its own ACL checking to determine if a user has proper permissions to an object, you must do one of the following when running PIG on top of ECS HDFS.

  • Option 1: Set the following properties in core-site.xml:
    • fs.<scheme>.auth.identity_translation = CURRENT_USER_REALM (in Kerberos authentication mode) or FIXED_REALM (in simple authentication mode)
    • fs.<scheme>.auth.anonymous_translation = CURRENT_USER
  • Option 2: Set the following environment variable:
    • $VIPR_LOCAL_USER_MODE ="true"
Back to Top

SymLink support

In standard HDFS, a symbolic link that does not specify the full URI to a file points to a path in the same HDFS instance.

The same rule is true in ECS HDFS. When you do not specify the full URI in a symlink, ECS HDFS uses the current namespace and bucket as the root. To provide a symlink to a file outside of the current namespace and bucket, you must provide the full URI that includes both the scheme and the authority.
Note Image
Hadoop 2.2 does not support SymLinks.

Back to Top

File system interaction

When you are interacting directly with ECS HDFS, you might notice the following differences from interaction with the standard HDFS file system:

  • Applications that expect the file system to be an instance of DistributedFileSystem do not work. Applications hardcoded to work against the built-in HDFS implementation require changes to use ECS HDFS.
  • ECS HDFS does not support checksums of the data.
  • When you use the listCorruptFileBlocks function, all blocks are reported as OK because ECS HDFS has no notion of corrupted blocks.
  • The replication factor is always reported as a constant N, where N=1. The data is protected by the ECS SLA, not by Hadoop replication.

Back to Top

Unsupported Hadoop applications

ECS HDFS does not support a small subset of Hadoop applications.

  • HttpFS
  • Hue
  • Cloudera Impala
  • Apache Oozie
The following Hadoop applications are not supported with a secure (Kerberos) Hadoop cluster.
  • HBase
  • Hive
Back to Top

Configure ECS HDFS

This article describes how to configure your existing Hadoop distribution to use the data in your ECS storage infrastructure with ECS HDFS. Use this step-by-step procedure if your Hadoop distribution is configured to use simple authentication and not Kerberos authentication.

If your Hadoop distribution is configured for Kerberos authentication, follow the steps described here.

To perform this integration procedure, you must have:

Back to Top

Plan the ECS HDFS and Hadoop integration

Use this list to verify that you have the information necessary to ensure a successful integration.

To integrate ECS HDFS with your Hadoop cluster, perform the following tasks:

  1. Obtain the ECS HDFS installation and support package
  2. Create bucket for HDFS
  3. Deploy the ECS HDFS JAR or Use a Cloudera Parcel to install Hadoop on a cluster
  4. Edit Hadoop core-site.xml file
  5. Restart the following services:
    • HDFS
    • MapReduce
    • YARN (when using Hortonworks)
    Note Image

    If you use Cloudera Manager Safety Valve to change core-site.xml properties then, by default, then Cloudera Manager will restart all services.


  6. Confirm the services restart correctly.
    Note Image

    Some services, such as namenode datanode will not start as default.fs is set to ViPRFS and ECS takes over these tasks in a non-Kerberos setup


  7. Verify that you have file system access.

When using HBase, perform these additional tasks:

  1. Edit HBASE hbase-site.xml.
  2. Restart the HBase services.
Back to Top

Obtain the ECS HDFS installation and support package

The ECS HDFS JAR and HDFS support tools are provided in a ZIP file, hdfsclient-<ECS version>-<version>.zip, that you can download from the ECS support pages on support.emc.com.

The ZIP file contains \tools\bin, \playbooks, \client, and \parcels directories. Before you unzip the file, create a directory to hold the zip contents (your unzip tool might do this for you), then extract the contents to that directory. After you extract the files, the directories will contain the following:

  • \tools\bin: Contains the ViPRAdminTools.sh which enables the creation of buckets that support HDFS access without needing to use ECS object protocols or to use the ECS portal.
  • \playbooks: Contains Ansible playbooks for configuring a secure Hadoop environment to talk to ECS HDFS.
  • \client: Contains the following files:
    • ECS (ViPPRFS) JAR files (viprfs-client-<ECS version>-hadoop-<Hadoop version>.jar): Used to configure different Hadoop distributions.
  • \parcels
    • Cloudera distributions in "Parcel" format. Parcels include the appropriate ViPRFS JAR file.
Back to Top

Create bucket for HDFS

Hadoop HDFS support in ECS uses the ECS object store. Buckets for use by HDFS can be created in a number of ways, but must be marked for HDFS access.

Where you are creating buckets to support a Hadoop cluster that uses simple security (non-Kerberos), you can use the ECS Portal or you can use the object APIs, as described in the following article:

Refer to Set bucket permissions for HDFS for information on setting permissions.

Note Image

You should not use underscores in bucket names as they are not supported by the URI Java class. For example, viprfs://my_bucket.ns.site/ will not work as this is an invalid URI and is thus not understood by Hadoop.


Back to Top

Set bucket permissions for HDFS

Before you can use a bucket for HDFS access, you need to ensure that permissions are appropriately set.

Buckets created using the ECS Data Access protocols have Full Control for the user who created the bucket. However, in order for a bucket to be used for HDFS, it must be set so that All Users have full control.

You can set bucket permissions using a graphical client, such as the S3 Browser, or you can use a command line tool, such as s3curl.

Back to Top

Deploy the ECS HDFS JAR

Use this procedure to put the ECS HDFS JAR on the classpath of each client node in the Hadoop cluster.

Before you begin

Obtain the ECS HDFS JAR for your Hadoop distribution from the EMC Support site for ECS as described in Obtain the ECS HDFS installation and support package.

Note Image
  • After upgrading to ECS SP1 or later, it is recommend that the client JARs are updated with the versions appropriate to the release.
  • If you are using a Cloudera Hadoop distribution, you can use the supplied Cloudera Parcels to install a Hadoop distribution that is pre-configured with the ViPRFS JAR. See Use a Cloudera Parcel to install Hadoop on a cluster.


Procedure

  1. Log in to a ECS client node.
  2. Run the classpath command to get the list of directories in the classpath:
    # hadoop classpath
  3. Copy ECS HDFS JAR to one of folders listed by the classpath command that occurs after the /conf folder.
    ECS distribution Classpath location (suggested)
    Pivotal HD /usr/lib/gphd/hadoop/lib
    Cloudera /usr/lib/hadoop/lib
    Hortonworks /usr/lib/hadoop/lib
  4. Repeat this procedure on each ECS client node.
Back to Top

Use a Cloudera Parcel to install Hadoop on a cluster

Cloudera uses Parcels as a mechanism to distribute software to CDH Clusters from the Cloudera Manager Admin console. ECS provides Cloudera Parcels, referred to as ViPRFS Parcels, that are pre-configured to use ECS HDFS.

Before you begin

Cloudera Parcels that include the ECS (ViPRFS) Client are provided in the ECS HDFS installation and support package (see Obtain the ECS HDFS installation and support package).

Procedure

  1. Create a parcel repository
    To create a Cloudera Parcel repository, put the ViPRFS Parcel files in a directory published by a web server. This is typically hosted inside your network.
  2. Add the ECS parcel repository from the Cloudera Manager UI.
    1. Navigate to Administration > Settings > Parcels.
    2. Set the Parcel Update Frequency to 1 minute to speed discovery.
    3. Click Save Changes
  3. Download the parcel.
    Navigate to Hosts > Parcels > Downloadable and download the latest ViPRFS Parcel.
  4. Distribute the parcel to your cluster.
    It is possible to distribute multiple versions of ViPRFS Parcels but only one version can be activated at any time.
    To delete a parcel, you need to de-activate the parcel first, and then remove the parcel from your hosts. All the above operations can be performed in the Parcel UI.
  5. Activate the Parcel
    A distributed parcel can be activated by pushing the Activate button.
    Cloudera Manager will prompt you to restart the cluster to enable running processes to use the new parcel.
  6. Click Restart to restart the cluster.
Back to Top

Edit Hadoop core-site.xml file

Use this procedure to update core-site.xml with the properties needed to integrate ECS HDFS with a Hadoop cluster that uses simple authentication mode.

Before you begin

You must have a set of user credentials that enable you to log in to Hadoop nodes and modify core-site.xml.

The location of core-site.xml depends on the distribution you are using.

core-site.xml resides on each node in the Hadoop cluster. You must modify the same properties in each instance. You can make the change in one node, and then use secure copy command (scp) to copy the file to the other nodes in the cluster.

See core_site.xml property reference for more information about each property you need to set.

Procedure

  1. Log in to one of the HDFS nodes where core-site.xml is located.
  2. Make a backup copy of core-site.xml.
    cp core-site.xml core-site.backup
  3. Using the text editor of your choice, open core-site.xml for editing.
    Note Image

    With Cloudera distributions it is better to use Cloudera Safety Valve, and with Hortonworks it is better to use Hortonworks Ambari, to make these changes so that they are persistent across the cluster.


  4. Add the following properties and values to define the Java classes that implement the ECS HDFS file system:
    <property>
    <name>fs.viprfs.impl</name>
    <value>com.emc.hadoop.fs.vipr.ViPRFileSystem</value>
    </property> 
    <property> 
    <name>fs.AbstractFileSystem.viprfs.impl</name>
    <value>com.emc.hadoop.fs.vipr.ViPRAbstractFileSystem</value>
    </property>
  5. Add the fs.vipr.installations property. In the following example, the value is set to Site1.
    <property>
      <name>fs.vipr.installations</name>
      <value>Site1</value>
    </property>
  6. Add the fs.vipr.installation.[installation_name].hosts property as a comma-separated list of ECS data nodes or load balancer IP addresses. In the following example, the installation_name is set to Site1.
    Note Image

    The use of a load balancer adds no value to a HDFS scenario as the client has the logic to connect to the nodes directly. For this reason, it is recommended that you provide a list of data nodes in this property, not the address of a load balancer. In addition, you should set the fs.vipr.installation.[installation_name].resolution to "dynamic".


    <property>
      <name>fs.vipr.installation.Site1.hosts</name>
      <value>203.0.113.10,203.0.113.11,203.0.113.12</value>
    </property>
  7. Add the fs.vipr.installation.[installation_name].resolution property, and set it to one of the following values:
    Option Description
    dynamic Use when accessing ECS data nodes directly without a load balancer.
    fixed Use when accessing ECS data nodes through a load balancer.
    In the following example, installation_name is set to Site1.
    <property>
      <name>fs.vipr.installation.Site1.resolution</name>
      <value>dynamic</value>
    </property>
    1. If you set fs.vipr.installation.[installation_name].resolution to dynamic, add the fs.vipr.installation.[installation_name].resolution.dynamic.time_to_live_ms property to specify how often to query ECS for the list of active nodes.
      In the following example, installation_name is set to Site1.
      <property>
      <name>fs.vipr.installation.Site1.resolution.dynamic.time_to_live_ms</name>
      <value>900000</value>
      </property>
  8. Locate the fs.defaultFS property and modify the value to specify the ECS file system URI. .
    This setting is optional and you can specify the full file system URL to connect to the ECS ViPRFS.
    Use the following format: viprfs://<bucket_name.namespace.installation_name, where
    • bucket_name: The name of the bucket that contains the data you want to use when you run Hadoop jobs. If running in simple authentication mode, the owner of the bucket must grant permission to Everybody. In the following example, the bucket_name is set to mybucket.
    • namespace: The tenant namespace where bucket_name resides. In the following example, the namespace is set to mynamespace.
    • installation_name: The value specified by the fs.vipr.installations property. In the following example, installation_name is set to Site1.
    <property>
      <name>fs.defaultFS</name>
      <value>viprfs://mybucket.mynamespace.Site1/</value>
    </property>
  9. Locate fs.permissions.umask-mode, and set the value to 022.
    In some configurations, this property might not already exist. If it does not, then add it.
    <property>
      <name>fs.permissions.umask-mode</name>
      <value>022</value>
    </property>
  10. Add the fs.viprfs.auth.anonymous_translation property; use it to specify whether to map anonymously owned objects to the current user so the current user has permission to modify it.
    Option Description
    NONE (default) Do not map anonymously owned objects to the current user.
    CURRENT_USER Map anonymously owned objects to the current Unix user.
    <property>
      <name>fs.viprfs.auth.anonymous_translation</name>
      <value>CURRENT_USER</value>
    </property>
  11. Add the fs.viprfs.auth.identity_translation property. It provides a way to assign users to a realm when Kerberos is not present.
    Option Description
    FIXED_REALM When specified, ECS HDFS gets the realm name from the value of the fs.vipr.auth.realm property.
    NONE (default) ECS HDFS does no realm translation.
    <property>
      <name>fs.viprfs.auth.identity_translation</name>
      <value>NONE</value>
    </property>
  12. If you set the fs.viprfs.auth.identity_translation property to FIXED_REALM, add the fs.viprfs.auth.realm property.
  13. Save core-site.xml.
  14. Update the core-site.xml on the required nodes in your Hadoop cluster.
  15. If you are using a Cloudera distribution, use Cloudera Manager to update the core-site.xml safety valve with the same set of properties and values.
  16. Restart the Hadoop services.
    Hadoop Distribution Commands
    Pivotal HD ComputeMaster:

    # service hadoop-yarn-resourcemanager restart

    Data Nodes:

    # service hadoop-hdfs-datanode restart

    # service hadoop-yarn-nodemanager restart

    NameNode:

    # service hadoop-yarn-nodemanager restart

    When you configure the Pivotal Hadoop cluster to use ECS HDFS as the default file system (specified by fs.DefaultFS in core-site.xml), you cannot use the icm_client's cluster start/stop functionality, instead, you must start all cluster services (except HDFS) individually. For example:

    icm_client start -s yarn
    icm_client start -s zookeeper 
    and so on.
    Cloudera Use Cloudera Manager to restart the HDFS and MapReduce services
    Apache # stop-all.sh

    # start-all.sh

  17. Test the configuration by running the following command to get a directory listing:
    # hdfs dfs -ls viprfs://mybucket.mynamespace.Site1/
    13/12/13 22:20:37 INFO vipr.ViPRFileSystem: Initialized ViPRFS for viprfs://mybucket.mynamespace.Site1/
    
    If you have set fs.defaultFS, you can use:

    # hdfs dfs -ls /

Back to Top

Edit HBASE hbase-site.xml

When you use HBASE with ECS HDFS, you must set the hbase.rootdir in hbase-site.xml to the same value as the core-site.xml fs.defaultFS property.

hbase-site.xml is located in one of the following locations:

Procedure

  1. Open hbase-site.xml.
  2. Set the hbase.rootdir property to the same value as fs.defaultFS adding /hbase as the suffix.
  3. Save your changes.
    1. On Cloudera, add the hbase.rootdir property to the HBase Service Configuration Safety Valve for hbase-site.xml.
  4. Restart the services for your distribution.
    Hadoop Distribution Description
    Pivotal HD Run this command on the hbase master node:
    # service hbase-master restart

    Run this command on the hbase region server:

    # service hadoop-regionserver restart
    Cloudera Use Cloudera manager to restart the HBase service.
    Hortonworks
    # bin/start-hbase.sh

hbase.rootdir entry

<property>
  <name>hbase.rootdir</name>
  <value>viprfs://testbucket.s3.testsite/hbase</value>
</property>

Back to Top

Configure secure Hadoop cluster to use ECS HDFS

This article describes how to configure your existing Hadoop distribution to use the data in your ECS storage infrastructure with ECS HDFS. Use this step-by-step procedure if your Hadoop cluster is configured to use Kerberos authentication.

If your Hadoop cluster is configured for simple authentication, follow the steps described here.

To perform this integration procedure, you must have:

Back to Top

Plan the ECS HDFS and secure Hadoop cluster integration

Use this list to verify that you have the information necessary to ensure a successful integration. It is a best practice to get your Hadoop cluster working with Kerberos before you configure ECS HDFS.

In addition, verify that a Kerberos KDC is installed and configured to handle authentication of the Hadoop service principals. If you are using Active Directory to authenticate ECS users, you must set up a cross-realm trust between the Kerberos realm and the ECS user realm. Help with setting up the Kerberos KDC and configuring trust is provided in Guidance on Kerberos configuration.

To integrate ECS HDFS with your secure Hadoop cluster, complete the following tasks:

  1. Obtain the ECS HDFS installation and support package
  2. Create bucket for secure HDFS
  3. Deploy the ECS HDFS JAR in a secure cluster or Use a Cloudera Parcel to install Hadoop on a secure cluster
  4. Configure ECS nodes with the ECS Service Principal
  5. Edit core-site.xml
  6. Restart the HDFS and MapReduce services
  7. Confirm the services restart correctly
  8. Verify that you have file system access
Back to Top

Obtain the ECS HDFS installation and support package

The ECS HDFS JAR and HDFS support tools are provided in a ZIP file, hdfsclient-<ECS version>-<version>.zip, that you can download from the ECS support pages on support.emc.com.

The ZIP file contains \tools\bin, \playbooks, \client, and \parcels directories. Before you unzip the file, create a directory to hold the zip contents (your unzip tool might do this for you), then extract the contents to that directory. After you extract the files, the directories will contain the following:

  • \tools\bin: Contains the ViPRAdminTools.sh which enables the creation of buckets that support HDFS access without needing to use ECS object protocols or to use the ECS portal.
  • \playbooks: Contains Ansible playbooks for configuring a secure Hadoop environment to talk to ECS HDFS.
  • \client: Contains the following files:
    • ECS (ViPPRFS) JAR files (viprfs-client-<ECS version>-hadoop-<Hadoop version>.jar): Used to configure different Hadoop distributions.
  • \parcels
    • Cloudera distributions in "Parcel" format. Parcels include the appropriate ViPRFS JAR file.
Back to Top

Create bucket for secure HDFS

Where the Hadoop cluster is secured using Kerberos, and you want users authenticated against a Kerberos domain to be able to create buckets, you can create an S3 bucket and enable it for HDFS access or you can use the ECS Admin tool.

The Admin tool provides a convenient way of creating a bucket without having access to the ECS portal or without having to use the ECS Management API or the S3 Object Service API.

Note Image

You should not use underscores in bucket names as they are not supported by the URI Java class. For example, viprfs://my_bucket.ns.site/ will not work as this is an invalid URI and is thus not understood by Hadoop.


Back to Top

Create a bucket for HDFS using the ViPRAdminTool

From a Hadoop cluster secured using Kerberos, you can use the ViPRAdminTool.sh to create a bucket for use by Object and HDFS protocols, without requiring knowledge of the ECS REST API or Object Data Access APIs. The tool is a wrapper around a Java class within the ECS HDFS JAR, so can be run once the Hadoop cluster has been configured to use ECS HDFS.

Before you begin

  • Obtain the ViPRAdminTool.sh as described in Obtain the ECS HDFS installation and support package
  • Hadoop must be installed and the machine on which you are running the tool must have the ECS HDFS JAR installed and the Hadoop cluster configured to access the ECS HDFS.
  • Kerberos security must be configured. If you do not have Kerberos security configured, you will need to create a bucket using the S3 API or the ECS REST API.

Reference information for the ViPRAdminTool tool is provided in ECS administration tool reference.

Procedure

  1. Use the ViPRAdminTool.sh script and specify: the createBucket command, the path to the ECS node and namespace, the name of the new bucket, and the permissions to set for the bucket.
    The following command creates a bucket called "newbucket" and sets its permissions as 0755 (rwx/r-x/r-x). The object virtual pool in which the bucket is created is the default pool and the bucket is assigned to the default project.
    • ViPRAdminTool.sh createbucket viprfs://<ECS Node Address>/myNamespace newbucket 0755
      				
    If you want to specify a project and a virtual pool, you can use the ECS REST API or CLI to obtain these values.
Back to Top

Deploy the ECS HDFS JAR in a secure cluster

Use this procedure to put the ECS HDFS JAR on the classpath of each client node in the ECS cluster.

Before you begin

Obtain the ECS HDFS JAR for your ECS distribution from the EMC Support site for ECS as described in Obtain the ECS HDFS installation and support package.

Note Image
  • After upgrading to ECS SP1 or later, it is recommend that the client JARs are updated with the versions appropriate to the release.
  • If you are using a Cloudera Hadoop distribution, you can use the supplied Cloudera Parcels to install a Hadoop distribution that is pre-configured with the ViPRFS JAR. See Use a Cloudera Parcel to install Hadoop on a cluster.


Procedure

  1. Log in to a ECS client node.
  2. Run the classpath command to get the list of directories in the classpath:
    # hadoop classpath
  3. Copy ECS HDFS JAR to one of folders listed by the classpath command that occurs after the /conf folder.
    ECS distribution Classpath location (suggested)
    Pivotal HD /usr/lib/gphd/hadoop/lib
    Cloudera /usr/lib/hadoop/lib
    Hortonworks /usr/lib/hadoop/lib
  4. Repeat this procedure on each ECS client node.
Back to Top

Use a Cloudera Parcel to install Hadoop on a secure cluster

Cloudera uses Parcels as a mechanism to distribute software to CDH Clusters from the Cloudera Manager Admin console. ECS provides Cloudera Parcels, referred to as ViPRFS Parcels, that are pre-configured to use ECS HDFS.

Before you begin

Cloudera Parcels that include the ECS (ViPRFS) Client are provided in the ECS HDFS installation and support package (see Obtain the ECS HDFS installation and support package).

Procedure

  1. Create a parcel repository
    To create a Cloudera Parcel repository, put the ViPRFS Parcel files in a directory published by a web server. This is typically hosted inside your network.
  2. Add the ECS parcel repository from the Cloudera Manager UI.
    1. Navigate to Administration > Settings > Parcels.
    2. Set the Parcel Update Frequency to 1 minute to speed discovery.
    3. Click Save Changes
  3. Download the parcel.
    Navigate to Hosts > Parcels > Downloadable and download the latest ViPRFS Parcel.
  4. Distribute the parcel to your cluster.
    It is possible to distribute multiple versions of ViPRFS Parcels but only one version can be activated at any time.
    To delete a parcel, you need to de-activate the parcel first, and then remove the parcel from your hosts. All the above operations can be performed in the Parcel UI.
  5. Activate the Parcel
    A distributed parcel can be activated by pushing the Activate button.
    Cloudera Manager will prompt you to restart the cluster to enable running processes to use the new parcel.
  6. Click Restart to restart the cluster.
Back to Top

Configure ECS nodes with the ECS Service Principal

The ECS service principal and its corresponding keytab file must reside on each ECS data node. Use the Ansible playbooks provided to automate these steps.

Before you begin

You must have the following items before you can complete this procedure:
  • Access to the Ansible playbooks. Obtain the Ansible playbooks from the ECS HDFS software package as described in Obtaining the ViPR HDFS installation and support package, and copy it to the node were you intend to install Ansible.
  • The list of ECS node IP addresses.
  • IP address of the KDC.
  • The DNS resolution where you run this script should be the same as the DNS resolution for the Hadoop host, otherwise the vipr/_HOST@REALM will not work.

ECS provides reusable Ansible content called 'roles', which consist of python scripts, YAML-based task lists, and template files.

  • vipr_kerberos_config: Configures a ECS node for Kerberos.
  • vipr_jce_config: Configures a ECS data node for unlimited-strength encryption by installing JCE policy files.
  • vipr_kerberos_principal: Acquires a service principal for an ECS node.

Procedure

  1. Install Ansible.
    yum install epel-release && yum install ansible
  2. Decompress the hdfsclient-1.2.0.0-<version>.zip file
    The steps in this procedure use the playbooks contained in the viprfs-client-1.2.0.0-<version>/playbooks/samples directory and the steps are also contained in viprfs-client-1.2.0.0-<version>/playbooks/samples/README.md.
  3. Install the supplied Ansible roles.
    ansible-galaxy install -r requirements.txt -f
  4. Copy the contents of the viprfs-client-1.2.0.0-<version>/playbooks/samples directory to a working directory.
  5. Edit inventory.txt to refer to the ECS data nodes and KDC server.
    The default entries are shown below.
    [data_nodes]
    192.168.2.[100:200] 
    
    [kdc]
    192.168.2.10
  6. Download the "unlimited" JCE policy archive from oracle.com, and extract it to the UnlimitedJCEPolicy directory.
    Kerberos may be configured to use a strong encryption type, such as AES-256. In that situation, the JRE within the ECS nodes must be reconfigured to use the 'unlimited' policy.
    Note Image

    This step should be performed only if you are using strong encryption type.


  7. Copy the krb5.conf file from the KDC to the working directory.
  8. Edit the generate-vipr-keytabs.yml as necessary and set the domain name.
    For example.
    [root@nile3-vm22 samples]# cat generate-vipr-keytabs.yml
    ---
    ###
    # Generates keytabs for ViPR/ECS data nodes.
    ###
      
    - hosts: data_nodes
      serial: 1
      
      roles:
        - role: vipr_kerberos_principal
          kdc: "{{ groups.kdc | first }}"
          principals:
            - name: vipr/_HOST@MA.EMC.COM
              keytab: keytabs/_HOST@MA.EMC.COM.keytab

    In this example, the default value (vipr/_HOST@EXAMPLE.COM) has been replaced with (vipr/_HOST@MA.EMC.COM) and the domain is MA.EMC.COM.

  9. Run
    export ANSIBLE_HOST_KEY_CHECKING=False
  10. Run the Ansible playbook to generate keytabs.
    ansible-playbook -v -k -i inventory.txt generate-vipr-keytabs.yml
    		  
  11. Edit the setup-vipr-kerberos.yml file as necessary.
    The default file contents are shown below.
    # cat setup-vipr-kerberos.yml
    
    ---
    ### 
    # Configures ViPR/ECS for Kerberos authentication.
    # - Configures krb5 client 
    # - Installs keytabs
    # - Installs JCE policy
    ###
     
     - hosts: data_nodes
     
       roles:
         - role: vipr_kerberos_config
           krb5:
             config_file: krb5.conf
           service_principal:
             name: vipr/_HOST@EXAMPLE.COM
             keytab: keytabs/_HOST@EXAMPLE.COM.keytab
    
         - role: vipr_jce_config
           jce_policy: 
             name: unlimited
             src: UnlimitedJCEPolicy/
    

    In this example, the default value (vipr/_HOST@EXAMPLE.COM) has been replaced with (vipr/_HOST@MA.EMC.COM) and the domain is MA.EMC.COM.

    Note Image
    Remove the "vipr_jce_config" role if you are not using strong encryption type.

  12. Run the Ansible playbook to configure the data nodes with the ECS service principal.
    Make sure the ./viprfs-client-1.2.0.0-<version>/playbooks/samples/keytab directory exist and the krb5.conf file is in the working directory viprfs-client-1.2.0.0-<version>/playbooks/samples directory.
    ansible-playbook -v -k -i inventory.txt setup-vipr-kerberos.yml
    Verify that the correct ECS service principal, one per data node, has been created (from the KDC):
    # kadmin.local -q "list_principals" | grep vipr
    vipr/nile3-vm42.centera.lab.emc.com@MA.EMC.COM
    vipr/nile3-vm43.centera.lab.emc.com@MA.EMC.COM
    Verify that correct keytab is generated and stored in location: /data/hdfs/krb5.keytab on all ECS data nodes. You can use the "strings" command on the keytab to extract the human readable text, and verify that it contains the correct principal. For example:
    dataservice-10-247-199-69:~ # strings /data/hdfs/krb5.keytab
    MA.EMC.COM
    vipr
    nile3-vm42.centera.lab.emc.com

    In this case the principal is vipr/nile3-vm42.centera.lab.emc.com.

Back to Top

Edit core-site.xml

Use this procedure to update core-site.xml with the properties that are required when using ECS HDFS with a ECS cluster that uses Kerberos authentication mode.

Before you begin

Obtain the credentials that enable you to log in to ECS nodes and modify core-site.xml.

See core_site.xml property reference for more information about each property you need to set.

core-site.xml resides on each node in the Hadoop cluster, and you must modify the same properties in each instance. You can make the change in one node, and then use secure copy command (scp) to copy the file to the other nodes in the cluster. As a best practice, back up core-site.xml before you start the configuration procedure.

The location of core-site.xml depends on the distribution you are using.

Procedure

  1. Log in to one of the HDFS nodes where core-site.xml is located.
  2. Make a backup copy of core-site.xml.
    cp core-site.xml core-site.backup
  3. Using the text editor of your choice, open core-site.xml for editing.
    Note Image

    With Cloudera distributions it is better to use Cloudera Safety Valve, and with Hortonworks it is better to use Hortonworks Ambari, to make these changes so that they are persistent across the cluster.


  4. Add the following properties and values to define the Java classes that implement the ECS HDFS file system:
    <property>
    <name>fs.viprfs.impl</name>
    <value>com.emc.hadoop.fs.vipr.ViPRFileSystem</value>
    </property> 
    <property> 
    <name>fs.AbstractFileSystem.viprfs.impl</name>
    <value>com.emc.hadoop.fs.vipr.ViPRAbstractFileSystem</value>
    </property>
  5. Add the fs.vipr.installations property. In the following example, the value is set to Site1.
    <property>
      <name>fs.vipr.installations</name>
      <value>Site1</value>
    </property>
  6. Add the fs.vipr.installation.[installation_name].hosts property as a comma-separated list of ECS data nodes or load balancer IP addresses. In the following example, the installation_name is set to Site1.
    Note Image

    The use of a load balancer adds no value to a HDFS scenario as the client has the logic to connect to the nodes directly. For this reason, it is recommended that you provide a list of data nodes in this property, not the address of a load balancer. In addition, you should set the fs.vipr.installation.[installation_name].resolution to "dynamic".


    <property>
      <name>fs.vipr.installation.Site1.hosts</name>
      <value>203.0.113.10,203.0.113.11,203.0.113.12</value>
    </property>
  7. Add the fs.vipr.installation.[installation_name].resolution property, and set it to one of the following values:
    Option Description
    dynamic Use when accessing ECS data nodes directly without a load balancer.
    fixed Use when accessing ECS data nodes through a load balancer.
    In the following example, installation_name is set to Site1.
    <property>
      <name>fs.vipr.installation.Site1.resolution</name>
      <value>dynamic</value>
    </property>
    1. If you set fs.vipr.installation.[installation_name].resolution to dynamic, add the fs.vipr.installation.[installation_name].resolution.dynamic.time_to_live_ms property to specify how often to query ECS for the list of active nodes.
      In the following example, installation_name is set to Site1.
      <property>
      <name>fs.vipr.installation.Site1.resolution.dynamic.time_to_live_ms</name>
      <value>900000</value>
      </property>
  8. Locate the fs.defaultFS property and modify the value to specify the ECS file system URI. .
    This setting is optional and you can specify the full file system URL to connect to the ECS ViPRFS.
    Use the following format: viprfs://<bucket_name.namespace.installation_name, where
    • bucket_name: The name of the bucket that contains the data you want to use when you run Hadoop jobs. If running in simple authentication mode, the owner of the bucket must grant permission to Everybody. In the following example, the bucket_name is set to mybucket.
    • namespace: The tenant namespace where bucket_name resides. In the following example, the namespace is set to mynamespace.
    • installation_name: The value specified by the fs.vipr.installations property. In the following example, installation_name is set to Site1.
    <property>
      <name>fs.defaultFS</name>
      <value>viprfs://mybucket.mynamespace.Site1/</value>
    </property>
  9. Locate fs.permissions.umask-mode, and set the value to 027.
    In some configurations, this property might not already exist. If it does not, then add it.
    <property>
    		<name>fs.permissions.umask-mode</name>
    		<value>027</value>
    </property>
  10. Add the fs.viprfs.auth.anonymous_translation property; use it to specify whether to map anonymously owned objects to the current user so the current user has permission to modify it.
    Option Description
    NONE (default) Do not map anonymously owned objects to the current user.
    CURRENT_USER Map anonymously owned objects to the current Unix user.
    <property>
      <name>fs.viprfs.auth.anonymous_translation</name>
      <value>CURRENT_USER</value>
    </property>
  11. Add the fs.viprfs.auth.identity_translation property, and set it to CURRENT_USER_REALM, which maps to the realm of the user signed in via kinit.
    <property>
    		<name>fs.viprfs.auth.identity_translation</name>
    		<value>CURRENT_USER_REALM</value>
    </property>
  12. Add the viprfs.security.principal property. This property tells the KDC who the ECS user is.

    The principal name can include "_HOST" which is automatically replaced by the actual data node FQDN at run time.

    <property>
    		<name>viprfs.security.principal</name>
    		<value>vipr/_HOST@example.com</value>
    </property>
  13. Restart the HDFS and MapReduce services.
  14. Test the configuration by running the following command to get a directory listing:
    # kinit <service principal>
    # hdfs dfs -ls viprfs://mybucket.mynamespace.Site1/
    13/12/13 22:20:37 INFO vipr.ViPRFileSystem: Initialized ViPRFS for viprfs://mybucket.mynamespace.Site1/
    
Back to Top

Guidance on Kerberos configuration

Provides guidance on configuring Kerberos in the Hadoop cluster.

Back to Top

Set up the Kerberos KDC

Set up the Kerberos KDC by following these steps.

Procedure

  1. Install krb5-workstation.
    Use the command:
    yum install -y krb5-libs krb5-server krb5-workstation
  2. Modify /etc/krb5.conf and change the realm name and extensions.
  3. Modify /var/kerberos/krb5kdc/kdc.conf and change the realm name to match your own.
  4. If your KDC is a VM, recreate /dev/random (otherwise your next step of creating the KDC database will take a very long time).
    1. Remove using:
      # rm -rf /dev/random
    2. Recreate using:
       # mknod /dev/random c 1 9
      				  
  5. Create the KDC database.
     # kdb5_util create -s
    Note Image
    If you made a mistake with the initial principals. For example, you ran "kdb5_util create -s" incorrectly, you might need to delete these principals explicitly in the /var/kerberos/krb5kdc/ directory.

  6. Modify kadm5.acl to specify users that have admin permission.
    */admin@DET.EMC.COM *
  7. Modify /var/kerberos/krb5kdc/kdc.conf and take out any encryption type except des-cbc-crc:normal. Also modify the realm name.
  8. Ensure iptables and selinux are off on all nodes (KDC server as well as Hadoop nodes).
  9. Start KDC services and create a local admin principal.
    kadmin.local 
    
    # service krb5kdc start
    
    # service kadmin start
    
    # /usr/kerberos/sbin/kadmin.local-q "addprinc root/admin"
    
    # kinit root/admin
  10. Copy the krb5.conf file to all Hadoop nodes.
    Any time you make a modification to any of the configuration files restart the below services and copy the krb5.conf file over to relevant Hadoop host and ECS nodes.
  11. Restart the services.
    service krb5kdc restart
    
    service kadmin restart
  12. You can visit following link to setup a Kerberos KDC based on steps at http://www.centos.org/docs/4/html/rhel-rg-en-4/s1-kerberos-server.html.
Back to Top

Configure AD user authentication for Kerberos

Where you have a Hadoop environment configured with Kerberos security, you can configure it to authenticate against the ECS AD domain.

Make sure you have an AD user for your ADREALM. The user "detscr" for ADREALM CAMBRIDGE.EMC.COM is used in the example below. Create a one-way trust between the KDCREALM and the ADREALM as shown in the example. Do not try to validate this realm using "netdom trust".

On Active Directory

You must set up a one-way cross-realm trust from the KDC realm to the AD realm. To do so, run the following commands at a command prompt.
ksetup /addkdc KDC-REALM <KDC hostname>
netdom trust KDC-REALM /Domain:AD-REALM /add /realm /passwordt:<TrustPassword>
ksetup /SetEncTypeAttr KDC-REALM <enc_type>
For example:
ksetup /addkdc LSS.EMC.COM lcigb101.lss.emc.com
netdom trust LSS.EMC.COM /Domain:CAMBRIDGE.EMC.COM /add /realm /passwordt:ChangeMe
ksetup /SetEncTypeAttr LSS.EMC.COM DES-CBC-CRC

For this example, encryption des-cbc-crc was used. However, this is a weak encryption that was only chosen for demonstration purposes. Whatever encryption you choose, the AD, KDC, and clients must support it.

On your KDC (as root)

To set up a one-way trust, you will need to create a "krbtgt" service principal. To do so, the name is krbtgt/KDC-REALM@AD-REALM. Give this the password ChangeMe, or whatever you specified to the /passwordt argument above.
  1. On KDC (as root)
    # kadmin
    kadmin: addprinc -e "des-cbc-crc:normal" krbtgt/LSS.EMC.COM@CAMBRIDGE.EMC.COM
    Note Image
    When deploying, it is best to limit the encryption types to the one you chose. Once this is working, additional encryption types can be added.

  2. Add the following rules to your core-site.xml hadoop.security.auth_to_local property:
    RULE:[1:$1@$0](^.*@CAMBRIDGE\.EMC\.COM$)s/^(.*)@CAMBRIDGE\.EMC\.COM$/$1/g
    RULE:[2:$1@$0](^.*@CAMBRIDGE\.EMC\.COM$)s/^(.*)@CAMBRIDGE\.EMC\.COM$/$1/g
  3. Verify that AD or LDAP is correctly setup with the Kerberos (KDC) server. User should be able to "kinit" against an AD user and list local HDFS directory.
    Note Image
    If you are configuring your Hadoop cluster and ECS to authenticate through an AD, create local Linux user accounts on all Hadoop nodes for the AD user you will be kinit'ed as, and also make sure that all Hadoop host are kinit'ed using that AD user. For example, if you kinit as userX@ADREALM, create userX as a local user on all Hadoop hosts, and kinit using: 'kinit userX@ADREALM' on all hosts for that user.

In the example below, we will authenticate as "kinit detscr@CAMBRIDGE.EMC.COM", so will create a user called "detscr" and kinit as this user on the Hadoop host. As shown below:
[root@lviprb159 ~]# su detscr
    [detscr@lviprb159 root]$ whoami
    detscr
    [detscr@lviprb159 root]$ kinit detscr@CAMBRIDGE.EMC.COM
    Password for detscr@CAMBRIDGE.EMC.COM:
    [detscr@lviprb159 root]$ klist
    Ticket cache: FILE:/tmp/krb5cc_1010
    Default principal: detscr@CAMBRIDGE.EMC.COM
    Valid starting     Expires            Service principal
    12/22/14 14:28:27  03/02/15 01:28:30  krbtgt/CAMBRIDGE.EMC.COM@CAMBRIDGE.EMC.COM
        renew until 09/17/17 15:28:27
  
    [detscr@lviprb159 root]$ hdfs dfs -ls /
Found 4 items
drwx---rwx   - yarn   hadoop          0 2014-12-23 14:11 /app-logs
drwx---rwt   - hdfs                   0 2014-12-23 13:48 /apps
drwx---r-x   - mapred                 0 2014-12-23 14:11 /mapred
drwx---r-x   - hdfs                   0 2014-12-23 14:11 /mr-history

Back to Top

ECS administration tool reference

The ECS administration tool (ViPRAdminTool.sh) is provided to enable buckets that support HDFS to be created from a Hadoop cluster without needing to interact with the ECS Portal or API. The tool can also be used to list, delete, and obtain the status of buckets.

Obtaining the tool

The ViPRAdminTool.sh tool is a wrapper around a Java class within the ECS HDFS JAR, so can be run once the Hadoop cluster has been configured to use ECS HDFS. It can be obtained as described in Obtain the ECS HDFS installation and support package.

Usage

  • To run the tool, use:
    ViPRAdminTool.sh <COMMAND> [ARGUMENTS]
  • You can run the tool without the script and calling the class directly using:
    hadoop com/emc/hadoop/fs/vipr/ViPRAdminClient <COMMAND> [ARGUMENTS]

A command is always required, denoted by <>, arguments are optional, denoted by [].

Note Image
You must supply all arguments up to the last one that you want to specify.

Commands

createbucket <uri><name>[permission][vpoolId][projectId][objectType]
Creates a bucket.
statbucket <uri><name>
Gets the status for specified bucket.
deletebucket <uri><name>
Deletes the specified bucket.
listbucket <uri>
Lists all buckets the current user has permission to read.

Arguments

name
Bucket name to create/stat/delete.
uri
A pointer to the ECS deployment, in <scheme>://<DataNode Address>/<namespace> format.
permission
Valid for createbucket only. Specify POSIX permissions in octal format, such as '0755' . Default value: 0777.
vpoolId
Valid for createbucket only. Identity of the replication group to be used. Default replication group is used if not specified.
objectType
Valid for createbucket only. Specifies the objectType allowed for this bucket; only S3 is allowed.

Back to Top

Troubleshooting

This area provides workarounds for issue that may be encountered when configuring ECS HDFS.

Back to Top

Verify AD/LDAP is correctly configured with secure Hadoop cluster

You should verify that AD or LDAP is correctly set up with Kerberos (KDC) and the Hadoop cluster.

When your configuration is correct, you should be able to "kinit" against an AD/LDAP user. In addition, if the Hadoop cluster is configured for local HDFS, you should check that you can list the local HDFS directory before ECS gets added to the cluster.

Workaround

If you cannot successfully authenticate as an AD/LDAP user with the KDC on the Hadoop cluster, you should address this before proceeding to ECS Hadoop configuration.

An example of a successful login is shown below:
[kcluser@lvipri054 root]$  kinit kcluser@QE.COM
Password for kcluser@QE.COM:


[kcluser@lvipri054 root]$ klist
Ticket cache: FILE:/tmp/krb5cc_1025
Default principal: kcluser@QE.COM

Valid starting     Expires            Service principal
04/28/15 06:20:57  04/28/15 16:21:08  krbtgt/QE.COM@QE.COM
        renew until 05/05/15 06:20:57

If the above is not successful, you can investigate using the following checklist:
  • Check /etc/krb5.conf on the KDC server for correctness and syntax. Realms can be case sensitive in the config files as well as when used with the kinit command.
  • Check that /etc/krb5.conf from the KDC server is copied to all the Hadoop nodes.
  • Check that one-way trust between AD/LDAP and the KDC server was successfully made. Refer to appropriate documentation on how to do this.
  • Make sure that the encryption type on the AD/LDAP server matches that on the KDC server.
  • Check that /var/kerberos/krb5kdc/kadm5.acl and /var/kerberos/krb5kdc/kdc.conf are correct.
  • Try logging in as a service principle on the KDC server to indicate that the KDC server itself is working correctly.
  • Try logging in as the same AD/LDAP user on the KDC server directly. If that does not work, the issue is likely to be on the KDC server directly.

Back to Top

Permission denied for AD user

When running an application as an AD user, a "Permission denied" error is raised.

Workaround

Set the permissions for the /user directory as:
hdfs dfs -chmod 1777 /user

Back to Top

Restart services after hbase configuration

After editing the hbase.rootdir property in hbase-site.xml, the hbase service does not restart correctly.

Workaround

The following steps should be performed when this issue arises on Cloudera or Hortonworks to get hbase-master running.
  1. Connect to the zookeeper cli.
    hbase zkcli
  2. Remove the hbase directory.
    rmr /hbase
  3. Restart the hbase service.

    On Cloudera restart all services.

Back to Top

Pig test fails: unable to obtain Kerberos principal

Pig test fails with the following error: "Info:Error: java.io.IOException: Unable to obtain the Kerberos principal" even after kinit as AD user, or with "Unable to open iterator for alias firstten".

This issue is caused due to the fact that Pig (<0.13) doesn't generate a delegation token for ViPRFS as a secondary storage.

Workaround

Append the viprfs://bucket.ns.installation/ to the mapreduce.job.hdfs-servers configuration setting. For example:
set mapreduce.job.hdfs-servers viprfs://KcdhbuckTM2.s3.site1

Back to Top

Enable Kerberos client-side logging

To troubleshoot authentication issues, you can enable verbose logging on the Hadoop cluster node that you are using.

Verbose logging is enabled using an environment variable that applies only to your current SSH session.
 export HADOOP_OPTS="-Dsun.security.krb5.debug=true"

Back to Top

Debug Kerberos on the KDC

Tail the KDC's /var/log/krb5kdc.log file when you do an HDFS operation to make it easier to debug.

tail -f /var/log/krb5kdc.log

Back to Top

Eliminate clock skew

It is important to ensure that time is synchronized between client and server as Kerberos relies on time being accurate.

If your AD has a clock skew with your data nodes/KDC, you will have configure its NTP server. You can do this as follows:
  1. Use Remote Desktop to connect to your AD server.
  2. Run the following commands:
    1. w32tm /config /syncfromflags:manual /manualpeerlist:<ntp-server1>,<ntp-server2>
    2. net stop w32time
    3. net start w32time

Back to Top

Hadoop core-site.xml properties for ECS HDFS

When configuring the Hadoopcore-site.xml file, use this table as a reference for the properties and their related values.

Using YARN

When Kerberos is enabled with ECS HDFS, YARN's Resource Manager (RM) and Node Manager (NM) must run as the same Kerberos principal. Here is an example of how you would set it up in the core-site.xml properties:

fs.vipr.auth.service.yarn.principal = rm/_HOST@ACME.COM
fs.vipr.auth.service.yarn.keytab = /etc/security/keytab/rm.keytab
yarn.nodemanager.keytab = /etc/security/keytab/rm.keytab
yarn.nodemanager.principal = rm/_HOST@ACME.COM
yarn.resourcemanager.keytab = /etc/security/keytab/rm.keytab
yarn.resourcemanager.principal = rm/_HOST@ACME.COM

Back to Top

Sample core-site.xml for simple authentication mode

This core-site.xml is an example of ECS HDFS properties for simple authentication mode.

core-site.xml

<property>
  <name>fs.viprfs.impl</name>
  <value>com.emc.hadoop.fs.vipr.ViPRFileSystem</value>
</property>

<property>
  <name>fs.AbstractFileSystem.viprfs.impl</name>
  <value>com.emc.hadoop.fs.vipr.ViPRAbstractFileSystem</value>
</property>

<property>
  <name>fs.vipr.installations</name>
  <value>Site1</value>
</property>

<property>
  <name>fs.vipr.installation.Site1.hosts</name>
  <value>203.0.113.10,203.0.113.11,203.0.113.12</value>
</property>

<property>
  <name>fs.vipr.installation.Site1.resolution</name>
  <value>dynamic</value>
</property>

<property>
  <name>fs.vipr.installation.Site1.resolution.dynamic.time_to_live_ms</name>
  <value>900000</value>
</property>

<property>
  <name>fs.defaultFS</name>
  <value>viprfs://mybucket.mynamespace.Site1/</value>
</property>

<property>
  <name>fs.viprfs.auth.anonymous_translation</name>
  <value>CURRENT_USER</value>
</property>

<property>
  <name>fs.viprfs.auth.identity_translation</name>
  <value>FIXED_REALM</value>
</property>

<property>
  <name>fs.viprfs.auth.realm</name>
  <value>MY.TEST.REALM</value>
</property>

Back to Top