ViPR 2.1 - Configure ViPR HDFS
Table of Contents
Configure ViPR HDFS
This article describes how to configure your existing Hadoop distribution to use the data in your ViPR storage infrastructure with ViPR HDFS. Use this step-by-step procedure if your Hadoop distribution is configured to use simple authentication and not Kerberos authentication.
If your Hadoop distribution is configured for Kerberos authentication, follow the steps described here.
To perform this integration procedure, you must have:
- A working knowledge of your Hadoop distribution and its associated tools.
- The Hadoop credentials that allow you to log in to Hadoop nodes, to modify Hadoop system files, and to start and stop Hadoop services.
Plan the ViPR HDFS and Hadoop integration
Use this list to verify that you have the information necessary to ensure a successful integration.
To integrate ViPR HDFS with your Hadoop cluster, perform the following tasks:
- Obtain the ViPR HDFS installation and support package
- Deploy the ViPR HDFS JAR
- If using Pivotal HAWQ, Replace Pivotal HDFS lib with ViPR HDFS lib
- Edit core-site.xml
- Restart the following services:
- HDFS
- MapReduce
- Pivotal HAWQ (only if using this service)
- Confirm the services restart correctly.
- Verify that you have file system access.
When using HBase, perform these additional tasks:
- Edit HBASE hbase-site.xml.
- Restart the HBase services.
Obtain the ViPR HDFS installation and support package
The ViPR HDFS JAR and HDFS support tools are provided in a ZIP file, vipr-hdfs-<version>.zip that you can download from the ViPR support pages on support.EMC.com.
The ZIP file contains\client and\tools\bin directories. Before you unzip the file, create a directory to hold the zip contents (your unzip tool might do this for you), then extract the contents to that directory. After you extract the files, the directories will contain the following:
- \tools\bin: Contains the following tools.
- setupViPRKerberosConfiguration.sh: Configures the ViPR data nodes with a Kerberos service key to enable Hadoop access to the ViPR HDFS service. Run this script from the machine hosting the KDC.
- ViPRAdminTools.sh: Enables the creation of buckets that supports HDFS access without needing to use ViPR object protocols or to use the ViPR UI.
- \client: Contains the following files:
- ViPR JAR files: Used to configure different Hadoop distributions.
- libvipr-<version>.so: Used to configure Pivotal HAWQ for use with ViPR HDFS.
Deploy the ViPR HDFS JAR
Use this procedure to put the ViPR HDFS JAR on the ViPR classpath of each client node in the ViPR cluster.
Before you begin
Obtain the ViPR HDFS JAR for your ViPR distribution from the EMC Support site for ViPR as described in Obtaining the ViPR HDFS installation and support package .
Procedure
- Log in to a ViPR client node.
- Run the classpath command to get the list of directories in the classpath:
# hadoop classpath
- Copy
ViPR HDFS JAR to one of folders listed by the classpath command that occurs after the
/conf folder.
ViPR distribution Class path location (suggested) Pivotal HD /usr/lib/gphd/hadoop/lib Cloudera /usr/lib/hadoop/lib Apache /opt/hadoop/hadoop/lib/native - Repeat this procedure on each ViPR client node.
Configure Pivotal HAWQ
To use the Pivotal HAWQ service withViPR HDFS, you must replace Pivotal's HDFS lib with the ViPR HDFS lib.
Every time you reconfigure, deploy, or upgrade using icm_client, verify the libhdfs3 symlinks still point to libvipr-<version>.so. The HAWQ config file, hdfs-client.xml is not used by ViPR HDFS. When the system is configured to use ViPR HDFS, the HDFS name node fails to start because the defaultFS point to viprfs://vipr-endpoint/.
Procedure
- Copy the
libvipr-<version>.so you extracted from the
ViPR JAR to a local directory on each HAWQ Master and Segment node in the Pivotal cluster.
For example: /usr/local/vipr/libvipr-<version>.so
- Update the
libhdfs symlink in the Pivotal master and segment node's HAWQ installation directory (<HAWQ_INSTALL_DIR>)
For example:
<HAWQ_INSTALL_DIR>/lib/libhdfs3.so -> /usr/local/vipr/libvipr-<version>.so unlink <HAWQ_INSTALL_DIR>/lib/libhdfs3.so ln -s /usr/local/vipr/libvipr-<version>.so <HAWQ_INSTALL_DIR>/lib/libhdfs3.so
<HAWQ_INSTALL_DIR>/lib/libhdfs3.so.1 -> /usr/local/vipr/libvipr-<version>.so unlink <HAWQ_INSTALL_DIR>/lib/libhdfs3.so.1 ln -s /usr/local/vipr/libvipr-<version>.so <HAWQ_INSTALL_DIR>/lib/libhdfs3.so.1
- Update the symlink on each node in the Pivotal cluster.
Edit Hadoop core-site.xml file
Use this procedure to update core-site.xml with the properties needed to integrate ViPR HDFS with a Hadoop cluster that uses simple authentication mode.
Before you begin
You must have a set of user credentials that enable you to log in to Hadoop nodes and modify core-site.xml.
The location of core-site.xml depends on the distribution you are using.
core-site.xml resides on each node in the Hadoop cluster. You must modify the same properties in each instance. You can make the change in one node, and then use secure copy command (scp) to copy the file to the other nodes in the cluster.
See core_site.xml property reference for more information about each property you need to set.
Procedure
- Log in to one of the HDFS nodes where core-site.xml is located.
- Make a backup copy of
core-site.xml.cp core-site.xml core-site.backup
- Using the text editor of your choice, open core-site.xml for editing.
- Add the following properties and values to define the Java classes that implement the
ViPR HDFS file system:
<property> <name>fs.viprfs.impl</name> <value>com.emc.hadoop.fs.vipr.ViPRFileSystem</value> </property>
<property> <name>fs.AbstractFileSystem.viprfs.impl</name> <value>com.emc.hadoop.fs.vipr.ViPRAbstractFileSystem</value> </property>
- Add the fs.vipr.installations property.
In the following example, the value is set to Site1.
<property> <name>fs.vipr.installations</name> <value>Site1</value> </property>
- Add the fs.vipr.installation.[installation_name].hosts property as a comma-separated list of
ViPR data nodes or load balancer IP addresses.
In the following example, the installation_name is set to Site1.
<property> <name>fs.vipr.installation.Site1.hosts</name> <value>203.0.113.10,203.0.113.11,203.0.113.12</value> </property>
- Add the fs.vipr.installation.[installation_name].resolution property, and set it to one of the following values:
Option Description dynamic Use when accessing ViPR data nodes directly without a load balancer. fixed Use when accessing ViPR data nodes through a load balancer. In the following example, installation_name is set to Site1.<property> <name>fs.vipr.installation.Site1.hosts.resolution</name> <value>dynamic</value> </property>
- If you set fs.vipr.installation.[installation_name].resolution to dynamic, add the fs.vipr.installation.[installation_name].resolution.dynamic.time_to_live_ms property to specify how often to query
ViPR for the list of active nodes.
In the following example, installation_name is set to Site1.
<property> <name>fs.vipr.installation.Site1.resolution.dynamic.time_to_live_ms</name> <value>900000</value> </property>
- If you set fs.vipr.installation.[installation_name].resolution to dynamic, add the fs.vipr.installation.[installation_name].resolution.dynamic.time_to_live_ms property to specify how often to query
ViPR for the list of active nodes.
- Locate the fs.defaultFS property and modify the value to specify the
ViPR file system URI using the following format:viprfs://<bucket_name.namespace.installation_name.
Where
- bucket_name: The name of the bucket that contains the data you want to use when you run Hadoop jobs. If running in simple authentication mode, the owner of the bucket must grant permission to Everybody. In the following example, the bucket_name is set to mybucket.
- namespace: The tenant namespace where bucket_name resides. In the following example, the namespace is set to mynamespace.
- installation_name: The value specified by the fs.vipr.installations property. In the following example, installation_name is set to Site1.
<property> <name>fs.defaultFS</name> <value>viprfs://mybucket.mynamespace.Site1/</value> </property>
- Locate fs.permissions.umask-mode, and set the value to 022.In some configurations, this property might not already exist. If it does not, then add it.
<property> <name>fs.permissions.umask-mode</name> <value>022</value> </property>
- Add the fs.viprfs.auth.anonymous_translation property; use it to specify whether to map anonymously owned objects to the current user so the current user has permission to modify it.
Option Description NONE (default) Do not map anonymously owned objects to the current user. CURRENT_USER Map anonymously owned objects to the current Unix user. <property> <name>fs.viprfs.auth.anonymous_translation</name> <value>CURRENT_USER</value> </property>
- Add the fs.viprfs.auth.identity_translation property. It provides a way to assign users to a realm when Kerberos is not present.
Option Description FIXED_REALM When specified, ViPR HDFS gets the realm name from the value of the fs.vipr.auth.realm property. NONE (default) ViPR HDFS does no realm translation. <property> <name>fs.viprfs.auth.identity_translation</name> <value>NONE</value> </property>
- If you set the fs.viprfs.auth.identity_translation property to FIXED_REALM, add the fs.viprfs.auth.realm property.
- If you want to use the Pivotal HAWQ service, add the hawq.vipr.endpoint property. Specify the value using the following format:
bucket_name.namespace.installation_name.
Where:
- bucket_name:The name of the bucket that contains the data you want to use when you run Hadoop jobs. If running in simple authentication mode, the owner of the bucket must grant permission to Everybody. In the following example, bucket_name is set to mybucket.
- namespace: The tenant namespace where bucket_name resides. In the following example, the namespace is set to mynamespace.
- installation_name: The value specified by the fs.vipr.installations property. In the following example, the installation_name is set to Site1.
You must be running a version of ViPR that supports Pivotal HAWQ. For more information, see the ViPR Support Matrix.<property> <name>hawq.vipr.endpoint</name> <value>mybucket.mynamespace.Site1</value> </property>
- Save core-site.xml.
- Update the core-site.xml on the required nodes in your Hadoop cluster.
- If you are using a Cloudera distribution, use Cloudera Manager to update the core-site.xml safety valve with the same set of properties and values.
- Restart the Haddoop services.
Hadoop Distribution Commands Pivotal HD ComputeMaster: # service hadoop-yarn-resourcemanager restart
Data Nodes:
# service hadoop-hdfs-datanode restart
# service hadoop-yarn-nodemanager restart
NameNode:
# service hadoop-yarn-nodemanager restart
If you are using the Pivotal HAWQ service, restart it by running the following commands:
# service hawq stop
# service hawq start
When you configure the Pivotal Hadoop cluster to use ViPR HDFS as the default file system (specified by fs.DefaultFS in core-site.xml), you cannot use the icm_client's cluster start/stop functionality, instead, you must start all cluster services (except HDFS) individually. For example:
icm_client start -s yarn
icm_client start -s zookeeper
and so on.Cloudera Use Cloudera Manager to restart the HDFS and MapReduce services Apache # stop-all.sh # start-all.sh
- Test the configuration by running the following command to get a directory listing:
# hdfs dfs -ls viprfs://mybucket.mynamespace.Site1/
13/12/13 22:20:37 INFO vipr.ViPRFileSystem: Initialized ViPRFS for viprfs://mybucket.mynamespace.Site1/
If you have set fs.defaultFS, you can use:# hdfs dfs -ls /
Edit HBASE hbase-site.xml
When you use HBASE with ViPRHDFS, you must set the hbase.rootdir in hbase-site.xml to the same value as the core-site.xml fs.defaultFS property.
hbase-site.xml is located in one of the following locations:
Procedure
- Open hbase-site.xml.
- Set the hbase.rootdir property to the same value as fs.defaultFS adding /hbase as the suffix.
- Save your changes.
- On Cloudera, add the hbase.rootdir property to the HBase Service Configuration Safety Valve for hbase-site.xml.
- Restart the services for your distribution.
HadoopDistribution Description Pivotal HD Run this command on the hbase master node: # service hbase-master restart
Run this command on the hbase region server:
# service hadoop-regionserver restart
Cloudera Use Cloudera manager to restart the HBase service. Apache # bin/start-hbase.sh
hbase.rootdir entry
<property> <name>hbase.rootdir</name> <value>viprfs://testbucket.s3.testsite/hbase</value> </property>