What is ViPR HDFS?
Table of Contents
This article applies to EMC ViPR 2.0.
You can configure your Hadoop distribution to run against the built-in Hadoop file system, against ViPR HDFS, or any combination of HDFS, ViPR HDFS, or other Hadoop Compatible File Systems available in your environment. The following figure illustrates how ViPR HDFS integrates with an existing Hadoop cluster.
In a Hadoop environment configured to use ViPR HDFS, each of the Hadoop HDFS data nodes (VM or commodity) functions as a traditional Hadoop NameNode which means that all of the Hadoop HDFS data nodes are capable of accepting HDFS requests and servicing them.
When you set up the Hadoop client to use Hadoop HDFS instead of traditional HDFS, the configuration points to Hadoop HDFS to do all the HDFS activity. On each Hadoop HDFS client node, any traditional Hadoop component would use the Hadoop HDFS client (JAR) to perform the HDFS activity.
- A Hadoop cluster already installed and configured. See the EMC ViPR Support Matrix for the list of supported distributions.
- Hadoop installed and configured to support
ViPR HDFS, which requires:
- A data services license for HDFS or for Object + HDFS.
- A data services virtual pool that supports HDFS or Object + HDFS.
- One or more buckets that support HDFS or Object + HDFS.
- ViPR HDFS Java classes: This set of properties defines the ViPR HDFS implementation classes that are contained in the ViPR HDFS client JAR. They are required.
- File system location properties: These properties define the file system URI (scheme and authority) to use when running Hadoop jobs, and the IP addresses to the ViPR data VMs for a specific ViPR file system.
- Identity translation properties: These properties allow you to map anonymously owned objects to users, as well as specify user realms.
- Kerberos realm and service principal properties: These properties are required only when you are running in a Hadoop environment where Kerberos is present. These properties map Hadoop and ViPR HDFS users.
core-site.xml resides on each node in the Hadoop cluster. You must add the same properties to each instance of core-site.xml.Back to Top
The bucket_name corresponds to a bucket residing in an HDFS or HDFS + Object enabled virtual pool. It contains the data you want to analyze with Hadoop. The namespace corresponds to a tenant namespace, and the installation_name is a name you assign to a specific set of ViPR nodes or a load balancer. ViPR HDFS resolves the installation_name to a set of ViPR data VMs or to a load balancer by using the fs.vipr.installation.[installation_name].hosts property, which includes the IP addresses of the data service VMs or load balancer.
If the installation_name maps to a set of ViPR data VMs, you can specify how often to query ViPR for the list of active nodes by setting the fs.vipr.installation.[installation_name].resolution to dynamic, and the fs.vipr.installation.[installation_name].resolution.dynamic.time_to_live_ms to specify how often to query ViPR for the list of active nodes.
You can specify the ViPR HDFS URI as the default file system in core-site.xml by setting it as the value of the fs.defaultFS property, but this is not a requirement. Whether or not to set this value to ViPR HDFS requires careful consideration as part of your overall Hadoop on ViPR HDFS integration planning. If you do not specify ViPR HDFS as the default file system, you must use the full URI including the path each time you access ViPR data. If you have existing applications that already use a different default file system, you need to update those applications.Back to Top
Hadoop applications access data stored in ViPR buckets, so the Hadoop users must have permissions to read the objects they are trying to read, and permissions to write to the buckets they are trying to write to. Hadoop services (such as mapred, hive, and hbase) must have permissions to write system files.
The following table lists the default ACLs applied to files and directories in both simple and Kerberos authentication mode.Back to Top
To resolve this problem, set the fs.viprfs.auth.anonymous_translation property in core-site.xml to CURRENT_USER. This setting allows anonymously owned files to be displayed as if they were owned by the current Unix user. This example shows what happens when the setting for fs.viprfs.auth.anonymous_translation is set to NONE:
# hadoop fs -ls / 14/01/30 11:36:23 INFO vipr.ViPRFileSystem: Initialized ViPRFS (atom 22.214.171.124-811) for viprfs://hdfs.s3.docsite d------rwx - 0 2014-01-30 10:42 /bar d------rwx - 0 2014-01-30 10:34 /fooDir d------rwx - 0 2014-01-30 06:15 /tmp
# hadoop fs -ls / 14/01/30 11:30:37 INFO vipr.ViPRFileSystem: Initialized ViPRFS (atom 126.96.36.199-811) for viprfs://hdfs.s3.docsite Found 3 items drwx------ - root 0 2014-01-30 10:42 /bar drwx------ - root 0 2014-01-30 10:34 /fooDir drwx------ - root 0 2014-01-30 06:15 /tmp
In anonymous mode, set up your ViPR buckets to allow access from Everyone to ensure that Hadoop processes can access ViPR buckets.
hdfs dfs -chown sally@MYREALM.COM /sallys/new/file
<property> <name>fs.viprfs.auth.identity_translation</name> <value>FIXED_REALM</value> </property> <property> <name>fs.viprfs.auth.realm</name> <value>MYREALM.COM</value> </property>
Using chown in anonymous mode is not recommended. When you chown files or directories, you change the owner from the empty string to an actual owner. Once the files or directories have an owner, anonymous users no longer have access to it.Back to Top
In a Hadoop cluster running in Kerberos mode, there must be a one-way cross-realm trust from the Kerberos realm to the Active Directory realm used to authenticate your ViPR users.
- fs.permissions.umask-mode: Set the value to 027.
- fs.viprfs.auth.anonymous_translation: Set the value to CURRENT_USER.
- fs.viprfs.auth.identity_translation: Set the value to CURRENT_USER_REALM so the realm of users is auto-detected; alternatively set to FIXED_REALM if you want to hard-code the user's realm by using the fs.viprfs.auth.realm property.
- Option 1: Set the following properties in
- fs.<scheme>.auth.identity_translation = CURRENT_USER_REALM (in Kerberos authentication mode) or FIXED_REALM (in simple authentication mode)
- fs.<scheme>.auth.anonymous_translation = CURRENT_USER
- Option 2: Set the following environment variable:
- $VIPR_LOCAL_USER_MODE ="true"
When you are interacting directly with ViPR HDFS, you might notice the following differences from interaction with the standard HDFS file system:
- Applications that expect the file system to be an instance of DistributedFileSystem do not work. Applications hardcoded to work against the built-in HDFS implementation require changes to use ViPR HDFS.
- ViPR HDFS does not support checksums of the data.
- When you use the listCorruptFileBlocks function, all blocks are reported as OK because ViPR HDFS has no notion of corrupted blocks.
- The replication factor is always reported as a constant N, where N=1. The data is protected by the ViPR SLA, not by Hadoop replication.
- Cloudera Impala
- Apache Oozie
The ZIP file contains\client and\tools\bin directories. Before you unzip the file, create a directory to hold the zip contents (your unzip tool might do this for you), then extract the contents to that directory. After you extract the files, the directories will contain the following:
- \tools\bin: Contains the following tools.
- setupViPRKerberosConfiguration.sh: Configures the ViPR data nodes with a Kerberos service key to enable Hadoop access to the ViPR HDFS service. Run this script from the machine hosting the KDC.
- ViPRAdminTools.sh: Enables the creation of buckets that supports HDFS access without needing to use ViPR object protocols or to use the ViPR UI.
- \client: Contains the following files:
- ViPR JAR files: Used to configure different Hadoop distributions.
- libvipr-<version>.so: Used to configure Pivotal HAWQ for use with ViPR HDFS.