Migration from HDP non-secure cluster to CDP secure cluster - migration

We are running a migration of HDFS data from an HDP non-sercure cluster to CDP secure cluster, when I read the Cloudera documentation, they are mentioning "distcp" as a tool to handle the migration, but also they mention only from HDP secure cluster to CDP secure/non-secure cluster which is not my case.
I have few questions :
Should I secure the exiting cluster first and then use distcp ?
or is it okay if I use distcp without security checks ?
from you're experiences how can I handle such a situation ?
Thanks in advance

From my experience you will have to run the distcp from the CDP secure cluster, with a valid kerberos ticket, and with the following parameter :
ipc.client.fallback-to-simple-auth-allowed=true
Full example :
hadoop distcp \
-D ipc.client.fallback-to-simple-auth-allowed=true \
hdfs://<hdp_namenode>:8020/<dir> \
hdfs://<cdp_namenode>:8020/<dir>

Related

Can you connect Amazon ElastiСache Redis to amazon EMR pyspark?

I have been trying several solutions with custom jars from redislab and also with --packages in spark submit emr and still no suceess is there any simple way in emr to connect to elasticache ?

How to run aerospike AQL from remote servers to query aerospike cluster data

I am trying to run AQL queries on my aerosike cluster from remote servers.
Please let me know if there any any AQL web/Cli client or any way to achieve it.
You can simply use the -h or --host= options to point at one of the hosts in your cluster. Refer to the AQL docs.

Hive policy in ranger is not working

I am using a Kerberos certified cluster. Ranger version is 0.7 and Hive version is 1.2.
After ranger configured Hive policy, it is not working, but hbase and hdfs works fine. I am using Beeline to connect to Hiveserver2.
After careful examination found that parameter configuration error.
Enable Authorization changed to true
hive.security.authorization.manager changed to :
org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizerFactory
This parameter was :
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory

Hadoop Cluster deployment using Apache Ambari

I have listed few queries related to ambari as follows:
Can I configure more than one hadoop cluster via UI of ambari ?
( Note : I am using Ambari 1.6.1 for my hadoop cluster deployment purpose and I am aware that this can be done via Ambari API, but not able to find via ambari portal)
We can check the status of services on each node by “jps” command, if we have configured hadoop cluster w/o ambari.
Is there any way similar to “jps” to check from back end if the setup for hadoop cluster was successful from the backend ?
( Note : I can see that services are showing UP on the ambari portal )
Help Appreciated !!
Please let me know if any additional information is required.
Thanks
The ambari UI is served up by the ambari server for a specific cluster. To configure another cluster you need to point your browser to the URL for that other cluster's ambari server. So you can't see the configuration for multiple servers on the same web page, but you can set browser bookmarks to jump from configuration to configuration.

Hive Server 1 vs Hive Server 2

We have hive 0.10 version and we were wondering if we should be using Hive Server 1 or Hive Server2. Another question is to connect to Hive Server running on port 10000, using 3rd party tools do we need anything else?
Thanks,
I had the Hive 1 v 2 question and found the basics at:
http://www.slideshare.net/cwsteinbach/hiveserver2-for-apache-hive
HiveServer2 Thrift API spec
JDBC/ODBC HiveServer2 drivers
Concurrent Thrift clients with memory leak fixes and session/config
info
Kerberos authentication
Authorization to improve GRANT/ROLE and code injection vectors
I'm sure there's more given intervening development.
Hive Server 2 supports Rest API. Tools like beeline can be used to connect from any client outside of your cluster to the hive database. In a secured environment beeline(Hive Rest API client) will connect through knox gateway. Literally there can be multiple beeline connections possible to connect with Hive Server2. So, go with hiveserver2 for more secured and to have multiple connections
HiveServer2 is an improved version of HiveServer that supports a Thrift API tailored for JDBC and ODBC clients, Kerberos authentication, and multi-client concurrency. The CLI for HiveServer2 is Beeline.
Src: Cloudera
Kerberos (authentication) and Sentry (authorization).
Sentry security will working through HiveServer2 and HiveServer1 which is used by Hive CLI.
The CLI for HiveServer1 is HiveCLI.
The CLI for HiveServer2 is Beeline.