How do I read from Hive using Apache Beam? - hive

How to read from Hive using Apache Beam / how to use Hive as a source in Apache Beam ?

HadoopInputFormatIO can be used to read from Hive as below :
Configuration conf = new Configuration();
conf.setClass("mapreduce.job.inputformat.class", HCatInputFormat.class,
InputFormat.class);
conf.setClass("key.class", LongWritable.class, WritableComparable.class);
conf.setClass("value.class", DefaultHCatRecord.class, Writable.class);
conf.set("hive.metastore.uris", "...");
HCatInputFormat.setInput(hiveConf, "myDatabase", "myTable", "myFilter");
PCollection<KV<LongWritable, DefaultHCatRecord>> data =
p.apply(HadoopInputFormatIO.<Long,
DefaultHCatRecord>read().withConfiguration(conf));

A pull request merged in July 2017 allows Beam 2.1.0 to support hive via the HCatalog https://issues.apache.org/jira/browse/BEAM-2357 .

Related

Web HUE installation and set up is done. Bu the dashboard is not working

I have recently set up the Hue set up on my hadoop cluster and everything seems fine. I was able to open the webhue ie., localhost:8888 and i can see the HDFS, HBase and Mysql. But I am still facing some issues on this. Could anyone please help me out in this regard.
Problems facing are :
Hive connection: I am using beeline and i was able to connect to hive databses using beeline on the shelll But in the web hue, it shows error loading databases. The configuration i have used in hue.ini file is
hive_server_host=localhost
Port where HiveServer2 Thrift server runs on.
hive_server_port=10000
The second issue is even though i was able to connect to the mysql database, the issue i am facing is in the dashboard tab. I can see all the widgets and charting options like pie,bar etc. But when i drag and drop them on the page, its loading forever. I dont able to see any chart of the table data.
Please help me out as i have been trying since 10 days and i could not able to find any pointers yet.
#Ruthikajawar I have a working hue.ini here:
https://github.com/steven-dfheinz/HDP3-Hue-Service/blob/Hue.4.6.0/configuration/live.hue.ini
The specifics for working hive are:
[beeswax]
# Host where HiveServer2 is running.
# If Kerberos security is enabled, use fully-qualified domain name (FQDN).
hive_server_host=hdp.cloudera.com
# Binary thrift port for HiveServer2.
#hive_server_port=10000
# Http thrift port for HiveServer2.
#hive_server_http_port=10001
# Host where LLAP is running
## llap_server_host = localhost
# LLAP binary thrift port
## llap_server_port = 10500
# LLAP HTTP Thrift port
## llap_server_thrift_port = 10501
# Alternatively, use Service Discovery for LLAP (Hive Server Interactive) and/or Hiveserver2, this will override server and thrift port
# Whether to use Service Discovery for LLAP
## hive_discovery_llap = true
# is llap (hive server interactive) running in an HA configuration (more than 1)
# important as the zookeeper structure is different
## hive_discovery_llap_ha = false
# Shortcuts to finding LLAP znode Key
# Non-HA - hiveserver-interactive-site - hive.server2.zookeeper.namespace ex hive2 = /hive2
# HA-NonKerberized - (llap_app_name)_llap ex app name llap0 = /llap0_llap
# HA-Kerberized - (llap_app_name)_llap-sasl ex app name llap0 = /llap0_llap-sasl
## hive_discovery_llap_znode = /hiveserver2-hive2
# Whether to use Service Discovery for HiveServer2
hive_discovery_hs2 = true
# Hiveserver2 is hive-site hive.server2.zookeeper.namespace ex hiveserver2 = /hiverserver2
hive_discovery_hiveserver2_znode = /hiveserver2
# Applicable only for LLAP HA
# To keep the load on zookeeper to a minimum
# ---- we cache the LLAP activeEndpoint for the cache_timeout period
# ---- we cache the hiveserver2 endpoint for the length of session
# configurations to set the time between zookeeper checks
## cache_timeout = 60
# Host where Hive Metastore Server (HMS) is running.
# If Kerberos security is enabled, the fully-qualified domain name (FQDN) is required.
#hive_metastore_host=hdp.cloudera.com
# Configure the port the Hive Metastore Server runs on.
#hive_metastore_port=9083
# Hive configuration directory, where hive-site.xml is located
hive_conf_dir=/etc/hive/conf
# Timeout in seconds for thrift calls to Hive service
## server_conn_timeout=120
# Choose whether to use the old GetLog() thrift call from before Hive 0.14 to retrieve the logs.
# If false, use the FetchResults() thrift call from Hive 1.0 or more instead.
## use_get_log_api=false
# Limit the number of partitions that can be listed.
## list_partitions_limit=10000
# The maximum number of partitions that will be included in the SELECT * LIMIT sample query for partitioned tables.
## query_partitions_limit=10
# A limit to the number of rows that can be downloaded from a query before it is truncated.
# A value of -1 means there will be no limit.
## download_row_limit=100000
# A limit to the number of bytes that can be downloaded from a query before it is truncated.
# A value of -1 means there will be no limit.
## download_bytes_limit=-1
# Hue will try to close the Hive query when the user leaves the editor page.
# This will free all the query resources in HiveServer2, but also make its results inaccessible.
## close_queries=false
# Hue will use at most this many HiveServer2 sessions per user at a time.
# For Tez, increase the number to more if you need more than one query at the time, e.g. 2 or 3 (Tez has a maximum of 1 query by session).
## max_number_of_sessions=1
# Thrift version to use when communicating with HiveServer2.
# Version 11 comes with Hive 3.0. If issues, try 7.
thrift_version=11
# A comma-separated list of white-listed Hive configuration properties that users are authorized to set.
## config_whitelist=hive.map.aggr,hive.exec.compress.output,hive.exec.parallel,hive.execution.engine,mapreduce.job.queuename
# Override the default desktop username and password of the hue user used for authentications with other services.
# e.g. Used for LDAP/PAM pass-through authentication.
## auth_username=hive
## auth_password=hive
# Use SASL framework to establish connection to host.
use_sasl=true
For the second part of your question. You should monitor the /var/log/hue/error.log while using the UI to capture and resolve any errors.

Are SSL and Kerberos compatible to each other on Hive Server?

My Hive server is SSL as well as Kerberos enabled. But when I try to connect to hiverserver2 via beeline using following command:
*!connect jdbc:hive2://**hostnameOfServer**:10000/hive;ssl=true;sslTrustStore=**keystorePath**;trustStorePassword=**passwordfor keystore**;principal=**Kerberos hive principal** **database username** **database password** org.apache.hive.jdbc.HiveDriver*
I get following error :
Error: Could not open client transport with JDBC Uri: jdbc:hive2://hostnameOfServer:10000/hive;ssl=true;sslTrustStore=keystorePath;trustStorePassword=passwordfor
keystore;principal=Kerberos hive principal database username
database password org.apache.hive.jdbc.HiveDriver: Invalid status 21 (state=08S01,code=0)
Also I tried using following command on beeline:
jdbc:hive2://**hostnameOfServer**:10000/hive;principal=**Kerberos hive principal**?transportMode=https;httpPath=cliservice;auth=kerberos;sasl.qop=auth.
But got same error.
Are ssl and kerberos compatible to each other?
Yes it is compatible from version Hive-2.0.0. Check the below JIRA task for more information
https://issues.apache.org/jira/browse/HIVE-14019

Not able to connect to metastore using Thrift URI - Hive

Can someone please help me with below issue, I have added thrift uri value in hive-site.xml. Aslo how can i verify the correct uri value?
I am running this command
grunt> battingdata = LOAD 'default.batting' USING org.apache.hive.hcatalog.pig.HCatLoader();
and getting below error
2015-09-29 02:42:36,937 [main] WARN hive.metastore - Failed to connect to the MetaStore Server...
Thanks for your help in advance.

Cloudera CDH 4.6.0 - Hive metastore service not starting

I installed Cloudera CDH 4.6.0 on my Centos 6.2 linux server machine (Cloudera manager - 4.8). I am able to start few services, however not able to start the Hive metastore service.
Cloudera is using Postgre SQL as the remote metatore DB. My host name is delvmpll2, but when starting Hive service, it is giving java.net.UnknownHostException: localhost.localdomain.
I edited the hostname in hive-site.xml and restarted all the services, but still the same exception is coming. I could not find the place where cloudera is picking this hostname.
Could someone please let me know what would have went wrong.
Here is the exception
Caused by: java.net.UnknownHostException: localhost.localdomain
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:195)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at java.net.Socket.connect(Socket.java:478)
at java.net.Socket.<init>(Socket.java:375)
at java.net.Socket.<init>(Socket.java:189)
at org.postgresql.core.PGStream.<init>(PGStream.java:62)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:76)
... 58 more
2014-07-04 07:16:06,354 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: Shutting down hive metastore.
Thanks in advance
Finally I solved it.
I changed the server_host value in config.ini file in /etc/cloudera-scm-agent to my host and after that when I restarted the services, all the services are running well

I can't get Hadoop to start using Amazon EC2/S3

I have created an AMI image and installed Hadoop from the Cloudera CDH2 build. I configured my core-site.xml as so:
<property>
<name>fs.default.name</name>
<value>s3://<BUCKET NAME>/</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value><ACCESS ID></value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value><SECRET KEY></value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop-0.20/cache/${user.name}</value>
</property>
But I get the following error message when I start up the hadoop daemons in the namenode log:
2010-11-03 23:45:21,680 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.default.name): s3://<BUCKET NAME>/ is not of scheme 'hdfs'.
at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:177)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:198)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:306)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1006)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1015)
2010-11-03 23:45:21,691 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
However, I am able to execute hadoop commands from the command line like so:
hadoop fs -put sun-javadb-common-10.5.3-0.2.i386.rpm s3://<BUCKET NAME>/
hadoop fs -ls s3://poc-jwt-ci/
Found 3 items
drwxrwxrwx - 0 1970-01-01 00:00 /
-rwxrwxrwx 1 16307 1970-01-01 00:00 /sun-javadb-common-10.5.3-0.2.i386.rpm
drwxrwxrwx - 0 1970-01-01 00:00 /var
You will notice there is a / and a /var folders in the bucket. I ran the hadoop namenode -format when I first saw this error, then restarted all services, but still receive the weird Invalid URI for NameNode address (check fs.default.name): s3://<BUCKET NAME>/ is not of scheme 'hdfs'.
I also notice that the file system created looks like this:
hadoop fs -ls s3://<BUCKET NAME>/var/lib/hadoop-0.20/cache/hadoop/mapred/system
Found 1 items
-rwxrwxrwx 1 4 1970-01-01 00:00 /var/lib/hadoop0.20/cache/hadoop/mapred/system/jobtracker.info
Any ideas of what's going on?
First I suggest you just use Amazon Elastic MapReduce. There is zero configuration required on your end. EMR also has a few internal optimizations and monitoring that works in your benefit.
Second, do not use s3: as your default FS. First, s3 is too slow to be used to store intermediate data between jobs (a typical unit of work in hadoop is a dozen to dozens of MR jobs). it also stores the data in a 'proprietary' format (blocks etc). So external apps can't effectively touch the data in s3.
Note that s3: in EMR is not the same s3: in the standard hadoop distro. The amazon guys actually alias s3: as s3n: (s3n: is just raw/native s3 access).
You could also use Apache Whirr for this workflow like this:
Start by downloading the latest release (0.7.0 at this time) from http://www.apache.org/dyn/closer.cgi/whirr/
Extract the archive and try to run ./bin/whirr version. You need to have Java installed for this to work.
Make your Amazon AWS credentials available as environment variables:
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
Update the Hadoop EC2 config to match your needs by editing recipes/hadoop-ec2.properties. Check the Configuration Guide for more info.
Start a cluster Hadoop by running:
./bin/whirr launch-cluster --config recipes/hadoop-ec2.properties
You can see verbose logging output by doing tail -f whirr.log
Now you can login to your cluster and do your work.
./bin/whirr list-cluster --config recipes/hadoop-ec2.properties
ssh namenode-ip
start jobs as needed or copy data from / to S3 using distcp
For more explanations you should read the Quick Start Guide and the 5 minutes guide.
Disclaimer: I'm one of the committers.
I think you should not execute bin/hadoop namenode -format, because it is used for format the hdfs. In the later version, hadoop has move these functions in a separate scripts file which called "bin/hdfs". After you set the configuration parameters in core-site.xml and other configuration files, you can use S3 as the underlying file system directly.
Use
fs.defaultFS = s3n://awsAccessKeyId:awsSecretAccessKey#BucketName in your /etc/hadoop/conf/core-site.xml
Then do not start your datanode or namenode, if you have services that need your datanode and namenode this will not work..
I did this and can access my bucket using commands like
sudo hdfs dfs -ls /
Note if you have awsSecretAccessKey's with "/" character then you will have to url encode this.
Use s3n instead of s3.
hadoop fs -ls s3n://<BUCKET NAME>/etc