Hortonworks Hive and SpagoBI - hive

I want to connect hortonworks hive with SpagoBI studio i am using jdbc driver to make the connection but its not working please anyone solve this problem.
Thankyou

First of all you should create an environment file for the spagobi. In that file you need to provide the paths for jars of hive lib and hadoop-core.jar(for hadoop version 1)
Then you need to execute the run the env file and after that start SpagoBI.
It would run properly.
Basically this env file provides the access of jars of hive lib (including your hive-jdbc-*.jar) to SpagoBi
The environment file is
HADOOP_HOME=/usr/lib/hadoop
HIVE_HOME=/usr/lib/hive
echo -e '1\x01foo' > /tmp/spagobi/a.txt
echo -e '2\x01bar' >> /tmp/spagobi/a.txt
CLASSPATH=.:$HADOOP_HOME/hadoop-core.jar:$HIVE_HOME/conf
for i in ${HIVE_HOME}/lib/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
`
just save the code in file
spagobi.env
and then execute the file thorugh . spagobi-env.env

If you using updated version of hive then download some jar files that given below:-
1. hadoop-common-2.6.0.2.2.0.0-2041.jar
2. z-hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar
then,In SpagoBI Studio -> Go to data source connection -> select hive driver -> then add new driver file "z-hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar" file -> then ok.
Then, provide credentials to connect with hive given below:-
URL : jdbc:hive2://localhost:10000/xyz
Driver: org.apache.hive.jdbc.HiveDriver
definately, this will work.
Thanks
Aman

Edit server.xml
Path: All-In-One-SpagoBI-X.X.X\conf\server.xml
Add:
<!-- Hive Configuration-->
<Resource name="jdbc/hive" auth="Container" type="javax.sql.DataSource" driverClassName="org.apache.hive.jdbc.HiveDriver"
url="jdbc:hive2://data_node_server.com:10000/wsms" username=" " password=" "
maxActive="20" maxIdle="10" maxWait="-1"/>
Edit context.xml
Path: All-In-One-SpagoBI-X.X.X\webapp\spagobi\meta-inf\context.xml
Add:
<ResourceLink global="jdbc/hive" name="jdbc/hive" type="javax.sql.DataSource"/>
Do same with all context.xml for each engine:
All-In-One-SpagoBI-XXX\webapps\XXXEngine\META-INF
Add Following Jars in All-In-One-SpagoBI-XXX\lib:
httpcore-4.3.jar
httpclient-4.3-beta2.jar
httpclient-4.2.jar
hadoop-common-xxx.jar
hive-exec.jar
hive-jdbc-xxx.jar
hive-metastore-xxx.jar
hive-service-xxx.jar
slf4j-api-xxx.jar (present in webapp/spagobi/web-inf/lib)
hadoop-auth-xxx.jar (optional but recommended - may be required for kerberos or other authentication)
Data source
Label : hive2_conn (hive-jdbc-1.2.1000.2.4.0.0-169)
Description : Connecting to hive
Dialect : Hive QL
url : jdbc:hive2://server_name.com:10000/wsms (Note use data node server address , name node server wont work)
user : hive_user_name
pwd : hvie_user_pwd
Driver : org.apache.hive.jdbc.HiveDriver
Environment Variables - Hive Server (Data Node) (set variable for user which is used in data source)
vi ~/.bash_profile
HADOOP_HOME=/usr/hdp/2.4.0.0-169/hadoop
HIVE_HOME=/usr/hdp/2.4.0.0-169/hive CLASSPATH=.:$HADOOP_HOME/*.jar:$HADOOP_HOME/lib/*.jar:$HIVE_HOME/lib/*.jar
Hadoop Server Configurations(hive-site.xml):
<!-- hive Multi user Support -->
<property>
<name>hive.support.concurrency</name>
<description>Enable Hive's Table Lock Manager Service</description>
<value>true</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<description>Zookeeper quorum used by Hive's Table Lock Manager</description>
<value>zk1.myco.com,zk2.myco.com,zk3.myco.com</value>
</property>
<property>
<name>atlas.hook.hive.maxThreads</name>
<value>50</value>
</property>
<property>
<name>atlas.hook.hive.minThreads</name>
<value>5</value>
</property>
<!-- Configute to Support HTTP protocol default value binary (set it to http)-->
<property>
<name>hive.server2.transport.mode</name>
<value>http</value><!--default is binary-->
</property>
<!-- Query Optimization -->
<!-- Enable Cost Based Optimization , To Optimize Query Executio plan default value false (set it to True) -->
<property>
<name>hive.cbo.enable</name>enter code here
<value>true</value>
</property>

Related

Missing hive execution jar 3.1.2

Hi I am trying to run the following command while installing hive 3.1.2
bin/schematool -dbType derby -initSchema
When I run this it tells me "Missing Hive Execution Jar: home/<user>/hive/lib/hive-exec-*.jar". I looked into my /hive/lib directory and it contains the hive-exec-3.1.2.jar. I'm running a 32-bit Ubuntu VM, that has Hadoop already installed on it and working. Java is up to date if that helps too, thanks for the help. First, I unpacked my apache hive tar file, moved it to my home directory changed it to just hive, then I set the export HIVE_HOME= “home/<user>/hive”
export PATH=$PATH:$HIVE_HOME/bin. Next, I made the following change to core-site.xml in hadoop,
<property>
<name>hadoop.proxyuser.firepower.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.firepower.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.server.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.server.groups</name>
<value>*</value>
</property>
</configuration>
Than I made a tmp directory and a user directory in hdfs with write privileges. However when I did this it was giving me error about a disabled stack guard, and that i should run execstac -c , and WARN util.NativeCodeLoder: Unable to load native-hadoop library for your platfor... using builtin-java classes. But i was still able to create the dirs. Than after that I tried to initalize the derby databse with the schematool.

Zeppelin configuration properties file: Can't load BigQuery interpreter configuration

I am attempting to set my zeppelin.bigquery.project_id (or any bigquery configuration property) via my zeppelin-site.xml, but my changes are not loaded when I start Zeppelin. The project ID always defaults to ' '. I am able to change other configuration properties (ex. zeppelin.notebook.storage). I am using Zeppelin 0.7.3 from https://hub.docker.com/r/apache/zeppelin/.
zeppelin-site.xml (created before starting Zeppelin, before an interpreter.json file exists):
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.S3NotebookRepo</value>
<description>notebook persistence layer implementation</description>
</property>
... etc ...
<property>
<name>zeppelin.bigquery.project_id</name>
<value>my-project-id</value>
<description>Google Project Id</description>
</property>
</configuration>
Am I configuring the interpreter incorrectly? Could this parameter be overridden elsewhere?
I am not really familiar with Apache Zeppelin, but I have found some documentation pages that make me think that you should actually store the BigQuery configuration parameters in your Interpreter configuration file:
This entry in the GCP blog explains how to use the BigQuery Interpreter for Apache Zeppelin. It includes some examples on how to use it with Dataproc, Apache Spark and the Interpreter.
The BigQuery Interpreter documentation for Zeppelin 0.7.3 mentions that zeppelin.bigquery.project_id is the right parameter to configure, so that is not the issue here. Here there is some information on how to configure the Zeppelin Interpreters.
The GitHub page of the BigQuery Interpreter states that you have to configure the properties during Interpreter creation, and then you should enable is by using %bigquery.sql.
Finally, make sure that you are specifying the BigQuery interpreter in the appropriate field in the zeppelin-site.xml (like done in the template) or instead enable it by clicking on the "Gear" icon and selecting "bigquery".
Edit /usr/lib/zeppelin/conf/interpreter.json, change zeppelin.bigquery.project_id to be the value of your project and run sudo systemctl restart zeppelin.service.

Running Sqoop job on YARN using Oozie

I've got a problem with running Sqoop job on YARN in Oozie using Hue. I want to download table from Oracle database and upload that table to HDFS. I've got multinode cluster consists of 4 nodes.
I want to run simple Sqoop statement:
import --options-file /tmp/oracle_dos.txt --table BD.BD_TABLE --target-dir /user/user1/files/user_temp_20160930_30 --m 1
Options file is located on local system on node number 1. Other nodes have no options file in /tmp/ dir. I created Oozie workflow with Sqoop job and tried to run it, but I got error:
3432 [main] ERROR org.apache.sqoop.Sqoop - Error while expanding arguments
java.lang.Exception: Unable to read options file: /tmp/oracle_dos.txt
The weirdest thing is that the job is sometimes ok, but sometimes fails. The log file gave me answer why - Oozie runs Sqoop jobs on YARN.
Resource Manager (which is component of YARN) decides which node will execute Sqoop job. When Resource Manager decided that Node 1 (which has options file on local file system) should execute job, everything is ok. But when RM decided that one of other 3 nodes should execute Sqoop job, it failed.
This is big problem for me, because I don't want to upload options file on every node (because what if I will have 1000 nodes?). So my question is - is there any way to tell Resource Manager which node it should use?
You can make a custom file available for you oozie action on a node, it can be done by using <file> tag in your sqoop action, look at this syntax:
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
...
<action name="[NODE-NAME]">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>[JOB-TRACKER]</job-tracker>
<name-node>[NAME-NODE]</name-node>
<prepare>
<delete path="[PATH]"/>
...
<mkdir path="[PATH]"/>
...
</prepare>
<configuration>
<property>
<name>[PROPERTY-NAME]</name>
<value>[PROPERTY-VALUE]</value>
</property>
...
</configuration>
<command>[SQOOP-COMMAND]</command>
<arg>[SQOOP-ARGUMENT]</arg>
...
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
...
</sqoop>
<ok to="[NODE-NAME]"/>
<error to="[NODE-NAME]"/>
</action>
...
</workflow-app>
Also read this:
The file , archive elements make available, to map-reduce jobs, files
and archives. If the specified path is relative, it is assumed the
file or archiver are within the application directory, in the
corresponding sub-path. If the path is absolute, the file or archive
it is expected in the given absolute path.
Files specified with the file element, will be symbolic links in the
home directory of the task.
...
So in simplest case you put your file oracle_dos.txt in your workflow directory, add element oracle_dos.txt in workflow.xml and change you command to something like this:
import --options-file ./oracle_dos.txt --table BD.BD_TABLE --target-dir /user/user1/files/user_temp_20160930_30 --m 1
In this case nevertheless your sqoop action is running on some randomly picked node in a cluster, oozie will copy oracle_dos.txt to this node and you can refer to it as to local file.
Perhaps this is about file permissions. Try to put this file in /home/{user}.

Logs for hive query executed via. beeline

i am running below hive coomand from beeline . Can someone please tell where can I see Map reudce logs for this ?
0: jdbc:hive2://<servername>:10003/> select a.offr_id offerID , a.offr_nm offerNm , b.disp_strt_ts dispStartDt , b.disp_end_ts dispEndDt , vld_strt_ts validStartDt, vld_end_ts validEndDt from gcor_offr a, gcor_offr_dur b where a.offr_id = b.offr_id and b.disp_end_ts > '2016-09-13 00:00:00';
When using beeline, MapReduce logs are part of HiveServer2 log4j logs.
If your Hive install was configured by Cloudera Manager (CM), then it will typically be in /var/log/hive/hadoop-cmf-HIVE-1-HIVESERVER2-*.out on the node where HiveServer2 is running (may or may not be the same as where you are running beeline from)
Few other scenarios:
Your Hive install was not configured by CM ? You will need to manually create log4j config file:
Create hive-log4j.properties config file in directory specified by HIVE_CONF_DIR environment variable. (This makes it accessible to HiveServer2 JVM classpath)
In this file, log location is specified by log.dir and log.file. See conf/hive-log4j.properties.template in your distribution for an example template for this file.
You run beeline in "embedded HS2 mode" (i.e. beeline -u jdbc:hive2:// user password) ?:
You will customize beeline log4j (as opposed to HiveServer2 log4j).
Beeline log4j properties file is strictly called beeline-log4j2.properties (in versions prior to Hive 2.0, it is called beeline-log4j.properties). Needs to be created and made accessible to beeline JVM classpath via HIVE_CONF_DIR. See HIVE-10502 and HIVE-12020 for further discussion on this.
You want to customize what HiveServer2 logs get printed on beeline stdout ?
This can be configured at HiveServer2 level using hive.server2.logging.operation.enabled and hive.server2.logging.operation configs.
Hive uses log4j for logging. These logs are not emitted to the standard output by default but are instead captured to a log file specified by Hive's log4j properties file. By default, Hive will use hive-log4j.default in the conf/ directory of the Hive installation which writes out logs to /tmp/<userid>/hive.log and uses the WARN level.
It is often desirable to emit the logs to the standard output and/or change the logging level for debugging purposes. These can be done from the command line as follows:
$HIVE_HOME/bin/hive --hiveconf hive.root.logger=INFO,console
set hive.async.log.enabled=false

Why 'mapred-site.xml' is not included in the latest Hadoop 2.2.0?

Latest build of Hadoop provides mapred-site.xml.template
Do we need to create a new mapred-site.xml file using this?
Any link on documentation or explanation related to Hadoop 2.2.0 will be much appreciated.
I believe it's still required. For our basic Hadoop 2.2.0 2-node cluster setup that we have working I did the following from the setup documentation.
"
From the base of the Hadoop installation, edit the etc/hadoop/mapred-site.xml file. A new
configuration option for Hadoop 2 is the capability to specify a framework name for
MapReduce, setting the mapreduce.framework.name property. In this install we will use the
value of "yarn" to tell MapReduce that it will run as a YARN application.
First, copy the template file to the mapred-site.xml.
cp mapred-site.xml.template mapred-site.xml
Next, copy the following into Hadoop etc/hadoop/mapred-site.xml file and remove the original empty tags.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
"
Wrt documentation, I found this the most useful. Also, etc/hosts configs for cluster setup and other cluster related configs were a bit hard to figure out.