How to access custom UDFs through Spark Thrift Server? - hive

I am running Spark Thrift Server on EMR. I start up the Spark Thrift Server by:
sudo -u spark /usr/lib/spark/sbin/start-thriftserver.sh --queue interactive.thrift --jars /opt/lib/custom-udfs.jar
Notice that I have a customer UDF jar and I want to add it to the Thrift Server classpath, so I added --jars /opt/lib/custom-udfs.jar in the above command.
Once I am in my EMR, I issued the following to connect to the Spark Thrift Server.
beeline -u jdbc:hive2://localhost:10000/default
Then I was able to issue command like show databases. But how do I access the custom UDF? I thought by adding the --jars option in the Thrift Server startup script, that would add the jar for Hive resource to use as well.
The only way I can access the custom UDF now is by adding the customer UDF jar to Hive resource
add jar /opt/lib/custom-udfs.jar
Then create function of the UDF.
Question:
Is there a way to auto config the custom UDF jar without adding jar each time to the spark session?
Thanks!

The easiest way is to edit the file start-thriftserver.sh, at the end:
Wait server is ready
Execute setup SQL query
You could also post a proposal on JIRA, this is a very good feature "Execute setup code at start up".

The problem here seems to be that the --jars should be positioned correctly; which should be the first argument. I too had trouble getting the jars to work properly. This worked for me
# if your spark installation is in /usr/lib/
sudo -u spark /usr/lib/spark/sbin/start-thriftserver.sh \
--jars /path/to/jars/jar1.jar,/path/to/jars/jar2.jar \
--properties-file ./spark-thrift-sparkconf.conf \ # this is only needed if you want to customize spark configuration, it looks similar to spark-defaults.conf
--class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2

Related

nifi pyspark - "no module named boto3"

I'm trying to run a pyspark job I created that downloads and uploads data from s3 using the boto3 library. While the job runs fine in pycharm, when I try to run it in nifi using this template https://github.com/Teradata/kylo/blob/master/samples/templates/nifi-1.0/template-starter-pyspark.xml
The ExecutePySpark errors with "No module named boto3".
I made sure it was installed on my conda environment that is active.
Any ideas, im sure im missing something obvious.
Here is a picture of the nifi spark processor.
Thanks,
tim
The Python environment where PySpark should run on is configured via the PYSPARK_PYTHON variable.
Go to Spark installation directory
Go to conf
Edit spark-env.sh
Add this line: export PYSPARK_PYTHON=PATH_TO_YOUR_CONDA_ENV

Not able to start hiveserver2 for Apache Hive

Could any one help to resolve below problem, I'm trying to start hserver2 and I configured hive_site.xml and configuration file for Hadoop Directory path as well and jar file hive-service-rpc-2.1.1.jar also available at directory lib. And I am able to start using hive but not hiveserver2
$ hive --service hiveserver2 Exception in thread "main" java.lang.ClassNotFoundException: /home/directory/Hadoop/Hive/apache-hive-2/1/1-bin/lib/hive-service-rpc-2/1/1/jar
export HIVE_HOME=/usr/local/hive-1.2.1/
export HIVE_HOME=/usr/local/hive-2.1.1
I am glad that I solve it's problem. Here is my question ,I have different version hive ,and My command use 1.2.1, but it find it's jar form 2.1.1.
you can user command which hive server 2 ,find where is you command from .

Logs for hive query executed via. beeline

i am running below hive coomand from beeline . Can someone please tell where can I see Map reudce logs for this ?
0: jdbc:hive2://<servername>:10003/> select a.offr_id offerID , a.offr_nm offerNm , b.disp_strt_ts dispStartDt , b.disp_end_ts dispEndDt , vld_strt_ts validStartDt, vld_end_ts validEndDt from gcor_offr a, gcor_offr_dur b where a.offr_id = b.offr_id and b.disp_end_ts > '2016-09-13 00:00:00';
When using beeline, MapReduce logs are part of HiveServer2 log4j logs.
If your Hive install was configured by Cloudera Manager (CM), then it will typically be in /var/log/hive/hadoop-cmf-HIVE-1-HIVESERVER2-*.out on the node where HiveServer2 is running (may or may not be the same as where you are running beeline from)
Few other scenarios:
Your Hive install was not configured by CM ? You will need to manually create log4j config file:
Create hive-log4j.properties config file in directory specified by HIVE_CONF_DIR environment variable. (This makes it accessible to HiveServer2 JVM classpath)
In this file, log location is specified by log.dir and log.file. See conf/hive-log4j.properties.template in your distribution for an example template for this file.
You run beeline in "embedded HS2 mode" (i.e. beeline -u jdbc:hive2:// user password) ?:
You will customize beeline log4j (as opposed to HiveServer2 log4j).
Beeline log4j properties file is strictly called beeline-log4j2.properties (in versions prior to Hive 2.0, it is called beeline-log4j.properties). Needs to be created and made accessible to beeline JVM classpath via HIVE_CONF_DIR. See HIVE-10502 and HIVE-12020 for further discussion on this.
You want to customize what HiveServer2 logs get printed on beeline stdout ?
This can be configured at HiveServer2 level using hive.server2.logging.operation.enabled and hive.server2.logging.operation configs.
Hive uses log4j for logging. These logs are not emitted to the standard output by default but are instead captured to a log file specified by Hive's log4j properties file. By default, Hive will use hive-log4j.default in the conf/ directory of the Hive installation which writes out logs to /tmp/<userid>/hive.log and uses the WARN level.
It is often desirable to emit the logs to the standard output and/or change the logging level for debugging purposes. These can be done from the command line as follows:
$HIVE_HOME/bin/hive --hiveconf hive.root.logger=INFO,console
set hive.async.log.enabled=false

How to add JAR for Hive custom UDF so it is available permanently on the HDInsight cluster?

I have created a custom UDF in Hive, it's tested in Hive command line and works fine. So now I have the jar file for the UDF, what I need to do so that users will be able to create temporary function pointing to it? Ideally from command prompt of Hive I would do this:-
hive> add jar myudf.jar;
Added [myudf.jar] to class path
Added resources: [myudf.jar]
hive> create temporary function foo as 'mypackage.CustomUDF';
After this I am able to use the function properly.
But I don't want to add jar each and every time I want to execute the function. I should be able to run this function while:-
executing Hive query against HDInsight cluster from Visual Studio
executing Hive query from command line through SSH(Linux) or
RDP/cmd(Windows)
executing Hive query from Ambari (Linux) Hive view
executing Hive query from HDinsight Query Console Hive
Editor(Windows cluster)
So, no matter how I am executing the query the JAR should be already available and added to the path. What's the process to ensure this for Linux as well as Windows cluster?
may be you could add the jar in hiverc file present in hive etc/conf directory. This file will be loaded every time when hive starts. So from next time you need not to add jar separably for that session.

Vectorwise to Hive using Sqoop

I've been trying to import a table from Vectorwise to Hive using Sqoop. I downloaded the Vectorwise JDBC driver and all. It just ain't working.
This is the command I'm using:
sudo -u hdfs sqoop import --driver com.ingres.jdbc.IngresDriver --connect jdbc:ingres://172.16.63.157:VW7/amit --username ingres -password ingres --table vector_table --hive-table=vector_table --hive-import --create-hive-table -m 1
And I'm getting the error:
12/06/07 22:08:27 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.ingres.jdbc.IngresDriver
java.lang.RuntimeException: Could not load db driver class: com.ingres.jdbc.IngresDriver
at com.cloudera.sqoop.manager.SqlManager.makeConnection(SqlManager.java:635)
at com.cloudera.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:53)
at com.cloudera.sqoop.manager.SqlManager.execute(SqlManager.java:524)
at com.cloudera.sqoop.manager.SqlManager.execute(SqlManager.java:547)
at com.cloudera.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:191)
at com.cloudera.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:175)
at com.cloudera.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:263)
at com.cloudera.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1226)
at com.cloudera.sqoop.orm.ClassWriter.generate(ClassWriter.java:1051)
at com.cloudera.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:84)
at com.cloudera.sqoop.tool.ImportTool.importTable(ImportTool.java:370)
at com.cloudera.sqoop.tool.ImportTool.run(ImportTool.java:456)
at com.cloudera.sqoop.Sqoop.run(Sqoop.java:146)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.cloudera.sqoop.Sqoop.runSqoop(Sqoop.java:182)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:221)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:230)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:239)
I'd really appreciate it if someone can help me out here.
Thanks in advance! :)
I can't comment yet so as an answer:
This is a quote from the documentation:
You can use Sqoop with any other JDBC-compliant database. First,
download the appropriate JDBC driver for the type of database you want
to import, and install the .jar file in the $SQOOP_HOME/lib directory
on your client machine. (This will be /usr/lib/sqoop/lib if you
installed from an RPM or Debian package.) Each driver .jar file also
has a specific driver class which defines the entry-point to the
driver. For example, MySQL’s Connector/J library has a driver class of
com.mysql.jdbc.Driver. Refer to your database vendor-specific
documentation to determine the main driver class. This class must be
provided as an argument to Sqoop with --driver.
Do you have the proper jar file in a directory that's accessible by Sqoop?
For the future it is also always useful if you give a bit more information about your environment like which version of Sqoop you are using etc.
Okay, I got it working. It was a simple permission issue. I changed the owner of iijdbc.jar to hdfs.
sudo chown hdfs /usr/lib/sqoop/lib/iijdbc.jar
Now it's working! :)
I can now import my Vectorwise tables to Hive using Sqoop. Great!