`show tables like '*' fails in Spark SQL 1.3.0+ - apache-spark-sql

We have an instance of Spark 1.2.0 that we can run the command show tables like 'tmp*';, using beeline connected to the thrift server port, without issue. We are testing things out against Spark 1.4.0 on the same machine, but when we run the same command on Spark 1.4.0, we get the following error:
0: jdbc:hive2://localhost:10001> show tables like 'tmp*';
Error: java.lang.RuntimeException: [1.13] failure: ``in'' expected but identifier like found
show tables like 'tmp*'
^ (state=,code=0)
0: jdbc:hive2://localhost:10001>
I pulled down Spark 1.3.0 on this machine and it gives the same error as above when running show tables like 'tmp*'.
Does anyone know if there is a similar command in Spark SQL 1.3.0+ that will allow the use of wild cards to return tables with a given pattern?
This was done on a machine running CDH 5.3.0. The Hive version is Hive 0.13.1-cdh5.3.0 if that matters.

You may use below command on Spark-SQL shell
sqlContext.tables().filter("tableName LIKE '%tmp%'").collect()

$ spark-shell
scala> sql("show tables like 'tmp*'").show()

Related

How can you connect to a postgresql database on heroku from google-colaboratory?

Just started using colaboratory and would like to run sql queries against a postgresql db. Is this possible on colaboratory. I've used magic sql with jupyter nb to do this.
It was a matter of using 2 %% instead of 1 that was giving me an error. What I'm running is this:
! pip install ipython-sql
! pip install psycopg2
%sql postgres://<connect string>
%sql select tablename from pg_tables;
I saw an example where there sql was run with %%sql and when I tried that I got an error.
Everything is running well now.
Rebecca

How to access custom UDFs through Spark Thrift Server?

I am running Spark Thrift Server on EMR. I start up the Spark Thrift Server by:
sudo -u spark /usr/lib/spark/sbin/start-thriftserver.sh --queue interactive.thrift --jars /opt/lib/custom-udfs.jar
Notice that I have a customer UDF jar and I want to add it to the Thrift Server classpath, so I added --jars /opt/lib/custom-udfs.jar in the above command.
Once I am in my EMR, I issued the following to connect to the Spark Thrift Server.
beeline -u jdbc:hive2://localhost:10000/default
Then I was able to issue command like show databases. But how do I access the custom UDF? I thought by adding the --jars option in the Thrift Server startup script, that would add the jar for Hive resource to use as well.
The only way I can access the custom UDF now is by adding the customer UDF jar to Hive resource
add jar /opt/lib/custom-udfs.jar
Then create function of the UDF.
Question:
Is there a way to auto config the custom UDF jar without adding jar each time to the spark session?
Thanks!
The easiest way is to edit the file start-thriftserver.sh, at the end:
Wait server is ready
Execute setup SQL query
You could also post a proposal on JIRA, this is a very good feature "Execute setup code at start up".
The problem here seems to be that the --jars should be positioned correctly; which should be the first argument. I too had trouble getting the jars to work properly. This worked for me
# if your spark installation is in /usr/lib/
sudo -u spark /usr/lib/spark/sbin/start-thriftserver.sh \
--jars /path/to/jars/jar1.jar,/path/to/jars/jar2.jar \
--properties-file ./spark-thrift-sparkconf.conf \ # this is only needed if you want to customize spark configuration, it looks similar to spark-defaults.conf
--class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2

HIVE on Spark Issue

I am trying to configure Hive on Spark but even after trying for 5 days i am not getting any solution..
Steps followed:
1.After spark installation,going in hive console and setting below proeprties
set hive.execution.engine=spark;
set spark.master=spark://INBBRDSSVM294:7077;
set spark.executor.memory=2g;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
2.Added spark -asembly jar in hive lib.
3.When running select count(*) from table_name I am getting below error:
2016-08-08 15:17:30,207 ERROR [main]: spark.SparkTask (SparkTask.java:execute(131))
- Failed to execute spark task, with exception
'org.apache.hadoop.hive.ql.metadata.HiveException (Failed to create spark client.)'
Hive version: 1.2.1
Spark version: tried with 1.6.1,1.3.1 and 2.0.0
Would appreciate if any one can suggest something.
You can download spark-1.3.1 src from spark download website and try to build spark-1.3.1 without hive version using:
./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4" -Dhadoop.version=2.7.1 -Dyarn.version=2.7.1 –DskipTests
Then copy spark-assembly-1.3.1-hadoop2.7.1.jar to hive/lib folder.
And follow https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-SparkInstallation to set necessary properties.
First of all, you need to pay attention to which versions are compatible. If you choose Hive 1.2.1, I advise you to use Spark 1.3.1. You can see the version compatibility list here.
The mistake you have is a general mistake. You need to start Spark and see what errors the Spark Workers says. However, have you already copied the hive-site.xml to spark/conf?

HBase - Error: java.io.IOException: The connection has to be unmanaged - Client Squirrel SQL

I am trying to query HBase data through an HIVE external table. The query comes through a client, at this time it is Squirrel SQL. If i query through simple HIVE command line interface i am able to query the Hive external table (stored in HBASE)
However when i query through Squirrel SQL i get the error
Error: java.io.IOException: The connection has to be unmanaged.
The following is my environment
HBase - 1.1.5
Hive - 1.2.1
Hadoop - 2.6.0
Zookeeper - 3.4.6 Runs on 3 nodes
Please help.
Regards
Bala
I sorted this out as well. This was due to jar mismatch. Once i got all the right jars lined up and started thriftserver with --jars, this error went away.
Thanks
Bala

Cannot Load Hive Table into Pig via HCatalog

I am currently configuring a Cloudera HDP dev image using this tutorial on CentOS 6.5, installing the base and then adding the different components as I need them. Currently, I am installing / testing HCatalog using this section of the tutorial linked above.
I have successfully installed the package and am now testing HCatalog integration with Pig with the following script:
A = LOAD 'groups' USING org.apache.hcatalog.pig.HCatLoader();
DESCRIBE A;
I have previously created and populated a 'groups' table in Hive before running the command. When I run the script with the command pig -useHCatalog test.pig I get an exception rather than the expected output. Below is the initial part of the stacktrace:
Pig Stack Trace
---------------
ERROR 2245: Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1608)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1547)
at org.apache.pig.PigServer.registerQuery(PigServer.java:518)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
...
Has anyone encountered this error before? Any help would be much appreciated. I would be happy to provide more information if you need it.
The error was caused by HBase's Thrift server not being proper configured. I installed/configured Thrift and added the following to my hive-xml.site with the proper server information added:
<property>
<name>hive.metastore.uris</name>
<value>thrift://<!--URL of Your Server-->:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
I thought the snippet above was not required since I am running Cloudera HDP in pseudo-distributed mode.Turns out, it and HBase Thrift are required to use HCatalog with Pig.