Can't access external Hive metastore with Pyspark - hive

I am trying to run a simple code to simply show databases that I created previously on my hive2-server. (note in this example there are both, examples in python and scala both with the same results).
If I log in into a hive shell and list my databases I see a total of 3 databases.
When I start Spark shell(2.3) on pyspark I do the usual and add the following property to my SparkSession:
sqlContext.setConf("hive.metastore.uris","thrift://*****:9083")
And re-start a SparkContext within my session.
If I run the following line to see all the configs:
pyspark.conf.SparkConf().getAll()
spark.sparkContext._conf.getAll()
I can indeed see the parameter has been added, I start a new HiveContext:
hiveContext = pyspark.sql.HiveContext(sc)
But If I list my databases:
hiveContext.sql("SHOW DATABASES").show()
It will not show the same results from the hive shell.
I'm a bit lost, for some reason it looks like it is ignoring the config parameter as I am sure the one I'm using it's my metastore as the address I get from running:
hive -e "SET" | grep metastore.uris
Is the same address also if I run:
ses2 = spark.builder.master("local").appName("Hive_Test").config('hive.metastore.uris','thrift://******:9083').getOrCreate()
ses2.sql("SET").show()
Could it be a permission issue? Like some tables are not set to be seen outside the hive shell/user.
Thanks

Managed to solve the issue, because a communication issue the Hive was not hosted in that machine, corrected the code and everything fine.

Related

apache spark sql table overwrite issue

I am using the below code to create a table from a dataframe in databricks and run into error.
df.write.saveAsTable("newtable")
This works fine the very first time but for re-usability if I were to rewrite like below
df.write.mode(SaveMode.Overwrite).saveAsTable("newtable")
I get the following error.
Error Message:
org.apache.spark.sql.AnalysisException: Can not create the managed table newtable. The associated location dbfs:/user/hive/warehouse/newtable already exists
The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.
What are the differences between saveAsTable and insertInto in different SaveMode(s)?
Run following command to fix issue :
dbutils.fs.rm("dbfs:/user/hive/warehouse/newtable/", true)
Or
Set the flag
spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation = true
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")

ExecuteSQL processor returns corrupted data

I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000.
What I did is select * from my_table where dt=1000.
Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.
The DBCPConnectionPool is configured to use HiveJDBC4 jar.
Later I tried using the compatible jar according to the CDH release, didn't fix it either.
The ExecuteSQL processor is configured as such:
Normalize Table/Column Names: true
Use Avro Logical Types: false
Hive version: 1.1.0
CDH: 5.7.1
Any ideas what's happening? Thanks!
EDIT:
Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.
Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).
If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.
Eventually it was solved by using hive property hive.query.result.fileformat=SequenceFile

failed to run hive queries in parallel using hue query editor

I have cdh-5 cluster with hive, impala and hue installed.
When 2 users try to use in parallel the Hue "Query Editor" with either Impala or Hive, we never get the result back.
When a single user fires a query we get results without a problem
When we tried to user Hive command line interface we could run queries in parallel.
We also tried to create different hue users, but even when different hue users tried to run queries in parallel we still got no result
It looks like hue configuration issue.
Any ideas?
Yosi

Hive 13.0 The UDF implementation class '...' is not present in the class path

I encounter weird behaviour when using Hive 13.0.1 on Amazon EMR.
This happens when I try to both use UDF and run external shell script that runs hive -e "..." commands
We have been using shell scripts to add partitions dynamically to a table and never encountered any problem in Hive 0.11
However in Hive 0.13.1 the following simplified example breaks:
add jar myjar;
create temporary function myfunc as '...';
create external table mytable...
!hive -e "";
select myfunc(someCol) from mytable;
Results in The UDF implementation class '...' is not present in the class path
Removing the shell command (!hive -e "") and the error disappears.
Adding the jar and function again after the shell and the error disappears (Adding just the function without the jar does not get rid of the error).
Is this known behavior or a bug, can I do anything besides reloading the jar and function before every usage?
AFAIK - this is the way it's always been. One hive shell cannot pass on the additional jars added to it's classpath to the child shell. and definitely not the function definitions.
We provide Hive/Hadoop etc. as a service in Qubole and have the notion of a hive bootstrap that is used to, for cases like this, capture common statements required for all queries. This is used extensively by most users. (caveat - i am one of Qubole and Hive's founders - but would recommend using Qubole over EMR for Hive).

Error In Query Operation: Cannot start a job without a project id

I keep getting an error using the bqcommand line tool. For example, I can easily run this query and it returns the table that I want:
head -n 10 xxxx-bq:name_name.Report2
Note that xxxx-bq is the projectid, and name_name is the dataset id. When I try to run a query against this table, say the follwing:
query "SELECT count(*) FROM xxxx-bq:name_name.Report2
I get an error that says that I cannot start a job without a project id. What am I doing wrong here? How can I specify in the query the project ID? I know people have asked some similar questions. That said, I have been following along and my approach is not working.
Do you have a project id? If not, this page can help you set one up: https://developers.google.com/bigquery/bq-command-line-tool-quickstart
All BigQuery jobs (which include queries) require a project id, which is the project that gets billed for any damage done by the job. (by damage, I mean work)
You should either set your default project id (you can do this by running bq init)
or set the project id that you're running the job under via --project_id=
So if you're running bq shell, you would use bq shell --project_id=myprojectid instead.
strange... I just started working with bq & got the same error but it didn't like me passing --project_id=[myprojectid]. Although I was already authed with gcloud auth login, I had to run bq init (and it seemingly didn't do anything) -- after that, my queries worked just fine.