I have written hive udf in cloudera and we’re migrating it to hortonworks. When I try to apply the same udf in hortonworks cluster it throws me an error below.
Use the right dependencies with the correct versions. Sit with admin team regarding the versions and try to run it. Limit always scan few records and apply the operation on that data instead of whole dataset so, it worked for me when I apply the udf with limit. Even any version you use/even cdh version will work if you use limit. But the problem comes when you apply it on whole data set. As my sample data is around 5 million records, it has to run map reduce job.
Related
I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000.
What I did is select * from my_table where dt=1000.
Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.
The DBCPConnectionPool is configured to use HiveJDBC4 jar.
Later I tried using the compatible jar according to the CDH release, didn't fix it either.
The ExecuteSQL processor is configured as such:
Normalize Table/Column Names: true
Use Avro Logical Types: false
Hive version: 1.1.0
CDH: 5.7.1
Any ideas what's happening? Thanks!
EDIT:
Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.
Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).
If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.
Eventually it was solved by using hive property hive.query.result.fileformat=SequenceFile
When I try to use impala to transfer massive data (about 100G) for one time and select count(1) immediately, I get the wrong total count. Then I execute the same sql again, the total count is correct.
I want to know besides leader change, is there have any other internal ops can cause the scan inconsistency? If I change the impala configure kudu_read_mode: READ_LATEST to kudu_read_mode: READ_AT_SNAPSHOT, what's the timestamp that the impala will transimit? If the READ_AT_SNAPSHOT can resolve the issue?
I am using the impala 2.10.0 + kudu 1.5.0.
I am using apache-hive-1.2.2 on Hadoop 2.6.0. When am running a hive query with where clause it is giving results immediately without launching any MapReduce job. I'm not sure what is happening. Table has over 100k records.
I am quoting this from Hive Documentation
hive.fetch.task.conversion
Some select queries can be converted to a single FETCH task,
minimizing latency. Currently the query should be single sourced not
having any subquery and should not have any aggregations or distincts
(which incur RS – ReduceSinkOperator, requiring a MapReduce task),
lateral views and joins.
Any type of the sort of aggregation like max or min or count is going to require a MapReduce job. So it depends on your data-set you have.
select * from tablename;
It just reads raw data from files in HDFS, so it is much faster without MapReduce and it doesn't need MR.
This is due to the the property "hive.fetch.task.conversion". The default value is set to "more" (Hive 2.1.0) and results in Hive trying to go straight at the data by launching a single Fetch task instead of a Map Reduce job wherever possible.
This behaviour however might not be desirable in case you have a huge table (say 500 GB+) as it would cause a single thread to be launched instead of multiple threads as happens in the case of a Map Reduce job.
You can set this property to "minimal" or "none" in hive-site.xml to bypass the behaviour.
I'm running Hive 1.0, trying to compute column statistics using the built-in analyze command. HQL script looks like:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
use db;
analyze table tbl compute statistics for columns;
Which kicks off a map-only MR task as expected. The job runs to 100% for both map and reduce, then reports:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
But the job is registered as a SUCCESS.
Googling led me to this JIRA ticket, but the resolution says the problem is resolved in Hive 0.14. Is there something simple I'm missing in the query?
EDIT: Five and a half years later, I've changed jobs and industries twice, picked up Spark and then abandoned Hadoop altogether in all my workflows, and the world aligned around efficient cloud data lakes that don't require a new query language. Hive is a distant memory for me, but I hope the other answer seekers found sufficient workarounds. I don't think I ever did.
I am trying to create a table in hive metastore using shark by executing the following command:
CREATE TABLE src(key int, value string);
but i always get:
FAILED: Hive Internal Error: java.util.NoSuchElementException(null)
Read about the same thing in the google group- shark-users but alas.
My spark version is 0.8.1
My shark version is 0.8.1
Hive binary version is 0.9.0
I have pre installed hive-0.10.0 from cdh4.5.0 but i cant use it since shark 0.8.1 is not compatible with hive-0.10.0 yet.
I can run various queries like select * from table_name; but not create table query.
Even trying to create a cached table fails.
If i try and do sbt build using my HADOOP_VERSION=2.0.0cdh4.5.0, i get DistributedFileSystem error and i am not able to run any query.
I am dire need of a solution. Ill be glad if somebody can put me on to a right direction. I have mysql database and not derby.
I encountered a similar problem, and it seems that this occurs only in 0.8.1 of Shark. I solved it by reverting to Spark and Shark 0.8.0, and it works fine.
0.8.0 and 0.8.1 are very similar in functionality and unless you are using Spark for the added functionality between the two releases, you would be better off staying with 0.8.0.
By the way, it's SPARK_HADOOP_VERSION and SHARK_HADOOP_VERSION if you intend to build those two from the source code. It's not just HADOOP_VERSION.