SMB join not working over Hive Tables

SMB join not working over Hive Tables - hive

While performing SMB join over two ORC tables, bucketed and sorted on subscription_id, the join fails giving below error:
Error: java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:210)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.joinFinalLeftData(SMBMapJoinOperator.java:345)
at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.closeOp(SMBMapJoinOperator.java:610)
at org.apache.hadoop.hive.ql.exec.vector.VectorSMBMapJoinOperator.closeOp(VectorSMBMapJoinOperator.java:275)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:617)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:631)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:631)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:631)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:192)
... 8 more
The task tracker URL also doesn't give much details.
The query is:
SELECT * FROM
user_plays_buck
INNER JOIN small_user_subscription_buck
ON user_plays_buck.subscription_id = small_user_subscription_buck.subscription_id
LIMIT 1;

Got exactly the same issue in Hive 1.1. The same query works in Hive 2.1. So upgrade your hive.

Related

Flink hive api query exception: Distinct without an aggregation

I have the following exception when querying Hive with Flink's Hive table API connector: Distinct without an aggregation.
However, the sql query is executed correctly when using Hue interface to query hive.
I'm wondering if this problem due to a bad compatibility with flink?
Flink version: 1.14.2
Hive version: 2.1.1
SQL statement:
select devid as pdevid,
count(distinct vtype) as vip_type_trans
from events
where dt = '20220702'
and utype > -1
group by devid
having count(distinct vtype) > 1
Exception:
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: org.apache.hadoop.hive.ql.parse.SemanticException: Distinct without an aggregation.
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:812)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:246)
at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1054)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.parse.SemanticException: Distinct without an aggregation.
at org.apache.flink.table.planner.delegation.hive.HiveParserCalcitePlanner.logicalPlan(HiveParserCalcitePlanner.java:304)
at org.apache.flink.table.planner.delegation.hive.HiveParserCalcitePlanner.genLogicalPlan(HiveParserCalcitePlanner.java:272)
at org.apache.flink.table.planner.delegation.hive.HiveParser.analyzeSql(HiveParser.java:290)
at org.apache.flink.table.planner.delegation.hive.HiveParser.processCmd(HiveParser.java:238)
at org.apache.flink.table.planner.delegation.hive.HiveParser.parse(HiveParser.java:208)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.sqlQuery(TableEnvironmentImpl.java:716)
at com.zhhainiao.wp.stat.PaidConversationRate$.main(PaidConversationRate.scala:158)
at com.zhhainiao.wp.stat.PaidConversationRate.main(PaidConversationRate.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355)
... 11 more
Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: Distinct without an aggregation.
at org.apache.flink.table.planner.delegation.hive.HiveParserCalcitePlanner.genSelectLogicalPlan(HiveParserCalcitePlanner.java:2275)
at org.apache.flink.table.planner.delegation.hive.HiveParserCalcitePlanner.genLogicalPlan(HiveParserCalcitePlanner.java:2749)
at org.apache.flink.table.planner.delegation.hive.HiveParserCalcitePlanner.genLogicalPlan(HiveParserCalcitePlanner.java:2647)
at org.apache.flink.table.planner.delegation.hive.HiveParserCalcitePlanner.genLogicalPlan(HiveParserCalcitePlanner.java:2688)
at org.apache.flink.table.planner.delegation.hive.HiveParserCalcitePlanner.genLogicalPlan(HiveParserCalcitePlanner.java:2647)
at org.apache.flink.table.planner.delegation.hive.HiveParserCalcitePlanner.genLogicalPlan(HiveParserCalcitePlanner.java:2688)
at org.apache.flink.table.planner.delegation.hive.HiveParserCalcitePlanner.logicalPlan(HiveParserCalcitePlanner.java:284)

That is, flink is not fully compatible with hive sql syntax？
Flink is not fully compatible with the Hive SQL Syntax. There is one open Flink ticket regarding Hive while using DISTINCT, see https://issues.apache.org/jira/browse/FLINK-19004. If that matches with your current problem, you can track that item. Else I would recommend that you open a new Flink Jira ticket for this bug.

java.lang.ArrayIndexOutOfBoundsException: -1 - Hive update statement

When i trying to run the Hive update statement getting the following error.
2021-02-25 15:38:54,934 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1592334694783_33388_r_000007_3: Error: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"rowid":3}},"value":{"_col0":"T","_col1":1111111,"......."_col44":""}}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:256)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:790)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:841)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:235)
The update query is simple.
All the columns in the Target table are string or Decimal .
Identified another issue point Cloudera Link, But the problem is this query runs most of the time, but fails when run for certain type of Data.
Update Statement
UPDATE Table1 a
SET
email = MaskData(email,1)
WHERE d_Date >= '2017-01-01' and
email IN (select distinct email from Table2);
Any Path forward or assistance will be helpful. Thanks in Advance.

Looks like the data is not bucketed properly as we are inserting the data from Spark.
Had to re-do the complete tables and it was working properly.

Druid : No column with timestamp with local time-zone type on query result; one column should be of timestamp with local time-zone type

I am trying to create a druid table based on existing druid table using below query for which I'm facing error.
query :
CREATE TABLE IF NOT EXISTS database.druid_table2 STORED BY
'org.apache.hadoop.hive.druid.DruidStorageHandler' AS SELECT '__time' as `__time`,column1, column2, column3 FROM database.druid_table1;
error :
org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED:
SemanticException No column with timestamp with local time-zone type on query result; one column
should be of timestamp with local time-zone type
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:300)
at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:286)
at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:324)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:26`enter code here`5)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:718)
at org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:801)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:103)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:633)
at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
at org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
NOTE : I'm using Hive interactive mode since this is a druid related query and Hadoop version is 3.2.x.

I am not familiar with this method or syntax, but judging from the error it works for other cases? If so, perhaps you have to cast __time into a format indicating timezone? I'm not sure how it's being passed.

CTE based sequence generation with HSQLDB

I am using a recursive common table expression to fetch a batch of sequence number. The following query works with Postgres, SQL Server and H2 (minus the VALUES part).
WITH RECURSIVE t(n, level_num) AS (
SELECT next value for seq_parent_id as n,
1 as level_num
FROM (VALUES(0))
UNION ALL
SELECT next value for seq_parent_id as n,
level_num + 1 as level_num
FROM t
WHERE level_num < ?)
SELECT n FROM t
However with HSQLDB 2.4.0 I get the following exception
java.sql.SQLSyntaxErrorException: user lacks privilege or object not found: T
at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
at org.hsqldb.jdbc.JDBCStatement.fetchResult(Unknown Source)
at org.hsqldb.jdbc.JDBCStatement.executeQuery(Unknown Source)
...
Caused by: org.hsqldb.HsqlException: user lacks privilege or object not found: T
at org.hsqldb.error.Error.error(Unknown Source)
at org.hsqldb.error.Error.error(Unknown Source)
at org.hsqldb.ParserDQL.readTableName(Unknown Source)
at org.hsqldb.ParserDQL.readTableOrSubquery(Unknown Source)
at org.hsqldb.ParserDQL.XreadTableReference(Unknown Source)
at org.hsqldb.ParserDQL.XreadFromClause(Unknown Source)
at org.hsqldb.ParserDQL.XreadTableExpression(Unknown Source)
at org.hsqldb.ParserDQL.XreadQuerySpecification(Unknown Source)
at org.hsqldb.ParserDQL.XreadSimpleTable(Unknown Source)
at org.hsqldb.ParserDQL.XreadQueryPrimary(Unknown Source)
at org.hsqldb.ParserDQL.XreadQueryTerm(Unknown Source)
at org.hsqldb.ParserDQL.XreadSetOperation(Unknown Source)
at org.hsqldb.ParserDQL.XreadQueryExpressionBody(Unknown Source)
at org.hsqldb.ParserDQL.XreadQueryExpression(Unknown Source)
at org.hsqldb.ParserDQL.XreadSubqueryTableBody(Unknown Source)
at org.hsqldb.ParserDQL.XreadTableNamedSubqueryBody(Unknown Source)
at org.hsqldb.ParserDQL.XreadQueryExpression(Unknown Source)
at org.hsqldb.ParserDQL.compileCursorSpecification(Unknown Source)
at org.hsqldb.ParserCommand.compilePart(Unknown Source)
at org.hsqldb.ParserCommand.compileStatements(Unknown Source)
at org.hsqldb.Session.executeDirectStatement(Unknown Source)
at org.hsqldb.Session.execute(Unknown Source)
... 37 more
This specific use case could also be solved with a combination of UNNEST and SEQUENCE_ARRAY but I'd like to avoid having to introduce an HSQLDB specific code path.

I'd start with the simplest form of the recursive query without use of sequences and with hard-coded limit and then gradually add extra bits to it.
Based on the example in the docs With Clause and Recursive Queries the syntax should look like this:
WITH RECURSIVE
t(level_num)
AS
(
VALUES(1)
UNION ALL
SELECT
level_num + 1
FROM t
WHERE level_num < 10
)
SELECT level_num
FROM t
;
By the way, the docs say:
HyperSQL limits recursion to 265 rounds. If this is exceeded, an error
is raised.
I'd try the simplest query, like the one above, make sure that it works, then try it with, say, 1000 instead of 10 and see what error it returns. If it is the same error that you had originally, then you found the reason.
A side note: I'd use a permanent table of numbers instead of generating them on the fly recursively for the task of this kind. We have a table with 100K numbers in our system. It is simple and would work in any DBMS. Populate it once and use as needed. I know that in SQL Server recursive query is significantly slower (in this kind of task), I don't know about HyperSQL though. Also, the limit of recursion depth to just 265 is rather harsh. Most likely, with such low limit on the recursion depth it would be impossible to detect any difference in performance. But, again, is 265 numbers enough for your purposes?

HSQLDB has an issue with the 'UNION ALL'. In above referred example "With Clause and Recursive Queries" there is no 'UNION ALL', but only 'UNION' (note that documentation might be changed).
In this thread there is some discussion about it. But at the moment I cannot get a working recursive statement in HSQLDB that uses UNION ALL.
So use UNION for HSQLDB v 2.3.2+
Fred Toussi commented on above thread. In version 2.5.1 and upwards UNION ALL should behave as expected.

Insufficient data written when inserting rows

I am facing this error when running my unit test to insert some rows into my bigquery table today :
Caused by: java.io.IOException: insufficient data written
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.close(HttpURLConnection.java:3213)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:81)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:960)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:482)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithBackOffAndGZip(MediaHttpUploader.java:504)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeUploadInitiation(MediaHttpUploader.java:456)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:348)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:418)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:460)
I thought it was due to the new version of google-http-client (1.16.0.rc) because i updated just before running the test. But rollbacking to 1.15.0-rc has no effect.
Any idea ?

Me too. Also, it seems like a sign that Bigquery just stops receiving any data. Because if you query you table by count(*) after this exception, result won't change anymore. If I keep my program running for a while, it will give me errors such as:
javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)
waiting for answers...

These errors usually happen when there is a communication failure, specially for large files.
The way to avoid it is to use resumable upload.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SMB join not working over Hive Tables - hive

Got exactly the same issue in Hive 1.1. The same query works in Hive 2.1. So upgrade your hive.

Related

Flink hive api query exception: Distinct without an aggregation

java.lang.ArrayIndexOutOfBoundsException: -1 - Hive update statement

Druid : No column with timestamp with local time-zone type on query result; one column should be of timestamp with local time-zone type

CTE based sequence generation with HSQLDB

Insufficient data written when inserting rows

Categories

Resources