Is there a way to elemenate a subquery from this Hive query? - hive

Edit: I am using Apache Hive (version 3.1.0.3.1.5.0-152)
When I run the following query:
insert into delta_table (select * from batch_table where loaddate=(select max(loaddate) from batch_table));
I get this error:
Unsupported SubQuery Expression 'loaddate': Only SubQuery expressions
that are top level conjuncts are allowed
We have a table that is written to in daily batches with the column loaddate that is unique for each batch. The purpose of the query is to get all the records from the most recent batch without knowing what it's load date is.
I suspect the issue is because I am using a subquery inside a subquery. Is there a way to change this query to do the same thing, but without the last subquery?

Depends on which version of hive you have , but you can use the Clause with to avoid the second subquery
with max_load as ( select max(loaddate) as loaddate from batch_table)
insert into delta_table
(select * from batch_table a where a.loaddate=max_load.loaddate);

It looks like the error was because the table was created incorrectly and for some reason this caused the query to fail. I recreated the table and it now works

Analytic function + filter will be more efficient than self-join or subquery with one more table scan to find max date:
insert into delta_table
select col1, col2, ... coln --list columns here
from
(
select t.*, rank() over(order by loaddate desc) rnk
from batch_table t
)s
where rnk=1;

Related

Hive count elements of the max partition column

I'm struggling with a query that may look simple but which is causing me a lot of trouble.
SELECT COUNT(*) FROM mytable where partition_column IN (SELECT MAX(partition_column) FROM mytable )
mytable is a 2To Hive External table, partitioned by the column partition_column. This query is taking 10 minutes to run..
When I do 2 separate queries :
SELECT MAX(partition_column) FROM mytable
> 2020-06-29
SELECT COUNT(*) FROM mytable where partition_column = '2020-06-29'
It works super fine and super quickly.
Am I missing something ?
Thank you
I'm on Hive 1.2.1 and Hadoop 2.7.3
It looks like the subquery is taking long time to process. Since you are filtering on the same column and table as the subquery, so the reducer step is taking a long time to process. Hence resulting in slow running of the query.
You could improve your query by introducing CTE which will create a temporary result set. Something as below:
WITH MY_CTE_SUBQUERY AS (
SELECT MAX(partition_column) as max_pc FROM mytable
)
SELECT COUNT(*)
FROM mytable
where partition_column IN (Select max_pc from MY_CTE_SUBQUERY);
More on hive CTE in the official doc.

Select all columns grouping by version - Postgres

I need to query all columns in a table of all customers, the main factor being the latest version for each customer.
My table:
My Query:
SELECT DISTINCT ON(code)
code,
namefile,
versioncol,
status
FROM table_A
ORDER BY versioncol desc
Error:
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: SELECT DISTINCT ON(code)
Postgres' error message is trying to tell you what to do:
DISTINCT ON expressions must match initial ORDER BY expressions
Actually that's quite clear: to make your code a valid DISTINCT ON query, you just need to add code (that's the DISTINCT ON expression) as a first sorting criteria to the query (ie as initial ORDER BY).
SELECT DISTINCT ON(code) a.*
FROM table_A a
ORDER BY code, versioncol DESC

Hive: less than operator error in subquery

I want the latest records from HIVE table using the following query-
WITH lot as (select *
from to_burn_in as a where a.rel_lot='${Rel_Lot}')
select a.* from lot AS a
where not exists (select 1 from lot as b
where a.Rel_Lot=b.Rel_Lot and a.SerialNum=b.SerialNum and a.Test_Stage=b.Test_Stage
and cast(a.test_datetime as TIMESTAMP) < cast(b.Test_Datetime as TIMESTAMP))
order by a.SerialNum
this query is throwing a error as
Error while compiling statement: FAILED: SemanticException line 0:undefined:-1 Unsupported SubQuery Expression 'Test_Datetime': SubQuery expression refers to both Parent and SubQuery expressions and is not a valid join condition.
I have tried running with equal operator in place of the less than operator in subquery and it is running fine. I read the HIVE documentation as given in
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
and couldn't figure out why it is throwing a error as 'where' subquery is supported.
What might be the problem here?
EXISTS works the same as a join actually. Not equality join conditions are not supported in Hive prior Hive 2.2.0 (see HIVE-15211, HIVE-15251)
It seems you are trying to get records having latest timestamp per Rel_Lot,SerialNum,Test_Stage. Your query can be rewritten using dense_rank() or rank() function:
WITH lot as (select *
from to_burn_in as a where a.rel_lot='${Rel_Lot}'
)
select * from
(
select a.*,
dense_rank() over(partition by Rel_Lot,SerialNum,Test_Stage order by cast(a.test_datetime as TIMESTAMP) desc) as rnk
from lot AS a
)s
where rnk=1
order by s.SerialNum

Why do partitions require nested selects?

I have a page to show 10 messages by each user (don't ask me why)
I have the following code:
SELECT *, row_number() over(partition by user_id) as row_num
FROM "posts"
WHERE row_num <= 10
It doesn't work.
When I do this:
SELECT *
FROM (
SELECT *, row_number() over(partition by user_id) as row_num FROM "posts") as T
WHERE row_num <= 10
It does work.
Why do I need nested query to see row_num column? Btw, in first request I actually see it in results but can't use where keyword for this column.
It seems to be the same "rule" as any query, column aliases aren't visible to the WHERE clause;
This will also fail;
SELECT id AS newid
FROM test
WHERE newid=1; -- must use "id" in WHERE clause
SQL Query like:
SELECT *
FROM table
WHERE <condition>
will execute in next order:
3.SELECT *
1.FROM table
2.WHERE <condition>
so, as Joachim Isaksson say, columns in SELECt clause are not visible in WHERE clause, because of processing order.
In your second query, column row_num are fetched in FROM clause first, so it will be visible in WHERE clause.
Here is simple list of steps in order they executes.
There is a good reason for this rule in standard SQL.
Consider the statement:
SELECT *, row_number() over (partition by user_id) as row_num
FROM "posts"
WHERE row_num <= 10 and p.type = 'xxx';
When does the p.type = 'xxx' get evaluated relative to the row number? In other words, would this return the first ten rows of "xxx"? Or would it return the "xxx"s in the first ten rows?
The designers of the SQL language recognize that this is a hard problem to resolve. Only allowing them in the select clause resolves the issue.
You can check this topic and this one on dba.stockexchange.com about order in which SQL executes SELECT clause. I think it aplies not only for PostgreSQL, but for all RDBMS.

How to retrieve the last 2 records from table?

I have a table with n number of records
How can i retrieve the nth record and (n-1)th record from my table in SQL without using derived table ?
I have tried using ROWID as
select * from table where rowid in (select max(rowid) from table);
It is giving the nth record but i want the (n-1)th record also .
And is there any other method other than using max,derived table and pseudo columns
Thanks
You cannot depend on rowid to get you to the last row in the table. You need an auto-incrementing id or creation time to have the proper ordering.
You can use, for instance:
select *
from (select t.*, row_number() over (order by <id> desc) as seqnum
from t
) t
where seqnum <= 2
Although allowed in the syntax, the order by clause in a subquery is ignored (for instance http://docs.oracle.com/javadb/10.8.2.2/ref/rrefsqlj13658.html).
Just to be clear, rowids have nothing to do with the ordering of rows in a table. The Oracle documentation is quite clear that they specify a physical access path for the data (http://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#i6732). It is true that in an empty database, inserting records into a newtable will probably create a monotonically increasing sequence of row ids. But you cannot depend on this. The only guarantees with rowids are that they are unique within a table and are the fastest way to access a particular row.
I have to admit that I cannot find good documentation on Oracle handling or not handling order by's in subqueries in its most recent versions. ANSI SQL does not require compliant databases to support order by in subqueries. Oracle syntax allows it, and it seems to work in some cases, at least. My best guess is that it would probably work on a single processor, single threaded instance of Oracle, or if the data access is through an index. Once parallelism is introduced, the results would probably not be ordered. Since I started using Oracle (in the mid-1990s), I have been under the impression that order bys in subqueries are generally ignored. My advice would be to not depend on the functionality, until Oracle clearly states that it is supported.
select * from (select * from my_table order by rowid) where rownum <= 2
and for rows between N and M:
select * from (
select * from (
select * from my_table order by rowid
) where rownum <= M
) where rownum >= N
Try this
select top 2 * from table order by rowid desc
Assuming rowid as column in your table:
SELECT * FROM table ORDER BY rowid DESC LIMIT 2