Subquery in Hive Where Clause

Subquery in Hive Where Clause - hive

I need help with this piece:
How can I write the following in HIVE...
SELECT *
FROM tableA
WHERE colA = (SELECT MAX(date_column) FROM tableA)
I need to query only the latest current records from the table. I am storing dates as strings in hive as "yyyy-mm-dd".

Avoid a JOIN, use the analytics and windowing features:
select * from (select *, rank() over (order by date_col desc) as rank
from tableA) S where S.rank = 1;

Something like this might work:
SELECT a.*
FROM tableA a
JOIN (SELECT MAX(date_column) AS max_date_column
FROM tableA) b
ON a.colA = b.max_date_column
Hope it helps
EDIT: I have no idea how I got to this old question, you probably solved it long ago :)

Note that in Hive 0.13+, you can use subqueries in WHERE statements.

Related

Alternate for max() function in db2

I am executing a complex query in DB2 and the response time of which is quite high. After a lot of R&D, I found that the repetitive use of the max() function is causing hindrance in the execution time.
Thus I wish to know if there is an alternative for the max() function. I read a bit about rank() and was wondering if the same could be used, but wasn't able to get the result I wanted.
A part of my query is:
Select DISTINCT
(Select MAX(DATE(last_timestamp)) from Table_1 where ID = E.EID)
From Table_2 E
I am stuck at this for too long now. So any help would be appreciated.

row_number() is better than rank() since you don't want duplicates. Here it is. I'm thankful for this site as reference: https://www.ibm.com/developerworks/community/blogs/SQLTips4DB2LUW/entry/finding_the_maximum_row_and_more56?lang=en
SELECT DISTINCT last_timestamp
FROM (SELECT ROW_NUMBER() OVER(PARTITION BY A.ID ORDER BY A.last_timestamp DESC) AS rn, DATE(A.last_timestamp) as last_timestamp
FROM Table_1 A, From Table_2 E where A.ID = E.EID)
WHERE rn = 1;
The syntax might not be exact since I did not test it and I don't have DB2 installed.

The issue may be the conversion of the timestamp to a date - DATE() - before doing the MAX() function - not a DB2 expert, but would guess that this may end up creating a temp table of dates that is then MAXed, and indexes are not used...
Suggest that the following be tried:
Select DISTINCT
DATE(
Select MAX(last_timestamp)
from Table_1
where ID = E.EID
)
From Table_2 E
Of course the query optimizer may be smart enough to do this...

Perhaps you can join on the following CTE :
WITH t1(i,t) AS
( SELECT id, MAX(last_timestamp)
FROM Table_1
GROUP BY id )
SELECT
...
DATE(t1.t),
...
FROM Table_2
LEFT JOIN t1 ON Table_2.EID = t1.i

SQL combine two query results

I can't use a Union because it's not the result I want, and I can't use join because I haven't any common column. I have tried many different SQL query structures and nothing works as I want.
I need help to achieve what I believe is a really simple SQL query. What I am doing now is
select a, b
from (select top 4 a from element_type order by c) as Y,
(SELECT * FROM (VALUES (NULL), (1), (2), (3)) AS X(b)) as Z
The first is a part of a table and the second is a hand created select that gives results like this:
select a; --Give--> a,b,c,d (1 column)
select b; --Give--> 1,2,3,4 (1 column)
I need a query based on the two first that give me (2 column) :
a,1
b,2
c,3
d,4
How can i do this? UNION, JOIN or anything else? Or maybe I can't.
All I can get for now is this:
a,1
a,2
a,3
a,4
b,1
b,2
...

If you want to join two tables together purely on the order the rows appear, then I hope your database support analytic (window) functions:
SELECT * FROM
(SELECT t.*, ROW_NUMBER() OVER(ORDER BY x) as rown FROM table1 t) t1
INNER JOIN
(SELECT t.*, ROW_NUMBER() OVER(ORDER BY x) as rown FROM table2 t) t2
ON t1.rown = t2.rown
Essentially we invent something to join them on by numbering the rows. If one of your tables already contains incrementing integers from 1, you dont need to ROW_NUMBER() OVER() on that table, because it already has suitable data to join to; you just invent a fake column of incrementing nubmers in the other table and then join together
Actually, even if it doesn't support analytics, there are ugly ways of doing row numbering, such as joining the table back to itself using id < id and COUNT(*) .. GROUP BY id to number the rows. I hate doing it, but if your DB doesnt support ROW_NUMBER i'll post an example.. :/
Bear in mind, of course, that RDBMS have R in the name for a reason - related data is.. well.. related. They don't do so well when data is unrelated, so if your hope is to join the "chalks" table to the "cheese" table even though the two are completely unrelated, you're finding out now why it's hard work! :)

Try using row_number. I've created something that might help you. See below:
declare #tableChar table(letter varchar)
insert into #tableChar(letter)
select 'a';
insert into #tableChar(letter)
select 'b';
insert into #tableChar(letter)
select 'c';
insert into #tableChar(letter)
select 'd';
select letter,ROW_NUMBER() over(order by letter ) from #tableChar

You can user row_number() to achieve this,
select a,row_number() over(order by a) as b from element_type;
As you are not taking second part from other table, so you do not need to use join. But if you are doing this on different tables the you can use row_number() to create key for both the tables and bases on those keys, you can join.
Hope it will help.

how do you retrieve the latest value from an oracle table

I am tryin to get the latest value from an oracle table based on server name. I have the following sql:
SELECT T."Node",T."Timestamp",T."MAX_User_CPU_Pct", T."MAX_System_CPU_Pct"
FROM DW.KPX_CPU_DETAIL_HV T where T."Node"='serverA%' and T."Timestamp"=
(select max(P."Timestamp") from DW.KPX_CPU_DETAIL_HV P where P."Node"='serverA%')
it does not seem to be working, any ideas what I might be doing wrong here?

Try this, might actually be faster than the sub-select (even if it was correct):
SELECT T."Node",
T."Timestamp",
T."MAX_User_CPU_Pct",
T."MAX_System_CPU_Pct"
FROM (
SELECT p.*,
row_Number() over (partition by p."Node" order by p."Timestamp" desc) as rn
FROM DW.KPX_CPU_DETAIL_HV p
) t
where rn = 1;

SELECT T."Node",T."Timestamp",T."MAX_User_CPU_Pct", T."MAX_System_CPU_Pct"
FROM (SELECT * FROM DW.KPX_CPU_DETAIL_HV T where T."Node" like 'serverA%' ORDER BY T."Timestamp" DESC) T
WHERE ROWNUM = 1
this did it for me, not sure it is the best solution but working for now.

T-SQL, how to do this group by query?

I have a view with this information:
TableA (IDTableA, IDTableB, IDTableC, Active, date, ...)
For each register in TableA and each register in tableC, I want the register of tableB that have the max date and is active.
select IDTableA, IDtableC, IDTableB, Date, Active
from myView
where Active = 1
group by IDTableA, IDTableC
Having Max(Date)
order by IDTableA;
This query works with SQLite, but if I try this query in SQL Server I get an error that say that IDTableB in the select is not contained in the group clause.
I know that in theory the first query in the SQLite shouldn't work, but do it.
How can I do this query in SQL Server?
Thanks.

According to SQL 92, if you use GROUP BY clause, then in SELECT output expression list you can only use columns mentioned in GROUP BY list, or any other columns but they must be wrapped in aggregate functions like count(), sum(), avg(), max(), min() and so on.
Some servers like MSSQL, Postgres are strict about this rule, but for example MySQL and SQLite are very relaxed and forgiving - and this is why it surprised you.
Long story short - if you want this to work in MSSQL, adhere to SQL92 requirement.

This query in SQLServer
select IDTableA, IDtableC, IDTableB, Date, Active
from myView v1
where Active = 1
AND EXISTS (
SELECT 1
FROM myView v2
group by v2.IDTableA, v2.IDTableC
Having Max(v2.Date) = v1.Date
)
order by v1.IDTableA;
OR
Also in SQLServer2005+ you can use CTE with ROW_NUMBER
;WITH cte AS
(
select IDTableA, IDtableC, IDTableB, [Date], Active,
ROW_NUMBER() OVER(PARTITION BY IDTableA, IDTableC ORDER BY [Date] DESC) AS rn
from myView v1
where Active = 1
)
SELECT *
FROM cte
WHERE rn = 1
ORDER BY IDTableA

Try this,
select * from table1 b
where active = 1
and date = (select max(date) from table1
where idtablea = b.idtablea
and idtablec = b.idtablec
and active = 1);
SQLFIDDLE DEMO

group by with where not working

SELECT A.ID, A.COLUMN_B, A.COLUMN_C FROM A
WHERE A.COLUMN_A IN
(
SELECT A.COLUMN_A
FROM B
INNER JOIN A ON B."COLUMN_A" = A."COLUMN_A"
WHERE B."COLUMN_B" = 'something'
UNION
SELECT A."COLUMN_A"
FROM A
WHERE A."COLUMN_D" IN (X,Y,Z) OR A."COLUMN_D" = 'something'
)
Now I want add a group by (A.ID) , and order by (A.COLUMN_B) DESC, and then select first to it. But DB won't allow. Any suggestions ? I can use LINQ to solve it once inner Union part is returned. But do now want to go that way.

There's a couple of things here.
First off - in DB2, when using GROUP BY, you can only select those columns listed in the grouping statement - everything else must be part of an aggregation function. So, grouping by a.Id and ordering by a.Column_B won't work - you'll need to order by SUM(a.Column_B) or something applicable.
Second... your query could use a bit of work in the general sense - specifically, you're self-joining twice, which you don't need to do at all. Try this instead:
SELECT a.Id, SUM(a.Column_B) as total, SUM(a.Column_C)
FROM a
WHERE a.Column_D in (X, Y, Z, 'Something')
OR EXISTS (SELECT '1'
FROM b
WHERE b.Column_A = a.Column_A
AND b.Column_B = 'Something')
GROUP BY a.Id
ORDER BY total DESC
FETCH FIRST 1 ROW ONLY
Swap out the SUM function for whatever is appropriate.

You can't use a column in the ORDER BY or SELECT that you haven't included in the GROUP BY, unless it's being aggregated (in a function like MAX() or COUNT() or SUM().
So, you could GROUP BY A.ID,A.COLUMN_B, and then ORDER BY COLUMN_B. Using a TOP 1 should work, too.
I just noticed that you're on DB2. I know that it will work this way on SQLServer. DB2 should be similar.

Worked the oterh way around. Just used Order By on A.ID and select row with max identity column.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Subquery in Hive Where Clause - hive

I need help with this piece: How can I write the following in HIVE... SELECT * FROM tableA WHERE colA = (SELECT MAX(date_column) FROM tableA) I need to query only the latest current records from the table. I am storing dates as strings in hive as "yyyy-mm-dd".

Avoid a JOIN, use the analytics and windowing features: select * from (select *, rank() over (order by date_col desc) as rank from tableA) S where S.rank = 1;

Something like this might work: SELECT a.* FROM tableA a JOIN (SELECT MAX(date_column) AS max_date_column FROM tableA) b ON a.colA = b.max_date_column Hope it helps EDIT: I have no idea how I got to this old question, you probably solved it long ago :)

Note that in Hive 0.13+, you can use subqueries in WHERE statements.

Related

Alternate for max() function in db2

SQL combine two query results

how do you retrieve the latest value from an oracle table

T-SQL, how to do this group by query?

group by with where not working

Categories

Resources