Get row with max value in Hive/SQL? - sql

I'm new to Hive/SQL, and I'm stuck on a fairly simple problem. My data looks like:
+------------+--------------------+-----------------------+
| carrier_iD | meandelay | meancanceled |
+------------+--------------------+-----------------------+
| EV | 13.795802119653473 | 0.028584251044292006 |
| VX | 0.450591016548463 | 2.364066193853424E-4 |
| F9 | 10.898001378359766 | 0.00206753962784287 |
| AS | 0.5071547420965062 | 0.0057404326123128135 |
| HA | 1.2031093279839498 | 5.015045135406214E-4 |
| 9E | 8.147899230704216 | 0.03876067292247866 |
| B6 | 9.45383857757506 | 0.003162096314343487 |
| UA | 8.101511665305816 | 0.005467725574605967 |
| FL | 0.7265068895709532 | 0.0041141513746490044 |
| WN | 7.156119279121648 | 0.0057419058192869415 |
| DL | 4.206288692245839 | 0.005123990066804269 |
| YV | 6.316802855264404 | 0.029304029304029346 |
| US | 3.2221527095063736 | 0.007984031936127766 |
| OO | 6.954715814690328 | 0.02596499362466706 |
| MQ | 9.74568222216328 | 0.025628100708354324 |
| AA | 8.720522654298968 | 0.019242775597574157 |
+------------+--------------------+-----------------------+
I want Hive to return the row with the meanDelay max value. I have:
SELECT CAST(MAX(meandelay) as FLOAT) FROM flightinfo;
which indeed returns the max (I use cast because my values are saved as STRING). So then:
SELECT * FROM flightinfo WHERE meandelay = (SELECT CAST(MAX(meandelay) AS FLOAT) FROM flightinfo);
I get the following error:
FAILED: ParseException line 1:44 cannot recognize input near 'select' 'cast' '(' in expression specification

Use the windowing and analytics functions
SELECT carrier_id, meandelay, meancanceled
FROM
(SELECT carrier_id, meandelay, meancanceled,
rank() over (order by cast(meandelay as float) desc) as r
FROM table) S
WHERE S.r = 1;
This will also solve the problem if more than one row has the same max value, you'll get all the rows as result. If you just want a single row change rank() to row_number() or add another term to the order by.

use join instead.
SELECT a.* FROM flightinfo a left semi join
(SELECT CAST(MAX(meandelay) AS FLOAT)
maxdelay FROM flightinfo)b on (a.meandelay=b.maxdelay)

You can use the collect_max UDF from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem, passing in a value of 1, meaning that you only want the single max value.
select array_index( map_keys( collect_max( carrier_id, meandelay, 1) ), 0 ) from flightinfo;
Also, I've read somewhere that the Hive max UDF does allow you to access other fields on the row, but I think its easier just to use collect_max.

I don't think your sub-query is allowed ...
A quick look here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
states:
As of Hive 0.13 some types of subqueries are supported in the WHERE
clause. Those are queries where the result of the query can be treated
as a constant for IN and NOT IN statements (called uncorrelated
subqueries because the subquery does not reference columns from the
parent query):

Related

SQL Check if the User has IN and Out

I need help getting the User which has an 'IN' and 'Out' in Column isIN. If the user has an IN and OUT do not select them in the list. I need to select the user who has only had an IN. Please I need help. Thanks in advance.
This is the table:
| Users | IsIN |
|:------------------:|:-----:|
| MHYHDC61TMJ907867 | IN |
| MHYHDC61TMJ907867 | OUT |
| MHYHDC61TMJ907922 | IN |
| MHYHDC61TMJ907922 | OUT |
| MHYHDC61TMJ907923 | IN |
| MHYHDC61TMJ907923 | OUT |
| MHYHDC61TMJ907924 | IN | - I need to get only this row
| MHYHDC61TMJ907925 | IN |
| MHYHDC61TMJ907925 | OUT |
| MHYHDC61TMJ908054 | IN | - I need to get only this row
| MHYHDC61TMJ908096 | IN | - I need to get only this row
| MHYHDC61TMJ908109 | IN | - I need to get only this row
Need to get the result like
| Users | IsIN |
|:------------------:|:-----:|
| MHYHDC61TMJ907924 | IN |
| MHYHDC61TMJ908054 | IN |
| MHYHDC61TMJ908096 | IN |
| MHYHDC61TMJ908109 | IN |
I tried using this query and sample query below but it doesn't work.
select s.[Users], s.[isIn] [dbo].[tblIO] s
where not exists (
select 1
from [dbWBS].[dbo].[tblIO] s2
where s2.[Users] = s.[Users] and s2.isIn = 'IN'
);
You can use not exists:
select s.*
from sample s
where not exists (select 1
from sample s2
where s2.user = s.user and s2.inout = 'OUT'
);
If you want only users that meet the condition (and not the full rows):
select user
from sample s
group by user
having min(inout) = max(inout) and min(inout) = 'IN';
Bearing in mind that an 'OUT' IsIn must be always preceded by an 'IN' record, you could use a query like this:
select s.Users, 'IN' as IsIn
from sample s
group by s.Users
having count(distinct s.IsIn) = 1

SQL - Given sequence of data, how do I query the origin?

Let's assume we have the following data.
| UUID | SEENTIME | LAST_SEENTIME |
------------------------------------------------------
| UUID1 | 2020-11-10T05:00:00 | |
| UUID2 | 2020-11-10T05:01:00 | 2020-11-10T05:00:00 |
| UUID3 | 2020-11-10T05:03:00 | 2020-11-10T05:01:00 |
| UUID4 | 2020-11-10T05:04:00 | 2020-11-10T05:03:00 |
| UUID5 | 2020-11-10T05:07:00 | 2020-11-10T05:04:00 |
| UUID6 | 2020-11-10T05:08:00 | 2020-11-10T05:07:00 |
Each data is connected to each other via LAST_SEENTIME.
In such case, is there a way to use SQL to identify these connected events as one? I want to be able to calculate start and end to calculate the duration of this event.
You can use a recursive CTE. The exact syntax varies by database, but something like this:
with recursive cte as
select uuid as orig_uuid, uuid, seentime
from t
where last_seentime is null
union all
select cte.orig_uuid, t.uuid, t.seentime
from cte join
t
on cte.seentime = t.last_seentime
)
select orig_uuid,
max(seentime) - min(seentime) -- or whatever your database uses
from cte
group by orig_uuid;

Query Optimization - subselect in Left Join

I'm working on optimizing a sql query, and I found a particular line that appears to be killing my queries performance:
LEFT JOIN anothertable lastweek
AND lastweek.date>= (SELECT MAX(table.date)-7 max_date_lweek
FROM table table
WHERE table.id= lastweek.id)
AND lastweek.date< (SELECT MAX(table.date) max_date_lweek
FROM table table
WHERE table.id= lastweek.id)
I'm working on a way of optimizing these lines, but I'm stumped. If anyone has any ideas, please let me know!
-----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1908654 | 145057704 | 720461 | 00:00:29 |
| * 1 | HASH JOIN RIGHT OUTER | | 1908654 | 145057704 | 720461 | 00:00:29 |
| 2 | VIEW | VW_DCL_880D8DA3 | 427487 | 7694766 | 716616 | 00:00:28 |
| * 3 | HASH JOIN | | 427487 | 39328804 | 716616 | 00:00:28 |
| 4 | VIEW | VW_SQ_2 | 7174144 | 193701888 | 278845 | 00:00:11 |
| 5 | HASH GROUP BY | | 7174144 | 294139904 | 278845 | 00:00:11 |
| 6 | TABLE ACCESS STORAGE FULL | TASK | 170994691 | 7010782331 | 65987 | 00:00:03 |
| * 7 | HASH JOIN | | 8549735 | 555732775 | 429294 | 00:00:17 |
| 8 | VIEW | VW_SQ_1 | 7174144 | 172179456 | 278845 | 00:00:11 |
| 9 | HASH GROUP BY | | 7174144 | 294139904 | 278845 | 00:00:11 |
| 10 | TABLE ACCESS STORAGE FULL | TASK | 170994691 | 7010782331 | 65987 | 00:00:03 |
| 11 | TABLE ACCESS STORAGE FULL | TASK | 170994691 | 7010782331 | 65987 | 00:00:03 |
| * 12 | TABLE ACCESS STORAGE FULL | TASK | 1908654 | 110701932 | 2520 | 00:00:01 |
-----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 1 - access("SYS_ID"(+)="TASK"."PARENT")
* 3 - access("ITEM_2"="TASK_LWEEK"."SYS_ID")
* 3 - filter("TASK_LWEEK"."SNAPSHOT_DATE"<"MAX_DATE_LWEEK")
* 7 - access("ITEM_1"="TASK_LWEEK"."SYS_ID")
* 7 - filter("TASK_LWEEK"."SNAPSHOT_DATE">=INTERNAL_FUNCTION("MAX_DATE_LWEEK"))
* 12 - storage("TASK"."CLOSED_AT" IS NULL OR "TASK"."CLOSED_AT">=TRUNC(SYSDATE#!)-15)
* 12 - filter("TASK"."CLOSED_AT" IS NULL OR "TASK"."CLOSED_AT">=TRUNC(SYSDATE#!)-15)
Well, you are not even showing the select. As I can see that the select is done over Exadata ( Table Access Storage Full ) , perhaps you need to ask yourself why do you need to make 4 access to the same table.
You access fourth times ( lines 6, 10, 11, 12 ) to the main table TASK with 170994691 rows ( based on estimation of the CBO ). I don't know whether the statistics are up-to-date or it is optimizing sampling kick in due to lack of good statistics.
A solution could be use WITH for generating intermediate results that you need several times in your outline query
with my_set as
(SELECT MAX(table.date)-7 max_date_lweek ,
max(table.date) as max_date,
id from FROM table )
select
.......................
from ...
left join anothertable lastweek on ( ........ )
left join myset on ( anothertable.id = myset.id )
where
lastweek.date >= myset.max_date_lweek
and
lastweek.date < myset.max_date
Please, take in account that you did not provide the query, so I am guessing a lot of things.
Since complete information is not available I will suggest:
You are using the same query twice then why not use CTE such as
with CTE_example as (SELECT MAX(table.date), max_date_lweek, ID
FROM table table)
Looking at your explain plan, the only table being accessed is TASK. From that, I infer that the tables in your example: ANOTHERTABLE and TABLE are actually the same table and that, therefore, you are trying to get the last week of data that exists in that table for each id value.
If all that is true, it should be much faster to use an analytic function to get the max date value for each id and then limit based on that.
Here is an example of what I mean. Note I use "dte" instead of "date", to remove confusion with the reserved word "date".
LEFT JOIN ( SELECT lastweek.*,
max(dte) OVER ( PARTITION BY id ) max_date
FROM anothertable lastweek ) lastweek
ON 1=1 -- whatever other join conditions you have, seemingly omitted from your post
AND lastweek.dte >= lastweek.max_date - 7;
Again, this only works if I am correct in thinking that table and anothertable are actually the same table.

Padding to the result of a DISTINCT Sqlite query

I searched and figured out that I could use either substr with || or a printf statement with format specifiers in order to add padding to the results, but that doesn't seem to work if I had DISTINCT in the sqlite query.
I've a table called timeLapse that looks like so:
+----+-------+-----------+
| ID | Time | Status |
+----+-------+-----------+
| 1 | 0.001 | Initiated |
| 1 | 0.002 | Cranked |
| 3 | 0.002 | Initiated |
| 2 | 0.002 | Initiated |
| 2 | 0.003 | Cranked |
+----+-------+-----------+
I could query the distinct IDs with something like SELECT distinct(ID) FROM timeLapse as IDs, which returns this:
+-----+
| IDs |
+-----+
| 1 |
| 2 |
| 3 |
+-----+
However, I would like to pad the resultant distinct rows like so:
+----------+
| IDs |
+----------+
| Object-1 |
| Object-2 |
| Object-3 |
+----------+
My query SELECT substr('Object-' || DISTINCT(ID), 10, 10) as IDs FROM timeLapse results in an error:
"[17:22:47] Error while executing SQL query on database 'machining': near "distinct": syntax error"
Could someone please help me understand what am I doing wrong here? I am enormously thankful for your time and help.
get distinct() first before using substr() function.
select substr('Object-' || t1.ID, 1, 10) as IDs
from (SELECT DISTINCT(ID) ID FROM timeLapse) t1
see sqlfiddle
All credits to the user named ϻᴇᴛᴀʟ, as I only understood from their answer that I should have a sub-query within this query where the DISTINCT should go into.
This resolves my problem:
select printf('Object-%s', t1.ID) as IDs
FROM (SELECT DISTINCT(id) ID FROM timeLapse) t1

Pick a record based on a given value in postgres

I have a table in postgres like below,
alg_campaignid | alg_score | cp | sum
----------------+-----------+---------+----------
9829 | 30.44056 | 12.4000 | 12.4000
9880 | 29.59280 | 12.0600 | 24.4600
9882 | 29.59280 | 12.0600 | 36.5200
9827 | 29.27504 | 11.9300 | 48.4500
9821 | 29.14840 | 11.8800 | 60.3300
9881 | 29.14840 | 11.8800 | 72.2100
9883 | 29.14840 | 11.8800 | 84.0900
10026 | 28.79280 | 11.7300 | 95.8200
10680 | 10.31504 | 4.1800 | 100.0000
From which i have to select a record based on randomly generated number from 0 to 100.i.e first record should be returned if random number picked is between 0 and 12.4000,second if rendom is between 12.4000 and 24.4600,and likewise last if random no is between 95.8200 and 100.0000.
For Example
if the random number picked is 8 then the first record should be returned
or
if the random number picked is 48 then the fourth record should be returned
Is it possible to do this postgres if so kindly recommend a solution for this..
Yes, you can do this in Postgres. If you want to generate the number in the database:
with r as (
select random() * 100 as r
)
select t.*
from table t cross join r
where t.sum <= r.r
order by t.sum desc
limit 1;