SQL / Hive Select first rows with certain column value - hive

Consider following 3 column table structure:
id, b_time, b_type
id is a string, there will be multiple rows with the same id in the table.
b_time is timestamp and b_type can have any one of 2 possible values - 'A' or 'B'.
I want to select all the rows that fulfill one of the 2 conditions, priority wise:
For all ids, select the row with highest timestamp, where b_type='A'.
If for an id, there are no rows where b_type='A', select the row with highest timestamp, irrespective of the b_type value.
Please suggest the sql query which should tackle this problem(even if it requires creation of temporary intermediate tables).

Figured out a simple and intuitive way to do this:
SELECT * FROM
(SELECT id
, b_time
, b_type
, ROW_NUMBER() OVER (PARTITION BY id ORDER BY b_type ASC,b_time DESC) AS RN
FROM your_table
)
WHERE RN = 1

with nottypea as (select id, max(b_time) as mxtime
from tablename
group by id
having sum(case when b_type = 'A' then 1 else 0 end) = 0)
, typea as (select id, max(b_time) as mxtime
from tablename
group by id
having sum(case when b_type = 'A' then 1 else 0 end) >= 1)
select id,mxtime,'B' as typ from nottypea
union all
select id,mxtime,'A' as typ from typea

Related

how to find all column records are same or not in group by column in SQL

How to find all column values are same in Group by of rows in table
CREATE TABLE #Temp (ID int,Value char(1))
insert into #Temp (ID ,Value ) ( Select 1 ,'A' union all Select 1 ,'W' union all Select 1 ,'I' union all Select 2 ,'I' union all Select 2 ,'I' union all Select 3 ,'A' union all Select 3 ,'B' union all Select 3 ,'1' )
select * from #Temp
Sample Table:
How to find all column value of 'Value' column are same or not if group by 'ID' Column.
Ex: select ID from #Temp group by ID
For ID 1 - Value column records are A, W, I - Not Same
For ID 2 - Value column records are I, I - Same
For ID 3 - Value column records are A, B, 1 - Not Same
I want the query to get a result like below
When all items in the group are the same, COUNT(DISTINCT Value) would be 1:
SELECT Id
, CASE WHEN COUNT(DISTINCT Value)=1 THEN 'Same' ELSE 'Not Same' END AS Result
FROM MyTable
GROUP BY Id
If you're using T-SQL, perhaps this will work for you:
SELECT t.ID,
CASE WHEN MAX(t.RN) > 1 THEN 'Same' ELSE 'Not Same' END AS GroupResults
FROM(
SELECT *, ROW_NUMBER() OVER(PARTITION BY ID, VALUE ORDER BY ID) RN
FROM #Temp
) t
GROUP BY t.ID
Usally that's rather easy: Aggregate per ID and count distinct values or compare minimum and maximum value.
However, neither COUNT(DISTINCT value) nor MIN(value) nor MAX(value) take nulls into consideration. So for an ID having value 'A' and null, these would detect uniqueness. Maybe this is what you want or nulls don't even occur in your data.
But if you want nulls to count as a value, then select distinct values first (where null gets a row too) and count then:
select id, case when count(*) = 1 then 'same' else 'not same' end as result
from (select distinct id, value from #temp) dist
group by id
order by id;
Rextester demo: http://rextester.com/KCZD88697

Is it possible to put a row with specific data at row number one in SQL?

Assume i have a table:
I want that row with UserId ='ee' always display at row number one every time i select this table in SQLQuery.
Is it possible?
select *
from your_table
order by UserId <> 'ee' -- "<>", because false < true
or (arguably clearer):
select *
from your_table
order by case when UserId = 'ee' then 0 else 1 end
Use a view:
CREATE VIEW myView AS
SELECT * FROM myTable ORDER BY (case when UserId = 'ee' then 0 else 1 end) ASC
You can use the CASE Clause or IF Clause to change to order.
Example:
SELECT * FROM
Table
ORDER BY (CASE WHEN UserID = 'ee' THEN 1 ELSE 2 END) ASC, UserID ASC
Or
SELECT * FROM
Table
ORDER BY IIF(UserID = 'ee', 1, 2) ASC, UserID ASC
This way you give te ee value number 1 and all the others number 2 and order by those numbers. When those are ordered you order the duplicated 2's by the UserID itself.

How to get the first not null value from a column of values in Big Query?

I am trying to extract the first not null value from a column of values based on timestamp. Can somebody share your thoughts on this. Thank you.
What have i tried so far?
FIRST_VALUE( column ) OVER ( PARTITION BY id ORDER BY timestamp)
Input :-
id,column,timestamp
1,NULL,10:30 am
1,NULL,10:31 am
1,'xyz',10:32 am
1,'def',10:33 am
2,NULL,11:30 am
2,'abc',11:31 am
Output(expected) :-
1,'xyz',10:30 am
1,'xyz',10:31 am
1,'xyz',10:32 am
1,'xyz',10:33 am
2,'abc',11:30 am
2,'abc',11:31 am
You can modify your sql like this to get the data you want.
FIRST_VALUE( column )
OVER (
PARTITION BY id
ORDER BY
CASE WHEN column IS NULL then 0 ELSE 1 END DESC,
timestamp
)
Try this old trick of string manipulation:
Select
ID,
Column,
ttimestamp,
LTRIM(Right(CColumn,20)) as CColumn,
FROM
(SELECT
ID,
Column,
ttimestamp,
MIN(Concat(RPAD(IF(Column is null, '9999999999999999',STRING(ttimestamp)),20,'0'),LPAD(Column,20,' '))) OVER (Partition by ID) CColumn
FROM (
SELECT
*
FROM (Select 1 as ID, STRING(NULL) as Column, 0.4375 as ttimestamp),
(Select 1 as ID, STRING(NULL) as Column, 0.438194444444444 as ttimestamp),
(Select 1 as ID, 'xyz' as Column, 0.438888888888889 as ttimestamp),
(Select 1 as ID, 'def' as Column, 0.439583333333333 as ttimestamp),
(Select 2 as ID, STRING(NULL) as Column, 0.479166666666667 as ttimestamp),
(Select 2 as ID, 'abc' as Column, 0.479861111111111 as ttimestamp)
))
As far as I know, Big Query has no options like 'IGNORE NULLS' or 'NULLS LAST'. Given that, this is the simplest solution I could come up with. I would like to see even simpler solutions.
Assuming the input data is in table "original_data",
select w2.id, w1.column, w2.timestamp
from
(select id,column,timestamp
from
(select id,column,timestamp, row_number()
over (partition BY id ORDER BY timestamp) position
FROM original_data
where column is not null
)
where position=1
) w1
right outer join
original_data as w2
on w1.id = w2.id
SELECT id,
(SELECT top(1) column FROM test1 where id=1 and column is not null order by autoID desc) as name
,timestamp
FROM yourTable
Output :-
1,'xyz',10:30 am
1,'xyz',10:31 am
1,'xyz',10:32 am
1,'xyz',10:33 am
2,'abc',11:30 am
2,'abc',11:31 am

Combine the most recent entries from a number of tables

I have a master table with a number of IDs in it:
ID ...
0 ...
1 ...
And multiple tables (say vtbl1, vtbl2, vtbl3) with a foreign key to master, a timestamp and a value:
ID Timestamp Value
0 01/01/01.. 2
1 01/01/02.. 7
0 01/01/03.. 5
I would like to get one or more entries for each ID in master with an entry (or null if no entries exist) containing the most recent entry in each v... table (grouped by timestamps):
ID Timestamp vtbl1.Value vtbl2.Value vtbl3.value
0 01/01/03.. 5 2
0 01/01/01.. 4
1 01/01/02.. 7 4 9
I'm sure this is fairly simple but my SQL is rusty and I've been going in circles. Any help would be appreciated.
Clarification
These values come from one or more sensors able to read one or more of the values. So the latest value in each value table for the ID is to be considered the current system state for that ID. If the timestamps match they are considered one update.
I need the minimal set of updates required for each ID to give a full data set for the current state.
Also the values can be of different types.
If I understand your question correctly, one option is to use conditional aggregation and union all:
select id, timestamp,
max(case when tbl = 'tbl1' then value end) t1value,
max(case when tbl = 'tbl2' then value end) t2value,
max(case when tbl = 'tbl3' then value end) t3value
from (
select id, timestamp, value, 'tbl1' tbl
from tbl1
union all
select id, timestamp, value, 'tbl2' tbl
from tbl2
union all
select id, timestamp, value, 'tbl3' tbl
from tbl3
) t
group by id, timestamp
Or if you have multiple records per id and you want the highest value per by timestamp, you can include row_number() in your subquery:
select id, timestamp,
max(case when tbl = 'tbl1' then value end) t1value,
max(case when tbl = 'tbl2' then value end) t2value,
max(case when tbl = 'tbl3' then value end) t3value
from (
select id, timestamp, value, 'tbl1' tbl,
row_number() over (partition by id order by timestamp desc) rn
from tbl1
union all
select id, timestamp, value, 'tbl2' tbl,
row_number() over (partition by id order by timestamp desc) rn
from tbl2
union all
select id, timestamp, value, 'tbl3' tbl,
row_number() over (partition by id order by timestamp desc) rn
from tbl3
) t
where rn = 1
group by id, timestamp
This can get difficult though if max(timestamp) values aren't the same in each of the child tables. Which do you join on at that point?
select m.*, v1.value as t1_val, v2.value as t2_val, v3.value as t3_val
from master m
left join (select x.*
from vtbl1 x
join (select id, max(timestamp) as last_ts
from vtbl1
group by id) y
on x.id = y.id
and x.timestamp = y.last_ts) v1
on m.id = v1.id
left join (select x.*
from vtbl2 x
join (select id, max(timestamp) as last_ts
from vtbl2
group by id) y
on x.id = y.id
and x.timestamp = y.last_ts) v2
on m.id = v2.id
left join (select x.*
from vtbl3 x
join (select id, max(timestamp) as last_ts
from vtbl3
group by id) y
on x.id = y.id
and x.timestamp = y.last_ts) v3
on m.id = v3.id
The fastest query technique depends on the distribution of values. DISTINCT ON would be a simple solution in Postgres, ideal for just a few values per id in each child table. But guessing from your description I expect many rows per id, so I suggest a solution with LATERAL joins. Requires Postgres 9.3+:
Optimize GROUP BY query to retrieve latest record per user
One more complication for your already-not-so-simple case:
Also the values can be of different types
Alternative 1
Cast all values to text. Every data type can be cast to text.
Base query
SELECT m.id, v.timestamp, 1 AS tbl, v.value -- simple int as table id
FROM master m
, LATERAL (
SELECT timestamp, value::text -- cast to text
FROM vtbl1
WHERE id = m.id -- lateral reference
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp, 2 AS tbl, v.value -- ascending without gaps
FROM master m
, LATERAL (
SELECT timestamp, value::text
FROM vtbl2
WHERE id = m.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp, 3 AS tbl, value
FROM ...
;
All you need for this to be fast is an index on (id, timestamp) for each child table. Best in this form (adding value is only useful if you get index-only scans out of it):
CREATE INDEX vtbl1_combo_idx ON vtbl1 (id, timestamp DESC NULLS LAST, value)
1a. Aggregate (pseudo-crosstab)
To format as desired use aggregate functions on CASE expressions in Postgres 9.3 or older (like demonstrated by #sgeddes) or (better) the new aggregate FILTER clause in Postgres 9.4+:
How can I simplify this game statistics query?
SELECT id, timestamp
, max(value) FILTER (WHERE tbl = 1) AS val1
, max(value) FILTER (WHERE tbl = 2) AS val2
, ...
FROM ( <query frm above> ) t
GROUP BY 1, 2;
1b. Crosstab
Actual cross tabulation (also called "pivot" in other RDBMS) should be considerably faster. You need the additional module tablefunc installed, instructions below.
The special difficulty here: we have a composite "row name" (id, timestamp), but the function expects a single column as row name. So we substitute with row_number(), but do not display that surrogate key in the result:
SELECT id, timestamp, val1, val2, val3, ...
-- normally SELECT * is enough; explicit list to filter rn
FROM crosstab(
$$
SELECT row_number() OVER (ORDER BY id, timestamp DESC NULLS LAST) AS rn
, id, timestamp, tbl, value
FROM ( <query from above> ) t
ORDER BY 1
$$
, 'SELECT generate_series(1,3)' -- replace 3 with highest table nr.
) AS ct (
rn int, id int, timestamp date
, val1 text, val2 text, val3 text, ...);
Closely related:
Postgres - Transpose Rows to Columns
Relevant basics:
PostgreSQL Crosstab Query
Pivot on Multiple Columns using Tablefunc
Alternative 2
Simple, but may be just as fast and preserves original data types:
SELECT id, timestamp
, max(val1) AS val1, max(val2) AS val2, max(val3) AS val3, ...
FROM (
SELECT m.id, v.timestamp
, v.value AS val1, NULL::int AS val2, NULL::numeric AS val3, ...
-- list all values with actual data type
FROM master m
, LATERAL (
SELECT timestamp, value
FROM vtbl1
WHERE id = m.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp
, NULL, v.value, NULL, ... -- column names & data types defined in first SELECT
FROM master m
, LATERAL (
SELECT timestamp, value
FROM vtbl2
WHERE id = m.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp
, NULL, NULL, v.value, ...
FROM ...
) t
GROUP BY 1, 2
ORDER BY 1, 2;
Aside: Never use basic type names or reserved words (in standard SQL) like timestamp as identifier.

Duplicate Counts - TSQL

I want to get All records that has duplicate values for SOME of the fields (i.e. Key columns).
My code:
CREATE TABLE #TEMP (ID int, Descp varchar(5), Extra varchar(6))
INSERT INTO #Temp
SELECT 1,'One','Extra1'
UNION ALL
SELECT 2,'Two','Extra2'
UNION ALL
SELECT 3,'Three','Extra3'
UNION ALL
SELECT 1,'One','Extra4'
SELECT ID, Descp, Extra FROM #TEMP
;WITH Temp_CTE AS
(SELECT *
, ROW_NUMBER() OVER (PARTITION BY ID, Descp ORDER BY (SELECT 0))
AS DuplicateRowNumber
FROM #TEMP
)
SELECT * FROM Temp_cte
DROP TABLE #TEMP
The last column tells me how many times each row has appeared based on ID and Descp values.
I want that row but I ALSO need another column* that indicates both rows for ID = 1 and Descp = 'One' has showed up more than once.
So an extra column* (i.e. MultipleOccurances (bool)) which has 1 for two rows with ID = 1 and Descp = 'One' and 0 for other rows as they are only showing up once.
How can I achieve that? (I want to avoid using Count(1)>1 or something if possible.
Edit:
Desired output:
ID Descp Extra DuplicateRowNumber IsMultiple
1 One Extra1 1 1
1 One Extra4 2 1
2 Two Extra2 1 0
3 Three Extra3 1 0
SQL Fiddle
You say "I want to avoid using Count" but it is probably the best way. It uses the partitioning you already have on the row_number
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ID, Descp
ORDER BY (SELECT 0)) AS DuplicateRowNumber,
CASE
WHEN COUNT(*) OVER (PARTITION BY ID, Descp) > 1 THEN 1
ELSE 0
END AS IsMultiple
FROM #Temp
And the execution plan just shows a single sort
Well, I have this solution, but using a Count...
SELECT T1.*,
ROW_NUMBER() OVER (PARTITION BY T1.ID, T1.Descp ORDER BY (SELECT 0)) AS DuplicateRowNumber,
CASE WHEN T2.C = 1 THEN 0 ELSE 1 END MultipleOcurrences FROM #temp T1
INNER JOIN
(SELECT ID, Descp, COUNT(1) C FROM #TEMP GROUP BY ID, Descp) T2
ON T1.ID = T2.ID AND T1.Descp = T2.Descp