Identify the second max value in hive based on condition

Identify the second max value in hive based on condition - sql

I have a table with rows that looks like that with a column that rank all rows partition by ticket id over timestamp desc.
All rows can only have one flag equal to one.
ticketID | flag 1 | flag 2 | flag 3 | flag 4 | Timestamp | Rank | stringvalue |
----------------------------------------------------------------------------------------|
1 | 0 | 0 | 1 | 0 | xxxxxx | 2 | aaaaaa |
1 | 0 | 0 | 0 | 1 | xxxxxx | 1 | bbbbbb |
1 | 0 | 1 | 0 | 0 | xxxxxx | 3 | aaaaaa |
2 | 1 | 0 | 0 | 0 | xxxxxx | 2 | bbbbbb |
2 | 0 | 0 | 0 | 1 | xxxxxx | 1 | xxxxxx |
3 | 0 | 0 | 1 | 0 | xxxxxx | 4 | aaaaaa |
3 | 0 | 1 | 0 | 0 | xxxxxx | 3 | bbbbbb |
3 | 1 | 0 | 0 | 0 | xxxxxx | 1 | ssssss |
3 | 0 | 0 | 0 | 1 | xxxxxx | 2 | nnnnnn |
4 | 0 | 1 | 0 | 0 | xxxxxx | 2 | gggggg |
4 | 0 | 0 | 0 | 1 | xxxxxx | 1 | iiiiii |
for each ticketID i need to get the first row based on the rank but with a an exception for a specific flag :
when the rank 1 of a ticket is a row with flag 4 = 1 then i need to take the second rank position as the first one.
And if the second rank of the ticket is flag 3 = 1 then i need to concatenate stringvalue from the first rank (flag = 4) with the second rank (flag = 3).
If the second rank is flag = 1 or flag = 2 then just forget about the first rank and return the second one as the first.
I hope that my question is clear.
Thanks
Edit
Sample output :
----------------------------------------------------------------------------------------
ticketID | flag 1 | flag 2 | flag 3 | Timestamp | Rank | stringvalue |
---------------------------------------------------------------------------------------|
1 | 0 | 0 | 1 | xxxxxx | 1 | aaaaaa / bbbbbbb |
2 | 1 | 0 | 0 | xxxxxx | 1 | bbbbbb |
3 | 1 | 0 | 0 | xxxxxx | 1 | ssssss |
4 | 0 | 1 | 0 | xxxxxx | 1 | gggggg |
----------------------------------------------------------------------------------------

I'm going to use some sub-queries with a struct group by. This will allow us to ask questions about multiple rows without using a window. Likely will perform faster as we don't have to maintain window state.
create table theRanks (ticketID int, flag_1 int, flag_2 int, flag_3 int, flag_4 int, Timestamp string, Rank int, stringvalue string)
-- create some dummy data
insert into theRanks values ( 1 , 0, 0, 1, 0, 'xxxxxx', 2, 'aaaaaa')
insert into theRanks values ( 1 , 0, 0, 0, 1, 'xxxxxx', 1, 'bbbbbb')
insert into theRanks values ( 1 , 0, 1, 0, 0, 'xxxxxx', 3, 'aaaaaa')
with stuct_table as -- sub-query syntax
(
select
ticketID,
struct( -- struct will allow us to group rows together.
Rank as rawRank, -- this has to be first in strut as we use it for sorting
flag_1 ,
flag_2,
flag_3,
flag_4 ,
Timestamp ,
stringvalue
) as myRow
from
theRanks
where
rank in (1,2) -- only look at first two ranks
),
constants as -- subquery
(
select 0 as rank1, 1 as rank2 -- strictly not needed just to help make it more readable
),
grouped_rows as --subquery
(
select
ticketID,
array_sort(collect_list(myRow)) as row_list -- will sort on rank all structs into a list
from stuct_table
group by ticketID
) ,
raw_rows as (select --sub-query styntax
ticketId,
case
when
row_list[constants.rank2].flag_1 + row_list[constants.rank2].flag_2 > 0 or (row_list[constants.rank1].flag_4 = 1 and row_list[constants.rank2].flag_3 = 0 )
then
row_list[constants.rank2]
when
row_list[constants.rank1].flag_4 = 1 and row_list[constants.rank2].flag_3 = 1 -- condition to concat string
then
struct( -- this struct must match the original one we created
row_list[constants.rank2].rawRank as rawRank,
row_list[constants.rank2].flag_1 as flag_1,
row_list[constants.rank2].flag_2 as flag_2,
row_list[constants.rank2].flag_3 as flag_3,
row_list[constants.rank2].flag_4 as flag_4,
row_list[constants.rank2].Timestamp as Timestamp,
concat(
row_list[constants.rank1].stringvalue,
' / ',
row_list[constants.rank2].stringvalue) as stringvalue
)
else
row_list[constants.rank1]
end as rankedRow,
1 as Rank
from grouped_rows
cross join constants) -- not strictly needed, just replace all constants.rank1 with 0 and constants.rank2 with 1. I just use it to make it more clear what I'm doing. Could be replaced in production.
select rankedRow.* , 1 as Rank from raw_rows; -- makes struct columns into table columns

Related

Generate multiple record from existing records based on interval columns [from and to]

I have 2 types of score [M,B] in column 3, if a type is M, then the score is either an S[scored] or SB[bonus scored] in column 6. Every interval [from_hrs - to_hrs] for a type B must have a corresponding SB for type M, thus, an interval for a type B cannot have a score of S for a type M. I have several records that were unfortunately captured as seen in the table below.
CREATE TABLE SCORE_TBL
(
ID int IDENTITY(1,1) PRIMARY KEY,
PERSONID_FK int NOT NULL,
S_TYPE varchar(50) NULL,
FROM_HRS int NULL,
TO_HRS int NULL,
SCORE varchar(50) NULL,
);
INSERT INTO SCORE_TBL(PERSONID_FK,S_TYPE,FROM_HRS,TO_HRS,SCORE)
VALUES
(1, 'M' , 0,20, 'S'),
(1, 'B',6, 8, 'B'),
(2, 'B',0, 2, 'B'),
(2, 'M',0,20, 'S'),
(2, 'B', 10,13, 'B'),
(2, 'B', 18,20, 'B'),
(2, 'M', 13,18, 'S');
| ID | PERSONID_FK |S_TYPE| FROM_HRS | TO_HRS | SCORE |
|----|-------------|------|----------|--------|-------|
| 1 | 1 | M | 0 | 20 | S |
| 2 | 1 | B | 6 | 8 | B |
| 3 | 2 | B | 0 | 2 | B |
| 4 | 2 | M | 0 | 20 | S |
| 5 | 2 | B | 10 | 13 | B |
| 6 | 2 | B | 18 | 20 | B |
| 7 | 2 | M | 13 | 18 | S |
I want the data to look like this
| ID | PERSONID_FK |S_TYPE| FROM_HRS | TO_HRS | SCORE |
|----|-------------|------|----------|--------|-------|
| 1 | 1 | M | 0 | 6 | S |
| 2 | 1 | M | 6 | 8 | SB |
| 3 | 1 | B | 6 | 8 | B |
| 4 | 1 | M | 8 | 20 | S |
| 5 | 2 | B | 0 | 2 | B |
| 6 | 2 | M | 0 | 2 | SB |
| 7 | 2 | M | 2 | 10 | S |
| 8 | 2 | B | 10 | 13 | B |
| 9 | 2 | M | 10 | 13 | SB |
| 10 | 2 | M | 13 | 18 | S |
| 11 | 2 | B | 18 | 20 | B |
| 12 | 2 | S | 18 | 20 | SB |
Any ideas on how to generate this data in SQL Server select statement? Visually, this what am trying to get.

Tricky part here is that interval might need to be split in several pieces like 0..20 for person 2.
Window functions to the rescue. This query illustrates what you need to do:
WITH
deltas AS (
SELECT personid_fk, hrs, sum(delta_s) as delta_s, sum(delta_b) as delta_b
FROM (SELECT personid_fk, from_hrs as hrs,
case when score = 'S' then 1 else 0 end as delta_s,
case when score = 'B' then 1 else 0 end as delta_b
FROM score_tbl
UNION ALL
SELECT personid_fk, to_hrs as hrs,
case when score = 'S' then -1 else 0 end as delta_s,
case when score = 'B' then -1 else 0 end as delta_b
FROM score_tbl) _
GROUP BY personid_fk, hrs
),
running AS (
SELECT personid_fk, hrs as from_hrs,
lead(hrs) over (partition by personid_fk order by hrs) as to_hrs,
sum(delta_s) over (partition by personid_fk order by hrs) running_s,
sum(delta_b) over (partition by personid_fk order by hrs) running_b
FROM deltas
)
SELECT personid_fk, 'M' as s_type, from_hrs, to_hrs,
case when running_b > 0 then 'SB' else 'S' end as score
FROM running
WHERE running_s > 0
UNION ALL
SELECT personid_fk, s_type, from_hrs, to_hrs, score
FROM score_tbl
WHERE s_type = 'B'
ORDER BY personid_fk, from_hrs;
Step by step:
deltas is union of two passes on score_tbl - one for start and one for end of score/bonus interval, creating a timeline of +1/-1 events
running calculates running total of deltas over time, yielding split intervals where score/bonus are active
final query just converts score codes and unions bonus intervals (which are passed unchanged)
SQL Fiddle here.

Replace null values with most recent non-null values SQL

I have a table where each row consists of an ID, date, variable values (eg. var1).
When there is a null value for var1 in a row, I want like to replace the null value with the most recent non-null value before that date for that ID. How can I do this quickly for a very large table?
So presume I start with this table:
+----+------------|-------+
| id |date | var1 |
+----+------------+-------+
| 1 |'01-01-2022'|55 |
| 2 |'01-01-2022'|12 |
| 3 |'01-01-2022'|45 |
| 1 |'01-02-2022'|Null |
| 2 |'01-02-2022'|Null |
| 3 |'01-02-2022'|20 |
| 1 |'01-03-2022'|15 |
| 2 |'01-03-2022'|Null |
| 3 |'01-03-2022'|Null |
| 1 |'01-04-2022'|Null |
| 2 |'01-04-2022'|77 |
+----+------------+-------+
Then I want this
+----+------------|-------+
| id |date | var1 |
+----+------------+-------+
| 1 |'01-01-2022'|55 |
| 2 |'01-01-2022'|12 |
| 3 |'01-01-2022'|45 |
| 1 |'01-02-2022'|55 |
| 2 |'01-02-2022'|12 |
| 3 |'01-02-2022'|20 |
| 1 |'01-03-2022'|15 |
| 2 |'01-03-2022'|12 |
| 3 |'01-03-2022'|20 |
| 1 |'01-04-2022'|15 |
| 2 |'01-04-2022'|77 |
+----+------------+-------+

cte suits perfect here
this snippets returns the rows with values, just an update query and thats all (will update my response).
WITH selectcte AS
(
SELECT * FROM testnulls where var1 is NOT NULL
)
SELECT t1A.id, t1A.date, ISNULL(t1A.var1,t1B.var1) varvalue
FROM selectcte t1A
OUTER APPLY (SELECT TOP 1 *
FROM selectcte
WHERE id = t1A.id AND date < t1A.date
AND var1 IS NOT NULL
ORDER BY id, date DESC) t1B
Here you can dig further about CTEs :
https://learn.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-transact-sql?view=sql-server-ver16

SQL select with preference on column values

I am new to SQL and I would like to ask about how to select entries based on preferences and grouping.
+----------+----------+------+
| ENTRY_ID | ROUTE_ID | TYPE |
+----------+----------+------+
| 1 | 15 | 0 |
| 1 | 26 | 1 |
| 1 | 39 | 1 |
| 2 | 22 | 1 |
| 2 | 15 | 1 |
| 3 | 30 | 1 |
| 3 | 35 | 0 |
| 3 | 40 | 1 |
+----------+----------+------+
With the table above, I would like to select 1 entry for each ENTRY_ID with the following preference for the returned ROUTE_ID:
IF TYPE = 0 is available
for any one of the entries with the same ENTRY_ID, return the minimum ROUTE_ID for all entries with TYPE = 0
IF for the same ENTRY_ID only TYPE = 1 is available, return the minimum ROUTE_ID
The expected outcome for the query will be the following:
+----------+----------+------+
| ENTRY_ID | ROUTE_ID | TYPE |
+----------+----------+------+
| 1 | 15 | 0 |
| 2 | 15 | 1 |
| 3 | 35 | 0 |
+----------+----------+------+
Thank you for your help!

You can group by both TYPE and ENTRY_ID, and then use the HAVING clause to filter out those where TYPE is not the minimal value for that record.
SELECT ENTRY_ID, MIN(ROUTE_ID), TYPE
FROM MyTable
GROUP BY ENTRY_ID, TYPE
HAVING TYPE = (SELECT MIN(s.TYPE) FROM MyTable s WHERE s.ENTRY_ID = MyTable.ENTRY_ID)
This relies on type only being able to be 0 or 1. If there are more possible values, it will only return the lowest type.

If you want complete rows, use a correlated subquery:
select t.*
from t
where t.route_id = (select top 1 t2.route_id
from t as t2
where t2.entry_id = t.entry_id
order by iif(t2.type = 0, 1, 2), -- put type 0 first
t2.route_id asc -- then the first route_id
);
This has the advantage that it can return more than just the three columns you show in the question.

No rowid or key need most recent row

I am trying my hardest to get a list of the most recent rows by date in a DB2 file. The file has no unique id, so I am trying to get the entries by matching a set of columns. I need DESCGA most importantly as that changes often. When it does they keep another row for historical reasons.
SELECT B.COGA, B.COMSUBGA, B.ACCTGA, B.PRFXGA, B.DESCGA
FROM mylib.myfile B
WHERE
(
SELECT COUNT(*)
FROM
(
SELECT A.COGA,A.COMSUBGA,A.ACCTGA,A.PRFXGA,MAX(A.DATEGA) AS EDATE
FROM mylib.myfile A
GROUP BY A.COGA, A.COMSUBGA, A.ACCTGA, A.PRFXGA
) T
WHERE
(B.ACCTGA = T.ACCTGA AND
B.COGA = T.COGA AND
B.COMSUBGA = T.COMSUBGA AND
B.PRFXGA = T.PRFXGA AND
B.DATEGA = T.EDATE)
) > 1
This is what I am trying and so far I get 0 results.
If I remove
B.ACCTGA = T.ACCTGA AND
It will return results (of course wrong).
I am using ODBC in VS 2013 to structure this query.
I have a table with the following
| a | b | descri | date |
-----------------------------
| 1 | 0 | string | 20140102 |
| 2 | 1 | string | 20140103 |
| 1 | 1 | string | 20140101 |
| 1 | 1 | string | 20150101 |
| 1 | 0 | string | 20150102 |
| 2 | 1 | string | 20150103 |
| 1 | 1 | string | 20150103 |
and i need
| 1 | 0 | string | 20150102 |
| 2 | 1 | string | 20150103 |
| 1 | 1 | string | 20150103 |

You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by a, b order by date desc) as seqnum
from mylib.myfile t
) t
where seqnum = 1;

SQL moving aggregate SUM without partial results

Assume I have this schema (tested on postgresql) where the 'Scorelines' relation contains results of sport matches. (kickoff is a TIMESTAMP but replaced by INT for readability)
SQLFiddle here: http://sqlfiddle.com/#!12/52475/3
CREATE TABLE Scorelines (
team TEXT,
kickoff INT,
scored INT,
conceded INT
);
Now I want to produce another column 'three_matches_scored' that contains the sum of the points scored
over the 3 preceding game (determined by kickoff) of the same team. I have this:
SELECT team, kickoff, scored, conceded, SUM(scored) OVER three_matches AS three_matches_scored
FROM Scorelines
WINDOW three_matches AS
(PARTITION BY team ORDER BY kickoff
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING)
ORDER BY kickoff;
This works beautifully so far, except that I get values starting from the second game. Example:
| TEAM | KICKOFF | SCORED | CONCEDED | THREE_MATCHES_SCORED |
|------|---------|--------|----------|----------------------|
| A | 1 | 1 | 0 | (null) |
| B | 2 | 1 | 1 | (null) |
| A | 3 | 1 | 1 | 1 |
| A | 4 | 3 | 0 | 2 |
| B | 4 | 1 | 4 | 1 |
| A | 6 | 0 | 2 | 5 |
| B | 6 | 4 | 2 | 2 |
| B | 8 | 1 | 2 | 6 |
| B | 10 | 1 | 1 | 6 |
| A | 11 | 2 | 1 | 4 |
I want the column 'three_matches_scored' to be (null) for the first 3 games because there are no 3 results to sum up. How can I achieve this?
I'd prefer simple understandable solutions, performance is not critical for this particular case.
My only idea right now, is to define a stored function SUM3, that results in (null) with less than 3 values to add up. But I never defined a function in SQL and can't seem to figure it out.

You can use a case statement to null the rows where there are less than 3 games:
SELECT team, kickoff, scored, conceded,
CASE WHEN COUNT(scored) OVER three_matches = 3
THEN SUM(scored) OVER three_matches
ELSE NULL
END AS three_matches_scored
FROM Scorelines
WINDOW three_matches AS
(PARTITION BY team ORDER BY kickoff
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING)
ORDER BY kickoff;
Output:
team | kickoff | scored | conceded | three_matches_scored
------+---------+--------+----------+----------------------
A | 1 | 1 | 0 |
B | 2 | 1 | 1 |
A | 3 | 1 | 1 |
A | 4 | 3 | 0 |
B | 4 | 1 | 4 |
A | 6 | 0 | 2 | 5
B | 6 | 4 | 2 |
B | 8 | 1 | 2 | 6
B | 10 | 1 | 1 | 6
A | 11 | 2 | 1 | 4
(10 rows)

See harmics answer above.
(my first solution, just for reference)
Solution with user defined aggregate:
CREATE TYPE intermediate_sum AS (
sum INT,
count INT
);
CREATE FUNCTION sum_sfunc(intermediate_sum, INTEGER) RETURNS intermediate_sum AS
$$ SELECT $2 + $1.sum AS sum, $1.count - 1 AS count $$ LANGUAGE SQL;
CREATE FUNCTION sum_ffunc(intermediate_sum) RETURNS INTEGER AS
$$ SELECT (CASE WHEN $1.count > 1 THEN null
WHEN $1.count = 0 THEN $1.sum
END)
$$ LANGUAGE SQL;
CREATE AGGREGATE sum3(INTEGER) (
sfunc = sum_sfunc,
finalfunc = sum_ffunc,
stype = intermediate_sum,
initcond = '(0,3)'
);
The aggregate SUM3 wants at least 3 values, otherwise it returns (null). One can define other aggreates like SUM4 by changing the initcond, for example to '(0,4)'.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Identify the second max value in hive based on condition - sql

Related

Generate multiple record from existing records based on interval columns [from and to]

Replace null values with most recent non-null values SQL

SQL select with preference on column values

No rowid or key need most recent row

SQL moving aggregate SUM without partial results

Categories

Resources