Find unique dataset with max. value from 3 columns

Find unique dataset with max. value from 3 columns - sql

Imaging following table
ID:PrimaryKey (Sequence generated Number)
ColA:ForeignKey(Number)
ColB:ForeignKey(Number)
ColC:ForeignKey(Number)
State:Enumeration(Number) 10,20,30,... 90
ValidFrom:TimeStamp(6)
LastUpdate:(6)
I know created a query to fetch any combination in the highest states (70 and above) The combination ColA,ColB and ColC should be unqiue. If there is a validfrom available the highest would win. If there are 2 in state 90 the newest would win:
So for some table like this
|------|------|------|-------|-------------|------------|
| ColA | ColB | ColC | State |ValidFrom |LastUpdate |
|------|------|------|-------|-------------|------------|
| 1 | 1 | 1 | 10 | null | 10.10.2018 | //Excluded
|------|------|------|-------|-------------|------------|
| 1 | 1 | 1 | 70 | null | 09.10.2018 | // lower State
|------|------|------|-------|-------------|------------|
| 1 | 1 | 1 | 90 | null | 05.05.2018 | // older LastUpdate
|------|------|------|-------|-------------|------------|
| 1 | 1 | 1 | 90 | null | 12.07.2018 | //Should Win
|------|------|------|-------|-------------|------------|
| 1 | 2 | 1 | 90 | 18.10.2018 | 12.07.2018 | //Should Win
|------|------|------|-------|-------------|------------|
| 1 | 2 | 1 | 90 | null | 18.11.2018 | //loose against ValidFrom
|------|------|------|-------|-------------|------------|
| 3 | 2 | 1 | 90 | 02.12.2018 | 04.08.2018 | //lower ValidFrom
|------|------|------|-------|-------------|------------|
| 3 | 2 | 1 | 70 | 19.10.2018 | 17.11.2018 | //lower state
|------|------|------|-------|-------------|------------|
| 3 | 2 | 1 | 90 | 18.10.2018 | 14.08.2018 | //Should win
|------|------|------|-------|-------------|------------|
So as you can see the combination of ColA,ColB and ColC should be unqiue at the end.
So I started writing a script gives me all the data with the highest states per combination:
SELECT MAINSELECT.*
FROM
FOO MAINSELECT
WHERE
MAINSELECT.STATE >= 70
AND NOT EXISTS
( SELECT SUBSELECT.ID
FROM
FOO SUBSELECT
WHERE SUBSELECT.ID <> MAINSELECT.ID
AND SUBSELECT.COLA = MAINSELECT.COLA
AND SUBSELECT.COLB = MAINSELECT.COLB
AND SUBSELECT.COLC = MAINSELECT.COLC
AND SUBSELECT.STATE > MAINSELECT.STATE);
This now gives me all in the highest state. As I do not want to use an OR statement I tried to solve the problem to query either NULL as Validfrom or the MAX in 2 different queries (and use union). So I tried to extend this base SELECT like this to get all with a ValidFrom != null && Max(ValidFrom):
SELECT MAINSELECT.*
FROM
FOO MAINSELECT
WHERE
MAINSELECT.STATE >= 70
MAINSELECT.VALIDFROM IS NOT NULL
AND NOT EXISTS
( SELECT SUBSELECT.ID
FROM
FOO SUBSELECT
WHERE SUBSELECT.ID <> MAINSELECT.ID
AND SUBSELECT.COLA = MAINSELECT.COLA
AND SUBSELECT.COLB = MAINSELECT.COLB
AND SUBSELECT.COLC = MAINSELECT.COLC
AND SUBSELECT.STATE > MAINSELECT.STATE)
AND NOT EXISTS
( SELECT SUBSELECT.ID
FROM
FOO SUBSELECT
WHERE SUBSELECT.ID <> MAINSELECT.ID -- Should not be the same
AND SUBSELECT.COLA = MAINSELECT.COLA -- Same combination!
AND SUBSELECT.COLB = MAINSELECT.COLB
AND SUBSELECT.COLC = MAINSELECT.COLC
AND SUBSELECT.STATE = MAINSELECT.STATE --Filter on same state!
AND SUBSELECT.VALIDFROM > MAINSELECT.VALIDFROM);
But this doesn't seem to work because now nothing ist printed.
I am expecting just row: 5 and 9! [Starting at 1 ;-)]
And I currently get row: 5, 7 and 9!
So the combination [3,2,1] is duplicate.
I do not get why the 2nd NOT EXISTS does not work. It's like there are 0F*** given!

Use row_number():
dbfiddle demo
select *
from (
select row_number() over (
partition by cola, colb, colc
order by state desc, validfrom desc nulls last, lastupdate desc) rn,
foo.*
from foo)
where rn = 1
7 wins against 9 because 2018-12-02 is newer than 2018-10-18.
Explanation:
partition by cola, colb, colc causes that for each combination of these columns numbering is done separately,
next are criteria of ordering, so higher state wins, then newer, not nullable validfrom wins, and at the end newer lastupdate wins.
For each combinantion of a, b, c we get separate set of numbered rows. Outer query filters only rows numbered as 1.

I found the answer. Instead of using NOT EXISTS I am trying to use the max, rpad and coalesce to create a string which I compare:
SELECT
MAINSELECT.*
FROM
FOO MAINSELECT
WHERE (1 = 1)
AND MAINSELECT.STATE >= 70
AND coalesce(to_char(MAINSELECT.state), rpad('0', 3, '0') ) || coalesce(to_char(MAINSELECT.validfrom,'YYMMDDhh24missFF'), rpad('0', 18, '0') ) || coalesce(to_char(MAINSELECT.lastupdate,'YYMMDDhh24missFF'), rpad('0', 18, '0') )
= (select max(coalesce(to_char(SUBSELECT.state), rpad('0', 3, '0') ) || coalesce(to_char(SUBSELECT.validfrom,'YYMMDDhh24missFF'), rpad('0', 18, '0') )|| coalesce(to_char(SUBSELECT.lastupdate,'YYMMDDhh24missFF'), rpad('0', 18, '0')))
FROM
FOO SUBSELECT
WHERE (1 = 1)
AND SUBSELECT.STATE >= 70
AND SUBSELECT.COLA = MAINSELECT.COLA
AND SUBSELECT.COLB = MAINSELECT.COLB
AND SUBSELECT.COLC = MAINSELECT.COLC
);
This creates a simple string with the values from the columns STATE,VALIDFROM and LASTUPDATE and is then trying to find the max of these! stating with the State which has the highest number and comes in the front!

Related

query SQL table for the same data in column for 3 times in a row

I have a table
Id, Response
1, Yes
2, Yes
3, No
4, No
5, Yes
6, No
7, No
8, No
I would like to be able to query the table and check for the response of No and if it occurs 3 times in a row return a value.
So I am trying
select count(response) where response = no
order by id
Basically, the theory goes, if there are 3 responses of No, I want to trigger something else to happen. So I need to query the table each time an entry is made, and if the last 3 entries are no then return value.
I only want to know if the latest values are 3 no. for example if the last 4 entries were no, no, no, yes - I don't care as there is a yes value
so the last 3 values have to be no

I don't know which RDBMS you use, but you can try something like that:
select count(*)
from
(select id,
response
from your_table
order by id desc
limit 3) t
where t.response = 'No';

Here is a solution in Bigquery. You may need to tweak the syntax for you SQL base:
SELECT
* ,
SUM( CASE WHEN response ="No" THEN 1 ELSE 0 END )
OVER (ORDER BY id RANGE BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM dataset
It returns output like this:
Which I think is what you want.
The key part is the window functions using RANGE BETWEEN 2 PRECEDING AND CURRENT ROW. The case statement is checking if the current row and the 2 before are "No". If they are return a 1. So when three in a row occur this will SUM to 3.

I would use two lag()s:
select t.*
from (select t.*,
lag(id, 2) over (order by id) as prev2_id,
lag(id, 2) over (order by id) as prev2_id_response
from t
) t
where response = 'no' and prev2_id = prev2_id_response;
The first lag() determines the id "2 back". The second determines the id "2 back" for the same response. If the response is the same for those three rows, then these are the same.
This returns each occurrence of "no" where this occurs. You can use exists if you just want to know if this ever occurs.

This can be done with window functions and a derived table or CTE term. The following takes you through how it can be done, step by step:
Full Example with data
WITH cte1 AS (
SELECT x.*
, CASE WHEN COALESCE(LAG(response) OVER (ORDER BY id), 'NA') <> response THEN 1 ELSE 0 END AS edge
FROM xlogs AS x
)
, cte2 AS (
SELECT x.*
, SUM(edge) OVER (ORDER BY id) AS xgroup
FROM cte1 AS x
)
, cte3 AS (
SELECT x.*
, ROW_NUMBER() OVER (PARTITION BY xgroup ORDER BY id) AS xposition
FROM cte2 AS x
)
, cte4 AS (
SELECT x.*
, CASE WHEN xposition >= 3 AND response = 'No' THEN 1 END AS xtrigger
FROM cte3 AS x
)
, cte5 AS (
SELECT x.*
FROM cte4 AS x
ORDER BY id DESC
LIMIT 1
)
SELECT *
FROM cte5
WHERE response = 'No'
;
The result of cte4 provides useful detail about the logic:
+----+----------+------+--------+-----------+----------+
| id | response | edge | xgroup | xposition | xtrigger |
+----+----------+------+--------+-----------+----------+
| 1 | Yes | 1 | 1 | 1 | NULL |
| 2 | Yes | 0 | 1 | 2 | NULL |
| 3 | No | 1 | 2 | 1 | NULL |
| 4 | No | 0 | 2 | 2 | NULL |
| 5 | Yes | 1 | 3 | 1 | NULL |
| 6 | No | 1 | 4 | 1 | NULL |
| 7 | No | 0 | 4 | 2 | NULL |
| 8 | No | 0 | 4 | 3 | 1 |
+----+----------+------+--------+-----------+----------+

How to make sure the sql result is continued range?

I have table like:
id | low_number | high_number
-------------------------------
1 | 12 | 32
-------------------------------
2 | 13 | 33
-------------------------------
3 | 15 | 36
-------------------------------
4 | 33 | 50
-------------------------------
5 | 35 | 52
...
-------------------------------
17 | 52 | 80
I want to get result like:
id | low_number | high_number
-------------------------------
1 | 12 | 32
-------------------------------
4 | 33 | 50
-------------------------------
17 | 52 | 80
that is because the low_number bigger than the pervious row high_number.
How to write sql to get these result? I use postgresql

This seems like a recursive CTE problem. You want to choose the first row (by id) and then choose the next row based on that.
The idea is to cycle through the rows, one at a time. Then when the condition is met, transition to that row. And so on.
As a query, this looks like:
with recursive tt as (
select id, low_number, high_number, row_number() over (order by id) as seqnum
from t
),
cte as (
select id, low_number, high_number, seqnum, true as is_change, id as grouping_id
from tt
where seqnum = 1
union all
select tt.id, tt.low_number, tt.high_number, tt.seqnum, tt.low_number > t.high_number,
(case when tt.low_number > t.high_number then tt.id else cte.grouping_id end)
from cte join
t
on cte.grouping_id = t.id join
tt
on tt.seqnum = cte.seqnum + 1
)
select *
from cte
where is_change;
Here is a db<>fiddle.

Use the window function LAG() to get a value of a previous row, e.g.
WITH j AS (
SELECT
id,low_number,high_number,
LAG(high_number) OVER (ORDER BY id) AS prev_high_number
FROM t)
SELECT id,low_number,high_number FROM j
WHERE low_number > prev_high_number OR prev_high_number IS NULL;
Demo: db<>fiddle

query for column that are within a variable + or 1 of another column

I have a table that has 2 columns, and I am trying to determine a way to select the records where the two columns are CLOSE to one another. Maybe based on standard deviation if i can think about how to do that. But for now, this is what my table looks like:
ID| PCT | RETURN
1 | 20 | 1.20
2 | 15 | 0.90
3 | 0 | 3.00
The values in the pct field is a percent number (for example 20%). The value in the return field is a not fully calculated % number (so its supposed to be 20% above what the initial value was). The query I am working with so far is this:
select * from TABLE1 where ((pct = ((return - 1)* 100)));
What I'd like to end up with are the rows where both are within a set value of each other. For example If they are within 5 points of each other, then the row would be returned and the output would be:
ID| PCT | RETURN
1 | 20 | 1.20
2 | 15 | 0.90
In the above, ID 1 should work out to be PCT = 20 and Return = 20, and ID 2, is PCT = 15 and RETURN = 10. Because it was within 5 points of each other, it was returned.
ID 3 was not returned because 0 and 200 are way above the 5 point threshold.
Is there any way to set a variable that would return a +- 5 when comparing the two values from the above attributes? Thanks.

RexTester Example:
Use Lead() over (Order by PCT) to look ahead and LAG() to look back to the next row do the math and evaluate results...
WITH CTE (ID, PCT , RETURN) as (
SELECT 1 , 20 , 1.20 FROM DUAL UNION ALL
SELECT 2 , 15 , 0.90 FROM DUAL UNION ALL
SELECT 3 , 0 , 3.00 FROM DUAL),
CTE2 as (SELECT A.*, LEAD(PCT) Over (ORDER BY PCT) LEADPCT, LAG(PCT) Over (order by PCT) LAGPCT
FROM CTE A)
SELECT * FROM CTE2
WHERE LEADPCT-PCT <=5 OR PCT-LAGPCT <=5
Order by ID
Giving us:
+----+----+-----+--------+---------+--------+
| | ID | PCT | RETURN | LEADPCT | LAGPCT |
+----+----+-----+--------+---------+--------+
| 1 | 1 | 20 | 1,20 | NULL | 15 |
| 2 | 2 | 15 | 0,90 | 20 | 0 |
+----+----+-----+--------+---------+--------+
or use the return value instead of PCT... just depends on what you're after. But maybe I don't fully understand the question..

Possible to use a column name in a UDF in SQL?

I have a query in which a series of steps is repeated constantly over different columns, for example:
SELECT DISTINCT
MAX (
CASE
WHEN table_2."GRP1_MINIMUM_DATE" <= cohort."ANCHOR_DATE" THEN 1
ELSE 0
END)
OVER (PARTITION BY cohort."USER_ID")
AS "GRP1_MINIMUM_DATE",
MAX (
CASE
WHEN table_2."GRP2_MINIMUM_DATE" <= cohort."ANCHOR_DATE" THEN 1
ELSE 0
END)
OVER (PARTITION BY cohort."USER_ID")
AS "GRP2_MINIMUM_DATE"
FROM INPUT_COHORT cohort
LEFT JOIN INVOLVE_EVER table_2 ON cohort."USER_ID" = table_2."USER_ID"
I was considering writing a function to accomplish this as doing so would save on space in my query. I have been reading a bit about UDF in SQL but don't yet understand if it is possible to pass a column name in as a parameter (i.e. simply switch out "GRP1_MINIMUM_DATE" for "GRP2_MINIMUM_DATE" etc.). What I would like is a query which looks like this
SELECT DISTINCT
FUNCTION(table_2."GRP1_MINIMUM_DATE") AS "GRP1_MINIMUM_DATE",
FUNCTION(table_2."GRP2_MINIMUM_DATE") AS "GRP2_MINIMUM_DATE",
FUNCTION(table_2."GRP3_MINIMUM_DATE") AS "GRP3_MINIMUM_DATE",
FUNCTION(table_2."GRP4_MINIMUM_DATE") AS "GRP4_MINIMUM_DATE"
FROM INPUT_COHORT cohort
LEFT JOIN INVOLVE_EVER table_2 ON cohort."USER_ID" = table_2."USER_ID"
Can anyone tell me if this is possible/point me to some resource that might help me out here?
Thanks!

There is no such direct as #Tejash already stated, but the thing looks like your database model is not ideal - it would be better to have a table that has USER_ID and GRP_ID as keys and then MINIMUM_DATE as seperate field.
Without changing the table structure, you can use UNPIVOT query to mimic this design:
WITH INVOLVE_EVER(USER_ID, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE)
AS (SELECT 1, SYSDATE, SYSDATE, SYSDATE, SYSDATE FROM dual UNION ALL
SELECT 2, SYSDATE-1, SYSDATE-2, SYSDATE-3, SYSDATE-4 FROM dual)
SELECT *
FROM INVOLVE_EVER
unpivot ( minimum_date FOR grp_id IN ( GRP1_MINIMUM_DATE AS 1, GRP2_MINIMUM_DATE AS 2, GRP3_MINIMUM_DATE AS 3, GRP4_MINIMUM_DATE AS 4))
Result:
| USER_ID | GRP_ID | MINIMUM_DATE |
|---------|--------|--------------|
| 1 | 1 | 09/09/19 |
| 1 | 2 | 09/09/19 |
| 1 | 3 | 09/09/19 |
| 1 | 4 | 09/09/19 |
| 2 | 1 | 09/08/19 |
| 2 | 2 | 09/07/19 |
| 2 | 3 | 09/06/19 |
| 2 | 4 | 09/05/19 |
With this you can write your query without further code duplication and if you need use PIVOT-syntax to get one line per USER_ID.
The final query could then look like this:
WITH INVOLVE_EVER(USER_ID, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE)
AS (SELECT 1, SYSDATE, SYSDATE, SYSDATE, SYSDATE FROM dual UNION ALL
SELECT 2, SYSDATE-1, SYSDATE-2, SYSDATE-3, SYSDATE-4 FROM dual)
, INPUT_COHORT(USER_ID, ANCHOR_DATE)
AS (SELECT 1, SYSDATE-1 FROM dual UNION ALL
SELECT 2, SYSDATE-2 FROM dual UNION ALL
SELECT 3, SYSDATE-3 FROM dual)
-- Above is sampledata query starts from here:
, unpiv AS (SELECT *
FROM INVOLVE_EVER
unpivot ( minimum_date FOR grp_id IN ( GRP1_MINIMUM_DATE AS 1, GRP2_MINIMUM_DATE AS 2, GRP3_MINIMUM_DATE AS 3, GRP4_MINIMUM_DATE AS 4)))
SELECT qcsj_c000000001000000 user_id, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE
FROM INPUT_COHORT cohort
LEFT JOIN unpiv table_2
ON cohort.USER_ID = table_2.USER_ID
pivot (MAX(CASE WHEN minimum_date <= cohort."ANCHOR_DATE" THEN 1 ELSE 0 END) AS MINIMUM_DATE
FOR grp_id IN (1 AS GRP1,2 AS GRP2,3 AS GRP3,4 AS GRP4))
Result:
| USER_ID | GRP1_MINIMUM_DATE | GRP2_MINIMUM_DATE | GRP3_MINIMUM_DATE | GRP4_MINIMUM_DATE |
|---------|-------------------|-------------------|-------------------|-------------------|
| 3 | | | | |
| 1 | 0 | 0 | 0 | 0 |
| 2 | 0 | 1 | 1 | 1 |
This way you only have to write your calculation logic once (see line starting with pivot).

SQL - min() gets the lowest value, max() the highest, what if I want the 2nd (or 5th or nth) lowest value?

The problem I'm trying to solve is that I have a table like this:
a and b refer to point on a different table. distance is the distance between the points.
| id | a_id | b_id | distance | delete |
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 2 | 0.2345 | 0 |
| 3 | 1 | 3 | 100 | 0 |
| 4 | 2 | 1 | 1343.2 | 0 |
| 5 | 2 | 2 | 0.45 | 0 |
| 6 | 2 | 3 | 110 | 0 |
....
The important column I'm looking is a_id. If I wanted to keep the closet b for each a, I could do something like this:
update mytable set delete = 1 from (select a_id, min(distance) as dist from table group by a_id) as x where a_gid = a_gid and distance > dist;
delete from mytable where delete = 1;
Which would give me a result table like this:
| id | a_id | b_id | distance | delete |
| 1 | 1 | 1 | 1 | 0 |
| 5 | 2 | 2 | 0.45 | 0 |
....
i.e. I need one row for each value of a_id, and that row should have the lowest value of distance for each a_id.
However I want to keep the 10 closest points for each a_gid. I could do this with a plpgsql function but I'm curious if there is a more SQL-y way.
min() and max() return the smallest and largest, if there was an aggregate function like nth(), which'd return the nth largest/smallest value then I could do this in similar manner to the above.
I'm using PostgeSQL.

Try this:
SELECT *
FROM (
SELECT a_id, (
SELECT b_id
FROM mytable mib
WHERE mib.a_id = ma.a_id
ORDER BY
dist DESC
LIMIT 1 OFFSET s
) AS b_id
FROM (
SELECT DISTINCT a_id
FROM mytable mia
) ma, generate_series (1, 10) s
) ab
WHERE b_id IS NOT NULL
Checked on PostgreSQL 8.3

I love postgres, so it took it as a challenge the second I saw this question.
So, for the table:
Table "pg_temp_29.foo"
Column | Type | Modifiers
--------+---------+-----------
value | integer |
With the values:
SELECT value FROM foo ORDER BY value;
value
-------
0
1
2
3
4
5
6
7
8
9
14
20
32
(13 rows)
You can do a:
SELECT value FROM foo ORDER BY value DESC LIMIT 1 OFFSET X
Where X = 0 for the highest value, 1 for the second highest, 2... And so forth.
This can be further embedded in a subquery to retrieve the value needed. So, to use the dataset provided in the original question we can get the a_ids with the top ten lowest distances by doing:
SELECT a_id, distance FROM mytable
WHERE id IN
(SELECT id FROM mytable WHERE t1.a_id = t2.a_id
ORDER BY distance LIMIT 10);
ORDER BY a_id, distance;
a_id | distance
------+----------
1 | 0.2345
1 | 1
1 | 100
2 | 0.45
2 | 110
2 | 1342.2

Does PostgreSQL have the analytic function rank()? If so try:
select a_id, b_id, distance
from
( select a_id, b_id, distance, rank() over (partition by a_id order by distance) rnk
from mytable
) where rnk <= 10;

This SQL should find you the Nth lowest salary should work in SQL Server, MySQL, DB2, Oracle, Teradata, and almost any other RDBMS: (note: low performance because of subquery)
SELECT * /*This is the outer query part */
FROM mytable tbl1
WHERE (N-1) = ( /* Subquery starts here */
SELECT COUNT(DISTINCT(tbl2.distance))
FROM mytable tbl2
WHERE tbl2.distance < tbl1.distance)
The most important thing to understand in the query above is that the subquery is evaluated each and every time a row is processed by the outer query. In other words, the inner query can not be processed independently of the outer query since the inner query uses the tbl1 value as well.
In order to find the Nth lowest value, we just find the value that has exactly N-1 values lower than itself.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Find unique dataset with max. value from 3 columns - sql

Related

query SQL table for the same data in column for 3 times in a row

How to make sure the sql result is continued range?

query for column that are within a variable + or 1 of another column

Possible to use a column name in a UDF in SQL?

SQL - min() gets the lowest value, max() the highest, what if I want the 2nd (or 5th or nth) lowest value?

Categories

Resources