AWS Athena SQL to group and find minimum in distinct rows - sql

I have a query against AWS Athena and the core of it works great. My companies code is AA (field ACD) and our competitors codes are BB, CC and DD (field OCD). So for each distinct trip my company makes I get a series of similar trips from competitors. I end up with a table like this:
main =
AID ATRIPDT ACD ACAR CY1 CY2 OID OTRIPDT OCD BCAR DELMN
0 10/30/2018 AA XX22 LAS LAX 300 10/30/2018 BB ZZ1 21
0 10/30/2018 AA XX22 LAS LAX 544 10/30/2018 CC T09 36
0 10/30/2018 AA XX22 LAS LAX 755 10/30/2018 BB KLQ 57
0 10/30/2018 AA XX22 LAS LAX 912 10/30/2018 DD 75Q 5
1 10/30/2018 AA P700 LAS LAX 390 10/30/2018 BB MNZ 13
1 10/30/2018 AA P700 LAS LAX 603 10/30/2018 BB JJ1 30
However, the last step is to group by AID and select only one record for each OCD which should be the minimum value of DELMN.
In this case I am looking for this as a result:
AID ATRIPDT ACD ACAR CY1 CY2 OID OTRIPDT OCD BCAR DELMN
0 10/30/2018 AA XX22 LAS LAX 300 10/30/2018 BB ZZ1 21
0 10/30/2018 AA XX22 LAS LAX 544 10/30/2018 CC T09 36
0 10/30/2018 AA XX22 LAS LAX 912 10/30/2018 DD 75Q 5
1 10/30/2018 AA P700 LAS LAX 390 10/30/2018 BB MNZ 13
I tried this
with main as
(
<complex query that returns main table>
)
select * from main
where DELMN = (select min(DELMN) from main as b where b.OCD=main.OCD
which returns a total of three records so I am not setting up the grouping correctly. Am brain drained so not sure what else to try.

You want one row per AID+OCD value, so you'll want something like:
WITH main AS
(
<complex query that returns main table>
)
SELECT *
FROM main
WHERE DELMN = (SELECT MIN(DELMN)
FROM main AS b
WHERE b.OCD=main.OCD AND b.AID = main.AID)
GROUP BY AID, OCD
It won't be a very efficient query, but should work. It can be more efficient by JOINing to a query that pulls the minimum DELMN group by AID and OCD (rather than using a sub-select that runs for every row). That way, it only needs to scan those tables once. Don't worry about that unless you have LOTS of rows, which causes it to slow down.

Related

SQL query for Bill of Materials to get levels of components

I have got data that contains
parent_id
child_id
parent_desc
child_desc
123
24
AA
BB
123
81
AA
ZZ
24
32
BB
EE
32
45
EE
DD
45
57
DD
FF
57
62
FF
GG
62
7
GG
FA
81
9
ZZ
GA
What I want to achieve is to have a BOM explosion of it.
What I have come up is the below code:
create table test
( SurrogateKey bigint, ForeignKey bigint,mat_desc varchar,comp_desc varchar );
insert into test
values
( 123, 24,'AA','BB'),
(123,81,'AA','ZZ'),
( 24, 32,'BB','EE'),
( 32, 45,'EE','DD'),
( 45, 57,'DD', 'FF'),
( 57, 62,'FF','GG'),
( 62, 7,'GG','FA'),
(81,9,'ZZ','GA');
With traversal as
( SELECT test.SurrogateKey OriginKey,
ForeignKey,mat_desc,comp_desc
FROM test
WHERE SurrogateKey = 123 -- this first portion of the query generates the beginning set of records.
UNION ALL
SELECT traversal.OriginKey,
test.ForeignKey,traversal.mat_desc,test.comp_desc
FROM test
INNER JOIN traversal
ON test.SurrogateKey = traversal.ForeignKey -- we join back to the result set generated in the previous iteration of the recursion until no more nodes to travel to
)
select * from traversal
That is giving me a result:
parent_id
child_id
parent_desc
child_desc
123
24
AA
BB
123
81
AA
ZZ
123
32
AA
EE
123
9
AA
GA
123
45
AA
DD
123
57
AA
FF
123
62
AA
GG
123
7
AA
FA
But what I would like to achieve is to add column to see on which level the child is so thirst three rows would look like this, and so on
parent_id
child_id
parent_desc
child_desc
level
123
24
AA
BB
1
123
81
AA
ZZ
1
123
32
AA
EE
2
123
9
AA
GA
123
45
AA
DD
123
57
AA
FF
123
62
AA
GG
123
7
AA
FA
Maybe someone has got idea, how to solve this. Maybe some window function, or how to come up with some counter to have it using recursive cte
Thank you all in advance
If I understand correctly, you want to enumerate the rows based on the tree-depth. You correctly recognize that you can do this with a recursive CTE.
The syntax for recursive CTEs varies depending on the databases, but the idea is:
with recursive cte as (
select t.SurrogateKey, t.ForeignKey, t.mat_desc, t.comp_desc, t.mat_desc as orig_mat_desc, 1 as lev
from test t
where not exists (select 1 from test t2 where t2.comp_desc = t.mat_desc)
union all
select t.SurrogateKey, t.ForeignKey, t.mat_desc, t.comp_desc, cte.orig_mat_desc, 1 + lev
from cte join
test t
on cte.comp_desc = t.mat_desc
)
select *
from cte;
Here is a db<>fiddle.

Provide a list of subtypes from Main group with range numbers

I have a table named Items containing a list of items both at grouped level (G) and sub level (L). However I want to see only the sub level (L) data but with the respective Group Name attached to each sublevel. The Grouped item has a start and end number range. The numbers are not +1 for the sub level list, but incremental in no particular fashion. Also for each level row the start and end number is same.
I am using Microsoft SQL-Server Management Studio-2018
Main table: Items
Code
Start_No
End_No
Type
Group 1
1001
1035
G
AA
1001
1001
L
BB
1005
1005
L
CC
1009
1009
L
DD
1020
1020
L
EE
1035
1035
L
Group 2
1051
1090
G
FF
1051
1051
L
GG
1060
1060
L
HH
1075
1075
L
JJ
1090
1090
L
Group 3
1095
1200
G
LL
1095
1095
L
OO
1120
1120
L
PP
1200
1200
L
Group 4
1300
1800
G
QQ
1300
1300
L
TU
1500
1500
L
WC
1600
1600
L
ZA
1800
1800
L
I would like for the final output to be:
Desired outcome
Code Group
Code
Start_No
End_No
Group 1
AA
1001
1001
Group 1
BB
1005
1005
Group 1
CC
1009
1009
Group 1
DD
1020
1020
Group 1
EE
1035
1035
Group 2
FF
1051
1051
Group 2
GG
1060
1060
Group 2
HH
1075
1075
Group 2
JJ
1090
1090
Group 3
LL
1095
1095
Group 3
OO
1120
1120
Group 3
PP
1200
1200
Group 4
QQ
1300
1300
Group 4
TU
1500
1500
Group 4
WC
1600
1600
Group 4
ZA
1800
1800
This is the code I have written but not getting desired result.
Select i.Code, c.Start_No, c.End_No
into #temp
FROM items i
Where i.Type = 'L'
Select i2.Code, i2.Start_No, i2.End_No
FROM GLM_CHART i2
WHERE
EXISTS (SELECT * FROM #temp t where t.Start_No BETWEEN i2.Start_No AND i2.End_No)
Thanks
You can use a join:
select i.*, ig.code
from items i join
items ig
on i.start_no >= ig.start_no and
i.end_no <= ig.end_no and
ig.type = 'G'
where i.type = 'L';

Selecting rows where values in one column are different, other column(s) values are similar and values of Date columns are correct

Suppose I have the following columns:
ID,Code,DST,Short_text,Long_text,Date_from,Date_until
Here is the dataset:
ID Code DST Short_text Long_text Date_From Date_Until
1 B 01 B 1 Bez1 Bezirk1 29.10.1999 13.01.2020
1 B 01 B 1 Bez1 Bezirk1 14.01.2020 31.12.9999
2 B 02 B 2 Bez2 Bezirk2 29.10.1999 13.01.2020
3 B 03 B 3 Bez3 Bezirk3 14.01.2020 31.12.9999
4 B 04 B 4 Bez4 Bezirk4 29.10.1999 13.01.2020
4 B 04 B 4 Bez4 Bezirk4 14.01.2020 31.12.9999
97 M 51 M 52 MA 51 Sport 29.10.1999 13.01.2020
96 M 51 M 51 MA 51 Sport 14.01.2020 31.12.9999
98 M 55 M 53 MA 53 Dance 29.10.1999 13.01.2020
99 M 55 M 54 MA 54 Skating 14.01.2020 31.12.9999
100 M 56 M 59 MA 57 Football 29.10.1999 13.01.2020
101 M 56 M 56 MA 56 Tennis 29.10.1999 31.12.9999
I want to select rows, such that they have different ID AND (they have similar Code OR SImilar Short_text OR simmlar long_text) AND Correct Date_from - Date_Until.
Definition of correct Date_from - Date_Until:
1.Date_ from < Date_Until
2.Both fields are not Null
3. WHEN PREV_DATE_UNTIL = DATE_FROM - 1 OR PREV_DATE_UNTIL is null THEN 'OK'(PREV_DATE_UNTIL using lag operator)
4. WHEN NEXT_DATE_FROM = DATE_UNTIL + 1 OR NEXT_DATE_FROM is null THEN 'OK'(NEXT_DATE_FROM using lead operator)
Not correct:
WHEN WHEN NEXT_DATE_FROM > DATE_UNTIL + 1 THEN 'Gaps in Dates'
WHEN WHEN NEXT_DATE_FROM < DATE_UNTIL + 1 THEN 'Overlapping dates'
Basically what I mean, that historization of the data must be correct(no overlapping)
At the end I want to select the following rows:
97 M 51 M 52 MA 51 Sport 29.10.1999 13.01.2020
96 M 51 M 51 MA 51 Sport 14.01.2020 31.12.9999
Because they have different ID and similar Code or short_text or long_text and dates are correct according to the definition
And
98 M 55 M 53 MA 53 Dance 29.10.1999 13.01.2020
99 M 55 M 54 MA 54 Skating 14.01.2020 31.12.9999
Because they have different ID and similar Code and dates are correct according to the definition
Rows:
100 M 56 M 59 MA 57 Football 29.10.1999 13.01.2020
101 M 56 M 56 MA 56 Tennis 29.10.1999 31.12.9999
Should NOT be selected, because they have different ID and similar Code BUT they have incorrect Dates(they are overlapping).
This will be something like that:
with t as (
select row_number() over (order by date_from, date_until) rn,
id, code, dst, short_text, long_text, date_from,
nullif(date_until, date '9999-12-31') date_until
from data)
select rna, rnb, description, t.*
from t
join (
select a.rn rna, b.rn rnb,
case when b.date_from = a.date_until + 1 then 'OK'
when b.date_from > a.date_until + 1 then 'gaps'
when b.date_from < a.date_until + 1 then 'overlapping'
end description
from t a
join t b on a.id <> b.id and a.rn < b.rn
and (a.code = b.code or a.short_text = b.short_text
or a.long_text = b.long_text)) pairs
on rn in (rna, rnb)
result is:
RNA RNB DESCRIPTION RN ID CODE DST SHORT_TEXT LONG_TEXT DATE_FROM DATE_UNTIL
------ ------ ----------- ----- ---------- ---- ---- ---------- --------- ----------- -----------
1 7 overlapping 1 100 M 56 M 59 MA 57 Football 1999-10-29 2020-01-13
1 7 overlapping 7 101 M 56 M 56 MA 56 Tennis 1999-10-29
3 8 OK 3 98 M 55 M 53 MA 53 Dance 1999-10-29 2020-01-13
3 8 OK 8 99 M 55 M 54 MA 54 Skating 2020-01-14
6 9 OK 6 97 M 51 M 52 MA 51 Sport 1999-10-29 2020-01-13
6 9 OK 9 96 M 51 M 51 MA 51 Sport 2020-01-14
dbfiddle
I numbered rows, self joined such data and dressed your logic in case when syntax. I tested on your examples, in case of any mistakes please provide dbfiddle if possible.

How can I merge two dataframes outside the intersection of the data?

I have a dataframe of presidential candiates, their received donation amount, and the states where the donations came from (contbr_st).
However, the state includes non state abbreviations such as AA, FF, FM as shown below. And, I have a single column dataframe of 50 state abbreviations.
dataframe below is "total"
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AB 2048.00 NaN
AE 42973.75 5680.00
AK 281840.15 86204.24
AL 543123.48 527303.51
AP 37130.50 1655.00
AR 359247.28 105556.00
AS 2955.00 NaN
AZ 1506476.98 1888436.23
CA 23824984.24 11237636.60
CO 2132429.49 1506714.12
CT 2068291.26 3499475.45
DC 4373538.80 1025137.50
DE 336669.14 82712.00
FF NaN 99030.00
FL 7318178.58 8338458.81
FM 600.00 NaN
Dataframe below is 50 state, it is "state"
state
0 AL
1 AK
2 AZ
3 AR
4 CA
5 CO
6 CT
7 DC
8 DE
9 FL
10 GA
11 HI
12 ID
13 IL
14 IN
15 IA
16 KS
17 KY
18 LA
19 ME
20 MD
21 MA
22 MI
23 MN
24 MS
25 MO
26 MT
27 NE
28 NV
29 NH
30 NJ
31 NM
32 NY
33 NC
34 ND
35 OH
36 OK
37 OR
38 PA
39 RI
40 SC
41 SD
42 TN
43 TX
44 UT
45 VT
46 VA
47 WA
48 WV
49 WI
50 WY
Is there a simple way in Pandas to merge these two dataframes to discard the intersecting states, and keep the non state data from the original dataframe ('total')?
so my expected output would include non state abbreviation data as below
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AP 37130.50 1655.00
FF NaN 99030.00
FM 600.00 NaN
.
.
The only way I can think of is convert state list from each dataframe, convert to set, use the difference() method. Then, convert the result to dataframe, and merge with the "total" dataframe.

SQL Query pivot approach assistance

i am really struggling with this pivot and hoped reaching out for help and enlightenment might help.
Say i have the following table....
Table A
type actId date rowSort order value value_char colName
------------------------------------------------------------------------------------
checking 1003 2011-12-31 2 1 44 44 Amount
checking 1003 2011-12-31 2 2 55 55 Interest
checking 1003 2011-12-31 2 3 66 66 Change
checking 1003 2011-12-31 2 4 77 77 Target
checking 1003 2011-12-31 2 5 88 88 Spread
savings 23456 2011-12-31 1 1 999 999 Amount
savings 23456 2011-12-31 1 2 888 888 Interest
savings 23456 2011-12-31 1 3 777 777 Change
savings 23456 2011-12-31 1 4 666 666 Target
savings 23456 2011-12-31 1 5 555 555 Spread
And i want to transpose to table b
checking chkId date rowSort order chkvalue chkValchar colName savings savId savVal savValChar
-------------------------------------------------------------------------------------------------------------------
checking 1003 2011-12-31 2 1 44 44 Amount savings 23456 999 999
checking 1003 2011-12-31 2 2 55 55 Interest savings 23456 888 888
checking 1003 2011-12-31 2 3 66 66 Change savings 23456 777 777
checking 1003 2011-12-31 2 4 77 77 Target savings 23456 666 666
checking 1003 2011-12-31 2 5 88 88 Spread savings 23456 555 555
I can admit this is beyond my skills at the moment.
I believe i need to do a pivot on this table, using the rowSort (identify savings vs checking) along with ordering using the order column. This maybe wrong and that is why i am here.
Is a pivot the right way to go? Am i right to assume my pivot is to use the aggregate max(rowSort)?
Assuming rowSort from `checking equal to rowSort+1 from savings and the rows link though field value, this should do it:
SELECT DISTINCT
a.type as checking,
a.actId as chkId,
a.date,
a.rowSort+1,
a.order,
a.value as chkvalue,
a.value_char as chkValchar,
a.colName,
b.type as 'savings',
a.actId as savId,
b.value as savVal,
b.value_char as savValChar
FROM tablea a
INNER JOIN tablea b ON b.rowSort = a.rowSort+1 and b.value = a.value
Based on the requirements you presented, you will not use a PIVOT for this query, you will want to JOIN your table to itself. The query below should give you the records that you want without having to use a DISTINCT
select c.type as checking
, c.actId as chkid
, c.date
, c.rowsort
, c.[order]
, c.value as chkvalue
, c.value_char as chkValchar
, c.colName
, s.type as savings
, s.actId as savId
, s.value as savVal
, s.value_char as savValchar
from t1 c
inner join t1 s
on c.rowsort = s.rowsort + 1
and c.[order] = s.[order]
See SQL Fiddle with Demo