How can I removes duplicates by using MAX and SUM per group identifier? - sql

I'm creating an open order report using SQL to query data from AWS Redshift.
My current table has duplicates (same order, ln, and subln numbers)
Order
Ln
SubLn
Qty
ShpDt
4166
010
00
3
2021-01-06
4166
010
00
3
2021-01-09
4167
011
00
9
2021-02-01
4167
011
00
9
2021-01-28
4167
011
01
8
2020-12-29
I need to remove duplicates using order, ln, and subln columns as group identifiers. I want to calculate SUM of qty and keep most recent ship date for the order to achieve this result:
Order
Ln
SubLn
TotQty
Shipped
4166
010
00
6
2021-01-09
4167
011
00
18
2021-02-01
4167
011
01
8
2020-12-29
After reading (How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL?) I tried the code below, which only aggregated the fields and did not remove duplicates. What am I missing?
FROM table1 AS t1
JOIN (SELECT t1.order, t1.ln, t1.subln, SUM(qty) AS totqty, MAX(shpdt) AS shipped
FROM table1 AS t1
GROUP BY order, ln, subln) as t2
ON tb1.order = tb2.order AND tb1.ln = tb2.ln AND tb1.subln = tb2.subln

I need to remove duplicates using order, ln, and subln columns as
group identifiers. I want to calculate SUM of qty and keep most recent
ship date for the order to achieve this result:
Based on the dataset resulted from the query, the dataset is unique on those 3 columns

Related

hive sql get min and max values of multiple records

I have a query which results are
fruit street inventory need to_buy
banana 123 15 99 22
apple 4 32 68 44
banana 789 01 32 11
apple 9832 0 99 94
apple 85 839 12 48
banana 832 77 05 55
I want to get the minimum values for inventory, and need, and get the max to_buy value. but only have one record of each 'fruit'. the 'street' column is irrelevant and is not needed in the final result. The final result should look like
fruit inventory(min) need(min) to_buy(max)
banana 01 05 55
apple 0 12 94
Also the initial records may not be ordered at first so there are more 'fruits' inserted at random How can i achieve the desired result above?
Try this:
SELECT MIN(inventory), MIN(need), MAX(to_buy)
FROM tableName
GROUP BY fruits
This one should work:
SELECT fruits, MIN(inventory), MIN(need), MAX(to_buy)
FROM <table_name>
GROUP BY fruits

Using WHERE clause and DISTINCT ON

I have the following two postgresql tables:
table: daily
id date close symbol_id
1 2016-05-01 80 65
2 2016-05-01 75 67
3 2016-05-01 95 45
4 2016-05-02 11 65
5 2016-05-02 48 67
6 2016-05-02 135 45
7 2016-05-03 18 65
8 2016-05-03 82 67
9 2016-05-03 107 45
10 2016-05-04 29 65
table: symbol
id symbol
65 abc
67 xyz
45 jkl
I need to select all symbols where the close value is less than 100 for the latest date for each symbol. As per the example, not all symbols will have the same latest date.
The following query gives me correct data when I do not use the WHERE clause:
SELECT DISTINCT ON (daily.symbol_id) symbol.symbol, daily.close, daily.date
FROM daily JOIN symbol ON daily.symbol_id = symbol.id
--WHERE daily.close < 100
ORDER BY daily.symbol_id, daily.date DESC
Result:
symbol close date
abc 29 2016-05-04
xyz 82 2016-05-03
jkl 107 2016-05-03
The problem comes when I uncomment the WHERE clause. The desired result is for the symbol jkl to be removed from the list because the value for close for that symbol on its latest date is not < 100. However this is what happens:
symbol close date
abc 29 2016-05-04
xyz 82 2016-05-03
jkl 95 2016-05-01
You can move your existing query to a subquery and then filter with where criteria.
select *
from (
select distinct on (d.symbol_id) s.symbol, d.close, d.date
from daily d
join symbol s on d.symbol_id = s.id
order by daily.symbol_id, daily.date desc
) t
where close < 100
Here's another similar option using a windows function such as row_number:
select *
from (
select d.symbol_id, s.symbol, d.close, d.date,
row_number() over (partition by d.symbol_id order by d.date desc) rn
from daily d
join symbol s on d.symbol_id = s.id
) t
where rn = 1 and close < 100
Code not tested, just to demonstrate idea
First you make a query to get the latest date of every symbol. Then make a join to filter out rows that are not latest which you can safely apply the close < 100 where clause.
SELECT DISTINCT ON(symbol) * FROM (
SELECT MAX(d1.date) latest FROM daily d1 GROUP BY d1.symbol_id
INNER JOIN daily d2 ON latest = d2.date AND d1.symbol_id = d2.symbol_id) t
WHERE close <100

How do I add specific values from one different table to columns in a row based off of values in another table?

In SQL Server I have 2 tables that looks like this:
TEST SCRIPT 'a collection of test scripts'
(PK)
ID Description Count
------------------------
A12 Proj/Num/Dev 12
B34 Gone/Tri/Tel 43
C56 Geff/Ben/Dan 03
SCRIPT HISTORY 'the history of the aforementioned scripts'
(FK) (PK)
ScriptID ID Machine Date Time Passes
----------------------------------
A12 01 DEV012 6/26/15 16:54 4
A12 02 DEV596 6/28/15 13:12 9
A12 03 COM199 3/12/14 14:22 10
B34 04 COM199 6/30/13 15:45 12
B34 05 DEV012 6/30/15 13:13 14
B34 06 DEV444 6/12/15 11:14 14
C56 07 COM321 6/29/14 02:19 12
C56 08 ANS042 6/24/14 20:10 18
C56 09 COM432 6/30/15 12:24 4
C56 10 DEV444 4/20/12 23:55 2
In a single query, how would I write a select statement that takes just one entry for each DISTINCT script in TEST SCRIPT and pairs it with the values in only the TOP 1 most recent run time in SCRIPT HISTORY?
For example, the solution to the example tables above would be:
OUTPUT
ScriptID ID Machine Date Time Passes
---------------------------------------------------
A12 02 DEV596 6/28/15 13:12 9
B34 05 DEV012 6/30/15 13:13 14
C56 09 COM432 6/30/15 12:24 4
The way you describe the problem is almost directly as cross apply:
select h.*
from testscript ts cross apply
(select top 1 h.*
from history h
where h.scriptid = ts.id
order by h.date desc, h.time desc
) h;
Please try something like this:
select *
from SCRIPT SCR
left join (select MAX(SCRIPT_HISTORY.Date) as Date, SCRIPT_HISTORY.ScriptID
from SCRIPT_HISTORY
group by SCRIPT_HISTORY.ScriptID
) SH on SCR.ID = SH.ScriptID

How to order rows in a table within partitions?

I am using DB2 to take a table, split it into partitions and then order rows within each partition. The table I have is like:
ID DATE EVENT
-- ---- -----
01 1999-06-01 a
01 1999-06-01 b
01 2006-01-01 a
01 2011-12-31 c
02 1999-01-01 a
02 2003-01-01 a
02 2003-01-01 b
02 2009-11-12 b
where I want to order it to get the following...
ID DATE EVENT SEQUENCE
-- ---- ----- --------
01 1999-06-01 a 1
01 1999-06-01 b 1
01 2006-01-01 a 2
01 2011-12-31 c 3
02 1999-01-01 a 1
02 2003-01-01 a 2
02 2003-01-01 b 2
02 2009-11-12 b 3
so I am trying:
select a.*, row_number() over(partition by ID,order by DATE) from mytable a
which gives me:
ID DATE EVENT SEQUENCE
-- ---- ----- --------
01 1999-06-01 a 1
01 1999-06-01 b 2
01 2006-01-01 a 3
01 2011-12-31 c 4
02 1999-01-01 a 1
02 2003-01-01 a 2
02 2003-01-01 b 3
02 2009-11-12 b 4
where as you can see, even though a consecutive row may have the same date as the previous row, this is ignored and the SEQUENCE column is iterated.
How do I ensure that if the next row has the same date that the sequence is preserved until a row with a later date appears?
Thanks very much.
Clearly, the row_number() function would not return the same number for different rows within the window. You need to use the dense_rank() function.
By the way, your query has a syntax error, and it is not a good idea to use reserved words ('DATE' in this case) for column names.
You could use the DENSE_RANK function instead, which gives you an option of assigning the same rank, if two rows have the same values, as below:
select a.*, DENSE_RANK() OVER(PARTITION BY ID ORDER BY DATE DESC) from mytable a;
References:
Using OLAP specifications

select query with priority based on column values

This is a continuation of the question asked in : select query with priority based on column values
I have an identical issue wherein my result set (which is a flight movement report of an airport) is like this :
sl_no term arr org sta ata arr_pax asrc dep dep_pax dsrc std atd
----- ---- --- ---- --- ----- ----- ----- ----- ---- - ----- ---- ---
01 D TY123 TTY 00:00 00:05 123 USR II 877 26 LDM 00:45 00:50
02 D TY123 TTY 00:00 00:05 55 LDM II 877 26 LDM 00:45 00:50
03 D FY598 TTY 00:00 00:05 123 LDM II 877 32 USR 00:45 00:50
04 D ZX555 TTY 00:00 00:05 223 LDM II 877 55 LDM 00:45 00:50
05 D XX645 TTY 00:00 00:05 16 LDM II 877 55 LDM 00:45 00:50
06 D XX645 TTY 00:00 00:05 16 LDM II 877 65 USR 00:45 00:50
Now, you can observe that, the first two rows are identical but for the values under the column 'asrc' ('USR' and 'LDM'). Similarly, rows 5 and 6 are identical except for the values ('USR' and 'LDM') under the column 'dsrc'.
My target result should contain only one of the identical rows in the order that, if 'USR' is present, the row containing 'LDM' will be discarded. If 'USR' is absent and 'LDM' is the only one present, then the row will be selected.
The second answer in the linked question suggested the use of analytical functions. I tried to do that and I ended up with the below query :
An excerpt :
SELECT term,arr,org,sta,ata,arr_pax,asrc,dep,dep_pax,dsrc,std,atd,Max(dep_pax) KEEP (Dense_Rank first ORDER BY apriority) FROM
(
Select .....<joins to build the query>
, CASE dpriority
WHEN 'USR' THEN 1
WHEN 'LDM' THEN 2
END AS dpriority
, CASE dpriority
WHEN 'USR' THEN 1
WHEN 'LDM' THEN 2
END AS dpriority FROM
...... ;
I still end up with the same result set with the both rows with 'USR' and 'LDM' values. Can anyone point out how to construct the analytical function here?
Any other working approaches are also welcome.
Thanks in advance.
The following logic should do this:
SELECT term, arr, org, sta, ata, arr_pax, asrc, dep, dep_pax, dsrc, std, atd,
Max(dep_pax) KEEP (Dense_Rank first ORDER BY apriority)
FROM (select t.*,
sum(case when dpriority = 'USR' then 1 else 0 end) over (partition by . . . ) as NumUSR,
sum(case when dpriority = 'LDM' then 1 else 0 end) over (partition by . . . ) as NumLDM
from t
) t
WHERE dpriority = 'USR' or NumUSR = 0;
It counts the number of "USR" values and the number of "LDM" values (strictly speaking, the latter is not necessary). The logic is to take the "USR" value if available or any value if there are no USR values.
I'm not sure what the right partitioning key is. You can put in all the columns, although a smaller subset might be sufficient.