Struggling to find the right WHERE clause - sql

I'm struggling with a SQL query and I need your help. To be honest, I'm starting to wonder if what I want to achieve can be done the way I did it so far but maybe your collective brains can come up with a better solution than mine and prove me I took the good way at the beginning (Or that I was totally wrong and I should start from scratch).
The Dataset
A row has 4 important fields: ItemID, Item, Priority and Group. Those fields contain the only valuable piece of information, the one that will be displayed in the end.
As I'm using SQL Server 2008, I don't have access to the LAG and LEAD function so I needed to simulate them (Or at least, I did it because I thought it would be useful to me but I'm not so sure anymore). To obtain this result, I used the code from this article from SQLscope which provide you with a LAG and LEAD equivalent that I restrict to a set of row that have the same ItemID. This adds 7 new functional columns to my dataset: Rn, RnDiv2, RnPlus1Div2, PreviousPriority, NextPriority, PreviousGroup and NextGroup.
ItemID | Item | Priority | Group | Rn | RnDiv2 | RnPlus1Div2 | PreviousPriority | NextPriority | PreviousGroup | NextGroup
-------- | ------- | -------- | ------- | ----- | ------ | ----------- | ---------------- | ------------ | ------------- | ---------
16777397 | Item 1 | 5 | Group 1 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
16777403 | Item 2 | 5 | Group 2 | 1 | 0 | 1 | NULL | 5 | NULL | Group 2
16777403 | Item 2 | 10 | Group 2 | 2 | 1 | 1 | 5 | NULL | Group 2 | NULL
16777429 | Item 3 | 1000 | Group 3 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
16777430 | Item 4 | 5 | Group 1 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
16777454 | Item 5 | 5 | Group 4 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
16777455 | Item 6 | 5 | Group 5 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
16777459 | Item 6 | 5 | Group 6 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
16777468 | Item 8 | 5 | Group 7 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
16777479 | Item 9 | 5 | Group 4 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
16777481 | Item 10 | 5 | Group 4 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
16777496 | Item 11 | 5 | Group 6 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
16777514 | Item 12 | 5 | Group 4 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
16777518 | Item 13 | 5 | Group 8 | 1 | 0 | 1 | NULL | 10 | NULL | Group 8
16777518 | Item 13 | 10 | Group 8 | 2 | 1 | 1 | 5 | 100 | Group 8 | Group 1
16777518 | Item 13 | 100 | Group 1 | 3 | 1 | 2 | 10 | NULL | Group 8 | NULL
16777520 | Item 14 | 5 | Group 9 | 1 | 0 | 1 | NULL | NULL | NULL | NULL
The problem
The problem in my SQL query is the WHERE clause. I will always filter the rows based on their Group column. But there are some subtlety. Whatever the number of Group an Item is member of, I want it to appear in one and only one Group based on these criteria :
If the Item appears in the same Group more than one time, only the line with the lowest priority should be returned. If an Item appears more than one time in the same Group but with the same Priority, then only the first occurrence should be kept Example: for Item 2, only the line with a Priority value of 5 should be returned;
If the Item appears in the Group but is also present in another Group with a lowest Priority, it shouldn't be displayed. Example: Group 1 is selected as a filter. Item 1 should be displayed but Item 13 shouldn't because it is also present in Group 8 with a lower Priority (Item 13 would appear only in Group 8).
Note that this is just a sample. My real dataset has more than 3000 rows and some other cases are probably possible that I haven't listed in my sample.
Unsuccessful Attempts
Like I said, there is one constant in the WHERE clause and that is the Group filtering.
Because of the criterion #2, I can't simply start my clause like that : WHERE Group = 'Group 1' and I need to have something a bit more complex.
I have tried the following clause without success : WHERE Group = 'Group 1' AND (Group = NextGroup AND Priority < NextPriority). That works well in the case of an Item that is in no more that 2 groups. But for Item 13, it would return the first two rows. And if I add something like AND NOT (CorrectedPriority >= PreviousPriority) to the WHERE clause, I get no results at all.
Last attempt so far : (SiteName <> PreviousSiteName AND CorrectedPriority >= PreviousPriority). The problem is that I will never return a line where Rn = 1 because PreviousSiteName will be equal to NULL. Adding a check on NULL doesn't work either. I must have bee tired when trying this particular clause because it's complete garbage.
I will continue to try and find the good WHERE clause but I have the feeling that my whole approach is wrong. I don't see how I could solve the problem when there are more than two entries for the same Item. It is worth noting that this query is used in a SSRS report so I could maybe use custom code to parse the dataset and filter the rows (Working with tables might help solving the issue of Items with more than two entries). But if there's a SQL genius around here with a working solution, that would be great.
PS : if someone knows how to fix this table and can explain it to me, extra cookies for him. :D
Edit :
This is the modified query that I'm using at the moment. I will consider using #Yellowbedwetter's latest query has it seems more robust.
SELECT *
FROM (SELECT ItemID,
Item,
Priority,
Group_,
MIN(Priority) OVER
( PARTITION BY item
) AS interItem_MinPriority
FROM (SELECT ItemID,
Item,
Priority,
Group_,
ROW_NUMBER() OVER
( PARTITION BY Item
ORDER BY Priority ASC
) AS interGrp_Rank
FROM Test_Table
) AS TMP
WHERE interGrp_Rank = 1 -- Exclude all records with the same item/group, but higher priority.
) AS TMP2
WHERE Priority = interItem_MinPriority; -- Exclude which aren't the lowest priority across groups.

If I understand the question correctly this should work
SELECT *
FROM (SELECT ItemID,
Item,
Priority,
Group_,
MIN(Priority) OVER
( PARTITION BY item
) AS interItem_MinPriority
FROM (SELECT ItemID,
Item,
Priority,
Group_,
ROW_NUMBER() OVER
( PARTITION BY Item,
Group_
ORDER BY Priority ASC
) AS interGrp_Rank
FROM Test_Table
) AS TMP
WHERE interGrp_Rank = 1 -- Exclude all records with the same item/group, but higher priority.
) AS TMP2
WHERE Priority = interItem_MinPriority; -- Exclude which aren't the lowest priority across groups.
I don't know if your version of SQL Server supports MIN() OVER()..., but if not you should be able to work around that easily enough.
Edit: To handle tie breaks.
WITH TEST_TABLE (ItemID, Item, Priority, Group_) AS
(
SELECT '16777397','Item 1','5','Group 1' UNION
SELECT '16777403','Item 2','5','Group 2' UNION
SELECT '16777403','Item 2','10','Group 2' UNION
SELECT '16777429','Item 3','1000','Group 3' UNION
SELECT '16777430','Item 4','5','Group 1' UNION
SELECT '16777454','Item 5','5','Group 4' UNION
SELECT '16777455','Item 6','5','Group 5' UNION
SELECT '16777459','Item 6','5','Group 6' UNION
SELECT '16777468','Item 8','5','Group 7' UNION
SELECT '16777479','Item 9','5','Group 4' UNION
SELECT '16777481','Item 10','5','Group 4' UNION
SELECT '16777496','Item 11','5','Group 6' UNION
SELECT '16777514','Item 12','5','Group 4' UNION
SELECT '16777518','Item 13','5','Group 8' UNION
SELECT '16777518','Item 13','10','Group 8' UNION
SELECT '16777518','Item 13','100','Group 1' UNION
SELECT '16777520','Item 14','5','Group 9'
)
SELECT ItemID,
Item,
Priority,
Group_
FROM (SELECT ItemID,
Item,
Priority,
Group_,
ROW_NUMBER() OVER
( PARTITION BY item
ORDER BY Group_ ASC -- or however you want to break the tie
) AS grp_minPriority_TieBreak
FROM (SELECT ItemID,
Item,
Priority,
Group_,
MIN(Priority) OVER
( PARTITION BY item
) AS interItem_MinPriority
FROM (SELECT ItemID,
Item,
Priority,
Group_,
ROW_NUMBER() OVER
( PARTITION BY Item,
Group_
ORDER BY Priority ASC
) AS interGrp_Rank
FROM TEST_TABLE
) AS TMP
WHERE interGrp_Rank = 1 -- Exclude all records with the same item/group, but higher priority.
) AS TMP2
WHERE Priority = interItem_MinPriority -- Exclude which aren't the lowest priority across groups.
) AS TMP2
WHERE grp_minPriority_TieBreak = 1;

If I understand your problem well
about these criteria
If the Item appears in the same Group more than one time, only the
line with the lowest priority should be returned. Example: for Item
2, only the line with a Priority value of 5 should be returned;
If the Item appears in the Group but is also present in another
Group with a lowest Priority, it shouldn't be displayed. Example:
Group 1 is selected as a filter. Item 1 should be displayed but Item
13 shouldn't because it is also present in Group 8 with a lower
Priority (Item 13 would appear only in Group 8).
I think we can get the right result by using the minimum priority per item without considering the group of item , because in the two cases above we took the minimum priority of the item.
so the following query might be helpful.(I tested it with your sample data)
with minPriority as
(
select ItemID, Item, Priority , Group_,ROW_NUMBER() over(partition by ItemId order by priority )rn from Test_table
)
select * from minPriority where rn=1

Haven't tried it but something like..`select max(priority) as mp ..... From ... Where group = 'group1' and mp not in (select max(priority).... from ... Where group <> 'group1'
Apologies for the typing, on my phone no glasses :)

Related

Get some values from the table by selecting

I have a table:
| id | Number |Address
| -----| ------------|-----------
| 1 | 0 | NULL
| 1 | 1 | NULL
| 1 | 2 | 50
| 1 | 3 | NULL
| 2 | 0 | 10
| 3 | 1 | 30
| 3 | 2 | 20
| 3 | 3 | 20
| 4 | 0 | 75
| 4 | 1 | 22
| 4 | 2 | 30
| 5 | 0 | NULL
I need to get: the NUMBER of the last ADDRESS change for each ID.
I wrote this select:
select dh.id, dh.number from table dh where dh =
(select max(min(t.history)) from table t where t.id = dh.id group by t.address)
But this select not correctly handling the case when the address first changed, and then changed to the previous value. For example id=1: group by return:
| Number |
| -------- |
| NULL |
| 50 |
I have been thinking about this select for several days, and I will be happy to receive any help.
You can do this using row_number() -- twice:
select t.id, min(number)
from (select t.*,
row_number() over (partition by id order by number desc) as seqnum1,
row_number() over (partition by id, address order by number desc) as seqnum2
from t
) t
where seqnum1 = seqnum2
group by id;
What this does is enumerate the rows by number in descending order:
Once per id.
Once per id and address.
These values are the same only when the value is 1, which is the most recent address in the data. Then aggregation pulls back the earliest row in this group.
I answered my question myself, if anyone needs it, my solution:
select * from table dh1 where dh1.number = (
select max(x.number)
from (
select
dh2.id, dh2.number, dh2.address, lag(dh2.address) over(order by dh2.number asc) as prev
from table dh2 where dh1.id=dh2.id
) x
where NVL(x.address, 0) <> NVL(x.prev, 0)
);

Selecting the first row of group with additional group by columns

Say I have a table with the following results:
How is it possible for me to select such that I only want distinct parent_ids with the min result of object0_behaviour?
Expected output:
parent_id | id | object0_behaviour | type
------------------------------------------
1 | 1 | 5 | IP
2 | 3 | 5 | IP
3 | 5 | 7 | ID
4 | 6 | 7 | ID
5 | 8 | 5 | IP
6 | 18 | 7 | ID
7 | 10 | 7 | ID
8 | 9 | 5 | IP
I have tried:
SELECT parent_id, min(object0_behaviour) FROM table GROUP BY parent_id
It works, however if I wanted the other 2 additional columns, I am required to add into GROUP BY clause and things go back to square one.
I saw examples with R : Select the first row by group
Similar output from what I need, but I can't seem to convert it into SQL
You can try using row_number() window function
select * from
(
select *, row_number() over(partition by parent_id order by object0_behaviour) as rn
from tablename
)A where rn=1
select * from table
join (
SELECT parent_id, min(object0_behaviour) object0_behaviour
FROM table GROUP BY parent_id
) grouped
on grouped.parent_id = table.parent_id
and grouped.object0_behaviour = table.object0_behaviour

PSQL select all rows with a non-unique column

The query is supposed to query the item table and:
filter out active=0 items
select id and groupId where there's at least one more item with that groupId
Example:
| id | groupId | active |
| --- | ------- | ------ |
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 2 | 0 |
| 4 | 3 | 1 |
| 5 | 3 | 1 |
| 6 | 4 | 1 |
Desired Output:
| id | groupId |
| --- | ------- |
| 4 | 3 |
| 5 | 3 |
Explanation
groupID 1: invalid because has only 1 member
groupID 2: invalid because has two members, but one is inactive
groupID 3: valid
groupID 4: invalid because has only 1 member
What I tried
SELECT id, groupId
FROM items
WHERE id IN (
SELECT id
FROM items
WHERE active=1
GROUP BY groupId
HAVING COUNT(*) > 1
);
But I get the id must appear in the GROUP BY clause or be used in an aggregate function error.
I understand I can mess around with the sql_mode to get rid of that error, but I would rather avoid that.
Go for window functions:
select i.*
from (select i.*, count(*) over (partition by groupid) as cnt
from items i
where active = 1
) i
where cnt > 1
Window functions is the way to go.
But if you want to fix your query then this should do it:
select a.id, a.groupId from items a
where active = 1 and groupid in(
select groupId from item
where active = 1
group by groupId
having count(distinct id) > 1
)
because we are counting which groupid has more than 1 id for the same groupid

SQL group by under some conditions

I have a big table with tons of duplicated rows (among those columns that I care about). Let me start with the following example:
|field1 | field2| field3| field4| field5|
| aa | 1 | NULL | 1 | 0 |
| aaa | 1 | NULL | 1 | 1 |
| aaa | 1 | NULL | 1 | 2 |
| a | 2 | 0 | 1 | 3 |
| a | 2 | 0 | NULL | 4 |
| a | 2 | NULL | 2 | 5 |
| b | 3 | NULL | 2 | 6 |
| b2 | 3 | NULL | NULL | 7 |
| c | 4 | NULL | NULL | 8 |
I am interested in an effiecient query to get the following table:
|field1 | field2| field3| field4|
| aaa | 1 | NULL | 1 |
| a | 2 | 0 | 1 |
| b | 3 | NULL | 2 |
| c | 4 | NULL | NULL |
Basically, it follows the following rules:
for each value of field2, there should be one and exactly one row present
among all the rows with the same value of field2 select the row that satisfy the following in order:
select the row that field4 is not Null (if possible)
among those that have a non Null value for the field4 select the row that has has a non Null value for field 3
among those that have a non Null value for the field4 and 3, select the row that has the longest string value for field 1
among those that satisfy all above, select only one row (does not matter what is the value of field5).
I could do it with bunch of joins, but it becomes very slow. Any better suggestions?
EDIT
The field2 values may not be in an specific order. I just put 1,2,3,4 in the example but this is not generally true in my case. I did not change it directly on the table since one of the suggested solutions are actually considering sequential value for field2, so I kept if for future readers that maybe interested in that.
This type of prioritization is challenging. I think the simplest method in MySQL uses variables:
select t.*
from (select t.*,
(#rn := if(#f2 = field2, #rn + 1,
if(#f2 := field2, 1, 1)
)
) as seqnum
from t cross join
(select #rn := 0, #field2 := '') params
order by field2,
(field4 is not null) desc,
(field3 is not null) desc,
length(field1) desc
) t
where seqnum = 1;
I'm not 100% sure I have the conditions right (the third seems to conflict with the first two). But whatever the prioritization, the idea is the same: use order by to get the rows in the right order and use variables to get the first one.
EDIT:
In SQL Server -- or any other reasonable database -- you do this with row_number():
select t.*
from (select t.*,
row_number() over (partition by field2
order by (case when field4 is not null then 0 else 1 end),
(case when field3 is not null then 0 else 1 end),
len(field1)
) as seqnum
from t
) t
where seqnum = 1;

An SQL query that combines aggregate and non-aggregate values in one row

The following query gives me the information that I need but I want it to take it just a step further. In the table at the bottom (only showing a subset of the fields), I want to group by cust_line in an unusual way (at least to me it's unusual).
Let's look at the items with a cust_line of 2 as an example. I would like these to be represented by one line not 5. For this line, I would like to select all the fields except for the price field where the cust_part = "GROUPINVC". For the total field I would like it to be 'sum(total) as new_total' and for the price, I would like it to be new_total / qty_invoiced, where qty_invoiced is the value on the line where cust_part = "GROUPINV".
Is what I am asking for completely ridiculous? Is it even possible? I'm not advanced at SQL so it may also be easy and I just don't know how to approach it. I thought of using 'partition by' but I couldn't imagine how I would get it to work as I figured it would still return 5 rows where I only want 1.
I've also looked at these questions with similar titles but not really what I am looking for:
SQL query that returns aggregate AND non aggregate results
Combined aggregated and non-aggregate query in SQL
SELECT L.CUST_LINE, I.LINE_NO, I.ORDER_NO, I.STAGE, I.ORDER_LINE_POS, I.CUST_PART,
I.LINE_ITEM_NO, I.QTY_INVOICED, I.CUST_DESC, I.DESCRIPTION, I.SALE_UNIT_PRICE, I.PRICE_TOTAL,
I.INVOICE_NO, I.CUSTOMER_PO_NO, I.ORDER_NO, I.CUSTOMER_NO, I.CATALOG_DESC, I.ORDER_LINE_NOTES
FROM
(SELECT CUST_LINE, ORDER_NO, LINE_NO
FROM CUSTOMER_ORDER_LINE
GROUP BY CUST_LINE, ORDER_NO, LINE_NO
) L
INNER JOIN CUSTOMER_ORDER_IVC_REP I
ON I.ORDER_NO = L.ORDER_NO
WHERE RESULT_KEY = 999999
AND I.LINE_NO = L.LINE_NO
ORDER BY L.CUST_LINE;
| cust_line | line_no | cust_part | qty_invoiced | cust_desc | price | total |
| 1 | 4 | ... | 1 | ... | 55 | 55 |
| 2 | 1 | GROUPINV | 1 | some part | 0 | 0 |
| 2 | 6 | ... | 3 | ... | 0 | 0 |
| 2 | 2 | ... | 1 | ... | 0 | 0 |
| 2 | 3 | ... | 1 | ... | 0 | 0 |
| 2 | 7 | ... | 2 | ... | 10 | 20 |
| 3 | 7 | ... | 1 | ... | 67 | 67 |
You can use an analytic function to calculate a total over multiple rows of a result set, then filter out the rows you don't want.
Leaving out all the extra columns for sake of brevity:
SELECT cust_line, qty_invoiced, order_total/qty_invoiced AS price
FROM (
SELECT l.cust_line, qty_invoiced,
SUM(total) OVER (PARTITION BY l.cust_line) AS order_total,
COUNT(cust_line) OVER (PARTITION BY l.cust_line) AS group_count
FROM
(SELECT CUST_LINE, ORDER_NO, LINE_NO
FROM CUSTOMER_ORDER_LINE
GROUP BY CUST_LINE, ORDER_NO, LINE_NO
) L
INNER JOIN CUSTOMER_ORDER_IVC_REP I
ON I.ORDER_NO = L.ORDER_NO
WHERE RESULT_KEY = 999999
AND I.LINE_NO = L.LINE_NO
)
WHERE ( cust_part = 'GROUPINV' OR group_count = 1 )
ORDER BY cust_line
I am guessing on what you want in the PARTITION BY clause; this is essentially a GROUP BY that applies only to the SUM function. Not sure if you might also want order_no in the partition.
The trick is to select all the rows in the inner query, applying SUM across them all; then filter out the rows you are not interested in in the outermost query.