Select rows with missing value for each item - sql

I'm trying to make a report to find rows in a table, which have a mistake, a missing item order. I.e.
ID Item Order
----------------
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 C 2
7 C 3
8 D 1
Note, that Item "C" is missing row with Order index "1". I need to find all items, which are missing index "1" and start with "2" or other.
One way I figured is this:
SELECT DIstinct(Item) FROM ITEMS as I
WHERE I.Item NOT IN (SELECT Item FROM Items WHERE Order = 1)
But surprisingly (to me), it does not give me any results even though I know I have such items. I guess, it first selects items wich are not in sub-select and then distincts them, but what I wanted to is select distinct Items and find which of them have no lines with "Order = 1".
Also, this code is to be executed over some 70 thousands of lines, so it has to be feasible (another way I can think of is a CURSOR, but that would be very slow and possibly unstable?).
Regards,
Oak

The idea is sound, but there is one tiny detail with NOT IN that may be problematic. That is, if the subquery after NOT IN results in any NULLs, the NOT IN is evaluated as if it were false. This may be the reason why you get no results. You can try NOT EXISTS, like in the other answer, or just
SELECT DISTINCT Item FROM ITEMS as I
WHERE I.Item NOT IN (SELECT Item FROM Items WHERE Order = 1 AND Item IS NOT NULL)

You can find the missing orders using a HAVING clause. HAVING allows you to filter on aggregated records. In this case we are filtering for Items with a min Order in excess of 1.
The benefit of this approach over a sub query in the WHERE clause is SQL Server doesn't have to rerun the sub query multiple times. It should run faster on large datasets.
Example
/* HAVING allows us to filter on aggregated records.
*/
WITH SampleData AS
(
/* This CTE creates some sample records
* to experiment with.
*/
SELECT
r.*
FROM
(
VALUES
( 1, 'A', 1),
( 2, 'A', 2),
( 3, 'A', 3),
( 4, 'B', 1),
( 5, 'B', 2),
( 6, 'C', 2),
( 7, 'C', 3),
( 8, 'D', 1)
) AS r(ID, Item, [Order])
)
SELECT
Item,
COUNT([Order]) AS Count_Order,
MIN([Order]) AS Min_Order
FROM
SampleData
GROUP BY
Item
HAVING
MIN([Order]) > 1
;

Your query should work. The problem is probably that Item could be NULL. So try this:
SELECT Distinct(Item)
FROM ITEMS as I
WHERE I.Item NOT IN (SELECT Item FROM Items WHERE Order = 1 AND Item IS NOT NULL);
This is why NOT EXISTS is preferable to NOT IN.
I would do this, though, with an aggregation query:
select item
from items
group by item
having sum(case when [order] = 1 then 1 else 0 end) = 0;

You can use NOT EXISTS:
SELECT DISTINCT(i1.Item) FROM ITEMS i1
WHERE NOT EXISTS
(
SELECT 1 FROM Items i2
WHERE i1.Item = i2.Item AND i2.[Order] = 1
)
NOT IN has it's issues, worth reading:
http://sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join
The main problem is that the results can be surprising if the target
column is NULLable (SQL Server processes this as a left anti semi
join, but can't reliably tell you if a NULL on the right side is equal
to – or not equal to – the reference on the left side). Also,
optimization can behave differently if the column is NULLable, even if
it doesn't actually contain any NULL values
because of this...
Instead of NOT IN, use a correlated NOT EXISTS for this query pattern.
Always. Other methods may rival it in terms of performance, when all
other variables are the same, but all of the other methods introduce
either performance problems or other challenges.

Related

More than one row returned by a subquery used as an expression when UPDATE on multiple rows

I'm trying to update rows in a single table by splitting them into two "sets" of rows.
The top part of the set should have a status set to X and the bottom one should have a status set to status Y.
I've tried putting together a query that looks like this
WITH x_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
LIMIT 5
), y_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
OFFSET 5
)
UPDATE people
SET status = folks.status
FROM (values
((SELECT id from x_status), 'X'),
((SELECT id from y_status), 'Y')
) as folks (ids, status)
WHERE id IN (folks.ids);
When I run this query I get the following error:
pq: more than one row returned by a subquery used as an expression
This makes sense, folks.ids is expected to return a list of IDs, hence the IN clause in the UPDATE statement, but I suspect the problem is I can not return the list in the values statement in the FROM clause as it turns into something like this:
(1, 2, 3, 4, 5, 5)
(6, 7, 8, 9, 1)
Is there a way how this UPDATE can be done using a CTE query at all? I could split this into two separate UPDATE queries, but CTE query would be better and in theory faster.
I think I understand now... if I get your problem, you want to set the status to 'X' for the oldest five records and 'Y' for everything else?
In that case I think the row_number() analytic would work -- and it should do it in a single pass, two scans, and eliminating one order by. Let me know if something like this does what you seek.
with ranked as (
select
id, row_number() over (order by date_registered desc) as rn
from people
)
update people p
set
status = case when r.rn <= 5 then 'X' else 'Y' end
from ranked r
where
p.id = r.id
Any time you do an update from another data set, it's helpful to have a where clause that defines the relationship between the two datasets (the non-ANSI join syntax). This makes it iron-clad what you are updating.
Also I believe this code is pretty readable so it will be easier to build on if you need to make tweaks.
Let me know if I missed the boat.
So after more tinkering, I've come up with a solution.
The problem with why the previous query fails is we are not grouping the IDs in the subqueries into arrays so the result expands into a huge list as I suspected.
The solution is grouping the IDs in the subqueries into ARRAY -- that way they get returned as a single result (tuple) in ids value.
This is the query that does the job. Note that we must unnest the IDs in the WHERE clause:
WITH x_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
LIMIT 5
), y_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
OFFSET 5
)
UPDATE people
SET status = folks.status
FROM (values
(ARRAY(SELECT id from x_status), 'X'),
(ARRAY(SELECT id from y_status), 'Y')
) as folks (ids, status)
WHERE id IN (SELECT * from unnest(folks.ids));

Proper way to count how many times an item from 2 lists of items are ordered together

I am currently self-learning SQL so for most of you this would probably seem like a simple question (if I expressed it correctly).
I have a table 'orders' that look like this:
Orddid(Uid) Ordmid Odate Itmsid
----------- ------ ----- ------
100101 100101 01.12.2018 12
100102 100101 01.12.2018 88
100103 100101 01.12.2018 57
100104 100102 01.12.2018 12
What I want to do is count the times that any item from 2 lists of items (as in IN (itmsid1, itmsid2) coexists for all Ordmids.
For example, if I query about itemsid in (12,99) and also itemsid in (22,57) I would get a count of 1 at the end.
How do I do that?
EDIT: I have to say that this community is amazing! Lightning fast responses and very supportive even. Thank you very much people. I owe you!
I interpret your question as:
How many times does an Ormid group feature itemid 12 or 99, in combination with itemid 22 or 57..
Meaning, an ormid group should have either a 12 and a 22, or 12 and 57, or 99 and 22, or 99 and 27 (at least.. 12,22,57 etc would also be permitted). In plain english this might be expressed as "How many times did someone buy (a keyboard or a mouse) in combination with (a memory stick or a printer cartridge)" - to qualify for a special offer, someone has to buy at least one item from group 1, and one item from group 2
Many ways to do, here's one:
SELECT COUNT(distinct t_or_nn.ormid) FROM
(SELECT ormid FROM orders WHERE itemid in (12,99)) t_or_nn
INNER JOIN
(SELECT ormid FROM orders WHERE itemid in (22,57)) tt_or_fs
ON t_or_nn.ormid = rr_or_fs.ormid
How it works:
Two subqueries; one pulls a list of all the ormids that have a 12 or 99. The other pulls a list of all the ormids that have a 22 or 57.
When these lists are joined, only the ormids that are equal will survive to become a result of the join
We thus end up with a list of only those ormids that have a 12 or 99, in combination with a 22 or 57. Counting this (distinctly, to prevent an ormid with 12,99,22 being counted as 2, or an ormid of 12,22,57,99 items being counted as 4) provides our answer.
If you need more detail on why having an ormid with itemids 12,99,22,57 results in a count of 4, let me know. I won't launch into talking about cartesian products right away as you might already know..
There are a few ways to solve things like this, I've picked on this way as it's fairly easy to explain because the query logic is fairly well aligned with the way a human might think about it
You can use group by and having:
select o.ordmid
from orders o
where o.itmid in (12, 99)
group by o.ordmid
having count(distinct o.itmid) = 2; -- number of items in the list
If items cannot repeat within an order, then use count(o.itmid) in the having rather than count(distinct).
If you want the number of ordmids where this occurs, then just use this as a subquery and use count():
select count(*)
from (select o.ordmid
from orders o
where o.itmid in (12, 99)
group by o.ordmid
having count(distinct o.itmid) = 2; -- number of items in the list
) o;
EDIT:
If you have two separate lists and you want orders that have at least one item from each list you can do:
select o.ordmid
from orders o
group by o.ordmid
having sum(case when o.itmid in (<list 1>) then 1 else 0 end) > 0 and
sum(case when o.itmid in (<list 2>) then 1 else 0 end) > 0 ;
You want only orders that order at least one from set1 and at least one from set 2. This means that you need to select all records where both conditions are true and then count the distinct orders:
For example, you can use exists to check for each order if they have a record in either set:
select count(distinct ordmid)
from orders o
where exists (
select *
from orders
where ordmid = o.ordmid and itmsid in (12, 99)
)
and exists (
select *
from orders
where ordmid = o.ordmid and itmsid in (22, 57)
)
Alternatively, you could use the quantified comparison predicate (ANY or SOME) (see also Quantified Subquery Predicates in the Firebird 2.5 Language Reference). Contrary to the exists solution, this removes the need for a correlated subquery.
select count(distinct ordmid)
from orders o
where ordmid = any(
select ordmid
from orders
where itmsid in (12, 99)
)
and ordmid = any(
select ordmid
from orders
where itmsid in (22, 57)
)

Select distinct rows "modulo null"

Suppose I have a table mytable:
a b c d
------------------------
1 2 3 4
1 1 1 null
1 2 3 4
1 null null null
1 2 null null
1 null 1 null
null null null null
Now the first and third rows of this table are exact duplicates. However, we can also think of the fifth row as duplicating the information contained the first row, in the sense that 1 2 null null is just a copy of 1 2 3 4 but with some data missing. Let's say that 1 2 null null is covered by 1 2 3 4.
"Covering by" is a relationship like <=, while "exact duplication" is a relationship like ==. In the table above, we also have that the sixth row is covered by the second row, the fourth row is covered by all other rows except for the last, the last row is covered by all other rows, and the first and third rows are covered by each other.
Now I want to deduplicate mytable using this notion of covering. Said differently, I want the "minimal cover." That means that whenever row1 <= row2, row1 should be removed from the result. In this case, the outcome is
a b c d
------------------------
1 2 3 4
1 1 1 null
This is like SELECT DISTINCT, but with enhanced null-handling behavior.
More formally, we can define deduplicate(table) as the subset of rows of table such that:
for every row r of table, there exists a row c of deduplicate(table) such that r <= c, and
if c1 and c2 are any two separate rows in deduplicate(table), then c1 <= c2 does not hold.
Or algorithmically:
def deduplicate(table):
outcome = set()
for nextRow in table:
if any(nextRow <= o for o in outcome):
continue
else:
for possiblyNowADuplicate in outcome:
if possiblyNowADuplicate <= nextRow:
# it is now a duplicate
outcome.remove(possiblyNowADuplicate)
outcome.add(nextRow)
return outcome
How can I do this in SQL?
(I'm working in Presto, which allegedly implements modern ANSI SQL; moreover, the table I'm working with has many more columns and tons more rows than mytable, so the solution has to scale reasonably well, both in code complexity (ideally should not require code length O(n^2) in the number of columns!), and in terms of execution time.)
Edit: Based on #toonice's response, I have the following refinements:
On further reflection, it'd be nice if the query code length were O(1) in the number of columns (possibly excluding a single explicit naming of the columns to be operated on in a subtable select, for maintainability). Having a complex boolean condition for each column in both a group by and an order by is a bit much. I'd have to write a python script to generate my sql query. It may be that this is unavoidable, however.
I am operating on at least millions of rows. I cannot do this in O(n^2) time. So:
Is it possible to do this faster?
If not, I should mention that in my real dataset, I have a nonnull column "userid" such that each userid has at most say 100 rows associated with it. Can we take advantage of this segmentation to do the quadratic stuff only over each userid, and then recombine the data all back together? (And there are 60k users, so I definitely cannot name them explicitly in the query.)
Please try the following...
SELECT DISTINCT leftTable.a,
leftTable.b,
leftTable.c,
leftTable.d
FROM tblTable AS leftTable
JOIN tblTable AS rightTable ON ( ( leftTable.a = rightTable.a OR
rightTable.a IS NULL ) AND
( leftTable.b = rightTable.b OR
rightTable.b IS NULL ) AND
( leftTable.c = rightTable.c OR
rightTable.c IS NULL ) AND
( leftTable.d = rightTable.d OR
rightTable.d IS NULL ) )
GROUP BY rightTable.a,
rightTable.b,
rightTable.c,
rightTable.d
ORDER BY ISNULL( leftTable.a ),
leftTable.a DESC,
ISNULL( leftTable.b ),
leftTable.b DESC,
ISNULL( leftTable.c ),
leftTable.c DESC,
ISNULL( leftTable.d ),
leftTable.d DESC;
This statement starts by performing an INNER JOIN on two copies of tblTable, which I have given the aliases of leftTable and rightTable. This join will append a copy of each record from rightTable to every record in leftTable where the record from leftTable covers that from rightTable
The resulting dataset is then grouped to eliminate any duplicate entries in the fields from leftTable.
The grouped dataset is then ordered into descending order, with surviving NULL values being placed after non-NULL values.
Extension
You can use SELECT DISTINCT leftTable.* on the first line if you are happy with selecting all fields from leftTable - I've just gotten in the habit of listing the fields. Either will work just fine in this case. leftTable.* may prove more wieldy if you are dealing with a large number of fields. I'm not sure if there is a difference in execution time bewteen the two methods.
I have not been able to find a way to say where all fields equal in a WHERE clause, either by saying leftTable.* = rightTable.* or something equivalent. Our situation is further complicated by the fact that we are not testing for equivalence, but for covering. Whilst I'd love it if there is a way to test for covering en masse, I'm afraid that you will just have to do a lot of copying, pasting and carefully changing letters so that the test used for each field in my Answer is applied to each of your fields.
Also, I have not been able to find a way to GROUP BY all fields, either in the order that they occur in the table or in any order, short of specifying every field to be grouped on. This too would be nice to know, but for now I think you will have to specify each field from rightTable. Seek out the glories and beware the dangers of copy, paste and edit!
If you do not care about if a row is ordered first or last when the value it is being ordered on is NULL, then you can speed up the statement slightly by removing the ISNULL() conditions from the ORDER BY clause.
If you do not care about ordering at all you can further speed up the statement by removing the ORDER BY clause entirely. Depending on the quirks of your language, you will want to replace it with either nothing or with ORDER BY NULL. Some languages, such as MySQL, automatically sort by the fields specified in a GROUP BY clause unless an ORDER BY clause is specified. ORDER BY NULL is effectively a way of telling it not to do any sorting.
If we are only deduplicating covered records for each user (i.e. each user's records have no bearing on the records of other users), then the following statement should be used...
SELECT DISTINCT leftTable.userid,
leftTable.a,
leftTable.b,
leftTable.c,
leftTable.d
FROM tblTable AS leftTable
JOIN tblTable AS rightTable ON ( leftTable.userid = rightTable.userid AND
( leftTable.a = rightTable.a OR
rightTable.a IS NULL ) AND
( leftTable.b = rightTable.b OR
rightTable.b IS NULL ) AND
( leftTable.c = rightTable.c OR
rightTable.c IS NULL ) AND
( leftTable.d = rightTable.d OR
rightTable.d IS NULL ) )
GROUP BY rightTable.userid,
rightTable.a,
rightTable.b,
rightTable.c,
rightTable.d
ORDER BY leftTable.userid,
ISNULL( leftTable.a ),
leftTable.a DESC,
ISNULL( leftTable.b ),
leftTable.b DESC,
ISNULL( leftTable.c ),
leftTable.c DESC,
ISNULL( leftTable.d ),
leftTable.d DESC;
By eliminating in a dataset that large the need to join other user's records to that of each user, you are removing alot of processing overhead, more than is created by now needing to choose another field for output and by testing another pair of fields when joining and by adding another layer of grouping and by having to ORDER BY another field.
I'm afraid that I can not think of any other way to make this statement more efficient. If anyone does know of a way, then I would like to hear about it.
If you have any questions or comments, then please feel free to post a Comment accordingly.
Appendix
This code was tested in MySQL using a dataset created using the following script...
CREATE TABLE tblTable
(
a INT,
b INT,
c INT,
d INT
);
INSERT INTO tblTable ( a,
b,
c,
d )
VALUES ( 1, 2, 3, 4 ),
( 1, 1, 1, NULL ),
( 1, 2, 3, 4 ),
( 1, NULL, NULL, NULL ),
( 1, 2, NULL, NULL ),
( 1, NULL, NULL, NULL ),
( NULL, NULL, NULL, NULL );

SQL WHERE IN (...) sort by order of the list?

Let's say I have query a database with a where clause
WHERE _id IN (5,6,424,2)
Is there any way for the returned cursor to be sorted in the order that the _id's where listed in the list? _id attribute from first to last in Cursor to be 5, 6, 424, 2?
This happens to be on Android through a ContentProvider, but that's probably not relevant.
Select ID list using subquery and join with it:
select t1.*
from t1
inner join
(
select 1 as id, 1 as num
union all select 5, 2
union all select 3, 3
) ids on t1.id = ids.id
order by ids.num
UPD: Code fixed
One approach might be to do separate SQL queries with a UNION between each. You would obviously issue each query in the order you would like it returned to you.
...
order by
case when _id=5 then 1
when _id=6 then 2
end
etc.
You can join it to a virtual table that contains the list required in sort order
select tbl.*
from tbl
inner join (
select 1 as sorter, 5 as value union all
select 2, 6 union all
select 3, 424 union all
select 4, 2) X on tbl._id = X.value
ORDER BY X.sorter
List? You don't have a list! ;)
This:
WHERE _id IN (5,6,424,2)
is mere syntactic sugar for this:
WHERE (
_id = 5
OR _id = 6
OR _id = 424
OR _id = 2
)
SQL has but one data structure, being the table. Your (5,6,424,2) isn't a table either! :)
You could create a table of values but your next problem is that tables do not have any logical ordering. Therefore, as per #cyberkiwi's answer, you'd have to create a column explicitly to model the sort order. And in order to make it explicit to the calling application, ensure you expose this column in the SELECT clause of your query.

Ordering, grouping and filtering SQL result sets

I've got a number of 'containers' in a database, each of which contains zero or more items. Each item has a name, score, timestamp representing it was added to the container, and a foreign key on the container ID.
I want to fetch all the containers where the top item has a score of 5 or greater (which implies not returning empty containers). As containers act like stacks in this instance, the item with the highest 'added time' is considered the 'top' item.
At present, I'm using the following SQL:
SELECT * FROM (
SELECT name, container_id, score
FROM items
ORDER BY added_time DESC
) AS temptbl
GROUP BY container_id
HAVING score >= 5
This appears to give me the desired results, but it is incredibly slow when the number of items starts to increase - running the query on 8000 containers and 10000 items takes nearly 6 seconds on the MySQL console, which is too slow. Am I doing something obviously inefficient?
Maybe this is what you want:
SELECT name, container_id, score
FROM items AS tb1
RIGHT JOIN (SELECT container_id, Max(added_time) as added_time
FROM items GROUP BY tablename) as tb2 on
tb1.container_id = tb2.container_id AND tb1.added_time = tb2.added_time
WHERE score >= 5
Try any of the following. It relies on (container_id, added_id) being unique.
select *
from (select container_id, max(added_time) as added_time
from items
group by container_id
) as topitems
join items on(topitems.container_id = items.container_id and
topitems.added_time = items.added_time)
where items.score >= 5;
select *
from items a
where score >= 5
and (added_time) = (select max(b.added_time)
from items b
where a.container_id = b.container_id);
Turns out that the inner select had a LEFT JOIN which was causing the slowdown - removing that reduced the query time to 0.01s. It means losing the information brought in by the join, but that can be filled in afterwards (final number of rows returned is 'small' so it doesn't matter if I have to run a query for each to replicate the effects of a LEFT JOIN).