How to find whether an unordered itemset exists - sql

I am representing itemsets in SQL (SQLite, if relevant). My tables look like this:
ITEMS table:
| ItemId | Name |
| 1 | Ginseng |
| 2 | Honey |
| 3 | Garlic |
ITEMSETS:
| ItemSetId | Name |
| ... | ... |
| 7 | GinsengHoney |
| 8 | HoneyGarlicGinseng |
| 9 | Garlic |
ITEMSETS2ITEMS
| ItemsetId | ItemId |
| ... | .... |
| 7 | 1 |
| 7 | 2 |
| 8 | 2 |
| 8 | 1 |
| 8 | 3 |
As you can see, an Itemset may contain several Items, and this relationship is detailed in the Itemset2Items table.
How can I check whether a new itemset is already in the table, and if so, find its ID?
For instance, I want to check whether "Ginseng, Garlic, Honey" is an existing itemset. The desired answer would be "Yes", because there exists a single ItemsetId which contains exactly these three IDs. Note that the set is unordered: a query for "Honey, Garlic, Ginseng" should behave identically.
How can I do this?

I would recommend that you start by placing the item sets that you want to check into a table, with one row per item.
The question is now about the overlap of this "proposed" item set to other itemsets. The following query provides the answer:
select itemsetid,
from (select coalesce(ps.itemid, is2i.itemid) as itemid, is2i.itemsetid,
max(case when ps.itemid is not null then 1 else 0 end) as inProposed,
max(case when is2i.itemid is not null then 1 else 0 end) as inItemset
from ProposedSet ps full outer join
ItemSets2items is2i
on ps.itemid = is2i.itemid
group by coalesce(ps.itemid, is2i.itemid), is2i.itemsetid
) t
group by itemsetid
having min(inProposed) = 1 and min(inItemSet) = 1
This joins all the proposed items with all the itemsets. It then groups by the items in each item set, giving a flag as to whether the item is in the set. Finally, it checks that all items in an item set are in both.

Sounds like you need to find an ItemSet that:
contains all the Items in your wanted list
doesn't contain any other Items
This example will return the ID of such an itemset if it exists.
Note: this solution is for MySQL, but it should work in SQLite once you change #variables into something SQLite understands, e.g. bind variables.
-- these are the IDs of the items in the new itemset
-- if you add/remove some, make sure to change the IN clauses below
set #id1 = 1;
set #id2 = 2;
-- this is the count of items listed above
set #cnt = 2;
SELECT S.ItemSetId FROM ItemSets S
INNER JOIN
(SELECT ItemsetId, COUNT(*) as C FROM ItemSets2Items
WHERE ItemId IN (#id1, #id2)
GROUP BY ItemsetId
HAVING COUNT(*) = #cnt
) I -- included ingredients
ON I.ItemsetId = S.ItemSetId
LEFT JOIN
(SELECT ItemsetId, COUNT(*) as C FROM ItemSets2Items
WHERE ItemId NOT IN (#id1, #id2)
GROUP BY ItemsetId
) A -- additional ingredients
ON A.ItemsetId = S.ItemSetId
WHERE A.C IS NULL
See fiddle for MySQL.

Related

Include zero counts when grouping by multiple columns and setting filters

I have a table (tbl) containing category (2 categories), impact (3 impacts), company name and date for example:
category | impact | company | date | number
---------+----------+---------+-----------|
Animal | Critical | A | 12/31/1999|1
Book | Critical | B | 12/31/2000|2
Animal | Minor | C | 12/31/2001|3
Book | Minor | D | 12/31/2002|4
Animal | Medium | E | 1/1/2003 |5
I want to get the count of records for each category and impact and be able to add rows with zero count and also be able to filter by company and date.
In the example result set below, the count result is 1 for category = Animal and company = A. The rest is 0 records and only the Critical and Medium impacts appear
category | impact | count
---------+----------+-------
Animal | Critical | 1
Animal | Medium | 0
I've looked at the responses to similar questions by using joins however, adding a WHERE clause doesn't include the zero records.
I also tried doing outer joins but it doesn't produce desired output. For example
select a.impact, b.category, ISNULL(count(b.impact), 0) from tbl a
left outer join tbl b
on b.number = a.number
and (a.category = 'Animal' and a.company in ('A'))
group by a.impact, b.category
produces
impact | category | count
---------+------------+--------
Medium | NULL | 0
Medium | Animal | 1
Critical | NULL | 0
Minor | NULL | 0
but the desired output should be
category | impact | count
---------+----------+-------
Animal | Critical | 1
Animal | Medium | 0
Animal | Minor | 0
Any help will be appreciated. Answers to associated questions don't have filtering so I will appreciate if someone can help with a query to produce desired output.
You need a master table with all the possible combinations of Categories and Impacts for this. Then Left join your table with the master and do the aggregation. Something like below
;WITH CAT
AS
(
SELECT
category
FROM Tbl
GROUP BY category
),
IMP
AS
(
SELECT
Impact
FROM Tbl
GROUP BY Impact
),MST
AS
(
SELECT
*
FROM CAT
CROSS JOIN IMP
)
SELECT
MST.category,
MST.Impact,
COUNT(T.Number)
FROM MST
LEFT JOIN Tbl T
ON MST.category = T.category
AND MST.Impact = T.Impact
AND T.Company = 'A'
WHERE MST.Category = 'Animal' GROUP BY MST.category,
MST.Impact

How do you flip rows into new columns?

I've got a table that looks like this:
player_id | violation
---------------------
1 | A
1 | A
1 | B
2 | C
3 | D
3 | A
And I want to turn it into this, with a bunch of new columns that refer to the types of violations, and then the sum of the number of each individual type of violation that each player got (not that concerned with what the columns are called; a/b/c/d would work great as well):
player_id | violation_a | violation_b | violation_c | violation_d
-----------------------------------------------------------------
1 | 2 | 1 | 0 | 0
2 | 0 | 0 | 1 | 0
3 | 1 | 0 | 0 | 1
I know how I could do this, but it would take a ton of lines of code, since there are in reality 100+ types of violations. Is there any way (perhaps with a tablefunc()?) that I could do this more concisely than spelling out each of the new 100+ columns that I want and the logic for them each individually?
In pure SQL I don't see how you could avoid declaring the columns yourself. You either have to create subselects or filters in every column ..
SELECT DISTINCT ON (t.player_id)
t.player_id,
count(*) FILTER (WHERE violation = 'A') AS violation_a,
count(*) FILTER (WHERE violation = 'B') AS violation_b,
count(*) FILTER (WHERE violation = 'C') AS violation_c,
count(*) FILTER (WHERE violation = 'D') AS violation_d
FROM t
GROUP BY t.player_id;
.. or create a pivot table:
SELECT *
FROM crosstab(
'SELECT player_id, t2.violation, count(*) FILTER (WHERE t.violation = t2.violation)::INT
FROM t,(SELECT DISTINCT violation FROM t) t2
GROUP BY player_id, t2.violation'
) AS ct(player_id INT,violation_a int,violation_b int,violation_c int,violation_d int);
Demo: db<>fiddle

Find SQL table rows where there are multiple different values

I want to be able to filter out groups where the values aren't the same. When doing the query:
SELECT
category.id as category_id,
object.id as object_id,
object.value as value
FROM
category,
object
WHERE
category.id = object.category
We get the following results:
category_id | object_id | value
-------------+-----------+-------
1 | 1 | 1
1 | 2 | 2
1 | 3 | 2
2 | 4 | 3
2 | 5 | 2
3 | 6 | 1
3 | 7 | 1
The goal: Update the query so that it yields:
category_id
-------------
1
2
In other words, find the categories where the values are different from the others in that same category.
I have tried many different methods of joining, grouping and so on, to no avail.
I know it can be done with multiple queries and then filter with a little bit of logic, but this is not the goal.
You can use aggregation:
SELECT o.category as category_id
FROM object o
GROUP BY o.category
HAVING MIN(o.value) <> MAX(o.value);
You have left the FROM clause out of your query. But as written, you don't need a JOIN at all. The object table is sufficient -- because you are only fetching the category id.

Best Way to Join One Column on Columns From Two Other Tables

I have a schema like the following in Oracle
Section:
+--------+----------+
| sec_ID | group_ID |
+--------+----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
+--------+----------+
Section_to_Item:
+--------+---------+
| sec_ID | item_ID |
+--------+---------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
+--------+---------+
Item:
+---------+------+
| item_ID | data |
+---------+------+
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
+---------+------+
Item_Version:
+---------+----------+--------+
| item_ID | start_ID | end_ID |
+---------+----------+--------+
| 1 | 1 | |
| 2 | 1 | 3 |
| 3 | 2 | |
| 4 | 1 | 2 |
+---------+----------+--------+
Section_to_Item has FK into Section and Item on the *_ID columns.
Item_version is indexed on item_ID but has no FK to Item.item_ID (ran out of space in the snapshot group).
I have code that receives a list of version IDs and I want to get all items in sections in a given group that are valid for at least one of the versions passed in. If an item has no end_ID, it's valid for anything starting with start_ID. If it has an end_id, it's valid for anything up until (not including) end_ID.
What I currently have is:
SELECT Items.data
FROM Section, Section_to_Items, Item, Item_Version
WHERE Section.group_ID = 1
AND Section_to_Item.sec_ID = Section.sec_ID
AND Item.item_ID = Section_to_Item.item_ID
AND Item.item_ID = Item_Version.item_ID
AND exists (
SELECT *
FROM (
SELECT 2 AS version FROM DUAL
UNION ALL SELECT 3 AS version FROM DUAL
) passed_versions
WHERE Item_Version.start_ID <= passed_versions.version
AND (Item_Version.end_ID IS NULL or Item_Version.end_ID > passed_version.version)
)
Note that the UNION ALL statement is dynamically generated from the list of passed in versions.
This query currently does a cartesian join and is very slow.
For some reason, if I change the query to join
AND Item_Version.item_ID = Section_to_Item.item_ID
which is not a FK, the query does not do the cartesian join and is much faster.
A) Can anyone explain why this is?
B) Is this the right way to be joining this sequence of tables (I feel weird about joining Item.item_ID to two different tables)
C) Is this the right way to get versions between start_ID and end_ID?
Edit
Same query with inner join syntax:
SELECT Items.data
FROM Item
INNER JOIN Section_to_Items ON Section_to_Items.item_ID = Item.item_ID
INNER JOIN Section ON Section.sec_ID = Section_to_Items.sec_ID
INNER JOIN Item_Version ON Item_Version.item_ID = Item_.item_ID
WHERE Section.group_ID = 1
AND exists (
SELECT *
FROM (
SELECT 2 AS version FROM DUAL
UNION ALL SELECT 3 AS version FROM DUAL
) passed_versions
WHERE Item_Version.start_ID <= passed_versions.version
AND (Item_Version.end_ID IS NULL or Item_Version.end_ID > passed_version.version)
)
Note that in this case the performance difference comes from joining on Item_Version first and then joining Section_to_Item on Item_Version.item_ID.
In terms of table size, Section_to_Item, Item, and Item_Version should be similar (1000s) while Section should be small.
Edit
I just found out that apparently, the schema has no FKs. The FKs specified in the schema configuration files are ignored. They're just there for documentation. So there's no difference between joining on a FK column or not. That being said, by changing the joins into a cascade of SELECT INs, I'm able to avoid joining the entire Item table twice. I don't love the resulting query, and I don't really understand the difference, but the stats indicate it's much less work (changes the A-Rows returned from the inner most scan on Section from 656,000 to 488 (it used to be 656k starts returning 1 row, now it's 488 starts returning 1 row)).
Edit
It turned out to be stale statistics - the two queries were equivalent the whole time but with the incomplete statistics, the DB happened to notice the correct plan only in the second instance. After updating statistics, both queries generated the same plan.
I'm not sure if this is the best idea but this seems to avoid the cartesian join:
select data
from Item
where item_ID in (
select item_ID
from Item_Version
where item_ID in (
select item_ID
from Section_to_Item
where sec_ID in (
select sec_ID
from Section
where group_ID = 1
)
)
and exists (
select 1
from (
select 2 as version
from dual
union all
select 3 as version
from dual
) versions
where versions.version >= start_ID
and (end_ID is null or versions.version <)
)
)

How do you update a sql table based on distinct matching counts in another table? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have 2 tables.
Contacts
+----+------+
| ID | Tier |
+----+------+
| 1 | Low |
| 2 | High |
| 3 | Max |
+----+------+
Events
+----+-----------+-----------+
| ID | EventType | GroupType |
+----+-----------+-----------+
| 1 | Open | A |
| 2 | Open | A |
| 3 | Open | A |
| 1 | Delete | B |
| 2 | Open | B |
| 3 | Open | B |
| 1 | Open | A |
| 3 | Open | C |
+----+-----------+-----------+
If Events contains 2 unique GroupTypes where EventType = 'Open' then the associated Contact record needs to be updated to a Tier of 'High', else if there are more than 2 I need to update to 'Max', else if there are fewer I need to update to 'Low'. (The above table shows correct tiers)
When attempting the below, I get "Error near Group". Can I group while updating? Is there a better way to get these results?
Update c
SET c.Tier = (CASE WHEN count(DISTINCT(e.GroupType)) > 2 THEN 'Max'
WHEN count(DISTINCT(e.GroupType)) = 2 THEN 'High'
ELSE 'Low'
END)
FROM Contacts c JOIN Events e on c.ID = e.ID
WHERE e.EventType = 'Open'
GROUP BY c.ID
You can't group by in your update statement. You just need to form the query separately that will give you the rows you need to update and join on that. You are updating based on ID, and the value you are setting depends on the number of open events for that ID, so form a query finding the number of open events by id:
-- query open event counts by contact id
SELECT ID, COUNT(*) AS OpenEventCount
FROM Events
WHERE EventType = 'Open'
GROUP BY ID
Now it's simple to link that to contacts and update:
UPDATE c
SET c.Tier = CASE
WHEN COALESCE(ec.OpenEventCount, 0) > 2 then 'Max'
WHEN COALESCE(ec.OpenEventCount, 0) = 2 then 'High'
ELSE 'Low'
END
FROM Contacts c
LEFT OUTER JOIN ( -- left join to update contacts with no open events
SELECT ID, COUNT(*) AS OpenEventCount
FROM Events
WHERE EventType = 'Open'
GROUP BY ID
) ec ON ec.ID = c.ID
This is exactly what VIEW is for in SQL.
You can make it an indexed view, if necessary.
As a concept, try to minimize data dependancy between tables; foreign keys are ok, but if you need data from this table to mtach data from a different table, use views (or calculayed columns) and not hard core UPDATED.