Select parents with children in multiple places - sql

I have two tables, boxes and things that partially model a warehouse.
A box may
contain a single thing
contain one or more boxes
be empty
There is only one level of nesting: a box may be a parent or a child, but not a grandparent.
I want to identify parent boxes that satisfy these criteria:
have children in more than one place
only child boxes associated with a quantity > 0 are to be considered
Using the example data the box with id 2 should be selected, because it has children with quantities in two places. Box 1 should be rejected because all it's children are in a single place and box 3 should be rejected because while it has children in two places, only one place has a positive quantity.
The query should work on all supported versions of Postgresql. Both tables contain around two million records.
Setup:
DROP TABLE IF EXISTS things;
DROP TABLE IF EXISTS boxes;
CREATE TABLE boxes (
id serial primary key,
box_id integer references boxes(id)
);
CREATE TABLE things (
id serial primary key,
box_id integer references boxes(id),
place_id integer,
quantity integer
);
INSERT INTO boxes (box_id)
VALUES (NULL), (NULL), (NULL), (1), (1), (2), (2), (3), (3);
INSERT INTO things (box_id, place_id, quantity)
VALUES (4, 1, 1), (5, 1, 1), (6, 2, 1), (7, 3, 1), (8, 4, 1), (9, 5, 0);
I have come up this solution
WITH parent_places AS (
SELECT DISTINCT ON (b.box_id, t.place_id) b.box_id, t.place_id
FROM boxes b
JOIN things t ON b.id = t.box_id
WHERE t.quantity > 0
)
SELECT box_id, COUNT(box_id)
FROM parent_places
GROUP BY box_id
HAVING COUNT(box_id) > 1;
but I'm wondering if I've missed a more obvious solution (or if my solution has any errors that I've overlooked).
DB Fiddle

The only way a box to have things with different location properties is only when the box has several boxes with things in them.
SELECT
b2.box_id, COUNT(DISTINCT place_id)
FROM
boxes b2
JOIN things t ON b2.id = t.box_id AND quantity > 0
WHERE
b2.box_id IS NOT NULL
GROUP BY
b2.box_id
HAVING
COUNT(DISTINCT place_id) > 1;
I see no reason for using CTE like in your example. I think you should use the simplest query that does the job.

Related

How to search an entry in a table and return the column name or index in PostgreSQL

I have a table representing a card deck with 4 cards that each have a unique ID. Now i want to look for a specific card id in the table and find out which card in the deck it is.
card1
card 2
card3
card4
cardID1
cardID2
cardID3
cardID4
if my table would like this for example I would like to do something like :
SELECT column_name WHERE cardID3 IN (card1, card2, card3, card4)
looking for an answer i found this: SQL Server : return column names based on a record's value
but this doesn't seem to work for PostgreSQl
SQL Server's cross apply is the SQL standard cross join lateral.
SELECT Cname
FROM decks
CROSS join lateral (VALUES('card1',card1),
('card2',card2),
('card3',card3),
('card4',card4)) ca (cname, data)
WHERE data = 3
Demonstration.
However, the real problem is the design of your table. In general, if you have col1, col2, col3... you should instead be using a join table.
create table cards (
id serial primary key,
value text
);
create table decks (
id serial primary key
);
create table deck_cards (
deck_id integer not null references decks,
card_id integer not null references cards,
position integer not null check(position > 0),
-- Can't have the same card in a deck twice.
unique(deck_id, card_id),
-- Can't have two cards in the same position twice.
unique(deck_id, position)
);
insert into cards(id, value) values (1, 'KH'), (2, 'AH'), (3, '9H'), (4, 'QH');
insert into decks values (1), (2);
insert into deck_cards(deck_id, card_id, position) values
(1, 1, 1), (1, 3, 2),
(2, 1, 1), (2, 4, 2), (2, 2, 3);
We've made sure a deck can't have the same card, nor two cards in the same position.
-- Can't insert the same card.
insert into deck_cards(deck_id, card_id, position) values (1, 1, 3);
-- Can't insert the same position
insert into deck_cards(deck_id, card_id, position) values (2, 3, 3);
You can query a card's position directly.
select deck_id, position from deck_cards where card_id = 3
And there is no arbitrary limit on the number of cards in a deck, you can apply one with a trigger.
Demonstration.
This is a rather bad idea. Column names belong to the database structure, not to the data. So you can select IDs and names stored as data, but you should not have to select column names. And actually a user using your app should not be interested in column names; they can be rather technical.
It would probably be a good idea you changed the data model and stored card names along with the IDs, but I don't know how exactly you want to work with your data of course.
Anyway, if you want to stick with your current database design, you can still select those names, by including them in your query:
select
case when card1 = 123 then 'card1'
when card2 = 123 then 'card2'
when card3 = 123 then 'card3'
when card4 = 123 then 'card4'
end as card_column
from cardtable
where 123 in (card1, card2, card3, card4);

How to speed up a slow MariaDB SQL query that has a flat BNL join?

I'm having problems with a slow SQL query running on the following system:
Operating system: Debian 11 (bullseye)
Database: MariaDB 10.5.15 (the version packaged for bullseye)
The table schemas and some sample data (no DB Fiddle as it doesn't support MariaDB):
DROP TABLE IF EXISTS item_prices;
DROP TABLE IF EXISTS prices;
DROP TABLE IF EXISTS item_orders;
CREATE TABLE item_orders
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
ordered_date DATE NOT NULL
) Engine=InnoDB;
CREATE TABLE prices
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
selected_flag TINYINT UNSIGNED NOT NULL
) Engine=InnoDB;
CREATE TABLE item_prices
(
item_order_id INT UNSIGNED NOT NULL,
price_id INT UNSIGNED NOT NULL,
PRIMARY KEY (item_order_id, price_id),
FOREIGN KEY (item_order_id) REFERENCES item_orders(id),
FOREIGN KEY (price_id) REFERENCES prices(id)
) Engine=InnoDB;
INSERT INTO item_orders VALUES (1, '2022-01-01');
INSERT INTO item_orders VALUES (2, '2022-02-01');
INSERT INTO item_orders VALUES (3, '2022-03-01');
INSERT INTO prices VALUES (1, 0);
INSERT INTO prices VALUES (2, 0);
INSERT INTO prices VALUES (3, 1);
INSERT INTO prices VALUES (4, 0);
INSERT INTO prices VALUES (5, 0);
INSERT INTO prices VALUES (6, 1);
INSERT INTO item_prices VALUES (1, 1);
INSERT INTO item_prices VALUES (1, 2);
INSERT INTO item_prices VALUES (1, 3);
INSERT INTO item_prices VALUES (2, 4);
INSERT INTO item_prices VALUES (2, 5);
INSERT INTO item_prices VALUES (3, 6);
A high-level overview of the table usage is:
For any given month, there will be thousands of rows in item_orders.
A row in item_orders will link to zero or more rows in item_prices (item_orders.id = item_prices.item_order_id).
A row in item_prices will have exactly one linked row in prices (item_prices.price_id = prices.id).
For any given row in item_orders, there will be zero or one row in prices where the selected_flag is 1 (item_orders.id = item_prices.item_order_id AND item_prices.price_id = prices.id AND prices.selected_flag = 1). This is enforced by the application rather than the database (i.e. it's not defined as a CONSTRAINT).
What I want to get, in a single query, are:
The number of rows in item_orders.
The number of rows in item_orders where the related selected_flag is 1.
At the moment I have the following query:
SELECT
COUNT(item_orders.id) AS item_order_count,
SUM(CASE WHEN prices.id IS NOT NULL THEN 1 ELSE 0 END) AS item_order_selected_count
FROM
item_orders
LEFT JOIN prices ON prices.id IN (
SELECT price_id
FROM item_prices
WHERE
item_prices.item_order_id = item_orders.id)
AND prices.selected_flag = 1
This query returns the correct data (item_order_count = 3, item_order_selected_count = 2), however it takes a long time (over 10 seconds) to run on a live dataset, which is too slow for users (it is a heavily-used report, refreshed repeatedly through the day). I think the problem is the subquery in the LEFT JOIN, as removing the LEFT JOIN and the associated SUM reduces the query time to around 0.1 seconds. Also, the EXPLAIN output for the join has this in the Extra column:
Using where; Using join buffer (flat, BNL join)
Searching for 'flat BNL join' reveals a lot of information, of which the summary seems to be: 'BNL joins are slow, avoid them if you can'.
Is it possible to rewrite this query to return the same information, but avoiding the BNL join?
Things I've considered already:
All the ID columns are indexed (item_orders.id, prices.id, item_prices.item_order_id, item_prices.price_id).
Splitting the query in two - one for item_order_count (no JOIN), the other for item_order_selected_count (INNER JOIN, as I only need rows which match). This works but isn't ideal as I want to build up this query to return more data (I've stripped it back to the minimum for this question). Also, I'm trying to keep the query output as close as possible to what the user will see, as that makes debugging easier and makes the database (which is optimised for that workload) do the work, rather than the application.
Changing the MariaDB configuration: Some of the search results for BNL joins suggest changing configuration options, however I'm wary of doing this as there are hundreds of other queries in the application and I don't want to cause a regression (e.g. speed up this query but accidentally slow down all the others).
Upgrading MariaDB: This would be a last resort as it would involve using a version different to that packaged with Debian, might break other parts of the application, and the system has just been through a major upgrade.
Not sure whether this will be any faster but worth a try (table joins on indexed foreign keys are fast and sometimes simplicity is king...)
SELECT
(SELECT COUNT(*) FROM item_orders) AS item_order_count,
(SELECT COUNT(*)
FROM item_orders io
JOIN item_prices ip
ON io.id = ip.item_order_id
JOIN prices p
ON ip.price_id = p.id
WHERE p.selected_flag = 1) AS item_order_selected_count;
I came back to this question this week as the performance got even worse as the number of rows increased, to the point where it was taking over 2 minutes to run the query (with around 100,000 rows in the item_orders table, so hardly 'big data').
I remembered that it was possible to list multiple tables in the FROM clause and wondered if the same was true of a LEFT JOIN. It turns out this is the case and the query can be rewritten as:
SELECT
COUNT(item_orders.id) AS item_order_count,
SUM(CASE WHEN prices.id IS NOT NULL THEN 1 ELSE 0 END) AS item_order_selected_count
FROM
item_orders
LEFT JOIN (item_prices, prices) ON
item_prices.item_order_id = item_orders.id
AND prices.id = item_prices.price_id
AND prices.selected_flag = 1
This returns the same results but takes less than a second to execute. Unfortunately I don't know any relational algebra to prove this, but effectively what I am saying is 'only LEFT JOIN where everything matches on both item_prices and prices'.

SQL query for key value table with 1:n relation

I have a table in which I want to store images. Each image has arbitrary properties that I want to store in a key-value table.
The table structure looks like this
id
fk_picture_id
key
value
1
1
camera
iphone
2
1
year
2001
3
1
country
Germany
4
2
camera
iphone
5
2
year
2020
6
2
country
United States
Now I want a query to find all pictures made by an iphone I could to something like this
select
fk_picture_id
from
my_table
where
key = 'camera'
and
value = 'iphone';
This works without any problems. But as soon as I want to add another key to my query I am get stucked. Lets say, I want all pictures made by an iPhone in the year 2020, I can not do something like
select
distinct(fk_picture_id)
from
my_table
where
(
key = 'camera'
and
value = 'iphone'
)
or
(
key = 'year'
and
value = '2020'
)
...because this selects the id 1, 4 and 5.
At the end I might have 20 - 30 different criteria to look for, so I don't think some sub-selects would work at the end.
I'm still in the design phase, which means I can still adjust the data model as well. But I can't think of any way to do this in a reasonable way - except to include the individual properties as columns in my main table.
A pattern you can consider here is to build a table of search parameters, then simply join this to your target table.
You would first create a temporary table with key and value columns then insert into it the search criteria values, any number of values you wish.
Using a CTE in place of a temporary table might look like:
with s as (
select 'camera' key, 'iphone' value
union all
select 'year', '2020'
)
select distinct t.fk_picture_id
from s
join t on t.key=s.key and t.value=s.value
The solution I found - thanks to this article
How to query data based on multiple 'tags' in SQL?
is that I made some changes to the database model
picture
id
name
1
Picture 1
2
Picture 2
And then I created a table for the tags
tag
id
tag
100
Germany
101
IPhone
102
United States
And the cross table
picture_tag
fk_picture_id
fk_tag_id
1
100
1
101
2
101
2
102
For a better understanding of the datasets
Picture
Tagname
Picture 1
Germany & Iphone
Picture 2
United States & IPhone
Now I can use the following statement
SELECT *
FROM picture
INNER JOIN (
SELECT fk_picture_id
FROM picture_tag
WHERE fk_tag_id IN (100, 101)
GROUP BY fk_picture_id
HAVING COUNT(fk_tag_id) = 2
) AS picture_tag
ON picture.id = picture_tag.fk_picture_id;
The only thing I need to do before the query is to collect the IDs of the tags I want to search for and put the number of tags in the having count statement.
If someone needs the example data, here are the sql statements for the tables and data
create table picture (
id integer,
name char(100)
);
create table tag (
id integer,
tag char(100)
);
create table picture_tag (
fk_picture_id integer,
fk_tag_id integer
);
insert into picture values (1, 'Picture 1');
insert into picture values (2, 'Picture 2');
insert into tag values (100, 'Germay');
insert into tag values (101, 'iphone');
insert into tag values (102, 'United States');
insert into picture_tag values (1, 100);
insert into picture_tag values (1, 101);
insert into picture_tag values (2, 101);
insert into picture_tag values (2, 102);

Need help linking an ID column with another ID column for an insert

I'm generating a dataset from a stored procedure. There's an ID column from a different database that I need to link to my dataset. Going from one database to the other isn't the issue. The problem is I'm having trouble linking an ID column that's in one database and not the other.
There's a common ID between the two columns: CustomerEventID. I know that'll be my link, but here's the issue. The CustomerEventID is a unique ID for whenever a customer purchases something in a single encounter.
The column I need to pull in is more discrete. It creates a unique ID (CustomerPurchaseID) for each item purchased from the encounter. However, I only need to pull in the CustomerPurchaseID for the last item purchased in that encounter. Unfortunately there's no time stamp associated with the CustomerPurchaseID.
It's basically like the CustomerEventID is a unique ID for a customer's receipt, whereas the CustomerPurchaseID is a unique ID for each item purchased on that receipt.
What would be the best way to pull the last CustomerPurchaseID from the CustomerEventID (I only need the last item on the receipt in my dataset)? I'll be taking my stored procedure with the dataset (from database A), using an SSIS package to put the dataset into a table on database B, then inserting the CustomerPurchaseID into that table.
I'm not sure if it helps, but here's the query from the stored procedure that will be sent to the other database (the process will run every 2 weeks to send it to Database B):
SELECT
ce.CustomerEventID,
ce.CustomerName,
ce.CustomerPhone,
ce.CustomerEventDate
FROM
CustomerEvent ce
WHERE
DATEDIFF(d, ce.CustomerEventDate, GETDATE()) < 14
Thanks for taking the time to read this wall of text. :)
If CustomerPurchaseID field is increasing (as you mentioned) then you can do Order by Desc on that field while picking up the line. This can be done using a sub-query in parent query or by doing a Outer Apply or Cross Apply if you need all the fields from CustomerPurchase table. Check below example.
declare #customerEvent table(CustomerEventID int not null primary key identity
, EventDate datetime)
declare #customerPurchase table(CustomerPurchaseID int not null primary key identity
, CustomerEventID int, ItemID varchar(100))
insert into #customerEvent(EventDate)
values ('2018-01-01'), ('2018-01-02'), ('2018-01-03'), ('2018-01-04')
insert into #customerPurchase(CustomerEventID, ItemID)
values (1, 1), (1, 2), (1, 3)
, (2, 3), (2, 4), (2, 10)
, (3, 1), (3, 2)
, (4, 1)
-- if you want all the fields from CustomerPurchase Table
select e.CustomerEventID
, op.CustomerPurchaseID
from #customerEvent as e
outer apply (select top 1 p.* from #customerPurchase as p where p.CustomerEventID = e.CustomerEventID
order by CustomerPurchaseID desc) as op
-- if you want only the last CustomerPurchaseID from CustomerPurchase table
select e.CustomerEventID
, (select top 1 CustomerPurchaseID from #customerPurchase as p where p.CustomerEventID = e.CustomerEventID
order by CustomerPurchaseID desc)
as LastCustomerPurchaseID
from #customerEvent as e

Find all rows with the same exact relations as provided in another table

Given these tables:
Table: Test
Columns:
testID int PK
name nvarchar(128) UNIQUE NOT NULL
Table: [Test-Inputs]
Columns
inputsTableName nvarchar(128) UNIQUE PK
testID int PK FK
Temporary Table: ##TestSearchParams
Columns:
inputsTableName nvarchar(128) UNIQUE NOT NULL
I need to find Tests that have entries in Test-Inputs with inputsTableNames matching EXACTLY ALL of the entries in ##TestSearchParams; the resulting tests relationships must be exactly the ones listed in ##TestSearchParams.
Essentially I am finding tests with ONLY the given relationships, no more, no less. I am matching names with LIKE and wildcards, but that is a sidenote that I believe I can solve after the core logic is there for exact matching.
This is my current query:
Select *
From Tests As B
Where B.testID In (
Select ti
From (
Select (
Select Count(inputsTableName)
From [Test-Inputs]
Where [Test-Inputs].testID = B.testID
) - Count(Distinct i1) As delta,
ti
From (
Select [Test-Inputs].inputsTableName As i1,
[Test-Inputs].testID As ti
From ##TableSearchParams
Join [Test-Inputs]
On [Test-Inputs].inputsTableName Like ##TableSearchParams.inputsTableName
And B.testID = [Test-Inputs].testID
) As A
Group By ti
) As D
Where D.delta = 0
);
The current problem is that his seems to retrieve Tests with a match to ANY of the entries in ##TableSearchParams. I have tried several other queries before this, to varying levels of success. I have working queries for find tests that match any of the parameters, all of the paramters, and none of the parameters -- I just cant get this query working.
Here are some sample table values:
Tests
1, Test1
2, Test2
3, Test3
[Test-Inputs]
Table1, 1
Table2, 2
Table1, 3
Table2, 3
TestSearchParams
Table1
Table2
The given values should only return (3, Test3)
Here's a possible solution that works by getting the complete set of TestInputs for each record in Tests, left-joining to the set of search parameters, and then aggregating the results by test and making two observations:
First, if a record from Tests includes a TestInput that is not among the search parameters, then that record must be excluded from the result set. We can check this by seeing if there is any case in which the left-join described above did not produce a match in the search parameters table.
Second, if a record from Tests satisfies the first condition, then we know that it doesn't have any superfluous TestInput records, so the only problem it could have is if there exists a search parameter that is not among its TestInputs. If that is so, then the number of records we've aggregated for that Test will be less than the total number of search parameters.
I have made the assumption here that you don't have Tests records with duplicate TestInputs, and that you likewise don't use duplicate search parameters. If those assumptions are not valid then this becomes more complicated. But if they are, then this ought to work:
declare #Tests table (testID int, [name] nvarchar(128));
declare #TestInputs table (testID int, inputsTableName nvarchar(128));
declare #TestSearchParams table (inputsTableName nvarchar(128));
-- Sample data.
--
-- testID 1 has only a subset of the search parameters.
-- testID 2 matches the search parameters exactly.
-- testID 3 has a superset of the search parameters.
--
-- Therefore the result set should include testID 2 only.
insert #Tests values
(1, 'Table A'),
(2, 'Table B'),
(3, 'Table C');
insert #TestInputs values
(1, 'X'),
(2, 'X'),
(2, 'Y'),
(3, 'X'),
(3, 'Y'),
(3, 'Z');
insert #TestSearchParams values
('X'),
('Y');
declare #ParamCount int;
select #ParamCount = count(1) from #TestSearchParams;
select
Tests.testID,
Tests.[name]
from
#Tests Tests
inner join #TestInputs Inputs on Tests.testID = Inputs.testID
left join #TestSearchParams Search on Inputs.inputsTableName = Search.inputsTableName
group by
Tests.testID,
Tests.[name]
having
-- If a group includes any record where Search.inputsTableName is null, it means that
-- the record in Tests has a TestInput that is not among the search parameters.
sum(case when Search.inputsTableName is null then 1 else 0 end) = 0 and
-- If a group includes fewer records than there are search parameters, it means that
-- there exists some parameter that was not found among the Tests record's TestInputs.
count(1) = #ParamCount;