Testing equality of referencing rows - sql

I have a number of tables that follow this rather common pattern: A <-->> B. I would like to find the pairs of matching rows in table A where certain columns have equal values and also have referencing rows in B where certain columns have equal values. In other words, a pair of rows (R, S) in A matches, iff for given sets of columns {a1, a2, …, an} in A and {b1, b2, …, bn} in B:
We have R.a1 = S.a1, R.a2 = S.a2, …, R.an = S.an.
For every R's referencing row T in B exists S's referencing row U in B s.t. T.b1 = U.b1, T.b2 = U.b2, …, T.bn = U.bn.
(R, S) matches iff (S, R) matches.
(I'm not very familiar with relational algebra, so my definition above might not follow any convention.)
The approach that I came up with was:
Find pairs (R, S) that have matching columns.
See if there's and equal number of (any) R's and S's referencing rows in B.
For each row in B find a matching row, group by the referencing row in A and count. Check that there are as many matching rows as referencing rows.
However, the query that I wrote (below) for steps 2 and 3, to find matching rows in B, is quite complex. Is there a better solution?
-- Tables similar to those that I have.
CREATE TABLE a (
id INTEGER PRIMARY KEY,
data TEXT
);
CREATE TABLE b (
id INTEGER PRIMARY KEY,
a_id INTEGER REFERENCES a (id),
data TEXT
);
SELECT DISTINCT dup.lhs_parent_id, dup.rhs_parent_id
FROM (
SELECT DISTINCT
MIN(lhs.a_id, rhs.a_id) AS lhs_parent_id, -- Normalize.
MAX(lhs.a_id, rhs.a_id) AS rhs_parent_id,
COUNT(*) AS count
FROM b lhs
INNER JOIN b rhs USING (data)
WHERE NOT (lhs.id = rhs.id OR lhs.a_id = rhs.a_id) -- Remove self-matching rows and duplicate values with the same parent.
GROUP BY lhs.a_id, rhs.a_id
) dup
INNER JOIN ( -- Check that lhs has the same number of rows.
SELECT
a_id AS parent_id,
COUNT(*) AS count
FROM b
GROUP BY a_id
) lhs_ct ON (
dup.lhs_parent_id = lhs_ct.parent_id AND
dup.count = lhs_ct.count
)
INNER JOIN ( -- Check that rhs has the same number of rows.
SELECT
a_id AS parent_id,
COUNT(*) AS count
FROM b
GROUP BY a_id
) rhs_ct ON (
dup.rhs_parent_id = rhs_ct.parent_id AND
dup.count = rhs_ct.count
);
-- Test data.
-- Expected query result is three rows with values (1, 2), (1, 3) and (2, 3) for a_id,
-- since the first three rows (with values 'row 1', 'row 2' and 'row 3')
-- have referencing rows, each of which has a matching pair. The fourth row
-- ('row 3') only has one referencing row with the value 'foo', so it doesn't have a
-- pair for the referenced rows with the value 'bar'.
INSERT INTO a (id, data) VALUES
(1, 'row 1'),
(2, 'row 2'),
(3, 'row 3'),
(4, 'row 4');
INSERT INTO b (id, a_id, data) VALUES
(1, 1, 'foo'),
(2, 1, 'bar'),
(3, 2, 'foo'),
(4, 2, 'bar'),
(5, 3, 'foo'),
(6, 3, 'bar'),
(7, 4, 'foo');
I'm using SQLite.

To find matching and different rows it is easier to use INTERSECT and MINUS operations then joins...
But when only one field actually used in comparison JOIN solution looks better:
Select B1.A_Id, B2.A_Id
From (
Select Data, A_Id, Count(Id) A_Count
From B
Group By Data, A_Id
) b1
inner join (
Select Data, A_Id, Count(Id) a_count
From B Group By Data, A_Id
) b2 on b1.data = b2.data and b1.a_count = b2.a_count and b1.a_id <> b2.a_id
As I understand you need to find out the pairs of different a_id which have same data and count of data.
The result of my script gives, the possible couples in two directions, that left room for optimization on SQLlite specific syntax.
Result example:
{1,2}, {1,3}, {2,1}, {2,3}, {3,2}, {3,1}

Related

Selecting X amount of rows from one table depending on value of column from another joined table

I am trying to join several tables. To simplify the situation, there is a table called Boxes which has a foreign key column for another table, Requests. This means that with a simple join I can get all the boxes that can be used to fulfill a request. But the Requests table also has a column called BoxCount which limits the number of boxes that is needed.
Is there a way to structure the query in such a way that when I join the two tables, I will only get the number of rows from Boxes that is specified in the BoxCount column of the given Request, rather than all of the rows from Boxes that have a matching foreign key?
Script to initialize sample data:
CREATE TABLE Requests (
Id int NOT NULL PRIMARY KEY,
BoxCount Int NOT NULL);
CREATE TABLE Boxes (
Id int NOT NULL PRIMARY KEY,
Label varchar,
RequestId INT FOREIGN KEY REFERENCES Requests(Id));
INSERT INTO Requests (Id, BoxCount)
VALUES
(1, 2),
(2, 3);
INSERT INTO Boxes (Id, Label, RequestId)
VALUES
(1, 'A', 1),
(2, 'B', 1),
(3, 'C', 1),
(4, 'D', 2),
(5, 'E', 2),
(6, 'F', 2),
(7, 'G', 2);
So, for example, when the hypothetical query is ran, it should return boxes A and B (because the first Request only needs 2 boxes), but not C. Similarly it should also include boxes D, E and F, but not box G, because the second request only requires 3 boxes.
Here is another approach using ROWCOUNT - a common and useful technique that every sql writer should master. The idea here is that you create a sequential number for all boxes within a request and use that to compare to the box count for filtering.
with boxord as (select *,
ROW_NUMBER() OVER (PARTITION BY RequestId ORDER BY Id) as rno
from dbo.Boxes
)
select req.*, boxord.Label, boxord.rno
from dbo.Requests as req inner join boxord on req.Id = boxord.RequestId
where req.BoxCount >= boxord.rno
order by req.Id, boxord.rno
;
fiddle to demonstrate
The INNER JOIN keyword selects records that have matching values in both tables
SELECT (cols) FROM Boxes
INNER JOIN Request on Boxes.(FK_column) = request.id
WHERE Request.BoxCount = (value)
select r.id,
r.boxcount,
b.id,
b.label
from requests r
cross apply (
select top (r.BoxCount)
id, label
from boxes
where requestid = r.id
order by id
) b;

Query to get count of distinct items in groupings

I have a table that stores created grouping for items from another table like this:
table1
table2
So giving the above, I want to write a query that returns the count of items from table1 that a grouping has been created for.
It may sound like doing the below but that is actually not what I'm looking for because the groups have to be manually created for them to appear in table 2 so you may have an item from table1 that does't exist in table 2 because the grouping hasn't been created (i.e id: 555).
SELECT count(id)
FROM table1
WHERE group IS NOT NULL
The above will return 4 but I need something that looks at table2 and returns 3 which is count of items from table1 whose group exists in the category column of table2.
My real table for this can be pretty large up to 100k+ rows so I don't think it is efficient to check if group string from table1 it exists in table2 one by one as that would probably take forever to run - or is that the only viable solution?
PS: tried to use table markdown but I must have screwed up somehow
PPS categories column is not of json type, its just string
Not sure that this will be faster but you can prepare an existing categories aggregate. Something like that (also you can try set_union instead of array_agg with flatten and array_distinct):
SELECT array_distinct(flatten(array_agg(CAST(JSON_EXTRACT(categories, '$.x') as ARRAY(VARCHAR)))))
FROM table2
And check that group is in the result.
Assuming that table2 would not contain any groups in the array that are not there in table1, you can try the following:
WITH table1(id, "group", qty) AS (
SELECT *
FROM (VALUES (111, 'cups', 1),
(222, 'plates', 2),
(333, 'spoons', 5),
(444, null, 2),
(555, 'knives', 2))
),
table2(group_id, categories, count_inventory) as (
SELECT *
FROM (VALUES ('A1', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['cups', 'plates']]) AS JSON), 3),
('B1', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['cups']]) AS JSON), 1),
('C1', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['cups', 'spoons']]) AS JSON), 6),
('C4', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['spoons']]) AS JSON), 5)
))
SELECT reduce(
array_agg(CAST(json_extract(categories, '$.x') AS ARRAY(VARCHAR))),
array[],
(s, x) -> array_union(s, x),
x -> cardinality(x)
)
FROM table2 WHERE categories is not null;

select the row that's "before" a given row in some ordering

What's the idiomatic way to select a row that's identified as the one that's coming up before a row we are given?
An example to make this clear:
CREATE TABLE entry (x VARCHAR, i INTEGER);
ALTER TABLE entry ADD PRIMARY KEY (x, i);
INSERT INTO entry (x,i) VALUES ('a', 1);
INSERT INTO entry (x,i) VALUES ('a', 2);
INSERT INTO entry (x,i) VALUES ('b', 1);
Table 'entry' has a clear lexicographical ordering according to the natural ordering:
SELECT * FROM entry ORDER BY x, i
If I am given the b and 1 (i.e. the ('b', 1) row) how do I write a query that selects the row that's coming up before that? (i.e. the ('a', 2) row). The query should return an empty row set if given the "first" row (in the case above, the ('a', 1) row).
You can do this with order by and limit and a where clause:
select e.*
from entry e
where x < 'b' or x = 'b' and i < 1
order by x desc, b desc
limit 1;
Use a LAG Window function. (also LEAD when appropriate)
http://www.postgresql.org/docs/8.4/static/functions-window.html

Find rows with same ID and have a particular set of names

EDIT:
I have a table with 3 rows like so.
ID NAME REV
1 A 0
1 B 0
1 C 0
2 A 1
2 B 0
2 C 0
3 A 1
3 B 1
I want to find the ID wich has a particular set of Names and the REV is same
example:
Edit2: GBN's solution would have worked perfectly, but since i do not have the access to create new tables. The added constraint is that no new tables can be created.
if input = A,B then output is 3
if input = A ,B,C then output is 1 and not 1,2 since the rev level differs in 2.
The simplest way is to compare a COUNT per ID with the number of elements in your list:
SELECT
ID
FROM
MyTable
WHERE
NAME IN ('A', 'B', 'C')
GROUP BY
ID
HAVING
COUNT(*) = 3;
Note: ORDER BY isn't needed and goes after the HAVING if needed
Edit, with question update. In MySQL, it's easier to use a separate table for search terms
DROP TABLE IF EXISTS gbn;
CREATE TABLE gbn (ID INT, `name` VARCHAR(100), REV INT);
INSERT gbn VALUES (1, 'A', 0);
INSERT gbn VALUES (1, 'B', 0);
INSERT gbn VALUES (1, 'C', 0);
INSERT gbn VALUES (2, 'A', 1);
INSERT gbn VALUES (2, 'B', 0);
INSERT gbn VALUES (2, 'C', 0);
INSERT gbn VALUES (3, 'A', 0);
INSERT gbn VALUES (3, 'B', 0);
DROP TABLE IF EXISTS gbn1;
CREATE TABLE gbn1 ( `name` VARCHAR(100));
INSERT gbn1 VALUES ('A');
INSERT gbn1 VALUES ('B');
SELECT
gbn.ID
FROM
gbn
LEFT JOIN
gbn1 ON gbn.`name` = gbn1.`name`
GROUP BY
gbn.ID
HAVING
COUNT(*) = (SELECT COUNT(*) FROM gbn1)
AND MIN(gbn.REV) = MAX(gbn.REV);
INSERT gbn1 VALUES ('C');
SELECT
gbn.ID
FROM
gbn
LEFT JOIN
gbn1 ON gbn.`name` = gbn1.`name`
GROUP BY
gbn.ID
HAVING
COUNT(*) = (SELECT COUNT(*) FROM gbn1)
AND MIN(gbn.REV) = MAX(gbn.REV);
Edit 2, without extra table, use a derived (inline) table:
SELECT
gbn.ID
FROM
gbn
LEFT JOIN
(SELECT 'A' AS `name`
UNION ALL SELECT 'B'
UNION ALL SELECT 'C'
) gbn1 ON gbn.`name` = gbn1.`name`
GROUP BY
gbn.ID
HAVING
COUNT(*) = 3 -- matches number of elements in gbn1 derived table
AND MIN(gbn.REV) = MAX(gbn.REV);
Similar to gbn, but allowing for the possibility of duplicate ID/Name combinations:
SELECT ID
FROM MyTable
WHERE NAME IN ('A', 'B', 'C')
GROUP BY ID
HAVING COUNT(DISTINCT NAME) = 3;
OKAY!... I solved my problem ! I modified GBN's logic to do it without a search table using the IN clause
1 flaw with doing MAX(rev) = MIN(REV) is: if i have a data like so .
ID NAME REV
1 A 0
1 B 1
1 A 1
then when I use a query like
Select ID from TABLE
where NAME in {A,B}
groupby ID
having count(*) = 2
and MIN(REV) = MAX(REV)
it will not show me the ID 1 as the min and max are different and the count is 3.
So i simply add another column to the groupby
so the final query is
Select ID from TABLE
where NAME in {A,B}
groupby ID,REV
having count(*) = 2
and MIN(REV) = MAX(REV)
Thanks,to all that helped. !

How to select only one full row per group in a "group by" query?

In SQL Server, I have a table where a column A stores some data. This data can contain duplicates (ie. two or more rows will have the same value for the column A).
I can easily find the duplicates by doing:
select A, count(A) as CountDuplicates
from TableName
group by A having (count(A) > 1)
Now, I want to retrieve the values of other columns, let's say B and C. Of course, those B and C values can be different even for the rows sharing the same A value, but it doesn't matter for me. I just want any B value and any C one, the first, the last or the random one.
If I had a small table and one or two columns to retrieve, I would do something like:
select A, count(A) as CountDuplicates, (
select top 1 child.B from TableName as child where child.A = base.A) as B
)
from TableName as base group by A having (count(A) > 1)
The problem is that I have much more rows to get, and the table is quite big, so having several children selects will have a high performance cost.
So, is there a less ugly pure SQL solution to do this?
Not sure if my question is clear enough, so I give an example based on AdventureWorks database. Let's say I want to list available States, and for each State, get its code, a city (any city) and an address (any address). The easiest, and the most inefficient way to do it would be:
var q = from c in data.StateProvinces select new { c.StateProvinceCode, c.Addresses.First().City, c.Addresses.First().AddressLine1 };
in LINQ-to-SQL and will do two selects for each of 181 States, so 363 selects. I my case, I am searching for a way to have a maximum of 182 selects.
The ROW_NUMBER function in a CTE is the way to do this. For example:
DECLARE #mytab TABLE (A INT, B INT, C INT)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 1, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 1, 2)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 2, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 3, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (2, 2, 2)
INSERT INTO #mytab ( A, B, C ) VALUES (3, 3, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (3, 3, 2)
INSERT INTO #mytab ( A, B, C ) VALUES (3, 3, 3)
;WITH numbered AS
(
SELECT *, rn=ROW_NUMBER() OVER (PARTITION BY A ORDER BY B, C)
FROM #mytab AS m
)
SELECT *
FROM numbered
WHERE rn=1
As I mentioned in my comment to HLGEM and Philip Kelley, their simple use of an aggregate function does not necessarily return one "solid" record for each A group; instead, it may return column values from many separate rows, all stitched together as if they were a single record. For example, if this were a PERSON table, with the PersonID being the "A" column, and distinct contact records (say, Home and Word), you might wind up returning the person's home city, but their office ZIP code -- and that's clearly asking for trouble.
The use of the ROW_NUMBER, in conjunction with a CTE here, is a little difficult to get used to at first because the syntax is awkward. But it's becoming a pretty common pattern, so it's good to get to know it.
In my sample I've define a CTE that tacks on an extra column rn (standing for "row number") to the table, that itself groups by the A column. A SELECT on that result, filtering to only those having a row number of 1 (i.e., the first record found for that value of A), returns a "solid" record for each A group -- in my example above, you'd be certain to get either the Work or Home address, but not elements of both mixed together.
It concerns me that you want any old value for fields b and c. If they are to be meaningless why are you returning them?
If it truly doesn't matter (and I honestly can't imagine a case where I would ever want this, but it's what you said) and the values for b and c don't even have to be from the same record, group by with the use of mon or max is the way to go. It's more complicated if you want the values for a particular record for all fields.
select A, count(A) as CountDuplicates, min(B) as B , min(C) as C
from TableName as base
group by A
having (count(A) > 1)
you can do some thing like this if you have id as primary key in your table
select id,b,c from tablename
inner join
(
select id, count(A) as CountDuplicates
from TableName as base group by A,id having (count(A) > 1)
)d on tablename.id= d.id