Classify records based on matching table - sql

I have two tables: ITEMS and MATCHING_ITEMS, as below:
ITEMS:
|---------------------|------------------|
| ID | Name |
|---------------------|------------------|
| 1 | A |
| 2 | B |
| 3 | C |
| 4 | D |
| 5 | E |
| 6 | F |
| 7 | G |
|---------------------|------------------|
MATCHING_ITEMS:
|---------------------|------------------|
| ID_1 | ID_2 |
|---------------------|------------------|
| 1 | 2 |
| 1 | 3 |
| 2 | 3 |
| 4 | 5 |
| 4 | 6 |
| 5 | 6 |
|---------------------|------------------|
The MATCHING_ITEMS table defines items that match each other, and thus belong to the same group, i.e. items 1,2, and 3 match with each other and thus belong in a group, and the same for items 4,5, and 6. Item 7 does not have a match belong to any group.
I now need to add a 'Group' column on the ITEMS table which contains a unique integer for each group, so it would look as follows:
ITEMS:
|---------------------|------------------|------------------|
| ID | Name | Group |
|---------------------|------------------|------------------|
| 1 | A | 1 |
| 2 | B | 1 |
| 3 | C | 1 |
| 4 | D | 2 |
| 5 | E | 2 |
| 6 | F | 2 |
| 7 | G | NULL |
|---------------------|------------------|------------------|
So far I have been using a stored procedure to do this, looping over each line in the MATCHING_ITEMS table and updating the ITEMS table with a group value. The problem is that I eventually need to do this for a table containing millions of records, and the looping method is far too slow.
Is there a way that I can achieve this without using a loop?

If you have all pairs of matches in the matching table, then you can just use the minimum id to assign the group. For this:
select i.*,
(case when grp_id is not null
then dense_rank() over (order by grp_id)
end) as grouping
from items i left join
(select mi.id_1, least(mi.id1, min(mi.id2)) as grp_id
from matching_items mi
group by mi.id_1
) mi
on i.id = mi.id_1;
Note: This works only if all pairs are in the matching items table. Otherwise, you will need a recursive/hierarchical query to get all the pairs.

You could use min and max at first, then dense_rank to assign group numbers:
select id, name, dense_rank() over (order by mn, mx) grp
from (
select distinct id, name,
min(id_1) over (partition by name) mn,
max(id_2) over (partition by name) mx
from items left join matching_items on id in (id_1, id_2))
order by id
demo

The pairs 2,3 and 5,6 in the Matching_items table seem redundant as they could be derived (if I am reading your question right)
Here is how I did it. I just reused id_1 from your example as the group no:
create table
items (
ID number,
name varchar2 (2)
);
insert into items values (1, 'A');
insert into items values (2, 'B');
insert into items values (3, 'C');
insert into items values (4, 'D');
insert into items values (5, 'E');
insert into items values (6, 'F');
insert into items values (7, 'G');
create table
matching_items (
ID number,
ID_2 number
);
insert into matching_items values (1, 2);
insert into matching_items values (1, 3);
insert into matching_items values (2, 3);
insert into matching_items values (4, 5);
insert into matching_items values (4, 6);
insert into matching_items values (5, 6);
with new_grp as
(
select id, id_2, id as group_no
from matching_items
where id in (select id from items)
and id not in (select id_2 from matching_items)),
assign_grp as
(
select id, group_no
from new_grp
union
select id_2, group_no
from new_grp)
select items.id, name, group_no
from items left outer join assign_grp
on items.id = assign_grp.id;

Related

SQL group by and sum based on distinct value in other column (sum once if value in other column is duplicated)

I need help with a group-by query. My table looks like this:
CREATE MULTISET TABLE MY_TABLE (PERSON CHAR(1), ITEM CHAR(1), COST INT);
INSERT INTO MY_TABLE VALUES ('A', '1', 5);
INSERT INTO MY_TABLE VALUES ('A', '1', 5);
INSERT INTO MY_TABLE VALUES ('A', '2', 1);
INSERT INTO MY_TABLE VALUES ('B', '3', 0);
INSERT INTO MY_TABLE VALUES ('B', '4', 10);
INSERT INTO MY_TABLE VALUES ('B', '4', 10);
INSERT INTO MY_TABLE VALUES ('C', '5', 1);
INSERT INTO MY_TABLE VALUES ('C', '5', 1);
INSERT INTO MY_TABLE VALUES ('C', '5', 1);
+--------+------+------+
| PERSON | ITEM | COST |
+--------+------+------+
| A | 1 | 5 |
| A | 1 | 5 |
| A | 2 | 1 |
| B | 3 | 0 |
| B | 4 | 10 |
| B | 4 | 10 |
| C | 5 | 1 |
| C | 5 | 1 |
| C | 5 | 1 |
+--------+------+------+
I need to group items and costs by person, but in different ways. For each person, I need the number of unique items they have. Ex: Person A has two distinct items, item 1 and item 2. I can get this with COUNT(DISTINCT ITEM).
Then for each person, I need to sum the cost but only once per distinct item (for duplicate items, the cost is always the same). Ex: Person A has item 1 for $5, item 1 for $5, and item 2 for $1. Since this person has item 1 twice, I count the $5 once, and then add the $1 from item 2 for a total of $6. The output should look like this:
+--------+---------------------+------------------------+
| PERSON | ITEM_DISTINCT_COUNT | COST_DISTINCT_ITEM_SUM |
+--------+---------------------+------------------------+
| A | 2 | 6 |
| B | 2 | 10 |
| C | 1 | 1 |
+--------+---------------------+------------------------+
Is there an easy way to do this that performs good on a lot of rows?
SELECT PERSON
,COUNT(DISTINCT ITEM) ITEM_DISTINCT_COUNT
-- help with COST_DISTINCT_ITEM_SUM
FROM MY_TABLE
GROUP BY PERSON
You can make a subquery which gets the distinct values of item and cost for each person, and then aggregate over that:
SELECT PERSON,
COUNT(ITEM) AS ITEM_DISTINCT_COUNT,
SUM(COST) AS COST_DISTINCT_ITEM_SUM
FROM (
SELECT DISTINCT PERSON, ITEM, COST
FROM MY_TABLE
) M
GROUP BY PERSON
Output:
PERSON ITEM_DISTINCT_COUNT COST_DISTINCT_ITEM_SUM
A 2 6
B 2 10
C 1 1
Demo on dbfiddle
I recommend a two levels of aggregation:
select person, count(*) as num_items, sum(cost)
from (select person, item, avg(cost) as cost
from my_table t
group by person, item
) t
group by person;

Find unique column value from records where another column has ALL of the values from a set

Let's say I have a table like this:
+----+-------+
| ID | Word |
+----+-------+
| 1 | a |
| 1 | dog |
| 1 | has |
| 2 | two |
| 2 | three |
| 2 | four |
| 2 | five |
| 3 | black |
| 3 | red |
+----+-------+
I want to find the unique ID value where there are records with that ID that have all of the Word values in a provided set. E.g. WHERE Word IN ('a', 'dog', 'has') would return ID value 1 but WHERE Word IN ('a', 'dog', 'has', 'black') would return NULL.
Is this possible?
Use group by and having:
select id
from t
where word in ('a', 'dog', 'has')
group by id
having count(*) = 3; - the number of words in the list
If you're using SQL 2016 or higher this seems like a good solution:
--Load test data
DECLARE #tbl as TABLE (id int, word varchar(40))
INSERT INTO #tbl
(id, word)
VALUES
(1, 'a'),(1,'dog'),(1,'cat'),(1,'has'),(1,'a'),(2,'two'),(2,'three'),(2,'four'),(2,'five'),(3,'black'),(3,'red')
--ACTUAL SOLUTION STARTS HERE
DECLARE #srch as VARCHAR(200)
SET #srch = 'a,dog,has'
SELECT id
FROM #tbl
WHERE word IN (SELECT value FROM STRING_SPLIT(#srch,','))
GROUP BY id
HAVING COUNT(DISTINCT word) = (SELECT COUNT(DISTINCT value) FROM STRING_SPLIT(#srch,','));

Postgresql - Preserve relative position when using distinct

I ran a query which returned a table like this.
d | e | f
---+-----+----
2 | 103 | C
6 | 201 | AB
1 | 102 | B
1 | 102 | B
1 | 102 | B
1 | 102 | B
1 | 102 | B
3 | 105 | E
3 | 105 | E
3 | 105 | E
What I want is to get distinct rows but in order. Basically I want this:
2 | 103 | C
6 | 201 | AB
1 | 102 | B
3 | 105 | E
I tried distinct and group by, but they are not always preserving the position (they preserved it for some other cases that I had). Any idea as to how can this be done easily or would one need to use other functionalities like rank?
SQL tables represent unordered sets. There is no ordering, unless you have an explicit order by with a column or expression.
If you have such an ordering, you can do what you want using group by:
select d, e, f
from t
group by d, e, f
order by min(a); -- assuming a is the column that specifies the ordering
Use case when:
order by case when f=C then 1 when f=AB then 2
when f=B then 3 when f=E then 5 else null end
You can try to order by ctid column, which describes the physical location of a row, to identify a row.
The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. The OID, or even better a user-defined serial number, should be used to identify logical rows.
use row_number with windows function to make row number by ctid.
then get rn = 1 and order by ctid
CREATE TABLE T(
d int,
e int,
f varchar(5)
);
insert into t values (2,103, 'C');
insert into t values (6,201, 'AB');
insert into t values (1,102, 'B');
insert into t values (1,102, 'B');
insert into t values (1,102, 'B');
insert into t values (1,102, 'B');
insert into t values (1,102, 'B');
insert into t values (3,105, 'E');
insert into t values (3,105, 'E');
insert into t values (3,105, 'E');
Query 1:
select d,e,f
from (
select d,e,f,ctid,row_number() over(partition by d,e,f order by ctid) rn
FROM T
)t1
where rn = 1
order by ctid
Results:
| d | e | f |
|---|-----|----|
| 2 | 103 | C |
| 6 | 201 | AB |
| 1 | 102 | B |
| 3 | 105 | E |

Select distinct one field other first non empty or null

I have table
| Id | val |
| --- | ---- |
| 1 | null |
| 1 | qwe1 |
| 1 | qwe2 |
| 2 | null |
| 2 | qwe4 |
| 3 | qwe5 |
| 4 | qew6 |
| 4 | qwe7 |
| 5 | null |
| 5 | null |
is there any easy way to select distinct 'id' values with first non null 'val' values. if not exist then null. for example
result should be
| Id | val |
| --- | ---- |
| 1 | qwe1 |
| 2 | qwe4 |
| 3 | qwe5 |
| 4 | qew6 |
| 5 | null |
In your case a simple GROUP BY should be the solution:
SELECT Id
,MIN(val)
FROM dbo.mytable
GROUP BY Id
Whenever using a GROUP BY, you have to use an aggregate function on all columns, which are not listed in the GROUP BY.
If an Id has a value (val) other than NULL, this value will be returned.
If there are just NULLs for the Id, NULL will be returned.
As far as i unterstood (regarding your comment), this is exactly what you're going to approach.
If you always want to have "the first" value <> NULL, you'll need another sort criteria (like a timestamp column) and might be able to solve it with a WINDOW-function.
If you want the first non-NULL value (where "first" is based on id), then MIN() doesn't quite do it. Window functions do:
select t.*
from (select t.*,
row_number() over (partition by id
order by (case when val is not null then 1 else 2 end),
id
) as seqnum
from t
) t
where seqnum = 1;
SQL Fiddle:
Create Table from SQL Fiddle:
CREATE TABLE tab1(pid integer, id integer, val varchar(25))
Insert dummy records :
insert into tab1
values (1, 1 , null),
(2, 1 , 'qwe1' ),
(3, 1 , 'qwe2'),
(4, 2 , null ),
(5, 2 , 'qwe4' ),
(6, 3 , 'qwe5' ),
(7, 4 , 'qew6' ),
(8, 4 , 'qwe7' ),
(9, 5 , null ),
(10, 5 , null );
fire below query:
SELECT Id ,MIN(val) as val FROM tab1 GROUP BY Id;

Query with subqueries returning more than I want

I have the following tables (those tables contain many records but for sake of this example I reduced the contents only to the records I want to operate on).
Products
product_id | product_name
------------+--------------
1 | PRODUCT
Contracts
contract_id | version | status | product_id
-------------+---------+--------+------------
2 | 1 | 30 | 1
2 | 2 | 30 | 1
2 | 3 | 30 | 1
2 | 4 | 30 | 1
2 | 5 | 30 | 1
2 | 6 | 30 | 1
People
id | guid
----+------
3 | 123
9 | 456
Limits
id | type
----+------
4 | 12
5 | 14
Link_table
link_id | version | contract_id | object_type | function | obj_id
---------+---------+-------------+-------------+----------+--------
6 | 1 | 2 | XADL | ADLTYP | 4
7 | 2 | 2 | XADL | ADLTYP | 5
8 | 2 | 2 | BCBP | BCA010 | 123
10 | 3 | 2 | BCBP | BCA010 | 456
Here is the DDL for the aforementioned tables...
CREATE TABLE products (
product_id integer PRIMARY KEY,
product_name varchar(10) NOT NULL
);
CREATE TABLE contracts (
contract_id integer,
version integer,
status varchar(2) NOT NULL,
product_id integer NOT NULL REFERENCES products(product_id),
PRIMARY KEY (contract_id, version)
);
CREATE TABLE link_table (
link_id integer,
version integer,
contract_id integer NOT NULL,
object_type varchar(4) NOT NULL,
function varchar(6) NOT NULL,
obj_id integer NOT NULL,
PRIMARY KEY(link_id, version)
);
CREATE TABLE people (
id integer PRIMARY KEY,
guid integer,
CONSTRAINT person_guid UNIQUE(guid)
);
CREATE TABLE limits (
id integer PRIMARY KEY,
type varchar(2) NOT NULL
);
Now... My task is to select the latest version of the value for field type in limits table for the latest version of the value id in the people table. The table link_table decides what the latest version is. This data need to be provided with fields contract_id, status, product_name.
I tried with the following query, unfortunately I receive two rows when I am supposed to receive only one with the latest value.
SELECT c.contract_id, status, product_name, type
FROM
contracts AS c
INNER JOIN
products AS p
ON c.product_id = p.product_id
INNER JOIN
link_table AS per
ON c.contract_id = per.contract_id
INNER JOIN
link_table AS ll
ON per.contract_id = ll.contract_id
INNER JOIN
people AS peop
ON per.obj_id = peop.guid
INNER JOIN
limits AS lim
ON ll.obj_id = lim.id
WHERE
peop.id = 3
AND per.object_type = 'BCBP'
AND per.function = 'BCA010'
AND ll.object_type = 'XADL'
AND ll.function = 'ADLTYP'
AND ll.version IN ( SELECT max(version) FROM link_table WHERE link_id = ll.link_id)
AND per.version IN ( SELECT max(version) FROM link_table WHERE link_id = per.link_id)
AND c.version IN ( SELECT max(version) FROM contracts WHERE contract_id = c.contract_id );
The result I expect is
contract_id | status | product_name | type
-------------+--------+--------------+------
2 | 30 | PRODUCT | 12
However the actual outcome is
contract_id | status | product_name | type
-------------+--------+--------------+------
2 | 30 | PRODUCT | 12
2 | 30 | PRODUCT | 14
I have been struggling with this for over a day now. Could anyone tell me what I am doing wrong? This example is done with PostgreSQL but the real problem needs to be solved with ABAP's OpenSQL so I cannot use UNION.
Here is some SQL to populate the tables.
INSERT INTO products VALUES (1, 'PRODUCT');
INSERT INTO contracts VALUES (2, 1, '30', 1);
INSERT INTO contracts VALUES (2, 2, '30', 1);
INSERT INTO contracts VALUES (2, 3, '30', 1);
INSERT INTO contracts VALUES (2, 4, '30', 1);
INSERT INTO contracts VALUES (2, 5, '30', 1);
INSERT INTO contracts VALUES (2, 6, '30', 1);
INSERT INTO people VALUES (3, 123);
INSERT INTO people VALUES (9, 456);
INSERT INTO limits VALUES (4, '12');
INSERT INTO limits VALUES (5, '14');
INSERT INTO link_table VALUES (6, 1, 2, 'XADL', 'ADLTYP', 4);
INSERT INTO link_table VALUES (7, 2, 2, 'XADL', 'ADLTYP', 5);
INSERT INTO link_table VALUES (8, 2, 2, 'BCBP', 'BCA010', 123);
INSERT INTO link_table VALUES (10, 3, 2, 'BCBP', 'BCA010', 456);
EDIT
Looks like if the following records in table_link
link_id | version | contract_id | object_type | function | obj_id
---------+---------+-------------+-------------+----------+--------
6 | 1 | 2 | XADL | ADLTYP | 4
7 | 2 | 2 | XADL | ADLTYP | 5
were defined with the same link_id then my query would return exactly what I want.
link_id | version | contract_id | object_type | function | obj_id
---------+---------+-------------+-------------+----------+--------
7 | 1 | 2 | XADL | ADLTYP | 4
7 | 2 | 2 | XADL | ADLTYP | 5
Unfortunately link_id is generated each time new in production even if there is a version in the composite key... Looks like I have to find another way or look for other fields in the link table that would help me.