Grouping on some common values

Grouping on some common values - sql

This is a hard problem to explain, but I'm trying to create a SQL query that generates a list of parent groups that contains all groups where at least one group shares a product with another group. But they don't ALL have to share products, as long as one other group does they would be included in the parent group.
So for example: Because group 1 has {101,102,103} and group 5 has {101,104,105} they would be considered both part of the same parent group because they share
product 101 in common. So would group 4 {104}, because it has product 104 in common with group 5 (even though it doesn't have a product id in common with group 1).
Example Data:
group_id product_id
1 101
1 102
1 103
2 101
3 103
4 104
5 101
5 104
5 105
6 105
6 106
6 107
7 110
7 111
Results:
parent_group_id group_id
1 1
1 2
1 3
1 4
1 5
1 6
2 7
There is no real limit to the amount of products that could be listed under a group.
I'm not really sure how to go about tackling this. Perhaps recursion using a CTE?
Ideally I'd like to be able to do this on the fly so that I can find all linked products and query them together as a large set.
Edit:
I based the following solution on Raul's answer below. The change was to the bottomLevel CTE. In their solution, the value of the group_id matters and grouping could be "missed". For example, in dataset below, group 2 would not be seen to have a parent group id of 1 because the groups that links 2 to 1 (5,6 and 8) have group ids larger then 2. My solution is to just use a straightforward self join on product id. This solves that problem, but the performance is brutal (stopped it after 30mins) when I use my testing dataset of 150K rows. In production I could expect millions.
I tried tossing the bottomLevel CTE into a temp table and putting an index on it and that helps a bit with smaller datasets, but still way too slow on the full set.
Am I out of luck here?
CREATE TABLE #products
(
group_id int not null,
product_id int not null
)
INSERT INTO #products
VALUES(1, 101)
,(1, 102)
,(1, 103)
,(2, 110)
,(2, 111)
,(3, 103)
,(4, 104)
,(5, 101)
,(5, 104)
,(5, 105)
,(6, 105)
,(6, 106)
,(6, 107)
,(8,106)
,(8,111)
,(9,201)
,(10,300)
,(11,300)
,(11,301)
CREATE CLUSTERED INDEX cx_prods ON #products (group_id,product_id);
----------------------------------------------------------------
;WITH bottomLevel AS (
SELECT DISTINCT
sp.group_id as parent_group_id
,p.group_id
FROM
#products p
inner JOIN
#products sp
ON
sp.product_id = p.product_id
),
rc AS (
SELECT parent_group_id
, group_id
FROM bottomLevel
UNION ALL
SELECT b.parent_group_id
, r.group_id
FROM rc r
INNER JOIN bottomLevel b
ON r.parent_group_id = b.group_id
AND b.parent_group_id < r.parent_group_id
)
SELECT MIN(parent_group_id) as parent_group_id
, group_id
FROM rc
GROUP BY group_id
ORDER BY group_id
OPTION (MAXRECURSION 32767)
DROP TABLE #products

Marking Raul's answer as accepted because it helped me find the right direction.
But for those who may find this later, here is what I did.
The CTE method I based on Raul's answer worked, but was much too slow for my needs. I explored using the new graph features in SQL Server 2017, but it doesn't support transitive closure yet. So no luck there. But it did provide me with a term to search for : transitive closure clustering. I found the following two articles on doing it in SQL Server.
This one from Davide Mauri:
http://sqlblog.com/blogs/davide_mauri/archive/2017/11/12/lateral-thinking-transitive-closure-clustering-with-sql-server-uda-and-json.aspx
And this one from Itzik Ben-Gan:
http://www.itprotoday.com/microsoft-sql-server/t-sql-puzzle-challenge-grouping-connected-items
Both very helpful in understanding the problem, but I used Ben-Gan's solution 4.
It uses a while loop to unfold the connected nodes and removes processed edges from the temp input table as it runs.
It runs very fast on small to medium sets, and scales well. My test data of 1.2m rows runs in 2 minutes.
Here is my version of it:
First create a table to store the test data:
CREATE TABLE [dbo].[GroupsToProducts](
[group_id] [INT] NOT NULL,
[product_id] [INT] NOT NULL,
CONSTRAINT [PK_GroupsToProducts] PRIMARY KEY CLUSTERED
(
[group_id] ASC,
[product_id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
INSERT INTO GroupsToProducts
VALUES(1, 101)
,(1, 102)
,(1, 103)
,(2, 110)
,(2, 111)
,(3, 103)
,(4, 104)
,(5, 101)
,(5, 104)
,(5, 105)
,(6, 105)
,(6, 106)
,(6, 107)
,(8,106)
,(8,111)
,(9,201)
,(10,300)
,(11,300)
,(11,301)
Then run the script to generate the clusters.
CREATE TABLE #group_rels
(
from_group_id int not null,
to_group_id int not null
)
INSERT INTO #group_rels
SELECT
p.group_id AS from_group_id,
sp.group_id AS to_group_id
FROM
GroupsToProducts p
inner JOIN
GroupsToProducts sp
ON
sp.product_id = p.product_id
AND p.group_id < sp.group_id
GROUP BY
p.group_id,
sp.group_id
CREATE UNIQUE CLUSTERED INDEX idx_from_group_id_to_group_id ON #group_rels(from_group_id, to_group_id);
CREATE UNIQUE NONCLUSTERED INDEX idx_to_group_id_from_group_id ON #group_rels(to_group_id, from_group_id);
-------------------------------------------------
CREATE TABLE #G
(
group_id INT NOT NULL,
parent_group_id INT NOT NULL,
lvl INT NOT NULL,
PRIMARY KEY NONCLUSTERED (group_id),
UNIQUE CLUSTERED(lvl, group_id)
);
DECLARE #lvl AS INT = 1, #added AS INT, #from_group_id AS INT, #to_group_id AS INT;
DECLARE #CurIds AS TABLE(id INT NOT NULL);
-- gets the first relationship pair
-- will use the from_group_id as a 'root' group
SELECT TOP (1)
#from_group_id = from_group_id,
#to_group_id = to_group_id
FROM
#group_rels
ORDER BY
from_group_id,
to_group_id;
SET #added = ##ROWCOUNT;
WHILE #added > 0
BEGIN
-- inserts two rows into the output table:
-- a self pairing using from_group_id
-- AND the actual relationship pair
INSERT INTO #G
(group_id, parent_group_id, lvl)
VALUES
(#from_group_id, #from_group_id, #lvl),
(#to_group_id, #from_group_id, #lvl);
-- removes the pair from input table
DELETE FROM #group_rels
WHERE
from_group_id = #from_group_id
AND to_group_id = #to_group_id;
WHILE #added > 0
BEGIN
-- increment the lvl variable so we only look at the most recently inserted data
SET #lvl += 1;
----------------------------------------------------------------------------
-- the same basic chunk of code is done twice
-- once for group_ids in the output table that join against from_group_id and
-- once for group_ids in the output table that join against to_group_id
-- 1 - join the output table against the input table, looking for any groups that join
-- against groups that have already been found to (directly or indirectly) connect to the root group.
-- 2 - store the found group_ids in the #CurIds table variable and delete the relationship from the input table.
-- 3 - insert the group_ids in the output table using #from_group_id (the current root node id) as the parent group id
-- if any rows are added to the output table in either chunk, loop and look for any groups that may connect to them.
------------------------------------------------------------------------------
DELETE FROM #CurIds;
DELETE FROM group_rels
OUTPUT deleted.to_group_id AS id INTO #CurIds(id)
FROM
#G AS G
INNER JOIN #group_rels AS group_rels
ON G.group_id = group_rels.from_group_id
WHERE
lvl = #lvl - 1;
INSERT INTO #G
(group_id, parent_group_id, lvl)
SELECT DISTINCT
id,
#from_group_id AS parent_group_id,
#lvl AS lvl
FROM
#CurIds AS C
WHERE
NOT EXISTS
(
SELECT
*
FROM
#G AS G
WHERE
G.group_id = C.id
);
SET #added = ##ROWCOUNT;
-----------------------------------------------------------------------------------
DELETE FROM #CurIds;
DELETE FROM group_rels
OUTPUT deleted.from_group_id AS id INTO #CurIds(id)
FROM
#G AS G
INNER JOIN #group_rels AS group_rels
ON G.group_id = group_rels.to_group_id
WHERE
lvl = #lvl - 1;
INSERT INTO #G
(group_id, parent_group_id, lvl)
SELECT DISTINCT
id,
#from_group_id AS grp,
#lvl AS lvl
FROM
#CurIds AS C
WHERE
NOT EXISTS
(
SELECT
*
FROM
#G AS G
WHERE
G.group_id = C.id
);
SET #added += ##ROWCOUNT;
END;
------------------------------------------------------------------------------
-- At this point, no new rows were added, so the cluster should be complete.
-- Look for another row in the input table to use as a root group
SELECT TOP (1)
#from_group_id = from_group_id,
#to_group_id = to_group_id
FROM
#group_rels
ORDER BY
from_group_id,
to_group_id;
SET #added = ##ROWCOUNT;
END;
SELECT
parent_group_id,
group_id,
lvl
FROM #G
--ORDER BY
--parent_group_id,
--group_id,
--lvl
-------------------------------------------------
DROP TABLE #G
DROP TABLE #group_rels

Take the following statement as a head start:
CREATE TABLE products
(
group_id int not null,
product_id int not null
)
INSERT INTO products
VALUES(1, 101)
,(1, 102)
,(1, 103)
,(2, 101)
,(3, 103)
,(4, 104)
,(5, 101)
,(5, 104)
,(5, 105)
,(6, 105)
,(6, 106)
,(6, 107)
,(7, 110)
,(7, 111)
;WITH bottomLevel AS (
SELECT ISNULL(MIN(matchedGroup),group_id) as parent_group_id
, group_id
FROM products p
OUTER APPLY (
SELECT MIN(group_id) AS matchedGroup
FROM products sp
WHERE sp.group_id != p.group_id
AND sp.product_id = p.product_id
) oa
GROUP BY p.group_id
),
rc AS (
SELECT parent_group_id
, group_id
FROM bottomLevel
UNION ALL
SELECT b.parent_group_id
, r.group_id
FROM rc r
INNER JOIN bottomLevel b
ON r.parent_group_id = b.group_id
AND b.parent_group_id < r.parent_group_id
)
SELECT MIN(parent_group_id) as parent_group_id
, group_id
FROM rc
GROUP BY group_id
ORDER BY group_id
OPTION (MAXRECURSION 32767)
I first grouped by group_id getting the smallest group_id having matching products and recursively joined the parents that have a minor parent.
Now this solution will probably not cover all exceptions you might encounter in a production, but should help you start somewhere.
Also if you have a large product table, this might run really slow, so consider doing this data matching using C#, Spark or SSIS or any other data manipulation engine.

Related

SQL Query - Run query multiple times but with a different variable date

I have a lengthy query written in SQL that uses CTEs and multiple variables to produce a report of about 1500 customer records with many columns based on a particular date, #ToDate. Some of the tables are ordered CTEs so I only get the latest value based on the #ToDate.
I've omitted specifics but the structure is as follows:
Declare #ToDate date .....
Declare #Category varchar ....;
with cte1 as (select * from table1 where table1.start_date <= #ToDate and (table1.end_date > #ToDate or table1.end_date is null))
,cte2 as (select * from table2 where table2.start_date <= #ToDate and (table2.end_date > #ToDate or table2.end_date is null))
select * from cte1
left join cte2 on cte2.id = cte1.id
where .....
which gives me the following results
|RunDate |CustomerID|DOB |Category|Col5 |Col6 |
|----------|----------|----------|--------|------|------|
|2021-08-30|11111 |2000-01-01|Cat1 | | |
|2021-08-30|22222 |2000-02-02|Cat2 | | |
I'd like to run the same script multiple times but with a different date. So run with #ToDate = '2021-08-30' which gives me one set of results and then every past Monday n number of times which would give me results like this...
|RunDate |CustomerID|DOB |Category|Col5 |Col6
|----------|----------|----------|--------|------|------|
|2021-08-30|11111 |2000-01-01|Cat1 | | |
|2021-08-30|22222 |2000-02-02|Cat2 | | |
|2021-08-23|11111 |2000-01-01|Cat1 | | |
|2021-08-23|22222 |2000-02-02|Cat2 | | |
|2021-08-23|33333 |2000-03-03|Cat9 | | |
I do have a calendar table available so I can easily identify the past n Mondays (or other day I like).
The only variable to change is the #ToDate as this is the Run Date, or As At Date if you will. Essentially I want to run it multiple times for the past few Mondays so I can get what the results were like at 30-08, 23-08, 16-08 etc...
I've never used loops and research suggests I should maybe avoid them or use them as a last resort. I'm not sure on the best approach and if I do use loops, how I wrap it around my query.
Thanks in advance

The question really needs a bit more elaboration but I have give a guess at what you are trying to do with this example.
I have create a Customers and Orders table and then display the results for the date range
I don't think you need to loop with cursors and such as you can get the loop effect by just using the #DateRanges and join on that. it being a CTE or not.
Please let me know if this is not what you meant and I will remove the answer
-- Setup a temp table to hold the dates I want to look for
IF EXISTS (SELECT * FROM tempdb.dbo.sysobjects O WHERE O.xtype in ('U') AND O.id = object_id(N'tempdb..#DateRanges'))
BEGIN
PRINT 'Removing temp table #DateRanges'
DROP TABLE #DateRanges;
END
CREATE TABLE #DateRanges (
[Date] DATE
)
-- Add some dates
INSERT INTO #DateRanges ([Date])
VALUES ('2021-08-30'),
('2021-08-23'),
('2021-08-16')
-- Setup some customers
IF EXISTS (SELECT * FROM tempdb.dbo.sysobjects O WHERE O.xtype in ('U') AND O.id = object_id(N'tempdb..#Customers'))
BEGIN
PRINT 'Removing temp table #Customers'
DROP TABLE #Customers;
END
CREATE TABLE #Customers (
CustomerId BIGINT IDENTITY(1,1) NOT NULL,
[Name] NVARCHAR(50),
DOB DATE NOT NULL,
CONSTRAINT PK_CustomerId PRIMARY KEY (CustomerId)
)
INSERT INTO #Customers ([Name], DOB)
VALUES('Bob', '1989-01-01'),
('Robert', '1994-01-01'),
('Andrew', '1992-01-01');
-- Setup some orders
IF EXISTS (SELECT * FROM tempdb.dbo.sysobjects O WHERE O.xtype in ('U') AND O.id = object_id(N'tempdb..#Order'))
BEGIN
PRINT 'Removing temp table #Order'
DROP TABLE #Order;
END
CREATE TABLE #Order (
OrderId BIGINT IDENTITY(1,1) NOT NULL,
CustomerId BIGINT NOT NULL,
CreatedDate DATE NOT NULL,
Category NVARCHAR(50) NOT NULL,
CONSTRAINT PK_OrderId PRIMARY KEY (OrderId)
)
INSERT INTO #Order(CustomerId, CreatedDate, Category)
VALUES
(1, '2021-08-30', 'Cat1'),
(1, '2021-08-23', 'Cat2'),
(2, '2021-08-30', 'Cat1'),
(2, '2021-08-23', 'Cat2'),
(2, '2021-08-16', 'Cat3'),
(3, '2021-08-30', 'Cat1'),
(3, '2021-08-16', 'Cat2')
-- Using the #DateRanged temp table we can the use this to ge the data we need so no need for a loop
SELECT *
FROM #DateRanges AS DR
LEFT JOIN #Order AS O ON O.
CreatedDate <= DR.[Date] AND O.CreatedDate >= DATEADD(D, -6, DR.[Date])

'Merge Fields' - alike SQL Server function

I try to find a way to let the SGBD perform a population of merge fields within a long text.
Create the structure :
CREATE TABLE [dbo].[store]
(
[id] [int] NOT NULL,
[text] [nvarchar](MAX) NOT NULL
)
CREATE TABLE [dbo].[statement]
(
[id] [int] NOT NULL,
[store_id] [int] NOT NULL
)
CREATE TABLE [dbo].[statement_merges]
(
[statement_id] [int] NOT NULL,
[merge_field] [nvarchar](30) NOT NULL,
[user_data] [nvarchar](MAX) NOT NULL
)
Now, create test values
INSERT INTO [store] (id, text)
VALUES (1, 'Waw, stackoverflow is an amazing library of lost people in the IT hell, and i have the feeling that $$PERC_SAT$$ of the users found a solution, personally I asked $$ASKED$$ questions.')
INSERT INTO [statement] (id, store_id)
VALUES (1, 1)
INSERT INTO [statement_merges] (statement_id, merge_field, user_data)
VALUES (1, '$$PERC_SAT$$', '85%')
INSERT INTO [statement_merges] (statement_id, merge_field, user_data)
VALUES (1, '$$ASKED$$', '12')
At the time being my app is delivering the final statement, looping through merges, replacing in the stored text and output
Waw, stackoverflow is an amazing library of lost people in the IT
hell, and i have the feeling that 85% of the users found a solution,
personally I asked 12 questions.
I try to find a way to be code-independent and serve the output in a single query, as u understood, select a statement in which the stored text have been populated with user data. I hope I'm clear.
I looked on TRANSLATE function but it looks like a char replacement, so I have two choices :
I try a recursive function, replacing one by one until no merge_fields is found in the calculated text; but I have doubts about the performance of this approach;
There is a magic to do that but I need your knowledge...
Consider that I want this because the real texts are very long, and I don't want to store it more than once in my database. You can imagine a 3 pages contract with only 12 parameters, like start date, invoiced amount, etc... Everything else cant be changed for compliance.
Thank you for your time!
EDIT :
Thanks to Randy's help, this looks to do the trick :
WITH cte_replace_tokens AS (
SELECT replace(r.text, m.merge_field, m.user_data) as [final], m.merge_field, s.id, 1 AS i
FROM store r
INNER JOIN statement s ON s.store_id = r.id
INNER JOIN statement_merges m ON m.statement_id = s.id
WHERE m.statement_id = 1
UNION ALL
SELECT replace(r.final, m.merge_field, m.user_data) as [final], m.merge_field, r.id, r.i + 1 AS i
FROM cte_replace_tokens r
INNER JOIN statement_merges m ON m.statement_id = r.id
WHERE m.merge_field > r.merge_field
)
select TOP 1 final from cte_replace_tokens ORDER BY i DESC
I will check with a bigger database if the performance is good...
At least, I can "populate" one statement, I need to figure out to be able to extract a list as well.
Thanks again !

If a record is updated more than once by the same update, the last wins. None of the updates are affected by the others - no cumulative effect. It is possible to trick SQL using a local variable to get cumulative effects in some cases, but it's tricky and not recommended. (Order becomes important and is not reliable in an update.)
One alternate is recursion in a CTE. Generate a new record from the prior as each token is replaced until there are no tokens. Here is a working example that replaces 1 with A, 2 with B, etc. (I wonder if there is some tricky xml that can do this as well.)
if not object_id('tempdb..#Raw') is null drop table #Raw
CREATE TABLE #Raw(
[test] [varchar](100) NOT NULL PRIMARY KEY CLUSTERED,
)
if not object_id('tempdb..#Token') is null drop table #Token
CREATE TABLE #Token(
[id] [int] NOT NULL PRIMARY KEY CLUSTERED,
[token] [char](1) NOT NULL,
[value] [char](1) NOT NULL,
)
insert into #Raw values('123456'), ('1122334456')
insert into #Token values(1, '1', 'A'), (2, '2', 'B'), (3, '3', 'C'), (4, '4', 'D'), (5, '5', 'E'), (6, '6', 'F');
WITH cte_replace_tokens AS (
SELECT r.test, replace(r.test, l.token, l.value) as [final], l.id
FROM [Raw] r
CROSS JOIN #Token l
WHERE l.id = 1
UNION ALL
SELECT r.test, replace(r.final, l.token, l.value) as [final], l.id
FROM cte_replace_tokens r
CROSS JOIN #Token l
WHERE l.id = r.id + 1
)
select * from cte_replace_tokens where id = 6

It's not recommended to do such tasks inside sql engine but if you want to do that, you need to do it in a loop using cursor in a function or stored procedure like so :
DECLARE #merge_field nvarchar(30)
, #user_data nvarchar(MAX)
, #statementid INT = 1
, #text varchar(MAX) = 'Waw, stackoverflow is an amazing library of lost people in the IT hell, and i have the feeling that $$PERC_SAT$$ of the users found a solution, personally I asked $$ASKED$$ questions.'
DECLARE merge_statements CURSOR FAST_FORWARD
FOR SELECT
sm.merge_field
, sm.user_data
FROM dbo.statement_merges AS sm
WHERE sm.statement_id = #statementid
OPEN merge_statements
FETCH NEXT FROM merge_statements
INTO #merge_field , #user_data
WHILE ##FETCH_STATUS = 0
BEGIN
set #text = REPLACE(#text , #merge_field, #user_data )
FETCH NEXT FROM merge_statements
INTO #merge_field , #user_data
END
CLOSE merge_statements
DEALLOCATE merge_statements
SELECT #text

Here is a recursive solution.
SQL Fiddle
MS SQL Server 2017 Schema Setup:
CREATE TABLE [dbo].[store]
(
[id] [int] NOT NULL,
[text] [nvarchar](MAX) NOT NULL
)
CREATE TABLE [dbo].[statement]
(
[id] [int] NOT NULL,
[store_id] [int] NOT NULL
)
CREATE TABLE [dbo].[statement_merges]
(
[statement_id] [int] NOT NULL,
[merge_field] [nvarchar](30) NOT NULL,
[user_data] [nvarchar](MAX) NOT NULL
)
INSERT INTO store (id, text)
VALUES (1, '$$(*)$$, stackoverflow...$$PERC_SAT$$...$$ASKED$$ questions.')
INSERT INTO store (id, text)
VALUES (2, 'Use The #_#')
INSERT INTO statement (id, store_id) VALUES (1, 1)
INSERT INTO statement (id, store_id) VALUES (2, 2)
INSERT INTO statement_merges (statement_id, merge_field, user_data) VALUES (1, '$$PERC_SAT$$', '85%')
INSERT INTO statement_merges (statement_id, merge_field, user_data) VALUES (1, '$$ASKED$$', '12')
INSERT INTO statement_merges (statement_id, merge_field, user_data) VALUES (1, '$$(*)$$', 'Wow')
INSERT INTO statement_merges (statement_id, merge_field, user_data) VALUES (2, ' #_#', 'Flux!')
Query 1:
;WITH Normalized AS
(
SELECT
store_id=store.id,
store.text,
sm.merge_field,
sm.user_data,
RowNumber = ROW_NUMBER() OVER(PARTITION BY store.id,sm.statement_id ORDER BY merge_field),
statement_id = st.id
FROM
store store
INNER JOIN statement st ON st.store_id = store.id
INNER JOIN statement_merges sm ON sm.statement_id = st.id
)
, Recurse AS
(
SELECT
store_id, statement_id, old_text = text, merge_field,user_data, RowNumber,
Iteration=1,
new_text = REPLACE(text, merge_field, user_data)
FROM
Normalized
WHERE
RowNumber=1
UNION ALL
SELECT
n.store_id, n.statement_id, r.old_text, n.merge_field, n.user_data,
RowNumber=r.RowNumber+1,
Iteration=Iteration+1,
new_text = REPLACE(r.new_text, n.merge_field, n.user_data)
FROM
Normalized n
INNER JOIN Recurse r ON r.RowNumber = n.RowNumber AND r.statement_id = n.statement_id
)
,ReverseOnIteration AS
(
SELECT *,
ReverseIteration = ROW_NUMBER() OVER(PARTITION BY statement_id ORDER BY Iteration DESC)
FROM
Recurse
)
SELECT
store_id, statement_id, new_text, old_text
FROM
ReverseOnIteration
WHERE
ReverseIteration=1
Results:
| store_id | statement_id | new_text | old_text |
|----------|--------------|------------------------------------------|--------------------------------------------------------------|
| 1 | 1 | Wow, stackoverflow...85%...12 questions. | $$(*)$$, stackoverflow...$$PERC_SAT$$...$$ASKED$$ questions. |
| 2 | 2 | Use TheFlux! | Use The #_# |

With the help of Randy, I think I've achieved what I wanted to do !
Known the fact that my real case is a contract, in which there are several statements that may be :
free text
stored text without any merges
stored text with one or
several merges
this CTE does the job !
WITH cte_replace_tokens AS (
-- The initial query dont join on merges neither on store because can be a free text
SELECT COALESCE(r.text, s.part_text) AS [final], CAST('' AS NVARCHAR) AS merge_field, s.id, 1 AS i, s.contract_id
FROM statement s
LEFT JOIN store r ON s.store_id = r.id
UNION ALL
-- We loop till the last merge field, output contains iteration to be able to keep the last record ( all fields updated )
SELECT replace(r.final, m.merge_field, m.user_data) as [final], m.merge_field, r.id, r.i + 1 AS i, r.contract_id
FROM cte_replace_tokens r
INNER JOIN statement_merges m ON m.statement_id = r.id
WHERE m.merge_field > r.merge_field AND r.final LIKE '%' + m.merge_field + '%'
-- spare lost replacements by forcing only one merge_field per loop
AND NOT EXISTS( SELECT mm.statement_id FROM statement_merges mm WHERE mm.statement_id = m.statement_id AND mm.merge_field > r.merge_field AND mm.merge_field < m.merge_field)
)
select s.id,
(select top 1 final from cte_replace_tokens t WHERE t.contract_id = s.contract_id AND t.id = s.id ORDER BY i DESC) as res
FROM statement s
where contract_id = 1

If the CTE solution with a cross join is too slow, an alternate solution would be to build a scalar fn dynamically that has every REPLACE required from the token table. One scalar fn call per record then is order(N). I get the same result as before.
The function is simple and likely not to be too long, depending upon how big the token table becomes...256 MB batch limit. I've seen attempts to dynamically create queries to improve performance backfire - moved the problem to compile time. Should not be a problem here.
if not object_id('tempdb..#Raw') is null drop table #Raw
CREATE TABLE #Raw(
[test] [varchar](100) NOT NULL PRIMARY KEY CLUSTERED,
)
if not object_id('tempdb..#Token') is null drop table #Token
CREATE TABLE #Token(
[id] [int] NOT NULL PRIMARY KEY CLUSTERED,
[token] [char](1) NOT NULL,
[value] [char](1) NOT NULL,
)
insert into #Raw values('123456'), ('1122334456')
insert into #Token values(1, '1', 'A'), (2, '2', 'B'), (3, '3', 'C'), (4, '4', 'D'), (5, '5', 'E'), (6, '6', 'F');
DECLARE #sql varchar(max) = 'CREATE FUNCTION dbo.fn_ReplaceTokens(#raw varchar(8000)) RETURNS varchar(8000) AS BEGIN RETURN ';
WITH cte_replace_statement AS (
SELECT a.id, CAST('replace(#raw,''' + a.token + ''',''' + a.value + ''')' as varchar(max)) as [statement]
FROM #Token a
WHERE a.id = 1
UNION ALL
SELECT n.id, CAST(replace(l.[statement], '#raw', 'replace(#raw,''' + n.token + ''',''' + n.value + ''')') as varchar(max)) as [statement]
FROM #Token n
INNER JOIN cte_replace_statement l
ON n.id = l.id + 1
)
select #sql += [statement] + ' END' from cte_replace_statement where id = 6
print #sql
if not object_id('dbo.fn_ReplaceTokens') is null drop function dbo.fn_ReplaceTokens
execute (#sql)
SELECT r.test, dbo.fn_ReplaceTokens(r.test) as [final] FROM [Raw] r

SQL return only distinct IDs from LEFT JOIN

I've inherited some fun SQL and am trying to figure out how to how to eliminate rows with duplicate IDs. Our indexes are stored in a somewhat columnar format and then we pivot all the rows into one with the values as different columns.
The below sample returns three rows of unique data, but the IDs are duplicated. I need just two rows with unique IDs (and the other columns that go along with it). I know I'll be losing some data, but I just need one matching row per ID to the query (first, top, oldest, newest, whatever).
I've tried using DISTINCT, GROUP BY, and ROW_NUMBER, but I keep getting the syntax wrong, or using them in the wrong place.
I'm also open to rewriting the query completely in a way that is reusable as I currently have to generate this on the fly (cardtypes and cardindexes are user defined) and would love to be able to create a stored procedure. Thanks in advance!
declare #cardtypes table ([ID] int, [Name] nvarchar(50))
declare #cards table ([ID] int, [CardTypeID] int, [Name] nvarchar(50))
declare #cardindexes table ([ID] int, [CardID] int, [IndexType] int, [StringVal] nvarchar(255), [DateVal] datetime)
INSERT INTO #cardtypes VALUES (1, 'Funny Cards')
INSERT INTO #cardtypes VALUES (2, 'Sad Cards')
INSERT INTO #cards VALUES (1, 1, 'Bunnies')
INSERT INTO #cards VALUES (2, 1, 'Dogs')
INSERT INTO #cards VALUES (3, 1, 'Cat')
INSERT INTO #cards VALUES (4, 1, 'Cat2')
INSERT INTO #cardindexes VALUES (1, 1, 1, 'Bunnies', null)
INSERT INTO #cardindexes VALUES (2, 1, 1, 'playing', null)
INSERT INTO #cardindexes VALUES (3, 1, 2, null, '2014-09-21')
INSERT INTO #cardindexes VALUES (4, 2, 1, 'Dogs', null)
INSERT INTO #cardindexes VALUES (5, 2, 1, 'playing', null)
INSERT INTO #cardindexes VALUES (6, 2, 1, 'poker', null)
INSERT INTO #cardindexes VALUES (7, 2, 2, null, '2014-09-22')
SELECT TOP(100)
[ID] = c.[ID],
[Name] = c.[Name],
[Keyword] = [colKeyword].[StringVal],
[DateAdded] = [colDateAdded].[DateVal]
FROM #cards AS c
LEFT JOIN #cardindexes AS [colKeyword] ON [colKeyword].[CardID] = c.ID AND [colKeyword].[IndexType] = 1
LEFT JOIN #cardindexes AS [colDateAdded] ON [colDateAdded].[CardID] = c.ID AND [colDateAdded].[IndexType] = 2
WHERE [colKeyword].[StringVal] LIKE 'p%' AND c.[CardTypeID] = 1
ORDER BY [DateAdded]
Edit:
While both solutions are valid, I ended up using the MAX() solution from #popovitsj as it was easier to implement. The issue of data coming from multiple rows doesn't really factor in for me as all rows are essentially part of the same record. I will most likely use both solutions depending on my needs.
Here's my updated query (as it didn't quite match the answer):
SELECT TOP(100)
[ID] = c.[ID],
[Name] = MAX(c.[Name]),
[Keyword] = MAX([colKeyword].[StringVal]),
[DateAdded] = MAX([colDateAdded].[DateVal])
FROM #cards AS c
LEFT JOIN #cardindexes AS [colKeyword] ON [colKeyword].[CardID] = c.ID AND [colKeyword].[IndexType] = 1
LEFT JOIN #cardindexes AS [colDateAdded] ON [colDateAdded].[CardID] = c.ID AND [colDateAdded].[IndexType] = 2
WHERE [colKeyword].[StringVal] LIKE 'p%' AND c.[CardTypeID] = 1
GROUP BY c.ID
ORDER BY [DateAdded]

You could use MAX or MIN to 'decide' on what to display for the other columns in the rows that are duplicate.
SELECT ID, MAX(Name), MAX(Keyword), MAX(DateAdded)
(...)
GROUP BY ID;

using row number windowed function along with a CTE will do this pretty well. For example:
;With preResult AS (
SELECT TOP(100)
[ID] = c.[ID],
[Name] = c.[Name],
[Keyword] = [colKeyword].[StringVal],
[DateAdded] = [colDateAdded].[DateVal],
ROW_NUMBER()OVER(PARTITION BY c.ID ORDER BY [colDateAdded].[DateVal]) rn
FROM #cards AS c
LEFT JOIN #cardindexes AS [colKeyword] ON [colKeyword].[CardID] = c.ID AND [colKeyword].[IndexType] = 1
LEFT JOIN #cardindexes AS [colDateAdded] ON [colDateAdded].[CardID] = c.ID AND [colDateAdded].[IndexType] = 2
WHERE [colKeyword].[StringVal] LIKE 'p%' AND c.[CardTypeID] = 1
ORDER BY [DateAdded]
)
SELECT * from preResult WHERE rn = 1

How to find sets of rows with one or more fields matching and assign a set id for each matching set?

I have a requirement to find sets of rows where one or more fields are matching.
E.g:
Vendor Master
VendorId | VendorName | Phone | Address | Fax
------------------------------------------------------------------------
1 AAAA 10101 Street1 111
2 BBBB 20202 Street2 222
3 CCCC 30303 Street3 333
4 DDDD 40404 Street2 444
5 FFFF 50505 Street5 555
6 GGGG 60606 Street6 444
7 HHHH 10101 Street6 777
SELECT VendorId FROM VendorMaster vm
WHERE EXISTS
( Select 1 FROM VendorMaster vm1
WHERE vm1.VendorId <> vm2.VendorId
AND (vm1.Phone = vm2.Phone OR vm1.Address=vm2.Address OR vm1.Fax = vm2.Fax)
With the above query I am getting records, but my requirement is to assign a set-id for each set of matching records.
Like below:
SetId | VendorId
---------------------
1000 1
1000 7 //1 and 7- Phone numbers are matching
1001 2
1001 4 //2 and 4 - Address matching
1001 6 // 4 and 6 - Fax matching
Please advise me on how to write a query to assign set ids for matching sets. The performance of the query is also key here as the number of records will be around 100,000.
Thanks

I believe this will give you your desired result. A little explanation is in the comments, let me know if more is needed.
with relations
--Get all single relationships between vendors.
as (
select t1.vendorId firstId,
t2.vendorId secondId
from VendorMaster t1
inner join VendorMaster t2 on t1.vendorId < t2.vendorId and(
t1.Phone = t2.Phone
or t1.address = t2.address
or t1.Fax = t2.Fax
)
),
recurseLinks
--Recurse the relationships
as (
select r.*, CAST(',' + CAST(r.firstId AS VARCHAR) + ',' AS VARCHAR) tree
from relations r
union all
select r.firstId,
l.secondId,
cast(r.Tree + CAST(l.secondId AS varchar) + ',' as varchar)
from relations l
inner join recurseLinks r on r.secondId = l.firstId and r.tree not like '%' + cast(l.secondId as varchar) + ',%'
union all
select r.firstId,
l.firstId,
cast(r.Tree + CAST(l.firstId AS varchar) + ',' as varchar)
from relations l
inner join recurseLinks r on r.secondId = l.secondId and r.tree not like '%' + cast(l.firstId as varchar) + ',%'
),
removeInvalid
--Removed invalid relationships.
as (
select l1.firstId, l1.secondId
from recurseLinks l1
where l1.firstId < l1.secondId
),
removeIntermediate
--Removed intermediate relationships.
as (
select distinct l1.*
from removeInvalid l1
left join removeInvalid l2 on l2.secondId = l1.firstId
where l2.firstId is null
)
select result.secondId,
dense_rank() over(order by result.firstId) SetId
from (
select firstId,
secondId
from removeIntermediate
union all
select distinct firstId,
firstId
from removeIntermediate
) result;
The 'relations' named result set returns all VendorMasters relationships where they share a common Phone, Address or Fax. It also only returns [A,B] it won't return the reverse relationship [B,A].
The 'recurseLinks' named result set is a little more complex. It recursively joins all rows that are related to each other. The path column keep track of lineage so it won't get stuck in an endless loop. The first query of this union selects all the relations from the 'relations' named result set. The second query of this union selects all the forward recursive relationships, so given [A,B], [B,C] and [C, D] then [A,C], [A,D] and [B,D] are added to the result set. The third query of the union selects all the non forward recursive relationships, so given [A,D], [C,D], [B,C] then [A,C], [A,B] and [B,D] are added to the result set.
The 'removeInvalid' named result set removes any invalid intermediate relationships added by the recursive query. For Example, [B,A] because we will already have [A,B]. Note this could have been prevented in the 'recurseLinks' result set with some effort.
The 'removeIntermediate' named result set removes any intermediate relationships. So given [A,B],[B,C], [C,D], [A,C], [A,D] it will remove [B,C] and [C,D].
The final result set selects the current results and adds in a self relationship. So given [A,B], [A, C], [A,D] add in [A,A]. Which produces are finial result set.

You can use the built in Ranking functions to accomplish this. For example, for unique Address values:
DECLARE #VendorMaster TABLE ( VendorID INT, Vendorname VARCHAR(20), Phone VARCHAR(20), Address VARCHAR(20), Fax VARCHAR(20) )
INSERT INTO #VendorMaster
(VendorID, Vendorname, Phone, Address, Fax )
VALUES
(1, 'AAAA', '10101', 'Street1', '111'),
(2, 'BBBB', '20202', 'Street2', '222'),
(3, 'CCCC', '30303', 'Street3', '333'),
(4, 'DDDD', '40404', 'Street2', '444'),
(5, 'FFFF', '50505', 'Street5', '555'),
(6, 'GGGG', '60606', 'Street6', '444'),
(7, 'HHHH', '10101', 'Street6', '777')
SELECT
DenseRank = DENSE_RANK() OVER ( ORDER BY Address )
,* FROM #VendorMaster
Results
DenseRank RowNumber VendorID Vendorname Phone Address Fax
1 1 1 AAAA 10101 Street1 111
2 2 2 BBBB 20202 Street2 222
3 3 3 CCCC 30303 Street3 333
2 4 4 DDDD 40404 Street2 444
4 5 5 FFFF 50505 Street5 555
5 6 6 GGGG 60606 Street6 444
5 7 7 HHHH 10101 Street6 777
If these SetId values need to persist, you could create a separate table with an identity column to track the values associated with each SetID for each set up. It sounds like you may simply want to normalize your database and break out data elements being duplicated into their own tables linked by an identity column relationship.

Although Wills answer is pretty ingenious, I've never really like recursive CTE's very much because they always work great on small sets but become very slow on larger ones quickly and sometimes will hit the MAXRECURSION limit.
Personally I'd try to solve this by first putting every VendorID in its own SetID and then merge the upper SetIDs into lower SetIDs that have a matching Vendor.
It would then look something like this:
-- create test-code
IF OBJECT_ID('VendorMaster') IS NOT NULL DROP TABLE VendorMaster
GO
CREATE TABLE VendorMaster
([VendorID] int IDENTITY(1,1) PRIMARY KEY, [Vendorname] nvarchar(100), [Phone] nvarchar(100) , [Address] nvarchar(100), [Fax] nvarchar(100))
;
INSERT INTO VendorMaster
([Vendorname], [Phone], [Address], [Fax])
VALUES
('AAAA', '10101', 'Street1', '111'),
('BBBB', '20202', 'Street20', '222'),
('CCCC', '30303', 'Street3', '333'),
('DDDD', '40404', 'Street2', '444'),
('FFFF', '50505', 'Street5', '555'),
('GGGG', '60606', 'Street6', '444'),
('HHHH', '10101', 'Street6', '777'),
('IIII', '80808', 'Street20', '888'),
('JJJJ', '90909', 'Street9', '888');
GO
-- create sets and start shifting & merging
DECLARE #rowcount int
SELECT SetID = 1000 + ROW_NUMBER() OVER (ORDER BY VendorID),
VendorID
INTO #result
FROM VendorMaster
SELECT #rowcount = ##ROWCOUNT
CREATE UNIQUE CLUSTERED INDEX uq0 ON #result (VendorID)
WHILE #rowcount > 0
BEGIN
-- find lowest SetID that has a match with current record
;WITH shifting
AS (SELECT newSetID = Min(n.SetID),
oldSetID = o.SetID
FROM #result o
JOIN #result n
ON n.SetID < o.SetID
JOIN VendorMaster vo
ON vo.VendorID = o.VendorID
JOIN VendorMaster vn
ON vn.VendorID = n.VendorID
WHERE vn.Vendorname = vo.Vendorname
OR vn.Phone = vo.Phone
OR vn.Address = vo.Address
OR vn.Fax = vo.Fax
GROUP BY o.SetID)
UPDATE #result
SET SetID = s.newSetID
FROM #result upd
JOIN shifting s
ON s.oldSetID = upd.SetID
AND s.newSetID < upd.SetID
SELECT #rowcount = ##ROWCOUNT
END
-- delete 'single-member-sets' for consistency in compare with CTE of Will
DELETE #result
FROM #result del
WHERE NOT EXISTS ( SELECT *
FROM #result xx
WHERE xx.SetID = del.SetID
AND xx.VendorID <> del.VendorID)
-- fix 'holes'
UPDATE #result
SET SetID = 1 + (SELECT COUNT(DISTINCT SetID)
FROM #result xx
WHERE xx.SetID < upd.SetID)
FROM #result upd
-- show result
SELECT * FROM #result ORDER BY SetID, VendorID
When running this on the test-case provided, I get the same results as the CTE, although it takes a bit longer.
When I add some extra test-data, things become interesting though.
DECLARE #counter int = 7
WHILE #counter > 0
BEGIN
INSERT VendorMaster ([Vendorname], [Phone], [Address], [Fax])
SELECT [Vendorname] = NewID(),
[Phone] = ABS(BINARY_CHECKSUM(NewID())) % 1500,
[Address] = NewID(),
[Fax] = NewID()
FROM VendorMaster
SELECT #counter = #counter - 1
END
SELECT COUNT(*) FROM VendorMaster
This gives me 1152 test-records with the matches we already had before, but now also with some matches on Phone (the NewID()'s will not ever match) to make things more easy to verify.
When I run the query above on this, I get 604 sets in just shy under 2 seconds. However, when I run the CTE on it, it

Need to convert a recursive CTE query to an index friendly query

After going through all the hard work of writing a recursive CTE query to meet my needs, I realize I can't use it because it doesn't work in an indexed view. So I need something else to replace the CTE below. (Yes you can use a CTE in a non-indexed view, but that's too slow for me).
The requirements:
My ultimate goal is to have a self updating indexed view (it doesn't have to be a view, but something similar)... that is, if data changes in any of the tables the view joins on, then the view needs to update itself.
The view needs to be indexed because it has to be very fast, and the data doesn't change very frequently. Unfortunately, the non-indexed view using a CTE takes 3-5 seconds to run which is way too long for my needs. I need the query to run in milliseconds. The recursive table has a few hundred thousand records in it.
As far as my research has taken me, the best solution to meet all these requirements is an indexed view, but I'm open to any solution.
The CTE can be found in the answer to my other post.
Or here it is again:
DECLARE #tbl TABLE (
Id INT
,[Name] VARCHAR(20)
,ParentId INT
)
INSERT INTO #tbl( Id, Name, ParentId )
VALUES
(1, 'Europe', NULL)
,(2, 'Asia', NULL)
,(3, 'Germany', 1)
,(4, 'UK', 1)
,(5, 'China', 2)
,(6, 'India', 2)
,(7, 'Scotland', 4)
,(8, 'Edinburgh', 7)
,(9, 'Leith', 8)
;
DECLARE #tbl2 table (id int, abbreviation varchar(10), tbl_id int)
INSERT INTO #tbl2( Id, Abbreviation, tbl_id )
VALUES
(100, 'EU', 1)
,(101, 'AS', 2)
,(102, 'DE', 3)
,(103, 'CN', 5)
;WITH abbr AS (
SELECT a.*, isnull(b.abbreviation,'') abbreviation
FROM #tbl a
left join #tbl2 b on a.Id = b.tbl_id
), abcd AS (
-- anchor
SELECT id, [Name], ParentID,
CAST(([Name]) AS VARCHAR(1000)) [Path],
cast(abbreviation as varchar(max)) abbreviation
FROM abbr
WHERE ParentId IS NULL
UNION ALL
--recursive member
SELECT t.id, t.[Name], t.ParentID,
CAST((a.path + '/' + t.Name) AS VARCHAR(1000)) [Path],
isnull(nullif(t.abbreviation,'')+',', '') + a.abbreviation
FROM abbr AS t
JOIN abcd AS a
ON t.ParentId = a.id
)
SELECT *, [Path] + ':' + abbreviation
FROM abcd

After hitting all the roadblocks with indexed views (self join, cte, udf accessing data etc), I propose that the below as a solution for you.
Create support function
Based on maximum depth of 4 from root (5 total). Or use a CTE
CREATE FUNCTION dbo.GetHierPath(#hier_id int) returns varchar(max)
WITH SCHEMABINDING
as
begin
return (
select FullPath =
isnull(H5.Name+'/','') +
isnull(H4.Name+'/','') +
isnull(H3.Name+'/','') +
isnull(H2.Name+'/','') +
H1.Name
+
':'
+
isnull(STUFF(
isnull(','+A1.abbreviation,'') +
isnull(','+A2.abbreviation,'') +
isnull(','+A3.abbreviation,'') +
isnull(','+A4.abbreviation,'') +
isnull(','+A5.abbreviation,''),1,1,''),'')
from dbo.HIER H1
left join dbo.ABBR A1 on A1.hier_id = H1.Id
left join dbo.HIER H2 on H1.ParentId = H2.Id
left join dbo.ABBR A2 on A2.hier_id = H2.Id
left join dbo.HIER H3 on H2.ParentId = H3.Id
left join dbo.ABBR A3 on A3.hier_id = H3.Id
left join dbo.HIER H4 on H3.ParentId = H4.Id
left join dbo.ABBR A4 on A4.hier_id = H4.Id
left join dbo.HIER H5 on H4.ParentId = H5.Id
left join dbo.ABBR A5 on A5.hier_id = H5.Id
where H1.id = #hier_id)
end
GO
Add columns to the table itself
For example the fullpath column, if you need, add the other 2 columns in the CTE by splitting the result of dbo.GetHierPath on ':' (left=>path, right=>abbreviations)
-- index maximum key length is 900, based on your data, 400 is enough
ALTER TABLE HIER ADD FullPath VARCHAR(400)
Maintain the columns
Because of the hierarchical nature, record X could be deleted that affects a Y descendent and Z ancestor, which is quite hard to identify in either of INSTEAD OF or AFTER triggers. So the alternative approach is based on the conditions
if data changes in any of the tables the view joins on, then the view needs to update itself.
the non-indexed view using a CTE takes 3-5 seconds to run which is way too long for my needs
We maintain the data simply by running through the entire table again, taking 3-5 seconds per update (or faster if the 5-join query works out better).
CREATE TRIGGER TG_HIER
ON HIER
AFTER INSERT, UPDATE, DELETE
AS
UPDATE HIER
SET FullPath = dbo.GetHierPath(HIER.Id)
Finally, index the new column(s) on the table itself
create index ix_hier_fullpath on HIER(FullPath)
If you intended to access the path data via the id, then it is already in the table itself without adding an additional index.
The above TSQL references these objects
Modify the table and column names to suit your schema.
CREATE TABLE dbo.HIER (Id INT Primary Key Clustered, [Name] VARCHAR(20) ,ParentId INT)
;
INSERT dbo.HIER( Id, Name, ParentId ) VALUES
(1, 'Europe', NULL)
,(2, 'Asia', NULL)
,(3, 'Germany', 1)
,(4, 'UK', 1)
,(5, 'China', 2)
,(6, 'India', 2)
,(7, 'Scotland', 4)
,(8, 'Edinburgh', 7)
,(9, 'Leith', 8)
,(10, 'Antartica', NULL)
;
CREATE TABLE dbo.ABBR (id int primary key clustered, abbreviation varchar(10), hier_id int)
;
INSERT dbo.ABBR( Id, Abbreviation, hier_id ) VALUES
(100, 'EU', 1)
,(101, 'AS', 2)
,(102, 'DE', 3)
,(103, 'CN', 5)
GO
EDIT - Possibly faster alternative
Given that all records are recalculated each time, there is no real need for a function that returns the FullPath for a single HIER.ID. The query in the support function can be used without the where H1.id = #hier_id filter at the end. Furthermore, the expression for FullPath can be broken into PathOnly and Abbreviation easily down the middle. Or just use the original CTE, whichever is faster.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Grouping on some common values - sql

Related

SQL Query - Run query multiple times but with a different variable date

'Merge Fields' - alike SQL Server function

SQL return only distinct IDs from LEFT JOIN

How to find sets of rows with one or more fields matching and assign a set id for each matching set?

Need to convert a recursive CTE query to an index friendly query

Categories

Resources