Count Similar Substrings SQL query - sql

I've tried a few scenarios and googled a lot, but still can't find a solution.
I have a table of user names with entries something like the below:
UserName
Cakes420
18Jack01
18Jack04
16Jack22
22Jack16
Mapple7609
Chrom44
chrom22
chrom77
013Cake
016Cake
122Cake
123Cake87
So I need a query that checks for all records that share 4 or more (in sequence) characters in the table.
So I need to return something like :
Characters
Times Used
Names Sharing
Cake
5
Cakes420, 013Cake, 016Cake, 122Cake, 123Cake87
Chro
3
Chrom44, chrom22, chrom77
or anything similar as I'd prefer not to repeat patterns, but hey, at this stage if it returns the values properly, I don't mind.
The shared characters can naturally appear in any place in the string, which is what makes this so difficult.

Should you do this in T-SQL? Probably not.
Can you do this in T-SQL? Yes.
Sample data
create table Names
(
Name nvarchar(20)
);
insert into Names (Name) values
('Cakes420'),
('18Jack01'),
('18Jack04'),
('16Jack22'),
('22Jack16'),
('Mapple7609'),
('Chrom44'),
('chrom22'),
('chrom77'),
('013Cake'),
('016Cake'),
('122Cake'),
('123Cake87');
Solution
Using STRING_AGG() for easy concatenation. Available from SQL Server 2017. Alternatives available for older SQL versions (use the search box on this site, there are many examples).
with rcte as
(
select n.Name,
convert(nvarchar(4), substring(n.Name, 1, 4)) as Part,
1 as PartFrom
from Names n
where len(n.Name) >= 4
union all
select r.Name,
convert(nvarchar(4), substring(r.Name, r.PartFrom+1, r.PartFrom+4)),
r.PartFrom+1
from rcte r
where len(r.Name) >= r.PartFrom+4
),
cte_count as
(
select r.Part,
count(1) as PartCount
from rcte r
where r.Part not like '%[0-9]%' -- exclude parts with numbers in them
group by r.Part
having count(1) > 1
)
select c.Part,
c.PartCount,
string_agg(r.Name, ', ') as Names
from cte_count c
join rcte r
on r.Part = c.Part
group by c.Part,
c.PartCount
order by c.Part;
Result
Part PartCount Names
---- --------- ----------------------------------------------
Cake 5 Cakes420, 123Cake87, 122Cake, 016Cake, 013Cake
Chro 3 Chrom44, chrom22, chrom77
hrom 3 chrom77, chrom22, Chrom44
Jack 4 22Jack16, 16Jack22, 18Jack04, 18Jack01
Fiddle to see it in action with the intermediate CTE results.

Let's use Itzik Ben-Gan's Tally Function to break out a list of substrings, then group them. This is called N-Gram, after the more common Trigram which is 3-character substrings.
I've removed one extra cross-join from the function to speed it up slightly, it's now good for up to varchar(65536):
CREATE OR ALTER FUNCTION dbo.GetNums(#num AS BIGINT)
RETURNS TABLE
AS
RETURN
WITH
L0 AS ( SELECT 1 AS c
FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),
(1),(1),(1),(1),(1),(1),(1),(1)) AS D(c) ),
L1 AS ( SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B ),
L2 AS ( SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B ),
Nums AS ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM L2 )
SELECT TOP(#num)
rownum AS rn
FROM Nums
ORDER BY rownum;
GO
DECLARE #substringLen int = 4;
SELECT
Characters,
[Times Used] = COUNT(*),
[Names Sharing] = STRING_AGG(Username, ', ')
FROM (
SELECT DISTINCT
-- remove DISTINCT if you want to know about multiple in a single username
t.Username,
Characters = SUBSTRING(t.Username, n.rn, #substringLen)
FROM myTable t
CROSS APPLY dbo.GetNums (LEN(t.UserName) - #substringLen + 1) n
) t
GROUP BY t.Characters
HAVING COUNT(*) > 1

Related

SQL Server [PATSTAT] query | Multiple charindex values &

Hello Stack Overflow Community.
I am retrieving data with SQL from PATSTAT (patent data base from the European Patent Office). I have two issues (see below). For your info the PATSAT sql commands are quite limited.
I. Charindex with multiple values
I am looking for specific two specific patent groups ["Y02E" and "Y02C"] and want to retrieve data on these. I have found that using the charindex function works if I insert one group;
and charindex ('Y02E', cpc_class_symbol) > 0
But if I want to use another charindex function the query just times out;
and charindex ('Y02E', cpc_class_symbol) > 0 or charindex ('Y02C', cpc_class_symbol) >0
I am an absolute SQL rookie but would really appreciate your help!
II. List values from column in one cell with comma separation
Essentially I want to apply what I found as the "string_agg"-command, however, it does not work for this database. I have entries with a unique ID, which have multiple patent categories. For example:
appln_nr_epodoc | cpc_class_symbol
EP20110185794 | Y02E 10/125
EP20110185794 | Y02E 10/127
I would like to have it like this, however:
appln_nr_epodoc | cpc_class_symbol
EP20110185794 | Y02E 10/125, Y02E 10/127
Again, I am very new to sql, so any help is appreciated! Thank you!
I will also attach the full code here for transparency
SELECT a.appln_nr_epodoc, a.appln_nr_original, psn_name, person_ctry_code, person_name, person_address, appln_auth+appln_nr,
appln_filing_date, cpc_class_symbol
FROM
tls201_appln a
join tls207_pers_appln b on a.appln_id = b.appln_id
join tls206_person c on b.person_id = c.person_id
join tls801_country on c.person_ctry_code= tls801_country.ctry_code
join tls224_appln_cpc on a.appln_id = tls224_appln_cpc.appln_id
WHERE appln_auth = 'EP'
and appln_filing_year between 2005 and 2012
and eu_member = 'Y'
and granted = 'Y'
and psn_sector = 'company'
and charindex ('Y02E', cpc_class_symbol) > 0
For your part 2 here is a sample data i created
And here is the code. It gives me YOUR requested output.
create table #test_1 (
appln_nr_epodoc varchar(20) null
,cpc_class_symbol varchar(20) null
)
insert into #test_1 values
('EP20110185794','Y02E 10/125')
,('EP20110185794','Y02E 10/127')
,('EP20110185795','Y02E 10/130')
,('EP20110185796','Y02E 20/140')
,('EP20110185796','Y02E 21/142')
with CTE_1 as (select *
from (
select *
,R1_1 = Rank() over(partition by appln_nr_epodoc order by cpc_class_symbol )
from #test_1
) as a
where R1_1 = 1
)
,CTE_2 as (select *
from (
select *
,R1_1 = Rank() over(partition by appln_nr_epodoc order by cpc_class_symbol )
from #test_1
) as a
where R1_1 = 2 )
select a.appln_nr_epodoc
,a.cpc_class_symbol+','+c.cpc_class_symbol
from CTE_1 a
join CTE_2 c on c.appln_nr_epodoc = a.appln_nr_epodoc
Out put

SQL Server 2008: duplicate a row n-times, where n is a value in a field

In SQL Server 2018 I have three tables:
T1 (idService, dateStart, dateStop)
T2 (idService, totalCostOfService)
T3 (idService, companyName)
Using joins, I created a view:
V1 (idService, dateStart, dateStop, totalCostOfService, companyName)
And we are fine. I can do my selects on the view and obtain the list of services done.
What I would like to do now is to duplicate every row of the view n times, where n=dateStart-dateStop; every row should have a "new" totalCostOfService = totalCostOfService/n.
I can do that using a temporary table, declaring variables, insert in temp using some while etc. etc. Let's call it "the procedure"
But what I would like to understand is:
is it possibile to do that directly with a select on V1? If not, is it possible to save "the procedure" as a view so that I can have it as a easy select?
Sorry if my question looks somewhat stupid, but I'm totally new with SQL. I tried searching here and on google but I couldn't find what an answer to my questions.
Thank you!
Rather than an rCTE (which is RBAR), you could use a Tally Table:
WITH N AS (
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) N(N)),
Tally AS(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -1 AS I
FROM N N1
CROSS JOIN N N2 --100
CROSS JOIN N N3 --1000
CROSS JOIN N N4) --10000
SELECT *
FROM YourTable
JOIN Tally T ON T.I <= dateStart-dateStop --Assumes dateStart and DateStop are integer values, even though their name implies otherwise
--If they are dates, then use DATEDIFF(DAY, dateStart, dateEnd)
That tally will generate numbers up to 10000 (which over 27 years worth of days. That should be far more than enough).
I will assume the existence of a numbers table which has the column val for the individual value numbers. If you don't, you will find plenty by searching around.
Add this in the end of the FROM clause of your view:
cross apply (select datediff(day,T1.dateStart,T1.dateStop)+1 as n_days)q1 -- number of days INCLUDING start
cross apply (select dateadd(day,T1.dateStart,n.val) as day_of_charge)q2 from numbers n where n.val between 0 and n_days-1)
Then you will be able to have the following field on your SELECT:
T2.totalCostOfService/n_days as totalCostOfService
I'll add a numbers table solution shortly.
You can use a recursive CTE:
with cte as (
select idService, dateStart, dateStop,
totalCostOfService / (datediff(day, datestop, datestart) + 1) as dailyCostOfService,
companyName
from v1
union all
select idService,
dateadd(day, 1, dateStart),
dateStop,
dailyCostOfService
companyName
from cte
)
select idservice, dateStart as dateOfService,
dailyCostOfService, companyName
from cte;
Note that if there are more than 100 days in any row, then you will need to add OPTION (MAXRECURSION 0).

Sql How to Join a table where data in Colums are alike

I am managing a large databse. I am trying to Join a table. but the data in the Colums dont actualy match. One has dashes and the other has space.
Such GPD 142 pol (Partnumber)in the Company table and GPD-142-pol (PartNumber)in the Customer table.
My query is written like this:
SELECT *
FROM CompanyPartsList
JOIN SalesReport
On FordPartsList.[Company Part Number] = SalesReport.[Customer Part #]
I tries something like this
SELECT *
FROM CompanyPartsList
JOIN SalesReport
On FordPartsList.[Company Part Number] Like SalesReport.[Customer Part #]
Any help would be appreciated.
again doing this will be very slow , a solution would be a trigger to create the correct formatted column on either side
SELECT *
FROM CompanyPartsList
JOIN SalesReport
On FordPartsList.[Company Part Number] = Replace(SalesReport.[Customer Part #],'-',' ')
Try replacing characters that can cause the values to be different.
SELECT
*
FROM
CompanyPartsList cpl,
SalesReport sr
WHERE
REPLACE(REPLACE(cpl.[Company Part Number],'-',''),' ','') = REPLACE(REPLACE(sr.[Customer Part #],'-',''),' ','')
At the end of the day, for the JOIN to work, you've got to have matching values. So your only option is to find a way to make them equal.
Given your examples, I'd experiment to see if there's a way that you could normalize the values to a standard. For example, you could try to remove all the spaces and hyphens on both sides by using REPLACE, and see if that does it for you.
If you're able to get matches that way, you've got two choices. You could always do that normalization on-the-fly when you do the JOIN, but that would probably be performance prohibitive. Or you could add another column to each table, which you update at the same time you update the real part#, setting it to that value with the spaces, hyphens, etc, removed.
Aside from that headache, your potential problem is the chance of clashes. What happens if you've got part#s "123-4B" and "12-34B"? If you use my proposal, those two products would appear the same.
Create a function as mentioned in question 1007697 and modify it gently to strip off anything but alphabets and numbers
Create Function [dbo].[RemoveNonAlphaNumericCharacters](#Temp VarChar(1000))
Returns VarChar(1000)
AS
Begin
Declare #KeepValues as varchar(50)
Set #KeepValues = '%[^a-z0-9]%'
While PatIndex(#KeepValues, #Temp) > 0
Set #Temp = Stuff(#Temp, PatIndex(#KeepValues, #Temp), 1, '')
Return #Temp
End
Then you could compare data, but it may be slow on a large table:
SELECT *
FROM CompanyPartsList
JOIN SalesReport
On RemoveNonAlphaNumericCharacters(FordPartsList.[Company Part Number])
= RemoveNonAlphaNumericCharacters(SalesReport.[Customer Part #])
If you end up wanting to use a function I would recommend using an inline table valued function instead of a scalar function that has a while loop inside. The performance of that scalar function is going to degrade quickly as the table gets larger. Here is an example of using an inline table valued function and a tally table so the replacement is set based.
If this was my code I would prefer to use the REPLACE option if that is a possibility.
CREATE FUNCTION [dbo].[StripNonAlphaNumeric_itvf]
(
#OriginalText VARCHAR(8000)
) RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH
E1(N) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
Tally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select STUFF(
(
SELECT SUBSTRING(#OriginalText, t.N, 1)
FROM tally t
WHERE
(
ASCII(SUBSTRING(#OriginalText, t.N, 1)) BETWEEN 48 AND 57 --numbers 0 - 9
OR
ASCII(SUBSTRING(#OriginalText, t.N, 1)) BETWEEN 65 AND 90 --UPPERCASE letters
OR
ASCII(SUBSTRING(#OriginalText, t.N, 1)) BETWEEN 97 AND 122 --LOWERCASE letters
)
AND n <= len(#OriginalText)
FOR XML PATH('')
), 1 ,0 , '') AS CleanedText

How to split and display distinct letters from a word in SQL?

Yesterday in a job interview session I was asked this question and I had no clue about it. Suppose I have a word "Manhattan " I want to display only the letters 'M','A','N','H','T'
in SQL. How to do it?
Any help is appreciated.
Well, here is my solution (sqlfiddle) - it aims to use a "Relational SQL" operations, which may have been what the interviewer was going for conceptually.
Most of the work done is simply to turn the string into a set of (pos, letter) records as the relevant final applied DQL is a mere SELECT with a grouping and ordering applied.
select letter
from (
-- All of this just to get a set of (pos, letter)
select ns.n as pos, substring(ss.s, ns.n, 1) as letter
from (select 'MANHATTAN' as s) as ss
cross join (
-- Or use another form to create a "numbers table"
select n from (values (1),(2),(3),(4),(5),(6),(7),(8),(9)) as X(n)
) as ns
) as pairs
group by letter -- guarantees distinctness
order by min(pos) -- ensure output is ordered MANHT
The above query works in SQL Server 2008, but the "Numbers Table" may have to be altered for other vendors. Otherwise, there is nothing used that is vendor specific - no CTE, or cross application of a function, or procedural language code ..
That being said, the above is to show a conceptual approach - SQL is designed for use with sets and relations and multiplicity across records; the above example is, in some sense, merely a perversion of such.
Examining the intermediate relation,
select ns.n as pos, substring(ss.s, ns.n, 1) as letter
from (select 'MANHATTAN' as s) as ss
cross join (
select n from (values (1),(2),(3),(4),(5),(6),(7),(8),(9)) as X(n)
) as ns
uses a cross join to generate the Cartesian product of the string (1 row) with the numbers (9 rows); the substring function is then applied with the string and each number to obtain each character in accordance with its position. The resulting set contains the records-
POS LETTER
1 M
2 A
3 N
..
9 N
Then the outer select groups each record according to the letter and the resulting records are ordered by the minimum (first) occurrence position of the letter that establishing the grouping. (Without the order by the letters would have been distinct but the final order would not be guaranteed.)
One way (if using SQL Server) is with a recursive CTE (Commom Table Expression).
DECLARE #source nvarchar(100) = 'MANHATTAN'
;
WITH cte AS (
SELECT SUBSTRING(#source, 1, 1) AS c1, 1 as Pos
WHERE LEN(#source) > 0
UNION ALL
SELECT SUBSTRING(#source, Pos + 1, 1) AS c1, Pos + 1 as Pos
FROM cte
WHERE Pos < LEN(#source)
)
SELECT DISTINCT c1 from cte
SqlFiddle for this is here. I had to inline the #source for SqlFiddle, but the code above works fine in Sql Server.
The first SELECT generates the initial row(in this case 'M', 1). The second SELECT is the recursive part that generates the subsequent rows, with the Pos column getting incremented each time until the termination condition WHERE Pos < LEN(#source) is finally met. The final select removes the duplicates. Internally, SELECT DISTINCT sorts the rows in order to facilitate the removal of duplicates, which is why the final output happens to be in alphabetic order. Since you didn't specify order as a requirement, I left it as-is. But you could modify it to use a GROUP instead, that ordered on MIN(Pos) if you needed the output in the characters' original order.
This same technique can be used for things like generating all the Bigrams for a string, with just a small change to the general structure above.
declare #charr varchar(99)
declare #lp int
set #charr='Manhattan'
set #lp=1
DECLARE #T1 TABLE (
FLD VARCHAR(max)
)
while(#lp<=LEN(#charr))
begin
if(not exists(select * from #T1 where FLD=(select SUBSTRING(#charr,#lp,1))))
begin
insert into #T1
select SUBSTRING(#charr,#lp,1)
end
set #lp=#lp+1
end
select * from #T1
check this it may help u
Here's an Oracle version of #user2864740's answer. The only difference is how you construct the "numbers table" (plus slight differences in aliasing)
select letter
from (
select ns.n as pos, substr(ss.s, ns.n, 1) as letter
from (select 'MANHATTAN' as s from dual) ss
cross join (
SELECT LEVEL as n
FROM DUAL
CONNECT BY LEVEL <= 9
ORDER BY LEVEL) ns
) pairs
group by letter
order by min(pos)

View to identify grouped values or object

As an example I have 5 objects. An object is the red dots bound together or adjacent to each other. In other words X+1 or X-1 or Y+1 or Y-1.
I need to create a MS SQL VIEW with will contain the first XY coordinate of each object like:
X,Y
=======
1. 1,1
2. 1,8
3. 4,3
4. 5,7
5. 6,5
I can't figure out how to group it in a VIEW (NOT using stored procedure). Anybody have any idea would be of great help.
Thanks
The other answer is already pretty long, so I'm leaving it as-is. This answer is much better, simpler and also correct whereas the other one has some edge-cases that will produce a wrong answer - I shall leave that exercise to the reader.
Note: Line breaks are added for clarity. The entire block is a single query
;with Walker(StartX,StartY,X,Y,Visited) as (
select X,Y,X,Y,CAST('('+right(X,3)+','+right(Y,3)+')' as Varchar(Max))
from puzzle
union all
select W.StartX,W.StartY,P.X,P.Y,W.Visited+'('+right(P.X,3)+','+right(P.Y,3)+')'
from Walker W
join Puzzle P on
(W.X=P.X and W.Y=P.Y+1 OR -- these four lines "collect" a cell next to
W.X=P.X and W.Y=P.Y-1 OR -- the current one in any direction
W.X=P.X+1 and W.Y=P.Y OR
W.X=P.X-1 and W.Y=P.Y)
AND W.Visited NOT LIKE '%('+right(P.X,3)+','+right(P.Y,3)+')%'
)
select X, Y, Visited
from
(
select W.X, W.Y, W.Visited, rn=row_number() over (
partition by W.X,W.Y
order by len(W.Visited) desc)
from Walker W
left join Walker Other
on Other.StartX=W.StartX and Other.StartY=W.StartY
and (Other.Y<W.Y or (Other.Y=W.Y and Other.X<W.X))
where Other.X is null
) Z
where rn=1
The first step is to set up a "walker" recursive table expression that will start at every
cell and travel as far as it can without retracing any step. Making sure that cells are not revisited is done by using the visited column, which stores each cell that has been visited from every starting point. In particular, this condition AND W.Visited NOT LIKE '%('+right(P.X,3)+','+right(P.Y,3)+')%' rejects cells that it has already visited.
To understand how the rest works, you need to look at the result generated by the "Walker" CTE by running "Select * from Walker order by StartX, StartY" after the CTE. A "piece" with 5 cells appears in at least 5 groups, each with a different (StartX,StartY), but each group has all the 5 (X,Y) pieces with different "Visited" paths.
The subquery (Z) uses a LEFT JOIN + IS NULL to weed the groups down to the single row in each group that contains the "first XY coordinate", defined by the condition
Other.StartX=W.StartX and Other.StartY=W.StartY
and (Other.Y<W.Y or (Other.Y=W.Y and Other.X<W.X))
The intention is for each cell that can be visited starting from (StartX, StartY), to compare against each other cell in the same group, and to find the cell where NO OTHER cell is on a higher row, or if they are on the same row, is to the left of this cell. This still leaves us with too many results, however. Consider just a 2-cell piece at (3,4) and (4,4):
StartX StartY X Y Visited
3 4 3 4 (3,4) ******
3 4 4 4 (3,4)(4,4)
4 4 4 4 (4,4)
4 4 3 4 (4,4)(3,4) ******
2 rows remain with the "first XY coordinate" of (3,4), marked with ******. We only need one row, so we use Row_Number and since we're numbering, we might as well go for the longest Visited path, which would give us as many of the cells within the piece as we can get.
The final outer query simply takes the first rows (RN=1) from each similar (X,Y) group.
To show ALL the cells of each piece, change the line
select X, Y, Visited
in the middle to
select X, Y, (
select distinct '('+right(StartX,3)+','+right(StartY,3)+')'
from Walker
where X=Z.X and Y=Z.Y
for xml path('')
) PieceCells
Which give this output
X Y PieceCells
1 1 (1,1)(2,1)(2,2)(3,2)
3 4 (3,4)(4,4)
5 6 (5,6)
7 5 (7,5)(8,5)(9,5)
8 1 (10,1)(8,1)(8,2)(9,1)(9,2)(9,3)
Ok. Its little bit hard. But in any case, I'm sure that in a simpler way this problem can not be solved.
So we have table:
CREATE Table Tbl1(Id int, X int, Y int)
INSERT INTO Tbl1
SELECT 1,1,1 UNION ALL
SELECT 2,1,2 UNION ALL
SELECT 3,1,8 UNION ALL
SELECT 4,1,9 UNION ALL
SELECT 5,1,10 UNION ALL
SELECT 6,2,2 UNION ALL
SELECT 7,2,3 UNION ALL
SELECT 8,2,8 UNION ALL
SELECT 9,2,9 UNION ALL
SELECT 10,3,9 UNION ALL
SELECT 11,4,3 UNION ALL
SELECT 12,4,4 UNION ALL
SELECT 13,5,7 UNION ALL
SELECT 14,5,8 UNION ALL
SELECT 15,5,9 UNION ALL
SELECT 16,6,5
And here is select query
with cte1 as
/*at first we make recursion to define groups of filled adjacent cells*/
/*as output of cte we have a lot of strings like <X>cell(1)X</X><Y>cell(1)Y</Y>...<X>cell(n)X</X><Y>cell(n)Y</Y>*/
(
SELECT id,X,Y,CAST('<X>'+CAST(X as varchar(10))+'</X><Y>'+CAST(Y as varchar(10))+'</Y>' as varchar(MAX)) info
FROM Tbl1
UNION ALL
SELECT b.id,a.X,a.Y,CAST(b.info + '<X>'+CAST(a.X as varchar(10))+'</X><Y>'+CAST(a.Y as varchar(10))+'</Y>' as varchar(MAX))
FROM Tbl1 a JOIN cte1 b
ON ((((a.X=b.X+1) OR (a.X=b.X-1)) AND a.Y=b.Y) OR (((a.Y=b.Y+1) OR (a.Y=b.Y-1)) AND a.X=b.X))
AND a.id<>b.id
AND
b.info NOT LIKE
('%'+('<X>'+CAST(a.X as varchar(10))+'</X><Y>'+CAST(a.Y as varchar(10))+'</Y>')+'%')
),
cte2 as
/*In this query, we select only the longest sequence of cell connections (first filter)*/
/*And we convert the string to a new standard (x,y | x,y | x,y |...| x,y) (for further separation)*/
(
SELECT *, ROW_NUMBER()OVER(ORDER BY info) cellGroupId
FROM(
SELECT REPLACE(REPLACE(REPLACE(REPLACE(info,'</Y><X>','|'),'</X><Y>',','),'<X>',''),'</Y>','') info
FROM(
SELECT info, MAX(LEN(info))OVER(PARTITION BY id)maxlen FROM cte1
) AS tmpTbl
WHERE maxlen=LEN(info)
)AS tmpTbl
),
cte3 as
/*In this query, we separated strings like (x,y | x,y | x,y |...| x,y) to many (x,y)*/
(
SELECT cellGroupId, CAST(LEFT(XYInfo,CHARINDEX(',',XYInfo)-1) as int) X, CAST(RIGHT(XYInfo,LEN(XYInfo)-CHARINDEX(',',XYInfo)) as int) Y
FROM(
SELECT cellGroupId, tmpTbl2.n.value('.','varchar(MAX)') XYinfo
FROM
(SELECT CAST('<r><c>' + REPLACE(info,'|','</c><c>')+'</c></r>' as XML) n, cellGroupId FROM cte2) AS tmpTbl1
CROSS APPLY n.nodes('/r/c') tmpTbl2(n)
) AS tmpTbl
),
cte4 as
/*In this query, we finally determined group of individual objects*/
(
SELECT cellGroupId,X,Y
FROM(
SELECT cellGroupId,X,Y,ROW_NUMBER()OVER(PARTITION BY X,Y ORDER BY cellGroupId ASC)rn
FROM(
SELECT *,
MAX(SumOfAdjacentCellsByGroup)OVER(PARTITION BY X,Y) Max_SumOfAdjacentCellsByGroup_ByXY /*calculated max value of <the sum of the cells in the group> by each cell*/
FROM(
SELECT *, SUM(1)OVER(PARTITION BY cellGroupId) SumOfAdjacentCellsByGroup /*calculated the sum of the cells in the group*/
FROM cte3
)AS TmpTbl
)AS TmpTbl
/*We got rid of the subgroups (i.e. [(1,2)(2,2)(2,3)] its subgroup of [(1,2)(1,1)(2,2)(2,3)])*/
/*it was second filter*/
WHERE SumOfAdjacentCellsByGroup=Max_SumOfAdjacentCellsByGroup_ByXY
)AS TmpTbl
/*We got rid of the same groups (i.e. [(1,1)(1,2)(2,2)(2,3)] its same as [(1,2)(1,1)(2,2)(2,3)])*/
/*it was third filter*/
WHERE rn=1
)
SELECT X,Y /*result*/
FROM(SELECT a.X,a.Y, ROW_NUMBER()OVER(PARTITION BY cellGroupId ORDER BY id)rn
FROM cte4 a JOIN Tbl1 b ON a.X=b.X AND a.Y=b.Y)a /*connect back*/
WHERE rn=1 /*first XY coordinate*/
Let's assume your coordinates are stored in X,Y form, something like this:
CREATE Table Puzzle(
id int identity, Y int, X int)
INSERT INTO Puzzle VALUES
(1,1),(1,2),(1,8),(1,9),(1,10),
(2,2),(2,3),(2,8),(2,9),
(3,9),
(4,3),(4,4),
(5,7),(5,8),(5,9),
(6,5)
This query then shows your Puzzle in board form (run in TEXT mode in SQL Management Studio)
SELECT (
SELECT (
SELECT CASE WHEN EXISTS (SELECT *
FROM Puzzle T
WHERE T.X=X.X and T.Y=Y.Y)
THEN 'X' ELSE '.' END
FROM (values(0),(1),(2),(3),(4),(5),
(6),(7),(8),(9),(10),(11)) X(X)
ORDER BY X.X
FOR XML PATH('')) + Char(13) + Char(10)
FROM (values(0),(1),(2),(3),(4),(5),(6),(7)) Y(Y)
ORDER BY Y.Y
FOR XML PATH(''), ROOT('a'), TYPE
).value('(/a)[1]','varchar(max)')
It gives you this
............
.XX.....XXX.
..XX....XX..
.........X..
...XX.......
.......XXX..
.....X......
............
This query done in 4 stages will give you the result of the TopLeft cell, if you define it as the Leftmost cell of the TopMost row.
-- the first table expression joins cells together on the Y-axis
;WITH FlattenOnY(Y,XLeft,XRight) AS (
-- start with all pieces
select Y,X,X
from puzzle
UNION ALL
-- keep connecting rightwards from each cell as far as possible
select B.Y,A.XLeft,B.X
from FlattenOnY A
join puzzle B on A.Y=B.Y and A.XRight+1=B.X
)
-- the second table expression flattens the results from the first, so that
-- it represents ALL the start-end blocks on each row of the Y-axis
,YPieces(Y,XLeft,XRight) as (
--
select Y,XLeft,Max(XRight)
from(
select Y,Min(XLeft)XLeft,XRight
from FlattenOnY
group by XRight,Y)Z
group by XLeft,Y
)
-- here, select * from YPieces will return the "blocks" such as
-- Row 1: 1-2 & 8-10
-- Row 2: 2-3 (equals Y,XLeft,XRight of 2,2,3)
-- etc
-- the third expression repeats the first, except it now combines on the X-axis
,FlattenOnX(Y,XLeft,CurPieceXLeft,CurPieceXRight,CurPieceY) AS (
-- start with all pieces
select Y,XLeft,XLeft,XRight,Y
from YPieces
UNION ALL
-- keep connecting rightwards from each cell as far as possible
select A.Y,A.XLeft,B.XLeft,B.XRight,B.Y
from FlattenOnX A
join YPieces B on A.CurPieceY+1=B.Y and A.CurPieceXRight>=B.XLeft and B.XRight>=A.CurPieceXLeft
)
-- and again we repeat the 2nd expression as the 4th, for the final pieces
select Y,XLeft X
from (
select *, rn2=row_number() over (
partition by Y,XLeft
order by CurPieceY desc)
from (
select *, rn=row_number() over (
partition by CurPieceXLeft, CurPieceXRight, CurPieceY
order by Y)
from flattenOnX
) Z1
where rn=1) Z2
where rn2=1
The result being
Y X
----------- -----------
1 1
1 8
4 3
5 7
6 5
Or is your representation in flat form something like this? If it is, give us a shout and I'll redo the solution
create table Puzzle (
row int,
[0] bit, [1] bit, [2] bit, [3] bit, [4] bit, [5] bit,
[6] bit, [7] bit, [8] bit, [9] bit, [10] bit, [11] bit
)
insert Puzzle values
(0,0,0,0,0,0,0,0,0,0,0,0,0),
(1,0,1,1,0,0,0,0,0,1,1,1,0),
(2,0,0,1,1,0,0,0,0,1,1,0,0),
(3,0,0,0,0,0,0,0,0,0,1,0,0),
(4,0,0,0,1,1,0,0,0,0,0,0,0),
(5,0,0,0,0,0,0,0,1,1,1,0,0),
(6,0,0,0,0,0,1,0,0,0,0,0,0),
(7,0,0,0,0,0,0,0,0,0,0,0,0)