How to modify duplicate strings in SQL Server? - sql

I'm dealing with truncating a column of string values in a table from 15 characters to 10 characters (this is the new max length I want to permit for the column).
There is a unique key on a pair of columns in the table, this one being one of them.
Because of the truncation, there is a possibility that this could be violated.
For example:
| ID | C1 | C2 |
| -- | --------------- | -- |
| 1 | 123456789012345 | 1 |
| 2 | 123456789012346 | 1 |
| 3 | 123456789012345 | 2 |
| 4 | 123456789012346 | 2 |
Let's say I have a unique key on C1 and C2. C1 is currently varchar(15), but for reasons that are beyond my control, it's being changed to varchar(10).
I have to truncate the values in C1 to strings of length 10. But if I just do so mindlessly, I'll obviously end up (in the example above) violating the unique key constraint.
So, I know how to find all the duplicates using something like:
select
t1.ID,
LEFT(t1.C1, 10) as C1,
t1.C2
INTO
#ColumnDuplicates
FROM
t t1
join t t2 on
t1.ID <> t2.ID
AND LEFT(t1.C1, 10) = LEFT(t2.C1, 10)
WHERE
t1.C2 = t2.C2
SELECT * FROM #ColumnDuplicates
Referring to the table above, this query would get me:
| ID | C1 | C2 |
| -- | ---------- | -- |
| 1 | 1234567890 | 1 |
| 2 | 1234567890 | 1 |
| 3 | 1234567890 | 2 |
| 4 | 1234567890 | 2 |
Now here's where I'm not sure how to do the next step. What I need to do is somehow get to this:
| ID | C1 | C2 |
| -- | ---------- | -- |
| 1 | 123456_001 | 1 |
| 2 | 123456_002 | 1 |
| 3 | 123456_001 | 2 |
| 4 | 123456_002 | 2 |
Effectively, I want to find all the duplicate C1 values for each C2 value, and then change the last 4 characters to a _[0-9][0-9][0-9] pattern, and progressively number those duplicates from 000 (or 001, I don't really care which is used as the starting point) through to a maximum of 999. This will give me space to deal with around 999 duplicates per C2 value, which I am quite sure based on my familiarity with the data I'm working with will not be an issue.
And then I can easily just use this temporary table to update the C1 values in the main table I am modifying.
My knowledge of SQL at the moment is quite basic, so I don't really know how to accomplish this.

If you are lucky, you can look at duplicates in the first six characters. I say lucky, because this assumes you never have more than 1000 such duplicates:
with toupdate as (
select t.*,
row_number() over (partition by left(c1, 6), c2 order by c2) as seqnum,
count(*) over (partition by left(c1, 6), c2) as cnt
from t
)
update toupdate
set c1 = (case when cnt > 1
then concat(left(c1, 6), '_', format(seqnum, '000'))
else left(c1, 10)
end);
The above is a little pessimistic with respect to duplicates. It probably makes sense to filter out known singletons before using row_number():
with toupdate as (
select t.*,
row_number() over (partition by left(c1, 6), c2,
(case when cnt10 > 1 then 1 else 2 end)
order by c2
) as seqnum,
count(*) over (partition by left(c1, 6), c2,
(case when cnt10 > 1 then 1 else 2 end)
) as cnt6
from (select t.*,
count(*) over (partition by left(c1, 10), c2) as cnt10
from t
) t
)
update toupdate
set c1 = (case when cnt10 > 1
then concat(left(c1, 6), '_', format(seqnum, '000'))
else left(c1, 10)
end);

You can use an updatable CTE to achieve this:
CREATE TABLE dbo.YourTable (ID int NOT NULL,
C1 varchar(15) NOT NULL,
C2 int NOT NULL);
CREATE UNIQUE INDEX YourIndex ON dbo.YourTable (C1,C2);
GO
INSERT INTO dbo.YourTable (ID, C1, C2)
VALUES (1,'123456789012345',1),
(2,'123456789012346',1),
(3,'123456789012345',2),
(4,'123456789012346',2);
GO
WITH CTE AS(
SELECT C1,
LEFT(YT.C1,6) + '_' + RIGHT(CONCAT('000',ROW_NUMBER() OVER (ORDER BY YT.C1, YT.C2 ASC)),3) AS NewC1
FROM dbo.YourTable YT
WHERE LEN(YT.C1) > 10) --Unsure if that WHERE is needed
UPDATE CTE
SET C1 = NewC1;
GO
DROP INDEX YourIndex ON dbo.YourTable; --Has to be dropped to alter
ALTER TABLE dbo.YourTable ALTER COLUMN C1 varchar(10) NOT NULL;
GO
CREATE UNIQUE INDEX YourIndex ON dbo.YourTable (C1,C2); --Recreate
GO
SELECT *
FROM dbo.YourTable;
GO
DROP TABLE dbo.YourTable;

Related

Pulling multiple entries based on ROW_NUMBER

I got the row_num column from a partition. I want each Type to match with at least one Sent and one Resent. For example, Jon's row is removed below because there is no Resent. Kim's Sheet row is also removed because again, there is no Resent. I tried using a CTE to take all columns for a Code if row_num = 2 but Kim's Sheet row obviously shows up because they're all under one Code. If anyone could help, that'd be great!
Edit: I'm using SSMS 2018. There are multiple Statuses other than Sent and Resent.
What my table looks like:
+-------+--------+--------+---------+---------+
| Code | Name | Type | Status | row_num |
+-------+--------+--------+---------+---------+
| 123 | Jon | Sheet | Sent | 1 |
| 221 | Kim | Sheet | Sent | 1 |
| 221 | Kim | Book | Resent | 1 |
| 221 | Kim | Book | Sent | 2 |
| 221 | Kim | Book | Sent | 3 |
+-------+--------+--------+---------+---------+
What I want it to look like:
+-------+--------+--------+---------+---------+
| Code | Name | Type | Status | row_num |
+-------+--------+--------+---------+---------+
| 221 | Kim | Book | Resent| 1 |
| 221 | Kim | Book | Sent | 2 |
| 221 | Kim | Book | Sent | 3 |
+-------+--------+--------+---------+---------+
Here is my CTE code:
WITH CTE AS
(
SELECT *
FROM #MyTable
)
SELECT *
FROM #MyTable
WHERE Code IN (SELECT Code FROM CTE WHERE row_num = 2)
If sent and resent are the only values for status, then you can use:
select t.*
from t
where exists (select 1
from t t2
where t2.name = t.name and
t2.type = t.type and
t2.status <> t.status
);
You can also phrase this with window functions:
select t.*
from (select t.*,
min(status) over (partition by name, type) as min_status,
max(status) over (partition by name, type) as max_status
from t
) t
where min_status <> max_status;
Both of these can be tweaked if other status values are possible. However, based on your question and sample data, that does not seem necessary.
FIDDLE
CREATE TABLE Table1(ID integer,Name VARCHAR(10),Type VARCHAR(10),Status VARCHAR(10),row_num integer);
INSERT INTO Table1 VALUES
('123','Jon','Sheet','Sent','1'),
('221','Kim','Sheet','Sent','1'),
('221','Kim','Book','Resent','1'),
('221','Kim','Book','Sent','2'),
('221','Kim','Book','Sent','3');
SELECT t1.*
FROM Table1 t1
WHERE EXISTS (
select 1
from Table1 t2
where t2.Name=t1.Name
and t2.Type=t1.TYpe
and t2.Status = case when t1.Status='Sent'
then 'Resent'
else 'Sent' end)
It would be easier if you would provide some scripts to create table and put these test data, but try something like
with a1 as (
select
name, type,
row_number() over (partition by code, Name, type, status) as rn
from #MyTable
), a2 as (
select * from a1 where rn > 1
)
select t.*
from #MyTable as t
inner join a2 on t.name = a2.name and t.type = a2.type;
Here you
calculate another row number using partitions by code, name, type and status,
then fetch these with this new row number > 1
and finally, you use that to join to original table and get interesting you rows
Syntax may vary on MSSQL, but you should give it a try. And please use better names than me ;-)
This solution is quite generic because it doesn't rely on used statuses. They're not hardcoded. And you can easily control what matters by changing partitions.
Fiddle

Partitioning function for continuous sequences

There is a table of the following structure:
CREATE TABLE history
(
pk serial NOT NULL,
"from" integer NOT NULL,
"to" integer NOT NULL,
entity_key text NOT NULL,
data text NOT NULL,
CONSTRAINT history_pkey PRIMARY KEY (pk)
);
The pk is a primary key, from and to define a position in the sequence and the sequence itself for a given entity identified by entity_key. So the entity has one sequence of 2 rows in case if the first row has the from = 1; to = 2 and the second one has from = 2; to = 3. So the point here is that the to of the previous row matches the from of the next one.
The order to determine "next"/"previous" row is defined by pk which grows monotonously (since it's a SERIAL).
The sequence does not have to start with 1 and the to - from does not necessary 1 always. So it can be from = 1; to = 10. What matters is that the "next" row in the sequence matches the to exactly.
Sample dataset:
pk | from | to | entity_key | data
----+--------+------+--------------+-------
1 | 1 | 2 | 42 | foo
2 | 2 | 3 | 42 | bar
3 | 3 | 4 | 42 | baz
4 | 10 | 11 | 42 | another foo
5 | 11 | 12 | 42 | another baz
6 | 1 | 2 | 111 | one one one
7 | 2 | 3 | 111 | one one one two
8 | 3 | 4 | 111 | one one one three
And what I cannot realize is how to partition by "sequences" here so that I could apply window functions to the group that represents a single "sequence".
Let's say I want to use the row_number() function and would like to get the following result:
pk | row_number | entity_key
----+-------------+------------
1 | 1 | 42
2 | 2 | 42
3 | 3 | 42
4 | 1 | 42
5 | 2 | 42
6 | 1 | 111
7 | 2 | 111
8 | 3 | 111
For convenience I created an SQLFiddle with initial seed: http://sqlfiddle.com/#!15/e7c1c
PS: It's not the "give me the codez" question, I made my own research and I just out of ideas how to partition.
It's obvious that I need to LEFT JOIN with the next.from = curr.to, but then it's still not clear how to reset the partition on next.from IS NULL.
PS: It will be a 100 points bounty for the most elegant query that provides the requested result
PPS: the desired solution should be an SQL query not pgsql due to some other limitations that are out of scope of this question.
I don’t know if it counts as “elegant,” but I think this will do what you want:
with Lagged as (
select
pk,
case when lag("to",1) over (order by pk) is distinct from "from" then 1 else 0 end as starts,
entity_key
from history
), LaggedGroups as (
select
pk,
sum(starts) over (order by pk) as groups,
entity_key
from Lagged
)
select
pk,
row_number() over (
partition by groups
order by pk
) as "row_number",
entity_key
from LaggedGroups
Just for fun & completeness: a recursive solution to reconstruct the (doubly) linked lists of records. [ this will not be the fastest solution ]
NOTE: I commented out the ascending pk condition(s) since they are not needed for the connection logic.
WITH RECURSIVE zzz AS (
SELECT h0.pk
, h0."to" AS next
, h0.entity_key AS ek
, 1::integer AS rnk
FROM history h0
WHERE NOT EXISTS (
SELECT * FROM history nx
WHERE nx.entity_key = h0.entity_key
AND nx."to" = h0."from"
-- AND nx.pk > h0.pk
)
UNION ALL
SELECT h1.pk
, h1."to" AS next
, h1.entity_key AS ek
, 1+zzz.rnk AS rnk
FROM zzz
JOIN history h1
ON h1.entity_key = zzz.ek
AND h1."from" = zzz.next
-- AND h1.pk > zzz.pk
)
SELECT * FROM zzz
ORDER BY ek,pk
;
You can use generate_series() to generate all the rows between the two values. Then you can use the difference of row numbers on that:
select pk, "from", "to",
row_number() over (partition by entity_key, min(grp) order by pk) as row_number
from (select h.*,
(row_number() over (partition by entity_key order by ind) -
ind) as grp
from (select h.*, generate_series("from", "to" - 1) as ind
from history h
) h
) h
group by pk, "from", "to", entity_key
Because you specify that the difference is between 1 and 10, this might actually not have such bad performance.
Unfortunately, your SQL Fiddle isn't working right now, so I can't test it.
Well,
this not exactly one SQL query but:
select a.pk as PK, a.entity_key as ENTITY_KEY, b.pk as BPK, 0 as Seq into #tmp
from history a left join history b on a."to" = b."from" and a.pk = b.pk-1
declare #seq int
select #seq = 1
update #tmp set Seq = case when (BPK is null) then #seq-1 else #seq end,
#seq = case when (BPK is null) then #seq+1 else #seq end
select pk, entity_key, ROW_NUMBER() over (PARTITION by entity_key, seq order by pk asc)
from #tmp order by pk
This is in SQL Server 2008

Selecting row with highest ID based on another column

In SQL Server 2008 R2, suppose I have a table layout like this...
+----------+---------+-------------+
| UniqueID | GroupID | Title |
+----------+---------+-------------+
| 1 | 1 | TEST 1 |
| 2 | 1 | TEST 2 |
| 3 | 3 | TEST 3 |
| 4 | 3 | TEST 4 |
| 5 | 5 | TEST 5 |
| 6 | 6 | TEST 6 |
| 7 | 6 | TEST 7 |
| 8 | 6 | TEST 8 |
+----------+---------+-------------+
Is it possible to select every row with the highest UniqueID number, for each GroupID. So according to the table above - if I ran the query, I would expect this...
+----------+---------+-------------+
| UniqueID | GroupID | Title |
+----------+---------+-------------+
| 2 | 1 | TEST 2 |
| 4 | 3 | TEST 4 |
| 5 | 5 | TEST 5 |
| 8 | 6 | TEST 8 |
+----------+---------+-------------+
Been chomping on this for a while, but can't seem to crack it.
Many thanks,
SELECT *
FROM (SELECT uniqueid, groupid, title,
Row_number()
OVER ( partition BY groupid ORDER BY uniqueid DESC) AS rn
FROM table) a
WHERE a.rn = 1
With SQL-Server as rdbms you can use a ranking function like ROW_NUMBER:
WITH CTE AS
(
SELECT UniqueID, GroupID, Title,
RN = ROW_NUMBER() OVER (PARTITON BY GroupID
ORDER BY UniqueID DESC)
FROM dbo.TableName
)
SELECT UniqueID, GroupID, Title
FROM CTE
WHERE RN = 1
This returns exactly one record for each GroupID even if there are multiple rows with the highest UniqueID (the name does not suggest so). If you want to return all rows in then use DENSE_RANK instead of ROW_NUMBER.
Here you can see all functions and how they work: http://technet.microsoft.com/en-us/library/ms189798.aspx
Since you have not mentioned any RDBMS, this statement below will work on almost all RDBMS. The purpose of the subquery is to get the greatest uniqueID for every GROUPID. To be able to get the other columns, the result of the subquery is joined on the original table.
SELECT a.*
FROM tableName a
INNER JOIN
(
SELECT GroupID, MAX(uniqueID) uniqueID
FROM tableName
GROUP By GroupID
) b ON a.GroupID = b.GroupID
AND a.uniqueID = b.uniqueID
In the case that your RDBMS supports Qnalytic functions, you can use ROW_NUMBER()
SELECT uniqueid, groupid, title
FROM
(
SELECT uniqueid, groupid, title,
ROW_NUMBER() OVER (PARTITION BY groupid
ORDER BY uniqueid DESC) rn
FROM tableName
) x
WHERE x.rn = 1
TSQL Ranking Functions
The ROW_NUMBER() generates sequential number which you can filter out. In this case the sequential number is generated on groupid and sorted by uniqueid in descending order. The greatest uniqueid will have a value of 1 in rn.
SELECT *
FROM the_table tt
WHERE NOT EXISTS (
SELECT *
FROM the_table nx
WHERE nx.GroupID = tt.GroupID
AND nx.UniqueID > tt.UniqueID
)
;
Should work in any DBMS (no window functions or CTEs are needed)
is probably faster than a sub query with an aggregate
Keeping it simple:
select * from test2
where UniqueID in (select max(UniqueID) from test2 group by GroupID)
Considering:
create table test2
(
UniqueID numeric,
GroupID numeric,
Title varchar(100)
)
insert into test2 values(1,1,'TEST 1')
insert into test2 values(2,1,'TEST 2')
insert into test2 values(3,3,'TEST 3')
insert into test2 values(4,3,'TEST 4')
insert into test2 values(5,5,'TEST 5')
insert into test2 values(6,6,'TEST 6')
insert into test2 values(7,6,'TEST 7')
insert into test2 values(8,6,'TEST 8')

Grouping SQL Results based on order

I have table with data something like this:
ID | RowNumber | Data
------------------------------
1 | 1 | Data
2 | 2 | Data
3 | 3 | Data
4 | 1 | Data
5 | 2 | Data
6 | 1 | Data
7 | 2 | Data
8 | 3 | Data
9 | 4 | Data
I want to group each set of RowNumbers So that my result is something like this:
ID | RowNumber | Group | Data
--------------------------------------
1 | 1 | a | Data
2 | 2 | a | Data
3 | 3 | a | Data
4 | 1 | b | Data
5 | 2 | b | Data
6 | 1 | c | Data
7 | 2 | c | Data
8 | 3 | c | Data
9 | 4 | c | Data
The only way I know where each group starts and stops is when the RowNumber starts over. How can I accomplish this? It also needs to be fairly efficient since the table I need to do this on has 52 Million Rows.
Additional Info
ID is truly sequential, but RowNumber may not be. I think RowNumber will always begin with 1 but for example the RowNumbers for group1 could be "1,1,2,2,3,4" and for group2 they could be "1,2,4,6", etc.
For the clarified requirements in the comments
The rownumbers for group1 could be "1,1,2,2,3,4" and for group2 they
could be "1,2,4,6" ... a higher number followed by a lower would be a
new group.
A SQL Server 2012 solution could be as follows.
Use LAG to access the previous row and set a flag to 1 if that row is the start of a new group or 0 otherwise.
Calculate a running sum of these flags to use as the grouping value.
Code
WITH T1 AS
(
SELECT *,
LAG(RowNumber) OVER (ORDER BY ID) AS PrevRowNumber
FROM YourTable
), T2 AS
(
SELECT *,
IIF(PrevRowNumber IS NULL OR PrevRowNumber > RowNumber, 1, 0) AS NewGroup
FROM T1
)
SELECT ID,
RowNumber,
Data,
SUM(NewGroup) OVER (ORDER BY ID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM T2
SQL Fiddle
Assuming ID is the clustered index the plan for this has one scan against YourTable and avoids any sort operations.
If the ids are truly sequential, you can do:
select t.*,
(id - rowNumber) as grp
from t
Also you can use recursive CTE
;WITH cte AS
(
SELECT ID, RowNumber, Data, 1 AS [Group]
FROM dbo.test1
WHERE ID = 1
UNION ALL
SELECT t.ID, t.RowNumber, t.Data,
CASE WHEN t.RowNumber != 1 THEN c.[Group] ELSE c.[Group] + 1 END
FROM dbo.test1 t JOIN cte c ON t.ID = c.ID + 1
)
SELECT *
FROM cte
Demo on SQLFiddle
How about:
select ID, RowNumber, Data, dense_rank() over (order by grp) as Grp
from (
select *, (select min(ID) from [Your Table] where ID > t.ID and RowNumber = 1) as grp
from [Your Table] t
) t
order by ID
This should work on SQL 2005. You could also use rank() instead if you don't care about consecutive numbers.

How do I get LIKE and COUNT to return the number of rows less than a value not in the row?

For example:
SELECT COUNT(ID) FROM My_Table
WHERE ID <
(SELECT ID FROM My_Table
WHERE ID LIKE '%4'
ORDER BY ID LIMIT 1)
My_Table:
X ID Y
------------------------
| | A1 | |
------------------------
| | B2 | |
------------------------
| | C3 | |
------------------------ -----Page 1
| | D3 | |
------------------------
| | E3 | |
------------------------
| | F5 | |
------------------------ -----Page 2
| | G5 | |
------------------------
| | F6 | |
------------------------
| | G7 | | -----Page 3
There is no data ending in 4 but there still are 5 rows that end in something less than "%4".
However, in this case were there is no match, so SQLite only returns 0
I get it is not there but how do I change this behavior to still return number of rows before it, as if it was there?
Any suggestions?
Thank You.
SELECT COUNT(ID) FROM My_Table
WHERE ID < (SELECT ID FROM My_Table
WHERE SUBSTRING(ID, 2) >= 4
ORDER BY ID LIMIT 1)
Assuming there is always one letter before the number part of the id field, you may want to try the following:
SELECT COUNT(*) FROM my_table WHERE CAST(substr(id, 2) as int) <= 4;
Test case:
CREATE TABLE my_table (id char(2));
INSERT INTO my_table VALUES ('A1');
INSERT INTO my_table VALUES ('B2');
INSERT INTO my_table VALUES ('C3');
INSERT INTO my_table VALUES ('D3');
INSERT INTO my_table VALUES ('E3');
INSERT INTO my_table VALUES ('F5');
INSERT INTO my_table VALUES ('G5');
INSERT INTO my_table VALUES ('F6');
INSERT INTO my_table VALUES ('G7');
Result:
5
UPDATE: Further to the comment below, you may want to consider using the ltrim() function:
The ltrim(X,Y) function returns a string formed by removing any and all characters that appear in Y from the left side of X.
Example:
SELECT COUNT(*)
FROM my_table
WHERE CAST(ltrim(id, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ') as int) <= 4;
Test case (adding to the above):
INSERT INTO my_table VALUES ('ABC1');
INSERT INTO my_table VALUES ('ZWY2');
New Result:
7
In MySQL that would be:
SELECT COUNT(ID)
FROM My_Table
WHERE ID <
(
SELECT id
FROM (
SELECT ID
FROM My_Table
WHERE ID LIKE '%4'
ORDER BY
ID
LIMIT 1
) q
UNION ALL
SELECT MAX(id)
FROM mytable
LIMIT 1
)