Unique combination of multiple columns, order doesn't matter

Unique combination of multiple columns, order doesn't matter - sql

Suppose a table with 3 columns. each row represents a unique combination of each value:
a a a
a a b
a b a
b b a
b b c
c c a
...
however, what I want is,
aab = baa = aba
cca = cac = acc
...
Finally, I want to get these values in a CSV format as a combination for each value like the image that I attached.
Thanks for your help!
Below is the query to generate my problem, please take a look!
--=======================================
--populate test data
--=======================================
drop table if exists #t0
;
with
cte_tally as
(
select row_number() over (order by (select 1)) as n
from sys.all_columns
)
select
char(n) as alpha
into #t0
from
cte_tally
where
(n > 64 and n < 91) or
(n > 96 and n < 123);
drop table if exists #t1
select distinct upper(alpha) alpha into #t1 from #t0
drop table if exists #t2
select
a.alpha c1
, b.alpha c2
, c.alpha c3
, row_number()over(order by (select 1)) row_num
into #t2
from #t1 a
join #t1 b on 1=1
join #t1 c on 1=1
drop table if exists #t3
select *
into #t3
from (
select *
from #t2
) p
unpivot
(cvalue for c in (c1,c2,c3)
) unpvt
select
row_num
, c
, cvalue
from #t3
order by 1,2
--=======================================
--these three rows should be treated equally
--=======================================
select *
from #t2
where concat(c1,c2,c3) in ('ABA','AAB', 'BAA')
--=======================================
--what i've tried...
--row count is actually correct, but the problem is that it ommits where there're any duplicate alphabet.
--=======================================
select
distinct
stuff((
select
distinct
'.' + cvalue
from #t3 a
where a.row_num = h.row_num
for xml path('')
),1,1,'') as comb
from #t3 h

As pointed out in the comments, you can unpivot the values, sort them in the right order and reaggregate them into a single row. Then you can group the original rows by those new values.
SELECT *
FROM #t2
CROSS APPLY (
SELECT a = MIN(val), b = MIN(CASE WHEN rn = 2 THEN val), c = MAX(val)
FROM (
SELECT *, rn = ROW_NUMBER() OVER (ORDER BY val)
FROM (VALUES (c1),(c2),(c3) ) v3(val)
) v2
) v
GROUP BY v.a, v.b, v.c;
Really, what you should perhaps do, is ensure that the values are in the correct order in the first place:
ALTER TABLE #t2
ADD CONSTRAINT t2_ValuesOrder
CHECK (c1 <= c2 AND c2 <= c3);

Would be curious why, sure you have a reason. Might suggest having a lookup table, holding all associated keys to a "Mapping Table". You might optimize some of this as you implement it. First create one table for holding the "Next/New Key" (this is where the 1, 2, 3...) come from. You get a new "New Key" after each batch of records you bulk insert into your "Mapping Table". The "Mapping Table" holds the combination of the key values, one row for each combinations along with your "New Key" Should get a table looking something like:
A, B, C, 1
A, C, B, 1
B, A, C, 1
...
X, Y, Z, 2
X, Z, Y, 2
If you can update your source table to hold a column for your "Mapping Key" (the 1,2,3) then you just look up from the mapping table where (c1=a, c2=a, c3=b) order for this look-up shouldn't matter. One suggestion would create a composite unique key using c1,c2,c3 on your mapping table. Then to get your records just look up the "mapping key value" from the mapping table and then query for records matching the mapping key value. Or, if you don't do a pre-lookup to get the mapping key you should be able to do a self-join using the mapping key value...

If you want them in a CSV format:
select distinct v.cs
from #t2 t2 cross apply
(select string_agg(c order by c desc, ',') as cs
from (values (t2.c1), (t2.c2), (t2.c3)
) v(c)
) v;

It seems to me that what you need is some form of masking*. Take this fiddle:
http://sqlfiddle.com/#!18/fc67f/8
where I have created a mapping table that contains all of the possible values and paired that with increasing orders of 10. Doing a cross join on that map table, concatenating the values, adding the masks and grouping on the total will yield you all the unique combinations.
Here is the code from the fiddle:
CREATE TABLE maps (
val varchar(1),
num int
);
INSERT INTO maps (val, num) VALUES ('a', 1), ('b', 10), ('c', 100);
SELECT mask, max(vals) as val
FROM (
SELECT concat(m1.val, m2.val, m3.val) as vals,
m1.num + m2.num + m3.num as mask
FROM maps m1
CROSS JOIN maps m2
CROSS JOIN maps m3
) q GROUP BY mask
Using these values of 10 will ensure that mask contains the count for each value, one for each place column in the resulting number, and then you can group on it to get the unique(ish) strings.
I don't know what your data looks like, and if you have more than 10 possible values then you will have to use some other base than 10, but the theory should still apply. I didn't write code to extract the columns from the value table into the mapping table, but I'm sure you can do that.
*actually, I think the term I was looking for was flag.

Related

I need to be able to generate non-repetitive 8 character random alphanumeric for 2.5 million records

I need to be able to apply unique 8 character strings per row on a table that has almost 2.5 million records.
I have tried this:
UPDATE MyTable
SET [UniqueID]=SUBSTRING(CONVERT(varchar(255), NEWID()), 1, 8)
Which works, but when I check the uniqueness of the ID's, I receive duplicates
SELECT [UniqueID], COUNT([UniqueID])
FROM NicoleW_CQ_2019_Audi_CR_Always_On_2019_T1_EM
GROUP BY [UniqueID]
HAVING COUNT([UniqueID]) > 1
I really would just like to update the table, as above, with just a simple line of code, if possible.

Here's a way that uses a temporary table to assure the uniqueness
Create and fill a #temporary table with unique random 8 character codes.
The SQL below uses a FOR XML trick to generate the codes in BASE62 : [A-Za-z0-9]
Examples : 8Phs7ZYl, ugCKtPqT, U9soG39q
A GUID only uses the characters [0-9A-F].
For 8 characters that can generate 16^8 = 4294967296 combinations.
While with BASE62 there are 62^8 = 2.183401056e014 combinations.
So the odds that a duplicate is generated are significantly lower with BASE62.
The temp table should have an equal of larger amount of records than the destination table.
This example only generates 100000 codes. But you get the idea.
IF OBJECT_ID('tempdb..#tmpRandoms') IS NOT NULL DROP TABLE #tmpRandoms;
CREATE TABLE #tmpRandoms (
ID INT PRIMARY KEY IDENTITY(1,1),
[UniqueID] varchar(8),
CONSTRAINT UC_tmpRandoms_UniqueID UNIQUE ([UniqueID])
);
WITH DIGITS AS
(
select n
from (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)
),
NUMS AS
(
select (d5.n*10000 + d4.n*1000 + d3.n*100 + d2.n * 10 + d1.n) as n
from DIGITS d1
cross join DIGITS d2
cross join DIGITS d3
cross join DIGITS d4
cross join DIGITS d5
)
INSERT INTO #tmpRandoms ([UniqueID])
SELECT DISTINCT LEFT(REPLACE(REPLACE((select CAST(NEWID() as varbinary(16)), n FOR XML PATH(''), BINARY BASE64),'+',''),'/',''), 8) AS [UniqueID]
FROM NUMS;
Then update your table with it
WITH CTE AS
(
SELECT ROW_NUMBER() OVER (ORDER BY ID) AS RN, [UniqueID]
FROM YourTable
)
UPDATE t
SET t.[UniqueID] = tmp.[UniqueID]
FROM CTE t
JOIN #tmpRandoms tmp ON tmp.ID = t.RN;
A test on rextester here

Can you just use numbers and assign a randomish value?
with toupdate as (
select t.*,
row_number() over (order by newid()) as random_enough
from mytable t
)
update toupdate
set UniqueID = right(concat('00000000', random_enough), 8);

See: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/a289ed64-2038-415e-9f5d-ae84e50fe702/generate-random-string-of-length-5-az09?forum=transactsql
Alter: DECLARE #s char(5) and SELECT TOP (5) c1 to fix length you want.

Remove duplicated subsets from very large table

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.
Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.
I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.
Example:
GroupID Subset Value
------- ---- ----
1 a 1
1 a 2
1 a 3
1 b 1
1 b 3
1 b 5
1 c 1
1 c 3
1 c 5
2 a 1
2 a 2
2 a 3
2 b 4
2 b 5
2 b 6
2 c 1
2 c 3
2 c 6
So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.
Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6)
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value]
SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
SELECT t1.GroupID, t1.SubSet
, NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM #values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
FROM #values t1
GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues
All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.

I'd go with something like this:
;with cte as
(
select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
from #values v
group by v.GroupID, v.SubSet
)
delete v
from #values v
join
(
select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
from cte c1
join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet
select *
from #values

From Checksum_Agg:
The CHECKSUM_AGG result does not depend on the order of the rows in
the table.
This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.
HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)
The following code demonstrates how to use HashBytes to return for each GroupId/Subset.
-- Thanks for the sample data!
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6);
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value];
with
DistinctGroups as (
select distinct GroupId, Subset
from #Values ),
GroupConcatenatedValues as (
select GroupId, Subset, Convert( VarBinary(256), (
select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
from #Values as V
where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
order by Value
for XML Path('') ), 2 ) as GroupedBinary
from DistinctGroups as DG )
-- To see the intermediate results from the CTE you can use one of the
-- following two queries instead of the last select :
-- select * from DistinctGroups;
-- select * from GroupConcatenatedValues;
select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
from GroupConcatenatedValues
order by GroupId, Subset;

You can use checksum_agg() over a set of rows. If the checksums are the same, this is strong evidence that the 'values' columns are equal within the grouped fields.
In the 'getChecksums' cte below, I group by the group and subset, with a checksum based on your 'value' column.
In the 'maybeBadSubsets' cte, I put a row_number over each aggregation just to identify the 2nd+ row in the event the checksums match.
Finally, I delete any subgroups so identified.
with
getChecksums as (
select groupId,
subset,
cs = checksum_agg(value)
from #values v
group by groupId,
subset
),
maybeBadSubsets as (
select groupId,
subset,
cs,
deleteSubset =
case
when row_number() over (
partition by groupId, cs
order by subset
) > 1
then 1
end
from getChecksums
)
delete v
from #values v
where exists (
select 0
from maybeBadSubsets mbs
where v.groupId = mbs.groupId
and v.SubSet = mbs.subset
and mbs.deleteSubset = 1
);
I don't know what the exact likelihood is for checksums to match. If you're not comfortable with the false positive rate, you can still use it to eliminate some branches in a more algorithmic approach in order to vastly improve performance.
Note: CTE's can have a quirk performance-wise. If you find that the query engine is running 'maybeBadSubsets' for each row of #values, you may need to put its results into a temp table or table variable before using it. But I believe with 'exists' you're okay as far at that goes.
EDIT:
I didn't catch it, but as the OP noticed, checksum_agg seems to perform very poorly in terms of false hits/misses. I suspect it might be due to the simplicity of the input. I changed
cs = checksum_agg(value)
above to
cs = checksum_agg(convert(int,hashbytes('md5', convert(char(1),value))))
and got better results. But I don't know how it would perform on larger datasets.

Guarantee random inserting

I am trying to pregenerate some alphanumeric strings and insert the result into a table. The length of string will be 5. Example: a5r67. Basically I want to generate some readable strings for customers so they can access their orders like
www.example.com/order/a5r67. Now I have a select statement:
;WITH
cte1 AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t)),
cte2 AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t)),
cte3 AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t)),
cte4 AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t)),
cte5 AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t))
INSERT INTO ProductHandles(ID, Used)
SELECT cte1.t + cte2.t + cte3.t + cte4.t + cte5.t, 0
FROM cte1
CROSS JOIN cte2
CROSS JOIN cte3
CROSS JOIN cte4
CROSS JOIN cte5
Now the problem is I need to write something like this to get a value from the table:
SELECT TOP 1 ID
FROM ProductHandles
WHERE Used = 0
I will have index on the Used column so it will be fast. The problem with this is that it comes with order:
00000
00001
00002
...
I know that I can order by NEWID(), but that will be much slower. I know that there is no guarantee of ordering unless we specify Order By clause. What is needed is opposite. I need guaranteed chaos, but not by ordering by NEWID() each time customer creates order.
I am going to use it like:
WITH cte as (
SELECT TOP 1 * FROM ProductHandles WHERE Used = 0
--I don't want to order by newid() here as it will be slow
)
UPDATE cte
SET Used = 1
OUTPUT INSERTED.ID

If you add an identity column to the table, and use order by newid() when inserting the records (that will be slow but it's a one time thing that's being done offline from what I understand) then you can use order by on the identity column to select the records in the order they where inserted to the table.
From the Limitations and Restrictions part of the INSERT page in Microsoft Docs:
INSERT queries that use SELECT with ORDER BY to populate rows guarantees how identity values are computed but not the order in which the rows are inserted.
This means that by doing this you are effectively making the identity column ordered by the same random order the rows where selected in the insert...select statement.
Also, there is no need to repeat the same cte 5 times - you are already repeating the cross apply:
CREATE TABLE ProductHandles(sort int identity(1,1), ID char(5), used bit)
;WITH
cte AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t))
INSERT INTO ProductHandles(ID, Used)
SELECT a.t + b.t + c.t + d.t + e.t, 0
FROM cte a
CROSS JOIN cte b
CROSS JOIN cte c
CROSS JOIN cte d
CROSS JOIN cte e
ORDER BY NEWID()
Then the cte can have an order by clause that guarantees the same random order as the rows returned from the select statement populating this table:
WITH cte as (
SELECT TOP 1 *
FROM ProductHandles
WHERE Used = 0
ORDER BY sort
)
UPDATE cte
SET Used = 1
OUTPUT INSERTED.ID
You can see a live demo on rextester. (with only digits since it's taking too long otherwise)

Here's a slightly different option...
Rather than trying to generate all possible values in a single sitting, you could simply generate a million or two at a time and generate more as they get used up.
Using this approach, you drastically reduce the the initial creation time and eliminate the need to maintain the massive table of values, the majority of which, that will never be used.
CREATE TABLE dbo.ProductHandles (
rid INT NOT NULL
CONSTRAINT pk_ProductHandles
PRIMARY KEY CLUSTERED,
ID_Value CHAR(5) NOT NULL
CONSTRAINT uq_ProductHandles_IDValue
UNIQUE WITH (IGNORE_DUP_KEY = ON), -- prevents the insertion of duplicate values w/o generating any errors.
Used BIT NOT NULL
CONSTRAINT df_ProductHandles_Used
DEFAULT (0)
);
-- Create a filtered index to help facilitate fast searches
-- of unused values.
CREATE NONCLUSTERED INDEX ixf_ProductHandles_Used_rid
ON dbo.ProductHandles (Used, rid)
INCLUDE(ID_Value)
WHERE Used = 0;
--==========================================================
WHILE 1 = 1 -- The while loop will attempt to insert new rows, in 1M blocks, until required minimum of unused values are available.
BEGIN
IF (SELECT COUNT(*) FROM dbo.ProductHandles ph WHERE ph.Used = 0) > 1000000 -- the minimum num of unused ID's you want to keep on hand.
BEGIN
BREAK;
END;
ELSE
BEGIN
WITH
cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)),
cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),
cte_n3 (n) AS (SELECT 1 FROM cte_n2 a CROSS JOIN cte_n2 b),
cte_Tally (n) AS (
SELECT TOP (1000000) -- Sets the "block size" of each insert attempt.
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM
cte_n3 a CROSS JOIN cte_n3 b
)
INSERT dbo.ProductHandles (rid, ID_Value, Used)
SELECT
t.n + ISNULL((SELECT MAX(ph.rid) FROM dbo.ProductHandles ph), 0),
CONCAT(ISNULL(c1.char_1, n1.num_1), ISNULL(c2.char_2, n2.num_2), ISNULL(c3.char_3, n3.num_3), ISNULL(c4.char_4, n4.num_4), ISNULL(c5.char_5, n5.num_5)),
0
FROM
cte_Tally t
-- for each of the 5 positions, randomly generate numbers between 0 & 36.
-- 0-9 are left as numbers.
-- 10 - 36 are converted to lower cased letters.
CROSS APPLY ( VALUES (ABS(CHECKSUM(NEWID())) % 36) ) n1 (num_1)
CROSS APPLY ( VALUES (CHAR(CASE WHEN n1.num_1 > 9 THEN n1.num_1 + 87 END)) ) c1 (char_1)
CROSS APPLY ( VALUES (ABS(CHECKSUM(NEWID())) % 36) ) n2 (num_2)
CROSS APPLY ( VALUES (CHAR(CASE WHEN n2.num_2 > 9 THEN n2.num_2 + 87 END)) ) c2 (char_2)
CROSS APPLY ( VALUES (ABS(CHECKSUM(NEWID())) % 36) ) n3 (num_3)
CROSS APPLY ( VALUES (CHAR(CASE WHEN n3.num_3 > 9 THEN n3.num_3 + 87 END)) ) c3 (char_3)
CROSS APPLY ( VALUES (ABS(CHECKSUM(NEWID())) % 36) ) n4 (num_4)
CROSS APPLY ( VALUES (CHAR(CASE WHEN n4.num_4 > 9 THEN n4.num_4 + 87 END)) ) c4 (char_4)
CROSS APPLY ( VALUES (ABS(CHECKSUM(NEWID())) % 36) ) n5 (num_5)
CROSS APPLY ( VALUES (CHAR(CASE WHEN n5.num_5 > 9 THEN n5.num_5 + 87 END)) ) c5 (char_5);
END;
END;
After the initial creation, move the code in the WHILE loop to a stored procedure and schedule it to automatically run on a periodic basis.

If I'm understanding this right, It looks like your attempting to separate the URL/visible data from the DB record ID, as most apps use, and provide something that is not directly related to an ID field that the user will see. NEWID() does allow control of the number of characters so you could generate a smaller field with a smaller index. Or just use a portion of the full NEWID()
SELECT CONVERT(varchar(255), NEWID())
SELECT SUBSTRING(CONVERT(varchar(40), NEWID()),0,5)
You might also want to look at a checksum field, I don't know if its faster on indexing though. You could get crazier by combining random NEWID() with a checksum across 2 or 3 fields.
SELECT BINARY_CHECKSUM(5 ,'EP30461105',1)

Unpivoting multiple columns

I have a table in SQL Server 2014 called anotes with the following data
and I want to add this data into another table named final as
ID Notes NoteDate
With text1, text2, text3, text4 going into the Notes column in the final table and Notedate1,notedate2,notedate3,notedate4 going into Notedate column.
I tried unpivoting the data with notes first as:
select createdid, temp
from (select createdid,text1,text2,text3,text4 from anotes) p
unpivot
(temp for note in(text1,text2,text3,text4)) as unpvt
order by createdid
Which gave me proper results:
and then for the dates part I used another unpivot query:
select createdid,temp2
from (select createdid,notedate1,notedate2,notedate3,notedate4 from anotes) p
unpivot (temp2 for notedate in(notedate1,notedate2,notedate3,notedate4)) as unpvt2
which also gives me proper results:
Now I want to add this data into my final table.
and I tried the following query and it results into a cross join :(
select a.createdid, a.temp, b.temp2
from (select createdid, temp
from (select createdid,text1,text2,text3,text4 from anotes) p
unpivot
(temp for note in(text1,text2,text3,text4)) as unpvt) a inner join (select createdid,temp2
from (select createdid,notedate1,notedate2,notedate3,notedate4 from anotes) p
unpivot (temp2 for notedate in(notedate1,notedate2,notedate3,notedate4)) as unpvt) b on a.createdid=b.createdid
The output is as follows:
Is there any way where I can unpivot both the columns at the same time?
Or use two select queries to add that data into my final table?
Thanks in advance!

I would say the most concise, and probably most efficient way to unpivot multiple columns is to use CROSS APPLY along with a table valued constructor:
SELECT t.CreatedID, upvt.Text, upvt.NoteDate
FROM anotes t
CROSS APPLY
(VALUES
(Text1, NoteDate1),
(Text2, NoteDate2),
(Text3, NoteDate3),
(Text4, NoteDate4),
(Text5, NoteDate5),
(Text6, NoteDate6),
(Text7, NoteDate7)
) upvt (Text, NoteDate);
Simplified Example on SQL Fiddle
ADDENDUM
I find the concept quite a hard one to explain, but I'll try. A table valued constuctor is simply a way of defining a table on the fly, so
SELECT *
FROM (VALUES (1, 1), (2, 2)) t (a, b);
Will Create a table with Alias t with data:
a b
------
1 1
2 2
So when you use it inside the APPLY you have access to all the outer columns, so it is just a matter of defining your constructed tables with the correct pairs of values (i.e. text1 with date1).

Used the link above mentioned by #AHiggins
Following is my final query!
select createdid,temp,temp2
from (select createdid,text1,text2,text3,text4,text5,text6,text7,notedate1,notedate2,notedate3,notedate4,notedate5,notedate6,notedate7 from anotes) main
unpivot
(temp for notes in(text1,text2,text3,text4,text5,text6,text7)) notes
unpivot (temp2 for notedate in(notedate1,notedate2,notedate3,notedate4,notedate5,notedate6,notedate7)) Dates
where RIGHT(notes,1)=RIGHT(notedate,1)

Treat each query as a table and join them together based on the createdid and the fieldid (the numeric part of the field name).
select x.createdid, x.textValue, y.dateValue
from
(
select createdid, substring(note, 5, len(note)) fieldId, textValue
from (select createdid,text1,text2,text3,text4 from anotes) p
unpivot
(textValue for note in(text1,text2,text3,text4)) as unpvt
)x
join
(
select createdid, substring(notedate, 9, len(notedate)) fieldId, dateValue
from (select createdid,notedate1,notedate2,notedate3,notedate4 from anotes) p
unpivot (dateValue for notedate in(notedate1,notedate2,notedate3,notedate4)) as unpvt2
) y on x.fieldId = y.fieldId and x.createdid = y.createdid
order by x.createdid, x.fieldId
The other answer given won't work if you have too many columns and the rightmost number of the field name is duplicated (e.g. text1 and text11).

How to select only one full row per group in a "group by" query?

In SQL Server, I have a table where a column A stores some data. This data can contain duplicates (ie. two or more rows will have the same value for the column A).
I can easily find the duplicates by doing:
select A, count(A) as CountDuplicates
from TableName
group by A having (count(A) > 1)
Now, I want to retrieve the values of other columns, let's say B and C. Of course, those B and C values can be different even for the rows sharing the same A value, but it doesn't matter for me. I just want any B value and any C one, the first, the last or the random one.
If I had a small table and one or two columns to retrieve, I would do something like:
select A, count(A) as CountDuplicates, (
select top 1 child.B from TableName as child where child.A = base.A) as B
)
from TableName as base group by A having (count(A) > 1)
The problem is that I have much more rows to get, and the table is quite big, so having several children selects will have a high performance cost.
So, is there a less ugly pure SQL solution to do this?
Not sure if my question is clear enough, so I give an example based on AdventureWorks database. Let's say I want to list available States, and for each State, get its code, a city (any city) and an address (any address). The easiest, and the most inefficient way to do it would be:
var q = from c in data.StateProvinces select new { c.StateProvinceCode, c.Addresses.First().City, c.Addresses.First().AddressLine1 };
in LINQ-to-SQL and will do two selects for each of 181 States, so 363 selects. I my case, I am searching for a way to have a maximum of 182 selects.

The ROW_NUMBER function in a CTE is the way to do this. For example:
DECLARE #mytab TABLE (A INT, B INT, C INT)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 1, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 1, 2)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 2, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 3, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (2, 2, 2)
INSERT INTO #mytab ( A, B, C ) VALUES (3, 3, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (3, 3, 2)
INSERT INTO #mytab ( A, B, C ) VALUES (3, 3, 3)
;WITH numbered AS
(
SELECT *, rn=ROW_NUMBER() OVER (PARTITION BY A ORDER BY B, C)
FROM #mytab AS m
)
SELECT *
FROM numbered
WHERE rn=1
As I mentioned in my comment to HLGEM and Philip Kelley, their simple use of an aggregate function does not necessarily return one "solid" record for each A group; instead, it may return column values from many separate rows, all stitched together as if they were a single record. For example, if this were a PERSON table, with the PersonID being the "A" column, and distinct contact records (say, Home and Word), you might wind up returning the person's home city, but their office ZIP code -- and that's clearly asking for trouble.
The use of the ROW_NUMBER, in conjunction with a CTE here, is a little difficult to get used to at first because the syntax is awkward. But it's becoming a pretty common pattern, so it's good to get to know it.
In my sample I've define a CTE that tacks on an extra column rn (standing for "row number") to the table, that itself groups by the A column. A SELECT on that result, filtering to only those having a row number of 1 (i.e., the first record found for that value of A), returns a "solid" record for each A group -- in my example above, you'd be certain to get either the Work or Home address, but not elements of both mixed together.

It concerns me that you want any old value for fields b and c. If they are to be meaningless why are you returning them?
If it truly doesn't matter (and I honestly can't imagine a case where I would ever want this, but it's what you said) and the values for b and c don't even have to be from the same record, group by with the use of mon or max is the way to go. It's more complicated if you want the values for a particular record for all fields.
select A, count(A) as CountDuplicates, min(B) as B , min(C) as C
from TableName as base
group by A
having (count(A) > 1)

you can do some thing like this if you have id as primary key in your table
select id,b,c from tablename
inner join
(
select id, count(A) as CountDuplicates
from TableName as base group by A,id having (count(A) > 1)
)d on tablename.id= d.id

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Unique combination of multiple columns, order doesn't matter - sql

If you want them in a CSV format: select distinct v.cs from #t2 t2 cross apply (select string_agg(c order by c desc, ',') as cs from (values (t2.c1), (t2.c2), (t2.c3) ) v(c) ) v;

Related

I need to be able to generate non-repetitive 8 character random alphanumeric for 2.5 million records

Remove duplicated subsets from very large table

Guarantee random inserting

Unpivoting multiple columns

How to select only one full row per group in a "group by" query?

Categories

Resources