I receive raw data files from external sources and need to provide analysis on them. I load the files into a table & set the fields as varchars, then run a complex SQL script that does some automated analysis. One issue I've been trying to resolve is: How to tell if a column of data is duplicated with 1 or more other columns in that same table?
My goal is to have, for every column, a hash, checksum, or something similar that looks at a column's values in every row in the order they come in. I have dynamic SQL that loops through every field (different tables will have a variable number of columns) based on the fields listed in INFORMATION_SCHEMA.COLUMNS, so no concerns on how to accomplish that part.
I've been researching this all day but can't seem to find any sensible way to hash every row of a field. Google & StackOverflow searches return how to do various things to rows of data, but I couldn't find much on how to do the same thing vertically on a field.
So, I considered 2 possibilities & hit 2 roadblocks:
HASHBYTES - Use 'FOR XML PATH' (or similar) to grab every row & use a delimiter between each row, then use HASHBYTES to hash the long string. Unfortunately, this won't work for me since I'm running SQL Server 2014, and HASHBYTES is limited to an input of 8000 characters. (I can also imagine performance would be abysmal on tables with millions of rows, looped for 200+ columns).
CHECKSUM + CHECKSUM_AGG - Get the CHECKSUM of each value, turning it into an integer, then use CHECKSUM_AGG on the results (since CHECKSUM_AGG needs integers). This looks promising, but the order of the data is not considered, returning the same value on different rows. Plus the risk of collisions is higher.
The second looked promising but doesn't work as I had hoped...
declare #t1 table
(col_1 varchar(5)
, col_2 varchar(5)
, col_3 varchar(5));
insert into #t1
values ('ABC', 'ABC', 'ABC')
, ('ABC', 'ABC', 'BCD')
, ('BCD', 'BCD', NULL)
, (NULL, NULL, 'ABC');
select * from #t1;
select cs_1 = CHECKSUM(col_1)
, cs_2 = CHECKSUM(col_2)
, cs_3 = CHECKSUM(col_3)
from #t1;
select csa_1 = CHECKSUM_AGG(CHECKSUM([col_1]))
, csa_2 = CHECKSUM_AGG(CHECKSUM([col_2]))
, csa_3 = CHECKSUM_AGG(CHECKSUM([col_3]))
from #t1;
In the last result set, all 3 columns bring back the same value: 2147449198.
Desired results: My goal is to have some code where csa_1 and csa_2 bring back the same value, while csa_3 brings back a different value, indicating that it's its own unique set.
You could compare every column combo in this way, rather than using hashes:
select case when count(case when column1 = column2 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn2
, case when count(case when column1 = column3 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn3
, case when count(case when column1 = column4 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn4
, case when count(case when column1 = column5 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn5
, case when count(case when column2 = column3 then 1 else null end) = count(1) then 1 else 0 end Column2EqualsColumn3
, case when count(case when column2 = column4 then 1 else null end) = count(1) then 1 else 0 end Column2EqualsColumn4
, case when count(case when column2 = column5 then 1 else null end) = count(1) then 1 else 0 end Column2EqualsColumn5
, case when count(case when column3 = column4 then 1 else null end) = count(1) then 1 else 0 end Column3EqualsColumn4
, case when count(case when column3 = column5 then 1 else null end) = count(1) then 1 else 0 end Column3EqualsColumn5
, case when count(case when column4 = column5 then 1 else null end) = count(1) then 1 else 0 end Column4EqualsColumn5
from myData a
Here's the setup code:
create table myData
(
id integer not null identity(1,1)
, column1 nvarchar (32)
, column2 nvarchar (32)
, column3 nvarchar (32)
, column4 nvarchar (32)
, column5 nvarchar (32)
)
insert myData (column1, column2, column3, column4, column5)
values ('hello', 'hello', 'no', 'match', 'match')
,('world', 'world', 'world', 'world', 'world')
,('repeat', 'repeat', 'repeat', 'repeat', 'repeat')
,('me', 'me', 'me', 'me', 'me')
And here's the obligatory SQL Fiddle.
Also, to save you having to write this here's some code to generate the above. This version will also include logic to handle scenarios where both columns' values are null:
declare #tableName sysname = 'myData'
, #sql nvarchar(max)
;with cte as (
select name, row_number() over (order by column_id) r
from sys.columns
where object_id = object_id(#tableName, 'U') --filter on our table
and name not in ('id') --only process for the columns we're interested in
)
select #sql = coalesce(#sql + char(10) + ', ', 'select') + ' case when count(case when ' + quotename(a.name) + ' = ' + quotename(b.name) + ' or (' + quotename(a.name) + ' is null and ' + quotename(b.name) + ' is null) then 1 else null end) = count(1) then 1 else 0 end ' + quotename(a.name + '_' + b.name)
from cte a
inner join cte b
on b.r > a.r
order by a.r, b.r
set #sql = #sql + char(10) + 'from ' + quotename(#tableName)
print #sql
NB: That's not to say you should run it as dynamic SQL; rather you can use this to generate your code (unless you need to support the scenario where the number or name of columns may vary at runtime, in which case you'd obviously want the dynamic option).
NEW SOLUTION
EDIT: Based on some new information, namely that there may be more than 200 columns, my suggestion is to compute hashes for each column, but perform it in the ETL tool.
Essentially, feed your data buffer through a transformation that computes a cryptographic hash of the previously-computed hash concatenated with the current column value. When you reach the end of the stream, you will have serially-generated hash values for each column, that are a proxy for the content and order of each set.
Then, you can compare each to all of the others almost instantly, as opposed to running 20,000 table scans.
OLD SOLUTION
Try this. Basically, you'll need a query like this to analyze each column against the others. There is not really a feasible hash-based solution. Just compare each set by its insertion order (some sort of row sequence number). Either generate this number during ingestion, or project it during retrieval, if you have a computationally-feasible means of doing so.
NOTE: I took liberties with the NULL here, comparing it as an empty string.
declare #t1 table
(
rownum int identity(1,1)
, col_1 varchar(5)
, col_2 varchar(5)
, col_3 varchar(5));
insert into #t1
values ('ABC', 'ABC', 'ABC')
, ('ABC', 'ABC', 'BCD')
, ('BCD', 'BCD', NULL)
, (NULL, NULL, 'ABC');
with col_1_sets as
(
select
t1.rownum as col_1_rownum
, CASE WHEN t2.rownum IS NULL THEN 1 ELSE 0 END AS col_2_miss
, CASE WHEN t3.rownum IS NULL THEN 1 ELSE 0 END AS col_3_miss
from
#t1 as t1
left join #t1 as t2 on
t1.rownum = t2.rownum
AND isnull(t1.col_1, '') = isnull(t2.col_2, '')
left join #t1 as t3 on
t1.rownum = t3.rownum
AND isnull(t1.col_1, '') = isnull(t2.col_3, '')
),
col_1_misses as
(
select
SUM(col_2_miss) as col_2_misses
, SUM(col_3_miss) as col_3_misses
from
col_1_sets
)
select
'col_1' as column_name
, CASE WHEN col_2_misses = 0 THEN 1 ELSE 0 END AS is_col_2_match
, CASE WHEN col_3_misses = 0 THEN 1 ELSE 0 END AS is_col_3_match
from
col_1_misses
Results:
+-------------+----------------+----------------+
| column_name | is_col_2_match | is_col_3_match |
+-------------+----------------+----------------+
| col_1 | 1 | 0 |
+-------------+----------------+----------------+
Related
Problem background
I am trying to pin down to what condition(s) are causing no records / rows the most, so to allow me to find the root cause of what data in the database might need scrubbing.
So for example from the following query I would like to know whether it was the first condition which fails most of the time or second condition is the most offending one and so on.
SELECT TOP 1
FROM table
WHERE column1 = #param1 -- (cndtn 1)This condition works without anding with other conditions
AND column2 = #param2
AND column3 = #param3 -- (cndtn 3) This with 1 works 10% of the time
AND column4 = #param4
One of the ideas I thought was to break the procedure to use one condition at a time.
DECLARE #retVal int
SELECT #retVal = COUNT(*)
FROM table
WHERE column1 = #param1
IF (#retVal > 0)
--Do Something like above but by using #param2, #param3 and so on
Issues
If first check itself fails I wouldn't have a way forward to investigate into other combinations.
This doesn't seem very efficient either as this stored procedure is called hundreds of times.
Other SO Post I also find this great post (Find which one of the WHERE clauses succeeded) but this isn't very helping when no records are returned.
If this is just for debugging, what about detecting when the ##ROWCOUNT = 0 and storing those parameters in a separate debugging table?
SELECT TOP 1 *
FROM SomeTable
WHERE column1 = #param1
AND column2 = #param2
AND column3 = #param3
AND column4 = #param4
-- order by ....
;
-- one or more parameters "failed"
IF ##ROWCOUNT = 0
BEGIN
INSERT INTO SomeTable_Debug( createdDate, column1, column2, column3, column4, column5)
VALUES (getDate(), #param1, #param2, #param3, #param4, #param5)
END
You can then use the debugging table later on, in a separate query script, without having to worry about it's impact on a frequently invoked procedure. For example, this query returns 1 when a condition "fails", otherwise it returns null. It's not optimized for efficiency, but should be fine for occasional debugging use:
SELECT *
, (SELECT 1 WHERE NOT EXISTS (SELECT NULL FROM SomeTable st WHERE st.column1 = d.column1)) AS Matches_Column1
, (SELECT 1 WHERE NOT EXISTS (SELECT NULL FROM SomeTable st WHERE st.column2 = d.column2)) AS Matches_Column2
, (SELECT 1 WHERE NOT EXISTS (SELECT NULL FROM SomeTable st WHERE st.column3 = d.column3)) AS Matches_Column3
, (SELECT 1 WHERE NOT EXISTS (SELECT NULL FROM SomeTable st WHERE st.column4 = d.column4)) AS Matches_Column4
, (SELECT 1 WHERE NOT EXISTS (SELECT NULL FROM SomeTable st WHERE st.column5 = d.column5)) AS Matches_Column5
FROM SomeTable_Debug d
Sample Results:
id
createdDate
column1
column2
column3
column4
column5
Matches_Column1
Matches_Column2
Matches_Column3
Matches_Column4
Matches_Column5
1
2022-04-18 16:51:11.487
1
22
3
4
5
null
1
null
null
null
2
2022-04-18 16:51:11.500
1
22
3
4
56
null
1
null
null
1
db<>fiddle here
Executing the following code triggers an error:
System.Data.SqlClient.SqlException: 'Cannot convert a char value to money. The char value has incorrect syntax.
All was fine until I have added the second parameter used for sorting purposes.
The code is simplification for clarity.
query as String = "
SELECT a, b, c
FROM DataTable
WHERE c = #PARAM
ORDER BY
CASE #SORTCOLUMN
WHEN 1 THEN a
WHEN 2 THEN b
WHEN 3 THEN c
END"
Dim param As String = "myprarameter"
Dim sortcolumn as Integer = 1
result = connection.Query(Of MyType)(query, New With {Key .PARAM = param, Key .SORTCOLUMN = sortcolumn})
UPDATE:
After hours spent on testing I have narrowed it down to purely SQL issue, not Dapper nor .NET Framework.
Here are my findings:
All works fine until all columns used within CASE WHEN are of the same type (or types and even values which can be easily converted into each other). Once types of columns are different and values cannot be converted, it returns type conversion error. Seems it is trying to convert type of the column selected by CASE WHEN to the type of the first WHENcolumn.
Herewith two examples. I have removed variables for simplicity.
This works:
CREATE TABLE #TestTable1
(col1 money, col2 money)
INSERT INTO #TestTable1
(col1, col2)
VALUES
(1, 30),
(2, 20),
(3, 10)
SELECT col1, col2
FROM #TestTable1
ORDER BY
CASE 2
WHEN 1 THEN col1
WHEN 2 THEN col2
END
DROP TABLE #TestTable1
But this does NOT work. Returns:
Cannot convert a char value to money. The char value has incorrect syntax.
CREATE TABLE #TestTable2
(col1 money, col2 varchar(20))
INSERT INTO #TestTable2
(col1, col2)
VALUES
(1, 'cc'),
(2, 'bb'),
(3, 'aa')
SELECT col1, col2
FROM #TestTable2
ORDER BY
CASE 2
WHEN 1 THEN col1
WHEN 2 THEN col2
END
DROP TABLE #TestTable2
I am using Azure SQL with compatibility level 150.
I have updated Title and Tags accordingly.
UPDATE 2:
I am trying to add a complication to #forpas solution in form of a second parameter which will tell the order. I have used CASE within CASE but this returns a number of syntax errors.
The ORDER part only. The rest has not changed.
ORDER BY
CASE #SORTORDER
WHEN 'a' THEN
(CASE #SORTCOLUMN
WHEN 1 THEN col1
WHEN 2 THEN col2
END,
CASE #SORTCOLUMN
WHEN 3 THEN col3
WHEN 4 THEN col4
END) ASC
WHEN 'd' THEN
(CASE #SORTCOLUMN
WHEN 1 THEN col1
WHEN 2 THEN col2
END,
CASE #SORTCOLUMN
WHEN 3 THEN col3
WHEN 4 THEN col4
END) DESC
END
One solution is to split the CASE statement to as many CASE statements needed so each one contains columns with the same or similar convertable data types:
CREATE TABLE #TestTable2
(col1 money, col2 decimal(18, 2), col3 varchar(20))
INSERT INTO #TestTable2
(col1, col2, col3)
VALUES
(1, 5.5, 'cc'),
(2, 1.8, 'bb'),
(3, 3.3, 'aa');
DECLARE #SORTCOLUMN INT = 1; -- 1 or 2 or 3
SELECT col1, col2, col3
FROM #TestTable2
ORDER BY
CASE #SORTCOLUMN
WHEN 1 THEN col1
WHEN 2 THEN col2
END,
CASE #SORTCOLUMN
WHEN 3 THEN col3
END
DROP TABLE #TestTable2
See the demo.
Simply create the SQL as nvarchar and execute it as follows:
DECLARE #SQL NVARCHAR(MAX) = N'SELECT a, b, c FROM DataTable
WHERE c = #PARAM
ORDER BY ' + (CASE #SORTCOLUMN WHEN 1 THEN 'a' WHEN 2 THEN 'b' WHEN 3 THEN 'c' ELSE 'a' END);
EXEC sp_Executesql #SQL;
in VB
query as String = "DECLARE #SQL NVARCHAR(MAX) = N'SELECT a, b, c FROM DataTable
WHERE c = #PARAM
ORDER BY ' + (CASE #SORTCOLUMN WHEN 1 THEN 'a' WHEN 2 THEN 'b' WHEN 3 THEN 'c' ELSE 'a' END);
EXEC sp_Executesql #SQL;"
try this.
query as String = "
SELECT a, b, c
FROM DataTable
WHERE c = #PARAM
ORDER BY
CASE #SORTCOLUMN
WHEN '1' THEN a
WHEN '2' THEN b
WHEN '3' THEN c
END"
I am trying to count the number of times a record field has a value so that I can report that number later in the application.
I am finding several answers with various approaches using COUNT and GROUP BY but the results are all sums of the total occurrences for the entire table.
I am trying to limit the count to each record.
Table Example:
COL-1 COL-2 COL-3 COL-4
VALUE VALUE
VALUE VALUE
VALUE VALUE VALUE
VALUE
I need to count the fields of each record for the number of times a value appears.
Something similar to:
Result Concept:
COL-1 COL-2 COL-3 COL-4 Occurrences
VALUE VALUE 2
VALUE VALUE 2
VALUE VALUE VALUE 3
VALUE 1
Clarification:
I do not actually need to list the columns and values in the result. I only need the accurate count for each record.
I just wanted to illustrate the relationship between the "occurrences-value" and the record values in my question.
Thank you for all suggestions and input.
Just use case:
select t.*,
( (case when col1 is not null then 1 else 0 end) +
(case when col2 is not null then 1 else 0 end) +
(case when col3 is not null then 1 else 0 end) +
(case when col4 is not null then 1 else 0 end)
) as occurrences
from t;
Or decode ;)
select t.*, DECODE(col1,null,0,1)+DECODE(col2,null,0,1)+
DECODE(col3,null,0,1)+DECODE(col4,null,0,1) cnt
from my_table t
You may use the dynamic SQL without listing all column names
DECLARE #sql VARCHAR(MAX)
DECLARE #tbl VARCHAR(100)
SET #tbl = 'sampletable' -- put your table name here
SET #sql = 'SELECT *, '
SELECT #sql = #sql + '(CASE WHEN ' + cols.name + ' IS NOT NULL THEN 1 ELSE 0 END) ' + '+'
FROM sys.columns cols
WHERE cols.object_id = object_id(#tbl);
SET #sql = LEFT(#sql, LEN(#sql) - 1)
SET #sql = #sql + ' AS occurrences FROM ' + #tbl
EXEC(#sql)
Given 2 or more rows that are selected to merge, one of them is identified as being the template row. The other rows should merge their data into any null value columns that the template has.
Example data:
Id Name Address City State Active Email Date
1 Acme1 NULL NULL NULL NULL blah#yada.com 3/1/2011
2 Acme1 1234 Abc Rd Springfield OR 0 blah#gmail.com 1/12/2012
3 Acme2 NULL NULL NULL 1 blah#yahoo.com 4/19/2012
Say that a user has chosen row with Id 1 as the template row, and rows with Ids 2 and 3 are to be merged into row 1 and then deleted. Any null value columns in row Id 1 should be filled with (if one exists) the most recent (see Date column) non-null value, and non-null values already present in row Id 1 are to be left as is. The result of this query on the above data should be exactly this:
Id Name Address City State Active Email Date
1 Acme1 1234 Abc Road Springfield OR 1 blah#yada.com 3/1/2011
Notice that the Active value is 1, and not 0 because row Id 3 had the most recent date.
P.S. Also, is there any way possible to do this without explicitly defining/knowing beforehand what all the column names are? The actual table I'm working with has a ton of columns, with new ones being added all the time. Is there a way to look up all the column names in the table, and then use that subquery or temptable to do the job?
You might do it by ordering rows first by template flag, then by date desc. Template row should always be the last one. Each row is assigned a number in that order. Using max() we are finding fist occupied cell (in descending order of numbers). Then we select columns from rows matching those maximums.
; with rows as (
select test.*,
-- Template row must be last - how do you decide which one is template row?
-- In this case template row is the one with id = 1
row_number() over (order by case when id = 1 then 1 else 0 end,
date) rn
from test
-- Your list of rows to merge goes here
-- where id in ( ... )
),
-- Finding first occupied row per column
positions as (
select
max (case when Name is not null then rn else 0 end) NamePosition,
max (case when Address is not null then rn else 0 end) AddressPosition,
max (case when City is not null then rn else 0 end) CityPosition,
max (case when State is not null then rn else 0 end) StatePosition,
max (case when Active is not null then rn else 0 end) ActivePosition,
max (case when Email is not null then rn else 0 end) EmailPosition,
max (case when Date is not null then rn else 0 end) DatePosition
from rows
)
-- Finally join this columns in one row
select
(select Name from rows cross join Positions where rn = NamePosition) name,
(select Address from rows cross join Positions where rn = AddressPosition) Address,
(select City from rows cross join Positions where rn = CityPosition) City,
(select State from rows cross join Positions where rn = StatePosition) State,
(select Active from rows cross join Positions where rn = ActivePosition) Active,
(select Email from rows cross join Positions where rn = EmailPosition) Email,
(select Date from rows cross join Positions where rn = DatePosition) Date
from test
-- Any id will suffice, or even DISTINCT
where id = 1
You might check it at Sql Fiddle.
EDIT:
Cross joins in last section might actually be inner joins on rows.rn = xxxPosition. It works this way, but change to inner join would be an improvement.
It's not so complicated.
At first..
DECLARE #templateID INT = 1
..so you can remember which row is treated as template..
Now find latest NOT NULL values (exclude template row). The easiest way is to use TOP 1 subqueries for each column:
SELECT
(SELECT TOP 1 Name FROM DataTab WHERE Name IS NOT NULL AND NOT ID = #templateID ORDER BY Date DESC) AS LatestName,
(SELECT TOP 1 Address FROM DataTab WHERE Address IS NOT NULL AND NOT ID = #templateID ORDER BY Date DESC) AS AddressName
-- add more columns here
Wrap above into CTE (Common Table Expression) so you have nice input for your UDPATE..
WITH Latest_CTE (CTE_LatestName, CTE_AddressName) -- add more columns here; I like CTE prefix to distinguish source columns from target columns..
AS
-- Define the CTE query.
(
SELECT
(SELECT TOP 1 Name FROM DataTab WHERE Name IS NOT NULL AND NOT ID = #templateID ORDER BY Date DESC) AS LatestName,
(SELECT TOP 1 Address FROM DataTab WHERE Address IS NOT NULL AND NOT ID = #templateID ORDER BY Date DESC) AS AddressName
-- add more columns here
)
UPDATE
<update statement here (below)>
Now, do smart UPDATE of your template row using ISNULL - it will act as conditional update - update only if target column is null
WITH
<common expression statement here (above)>
UPDATE DataTab
SET
Name = ISNULL(Name, CTE_LatestName), -- if Name is null then set Name to CTE_LatestName else keep Name as Name
Address = ISNULL(Address, CTE_LatestAddress)
-- add more columns here..
WHERE ID = #templateID
And the last task is delete rows other then template row..
DELETE FROM DataTab WHERE NOT ID = #templateID
Clear?
For dynamic columns, you need to write a solution using dynamic SQL.
You can query sys.columns and sys.tables to get the list of columns you need, then you want to loop backwards once for each null column finding the first non-null row for that column and updating your output row for that column. Once you get to 0 in the loop you have a complete row which you can then display to the user.
I should pay attention to posting dates. In any case, here's a solution using dynamic SQL to build out an update statement. It should give you something to build from, anyway.
There's some extra code in there to validate the results along the way, but I tried to comment in a way that made that non-vital code apparent.
CREATE TABLE
dbo.Dummy
(
[ID] int ,
[Name] varchar(30),
[Address] varchar(40) null,
[City] varchar(30) NULL,
[State] varchar(2) NULL,
[Active] tinyint NULL,
[Email] varchar(30) NULL,
[Date] date NULL
);
--
INSERT dbo.Dummy
VALUES
(
1, 'Acme1', NULL, NULL, NULL, NULL, 'blah#yada.com', '3/1/2011'
)
,
(
2, 'Acme1', '1234 Abc Rd', 'Springfield', 'OR', 0, 'blah#gmail.com', '1/12/2012'
)
,
(
3, 'Acme2', NULL, NULL, NULL, 1, 'blah#yahoo.com', '4/19/2012'
);
DECLARE
#TableName nvarchar(128) = 'Dummy',
#TemplateID int = 1,
#SetStmtList nvarchar(max) = '',
#LoopCounter int = 0,
#ColumnCount int = 0,
#SQL nvarchar(max) = ''
;
--
--Create a table to hold the column names
DECLARE
#ColumnList table
(
ColumnID tinyint IDENTITY,
ColumnName nvarchar(128)
);
--
--Get the column names
INSERT #ColumnList
(
ColumnName
)
SELECT
c.name
FROM
sys.columns AS c
JOIN
sys.tables AS t
ON
t.object_id = c.object_id
WHERE
t.name = #TableName;
--
--Create loop boundaries to build out the SQL statement
SELECT
#ColumnCount = MAX( l.ColumnID ),
#LoopCounter = MIN (l.ColumnID )
FROM
#ColumnList AS l;
--
--Loop over the column names
WHILE #LoopCounter <= #ColumnCount
BEGIN
--Dynamically construct SET statements for each column except ID (See the WHERE clause)
SELECT
#SetStmtList = #SetStmtList + ',' + l.ColumnName + ' =COALESCE(' + l.ColumnName + ', (SELECT TOP 1 ' + l.ColumnName + ' FROM ' + #TableName + ' WHERE ' + l.ColumnName + ' IS NOT NULL AND ID <> ' + CAST(#TemplateID AS NVARCHAR(MAX )) + ' ORDER BY Date DESC)) '
FROM
#ColumnList AS l
WHERE
l.ColumnID = #LoopCounter
AND
l.ColumnName <> 'ID';
--
SELECT
#LoopCounter = #LoopCounter + 1;
--
END;
--TESTING - Validate the initial table values
SELECT * FROM dbo.Dummy ;
--
--Get rid of the leading common in the SetStmtList
SET #SetStmtList = SUBSTRING( #SetStmtList, 2, LEN( #SetStmtList ) - 1 );
--Build out the rest of the UPDATE statement
SET #SQL = 'UPDATE ' + #TableName + ' SET ' + #SetStmtList + ' WHERE ID = ' + CAST(#TemplateID AS NVARCHAR(MAX ))
--Then execute the update
EXEC sys.sp_executesql
#SQL;
--
--TESTING - Validate the updated table values
SELECT * FROM dbo.Dummy ;
--
--Build out the DELETE statement
SET #SQL = 'DELETE FROM ' + #TableName + ' WHERE ID <> ' + CAST(#TemplateID AS NVARCHAR(MAX ))
--Execute the DELETE
EXEC sys.sp_executesql
#SQL;
--
--TESTING - Validate the final table values
SELECT * FROM dbo.Dummy;
--
DROP TABLE dbo.Dummy;
I have the following table layout. Each line value will always be unique. There will never be more than one instance of the same Id, Name, and Line.
Id Name Line
1 A Z
2 B Y
3 C X
3 C W
4 D W
I would like to query the data so that the Line field becomes a column. If the value exists, a 1 is applied in the field data, otherwise a 0. e.g.
Id Name Z Y X W
1 A 1 0 0 0
2 B 0 1 0 0
3 C 0 0 1 1
4 D 0 0 0 1
The field names W, X, Y, Z are just examples of field values, so I can't apply an operator to explicitly check, for example, 'X', 'Y', or 'Z'. These could change at any time and are not restricted to a finate set of values. The column names in the result-set should reflect the unique field values as columns.
Any idea how I can accomplish this?
It's a standard pivot query.
If 1 represents a boolean indicator - use:
SELECT t.id,
t.name,
MAX(CASE WHEN t.line = 'Z' THEN 1 ELSE 0 END) AS Z,
MAX(CASE WHEN t.line = 'Y' THEN 1 ELSE 0 END) AS Y,
MAX(CASE WHEN t.line = 'X' THEN 1 ELSE 0 END) AS X,
MAX(CASE WHEN t.line = 'W' THEN 1 ELSE 0 END) AS W
FROM TABLE t
GROUP BY t.id, t.name
If 1 represents the number of records with that value for the group, use:
SELECT t.id,
t.name,
SUM(CASE WHEN t.line = 'Z' THEN 1 ELSE 0 END) AS Z,
SUM(CASE WHEN t.line = 'Y' THEN 1 ELSE 0 END) AS Y,
SUM(CASE WHEN t.line = 'X' THEN 1 ELSE 0 END) AS X,
SUM(CASE WHEN t.line = 'W' THEN 1 ELSE 0 END) AS W
FROM TABLE t
GROUP BY t.id, t.name
Edited following update in question
SQL Server does not support dynamic pivoting.
To do this you could either use dynamic SQL to generate a query along the following lines.
SELECT
Id ,Name,
ISNULL(MAX(CASE WHEN Line='Z' THEN 1 END),0) AS Z,
ISNULL(MAX(CASE WHEN Line='Y' THEN 1 END),0) AS Y,
ISNULL(MAX(CASE WHEN Line='X' THEN 1 END),0) AS X,
ISNULL(MAX(CASE WHEN Line='W' THEN 1 END),0) AS W
FROM T
GROUP BY Id ,Name
Or an alternative which I have read about but not actually tried is to leverage the Access Transform function by setting up an Access database with a linked table pointing at the SQL Server table then query the Access database from SQL Server!
Here is the dynamic version
Test table
create table #test(id int,name char(1),line char(1))
insert #test values(1 , 'A','Z')
insert #test values(2 , 'B','Y')
insert #test values(3 , 'C','X')
insert #test values(4 , 'C','W')
insert #test values(5 , 'D','W')
insert #test values(5 , 'D','W')
insert #test values(5 , 'D','P')
Now run this
declare #names nvarchar(4000)
SELECT #names =''
SELECT #names = #names + line +', '
FROM (SELECT distinct line from #test) x
SELECT #names = LEFT(#names,(LEN(#names) -1))
exec('
SELECT *
FROM(
SELECT DISTINCT Id, Name,Line
FROM #test
) AS pivTemp
PIVOT
( COUNT(Line)
FOR Line IN (' + #names +' )
) AS pivTable ')
Now add one row to the table and run the query above again and you will see the B
insert #test values(5 , 'D','B')
Caution: Of course all the problems with dynamic SQL apply, if you can use sp_executeSQL but since parameters are not use like that in the query there really is no point
Assuming you have a finite number of values for Line that you could enumerate:
declare #MyTable table (
Id int,
Name char(1),
Line char(1)
)
insert into #MyTable
(Id, Name, Line)
select 1,'A','Z'
union all
select 2,'B','Y'
union all
select 3,'C','X'
union all
select 3,'C','W'
union all
select 4,'D','W'
SELECT Id, Name, Z, Y, X, W
FROM (SELECT Id, Name, Line
FROM #MyTable) up
PIVOT (count(Line) FOR Line IN (Z, Y, X, W)) AS pvt
ORDER BY Id
As you are using SQL Server, you could possibly use the PIVOT operator intended for this purpose.
If you're doing this for a SQL Server Reporting Services (SSRS) report, or could possibly switch to using one, then stop now and go throw a Matrix control onto your report. Poof! You're done! Happy as a clam with your data pivoted.
Here's a rather exotic approach (using sample data from the old Northwind database). It's adapted from the version here, which no longer worked due to the deprecation of DBCC RENAMECOLUMN and the addition of PIVOT as a keyword.
set nocount on
create table Sales (
AccountCode char(5),
Category varchar(10),
Amount decimal(8,2)
)
--Populate table with sample data
insert into Sales
select customerID, 'Emp'+CAST(EmployeeID as char), sum(Freight)
from Northwind.dbo.orders
group by customerID, EmployeeID
create unique clustered index Sales_AC_C
on Sales(AccountCode,Category)
--Create table to hold data column names and positions
select A.Category,
count(distinct B.Category) AS Position
into #columns
from Sales A join Sales B
on A.Category >= B.Category
group by A.Category
create unique clustered index #columns_P on #columns(Position)
create unique index #columns_C on #columns(Category)
--Generate first column of Pivot table
select distinct AccountCode into Pivoted from Sales
--Find number of data columns to be added to Pivoted table
declare #datacols int
select #datacols = max(Position) from #columns
--Add data columns one by one in the correct order
declare #i int
set #i = 0
while #i < #datacols begin
set #i = #i + 1
--Add next data column to Pivoted table
select P.*, isnull((
select Amount
from Sales S join #columns C
on C.Position = #i
and C.Category = S.Category
where P.AccountCode = S.AccountCode),0) AS X
into PivotedAugmented
from Pivoted P
--Name new data column correctly
declare #c sysname
select #c = Category
from #columns
where Position = #i
exec sp_rename '[dbo].[PivotedAugmented].[X]', #c, 'COLUMN'
--Replace Pivoted table with new table
drop table Pivoted
select * into Pivoted from PivotedAugmented
drop table PivotedAugmented
end
select * from Pivoted
go
drop table Pivoted
drop table #columns
drop table Sales