I have a table of changes to an entity. I am trying to rebuild the data from the changes.
This is my Changes table:
CREATE TABLE Changes
(
Id IDENTITY,
RecordId INT,
Field INT,
Val VARCHAR(MAX),
DateOfChange DATETIME
);
Field column is a reference to what field changed, Val is the new value, RecordId is the Id of the record that changed. Ideally the Record table would contain the latest values but I am not that lucky. There are 10 different fields that changes are tracked, mostly dates but some other types are thrown in there.
This is my Record table:
CREATE TABLE Records
(
Id IDENTITY,
AUserGeneratedIdentifer VARCHAR(12)
)
I'd like to have a view to query by the rolled up values.
SELECT
AUserGeneratedIdentitfier, DateOpened, DateClosed, etc
FROM
RecordView
WHERE
AUserGeneratedIdentitfier = 'something'
I am trying to implemented it with CTEs but I am wondering if this is the correct way. I am using a CTE per field I am trying to get to.
WITH DateOpened AS
(
SELECT
RecordId, Val,
ROW_NUMBER() OVER (PARTITION BY RecordId ORDER BY DateOfChanged DESC) Rank
FROM
Changes
WHERE
FieldId = #DateOpenedId
) --- ... Repeat for every field
SELECT (my fields)
FROM Records
INNER JOIN <all ctes> on Record Id
But this method feels wrong to me, possibly due to my lack of SQL experience. Is there a better way that I am missing here? What are the performance implications of having multiple CTEs on the same table and joining with them?
Please excuse the hastily thrown together pseudo code, I hope it illustrates my problem accurately
I think you should be able to use a single ROW_NUMBER and pivot it as below.
DECLARE #Records TABLE (Id INT IDENTITY, AUserGeneratedIdentifer VARCHAR(12))
DECLARE #Changes TABLE (Id INT IDENTITY, RecordId INT, Field INT, Val VARCHAR(MAX), DateOfChange DATETIME);
INSERT INTO #Records (AUserGeneratedIdentifer)
VALUES ('qwer'), ('asdf')
INSERT INTO #Changes (RecordId, Field, Val, DateOfChange)
VALUES (1, 1, 'foo', '2021-01-01' ),
(2, 1, 'fooz', '2021-01-01' ),
(2, 2, 'barz', '2021-01-01' ),
(1, 2, 'bar', '2021-01-01' ),
(1, 1, 'foo2', '2021-01-02' ),
(2, 2, 'barz2', '2021-01-02' )
SELECT piv.RecordId, piv.[1], piv.[2]
FROM (
SELECT C.RecordId, C.Field, C.Val, ROW_NUMBER() OVER (PARTITION BY C.RecordId, C.Field ORDER BY C.DateOfChange DESC) RowNum
FROM #Changes C
) sub
PIVOT (
MAX(Val)
FOR Field IN ([1], [2])
) piv
WHERE piv.RowNum = 1
Related
How can insert a row in SQL and add a value that would represent an internal counter grouped by a certain column value.
For example
CREATE TABLE Product (
Id int IDENTITY(1,1) PRIMARY KEY NOT NULL,
StoreId int,
StoreProductId int,
ProductName varchar(255)
)
when insert a row such as this
INSERT INTO(storeID, productName)
select 1, 'MyProduct'
I want to have values (1, 1, 1, 'MyProduct')
If I add a new product for that same store
I want to have values (2, 1, 2, 'MyProduct2')
For a different store
I want to have values (3, 2, 1, 'MyProduct3')
How do I do it safely ie. not having duplicate StoreProductId? I tried this using computed column column, but I was unable to use count, also I tried to use trigger on insert, but not sure if that is the right way to avoid duplicates.
You can use function with computed column:
CREATE FUNCTION dbo.fnGetStoreProductId(#id INT)
RETURNS int
AS
BEGIN
DECLARE #StoreProductId INT
;WITH cte AS
(
SELECT Id, ROW_NUMBER() OVER (PARTITION BY StoreId ORDER BY Id) rn
FROM dbo.Product
)
SELECT #StoreProductId = rn
FROM cte
WHERE cte.Id = #id
RETURN #StoreProductId
END
GO
ALTER TABLE dbo.Product Add StoreProductId AS dbo.fnGetStoreProductId(Id)
This is not something you should be storing at all.
There are numerous problems with such a design, such as:
Impossible to ensure integrity, without resorting to triggers.
You cannot guarantee sequential data if updates and/or deletes are allowed.
Insert performance is massively impacted.
Instead just calculate it when you need at the time of querying, using ROW_NUMBER
SELECT
p.Id,
p.StoreId,
StoreProductId = ROW_NUMBER() OVER (PARTITION BY p.StoreId ORDER BY p.Id)
p.ProductName
FROM dbo.Product p;
I have a merge statement that builds my SCD type 2 table each night. This table must house all historical changes made in the source system and create a new row with the date from/date to columns populated along with the "islatest" flag. I have come across an issue today that I am not really sure how to handle.
There looks to have been multiple changes to the source table within a 24 hour period.
ID Code PAN EnterDate Cost Created
16155 1012401593331 ENRD 2015-11-05 7706.3 2021-08-17 14:34
16155 1012401593331 ENRD 2015-11-05 8584.4 2021-08-17 16:33
I use a basic merge statement to identify my changes however what would be the best approach to ensure all changes get picked up correctly? The above is giving me an error as it's trying to insert/update multiple rows with the same value
DECLARE #DateNow DATETIME = Getdate()
IF Object_id('tempdb..#meteridinsert') IS NOT NULL
DROP TABLE #meteridinsert;
CREATE TABLE #meteridinsert
(
meterid INT,
change VARCHAR(10)
);
MERGE
INTO [DIM].[Meters] AS target
using stg_meters AS source
ON target.[ID] = source.[ID]
AND target.latest=1
WHEN matched THEN
UPDATE
SET target.islatest = 0,
target.todate = #Datenow
WHEN NOT matched BY target THEN
INSERT
(
id,
code,
pan,
enterdate,
cost,
created,
[FromDate] ,
[ToDate] ,
[IsLatest]
)
VALUES
(
source.id,
source.code ,
source.pan ,
source.enterdate ,
source.cost ,
source.created ,
#Datenow ,
NULL ,
1
)
output source.id,
$action
INTO #meteridinsert;INSERT INTO [DIM].[Meters]
(
[id] ,
[code] ,
[pan] ,
[enterdate] ,
[cost] ,
[created] ,
[FromDate] ,
[ToDate] ,
[IsLatest]
)
SELECT ([id] ,[code] ,[pan] ,[enterdate] ,[cost] ,[created] , #DateNow ,NULL ,1 FROM stg_meters a
INNER JOIN #meteridinsert cid
ON a.id = cid.meterid
AND cid.change = 'UPDATE'
Maybe you can do it using merge statement, but I would prefer to use typicall update and insert approach in order to make it easier to understand (also I am not sure that merge allows you to use the same source record for update and insert...)
First of all I create the table dimscd2 to represent your dimension table
create table dimscd2
(naturalkey int, descr varchar(100), startdate datetime, enddate datetime)
And then I insert some records...
insert into dimscd2 values
(1,'A','2019-01-12 00:00:00.000', '2020-01-01 00:00:00.000'),
(1,'B','2020-01-01 00:00:00.000', NULL)
As you can see, the "current" is the one with descr='B' because it has an enddate NULL (I do recommend you to use surrogate keys for each record... This is just an incremental key for each record of your dimension, and the fact table must be linked with this surrogate key in order to reflect the status of the fact in the moment when happened).
Then, I have created some dummy data to represent the source data with the changes for the same natural key
-- new data (src_data)
select 1 as naturalkey,'C' as descr, cast('2020-01-02 00:00:00.000' as datetime) as dt into src_data
union all
select 1 as naturalkey,'D' as descr, cast('2020-01-03 00:00:00.000' as datetime) as dt
After that, I have created a temp table (##tmp) with this query to set the enddate for each record:
-- tmp table
select naturalkey, descr, dt,
lead(dt,1,0) over (partition by naturalkey order by dt) enddate,
row_number() over (partition by naturalkey order by dt) rn
into ##tmp
from src_data
The LEAD function takes the next start date for the same natural key, ordered by date (dt).
The ROW_NUMBER marks with 1 the oldest record in the source data for the natural key in the dimension.
Then, I proceed to close the "current" record using update
update d
set enddate = t.dt
from dimscd2 d
join ##tmp t
on d.naturalkey = t.naturalkey
and d.enddate is null
and t.rn = 1
And finally I add the new source data to the dimension with insert
insert into dimscd2
select naturalkey, descr, dt,
case enddate when '1900-00-00' then null else enddate end
from ##tmp
Final result is obtained with the query:
select * from dimscd2
You can test on this db<>fiddle
I've inherited a SQL Server database that has duplicate data in it. I need to find and remove the duplicate rows. But without an id field, I'm not sure how to find the rows.
Normally, I'd compare it with itself using a LEFT JOIN and check that all fields are the same except the ID field would be table1.id <> table2.id, but without that, I don't know how to find duplicates rows and not have it also match on itself.
TABLE:
productId int not null,
categoryId int not null,
state varchar(255) not null,
dateDone DATETIME not null
SAMPLE DATA
1, 3, "started", "2016-06-15 04:23:12.000"
2, 3, "started", "2016-06-15 04:21:12.000"
1, 3, "started", "2016-06-15 04:23:12.000"
1, 3, "done", "2016-06-15 04:23:12.000"
In that sample, only rows 1 and 3 are duplicates.
How do I find duplicates?
Use having (and group by)
select
productId
, categoryId
, state
, dateDone
, count(*)
from your_table
group by productId ,categoryId ,state, dateDone
having count(*) >1
You can do this with windowing functions. For instance
create table #tmp
(
Id INT
)
insert into #tmp
VALUES (1), (1), (2) --so now we have duplicated rows
WITH CTE AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY Id ORDER BY Id) AS [DuplicateCounter],
Id
FROM #tmp
)
DELETE FROM CTE
WHERE DuplicateCounter > 1 --duplicated rows have DuplicateCounter > 1
For some reason I thought you wanted to delete them I guess I read that wrong but just switch DELETE in my statement to SELECT and now you have all of the duplicates and not the original. But using DELETE will remove all duplicates and still leave you 1 record which I suspect is your desire.
IF OBJECT_ID('tempdb..#TT') IS NOT NULL
BEGIN
DROP TABLE #TT
END
CREATE TABLE #TT (
productId int not null,
categoryId int not null,
state varchar(255) not null,
dateDone DATETIME not null
)
INSERT INTO #TT (productId, categoryId, state, dateDone)
VALUES (1, 3, 'started', '2016-06-15 04:23:12.000')
,(2, 3, 'started', '2016-06-15 04:21:12.000')
,(1, 3, 'started', '2016-06-15 04:23:12.000')
,(1, 3, 'done', '2016-06-15 04:23:12.000')
SELECT *
FROM
#TT
;WITH cte AS (
SELECT
*
,RowNum = ROW_NUMBER() OVER (PARTITION BY productId, categoryId, state, dateDone ORDER BY productId) --note what you order by doesn't matter
FROM
#TT
)
--if you want to delete them just do this otherwise change DELETE TO SELECT
DELETE
FROM
cte
WHERE
RowNum > 1
SELECT *
FROM
#TT
If you want to and can change schema you can always add an identity column after the fact too and it will populate the existing record
ALTER TABLE #TT
ADD Id INTEGER IDENTITY(1,1) NOT NULL
You can try CTE and then limit the actual selection from the CTE to where RN = 1. Here is the query:-
;WITH ACTE
AS
(
SELECT ProductID, categoryID, State, DateDone,
RN = ROW_NUMBER() OVER(PARTITION BY ProductID, CategoryID, State, DateDone
ORDER BY ProductID, CategoryID, State, DateDone)
FROM [Table]
)
SELECT * FROM ACTE WHERE RN = 1
I'm trying to adopt this solution to remove duplicate rows from a database table. However, in my case, whether two rows are considered "duplicates", another table must be checked. A full repro for my scenario would be like this:
-- Foreign key between these tables as well as "Group" table omitted for simplicity...
DECLARE #ItemType TABLE(Id INT, Title NVARCHAR(50), GroupId INT);
DECLARE #Item TABLE(Id INT IDENTITY(1,1), ItemTypeId INT, Created DATETIME2);
INSERT INTO #ItemType (Id, Title, GroupId)
VALUES (1, 'apple', 1), (2, 'banana', 1), (3, 'beans', 2);
INSERT INTO #Item (ItemTypeId, Created)
VALUES (1, '20141201'), (2, '20140615'), (3, '20140614');
-- Note: Id's are generated automatically
WITH cte AS (
SELECT ROW_NUMBER() OVER (PARTITION BY GroupId ORDER BY Created) AS Rnk
FROM #Item AS i
JOIN #ItemType AS it ON i.ItemTypeId = it.Id
)
DELETE FROM cte
WHERE Rnk > 1;
This fails, obviously, with the following message:
View or function 'cte' is not updatable because the modification affects multiple base tables.
Can this be solved while sticking with the elegant cte-solution? Or does this require a move over to a version based on DELETE or even MERGE INTO?
You can stick with the CTE version, but the DELETE has to be more explicit about which rows it's going to remove. Just pass the #Item.Id from the CTE to and filter to be deleted rows based on that:
WITH cte AS (
SELECT i.Id,
ROW_NUMBER() OVER (PARTITION BY GroupId ORDER BY Created) AS Rnk
FROM #Item AS i
JOIN #ItemType AS it ON i.ItemTypeId = it.Id
)
DELETE FROM #Item
WHERE Id IN (SELECT Id FROM cte WHERE Rnk > 1);
The following works, I'm just wondering if this is the correct approach to finding the latest value for each audit field.
USE tempdb
CREATE Table Tbl(
TblID Int,
AuditFieldID Int,
AuditValue Int,
AuditDate Date
)
GO
INSERT INTO Tbl(TblID,AuditFieldID,AuditValue,AuditDate) VALUES(1,10,101,'1/1/2001')
INSERT INTO Tbl(TblID,AuditFieldID,AuditValue,AuditDate) VALUES(2,10,102,'1/1/2002')
INSERT INTO Tbl(TblID,AuditFieldID,AuditValue,AuditDate) VALUES(3,20,201,'1/1/2001')
INSERT INTO Tbl(TblID,AuditFieldID,AuditValue,AuditDate) VALUES(4,20,202,'1/1/2009')
SELECT AuditFieldID,AuditValue,AuditDate
FROM Tbl A
WHERE TblID=
(SELECT TOP 1 TblID
FROM Tbl
WHERE AuditFieldID=A.AuditFieldID
ORDER BY AuditDate DESC
)
Aggregate/ranking to get key and latest date, join back to get value.
This assumes SQL Server 2005+
DECLARE #tbl Table (
TblID Int,
AuditFieldID Int,
AuditValue Int,
AuditDate Date
)
INSERT INTO #tbl(TblID,AuditFieldID,AuditValue,AuditDate) VALUES(1,10,101,'1/1/2001')
INSERT INTO #tbl(TblID,AuditFieldID,AuditValue,AuditDate) VALUES(2,10,102,'1/1/2002')
INSERT INTO #tbl(TblID,AuditFieldID,AuditValue,AuditDate) VALUES(3,20,201,'1/1/2001')
INSERT INTO #tbl(TblID,AuditFieldID,AuditValue,AuditDate) VALUES(4,20,202,'1/1/2009')
;WITH cLatest AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY AuditFieldID ORDER BY AuditDate DESC) AS Ranking,
AuditFieldID, AuditDate
FROM
#tbl
)
SELECT
A.AuditFieldID, A.AuditValue, A.AuditDate
FROM
#tbl A
JOIN
cLatest C ON A.AuditFieldID = C.AuditFieldID AND A.AuditDate = C.AuditDate
WHERE
C.Ranking = 1
Simpler:
SELECT top 1 AuditFieldID,AuditValue,AuditDate
FROM Tbl
order by AuditDate DES
There are various methods for doing this. Different methods perform differently. I encourage you to look at this blog which explains the various methods.
Including an Aggregated Column's Related Values
you don't need the where statement as you are already selecting from tbl A AND selecting on the same field.