Type II dimension joins - sql

I have the following table lookup table in OLTP
CREATE TABLE TransactionState
(
TransactionStateId INT IDENTITY (1, 1) NOT NULL,
TransactionStateName VarChar (100)
)
When this comes into my OLAP, I change the structure as follows:
CREATE TABLE TransactionState
(
TransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
TransactionStateName VarChar (100) NOT NULL,
StartDateTime DateTime NOT NULL,
EndDateTime NULL
)
My question is regarding the TransactionStateId column. Over time, I may have duplicate TransactionStateId values in my OLAP, but with the combination of StartDateTime and EndDateTime, they would be unique.
I have seen samples of Type-2 Dimensions where an OriginalTransactionStateId is added and the incoming TransactionStateId is mapped to it, plus a new TransactionStateId IDENTITY field becomes the PK and is used for the joins.
CREATE TABLE TransactionState
(
TransactionStateId INT IDENTITY (1, 1) NOT NULL,
OriginalTransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
TransactionStateName VarChar (100) NOT NULL,
StartDateTime DateTime NOT NULL,
EndDateTime NULL
)
Should I go with bachellorete #2 or bachellorete #3?

By this phrase:
With the combination of StartDateTime and EndDateTime, they would be unique.
you mean that they never overlap or that they satisfy the database UNIQUE constraint?
If the former, then you can use the StartDateTime in joins, but note that it may be inefficient, since it will use a "<=" condition instead of "=".
If the latter, then just use a fake identity.
Databases in general do not allow an efficient algorithm for this query:
SELECT *
FROM TransactionState
WHERE #value BETWEEN StartDateTime AND EndDateTime
, unless you do arcane tricks with SPATIAL data.
That's why you'll have to use this condition in a JOIN:
SELECT *
FROM factTable
CROSS APPLY
(
SELECT TOP 1 *
FROM TransactionState
WHERE StartDateTime <= factDateTime
ORDER BY
StartDateTime DESC
)
, which will deprive the optimizer of possibility to use HASH JOIN, which is most efficient for such queries in many cases.
See this article for more details on this approach:
Converting currencies
Rewriting the query so that it can use HASH JOIN resulted in 600% times performance gain, though it's only possible if your datetimes have accuracy of a day or lower (or a hash table will grow very large).
Since your time component is stripped of your StartDateTime and EndDateTime, you can create a CTE like this:
WITH cal AS
(
SELECT CAST('2009-01-01' AS DATE) AS cdate
UNION ALL
SELECT DATEADD(day, 1, cdate)
FROM cal
WHERE cdate <= '2009-03-01'
),
state AS
(
SELECT cdate, ts.*
FROM cal
CROSS APPLY
(
SELECT TOP 1 *
FROM TransactionState
WHERE StartDateTime <= cdate
ORDER BY
StartDateTime DESC
) ts
WHERE ts.EndDateTime >= cdate
)
SELECT *
FROM factTable
JOIN state
ON cdate = DATE(factDate)
If your date ranges span more than 100 dates, adjust MAXRECURSION option on CTE.

Please be aware that IDENTITY(1,1) is a declaration for auto-generating values in that column. This is different than PRIMARY KEY, which is a declaration that makes a column into a primary key clustered index. These two declarations mean different things and there are performance implications if you don't say PRIMARY KEY.

You could also use SSIS to load the DW. In the slowly changing dimension (SCD) transformation, you can set how to treat each attribute. If a historical attribute is selected, the type 2 SCD is applied to the whole row, and the transformation takes care of details. You also get to configure if you prefer start_date, end_date or a current/expired column.
The thing to differentiate here is difference between the primary key and a the business (natural) key. Primary key uniquely identifies a row in the table. Business key uniquely identifies a business object/entity and it can be repeated in a dimension table. Each time a SCD 2 is applied, a new row is inserted, with a new primary key, but the same business key; the old row is then marked as expired, while the new one is marked as current -- or start date and end date fields are populated appropriately.
The DW should not expose primary keys, so incoming data from OLTP contains business keys, while assignment of primary keys is under control of the DW; IDENTITY int is good for PKs in dimension tables.
The cool thing is that SCD transformation in SSIS takes care of this.

Related

SQL Server Indexing and Composite Keys

Given the following:
-- This table will have roughly 14 million records
CREATE TABLE IdMappings
(
Id int IDENTITY(1,1) NOT NULL,
OldId int NOT NULL,
NewId int NOT NULL,
RecordType varchar(80) NOT NULL, -- 15 distinct values, will never increase
Processed bit NOT NULL DEFAULT 0,
CONSTRAINT pk_IdMappings
PRIMARY KEY CLUSTERED (Id ASC)
)
CREATE UNIQUE INDEX ux_IdMappings_OldId ON IdMappings (OldId);
CREATE UNIQUE INDEX ux_IdMappings_NewId ON IdMappings (NewId);
and this is the most common query run against the table:
WHILE #firstBatchId <= #maxBatchId
BEGIN
-- the result of this is used to insert into another table:
SELECT
NewId, -- and lots of non-indexed columns from SOME_TABLE
FROM
IdMappings map
INNER JOIN
SOME_TABLE foo ON foo.Id = map.OldId
WHERE
map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
AND map.Processed = 0
-- We only really need this in case the user kills the binary or SQL Server service:
UPDATE IdMappings
SET Processed = 1
WHERE map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
SET #firstBatchId += 4999
SET #lastBatchId += 4999
END
What are the best indices to add? I figure Processed isn't worth indexing since it only has 2 values. Is it worth indexing RecordType since there are only about 15 distinct values? How many distinct values will a column likely have before we consider indexing it?
Is there any advantage in a composite key if some of the fields are in the WHERE and some are in a JOIN's ON condition? For example:
CREATE INDEX ix_IdMappings_RecordType_OldId
ON IdMappings (RecordType, OldId)
... if I wanted both these fields indexed (I'm not saying I do), does this composite key gain any advantage since both columns don't appear together in the same WHERE or same ON?
Insert time into IdMappings isn't really an issue. After we insert all records into the table, we don't need to do so again for months if ever.

Populate surrogate Datekey based on Date in a column

I have created a table
CREATE TABLE myTable(
DateID INT PRIMARY KEY NOT NULL,
myDate DATE)
Then I want to populate the table with data from a staging table using SSIS. The problem is that I don't understand how to generate DateID based on the incoming value from a staging table. For example:
after the staging table inserted 2020-12-12, I want my DateID to become 20201212 and so on.
I tried to google this issue but didn't find anything related to my case. (But I don't deny that I did it badly). How can I do it?
You can use a persisted computed column:
CREATE TABLE myTable (
DateID AS (YEAR(myDate) * 10000 + MONTH(myDate) * 100 + DAY(myDate)) PERSISTED PRIMARY KEY,
myDate DATE
);
Here is a db<>fiddle.

how to create date

how to create date format yyyy-mm with postgresql11
CREATE TABLE public."ASSOL"
(
id integer NOT NULL,
"ind" character(50) ,
"s_R" character(50) ,
"R" character(50) ,
"th" character(50),
"C_O" character(50) ,
"ASSOL" numeric(11,3),
date date,
CONSTRAINT "ASSOL_pkey" PRIMARY KEY (id)
This is a variation of Kaushik's answer.
You should just use the date data type. There is no need to create another type for this. However, I would implement this use a check constraint:
CREATE TABLE public.ASSOL (
id serial primary key,
ind varchar(50) ,
s_R varchar(50) ,
R varchar(50) ,
th varchar(50),
C_O varchar(50) ,
ASSOL numeric(11,3),
yyyymm date,
constraint chk_assol_date check (date = date_trunc('month', date))
);
This only allows you to insert values that are the first day of the month. Other inserts will fail.
Additional notes:
Don't use double quotes when creating tables. You then have to refer to the columns/tables using double quotes, which just clutters queries. Your identifiers should be case-insensitive.
An integer primary key would normally be a serial column.
NOT NULL is redundant for a PRIMARY KEY column.
Use reasonable names for columns. If you want a column to represent a month, then yyyymm is more informative than date.
Postgres stores varchar() and char() in the same way, but for most databases, varchar() is preferred because trailing spaces actually occupy bytes on the data pages.
for year and month you can try like below
SELECT to_char(now(),'YYYY-MM') as year_month
year_month
2019-05
You cannot create a date datatype that stores only the year and month component. There's no such option available at the data type level.
If you want to to truncate the day component to default it to start of month, you may do it. This is as good as having only the month and year component as all the dates will have day = 1 and only the month and year would change as per the time of running insert.
For Eg:
create table t ( id int, col1 text,
"date" date default date_trunc('month',current_date) );
insert into t(id,col1) values ( 1, 'TEXT1');
select * from t
d col1 date
1 TEXT1 2019-05-01
If you do not want to store a default date, simply use the date_trunc('month,date) expression wherever needed, it could either be in group by or in a select query.

Creating a table specifically for tracking change information to remove duplicated columns from tables

When creating tables, I have generally created them with a couple extra columns that track change times and the corresponding user:
CREATE TABLE dbo.Object
(
ObjectId int NOT NULL IDENTITY (1, 1),
ObjectName varchar(50) NULL ,
CreateTime datetime NOT NULL,
CreateUserId int NOT NULL,
ModifyTime datetime NULL ,
ModifyUserId int NULL
) ON [PRIMARY]
GO
I have a new project now where if I continued with this structure I would have 6 additional columns on each table with this type of change tracking. A time column, user id column and a geography column. I'm now thinking that adding 6 columns to every table I want to do this on doesn't make sense. What I'm wondering is if the following structure would make more sense:
CREATE TABLE dbo.Object
(
ObjectId int NOT NULL IDENTITY (1, 1),
ObjectName varchar(50) NULL ,
CreateChangeId int NOT NULL,
ModifyChangeId int NULL
) ON [PRIMARY]
GO
-- foreign key relationships on CreateChangeId & ModifyChangeId
CREATE TABLE dbo.Change
(
ChangeId int NOT NULL IDENTITY (1, 1),
ChangeTime datetime NOT NULL,
ChangeUserId int NOT NULL,
ChangeCoordinates geography NULL
) ON [PRIMARY]
GO
Can anyone offer some insight into this minor database design problem, such as common practices and functional designs?
Where i work, we use the same construct as yours - every table has the following fields:
CreatedBy (int, not null, FK users table - user id)
CreationDate (datetime, not null)
ChangedBy (int, null, FK users table - user id)
ChangeDate (datetime, null)
Pro: easy to track and maintain; only one I/O operation (i'll come to that later)
Con: i can't think of any at the moment (well ok, sometimes we don't use the change fields ;-)
IMO the approach with the extra table has the problem, that you will have to reference somehow also the belonging table for every record (unless you only need the one direction Object to Tracking table). The approach also leads to more I/O database operations - for every insert or modify you will need to:
add entry to Table Object
add entry to Tracking Table and get the new Id
update Object Table entry with the Tracking Table Id
It would certainly make the application code that communicates with the DB a bit more complicated and error-prone.

Indexing for tables used in temporal query

I have one tables with structures:
CREATE TABLE [dbo].[rx](
[pat_id] [int] NOT NULL,
[fill_Date] [date] NOT NULL,
[script_End_Date] AS (dateadd(day,[dayssup],[filldate])),
[drug_Name] [varchar](50) NULL,
[days_Sup] [int] NOT NULL,
[quantity] [float] NOT NULL,
[drug_Class] [char](3) NOT NULL,
[ofInterest] bit
CHECK(fill_Date <=script_End_Date
PRIMARY KEY CLUSTERED
(
[clmid] ASC
)
CREATE TABLE [dbo].[Calendar](
[cal_date] [date] PRIMARY KEY,
[Year] AS YEAR(cal_date) PERSISTED,
[Month] AS MONTH(cal_date) PERSISTED,
[Day] AS DAY(cal_date) PERSISTED,
[julian_seq] AS 1+DATEDIFF(DD, CONVERT(DATE, CONVERT(varchar,YEAR(cal_date))+'0101'),cal_date),
id int identity);
I used these tables with this query:
;WITH x
AS (SELECT rx.pat_id,
c.cal_date,
Count(DISTINCT rx.drug_name) AS distinctDrugs
FROM rx,
calendar AS c
WHERE c.cal_date BETWEEN rx.fill_date AND rx.script_end_date
AND rx.ofinterest = 1
GROUP BY rx.pat_id,
c.cal_date
--the query example I used having count(1) =2, but to illustrate the non-contiguous intervals, in practice I need the below having statement
HAVING Count(*) > 1),
y
AS (SELECT x.pat_id,
x.cal_date
--c2.id is the row number in the calendar table.
,
c2.id - Row_number()
OVER(
partition BY x.pat_id
ORDER BY x.cal_date) AS grp_nbr,
distinctdrugs
FROM x,
calendar AS c2
WHERE c2.cal_date = x.cal_date)
SELECT *,
Rank()
OVER(
partition BY pat_id, grp_nbr
ORDER BY distinctdrugs) AS [ranking]
FROM y
The calendar table runs for three years and the rx table has about 800k rows in it. After the preceding query ran for a few minutes I decided to add an index to it to speed things up. The index that I added was
create index ix_rx
on rx (clmid)
include (pat_id,fill_date,script_end_date,ofinterest)
This index had zero affect on the run time on the query. Can anyone help explain why the aforementioned index is not being used? This is a retrospective database and no more data will be added to it. I can add the execution plan if needed.
The clmid field is not used at all in the query. As such, I would be surprised if the optimizer would consider it, just for the include columns.
If you want to speed the query with indexes, start with the query where the table is used. The fields used are pat_id, drug_name, rx_ofinterest, fill_date, and script_end_date. The last two are challenging because of the between. You might try this index: rx(pat_id, drug_name, ofinterest, fill_date, script_end_date).
Having all the referenced fields in the index will make it possible to access the data without loading data pages.
Because it is not an appropriate index. Create two index one on [pat_id] and the other on drug_name. –