Update facts surrogate keys by new dimension with SCD2 - sql

I have a following problem. I am changing dimension table in my data database. My new dimension table will have new key values. Because of that I need to update all historical data in my fact table and change surrogate keys to new ones. Here is the example
Old dimension
id
name
value
start_date
end_date
200
Tom
1
02.02.2023
31-12-9999
100
Tom
2
01.01.2023
01.02.2023
300
Kate
1
01.01.2023
31-12-9999
Fact table
Date
name_id
05.02.2023
200
05.01.2023
100
10.02.2023
300
New dimension
id
name
value
start_date
end_date
2
Tom
1
02.02.2023
31-12-9999
1
Tom
2
01.01.2023
01.02.2023
3
Kate
1
01.01.2023
31-12-9999
To simplify we can assume that name is unique across all active records. So if Tom has id 100 at 05.02 then with new dimension table this surrogate key need to be changed from 100 to 1 (of course in real case it is more complex and we don't have pattern like 100 to 1, 200 to 2 ...)
New fact
Date
name_id
05.02.2023
2
05.01.2023
1
10.02.2023
3
How can we do it in SQL? Without SCD2 it seems to be easy, but I strugling of how to do it when I need to look also at date ranges

Related

Create a new column for group based on condition

I wanted to create a new column (Group ID) on the basis of following conditions:
If the DOB and first three letters of Name are same, then it must fall is same Group ID.
Name
DOB
Group ID
Anny
18-01-1922
0
Anny Scott
01-01-1950
1
Annie
01-01-1950
1
David
14-02-1950
2
David Kern
15-02-1951
3
William Perry
15-02-1953
4
Kenneth Field
15-02-1953
5
This how I want to create the groups
I have used the following code, to create the group ID for name (If first three letters are matched)
df['Group ID Name']=df.groupby(df['name'].str[:3]).ngroup()
The following code is used to create the group ID for DOB (If two records have the same DOB)
df['Group ID DOB']=df.groupby('Date of Birth').ngroup()
I want to use both the condition to create the Group ID, please help me out for the same.
Add multiple columns in list and also for correct ordering sort=False:
df['Group ID Name'] = df.groupby(['DOB',df['Name'].str[:3]], sort=False).ngroup()
print (df)
Name DOB Group;ID Group ID Name
0 Anny 18-01-1922 0 0
1 Anny Scott 01-01-1950 1 1
2 Annie 01-01-1950 1 1
3 David 14-02-1950 2 2
4 David Kern 15-02-1951 3 3
5 William erry 15-02-1953 4 4
6 Kenneth Field 15-02-1953 5 5

Spark Medium of values as column

I just starting to work with Spark and I have to create a column with values based on another data frame values. My first data frame has an Id and start date columns while my other one has a yield value,acquired date and Id. I have to create a new column in the first data frame with the mean of the available values from the last 30 days of the start date with the yield values from the other data frame. So the output should look something like this:
Table 1
ID start_date
1 01/12/2018
2 01/11/2019
Table 2
ID yield acquired_date
1 120 05/11/2019
1 100 05/11/2018
1 200 07/11/2018
1 200 08/11/2018
2 350 04/10/2020
2 300 04/10/2019
2 100 05/10/2019
output
ID start_date yield_mean
1 01/12/2018 250
2 01/11/2019 200
Note: the mean only accounts for values where acquired date is 30 days less than start date so row 0 and row 4 are not used.

Select maximum value where another column is used for for the Grouping

I'm trying to join several tables, where one of the tables is acting as a
key-value store, and then after the joins find the maximum value in a
column less than another column. As a simplified example, I have the following three tables:
Documents:
DocumentID
Filename
LatestRevision
1
D1001.SLDDRW
18
2
P5002.SLDPRT
10
Variables:
VariableID
VariableName
1
DateReleased
2
Change
3
Description
VariableValues:
DocumentID
VariableID
Revision
Value
1
2
1
Created
1
3
1
Drawing
1
2
3
Changed Dimension
1
1
4
2021-02-01
1
2
11
Corrected typos
1
1
16
2021-02-25
2
3
1
Generic part
2
3
5
Screw
2
2
4
2021-02-24
I can use the LEFT JOIN/IS NULL thing to get the latest version of
variables relatively easily (see http://sqlfiddle.com/#!7/5982d/3/0).
What I want is the latest version of variables that are less than or equal
to a revision which has a DateReleased, for example:
DocumentID
Filename
Variable
Value
VariableRev
DateReleased
ReleasedRev
1
D1001.SLDDRW
Change
Changed Dimension
3
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-25
16
1
D1001.SLDDRW
Change
Corrected Typos
11
2021-02-25
16
2
P5002.SLDPRT
Description
Generic Part
1
2021-02-24
4
How do I do this?
I figured this out. Add another JOIN at the start to add in another version of the VariableValues table selecting only the DateReleased variables, then make sure that all the VariableValues Revisions selected are less than this date released. I think the LEFT JOIN has to be added after this table.
The example at http://sqlfiddle.com/#!9/bd6068/3/0 shows this better.

SQL - Referencing 3 tables

This is in relation to my survey application for our team. I have 3 tables in my database related to this problem.
I apologize if the database is not fully normalized.
TBL_CHURCH columns:
1 FAM_CHURCH_SACRMNT_NUM (Primary Key) Int(15)
2 RSPONDNT_NUM
3 SURVYR_NUM
4 QN_NUMBER
5 CHRCHFAMLY_NAME
6 CHRCHFAMLY_ISBAPTIZED
Sample row based on order of columns above:
1 2 3 4 5 6
6422164 76826499 5712 362 Serio Tecson Jr. Yes
TBL_INTRVW columns:
1 QN_NUMBR (Primary Key)
2 SURVYR_NUM
3 ZONE_NUM
4 RSPONDNT_NUM
Sample row based on order of columns above:
1 2 3 4
362 5712 11 76826499
TBL_AREA columns:
1 BRGY_ZONE_NUM (Primary Key)
2 BRGY_CODE
Sample row based on order of columns above:
1 2
11 2A
21 2A
31 2A
The field CRCHFAMLY_ISBAPTIZED has only two values. A "Yes" or a "No" and each row has a QN_NUMBR value that is referenced to TBL_INTRVW and each QN_NUMBR on TBL_INTRVW has a unique ZONE_NUM that is referenced to TBL_AREA and that ZONE_NUM has a corresponding BRGY_CODE. Each BRGY_CODE have at least 2 ZONE_NUM values
My problem is that I want to count the number of people baptized in a given area.
The output more or less should look like this:
(The output is collected from the 3 different ZONE_NUM)
Zone Name Num of People Baptized
2A 20
I'm having what trouble what to use in my SQL statements. Should I use a WHERE within an INNER JOIN? And how do I go about in my SELECT statements?
SELECT c.BRGY_ZONE_NUM,count(a.CHRCHFAMLY_ISBAPTIZED) as [Num of People Baptized]
from TBL_CHURCH a
left join
TBL_INTRVW b
on a.QN_NUMBER=b.QN_NUMBER
left join
TBL_AREA c
on b.ZONE_NUM=cRGY_ZONE_NUM
where a.CHRCHFAMLY_ISBAPTIZED='Yes'
group by c.BRGY_ZONE_NUM
I dont see Zone Name column on the three table, so i used BRGY_ZONE_NUM

MDX query to get employees under given supervisor with parent child relationship

I have employee dimension in my cube where each employee has a supervisor which is also an employee. The sample data set is,
Employee ID | Supervisor ID | Name
1 0 ABC
2 1 AAA
3 1 BBB
4 2 CCC
5 2 DDD
6 4 EEE
7 3 FFF
I want to get the all employees under given supervisor. E.g. If the supervisor is 2, then the result should be
CCC
DDD
EEE
using below query i can get all the employees
SELECT {AddCalculatedMembers({[Employee].[EmployeeName].Children})} ON COLUMNS FROM [MY_CUBE]
I am new to MDX and please tell me how to write MDX query for above requirement.
#mmarie
I already have a cube. But not sure whether I implemented it correctly. My schema is as below,
The dimension "dimEmployee" has columns "EmployeeID, EmployeeName, Dept".
Also I have used bridge table "BridgeEmployee" and it has columns "ParentEmployeeID, ChildEmployeeID, Distance"
sample data in bridge are,
ParentEmployeeID | ChildEmployeeID | Distance
1 1 0
2 2 0
1 2 1
3 3 0
1 3 1
4 4 0
2 4 1
1 4 2
I am using SSAS and I have implemented the bridge table as Measure Group.