SQL find MIN and MAX range value for a subset of columns in SCD [duplicate] - sql

This question already has answers here:
Find min and max for subsets of consecutive rows - gaps and islands
(2 answers)
Group similar objects in different date ranges to get min and max dates in SQL Server
(1 answer)
Closed 1 year ago.
I have implemented CDC - SCD Type 2 on the customer data set.
I have implemented CDC on large of columns but the ask is to track behavior for only subset of those columns.
In the below input table I have ID column for the customer and RATE_CODE as one of the CDC field and START and END DATES are the CDC changes dates.
In this I wan't to know how over a period of time the customer data(RATE_CODE) is changing.
EX Row 1-3 has same RATE_CODE thus i need min(START_DATE) from ROW#1 and max(END_DATE) from ROW#3.
I tried applying group by on (ID,RATE_CODE) and min and max on dates but it is giving wrong value as in that case the max value will be picked from ROW#9 (for which I want a separate entry considering the ROW#6-9)
INPUT TABLE
ROW NUMBER
ID
RATE_CODE
START_DATE
END_DATE
1
1
A1
01-01-2021
18-01-2021
2
1
A1
18-01-2021
25-02-2021
3
1
A1
25-02-2021
15-03-2021
4
1
A2
15-03-2021
28-03-2021
5
1
A2
28-03-2021
28-05-2021
6
1
A1
28-05-2021
28-06-2021
7
1
A1
28-06-2021
12-07-2021
8
1
A1
20-07-2021
28-07-2021
9
1
A1
28-08-2021
13-09-2021
10
1
A2
13-09-2021
13-10-2021
EXPECTED OUTPUT
ID
RATE_CODE
START_DATE
END_DATE
1
A1
01-01-2021
15-03-2021
1
A2
15-03-2021
28-05-2021
1
A1
28-05-2021
13-09-2021
1
A2
13-09-2021
13-10-2021
There could be some articles or answer on the net as well but due to framing of the question I couldn't find them.
I want the solution in SQL but for the community PySpark and other languages are also welcomed.

Related

add column with fixed values for each value of another column Redshift

I have following table
]1
want to add date range for each user
How to achieve this:
if this is possible from query in Redshift then that be useful
If not, efficient way to create this in python pandas as data is having 8lk records
Given this dataframe df:
userid username
0 1 a
1 2 b
2 3 c
you can use numpy repeat and tile:
dr = pd.date_range('2020-01-01','2020-01-03')
df = pd.DataFrame(np.repeat(df.to_numpy(), len(dr), 0), columns=df.columns).assign(date=np.tile(dr.to_numpy(), len(df)))
Result:
userid username date
0 1 a 2020-01-01
1 1 a 2020-01-02
2 1 a 2020-01-03
3 2 b 2020-01-01
4 2 b 2020-01-02
5 2 b 2020-01-03
6 3 c 2020-01-01
7 3 c 2020-01-02
8 3 c 2020-01-03
In Sql this is simple too - just cross join with the list of dates you want to add to each row (replicate rows). You can see that in your example that 3 rows and 3 dates results in 9 rows. (untested explanatory code:)
select userid, username, "date" from <table> cross join (select values ('2020-01-01'::date), ('2020-02-01'::date), ('2020-03-01'::date));
Now the problem with simple approach is that if you are dealing with large tables and long lists of dates the multiplication will kill you. 10 billion rows by 5,000 dates is 15 trillion resulting rows - making this will take a long time and storing it will takes lots of disk space. For small tables and short lists of dates this works fine.
If you are in the "big" side of things you will likely need to rethink what you are trying to do. Since you are using Redshift there is a possibility that you may need to do this.

Creating min and max values and comparing them to timestamp values sql

I have a PostgreSQL database and I have a table that I am looking to query to determine which presses have been updated between the first cycle created_timestamp and the most recent cycle created_timestamp. Here is an example of the table, which is called event_log_summary.
press_id cycle_number created_timestamp
1 1 2020-02-07 16:07:52
1 2 2020-02-07 16:07:53
1 3 2020-02-07 16:07:54
1 4 2020-04-01 13:23:10
2 1 2020-01-13 8:33:23
2 2 2020-01-13 8:33:24
2 3 2020-01-13 8:33:25
3 1 2020-02-21 18:45:44
3 2 2020-02-21 18:45:45
3 3 2020-02-26 14:22:12
This is the query that I used to get me a three column output of press_id, mincycle, max_cycle, but then I want to compare the maxcycle created_timestamp to the mincycle created_timestamp and see if there is at least x amount of time between the two, say at least 1 day, I am unsure about how to implement that.
SELECT
press_id,
MIN(cycle_number) AS minCycle,
MAX(cycle_number) AS maxCycle
FROM
event_log_detail
GROUP BY
press_id
I have tried different things like using WHERE (MAX(cycle_number) - MIN(cycle_number > 1), but I am pretty new to SQL and don't quite fully know how to implement this. The output I am looking for, would have a difference of at least one day would be the following:
press_id
1
3
Presses 1 and 3 have their maximum cycle created_timestamp at least 1-day difference than their minimum cycle created_timestamp. I am just looking for the press_ids whose first cycle and the last cycle have a difference of at least 1 day, I don't need any other information on the output, just one column with the press_ids. Any help would be appreciated. Thanks.
You can use a HAVING clause:
select press_id,
max(created_timestamp) - min(created_timestamp) as diff
from event_log_detail
group by press_id
having max(created_timestamp) > min(created_timestamp) + interval '1 day';

Percentage of variable corresponding to percentage of other variable

I have two numerical variables, and would like to calculate the percentage of one variable corresponding to at least 50% of the other variable's sum.
For example:
A | B
__________
2 | 8
1 | 20
3 | 12
5 | 4
2 | 7
1 | 11
4 | 5
Here, the sum of column B is 68, so I'm looking for the rows (in B's descending order) where cumulative sum is at least 34.
In that case, they are rows 2, 3 & 6 (cumulative sum of 45). The sum of these row's column A is 5, which I want to compare to the total sum of column A (18).
Therefore, the result I'm looking for is 5 / 18 * 100 = 28.78%
I'm looking for a way to implement this in QlikSense, or in SQL.
Here's one way you can do it - there is probably some optimisation to be done, but this gives what you want.
Source:
LOAD
*,
RowNo() as RowNo_Source
Inline [
A , B
2 , 8
1 , 20
3 , 12
5 , 4
2 , 7
1 , 11
4 , 5
];
SourceSorted:
NoConcatenate LOAD *,
RowNo() as RowNo_SourceSorted
Resident Source
Order by B asc;
drop table Source;
BTotal:
LOAD sum(B) as BTotal
Resident SourceSorted;
let BTotal=peek('BTotal',0);
SourceWithCumu:
NoConcatenate LOAD
*,
rangesum(peek('BCumu'),B) as BCumu,
$(BTotal) as BTotal,
rangesum(peek('BCumu'),B)/$(BTotal) as BCumuPct,
if(rangesum(peek('BCumu'),B)/$(BTotal)>=0.5,A,0) as AFiltered
Resident SourceSorted;
Drop Table SourceSorted;
I worked with a debug fields that might be useful but you could of course remove these.
Then in the front end you do your calculation of sum(AFiltered)/sum(A) to get the stat you want and format it as a percentage.

Tableau: How to get moving average with respect to day of week in last 4 weeks?

e.g: If I have the data as below:
Week 1 Week2 Week3
S M T W T F S S M T W T F S S M T W T F S
2 5 6 7 5 5 3 4 5 7 2 4 3 2 4 5 2 1 2 7 8
If today is Monday, my average will be (5+5+5)/3 which is 5. Tomorrow it will be (6+7+2)/3 which will be 5 again and day after it will be (7+2+1)/3 which will be 3.33
How to get this in Tableau?
First, you can use "Weekday" as a column or row (by rightclicking on the date).
Then you can simply add a Table Calculation "Moving Average" with a specific computing dimension "Week of [Date]"
=> Table Calculation Specifics <=
=> Result <=
Data source used-: Tableau Sample Superstore.
You can do the following-:
Columns-: Week(Order Date)
Rows-: Weekday(Order date)
Put Sales in text.
Right click sales>Quick Table Calculation>Moving Average
right click Sales>edit quick table calculation>
Set the following
Select Moving along-: "Table across"
Previous values-: 4

SQL - Referencing 3 tables

This is in relation to my survey application for our team. I have 3 tables in my database related to this problem.
I apologize if the database is not fully normalized.
TBL_CHURCH columns:
1 FAM_CHURCH_SACRMNT_NUM (Primary Key) Int(15)
2 RSPONDNT_NUM
3 SURVYR_NUM
4 QN_NUMBER
5 CHRCHFAMLY_NAME
6 CHRCHFAMLY_ISBAPTIZED
Sample row based on order of columns above:
1 2 3 4 5 6
6422164 76826499 5712 362 Serio Tecson Jr. Yes
TBL_INTRVW columns:
1 QN_NUMBR (Primary Key)
2 SURVYR_NUM
3 ZONE_NUM
4 RSPONDNT_NUM
Sample row based on order of columns above:
1 2 3 4
362 5712 11 76826499
TBL_AREA columns:
1 BRGY_ZONE_NUM (Primary Key)
2 BRGY_CODE
Sample row based on order of columns above:
1 2
11 2A
21 2A
31 2A
The field CRCHFAMLY_ISBAPTIZED has only two values. A "Yes" or a "No" and each row has a QN_NUMBR value that is referenced to TBL_INTRVW and each QN_NUMBR on TBL_INTRVW has a unique ZONE_NUM that is referenced to TBL_AREA and that ZONE_NUM has a corresponding BRGY_CODE. Each BRGY_CODE have at least 2 ZONE_NUM values
My problem is that I want to count the number of people baptized in a given area.
The output more or less should look like this:
(The output is collected from the 3 different ZONE_NUM)
Zone Name Num of People Baptized
2A 20
I'm having what trouble what to use in my SQL statements. Should I use a WHERE within an INNER JOIN? And how do I go about in my SELECT statements?
SELECT c.BRGY_ZONE_NUM,count(a.CHRCHFAMLY_ISBAPTIZED) as [Num of People Baptized]
from TBL_CHURCH a
left join
TBL_INTRVW b
on a.QN_NUMBER=b.QN_NUMBER
left join
TBL_AREA c
on b.ZONE_NUM=cRGY_ZONE_NUM
where a.CHRCHFAMLY_ISBAPTIZED='Yes'
group by c.BRGY_ZONE_NUM
I dont see Zone Name column on the three table, so i used BRGY_ZONE_NUM