SQL ETL for creating a table - job interview question - sql

i had a sql test for a job i wanted but unfortunately i didn't get the job,
i was hoping someone can help me with the right answer for a question in the test,
so here is the question:
ETL Part
Our “Events” table (Data source) is created in real time. The table has no updates only appends.
event_id event_type time user_id OS Country
1 A 01/12/2018 15:39 1111 iOS ES
2 B 01/12/2018 10:43 2222 iOS Ge
3 C 02/12/2018 16:05 3333 Android IN
4 A 02/12/2018 16:39 3333 Android IN
Presented below Fact_Events table that is part of our DWH. This table aggregates the number of events on hourly level. The ETL process is running every 30 min.
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
01/12/2018 10:00 0 1 0
02/12/2018 16:00 1 0 1
Please answer the following questions:
Define the steps to create the Fact_Events table
For each step provide the output Table.
Write the query for each step.
What loading method you would use?
I really appreciate any help as i wish to learn for future job interviews.
thanks in advance,
Ido.
/***here is my answer/
please tell me if i am correct or if there is a better solution,
ETL Part
1.
Create a table for each event, in this case we will need 3 tables,
Use UNION ALL to concatenate all the table to one table.
2.
First step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
Second step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
01/12/2018 10:00 0 1 0
Third step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
01/12/2018 10:00 0 1 0
02/12/2018 16:00 0 0 1
SELECT Date(time) as date,
Hour(time) as hour,
Count(event_type) as event_type_A,
0 as event_type_B,
0 as event_type_C
FROM Events
WHERE event_type = 'A'
Union All
SELECT Date(time) as date,
Hour(time) as hour,
0 as event_type_A,
Count(event_type) as event_type_B,
0 as event_type_C
FROM Events
WHERE event_type = 'B'
Union All
SELECT Date(time) as date,
Hour(time) as hour,
0 as event_type_A,
0 as event_type_B,
Count(event_type) as event_type_C
FROM Events
WHERE event_type = 'C'
I would use incremental load,
At the first time we will run the script over all the data and save the table,
From now on we will concatenate only the new events that aren’t exists in the saved table.

The loading query is way off. You need to group and pivot.
Should be something like this:
select Date(time) as date,
datepart(hour,time) as hour,
Sum(case when event_type='A' then 1 else 0 end) as event_type_A,
Sum(case when event_type='B' then 1 else 0 end) as event_type_B,
Sum(case when event_type='C' then 1 else 0 end) as event_type_C
from Events
group by Date(time), datepart(hour,time)

In my shop, I'm not heavily involved in hiring, but I do take a large part in determining if anyone is any good - and hence, you could say I play a big part in firing :(
I don't want people stuffing up my data - or giving inaccurate data to clients - so understanding the overall data flow (start to finish) and the final output are key.
As such, you'd need a much bigger answer to question 1, potentially including questions back.
We group data hourly, but load data every half hour. The answer has to work for these.
What do we want if there are no transactions in a given hour? No line, or a line with all 0's? My gut feel is the latter (as it's useful to know that no transactions occurred but is not guaranteed.
Then we'd like to see options, and an evaluation of them e.g.,
The first run each hour creates new data for the hour, and the second updates the data, or
Create the rows as a separate process, then the half-hour process simply updates those.
However, note that the options depend heavily on answers to the previous questions e.g.,
If all-zero rows are allowed, then you can create the rows then update them
But if all-zero rows are not allowed, then you need to do insert on the first round, then updates and inserts in the second round (as the row may not have been created in the first round).
After you have the strategy and have evaluated it, then yes - write the SQL. First make it accurate/correct and understandable/maintainable. Then do efficiency.

Related

Using a history table, how can I identify each period of time in which any task is active, and reduce the data set to non-overlapping date periods

I have a history table in Teradata that contains information about tasks that occurred on an account. In typical history table fashion, the data present has a Load (Valid From) and Replace (Valid To) date.
In my environment, current/unterminated records have a 1/1/3000 date and all data loads occur shortly after midnight the day after the changes were completed in the system - so Load Dates always trail the Start Date by one day.
Some sample data might look like this (I have added Col "R" to reference row numbers, it does not exist in the table):
R
ACCT_NUM
TASK_NM
TASK_STAT
START_DT
END_DT
LD_DT
RPLC_DT
LGLC_DEL
CURR_ROW
1
0123456
TASK_01
O
2018-05-01
NULL
2018-05-02
2018-05-16
N
N
2
0123456
TASK_01
C
2018-05-01
2018-05-15
2018-05-16
2018-08-16
N
N
3
0123456
TASK_01
C
2018-05-01
2018-05-15
2018-08-16
3000-01-01
Y
Y
4
0123456
TASK_02
O
2018-05-05
NULL
2018-05-06
2018-05-19
N
N
5
0123456
TASK_02
C
2018-05-05
2018-05-18
2018-05-19
2018-08-19
N
N
6
0123456
TASK_03
O
2020-02-01
NULL
2020-02-02
2020-05-16
N
N
7
0123456
TASK_03
C
2020-02-01
2020-02-15
2020-02-16
2020-04-16
N
N
8
0123456
TASK_03
C
2020-02-01
2020-02-15
2020-04-16
3000-01-01
Y
Y
9
0123456
TASK_04
C
2022-03-01
NULL
2022-03-02
3000-01-01
N
Y
The place where I am struggling is that I need to identify each unique time period in which any task is active on a given account. If multiple tasks were active during overlapping time periods, then I need to identify the start date for the first task opened and the end date for the last overlapping task. (It is also possible that TASK_02 could both start and finish before TASK_01 finishes, so I would need the start/end dates related to task_01 in that type of scenario).
Using the above sample data, I would want to get an output as follows:
Account #
Active Start Dt
Active End Dt
0123456
2018-05-01
2018-05-19
0123456
2020-02-01
2018-02-15
0123456
2022-03-01
3000-01-01
Task 1 started, and then task 2 opened. Task 1 completed but task 2 was still open. So I need the Start Date of Task 1 (from Row 1) and the End Date of Task 2 (from Row 4 or 5). Later, Task 3 is opened during its own timeframe which creates a new record, and finally Task 4 is currently open.
I have tried quite a few things unsuccessfully, including but unfortunately not limited to:
Taking only 'Task Status = Open' records since the Replace Date would be updated with the ending date, but from here I'm unsure of how to best make the comparison to address the overlapping timeframes.
Utilizing Lead/Lag functions to identify the next record Load Date, but again, since tasks can occur in any order, this created incorrect timeframes since I couldn't dynamically identify the next replace date that I needed
Attempting to identify unique Load Dates of open tasks in a sub query, and then self joining back to the to the task table but this just created duplicates where multiple things were valid on a given day
I cannot provide the literal code I've written due to privacy restrictions, however, here is mock code for anyone will to provide guidance that could help get me going in the right direction:
SELECT
ACCT_NUM
,TASK_NM
,TASK_STAT
,START_DT
,END_DT
--,[other 'unimportant' fields that can change and create change records]
,LD_DT - INTERVAL '1' DAY AS LD_DT --SIMULATE START_DT
,RPLC_DT - INTERVAL '1' DAY AS LD_DT --SIMULATE END_DT ON "TASK_STAT = 'O'" RECORDS
,LGLC_DEL
,CURR_ROW
FROM
TBL_TASK_HIST
WHERE
TASK_STAT = 'O'
Sorry about the long summary, was trying to provide solid/clear detail.
Use Teradata's NORMALIZE to combine periods that overlap or meet. Based on your sample data, it appears this would work; you may need to adjust slightly for the real data.
SELECT ACCT_NUM,
BEGIN(PD) AS ACTIVE_START_DT,
PRIOR(END(PD)) AS ACTIVE_END_DT
FROM
(SELECT NORMALIZE ACCT_NUM,
PERIOD(START_DT,
NEXT(COALESCE(END_DT,
CASE WHEN TASK_STAT='O' and CURR_ROW='N'
THEN START_DT
ELSE DATE'3000-01-01'
END )
)
) AS PD
FROM TBL_TASK_HIST
) N
ORDER BY ACCT_NUM, ACTIVE_START_DT;
NORMALIZE requires a PERIOD data type so we use the PERIOD() constructor in the inner query and the BEGIN() and END() functions to convert back to two columns in the outer query. Since a PERIOD does not include the ending date/time bound (it's a "closed/open interval") we adjust the ending value with NEXT() and PRIOR() functions.
Maybe you could base the logic on LD_DT, RPLC_DT to avoid having to handle NULL END_DT, but it seems better to use the "business" columns if possible versus the ETL metadata.

Select unique IDs and divide result into X minute intervals based on given timespan

I'm trying to knock some dust off my good old SQL queries, but I'm afraid I need a push in the right direction into taking those dusty skills and transform them into something useful when it comes to BigQuery statements.
I'm currently working with a single table schema looking like this:
In the query I would like to be able to supply the following in my where clause:
The date of which I would like the results to stem from.
A time range - in the above result example this range would be from 20:00 to 21:00. If 1. and 2. in this list should be merged together that's also fine.
The eventId I would like to find records for.
Optionally to be able to determine the interval frequency - should it be divided into each ie. 5, 10 or 15 minute intervals.
Also I would like to count the unique userIds for each interval. If one user is present during the entire session he/she should be taken into the count in every interval.
So think of it as the following:
How many unique users did we have every 5 minutes at X event, between 20:00 and 21:00 at Y day?
How should my query look if I want a result looking (something) like the following pseudo result:
time_interval number_of_unique_userIds
1 2022-03-16 20:00:00 10
2 2022-03-16 20:05:00 12
3 2022-03-16 20:10:00 15
4 2022-03-16 20:15:00 20
5 2022-03-16 20:20:00 30
6 ... etc.
If the time of the query is before the provided end time in the timespan, it should fill out the rest of the interval rows with 0 unique userIds.
In the following result we've executed mentioned query earlier than the provided end date - let's say that it's executed at 20:49:
time_interval number_of_unique_userIds
X 2022-03-16 20:50:00 0
X 2022-03-16 20:55:00 0
X 2022-03-16 21:00:00 0
Here's what I have so far, but it gives me several of the same interval records with what looks like each userId:
SELECT
TIMESTAMP_SECONDS(5*60 * DIV(UNIX_SECONDS(creationTime), 5*60)) time_interval,
COUNT(DISTINCT(userId)) number_of_unique_userIds
FROM `bigquery.table`
WHERE eventId = 'xyz'
AND creationTime > '2022-03-16 20:00:00' AND creationTime < '2022-03-16 21:00:00'
GROUP BY time_interval
ORDER BY time_interval DESC
This gives me somewhat what I expect - but I think the number_of_unique_userIds seems too low, so I'm a little worried that I'm not getting unique userIds for each interval. What I'm thinking is, that userIds that were counted into the first 5 minute interval is not counted in the next. So I'm not sure this query is sufficient for my needs. Also it's not filling the blanks with 0 number_of_unique_userIds.
I hope you can help me out here.
Thanks!

Time between date. (More advanced than just Datediff)

I have a table that contains Guest_ID and Trip_Date. I have been tasked with trying to find out for each Guest_ID how many times they have had over 365 days between trips. I know that for the time between the dates I can use datediff formula but I am unsure of how to get the dates plugged in properly. I think if I can get help with this part I can do the rest.
For each time this happened I need to report back Guest_ID, Prior_Last_Trip, New_Trip, days between. This data goes back for over a decade so it is possible for a Guest to have multiple periods of over a year between visits.
I was thinking of just loading a table with that data that can be queried later. That way once I figure out how to make this work the first time I can setup a stored procedure or trigger to check for new occurrences of this and populate the table.
I was not sure were to begin on this code. I was thinking recursion might be the answer but I do not know recursion just that it exist.
This table is quite large. Around 1.5 million unique Guest_ID's with over 30 million trips.
I am using SQL Server 2012. If there is anything else I can add to help this let me know. I will edit and update this as I have ideas on how to make this work myself.
Edit 1: Sample Data and Desired Results
Guest_ID Trip_Date
1 1/1/2013
1 2/5/2013
1 12/5/2013
1 1/1/2015
1 6/5/2015
1 8/1/2017
1 10/2/2017
1 1/6/2018
1 6/7/2018
1 7/1/2018
1 7/5/2018
2 1/1/2018
2 2/6/2018
2 4/2/2018
2 7/3/2018
3 1/1/2014
3 6/5/2014
3 9/4/2014
Guest_ID Prior_Last_Trip New_Trip DaysBetween
1 12/5/2013 1/1/2015 392
1 6/5/2015 8/1/2017 788
So you can see that Guest 1 had 2 different times where they did not have a trip for over a year and that those two instances are recorded in the results. Guest 2 never had a gap of over a year and therefor has no records in the results. Guest 3 has not had a trip in over a year but without have the return trip currently does not qualify for the result set. Should Guest 3 ever make another trip they would then be added to the result set.
Edit 2: Working Query
Thanks to #Code4ml I got this working. Here is the complete query.
Select
Guest_ID, CurrentTrip, DaysBetween, Lasttrip
From (
Select
Guest_ID
,Lag(Trip_Date,1) Over(Partition by Guest_ID Order by Trip_Date) as LastTrip
,Trip_Date as CurrentTrip
,DATEDIFF(d,Lag(Trip_Date,1) Over(Partition by Guest_ID Order by Trip_Date),Trip_Date) as DaysBetween
From UCS
) as A
Where DaysBetween > 365
You may try SQL LAG function to access previous trip date like below.
SELECT guest_id, trip_date,
LAG (trip_date,1) OVER (PARTITION BY guest_id ORDER BY trip_date desc) AS prev_trip_date
FROM tripsTable
Now you can use this as a subquery to calculate number of days between trips and filter the data as required.

SQL - multiple date time aggregations within single query

I have a SQL database of customer actions, a customer is defined by a UniqueId and an action is given a date and time of action timestamp. A user can have more than one action on any one day as so:
UniqueID | actionDate | actionTime |
1 17-01-18 13:01
1 17-01-18 13:15
2 17-01-18 13:15
1 18-01-18 12:56
I want to understand multiple things from the database ideally in a single query.
The first is how many times has each uniqueId preformed an action over a given time period (day, week, month) so for the example above there would be a count of 2 for id1 for 17-01-18, a count of 1 for 18-01-18 and assuming they are the only two actions that week a count of 3 for id 1 for that week.
On days that have more than one action (17-01-18 in the above example) I would want to understand the distribution of actions across the day and more importantly the number of actions that occurred within a time frame of an hour. In this case id want to understand that 2 actions occurred between 13:00 - 14:00 for id 1 but the other 23 hours had 0 actions.
The end goal would be to have a time series that looks back over three months and be able to view monthly, weekly and importantly daily / intra-daily counts of actions for each unique ID.
Desired result may look something like this:
ID | M1W1D1H1|M1W1D1H2|->|M1W1D1H13|->|M1W1D2H12|
1 0 0 2 1
2 0 0 1 0
M=Month, W=Week, D=Day, H=Hour. AC = ActionCount
So the above shows that on month 1, week 1, day 1, hour 1, id1 had no actions. The first action was on M1W1D1H13, in which time they had two actions. There next action was on D2 of W1, M1. Could then aggregate up to get the respective, weekly, daily monthly actions. The result will be fairly sparse with many 0 action.
Any help and guidance appreciated.
If I am understanding your question you have an id with a date and time details in a normalized data structure. You, however, want to denormalize this data so that you have only one line per id aggregated at the conditions you desire.
To do this you could use a simple group by and nest your aggregations into case statements qualifying them for the column range you desire. If you can not hard code your time slices and need that to be dynamic that may be possible but I would need more information about your requirements. You can also nest case statments into case statements and use derived tables to further enable more complex rules.
So, using your example...
sel
UniqueID
, sum(
case when actionDate between <someDate> and <someDate> then 1
end) as evnt_cnt_in_range01
, count(distinct(
case when actionDate between <someDate> and <someDate> then actionDate
end)) as uniq_dates_in_range01
, min(
case when actionDate between <someDate> and <someDate> then actionTime
end) as earliest_action_in_range01
, max(
case when actionDate between <someDate> and <someDate> then actionTime
end) as latest_action_in_range01
, max(
case when actionDate between <someDate> and <someDate> then
CASE WHEN actionTime > '12:00' THEN 1 ELSE 0 END -- I flip caps to keeps nests straight
end) as cnt_after_noon_action_range1
FROM <sometable>
group by 1

Date snapshot table transformation in SQL

Data transformation issue on Postgres. I have a process where submission is made, and some stages/events take place to process the submission. A row is created per submission. When an stage is complete, a timestamp populates the column for that stage.
The raw data is in the format
submission_id stage1.time stage2.time stage3.time
XYZ 2016/01/01 2016/01/04 2016/01/05
I want to make a "snapshot" table for this (perhaps theres a better name for this?) which looks as follows, for the above example
snapshot_date submission_id stage_number days_in_stage
2016/01/01 XYZ 1 0
2016/01/02 XYZ 1 1
2016/01/03 XYZ 1 2
2016/01/04 XYZ 2 0
2016/01/05 XYZ 3 0
So basically, on a given date in the past, what submissions are in what stages and how long had they been in those stages.
So far I've managed to generate a date table using
SELECT ts::date
FROM (
SELECT min(stage1.time) as first_date
, max(stage1.time) as last_date
FROM schema.submissions
) h
, generate_series(h.first_date, h.last_date, interval '1 day') g(ts)
but I'm stuck on where I should be joining next, so any pointers would be appreciated.