I currently have a table that contains a personID, a date, and a set of times throughout the day in 15 minute intervals, it looks like this:
Table
PersonID | Date | [09:00 - 09:15] | [09:15 - 09:30] | .... | [17:45 - 18:00]
Each time column contains an integer (0 as default).
I'm updating the table to include information provided from another table that contains event information. E.g. a person may be in an event from 09:00 - 17:45 and I would want to increment the integer value stored in the respective time columns by 1. Rather that write a LOT of statements to incorporate the various permutations of possible events throughout the day it seems that I should be able to update the columns between the start and end time, I'm just unsure how to do this.
I would want to do something like the following:
UPDATE Table1
SET
(SELECT Column_names FROM Table1 WHERE ColumnNameStartTime >=
Table2.StartTime AND ColumnNameEndTime <= Table2.EndTime)
=ColumnName + 1
WHERE Table1.PersonID = Table2.PersonID and Table1.Date = Table2.Date
Is this even possible?
A more practical table design might be:
PersonID Date StartTime End Time Value
1 2017-11-07 09:00:01 09:15:00 0
1 2017-11-07 09:15:01 09:30:00 0
1 2017-11-07 09:30:01 09:45:00 0
2 2017-11-07 09:00:01 09:15:00 0
2 2017-11-07 09:15:01 09:30:00 0
But choose your data types carefully and be wary of gateway issues when matching your times
Column Data type
Date date
StartTime time
EndTime time
Now if you have a source event table you can update like this:
UPDATE Target
SET Value = 1
from source
where source.eventtime between target.starttime and target.endtime
and source.PersonID = target.PersonID
and source.Date = target.Date
If you need to count events, it's a bit more complicated - you need a calendar table defining the time windows. Then you can join to it and put events in buckets and update the table with that
Related
I have a history table in Teradata that contains information about tasks that occurred on an account. In typical history table fashion, the data present has a Load (Valid From) and Replace (Valid To) date.
In my environment, current/unterminated records have a 1/1/3000 date and all data loads occur shortly after midnight the day after the changes were completed in the system - so Load Dates always trail the Start Date by one day.
Some sample data might look like this (I have added Col "R" to reference row numbers, it does not exist in the table):
R
ACCT_NUM
TASK_NM
TASK_STAT
START_DT
END_DT
LD_DT
RPLC_DT
LGLC_DEL
CURR_ROW
1
0123456
TASK_01
O
2018-05-01
NULL
2018-05-02
2018-05-16
N
N
2
0123456
TASK_01
C
2018-05-01
2018-05-15
2018-05-16
2018-08-16
N
N
3
0123456
TASK_01
C
2018-05-01
2018-05-15
2018-08-16
3000-01-01
Y
Y
4
0123456
TASK_02
O
2018-05-05
NULL
2018-05-06
2018-05-19
N
N
5
0123456
TASK_02
C
2018-05-05
2018-05-18
2018-05-19
2018-08-19
N
N
6
0123456
TASK_03
O
2020-02-01
NULL
2020-02-02
2020-05-16
N
N
7
0123456
TASK_03
C
2020-02-01
2020-02-15
2020-02-16
2020-04-16
N
N
8
0123456
TASK_03
C
2020-02-01
2020-02-15
2020-04-16
3000-01-01
Y
Y
9
0123456
TASK_04
C
2022-03-01
NULL
2022-03-02
3000-01-01
N
Y
The place where I am struggling is that I need to identify each unique time period in which any task is active on a given account. If multiple tasks were active during overlapping time periods, then I need to identify the start date for the first task opened and the end date for the last overlapping task. (It is also possible that TASK_02 could both start and finish before TASK_01 finishes, so I would need the start/end dates related to task_01 in that type of scenario).
Using the above sample data, I would want to get an output as follows:
Account #
Active Start Dt
Active End Dt
0123456
2018-05-01
2018-05-19
0123456
2020-02-01
2018-02-15
0123456
2022-03-01
3000-01-01
Task 1 started, and then task 2 opened. Task 1 completed but task 2 was still open. So I need the Start Date of Task 1 (from Row 1) and the End Date of Task 2 (from Row 4 or 5). Later, Task 3 is opened during its own timeframe which creates a new record, and finally Task 4 is currently open.
I have tried quite a few things unsuccessfully, including but unfortunately not limited to:
Taking only 'Task Status = Open' records since the Replace Date would be updated with the ending date, but from here I'm unsure of how to best make the comparison to address the overlapping timeframes.
Utilizing Lead/Lag functions to identify the next record Load Date, but again, since tasks can occur in any order, this created incorrect timeframes since I couldn't dynamically identify the next replace date that I needed
Attempting to identify unique Load Dates of open tasks in a sub query, and then self joining back to the to the task table but this just created duplicates where multiple things were valid on a given day
I cannot provide the literal code I've written due to privacy restrictions, however, here is mock code for anyone will to provide guidance that could help get me going in the right direction:
SELECT
ACCT_NUM
,TASK_NM
,TASK_STAT
,START_DT
,END_DT
--,[other 'unimportant' fields that can change and create change records]
,LD_DT - INTERVAL '1' DAY AS LD_DT --SIMULATE START_DT
,RPLC_DT - INTERVAL '1' DAY AS LD_DT --SIMULATE END_DT ON "TASK_STAT = 'O'" RECORDS
,LGLC_DEL
,CURR_ROW
FROM
TBL_TASK_HIST
WHERE
TASK_STAT = 'O'
Sorry about the long summary, was trying to provide solid/clear detail.
Use Teradata's NORMALIZE to combine periods that overlap or meet. Based on your sample data, it appears this would work; you may need to adjust slightly for the real data.
SELECT ACCT_NUM,
BEGIN(PD) AS ACTIVE_START_DT,
PRIOR(END(PD)) AS ACTIVE_END_DT
FROM
(SELECT NORMALIZE ACCT_NUM,
PERIOD(START_DT,
NEXT(COALESCE(END_DT,
CASE WHEN TASK_STAT='O' and CURR_ROW='N'
THEN START_DT
ELSE DATE'3000-01-01'
END )
)
) AS PD
FROM TBL_TASK_HIST
) N
ORDER BY ACCT_NUM, ACTIVE_START_DT;
NORMALIZE requires a PERIOD data type so we use the PERIOD() constructor in the inner query and the BEGIN() and END() functions to convert back to two columns in the outer query. Since a PERIOD does not include the ending date/time bound (it's a "closed/open interval") we adjust the ending value with NEXT() and PRIOR() functions.
Maybe you could base the logic on LD_DT, RPLC_DT to avoid having to handle NULL END_DT, but it seems better to use the "business" columns if possible versus the ETL metadata.
i had a sql test for a job i wanted but unfortunately i didn't get the job,
i was hoping someone can help me with the right answer for a question in the test,
so here is the question:
ETL Part
Our “Events” table (Data source) is created in real time. The table has no updates only appends.
event_id event_type time user_id OS Country
1 A 01/12/2018 15:39 1111 iOS ES
2 B 01/12/2018 10:43 2222 iOS Ge
3 C 02/12/2018 16:05 3333 Android IN
4 A 02/12/2018 16:39 3333 Android IN
Presented below Fact_Events table that is part of our DWH. This table aggregates the number of events on hourly level. The ETL process is running every 30 min.
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
01/12/2018 10:00 0 1 0
02/12/2018 16:00 1 0 1
Please answer the following questions:
Define the steps to create the Fact_Events table
For each step provide the output Table.
Write the query for each step.
What loading method you would use?
I really appreciate any help as i wish to learn for future job interviews.
thanks in advance,
Ido.
/***here is my answer/
please tell me if i am correct or if there is a better solution,
ETL Part
1.
Create a table for each event, in this case we will need 3 tables,
Use UNION ALL to concatenate all the table to one table.
2.
First step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
Second step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
01/12/2018 10:00 0 1 0
Third step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
01/12/2018 10:00 0 1 0
02/12/2018 16:00 0 0 1
SELECT Date(time) as date,
Hour(time) as hour,
Count(event_type) as event_type_A,
0 as event_type_B,
0 as event_type_C
FROM Events
WHERE event_type = 'A'
Union All
SELECT Date(time) as date,
Hour(time) as hour,
0 as event_type_A,
Count(event_type) as event_type_B,
0 as event_type_C
FROM Events
WHERE event_type = 'B'
Union All
SELECT Date(time) as date,
Hour(time) as hour,
0 as event_type_A,
0 as event_type_B,
Count(event_type) as event_type_C
FROM Events
WHERE event_type = 'C'
I would use incremental load,
At the first time we will run the script over all the data and save the table,
From now on we will concatenate only the new events that aren’t exists in the saved table.
The loading query is way off. You need to group and pivot.
Should be something like this:
select Date(time) as date,
datepart(hour,time) as hour,
Sum(case when event_type='A' then 1 else 0 end) as event_type_A,
Sum(case when event_type='B' then 1 else 0 end) as event_type_B,
Sum(case when event_type='C' then 1 else 0 end) as event_type_C
from Events
group by Date(time), datepart(hour,time)
In my shop, I'm not heavily involved in hiring, but I do take a large part in determining if anyone is any good - and hence, you could say I play a big part in firing :(
I don't want people stuffing up my data - or giving inaccurate data to clients - so understanding the overall data flow (start to finish) and the final output are key.
As such, you'd need a much bigger answer to question 1, potentially including questions back.
We group data hourly, but load data every half hour. The answer has to work for these.
What do we want if there are no transactions in a given hour? No line, or a line with all 0's? My gut feel is the latter (as it's useful to know that no transactions occurred but is not guaranteed.
Then we'd like to see options, and an evaluation of them e.g.,
The first run each hour creates new data for the hour, and the second updates the data, or
Create the rows as a separate process, then the half-hour process simply updates those.
However, note that the options depend heavily on answers to the previous questions e.g.,
If all-zero rows are allowed, then you can create the rows then update them
But if all-zero rows are not allowed, then you need to do insert on the first round, then updates and inserts in the second round (as the row may not have been created in the first round).
After you have the strategy and have evaluated it, then yes - write the SQL. First make it accurate/correct and understandable/maintainable. Then do efficiency.
Convert a row into multiple rows in bigQuery SQL.
The number of rows depend on a particular column value (in this case, the value of delta_unit/60):
Source table:
ID time delta_unit
101 2019-06-18 01:00:00 60
102 2019-06-18 01:01:00 60
103 2019-06-18 01:03:00 120
The ID 102 does recorded a time at 01:01:00 and the next record was at 01:03:00.
So, we are missing a record that should have been 01:02:00 and the delta_unit = 60
Expected table:
ID time delta_unit
101 2019-06-18 01:00:00 60
102 2019-06-18 01:01:00 60
104 2019-06-18 01:02:00 60
103 2019-06-18 01:03:00 60
A new row is created based on the delta_unit. The number of rows that need to be created will depend on the value delta_unit/60 (in this case, 120/60 = 2)
I have found a solution to your problem. I have done the following, first run
SELECT max(delta/60) as max_a FROM `<projectid>.<dataset>.<table>`
to compute the maximum number of steps. Then run the following loop
DECLARE a INT64 DEFAULT 1;
WHILE a <= 2 DO --2=max_a (change accordingly)
INSERT INTO `<projectid>.<dataset>.<table>` (id,time,delta)
SELECT id+1,TIMESTAMP_ADD(time, INTERVAL a MINUTE),delta-60*a
FROM
(SELECT id,time,delta
FROM `<projectid>.<dataset>.<table>`
)
WHERE delta > 60*a;
SET a = a + 1;
END WHILE;
Of course this is not efficient enough but it gets the Job done. The IDs and deltas do not finish at the right values yet, they should not be needed. The deltas would end up all at 60 (the column can be deleted) and the IDs can be recreated using the timestamp to get them ordered.
You might try using a conditional expression in here to avoid the loop and only going through the table once.
I have tried
INSERT INTO `<projectid>.<dataset>.<table>` (id,time,delta)
SELECT id+1, CASE
WHEN delta>80 THEN TIMESTAMP_ADD(time, INTERVAL 1 MINUTE)
WHEN delta>150 THEN TIMESTAMP_ADD(time, INTERVAL 2 MINUTE)
END
,60
FROM
(SELECT id,time,delta
FROM `<projectid>.<dataset>.<table>`
)
WHERE delta > 60;
but fails because only returns the first condition where the when is True. So, I am not sure if it is possible to do it all at once. If you have small tables I would stick to the first one which works fine.
Data transformation issue on Postgres. I have a process where submission is made, and some stages/events take place to process the submission. A row is created per submission. When an stage is complete, a timestamp populates the column for that stage.
The raw data is in the format
submission_id stage1.time stage2.time stage3.time
XYZ 2016/01/01 2016/01/04 2016/01/05
I want to make a "snapshot" table for this (perhaps theres a better name for this?) which looks as follows, for the above example
snapshot_date submission_id stage_number days_in_stage
2016/01/01 XYZ 1 0
2016/01/02 XYZ 1 1
2016/01/03 XYZ 1 2
2016/01/04 XYZ 2 0
2016/01/05 XYZ 3 0
So basically, on a given date in the past, what submissions are in what stages and how long had they been in those stages.
So far I've managed to generate a date table using
SELECT ts::date
FROM (
SELECT min(stage1.time) as first_date
, max(stage1.time) as last_date
FROM schema.submissions
) h
, generate_series(h.first_date, h.last_date, interval '1 day') g(ts)
but I'm stuck on where I should be joining next, so any pointers would be appreciated.
My table called TimeList with 2 columns SlotID(int identity) and SlotTime(varchar) in database is like this.
SlotID SlotTime
1 8:00AM-8:15AM
2 8:15AM-8:30AM
3 8:30AM-8:45AM
4 8:45AM-9AM
5 9AM-9:30AM
likewise up to 6:45PM-7:00PM.
if i pass 2 parameters starttime and endtime as 8:00AM and endtime as 9AM,I want to retrieve first 4 rows in the above given table.Can anybody help to have such a stored procedure.
Would it be possible to refactor the table to look like this:
SlotID SlotStart SlotEnd
----------------------------
1 8:00am 8:15am
2 8:15am 8:30am
...
If you split the times into separate columns, it will be easier to query the date ranges. The query would look something like this:
#StartTime = '8:00am'
#EndTime = '9:00am'
select SlotID, SlotStart, SlotEnd
from Slots
where SlotStart >= #StartTime
and SlotEnd <= #EndTime
Your data is not properly normalized, so it will be hard to query. A field should only contain a single value, so you should have the starting and ending time for the slot in separate fields:
SlotID StartTime EndTime
1 8:00AM 8:15AM
2 8:15AM 8:30AM
3 8:30AM 8:45AM
4 8:45AM 9:00AM
5 9:00AM 9:30AM
This also allows you to use a datetime type for the fields instead of a textual data type, so that you can easily query the table:
select SlotId, StartTime, EndTime
from TimeList
where StartTime >= '8:00AM' and EndTime <= '9:00AM'
With your original table design, you would have to use string operations to split the values in the field, and convert the values to make it comparable. If you get a lot of data in the table, this will be a killer for performance, as the query can't make use of indexes.
The problem is that your table is not normalized. Please read up on that at http://en.wikipedia.org/wiki/Database_normalization , it can greatly improve the quality of the systems you design.
In your current case, please follow Andy's advice and separate SlotStart and SlotEnd. Your time format is not good either. Use a DateTime format (or whatever your database offers you as its time type) or a numerical type like INT to store your values (e.g. 1800 instead of 6:00PM).
Then you can easily use
SELECT FROM TimeList WHERE SlotStart>=... AND SlotEnd<=...
and select whatever you like from your table.