Date snapshot table transformation in SQL - sql

Data transformation issue on Postgres. I have a process where submission is made, and some stages/events take place to process the submission. A row is created per submission. When an stage is complete, a timestamp populates the column for that stage.
The raw data is in the format
submission_id stage1.time stage2.time stage3.time
XYZ 2016/01/01 2016/01/04 2016/01/05
I want to make a "snapshot" table for this (perhaps theres a better name for this?) which looks as follows, for the above example
snapshot_date submission_id stage_number days_in_stage
2016/01/01 XYZ 1 0
2016/01/02 XYZ 1 1
2016/01/03 XYZ 1 2
2016/01/04 XYZ 2 0
2016/01/05 XYZ 3 0
So basically, on a given date in the past, what submissions are in what stages and how long had they been in those stages.
So far I've managed to generate a date table using
SELECT ts::date
FROM (
SELECT min(stage1.time) as first_date
, max(stage1.time) as last_date
FROM schema.submissions
) h
, generate_series(h.first_date, h.last_date, interval '1 day') g(ts)
but I'm stuck on where I should be joining next, so any pointers would be appreciated.

Related

SELECT-SQL-Statement - Transformation of a single data record with a date period into several single data records per day

I have the following example records in a table that contains records with time periods (Originally import data):
ID
DateFrom
DateTo
Value
1
01.01.2021
03.01.2021
A
2
02.03.2021
06.03.2021
B
...
The data is imported as individual records into a separate table.
I would like to put the data records into the following form with a SELECT query in order to be able to check in the 2nd step whether all data were imported as a single data record:
ID
DateFrom
DateTo
Value
1
01.01.2021
01.01.2021
A
1
02.01.2021
02.01.2021
A
1
03.01.2021
03.01.2021
A
2
02.03.2021
02.03.2021
B
2
03.03.2021
03.03.2021
B
2
04.03.2021
04.03.2021
B
2
05.03.2021
05.03.2021
B
2
06.03.2021
06.03.2021
B
..
Unfortunately, I have a knot in my head and cannot find a query approach.
I am sure the hierarchical query suits here. The problem though I still can't fit it in here without using distinct.
This query will work with assummption that "datefrom" and "dateto" columns are of DATE format.
Replace "test_data" with table name you store dates in.
select td.id,
qq.day_date,
value
from test_data td
join (select distinct id,
datefrom + level - 1 day_date
from test_data
connect by level <= (dateto - datefrom + 1)) qq
on qq.id = td.id
order by td.id, qq.day_date;
If datato and datafrom are just varchars, you may convert them to dates using to_date function.

PostgreSQL - Generate series using subqueries

Using PostgreSQL, I need to accomplish the following scenario. I have a table called routine, where I store start_date and end_date columns. I have another table called exercises, where I store all the data related with each exercise and finally, I have a table called routine_exercise where I create the relationship between the routine and the exercise. Each routine can have seven days (one day indicates the day of the week, e.g: 1 means Monday, etc) of exercises and each day can have one or more exercise. For example:
Exercise Table
Exercise ID
Name
1
Exercise 1
2
Exercise 2
3
Exercise 3
Routine Table
Routine ID
Name
1
Routine 1
2
Routine 2
3
Routine 3
Routine_Exercise Table
Exercise ID
Routine ID
Day
1
1
1
2
1
1
3
1
1
1
1
2
2
1
3
3
1
4
The thing that I'm trying to do is generate a series from start_date to end_date (e.g 03-25-2020 to 05-25-2020, two months) and assign to each date the number of day it supposed to work.
For example, using the data in the Routine_Exercise Table the user should only workout days: 1,2,3,4, so I would like to attach that number to each date. For example, something like this:
Expected Result
Date
Number
03-25-2020
1
03-26-2020
2
03-27-2020
3
03-28-2020
4
03-29-2020
null
03-30-2020
null
03-31-2020
null
04-01-2020
1
04-02-2020
2
04-03-2020
3
04-04-2020
4
04-05-2020
null
Any suggestions or different ideas on how to implement this? Another solution that doesn't require series?
Thanks in advance!
You can generate the dates between start and end input dates using generate_series and then do left join with your routine_exercise table as follows:
SELECT t.d, re.day
FROM generate_series(timestamp '2020-03-25', timestamp '2020-05-25',
interval '1 day') AS t(d)
left join (select distinct day from Routine_Exercise re WHERE ROUTINE_ID = 1) re
on mod(extract(day from (t.d -timestamp '2020-03-25')), 7) + 1 = re.day;

SQL ETL for creating a table - job interview question

i had a sql test for a job i wanted but unfortunately i didn't get the job,
i was hoping someone can help me with the right answer for a question in the test,
so here is the question:
ETL Part
Our “Events” table (Data source) is created in real time. The table has no updates only appends.
event_id event_type time user_id OS Country
1 A 01/12/2018 15:39 1111 iOS ES
2 B 01/12/2018 10:43 2222 iOS Ge
3 C 02/12/2018 16:05 3333 Android IN
4 A 02/12/2018 16:39 3333 Android IN
Presented below Fact_Events table that is part of our DWH. This table aggregates the number of events on hourly level. The ETL process is running every 30 min.
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
01/12/2018 10:00 0 1 0
02/12/2018 16:00 1 0 1
Please answer the following questions:
Define the steps to create the Fact_Events table
For each step provide the output Table.
Write the query for each step.
What loading method you would use?
I really appreciate any help as i wish to learn for future job interviews.
thanks in advance,
Ido.
/***here is my answer/
please tell me if i am correct or if there is a better solution,
ETL Part
1.
Create a table for each event, in this case we will need 3 tables,
Use UNION ALL to concatenate all the table to one table.
2.
First step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
Second step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
01/12/2018 10:00 0 1 0
Third step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
01/12/2018 10:00 0 1 0
02/12/2018 16:00 0 0 1
SELECT Date(time) as date,
Hour(time) as hour,
Count(event_type) as event_type_A,
0 as event_type_B,
0 as event_type_C
FROM Events
WHERE event_type = 'A'
Union All
SELECT Date(time) as date,
Hour(time) as hour,
0 as event_type_A,
Count(event_type) as event_type_B,
0 as event_type_C
FROM Events
WHERE event_type = 'B'
Union All
SELECT Date(time) as date,
Hour(time) as hour,
0 as event_type_A,
0 as event_type_B,
Count(event_type) as event_type_C
FROM Events
WHERE event_type = 'C'
I would use incremental load,
At the first time we will run the script over all the data and save the table,
From now on we will concatenate only the new events that aren’t exists in the saved table.
The loading query is way off. You need to group and pivot.
Should be something like this:
select Date(time) as date,
datepart(hour,time) as hour,
Sum(case when event_type='A' then 1 else 0 end) as event_type_A,
Sum(case when event_type='B' then 1 else 0 end) as event_type_B,
Sum(case when event_type='C' then 1 else 0 end) as event_type_C
from Events
group by Date(time), datepart(hour,time)
In my shop, I'm not heavily involved in hiring, but I do take a large part in determining if anyone is any good - and hence, you could say I play a big part in firing :(
I don't want people stuffing up my data - or giving inaccurate data to clients - so understanding the overall data flow (start to finish) and the final output are key.
As such, you'd need a much bigger answer to question 1, potentially including questions back.
We group data hourly, but load data every half hour. The answer has to work for these.
What do we want if there are no transactions in a given hour? No line, or a line with all 0's? My gut feel is the latter (as it's useful to know that no transactions occurred but is not guaranteed.
Then we'd like to see options, and an evaluation of them e.g.,
The first run each hour creates new data for the hour, and the second updates the data, or
Create the rows as a separate process, then the half-hour process simply updates those.
However, note that the options depend heavily on answers to the previous questions e.g.,
If all-zero rows are allowed, then you can create the rows then update them
But if all-zero rows are not allowed, then you need to do insert on the first round, then updates and inserts in the second round (as the row may not have been created in the first round).
After you have the strategy and have evaluated it, then yes - write the SQL. First make it accurate/correct and understandable/maintainable. Then do efficiency.

Using crosstab, dynamically loading column names of resulting pivot table in one query?

The gem we have installed (Blazer) on our site limits us to one query.
We are trying to write a query to show how many hours each employee has for the past 10 days. The first column would have employee names and the rest would have hours with the column header being each date. I'm having trouble figuring out how to make the column headers dynamic based on the day. The following is an example of what we have working without dynamic column headers and only using 3 days.
SELECT
pivot_table.*
FROM
crosstab(
E'SELECT
"User",
"Date",
"Hours"
FROM
(SELECT
"q"."qdb_users"."name" AS "User",
to_char("qdb_works"."date", \'YYYY-MM-DD\') AS "Date",
sum("qdb_works"."hours") AS "Hours"
FROM
"q"."qdb_works"
LEFT OUTER JOIN
"q"."qdb_users" ON
"q"."qdb_users"."id" = "q"."qdb_works"."qdb_user_id"
WHERE
"qdb_works"."date" > current_date - 20
GROUP BY
"User",
"Date"
ORDER BY
"Date" DESC,
"User" DESC) "x"
ORDER BY 1, 2')
AS
pivot_table (
"User" VARCHAR,
"2017-10-06" FLOAT,
"2017-10-05" FLOAT,
"2017-10-04" FLOAT
);
This results in
| User | 2017-10-05 | 2017-10-04 | 2017-10-03 |
|------|------------|------------|------------|
| John | 1.5 | 3.25 | 2.25 |
| Jill | 6.25 | 6.25 | 6 |
| Bill | 2.75 | 3 | 4 |
This is correct, but tomorrow, the column headers will be off unless we update the query every day. I know we could pivot this table with date on the left and names on the top, but that will still need updating with each new employee – and we get new ones often.
We have tried using functions and queries in the "AS" section with no luck. For example:
AS
pivot_table (
"User" VARCHAR,
current_date - 0 FLOAT,
current_date - 1 FLOAT,
current_date - 2 FLOAT
);
Is there any way to pull this off with one query?
You could select a row for each user, and then per column sum the hours for one day:
with user_work as
(
select u.name as user
, to_char(w.date, 'YYYY-MM-DD') as dt_str
, w.hours
from qdb_works w
join qdb_users u
on u.id = w.qdb_user_id
where w.date >= current_date - interval '2 days'
)
select User
, sum(case when dt_str = to_char(current_date,
'YYYY-MM-DD') then hours end) as Today
, sum(case when dt_str = to_char(current_date - 'interval 1 day',
'YYYY-MM-DD') then hours end) as Yesterday
, sum(case when dt_str = to_char(current_date - 'interval 2 days',
'YYYY-MM-DD') then hours end) as DayBeforeYesterday
from user_work
group by
user
, dt_str
It's often easier to return a list and pivot it client side. That also allows you to generate column names with a date.
Is there any way to pull this off with one query?
No, because a fixed SQL query cannot have any variability in its output columns. The SQL engine determines the number, types and names of every column of a query before executing it, without reading any data except in the catalog (for the structure of tables and other objects), execution being just the last of 5 stages.
A single-query dynamic pivot, if such a thing existed, couldn't be prepared, since a prepared query always have the same results structure, whereas by definition a dynamic pivot doesn't, as the rows that pivot into columns can change between executions. That would be at odds again with the Prepare-Bind-Execute model.
You may find some limited workarounds and additional explanations in other questions, for example: Execute a dynamic crosstab query, but since you mentioned specifically:
The gem we have installed (Blazer) on our site limits us to one
query
I'm afraid you're out of luck. Whatever the workaround, it always need at best one step with a query to figure out the columns and generate a dynamic query from them, and a second step executing the query generated at the previous step.

SQL Find latest record only if COMPLETE field is 0

I have a table with multiple records submitted by a user. In each record is a field called COMPLETE to indicate if a record is fully completed or not.
I need a way to get the latest records of the user where COMPLETE is 0, LOCATION, DATE are the same and no additional record exist where COMPLETE is 1. In each record there are additional fields such as Type, AMOUNT, Total, etc. These can be different, even though the USER, LOCATION, and DATE are the same.
There is a SUB_DATE field and ID field that denote the day the submission was made and auto incremented ID number. Here is the table:
ID NAME LOCATION DATE COMPLETE SUB_DATE TYPE1 AMOUNT1 TYPE2 AMOUNT2 TOTAL
1 user1 loc1 2017-09-15 1 2017-09-10 Food 12.25 Hotel 65.54 77.79
2 user1 loc1 2017-09-15 0 2017-09-11 Food 12.25 NULL 0 12.25
3 user1 loc2 2017-08-13 0 2017-09-05 Flight 140 Food 5 145.00
4 user1 loc2 2017-08-13 0 2017-09-10 Flight 140 NULL 0 140
5 user1 loc3 2017-07-14 0 2017-07-15 Taxi 25 NULL 0 25
6 user1 loc3 2017-08-25 1 2017-08-26 Food 45 NULL 0 45
The results I would like is to retrieve are ID 4, because the SUB_DATE is later that ID 3. Which it has the same Name, Location, and Date information and there is no COMPLETE with a 1 value.
I would also like to retrieve ID 5, since it is the latest record for the User, Location, Date, and Complete is 0.
I would also appreciate it if you could explain your answer to help me understand what is happening in the solution.
Not sure if I fully understood but try this
SELECT *
FROM (
SELECT *,
MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) AS CompleteForNameLocationAndDate,
MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE) AS LastSubDate
FROM your_table t
) a
WHERE CompleteForNameLocationAndDate = 0 AND
SUB_DATE = LastSubDate
So what we have done here:
First, if you run just the inner query in Management Studio, you will see what that does:
The first max function will partition the data in the table by each unique Name,Location,Date set.
In the case of your data, ID 1 & 2 are the first partition, 3&4 are the second partition, 5 is the 3rd partition and 6 is the 4th partition.
So for each of these partitions it will get the max value in the complete column. Therefore any partition with a 1 as it's max value has been completed.
Note also, the convert function. This is because COMPLETE is of datatype BIT (1 or 0) and the max function does not work with that datatype. We therefore convert to INT. If your COMPLETE column is type INT, you can take the convert out.
The second max function partitions by unique Name, Location and Date again but we are getting the max_sub date this time which give us the date of the latest record for the Name,Location,Date
So we take that query and add it to a derived table which for simplicity we call a. We need to do this because SQL Server doesn't allowed windowed functions in the WHERE clause of queries. A windowed function is one that makes use of the OVER keyword as we have done. In an ideal world, SQL would let us do
SELECT *,
MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) AS CompleteForNameLocationAndDate,
MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE) AS LastSubDate
FROM your)table t
WHERE MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) = 0 AND
SUB_DATE = MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE)
But it doesn't allow it so we have to use the derived table.
So then we basically SELECT everything from our derived table Where
CompleteForNameLocationAndDate = 0
Which are Name,Location, Date partitions which do not have a record marked as complete.
Then we filter further asking for only the latest record for each partition
SUB_DATE = LastSubDate
Hope that makes sense, not sure what level of detail you need?
As a side, I would look at restructuring your tables (unless of course you have simplified to better explain this problem) as follows:
(Assuming the table in your examples is called Booking)
tblBooking
BookingID
PersonID
LocationID
Date
Complete
SubDate
tblPerson
PersonID
PersonName
tblLocation
LocationID
LocationName
tblType
TypeID
TypeName
tblBookingType
BookingTypeID
BookingID
TypeID
Amount
This way if you ever want to add Type3 or Type4 to your booking information, you don't need to alter your table layout