Creating a SQL query to find recent field changes - sql

I am having problems trying to create a SQL query to select the most recent change in hours and the difference from the previous time recorded.
The table is as below, the database keeps all the historical version together by versions:
Item ID Title RevisedDate ChangedDate Rev WorkHours
Task 187061 Development 10/9/12 11:14 10/5/12 15:54 1 4
Task 187061 Development 10/9/12 14:29 10/9/12 11:14 2 8
Task 187061 Development 10/10/12 15:07 10/9/12 14:29 3 16
Task 187061 Development 10/11/12 9:59 10/10/12 15:07 4 16
Task 187061 Development 10/12/12 10:51 10/11/12 9:59 5 16
Task 187061 Development 12/6/12 15:25 10/12/12 10:51 6 16
Task 187061 Development 12/11/12 10:27 12/6/12 15:25 7 16
Task 187061 Development 1/1/99 0:00 12/11/12 10:27 8 16
So the task most recent worked hours were updated on 10/10/12 15:07 from 8hr to 16hrs. I am having problems creating a query to tell me.
At the end of the day I need a result :-
Item ID Title RevisedDate ChangedDate Rev WorkHours ChangeHours
Task 187061 Development 10/10/12 15:07 10/9/12 14:29 3 16 8
(p.s I took one task as an example, the actual table has hundreds of task and several historical version)

As I understand your question you want the item with the most recent revised date for each ID
You get that like this:
SELECT *
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY RevisedDate DESC) as ord
FROM TABLE
) T
WHERE ord = 1
If you want the first one that changed that is harder:
-- First find the ones that changed
With FlagChange AS
(
SELECT T1.ID, T1.REV, T1.RevisedDate
CASE WHEN T2 IS NULL THEN FALSE
WHEN T1.WorkHour != T2.WorkHour THEN TRUE
ELSE FALSE END AS Changed
FROM TABLE T1
LEFT JOIN TABLE T2 ON T1.ID = T2.ID AND T2.REV = T1.REV-1
), NumberChange -- now use row number
(
SELECT ID, REV,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY RevisedDate DESC) as ord
FROM FlagChange
WHERE Changed = True
), SelectRecent -- take the newest ones
(
SELECT ID, REV
FROM NumberChange
WHERE ord = 1
) -- add in all the data and the ones with one revision
SELECT *
FROM TABLE T1
JOIN SelectRecient SR ON T1.ID = SR.ID AND T1.REV = SR.REV
UNION ALL
SELECT *
FROM TABLE
WHERE ID NOT IN (SELECT ID FROM SelectRecent)

Related

BigQuery filter WHERE by date for last 5 rows for each value of categorical column

Apologies if the title is a bit wordy - i will create an example below to highlight what i'm referring to. I have the following table of information:
t1
date team num_val
2017-10-04 ab 7
2017-10-03 ab 6
2017-10-02 ab 8
2017-10-05 ab 3
2017-10-07 ab 12
2017-10-06 ab 3
2017-10-01 ab 5
2017-09-08 cd 4
2017-09-09 cd 8
2017-09-10 cd 2
2017-09-14 cd 1
2017-09-13 cd 5
2017-09-11 cd 6
2017-09-12 cd 13
With this table, I would simply like to:
Filter, for each team, the most recent 5 dates
Group by team and sum the num_val column
Simple enough. However, there is no rhyme or reason to the dates for each team (I cannot simply filter on the most recent 5 dates overall, since they may be different for each team). I currently have the following framework for the query:
SELECT
team,
sum(num_val)
FROM t1
GROUP BY team
... any help getting this to the finish line would be greatly appreciated, thanks!!
Few more options for BigQuery Standard SQL, so you see different approaches
Option #1
#standardSQL
SELECT team, SUM(num_val) sum_num FROM (
SELECT team, num_val, ROW_NUMBER() OVER(PARTITION BY team ORDER BY DATE DESC) pos
FROM `project.dataset.table`
)
WHERE pos <= 5
GROUP BY team
Option #2
#standardSQL
SELECT team, sum_num FROM (
SELECT team,
SUM(num_val) OVER(PARTITION BY team ORDER BY DATE DESC ROWS BETWEEN CURRENT ROW AND 4 FOLLOWING) AS sum_num,
ROW_NUMBER() OVER(PARTITION BY team ORDER BY DATE DESC) pos
FROM `project.dataset.table`
)
WHERE pos = 1
If to apply to sample data from your question - both produce below result
Row team sum_num
1 ab 31
2 cd 27
While above options can be useful in some more complicated cases - in your particular case - I would go with option (similar to one) presented in Filipe's answer
#standardSQL
SELECT team, (SELECT SUM(num_val) FROM UNNEST(num_values)) sum_num
FROM (
SELECT team, ARRAY_AGG(STRUCT(num_val) ORDER BY DATE DESC LIMIT 5) num_values
FROM `project.dataset.table`
GROUP BY team
)
To get the latest 5 for each:
SELECT team, ARRAY_AGG(num_val ORDER BY date DESC LIMIT 5) arr
FROM x
GROUP BY team
Then UNNEST(arr) and add those num_vals.
SELECT team, (SELECT SUM(num_val) FROM UNNEST(arr) num_val) the_sum
FROM (previous)

Include only transition states in SQL query

I have a table with customers and their purchase behaviour that looks as follows:
customer shop time
----------------------------
1 5 13.30
1 5 14.33
1 10 22.17
2 3 12.15
2 1 13.30
2 1 15.55
2 3 17.29
Since I want the shift in shop I need the following output
customer shop time
----------------------------
1 5 13.30
1 10 22.17
2 3 12.15
2 1 13.30
2 3 17.29
I have tried using
ROW_NUMBER() OVER (PARTITION BY customer, shop ORDER BY time ASC) AS a counter
and then only keeping all counter=1. However, this troubles me when the customer visits the same shop again later on, as with customer=2 and shop=3 in my example.
I came up with this:
WITH a AS
(
SELECT
customer, shop, time,
ROW_NUMBER() OVER (PARTITION BY customer ORDER BY time ASC) AS counter
FROM
db
)
SELECT a1.*
FROM a a1
JOIN a AS a2 ON (a1.device = a2.device AND a2.counter1 + 1 = a1.counter1 AND a2.id <> a1.id)
UNION
SELECT a.*
FROM a
WHERE counter1 = 1
However, this is very inefficient and running it in AWS where my data is located results in a error telling me that
Query exhausted resources at this scale factor
Is there any way to make this query more efficient?
This is a gaps-and-islands problem. But the simplest solution uses lag():
select customer, shop, time
from (select t.*, lag(shop) over (partition by customer order by time) as prev_shop
from t
) t
where prev_shop is null or prev_shop <> shop;

SQL JOIN - retrieve MAX DateTime from second table and the first DateTime after previous MAX for other value

I have issue with creating a proper SQL expression.
I have table TICKET with column TICKETID
TICKETID
1000
1001
I then have table STATUSHISTORY from where I need to retrieve what was the last time (maximum time) when that ticket entered VENDOR status (last VENDOR status) and when it exited VENDOR status (by exiting VENDOR status I mean the first next INPROG status, but only first INPROG after the VENDOR status, it's always INPROG the next status after VENDOR status). Also it is also possible that VENDOR status for ID does not exist at all in STATUSHISOTRY (then nulls should be returned), but INPROG exists always - it can be before but also and after VENDOR status, if ID is not anymore in VENDOR status.
Here is the example of STATUSHISTORY.
ID TICKETID STATUS DATETIME
1 1000 INPROG 01.01.2017 10:00
2 1000 VENDOR 02.01.2017 10:00
3 1000 INPROG 03.01.2017 10:00
4 1000 VENDOR 04.01.2017 10:00
5 1000 INPROG 05.01.2017 10:00
6 1000 HOLD 06.01.2017 10:00
7 1000 INPROG 07.01.2017 10:00
8 1001 INPROG 02.02.2017 10:00
9 1001 VENDOR 03.02.2017 10:00
10 1001 INPROG 04.02.2017 10:00
11 1001 VENDOR 05.02.2017 10:00
So the result when doing the query from TICKET table and doing the JOIN with table STATUSHISTORY should be:
ID VENDOR_ENTERED VENDOR_EXITED
1000 04.01.2017 10:00 05.01.2017 10:00
1001 05.02.2017 10:00 null
Because for ID 1000 last VENDOR status was at 04.01.2017 and the first INPROG status after the VENDOR status for that ID was at 05.01.2017 while for ID 1001 the last VENDOR status was at 05.02.2017 and after that INPROG status did not happen yet.
If VENDOR did not exist then both columns should be null in result.
I am really stuck with this, trying different JOINs but without any progress.
Thank you in advance if you can help me.
You can do this with window functions. First, assign a "vendor" group to the tickets. You can do this using a cumulative sum counting the number of "vendor" records on or before each record.
Then, aggregate the records to get one record per "vendor" group. And use row numbers to get the most recent records. So:
with vg as (
select ticket,
min(datetime) as vendor_entered,
min(case when status = 'INPROG' then datetime end) as vendor_exitied
from (select sh.*,
sum(case when status = 'VENDOR' then 1 else 0 end) over (partition by ticketid order by datetime) as grp
from statushistory sh
) sh
group by ticket, grp
)
select vg.tiketid, vg.vendor_entered, vg.vendor_exited
from (select vg.*,
row_number() over (partition by ticket order by vendor_entered desc) as seqnum
from vg
) vg
where seqnum = 1;
You can aggregate to get max time, then join onto all of the date values higher than that time, and then re-aggregate:
select a.TicketID,
a.VENDOR_ENTERED,
min( EXIT_TIME ) as VENDOR_EXITED
from (
select TicketID,
max( DATETIME ) as VENDOR_ENTERED
from StatusHistory
where Status = 'VENDOR'
group by TicketID
) as a
left join
(
select TicketID,
DATETIME as EXIT_TIME
from StatusHistory
where Status = 'INPROG'
) as b
on a.TicketID = b.TicketID
and EXIT_TIME >= a.VENDOR_ENTERED
group by a.TicketID,
a.VENDOR_ENTERED
DB2 is not supported in SQLfiddle, but a standard SQL example can be found here.

SQL oracle with joining tables and Max functions

Some help please? Just a noob here starting to learn how to write SQL and ran into this problem. I know how to use the MAX function but I can't figure out how to join all these requirements together. I have two tables, Accounts and Books (below is an example of the data)
Accounts
ID Series YesorNot Dated Filed Plan Year
1 123 Yes 06/12/2015 2015
2 123 No 06/12/2015 2015
3 145 Yes 06/06/2015 2015
4 145 No 02/02/2015 2014
5 198 Yes 02/03/2015 2015
6 187 Yes 02/14/2013 2013
7 153 Yes 01/02/2011 2011
Books
Primary Key Date Created ID
1 06/13/2015 123
2 06/12/2015 123
3 06/07/2015 145
4 02/02/2015 145
5 02/03/2015 198
Two tables: Accounts and Books
Looking for:
1. Data that exists in both tables by the Project ID = Primary Key
2. I only want one unqiue Series (Series also = ID)
3. I want the MAX (most recent) value of Plan Year, and then if there are duplicates for Plan Year, I need the MAX (most recent) value of Date Created.
4. I just need the columns Project ID, Series, YesorNot, Date Filed, Plan Year so my output should be like this:
Project ID Series YesorNot Dated Filed Plan Year
1 123 Yes 06/12/2015 2015
3 145 Yes 06/06/2015 2015
4 145 No 02/02/2015 2014
5 198 Yes 02/03/2015 2015
First join the tables:
SELECT B.Primary_Key as Project_ID, A.Series, A.YesorNot, A.Date_Filed, A.Plan_Year
FROM Books B
JOIN Accounts A ON B.ID = A.Series
You should have been able to get this far on your own (and you should have posted it as part of the question) -- if you can't I'd say find a different career. Assuming you could now the slightly harder part.
Now we add a row number based on your criteria
ROW_NUMBER() PARTITION BY (B.Primary_Key, A.Series, A.YesorNot, A.Date_Filed ORDER BY A.Date_Year DESC, B.Date_Created DESC) AS RN
Now just take the first of the row number.
SELECT Project_ID, Series, YesorNot, Date_Filed, Plan_Year
FROM (
SELECT B.Primary_Key as Project_ID, A.Series, A.YesorNot, A.Date_Filed, A.Plan_Year,
ROW_NUMBER() PARTITION BY (B.Primary_Key, A.Series, A.YesorNot, A.Date_Filed ORDER BY A.Date_Year DESC, B.Date_Created DESC) AS RN
FROM Books B
JOIN Accounts A ON B.ID = A.Series
) X
WHERE RN = 1

cross reference nearest date data

I have three table ElecUser, ElecUsage, ElecEmissionFactor
ElecUser:
UserID UserName
1 Main Building
2 Staff Quarter
ElecUsage:
UserID Time Amount
1 1/7/2010 23230
1 8/10/2011 34340
1 8/1/2011 34300
1 2/3/2012 43430
1 4/2/2013 43560
1 3/2/2014 44540
2 3/6/2014 44000
ElecEmissionFactor:
Time CO2Emission
1/1/2010 0.5
1/1/2011 0.55
1/1/2012 0.56
1/1/2013 0.57
And intended outcome:
UserName Time CO2
1 2010 11615
1 2011 37752 (34340*0.55 + 34300*0.55)
1 2012 24320.8
1 2013 24829.2
1 2014 25387.8
2 2014 25080
The logic is ElecUsage.Amount * ElecEmissionFactor.
If same user and same year, add them up for the record of that year.
My query is:
SELECT ElecUser.UserName, Year([ElecUsage].[Time]), SUM((ElecEmissionFactor.CO2Emission*ElecUsage.Amount)) As CO2
FROM ElecEmissionFactor, ElecUser INNER JOIN ElecUsage ON ElecUser.UserID = ElecUsage.UserID
WHERE (((Year([ElecUsage].[Time]))>=Year([ElecEmissionFactor].[Time])))
GROUP BY ElecUser.UserName, Year([ElecUsage].[Time])
HAVING Year([ElecUsage].[Time]) = Max(Year(ElecEmissionFactor.Time));
However, this only shows the year with emission factor.
The challenge is to reference the year without emission factor to the latest year with emission factor.
Sub-query may be one of the solutions but i fail to do so.
I got stuck for a while. Hope to see your reply.
Thanks
Try something like this..
-- not tested
select T1.id, year(T1.time) as Time, sum(T1.amount*T2.co2emission) as CO2
from ElecUsage T1
left outer join ElecEmissionFactor T2 on (year(T1.time) = year(T2.time))
Group by year(T1.time), T1.id
use sub query to get the corresponding factor in this way
select T1.id,
year(T1.time) as Time,
sum(T1.amount*
(
select top 1 CO2Emission from ElecEmissionFactor T2
where year(T2.time) <= year(T1.time) order by T2.time desc
)
) as CO2
from ElecUsage T1
Group by year(T1.time), T1.id