Get Value Difference and Time Stamp Difference from SQL Table that is not Ideal - sql

This problem is way over my head. Can a report be created from the table below that will search the common date stamps and return Tank1Level difference? A few issues can occur, like the day on the time stamp can change and there can be 3 to 5 entries in the database that per tank filling process.
The report would show how much the tank was filled with the last t_stamp and last T1_Lot.
Here is the Data Tree
index T1_Lot Tank1Level Tank1Temp t_stamp quality_code
30 70517 - 1 43781.1875 120 7/10/2017 6:43 192
29 70517 - 1 242.6184692 119 7/10/2017 0:54 192
26 70617 - 2 242.6184692 119 7/10/2017 0:51 192
23 70617 - 2 44921.03516 134 7/8/2017 14:22 192
22 70617 - 2 892.652771 107 7/8/2017 8:29 192
21 62917 - 3 892.652771 107 7/8/2017 8:28 192
20 62917 - 3 42352.94141 124 7/6/2017 13:15 192
19 62917 - 3 5291.829102 121 7/6/2017 8:06 192
18 62917 - 2 5273.518066 121 7/6/2017 8:05 192
17 60817 - 2 444.0375366 97 7/6/2017 7:23 192
16 60817 - 2 476.0814819 97 7/5/2017 18:09 192
11 62817 - 3 45374.23047 113 6/30/2017 11:38 192
Here is what the report should look like.
At 7/10/2017 6:43 T1_Lot = 70517 - 1, Tank1Level difference = 43,629., and took 5:52.
At 7/8/2017 14:22 T1_Lot = 70517 - 1, Tank1Level difference = 44,028, and took 5:54.
At 7/6/2017 13:15 T1_Lot = 62917 - 3, Tank1Level difference = 41877, and took 5:10.
Here is how that was calculated.
Find the top time stamp with a value > 40,000 in Tank1Level,
Then Find the Next > 40000 in Tank Level.
Go one index up..
or it could be done with less than 8 hours accumulated
as you can see from the second report line there is data that should be ignored.
Report that last t_stamp of the series with the T1_Lot.
Calculate the difference in Tank1Level and report
Then Calculate the t_stamp difference in hh:mm and report.

Based on the data you provided, a self join might work.
from yourTable beforeFill join yourTable afterFill on beforeFill.t1_lot = afterFill.t1_lot
and beforeFill.index = afterFill.index - 1

Related

What is the best why to aggregate data for last 7,30,60.. days in SQL

Hi I have a table with date and the number of views that we had in our channel at the same day
date views
03/06/2020 5
08/06/2020 49
09/06/2020 50
10/06/2020 1
13/06/2020 1
16/06/2020 1
17/06/2020 102
23/06/2020 97
29/06/2020 98
07/07/2020 2
08/07/2020 198
12/07/2020 1
14/07/2020 168
23/07/2020 292
No we want to see in each calendar date the sum of the past 7 and 30 days
so the result will be
date sum_of_7d sum_of_30d
01/06/2020 0 0
02/06/2020 0 0
03/06/2020 5 5
04/06/2020 5 5
05/06/2020 5 5
06/06/2020 5 5
07/06/2020 5 5
08/06/2020 54 54
09/06/2020 104 104
10/06/2020 100 105
11/06/2020 100 105
12/06/2020 100 105
13/06/2020 101 106
14/06/2020 101 106
15/06/2020 52 106
16/06/2020 53 107
17/06/2020 105 209
18/06/2020 105 209
so I was wondering what is the best SQL that I can write in order to get it
I'm working on redshift and the actual table (not this example) include over 40B rows
I used to do something like this:
select dates_helper.date
, tbl1.cnt
, sum(tbl1.cnt) over (order by date rows between 7 preceding and current row ) as sum_7d
, sum(tbl1.cnt) over (order by date rows between 30 preceding and current row ) as sum_7d
from bi_db.dates_helper
left join tbl1
on tbl1.invite_date = dates_helper.date

How to I count a range in sql?

I have a data that looks like this:
$ Time : int 0 1 5 8 10 11 15 17 18 20 ...
$ NumOfFlights: int 1 6 144 91 504 15 1256 1 1 578 ...
Time col is just 24hr time. From 0 up all the way until 2400
What I hope to get is:
hour | number of flight
-------------------------------------
1st | 240
2nd | 223
... | ...
24th | 122
Where 1st hour is from midnight to 1am, and 2nd is 1am to 2am, and so on until finally 24th which is from 11pm to midnight. And number of flights is just the total of the NumOfFlights within the range.
I've tried:
dbGetQuery(conn,"
SELECT
flights.CRSDepTime AS Time,
COUNT(flights.CRSDepTime) AS NumOnTimeFlights
FROM flights
GROUP BY CRSDepTime/60
")
But I realise it can't be done this way. The results that I get will have 40 values for time.
> head
Time NumOnTimeFlights
1 50 6055
2 105 2383
3 133 674
4 200 446
5 245 266
6 310 34
> tail
Time NumOnTimeFlights
35 2045 48136
36 2120 103229
37 2215 15737
38 2245 36416
39 2300 15322
40 2355 8018
If your CRSDepTime column is an integer encoded time like HHmm then CRSDepTime/100 will extract the hour.
SELECT
CRSDepTime/100 AS hh,
COUNT(flights.CRSDepTime) AS NumOnTimeFlights
FROM flights
GROUP BY CRSDepTime/100

In Azure Data bricks I want to get start dates of every week with week numbers from datetime column

This is a sample Data Frame
Date Items_Sold
12/29/2019 10
12/30/2019 20
12/31/2019 30
1/1/2020 40
1/2/2020 50
1/3/2020 60
1/4/2020 35
1/5/2020 56
1/6/2020 34
1/7/2020 564
1/8/2020 6
1/9/2020 45
1/10/2020 56
1/11/2020 45
1/12/2020 37
1/13/2020 36
1/14/2020 479
1/15/2020 47
1/16/2020 47
1/17/2020 578
1/18/2020 478
1/19/2020 3578
1/20/2020 67
1/21/2020 578
1/22/2020 478
1/23/2020 4567
1/24/2020 7889
1/25/2020 8999
1/26/2020 99
1/27/2020 66
1/28/2020 678
1/29/2020 889
1/30/2020 990
1/31/2020 58585
2/1/2020 585
2/2/2020 555
2/3/2020 56
2/4/2020 66
2/5/2020 66
2/6/2020 6634
2/7/2020 588
2/8/2020 2588
2/9/2020 255
I am running this query
%sql
use my_items_table;
select weekofyear(Date), count(items_sold) as Sum
from my_items_table
where year(Date)=2020
group by weekofyear(Date)
order by weekofyear(Date)
I am getting this output. (IMP: I have added random values in Sum)
Week Sum
1 | 300091
2 | 312756
3 | 309363
4 | 307312
5 | 310985
6 | 296889
7 | 315611
But I want in which with week number one column should hold a start date of each week. Like this
Start_Date Week Sum
12/29/2019 1 300091
1/5/2020 2 312756
1/12/2020 3 309363
1/19/2020 4 307312
1/26/2020 5 310985
2/2/2020 6 296889
2/9/2020 7 315611
I am running the query on Azure Data Bricks.
If you have data for all days, then just use min():
select min(date), weekofyear(Date), count(items_sold) as Sum
from my_items_table
where year(Date) = 2020
group by weekofyear(Date)
order by weekofyear(Date);
Note: The year() is the calendar year starting on Jan 1. You are not going to get dates from other years using this query. If that is an issue, I would suggest that you ask a new question asking how to get the first day for the first week of the year.

SQL: Counting occurrence of certain value from its first appearance till next five minutes and repeat the same for the next occurence again

I need to find the number of times a value say 34occurred from its first occurrence till next 5 minutes.
Then again do the same thing after 5 minutes, again fetch the record with value 20, see how many times it occurred til next 5 minutes for each device
Suppose say I have following table:
DevID value DateTime
--------------------------------------------------
99 20 18-12-2016 18:10
99 34 18-12-2016 18:11
99 34 18-12-2016 18:12
99 20 18-12-2016 18:15
23 15 18-12-2016 18:16
28 34 18-12-2016 18:17
23 15 18-12-2016 18:18
23 12 18-12-2016 18:19
99 20 18-12-2016 18:20
99 34 18-12-2016 18:21
99 34 18-12-2016 18:22
99 34 18-12-2016 18:23
99 34 18-12-2016 18:24
99 34 18-12-2016 18:25
I'm interested in number 34. I want to check the first appearance of number 34, get its time and then count how many times this number (34) occurred for the next 5 minutes. Basically fetch records from the first time of occurrence till occurrence +5minutes and count how many of them have 34 and if its more than 3 list that device name.
Repeat same for the next record with 34 till next 5 minutes. so in the case above, record 99 will had 34 first time at 18-12-2016 18:11 but then we did not get more than 3 record of 34 in next 5 minutes but however we again got 34 at 18-12-2016 18:21 and got more than 3 entries of 34 in next 5 minutes
So the expected output for the above table would be device id 99.
Editted
I am interested in finding only value 34. So the extra complexity for finding all such repeated value in 5 minutes gap is not required.
Just want to know for which device we have 34 repeated more than 3 times(this should be changable i can hardcode this to 10 times as well) between a time interval of 5 minutes.
The most efficient method is to use lag()/lead():
select t.*
from (select t.*,
lead(datetime, 2) over (partition by devid order by datetime) as next2_dt
from t
where value = 34
) t
where next2_dt <= dateadd(minute, 5, datetime);
This peaks ahead to the 2nd value and just compares the datetime of that value with the datetime on the current row.
This could be done as follows:
SELECT DevID
FROM t
WHERE Value = 34
AND 2 <= (
SELECT COUNT(*)
FROM t AS x
WHERE x.DevID = t.DevID
AND x.Value = t.Value
AND x.DateTime > t.DateTime
AND x.DateTime < DATEADD(MINUTE, 5, t.DateTime)
)
GROUP BY DevID
You might want to replace < with <= depending on how you count 5 minutes.
please adjust to your RDBMS, but it should look something like this:
select b.*
from (
select value, min(DateTime) as md
from the_table
group by value
) as a
join the_table as b
on a.value = b.value
and b.DateTime between a.md and a.md + interval'5'minute

Stata: Generate sum / total by specific date ranges and save them as a new variable

I work with panel data which contain several companies (id) and cover the period from 1.1.2008 to 1.1.2013 (date, year). I want to generate new variable (sum1) which contains a sum of daily observation for var1 for each company and specific time interval. If the interval was equal to each year I would use the function total():
bysort id year: egen sum1=total(var1)
In my case however, the time interval is determined as an interval between two events. I have a special variable called event, which takes value of 1 if the event has occurred on special date and missing otherwise. There are 5 to 10 events for each company. The intervals between events are not equal; hence the first interval can contain 60 observations, the next interval 360 observations. The intervals are also not equal for different companies. The starting date for the first interval for each company is 1.1.2008. The starting date for the second interval is the date of the first event + 1 day. Besides I would like to account for missing values, so if all values of var1 for the company x are missing variable, sum1 for company x and specific interval must contain missing values and not 0.
My panel looks like this:
id date year var1 event sum1(to gen) event_id(to gen)
1 1.1.2008 2008 25 . 95 (25+30+40) 1
1 2.1.2008 2008 30 . 95 (25+30+40) 1
...........................................................1
1 31.4.2008 2008 40 1 95 (25+30+40) 1
1 1.5.2008 2008 50 . 160 (50+50+60) 2
1 2.5.2008 2008 50 . 160 (50+50+60) 2
......................................... ................2
1 31.4.2009 2009 60 1 160 (50+50+60) 2
2 1.1.2008 2008 26 . 96 (26+30+40) 1
2 2.1.2008 2008 30 . 96 (26+30+40) 1
...........................................................1
2 31.6.2008 2008 40 1 96 (26+30+40) 1
2 1.5.2008 2008 51 . 161 (51+50+60) 2
2 2.5.2008 2008 50 . 161 (51+50+60) 2
...........................................................2
2 31.6.2009 2009 60 1 161 (51+50+60) 2
I tried to write different loops (while, if), but I failed to do it correctly. I cannot use rolling as my intervals are not the same.
My other idea was to create the group identifier first (called event_id), which contains the event_id for each interval and each company. Then I could use bysort id event_id: egen sum1=total(var1), but unfortunately I do not have any idea how to do that. So, the variables event_id and sum1 in my panel do not exist and serve as an example for output I want to achieve.
I can make sense of the example with the following changes:
Dates 31 April and 31 June are typos for 1 day earlier.
Date 31.6.2008 should however be 30.4.2008.
That said, one trick of reversing time makes subdivision into spells easy. Given markers 1 for the ends of each spell, we can then cumulate backwards using sum(). The crucial small detail here is that sum() ignores missing values, or more precisely treats them as zero. Here that is entirely a feature, although not quite what the OP wants when applying egen, total().
Then reverse spell numbering, reverse time to the normal direction and apply egen as in other answers. Reversing and reversing back are both just negation using -. Sorting on date within panel is just cosmetic once we have a division into spells, but still the right thing to do.
For more on spells in Stata, see here
For hints from Statalist on how to provide data examples using dataex (SSC), which apply here too with minor modification, see here
clear *
input id str10 date year var1 event DesiredSum
1 1.1.2008 2008 25 . 95
1 2.1.2008 2008 30 . 95
1 30.4.2008 2008 40 1 95
1 1.5.2008 2008 50 . 160
1 2.5.2008 2008 50 . 160
1 30.4.2009 2009 60 1 160
2 1.1.2008 2008 26 . 96
2 2.1.2008 2008 30 . 96
2 30.4.2008 2008 40 1 96
2 1.5.2008 2008 51 . 161
2 2.5.2008 2008 50 . 161
2 30.6.2009 2009 60 1 161
end
gen ddate = -daily(date, "DMY")
bysort id (ddate): gen EVENT = sum(event)
replace ddate = -ddate
by id: replace EVENT = EVENT[_N] - EVENT + 1
bysort id EVENT (ddate): egen Sum = total(var1), missing
assert Sum == DesiredSum
list, sepby(id EVENT)
+-----------------------------------------------------------------------+
| id date year var1 event Desire~m ddate EVENT Sum |
|-----------------------------------------------------------------------|
1. | 1 1.1.2008 2008 25 . 95 17532 1 95 |
2. | 1 2.1.2008 2008 30 . 95 17533 1 95 |
3. | 1 30.4.2008 2008 40 1 95 17652 1 95 |
|-----------------------------------------------------------------------|
4. | 1 1.5.2008 2008 50 . 160 17653 2 160 |
5. | 1 2.5.2008 2008 50 . 160 17654 2 160 |
6. | 1 30.4.2009 2009 60 1 160 18017 2 160 |
|-----------------------------------------------------------------------|
7. | 2 1.1.2008 2008 26 . 96 17532 1 96 |
8. | 2 2.1.2008 2008 30 . 96 17533 1 96 |
9. | 2 30.4.2008 2008 40 1 96 17652 1 96 |
|-----------------------------------------------------------------------|
10. | 2 1.5.2008 2008 51 . 161 17653 2 161 |
11. | 2 2.5.2008 2008 50 . 161 17654 2 161 |
12. | 2 30.6.2009 2009 60 1 161 18078 2 161 |
+-----------------------------------------------------------------------+
If you are not opposed to re-coding event into something a bit easier to work with, the following should suffice. I'm also assuming here that event is used to flag the end of the time interval for which the event occurred (I make this assumption based on your sample data and my comment on the question).
clear *
input id str10 date year var1 event DesiredSum
1 1.1.2008 2008 25 . 95
1 2.1.2008 2008 30 . 95
1 31.4.2008 2008 40 1 95
1 1.5.2008 2008 50 . 160
1 2.5.2008 2008 50 . 160
1 31.4.2009 2009 60 1 160
2 1.1.2008 2008 26 . 96
2 2.1.2008 2008 30 . 96
2 31.6.2008 2008 40 1 96
2 1.5.2008 2008 51 . 161
2 2.5.2008 2008 50 . 161
2 31.6.2009 2009 60 1 161
end
bysort id : gen i = _n // to maintain sort order
/* This section of code changes event so that 1 indicates the start of the
interval. This data structure makes more sense to me */
replace event = 0 if mi(event)
replace event = 2 if event[_n-1] == 1 & _n != 1
replace event = event - 1 if event > 0
replace event = 1 in 1
gen event_id = event
replace event_id = event_id+event_id[_n-1] if i != 1
bysort id event_id : egen Sum = total(var1), missing
li id date event_id DesiredSum Sum, sepby(event_id)
Naturally, if you didn't want to change event, you could generate event2 = event to use in place of event.
It looks like you are essentially trying to create totals for unique combinations of id and eventid, not id and year. Based on your example, the event date and "special date" flag (event) don't seem to matter in calculating the desired sum. Therefore
bysort id eventid: egen _sum = total(var1)
or more simply
egen _sum = total(var1) , by(id eventid)
should both give you the total you want. Regarding
Besides I would like to account for missing values, so if all values of var1 for the company x are missing variable, sum1 for company x and specific interval must contain missing values and not 0.
The missing option on egen total() should help take care of this condition.
Update
Not necessarily an improvement on the other answers, but yet another method (relying on the events being in the proper order in the raw data):
clear *
input id str10 date year var1 event DesiredSum
1 1.1.2008 2008 25 . 95
1 2.1.2008 2008 30 . 95
1 30.4.2008 2008 40 1 95
1 1.5.2008 2008 50 . 160
1 2.5.2008 2008 50 . 160
1 30.4.2009 2009 60 1 160
2 1.1.2008 2008 26 . 96
2 2.1.2008 2008 30 . 96
2 30.4.2008 2008 40 1 96
2 1.5.2008 2008 51 . 161
2 2.5.2008 2008 50 . 161
2 30.6.2009 2009 60 1 161
end
gen _obs = _n
gen date2 = daily(date, "DMY")
format date2 %td
bys id (_obs): gen eventid = sum(date2 == td(01jan2008)) + sum(event[_n-1] == 1)
egen sum = total(var1) , by(id eventid) missing
li , sepby(id eventid)