Stata: Generate sum / total by specific date ranges and save them as a new variable - sum

I work with panel data which contain several companies (id) and cover the period from 1.1.2008 to 1.1.2013 (date, year). I want to generate new variable (sum1) which contains a sum of daily observation for var1 for each company and specific time interval. If the interval was equal to each year I would use the function total():
bysort id year: egen sum1=total(var1)
In my case however, the time interval is determined as an interval between two events. I have a special variable called event, which takes value of 1 if the event has occurred on special date and missing otherwise. There are 5 to 10 events for each company. The intervals between events are not equal; hence the first interval can contain 60 observations, the next interval 360 observations. The intervals are also not equal for different companies. The starting date for the first interval for each company is 1.1.2008. The starting date for the second interval is the date of the first event + 1 day. Besides I would like to account for missing values, so if all values of var1 for the company x are missing variable, sum1 for company x and specific interval must contain missing values and not 0.
My panel looks like this:
id date year var1 event sum1(to gen) event_id(to gen)
1 1.1.2008 2008 25 . 95 (25+30+40) 1
1 2.1.2008 2008 30 . 95 (25+30+40) 1
...........................................................1
1 31.4.2008 2008 40 1 95 (25+30+40) 1
1 1.5.2008 2008 50 . 160 (50+50+60) 2
1 2.5.2008 2008 50 . 160 (50+50+60) 2
......................................... ................2
1 31.4.2009 2009 60 1 160 (50+50+60) 2
2 1.1.2008 2008 26 . 96 (26+30+40) 1
2 2.1.2008 2008 30 . 96 (26+30+40) 1
...........................................................1
2 31.6.2008 2008 40 1 96 (26+30+40) 1
2 1.5.2008 2008 51 . 161 (51+50+60) 2
2 2.5.2008 2008 50 . 161 (51+50+60) 2
...........................................................2
2 31.6.2009 2009 60 1 161 (51+50+60) 2
I tried to write different loops (while, if), but I failed to do it correctly. I cannot use rolling as my intervals are not the same.
My other idea was to create the group identifier first (called event_id), which contains the event_id for each interval and each company. Then I could use bysort id event_id: egen sum1=total(var1), but unfortunately I do not have any idea how to do that. So, the variables event_id and sum1 in my panel do not exist and serve as an example for output I want to achieve.

I can make sense of the example with the following changes:
Dates 31 April and 31 June are typos for 1 day earlier.
Date 31.6.2008 should however be 30.4.2008.
That said, one trick of reversing time makes subdivision into spells easy. Given markers 1 for the ends of each spell, we can then cumulate backwards using sum(). The crucial small detail here is that sum() ignores missing values, or more precisely treats them as zero. Here that is entirely a feature, although not quite what the OP wants when applying egen, total().
Then reverse spell numbering, reverse time to the normal direction and apply egen as in other answers. Reversing and reversing back are both just negation using -. Sorting on date within panel is just cosmetic once we have a division into spells, but still the right thing to do.
For more on spells in Stata, see here
For hints from Statalist on how to provide data examples using dataex (SSC), which apply here too with minor modification, see here
clear *
input id str10 date year var1 event DesiredSum
1 1.1.2008 2008 25 . 95
1 2.1.2008 2008 30 . 95
1 30.4.2008 2008 40 1 95
1 1.5.2008 2008 50 . 160
1 2.5.2008 2008 50 . 160
1 30.4.2009 2009 60 1 160
2 1.1.2008 2008 26 . 96
2 2.1.2008 2008 30 . 96
2 30.4.2008 2008 40 1 96
2 1.5.2008 2008 51 . 161
2 2.5.2008 2008 50 . 161
2 30.6.2009 2009 60 1 161
end
gen ddate = -daily(date, "DMY")
bysort id (ddate): gen EVENT = sum(event)
replace ddate = -ddate
by id: replace EVENT = EVENT[_N] - EVENT + 1
bysort id EVENT (ddate): egen Sum = total(var1), missing
assert Sum == DesiredSum
list, sepby(id EVENT)
+-----------------------------------------------------------------------+
| id date year var1 event Desire~m ddate EVENT Sum |
|-----------------------------------------------------------------------|
1. | 1 1.1.2008 2008 25 . 95 17532 1 95 |
2. | 1 2.1.2008 2008 30 . 95 17533 1 95 |
3. | 1 30.4.2008 2008 40 1 95 17652 1 95 |
|-----------------------------------------------------------------------|
4. | 1 1.5.2008 2008 50 . 160 17653 2 160 |
5. | 1 2.5.2008 2008 50 . 160 17654 2 160 |
6. | 1 30.4.2009 2009 60 1 160 18017 2 160 |
|-----------------------------------------------------------------------|
7. | 2 1.1.2008 2008 26 . 96 17532 1 96 |
8. | 2 2.1.2008 2008 30 . 96 17533 1 96 |
9. | 2 30.4.2008 2008 40 1 96 17652 1 96 |
|-----------------------------------------------------------------------|
10. | 2 1.5.2008 2008 51 . 161 17653 2 161 |
11. | 2 2.5.2008 2008 50 . 161 17654 2 161 |
12. | 2 30.6.2009 2009 60 1 161 18078 2 161 |
+-----------------------------------------------------------------------+

If you are not opposed to re-coding event into something a bit easier to work with, the following should suffice. I'm also assuming here that event is used to flag the end of the time interval for which the event occurred (I make this assumption based on your sample data and my comment on the question).
clear *
input id str10 date year var1 event DesiredSum
1 1.1.2008 2008 25 . 95
1 2.1.2008 2008 30 . 95
1 31.4.2008 2008 40 1 95
1 1.5.2008 2008 50 . 160
1 2.5.2008 2008 50 . 160
1 31.4.2009 2009 60 1 160
2 1.1.2008 2008 26 . 96
2 2.1.2008 2008 30 . 96
2 31.6.2008 2008 40 1 96
2 1.5.2008 2008 51 . 161
2 2.5.2008 2008 50 . 161
2 31.6.2009 2009 60 1 161
end
bysort id : gen i = _n // to maintain sort order
/* This section of code changes event so that 1 indicates the start of the
interval. This data structure makes more sense to me */
replace event = 0 if mi(event)
replace event = 2 if event[_n-1] == 1 & _n != 1
replace event = event - 1 if event > 0
replace event = 1 in 1
gen event_id = event
replace event_id = event_id+event_id[_n-1] if i != 1
bysort id event_id : egen Sum = total(var1), missing
li id date event_id DesiredSum Sum, sepby(event_id)
Naturally, if you didn't want to change event, you could generate event2 = event to use in place of event.

It looks like you are essentially trying to create totals for unique combinations of id and eventid, not id and year. Based on your example, the event date and "special date" flag (event) don't seem to matter in calculating the desired sum. Therefore
bysort id eventid: egen _sum = total(var1)
or more simply
egen _sum = total(var1) , by(id eventid)
should both give you the total you want. Regarding
Besides I would like to account for missing values, so if all values of var1 for the company x are missing variable, sum1 for company x and specific interval must contain missing values and not 0.
The missing option on egen total() should help take care of this condition.
Update
Not necessarily an improvement on the other answers, but yet another method (relying on the events being in the proper order in the raw data):
clear *
input id str10 date year var1 event DesiredSum
1 1.1.2008 2008 25 . 95
1 2.1.2008 2008 30 . 95
1 30.4.2008 2008 40 1 95
1 1.5.2008 2008 50 . 160
1 2.5.2008 2008 50 . 160
1 30.4.2009 2009 60 1 160
2 1.1.2008 2008 26 . 96
2 2.1.2008 2008 30 . 96
2 30.4.2008 2008 40 1 96
2 1.5.2008 2008 51 . 161
2 2.5.2008 2008 50 . 161
2 30.6.2009 2009 60 1 161
end
gen _obs = _n
gen date2 = daily(date, "DMY")
format date2 %td
bys id (_obs): gen eventid = sum(date2 == td(01jan2008)) + sum(event[_n-1] == 1)
egen sum = total(var1) , by(id eventid) missing
li , sepby(id eventid)

Related

How to I count a range in sql?

I have a data that looks like this:
$ Time : int 0 1 5 8 10 11 15 17 18 20 ...
$ NumOfFlights: int 1 6 144 91 504 15 1256 1 1 578 ...
Time col is just 24hr time. From 0 up all the way until 2400
What I hope to get is:
hour | number of flight
-------------------------------------
1st | 240
2nd | 223
... | ...
24th | 122
Where 1st hour is from midnight to 1am, and 2nd is 1am to 2am, and so on until finally 24th which is from 11pm to midnight. And number of flights is just the total of the NumOfFlights within the range.
I've tried:
dbGetQuery(conn,"
SELECT
flights.CRSDepTime AS Time,
COUNT(flights.CRSDepTime) AS NumOnTimeFlights
FROM flights
GROUP BY CRSDepTime/60
")
But I realise it can't be done this way. The results that I get will have 40 values for time.
> head
Time NumOnTimeFlights
1 50 6055
2 105 2383
3 133 674
4 200 446
5 245 266
6 310 34
> tail
Time NumOnTimeFlights
35 2045 48136
36 2120 103229
37 2215 15737
38 2245 36416
39 2300 15322
40 2355 8018
If your CRSDepTime column is an integer encoded time like HHmm then CRSDepTime/100 will extract the hour.
SELECT
CRSDepTime/100 AS hh,
COUNT(flights.CRSDepTime) AS NumOnTimeFlights
FROM flights
GROUP BY CRSDepTime/100

how to add a constant value (1) in an empty column in snowflake-matillion

my table looks like
id total avg test_no
1 445 89
2 434 85
3 378 75
4 421 84
I'm working on matillion-snowflake
I need my result to look like
id total avg test_no
1 445 89 1
2 434 85 1
3 378 75 1
4 421 84 1
Just use a Calculator component and set the value of the calculated column to 1
In Snowflake, you would modify the table using:
update t
set test_no = 1;
I assume that Matillion supports this as well.

Create a sequence for dates with repeats

I have a list of days (numbered 195-720) and each day has multiple observations. I would ultimately like to determine which of these days are weekdays and which are weekend days. I would be able to do this if I could just assign the digits 1-7 to each of the days. Currently, the data looks like this:
Day Household ID Hour of Day
195 1 1
195 1 2
195 1 3
195 1 4
196 1 1
196 1 2
196 1 3
197 1 1
197 1 2
It is perhaps important to note that there is not a consistent number of observations for each day (e.g. 4 observations for day 195, 3 observations for day 196, 2 observations for day 197).
I know that Day 195 is a Tuesday, which for simplicity's sake I would like to code as equal to "2" (Wednesday=3, Thursday=4, etc).
Thus, I would like to get the following output:
Day Household ID Hour of Day DAY OF WEEK
195 1 1 2
195 1 2 2
195 1 3 2
195 1 4 2
196 1 1 3
196 1 2 3
196 1 3 3
197 1 1 4
197 1 2 4
After looking through Stata documentation, I considered using DYM/DMY. However, this does not work because I do not have an original "date" variable to work from. Instead, I just have a number "195" which corresponds to Tuesday, July 12.
I wanted to use something like:
bysort day: egen Hour_of_Day = seq(2, 3, 4, 5, 6, 7, 1)
However, Stata tells me that this has a syntax error. Note: I start with "2" because the my first day (195) is a Tuesday. I also considered commands like carryforward or mod(x,y) or fill.
Does anyone know how I can set the sequence to fill the same for each day? How can I fix this code to achieve the desired output?
If you know that 195 was Tuesday then the reverse engineering is straightforward. 193 must have been Sunday and 199 Saturday.
Let's look at a sandbox with that week, 193 to 199. Our first guess at a day of week function of our own will use the mod() function (not a command). This paper is a short riff on its applications in Stata.
. clear
. set obs 7
number of observations (_N) was 0, now 7
. gen day = 192 + _n
. gen dow = mod(day, 7)
. list, sep(0)
+-----------+
| day dow |
|-----------|
1. | 193 4 |
2. | 194 5 |
3. | 195 6 |
4. | 196 0 |
5. | 197 1 |
6. | 198 2 |
7. | 199 3 |
+-----------+
Stata's convention for day of week is that 0 is Sunday and 6 is Saturday. That is just a rotation away.
. gen DOW = mod(day + 3, 7)
. list, sep(0)
+-----------------+
| day dow DOW |
|-----------------|
1. | 193 4 0 |
2. | 194 5 1 |
3. | 195 6 2 |
4. | 196 0 3 |
5. | 197 1 4 |
6. | 198 2 5 |
7. | 199 3 6 |
+-----------------+
You can check with Stata's own dow() function that another way to get DOW above is
gen StataDOW = dow(day - 2)
So an indicator for weekday is (for example)
gen weekday = !inlist(DOW, 0, 6)
or
gen weekday = inrange(DOW, 1, 5)
or
gen weekday = !inlist(dow, 4, 3)
using the first variable created.
As it happens, I originally wrote egen, seq(). Your syntax is indeed not legal, as seq() is the syntax, but nothing is ever placed within the parentheses. I wouldn't use egen here, if only because the right answers are essentially impossible with multiple occurrences (as you do have) and also unlikely if you have gaps in the data. The reasoning here is, or should be, robust to repetitions and gaps.

Get Value Difference and Time Stamp Difference from SQL Table that is not Ideal

This problem is way over my head. Can a report be created from the table below that will search the common date stamps and return Tank1Level difference? A few issues can occur, like the day on the time stamp can change and there can be 3 to 5 entries in the database that per tank filling process.
The report would show how much the tank was filled with the last t_stamp and last T1_Lot.
Here is the Data Tree
index T1_Lot Tank1Level Tank1Temp t_stamp quality_code
30 70517 - 1 43781.1875 120 7/10/2017 6:43 192
29 70517 - 1 242.6184692 119 7/10/2017 0:54 192
26 70617 - 2 242.6184692 119 7/10/2017 0:51 192
23 70617 - 2 44921.03516 134 7/8/2017 14:22 192
22 70617 - 2 892.652771 107 7/8/2017 8:29 192
21 62917 - 3 892.652771 107 7/8/2017 8:28 192
20 62917 - 3 42352.94141 124 7/6/2017 13:15 192
19 62917 - 3 5291.829102 121 7/6/2017 8:06 192
18 62917 - 2 5273.518066 121 7/6/2017 8:05 192
17 60817 - 2 444.0375366 97 7/6/2017 7:23 192
16 60817 - 2 476.0814819 97 7/5/2017 18:09 192
11 62817 - 3 45374.23047 113 6/30/2017 11:38 192
Here is what the report should look like.
At 7/10/2017 6:43 T1_Lot = 70517 - 1, Tank1Level difference = 43,629., and took 5:52.
At 7/8/2017 14:22 T1_Lot = 70517 - 1, Tank1Level difference = 44,028, and took 5:54.
At 7/6/2017 13:15 T1_Lot = 62917 - 3, Tank1Level difference = 41877, and took 5:10.
Here is how that was calculated.
Find the top time stamp with a value > 40,000 in Tank1Level,
Then Find the Next > 40000 in Tank Level.
Go one index up..
or it could be done with less than 8 hours accumulated
as you can see from the second report line there is data that should be ignored.
Report that last t_stamp of the series with the T1_Lot.
Calculate the difference in Tank1Level and report
Then Calculate the t_stamp difference in hh:mm and report.
Based on the data you provided, a self join might work.
from yourTable beforeFill join yourTable afterFill on beforeFill.t1_lot = afterFill.t1_lot
and beforeFill.index = afterFill.index - 1

Pandas: Group by two columns to get sum of another column

I look most of the previously asked questions but was not able to find answer for my question:
I have following data.frame
id year month score num_attempts
0 483625 2010 01 50 1
1 967799 2009 03 50 1
2 213473 2005 09 100 1
3 498110 2010 12 60 1
5 187243 2010 01 100 1
6 508311 2005 10 15 1
7 486688 2005 10 50 1
8 212550 2005 10 500 1
10 136701 2005 09 25 1
11 471651 2010 01 50 1
I want to get following data frame
year month sum_score sum_num_attempts
2009 03 50 1
2005 09 125 2
2010 12 60 1
2010 01 200 2
2005 10 565 3
Here is what I tried:
sum_df = df.groupby(by=['year','month'])['score'].sum()
But this doesn't look efficient and correct. If I have more than one column need to be aggregate this seems like a very expensive call. for example if I have another column num_attempts and just want to sum by year month as score.
This should be an efficient way:
sum_df = df.groupby(['year','month']).agg({'score': 'sum', 'num_attempts': 'sum'})