Tallying events within specific time prior to a current event - dataframe

I am trying to tally the number of events that happened in specific periods of time previous to each of my events (day/week/month) in a data frame.
I have a data frame with 50 individuals, each of who have events scattered throughout different periods of time (days/weeks/months) in the dataframe. Every row in the data frame is an event, and I'm trying to understand how the number of events in the previous day/week/month impacted the way the individual responded to the current event. Every event is marked with an individual ID (ID.2) and has a date and time associated with it (Datetime). I have already created columns for day (epd), week (epw), month (epm) and want to populate them, for each event, with the number of events for that specific individual in the previous day, week and month respectively.
My data looks like this:
> head(ACss)
Date Datetime ID.2 month day year epd epw epm
1 2019-05-25 2019-05-25 11:57 139 5 25 2019 NA NA NA
2 2019-06-09 2019-06-09 19:42 43 6 9 2019 NA NA NA
3 2019-07-05 2019-07-05 20:12 139 7 5 2019 NA NA NA
4 2019-07-27 2019-07-27 17:27 152 7 27 2019 NA NA NA
5 2019-08-04 2019-08-04 9:13 152 8 4 2019 NA NA NA
6 2019-08-04 2019-08-04 16:18 139 8 4 2019 NA NA NA
I have no idea how to go about doing this so haven't tried anything yet! Any and all suggestions are greatly appreciated!

Related

Aggregate status change logs to calculate counts in various statuses at the end of every month

I've been tasked with updating some SQL code that my former colleague had written.
The goal is to calculate the number of cards in specific statuses at the end of every month, and then to further group these into a handful of customer types. So far, I've recreated his code to produce an intermediary view that looks like the following:
card_log_id
card_id
customer_type
starting_at
ending_at
previous_status
status
next_status
5241
5329
25
2015-09-30 02:39:17
2015-10-16 15:19:23
null
ISSUED_INACTIVE
CLOSED
6413
5329
25
2015-10-16 15:19:23
null
ISSUED_INACTIVE
CLOSED
null
-79042
79042
27
2018-10-02 02:08:37
2018-10-29 18:26:04
null
ISSUED_INACTIVE
OPEN
160543
79042
27
2018-10-29 18:26:04
2019-03-12 16:26:34
ISSUED_INACTIVE
OPEN
CLOSED
191884
79042
27
2019-03-12 16:26:34
null
OPEN
CLOSED
null
-241361
241361
26
2021-08-27 01:14:22.0184
2021-10-04 12:39:12
null
ISSUED_INACTIVE
OPEN
568406
241361
26
2021-10-04 12:39:12
2022-02-09 14:12:21
ISSUED_INACTIVE
OPEN
LOST_CARD
646423
241361
26
2022-02-09 14:12:21
2022-02-09 21:23:13
OPEN
LOST_CARD
CLOSED
646935
241361
26
2022-02-09 21:23:13
null
LOST_CARD
CLOSED
null
-194813
194813
27
2020-12-15 23:21:31
2021-01-06 16:52:26
null
ISSUED_INACTIVE
OPEN
423720
194813
27
2021-01-06 16:52:26
2021-06-04 12:11:20
ISSUED_INACTIVE
OPEN
FRAUD_BLOCK
497772
194813
27
2021-06-04 12:11:20
2021-06-04 16:03:43
OPEN
FRAUD_BLOCK
OPEN
497913
194813
27
2021-06-04 16:03:43
null
FRAUD_BLOCK
OPEN
null
The above captures all the status changes for those 4 card_ids. As you can see, card_id 194813 remains Open, whereas all those other cards are closed after going through a series of statuses through the lifetime of the card.
Where the card_log_id is negative, those are "artificial" in the sense that they are not found in our card_logs table, but are created by taking the date the card was created as the starting_at date, and then taking the card_id, turning it negative, then assigning this value to the card_log_id. Then the query results are ordered by card_log_id and then card_id. This has the impact of correctly ordering the logs for each card, where the first log spans the time between when the card was created and the first status change, and then all subsequent records for that card_id are pulled from our card_logs table and span the time the card was in the respective status.
For context:
ISSUED_INACTIVE is the default status for a new card that has not yet been activated
OPEN is the status when the card is active and available to use
INACTIVE is the status when someone sets their card temporarily inactive.
LOST_CARD is the status of the card when someone reports their card lost
FRAUD_BLOCK is the status when there is suspected fraud and the card is temporarily blocked
CLOSED is the status when the card is closed and no longer in use.
Here are the requirements for counting and bucketing each of the statuses:
For the purposes of this report, any card that has the status ISSUED_INACTIVE on the last day of the month should be counted in eom_total_not_yet_activated for that month and (most importantly) for the next 90 days only. If, after 90 days from the starting_at date of the first log for a card_id, no status change has occurred and the card remains ISSUED_INACTIVE, exclude it from the count for the month and subsequent months until the status changes.
For the purposes of this report, any card that is not CLOSED or ISSUED_INACTIVE on the last day of the month (so cards that are OPEN, INACTIVE, FRAUD_BLOCK and LOST_CARD) should be included in the count of eom_total_active for that month and all subsequent months unless a status change occurs that would cause it to be counted in a different bucket.
For the purposes of this report, any card that is CLOSED by the last day of the month should be included in the count for eom_total_closed for that month and all subsequent months unless a status change occurs (most of the time CLOSED cards are not re-opened but it can happen).
For the purposes of this report, any card that is not CLOSED, or is ISSUED_INACTIVE and has been in that status for less than 90 days at the end of the month, should be counted in eom_total_non_closed for that month. If the card is ISSUED_INACTIVE and has been in that status for 90 days or more, then exclude them from the count of eom_total_non_closed for that month.
We also have a view called months that has a row for each month dating back to 2013. I believe the output tables are created by joining the card_logs above with the months table and performing some complex bucketing and aggregating that produce the two output tables below. The months tables contains the following columns/data:
month
year_only
month_only
idx
bom_date
bom_next_month_date
bom
eom
eom_date
2022-08
2022
8
24272
2022-08-01
2022-09-01
2022-08-01 07:00:00
2022-09-01 07:00:00
2022-08-31
2022-07
2022
7
24271
2022-07-01
2022-08-01
2022-07-01 07:00:00
2022-08-01 07:00:00
2022-07-31
2022-06
2022
6
24270
2022-06-01
2022-07-01
2022-06-01 07:00:00
2022-07-01 07:00:00
2022-06-30
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Beyond this point, I am unable to really understand the code my colleague has written. However, I am hoping you all could help me write the SQL necessary to join these tables and generate two output tables: (1) aggregates the card_status counts by month and (2) aggregates card_status counts by month and customer_type.
Below is what the output should look like for the first table that aggregates the card status counts by month.
month
net_new_not_yet_activated
eom_total_not_yet_activated
net_new_active
eom_total_active
net_new_closed
eom_total_closed
net_new_non_closed
eom_total_non_closed
2022-08
1
783
38
2364
24
1034
39
3147
2022-07
22
782
75
2326
50
1010
97
3108
2022-06
37
760
52
2251
40
960
89
3011
2022-05
2
723
61
2199
83
920
63
2922
And below is what the output should look like for the second table that aggregates the card status counts both by month and customer_type.
month
customer_type
net_new_not_yet_activated
eom_total_not_yet_activated
net_new_active
eom_total_active
net_new_closed
eom_total_closed
net_new_non_closed
eom_total_non_closed
2022-08
23
1
34
0
89
1
60
1
123
2022-08
24
-9
183
18
714
3
293
9
897
2022-08
25
2
137
-2
424
7
183
0
561
2022-08
26
7
175
8
513
2
186
15
688
2022-08
27
-6
247
14
622
11
310
8
869
2022-07
23
2
33
0
89
2
59
2
122
2022-07
24
11
192
16
696
14
290
27
888
2022-07
25
-9
135
18
426
10
176
9
561
2022-07
26
6
168
11
505
10
184
17
673
2022-07
27
11
253
30
608
14
299
41
861
This query is run in our BI tool (Sisense/Periscope Data) and uses Amazon Redshift syntax for the SQL code.
I was thinking I need to do some kind of Window Function over a partition of the data but I am really not sure if that's the case, and if so, how.
I'd be happy to provide any additional information/context that could assist in solving this problem.
Thanks in advance for your help!

Trouble converting "Excel time" to "R time"

I have an Excel column that consists of numbers and times that were supposed to all be entered in as only time values. Some are in number form (915) and some are in time form (9:15, which appear as decimals in R). It seems like I managed to get them all to the same format in Excel (year-month-day hh:mm:ss), although the date's are incorrect - which doesn't really matter, I just need the time. However, I can't seem to convert this new column (time - new) back to the correct time value in R (in character or time format).
I'm sure this answer already exists somewhere, I just can't find one that works...
# Returns incorrect time
x$new_time <- times(strftime(x$`time - new`,"%H:%M:%S"))
# Returns all NA
x$new_time2 <- as.POSIXct(as.character(x$`time - new`),
format = '%H:%M:%S', origin = '2011-07-15 13:00:00')
> head(x)
# A tibble: 6 x 8
Year Month Day `Zone - matched with coordinate tab` Time `time - new` new_time new_time2
<dbl> <dbl> <dbl> <chr> <dbl> <dttm> <times> <dttm>
1 2017 7 17 Crocodile 103 1899-12-31 01:03:00 20:03:00 NA
2 2017 7 17 Crocodile 113 1899-12-31 01:13:00 20:13:00 NA
3 2017 7 16 Crocodile 118 1899-12-31 01:18:00 20:18:00 NA
4 2017 7 17 Crocodile 123 1899-12-31 01:23:00 20:23:00 NA
5 2017 7 17 Crocodile 125 1899-12-31 01:25:00 20:25:00 NA
6 2017 7 16 West 135 1899-12-31 01:35:00 20:35:00 NA
Found this answer here:
Extract time from timestamp?
library(lubridate)
# Adding new column to verify times are correct
x$works <- format(ymd_hms(x$`time - new`), "%H:%M:%S")

pandas groupby and filling in missing frequencies

I have a dataset of events each of which occurred on a specific day. Using Pandas I have been able to aggregate these into a count of events per month using the groupby function, and then plot a graph with Matplotlib. However, in the original dataset some months do not have any events and so there is no count of events in such a month. Such months do not therefore appear on the graph, but I would like to include then somehow with their zero count
bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count()
which produces
month_year month
2016-01 January 9
2016-02 February 7
2016-04 April 1
2016-06 June 4
2016-07 July 1
2016-08 August 3
2016-09 September 2
2016-10 October 5
2016-11 November 17
2016-12 December 3
I have been trying to find a way of filling missing months in the dataframe generated by the groupby function with a 'count' value of 0 for, in this example, March and May..
Can anyone offer some advice on how this might be achieved. I have been trying to carry out FFill on the month column but with little success and can't work out how to add in a corresponding zero value for the missing months
First of all, if bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count() is your code, then it is a series. So, let's change it to a dataframe with bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count().reset_index(). Now, into the problem.
Change to date format and use pd.Grouper and change back to string format. Also add back the month column and change the formatting of the event_no column:
bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count().reset_index()
bpm2['month_year'] = bpm2['month_year'].astype(str)
bpm2['month_year'] = pd.to_datetime(bpm2['month_year'])
bpm2 = bpm2.groupby([pd.Grouper(key='month_year', freq='1M')])['event_no'].first().fillna(0).astype(int).reset_index()
bpm2['month'] = bpm2['month_year'].dt.strftime('%B')
bpm2['month_year'] = bpm2['month_year'].dt.strftime('%Y-%m')
bpm2
output:
month_year event_no month
0 2016-01 9 January
1 2016-02 7 February
2 2016-03 0 March
3 2016-04 1 April
4 2016-05 0 May
5 2016-06 4 June
6 2016-07 1 July
7 2016-08 3 August
8 2016-09 2 September
9 2016-10 5 October
10 2016-11 17 November
11 2016-12 3 December

Built difference between values in the same column

Lets say I have got the following datatable which has one column which gives back the first of each month from 2000 until 2005 and the second column gives back some values which are positive or negative.
What I want to do is that I want to build the difference between two observations from the same month but from different years.
So for example:
I want to calculate the difference between 2001-01-01 and 2000-01-01 and write the value in a new column in the same row where my 2001-01-01 date stands.
I want to do this for all my observations and for the ones who do not have a value in the previous year to compare to, just give back NA.
Thank you for your time and help :)
If there are no gaps in your data, you could use the lag function:
library(dplyr)
df <- data.frame(Date = as.Date(sapply(2000:2005, function(x) paste(x, 1:12, 1, sep = "-"))),
Value = runif(72,0,1))
df$Difference <- df$Value-lag(df$Value, 12)
> df[1:24,]
Date Value Difference
1 2000-01-01 0.83038968 NA
2 2000-02-01 0.85557483 NA
3 2000-03-01 0.41463862 NA
4 2000-04-01 0.16500688 NA
5 2000-05-01 0.89260904 NA
6 2000-06-01 0.21735933 NA
7 2000-07-01 0.96691686 NA
8 2000-08-01 0.99877057 NA
9 2000-09-01 0.96518311 NA
10 2000-10-01 0.68122410 NA
11 2000-11-01 0.85688662 NA
12 2000-12-01 0.97282720 NA
13 2001-01-01 0.83614146 0.005751778
14 2001-02-01 0.07967273 -0.775902097
15 2001-03-01 0.44373647 0.029097852
16 2001-04-01 0.35088593 0.185879052
17 2001-05-01 0.46240321 -0.430205836
18 2001-06-01 0.73177425 0.514414912
19 2001-07-01 0.52017554 -0.446741315
20 2001-08-01 0.52986486 -0.468905713
21 2001-09-01 0.14921003 -0.815973080
22 2001-10-01 0.25427134 -0.426952761
23 2001-11-01 0.36032777 -0.496558857
24 2001-12-01 0.20862578 -0.764201423
I think you should try the lubridate package, very usefull to work with dates.
https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html

Normalize time variable for recurrent LSTM Neural Network using Keras

I am using Keras to create an LSTM neural-network that can predict the concentration in the blood of a certain drug. I have a dataset with time stamps on which a drug dosage was administered and when the concentration in the blood was measured. These dosage and measurement time stamps are disjoint. Furthermore several other variables are measured at all time steps (both dosage and measurements). These variables are the input for my model along with the dosages (0 when no dosage was given at time t). The observed concentration in the blood is the response variable.
I have normalized all input features using the MinMaxScaler().
Q1:
Now I am wondering, do I need to normalize the time variable that corresponds with all rows as well and give it as input to the model? Or can I leave this variable out since the time steps are equally spaced?
The data looks like:
PatientID Time Dosage DosageRate DrugConcentration
1 0 100 12 NA
1 309 100 12 NA
1 650 100 12 NA
1 1030 100 12 NA
1 1320 NA NA 12
1 1405 100 12 NA
1 1812 90 8 NA
1 2078 90 8 NA
1 2400 NA NA 8
2 0 120 13.5 NA
2 800 120 13.5 NA
2 920 NA NA 16
2 1515 120 13.5 NA
2 1832 120 13.5 NA
2 2378 120 13.5 NA
2 2600 120 13.5 NA
2 3000 120 13.5 NA
2 4400 NA NA 2
As you can see, the time between two consecutive dosages and measurements differs for a patient and between patients, which makes the problem difficult.
Q2:
One approach I can think of is aggregating on measurements intervals and taking the average dosage and SD between two measurements. Then we only predict on time stamps of which we know the observed drug concentration. Would this work, or would we lose to much information?
Q3
A second approach I could think of is create new data points, so that all intervals between dosages are the same and set the dosage and dosage rate at those time points to zero. The disadvantage is then, that we can only calculate the error on the time stamps on which we know the observed drug concentration. How should we tackle this?