Pandas: Days since last event per id - pandas

I want to build a column for my dataframe df['days_since_last'] that shows the days since the last match for each player_id for each event_id and nan if the row is the first match for the player in the dataset.
Example of my data:
event_id player_id match_date
0 1470993 227485 2015-11-29
1 1492031 227485 2016-07-23
2 1489240 227485 2016-06-19
3 1495581 227485 2016-09-02
4 1490222 227485 2016-07-03
5 1469624 227485 2015-11-14
6 1493822 227485 2016-08-13
7 1428946 313444 2014-08-10
8 1483245 313444 2016-05-21
9 1472260 313444 2015-12-13
I tried the code in Find days since last event pandas dataframe but got nonsensical results.

It seems you need sort first:
df['days_since_last_event'] = (df.sort_values(['player_id','match_date'])
.groupby('player_id')['match_date'].diff()
.dt.days)
print (df)
event_id player_id match_date days_since_last_event
0 1470993 227485 2015-11-29 15.0
1 1492031 227485 2016-07-23 20.0
2 1489240 227485 2016-06-19 203.0
3 1495581 227485 2016-09-02 20.0
4 1490222 227485 2016-07-03 14.0
5 1469624 227485 2015-11-14 NaN
6 1493822 227485 2016-08-13 21.0
7 1428946 313444 2014-08-10 NaN
8 1483245 313444 2016-05-21 160.0
9 1472260 313444 2015-12-13 490.0

Demo:
In [174]: df['days_since_last'] = (df.groupby('player_id')['match_date']
.transform(lambda x: (x.max()-x).dt.days))
In [175]: df
Out[175]:
event_id player_id match_date days_since_last
0 1470993 227485 2015-11-29 278
1 1492031 227485 2016-07-23 41
2 1489240 227485 2016-06-19 75
3 1495581 227485 2016-09-02 0
4 1490222 227485 2016-07-03 61
5 1469624 227485 2015-11-14 293
6 1493822 227485 2016-08-13 20
7 1428946 313444 2014-08-10 650
8 1483245 313444 2016-05-21 0
9 1472260 313444 2015-12-13 160

Related

How to count the number of campaigns per day based on the start and end dates of the campaigns in SQL

I need to count the number of campaigns per day based on the start and end dates of the campaigns
Input Table:
Campaign name
Start date
End date
Campaign A
2022-07-10
2022-09-25
Campaign B
2022-08-06
2022-10-07
Campaign C
2022-07-30
2022-09-10
Campaign D
2022-08-26
2022-10-24
Campaign E
2022-07-17
2022-09-29
Campaign F
2022-08-24
2022-09-12
Campaign G
2022-08-11
2022-10-24
Campaign H
2022-08-26
2022-11-22
Campaign I
2022-08-29
2022-09-25
Campaign J
2022-08-21
2022-11-15
Campaign K
2022-07-20
2022-09-18
Campaign L
2022-07-31
2022-11-20
Campaign M
2022-08-17
2022-10-10
Campaign N
2022-07-27
2022-09-07
Campaign O
2022-07-29
2022-09-26
Campaign P
2022-07-06
2022-09-15
Campaign Q
2022-07-16
2022-09-22
Out needed (result):
Date
Count unique campaigns
2022-07-02
17
2022-07-03
47
2022-07-04
5
2022-07-05
5
2022-07-06
25
2022-07-07
27
2022-07-08
17
2022-07-09
58
2022-07-10
23
2022-07-11
53
2022-07-12
18
2022-07-13
29
2022-07-14
52
2022-07-15
7
2022-07-16
17
2022-07-17
37
2022-07-18
33
How do I need to write the SQL command to get the above result? thanks all
In the following solutions we leverage string_split with combination with replicate to generate new records.
select dt as date
,count(*) as Count_unique_campaigns
from
(
select *
,dateadd(day, row_number() over(partition by Campaign_name order by (select null))-1, Start_date) as dt
from (
select *
from t
outer apply string_split(replicate(',',datediff(day, Start_date, End_date)),',')
) t
) t
group by dt
order by dt
date
Count_unique_campaigns
2022-07-06
1
2022-07-07
1
2022-07-08
1
2022-07-09
1
2022-07-10
2
2022-07-11
2
2022-07-12
2
2022-07-13
2
2022-07-14
2
2022-07-15
2
2022-07-16
3
2022-07-17
4
2022-07-18
4
2022-07-19
4
2022-07-20
5
2022-07-21
5
2022-07-22
5
2022-07-23
5
2022-07-24
5
2022-07-25
5
2022-07-26
5
2022-07-27
6
2022-07-28
6
2022-07-29
7
2022-07-30
8
2022-07-31
9
2022-08-01
9
2022-08-02
9
2022-08-03
9
2022-08-04
9
2022-08-05
9
2022-08-06
10
2022-08-07
10
2022-08-08
10
2022-08-09
10
2022-08-10
10
2022-08-11
11
2022-08-12
11
2022-08-13
11
2022-08-14
11
2022-08-15
11
2022-08-16
11
2022-08-17
12
2022-08-18
12
2022-08-19
12
2022-08-20
12
2022-08-21
13
2022-08-22
13
2022-08-23
13
2022-08-24
14
2022-08-25
14
2022-08-26
16
2022-08-27
16
2022-08-28
16
2022-08-29
17
2022-08-30
17
2022-08-31
17
2022-09-01
17
2022-09-02
17
2022-09-03
17
2022-09-04
17
2022-09-05
17
2022-09-06
17
2022-09-07
17
2022-09-08
16
2022-09-09
16
2022-09-10
16
2022-09-11
15
2022-09-12
15
2022-09-13
14
2022-09-14
14
2022-09-15
14
2022-09-16
13
2022-09-17
13
2022-09-18
13
2022-09-19
12
2022-09-20
12
2022-09-21
12
2022-09-22
12
2022-09-23
11
2022-09-24
11
2022-09-25
11
2022-09-26
9
2022-09-27
8
2022-09-28
8
2022-09-29
8
2022-09-30
7
2022-10-01
7
2022-10-02
7
2022-10-03
7
2022-10-04
7
2022-10-05
7
2022-10-06
7
2022-10-07
7
2022-10-08
6
2022-10-09
6
2022-10-10
6
2022-10-11
5
2022-10-12
5
2022-10-13
5
2022-10-14
5
2022-10-15
5
2022-10-16
5
2022-10-17
5
2022-10-18
5
2022-10-19
5
2022-10-20
5
2022-10-21
5
2022-10-22
5
2022-10-23
5
2022-10-24
5
2022-10-25
3
2022-10-26
3
2022-10-27
3
2022-10-28
3
2022-10-29
3
2022-10-30
3
2022-10-31
3
2022-11-01
3
2022-11-02
3
2022-11-03
3
2022-11-04
3
2022-11-05
3
2022-11-06
3
2022-11-07
3
2022-11-08
3
2022-11-09
3
2022-11-10
3
2022-11-11
3
2022-11-12
3
2022-11-13
3
2022-11-14
3
2022-11-15
3
2022-11-16
2
2022-11-17
2
2022-11-18
2
2022-11-19
2
2022-11-20
2
2022-11-21
1
2022-11-22
1
For SQL in Azure and SQL Server 2022 we have a cleaner solution based on [ordinal][4].
"The enable_ordinal argument and ordinal output column are currently
supported in Azure SQL Database, Azure SQL Managed Instance, and Azure
Synapse Analytics (serverless SQL pool only). Beginning with SQL
Server 2022 (16.x) Preview, the argument and output column are
available in SQL Server."
select dt as date
,count(*) as Count_unique_campaigns
from
(
select *
,dateadd(day, ordinal-1, Start_date) as dt
from (
select *
from t
outer apply string_split(replicate(',',datediff(day, Start_date, End_date)),',', 1)
) t
) t
group by dt
order by dt
Fiddle
Your sample data doesn't seem to match your desired results, but I think what you're after is this:
DECLARE #Start date, #End date;
-- first, find the earliest and last date:
SELECT #Start = MIN([Start date]), #End = MAX([End date])
FROM dbo.Campaigns;
-- now use a recursive CTE to build a date range,
-- and count the number of campaigns that have a row
-- where the campaign was active on that date:
WITH d(d) AS
(
SELECT #Start
UNION ALL
SELECT DATEADD(DAY, 1, d) FROM d WHERE d < #End
)
SELECT
[Date] = d,
[Count unique campaigns] = COUNT(*)
FROM d
INNER JOIN dbo.Campaigns AS c
ON d.d >= c.[Start date] AND d.d <= c.[End date]
GROUP BY d.d OPTION (MAXRECURSION 32767);
Working example in this fiddle.

How to calculate monthly normals?

I have this df:
CODE TMAX TMIN PP
DATE
1991-01-01 000130 32.6 23.4 0.0
1991-01-02 000130 31.2 22.4 0.0
1991-01-03 000130 32.0 NaN 0.0
1991-01-04 000130 32.2 23.0 0.0
1991-01-05 000130 30.5 22.0 0.0
... ... ... ...
2020-12-27 158328 NaN NaN NaN
2020-12-28 158328 NaN NaN NaN
2020-12-29 158328 NaN NaN NaN
2020-12-30 158328 NaN NaN NaN
2020-12-31 158328 NaN NaN NaN
I have data of 30 years (1991-2020) for each CODE, and i want to calculate monthly normals of TMAX, TMIN and PP. So for TMAX and TMIN i should calculate the average for every month, so if January have 31 days i should get the mean of those 31 values and get a value for January 1991, January 1992, etc. So i will have 30 Januarys (January 1991, January 1992, ... ,January 2020), 30 Februarys, etc. After this i should calculate the average of every group of months (Januarys with Januarys, Februarys with Februarys, etc). So i will have 12 values (one value for every month). Example:
(January1991 + January1992 + ..... + January 2020) /30
(February1991 + February1992 + ..... + February 2020) /30
.... same for every group of months.
So i'm using this code but i don't know if it's ok.
from datetime import date
normalstemp=df[['CODE','TMAX','TMIN']].groupby([df.CODE, df.index.month]).mean().round(1)
For PP (precipitation) i should sum the values of every PP value of the month, so if January have 31 days i should sum all of their values and get a value for January 1991, January 1992, etc. So i will have 30 Januarys (January 1991, January 1992, ... ,January 2020) , 30 Februarys (February 1991, February 1992, ... ,February 2020), etc. After this i should calculate the average of every group of months (Januarys with Januarys, Februarys with Februarys, etc). So i will have 12 values (one value for every month, the same as TMAX and TMIN).
Example:
(January1991 + January1992 + ..... + January 2020) /30
(February1991 + February1992 + ..... + February 2020) /30
.... same for every group of months.
So im using this code but i know this code isn't correct because i'm not getting the mean of the januarys, februarys, etc.
normalspp=df[['CODE','PP']].groupby([df.CODE, df.index.month]).sum().round(1)
I only have basic knowledge of python so i will appreciate if you can help me.
Thanks in advance.
Ver 2: Average by Year-Month and by Month
import pandas as pd
import numpy as np
x = pd.date_range(start='1/1/1991', end='12/31/2020',freq='D')
df = pd.DataFrame({'Date':x.tolist()*2,
'Code':['000130']*10958 + ['158328']*10958,
'TMAX': np.random.randint(6,10, size=21916),
'TMIN': np.random.randint(1,5, size=21916)
})
# Create a Month column to get Average by Month for all years
df['Month'] = df.Date.dt.month
# Create a Year-Month column to get Average of each Month within the Year
df['Year_Mon'] = df.Date.dt.strftime('%Y-%m')
# Print the Average of each Month within each Year for each code
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean())
print (df.groupby(['Code','Year_Mon'])['TMIN'].mean())
# Print the Average of each Month irrespective of the year (for each code)
print (df.groupby(['Code','Month'])['TMAX'].mean())
print (df.groupby(['Code','Month'])['TMAX'].mean())
If you want to give a name for the TMAX Average value, you can add the reset_index and rename column. Here's code to do that.
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean().reset_index().rename(columns={'TMAX':'TMAX_Avg'}))
The output of this will be:
Average of TMAX for each Year-Month for each Code
Code Year_Mon
000130 1991-01 7.225806
1991-02 7.678571
1991-03 7.354839
1991-04 7.500000
1991-05 7.516129
...
158328 2020-08 7.387097
2020-09 7.300000
2020-10 7.516129
2020-11 7.500000
2020-12 7.451613
Name: TMAX, Length: 720, dtype: float64
Average of TMIN for each Year-Month for each Code
Code Year_Mon
000130 1991-01 2.419355
1991-02 2.571429
1991-03 2.193548
1991-04 2.366667
1991-05 2.451613
...
158328 2020-08 2.451613
2020-09 2.566667
2020-10 2.612903
2020-11 2.666667
2020-12 2.580645
Name: TMIN, Length: 720, dtype: float64
Average of TMAX for each Month for each Code (all years combined)
Code Month
000130 1 7.540860
2 7.536557
3 7.482796
4 7.486667
5 7.444086
6 7.570000
7 7.507527
8 7.529032
9 7.501111
10 7.401075
11 7.482222
12 7.517204
158328 1 7.532258
2 7.563679
3 7.490323
4 7.555556
5 7.500000
6 7.497778
7 7.545161
8 7.483871
9 7.526667
10 7.529032
11 7.547778
12 7.524731
Name: TMAX, dtype: float64
Average of TMIN for each Month for each Code (all years combined)
Code Month
000130 1 7.540860
2 7.536557
3 7.482796
4 7.486667
5 7.444086
6 7.570000
7 7.507527
8 7.529032
9 7.501111
10 7.401075
11 7.482222
12 7.517204
158328 1 7.532258
2 7.563679
3 7.490323
4 7.555556
5 7.500000
6 7.497778
7 7.545161
8 7.483871
9 7.526667
10 7.529032
11 7.547778
12 7.524731
Name: TMAX, dtype: float64
Ver 1: Average by Year and Month for each Code
Here is one way to do this.
You can create two columns - Year and Month. Then get the average of TMAX, TMIN, and PP for each month within the year by doing a groupby ('Code','Year_Mon')
See code for more details.
import pandas as pd
import numpy as np
# create a range of dates from 1/1/2018 thru 12/31/2020 for each day
x = pd.date_range(start='1/1/2018', end='12/31/2020',freq='D')
# create a dataframe with the date ranges x 2 for two codes
# TMIN is a random value from 1 thru 5 - you can put your actual data here
# TMAX is a random value from 6 thru 10 - you can put your actual data here
df = pd.DataFrame({'Date':x.tolist()*2,
'Code':['000130']*1096 + ['158328']*1096,
'TMAX': np.random.randint(6,10, size=2192),
'TMIN': np.random.randint(1,5, size=2192)
})
# Create a Year-Month column using df.Date.dt.strftime
df['Year_Mon'] = df.Date.dt.strftime('%Y-%m')
# Calculate the Average of TMAX and TMIN using groupby Code and Year_Mon
df['TMAX_Avg'] = df.groupby(['Code','Year_Mon'])['TMAX'].transform('mean')
df['TMIN_Avg'] = df.groupby(['Code','Year_Mon'])['TMIN'].transform('mean')
The output of this will be:
Date Code TMAX TMIN Year_Mon TMAX_Avg TMIN_Avg
0 2018-01-01 000130 8 2 2018-01 7.451613 2.129032
1 2018-01-02 000130 7 4 2018-01 7.451613 2.129032
2 2018-01-03 000130 9 2 2018-01 7.451613 2.129032
3 2018-01-04 000130 6 1 2018-01 7.451613 2.129032
4 2018-01-05 000130 9 4 2018-01 7.451613 2.129032
5 2018-01-06 000130 6 1 2018-01 7.451613 2.129032
6 2018-01-07 000130 9 2 2018-01 7.451613 2.129032
7 2018-01-08 000130 9 2 2018-01 7.451613 2.129032
8 2018-01-09 000130 7 2 2018-01 7.451613 2.129032
9 2018-01-10 000130 8 2 2018-01 7.451613 2.129032
10 2018-01-11 000130 8 3 2018-01 7.451613 2.129032
11 2018-01-12 000130 7 2 2018-01 7.451613 2.129032
12 2018-01-13 000130 7 1 2018-01 7.451613 2.129032
13 2018-01-14 000130 8 1 2018-01 7.451613 2.129032
14 2018-01-15 000130 7 3 2018-01 7.451613 2.129032
15 2018-01-16 000130 6 1 2018-01 7.451613 2.129032
16 2018-01-17 000130 6 3 2018-01 7.451613 2.129032
17 2018-01-18 000130 9 3 2018-01 7.451613 2.129032
18 2018-01-19 000130 7 2 2018-01 7.451613 2.129032
19 2018-01-20 000130 8 1 2018-01 7.451613 2.129032
20 2018-01-21 000130 9 4 2018-01 7.451613 2.129032
21 2018-01-22 000130 6 2 2018-01 7.451613 2.129032
22 2018-01-23 000130 9 4 2018-01 7.451613 2.129032
23 2018-01-24 000130 6 2 2018-01 7.451613 2.129032
24 2018-01-25 000130 8 3 2018-01 7.451613 2.129032
25 2018-01-26 000130 6 2 2018-01 7.451613 2.129032
26 2018-01-27 000130 8 1 2018-01 7.451613 2.129032
27 2018-01-28 000130 8 3 2018-01 7.451613 2.129032
28 2018-01-29 000130 6 1 2018-01 7.451613 2.129032
29 2018-01-30 000130 6 1 2018-01 7.451613 2.129032
30 2018-01-31 000130 8 1 2018-01 7.451613 2.129032
31 2018-02-01 000130 7 1 2018-02 7.250000 2.428571
32 2018-02-02 000130 6 2 2018-02 7.250000 2.428571
33 2018-02-03 000130 6 4 2018-02 7.250000 2.428571
34 2018-02-04 000130 8 3 2018-02 7.250000 2.428571
35 2018-02-05 000130 8 2 2018-02 7.250000 2.428571
36 2018-02-06 000130 6 3 2018-02 7.250000 2.428571
37 2018-02-07 000130 6 3 2018-02 7.250000 2.428571
38 2018-02-08 000130 7 1 2018-02 7.250000 2.428571
39 2018-02-09 000130 9 4 2018-02 7.250000 2.428571
40 2018-02-10 000130 8 2 2018-02 7.250000 2.428571
41 2018-02-11 000130 7 4 2018-02 7.250000 2.428571
42 2018-02-12 000130 8 1 2018-02 7.250000 2.428571
43 2018-02-13 000130 6 4 2018-02 7.250000 2.428571
44 2018-02-14 000130 6 1 2018-02 7.250000 2.428571
45 2018-02-15 000130 6 4 2018-02 7.250000 2.428571
46 2018-02-16 000130 8 2 2018-02 7.250000 2.428571
47 2018-02-17 000130 7 3 2018-02 7.250000 2.428571
48 2018-02-18 000130 9 3 2018-02 7.250000 2.428571
49 2018-02-19 000130 8 2 2018-02 7.250000 2.428571
If you want only the Code, Year-Month, and TMIN and TMAX values, you can do:
TMAX average for each month within the year:
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean())
Output will be:
Code Year_Mon
000130 2018-01 7.451613
2018-02 7.250000
2018-03 7.774194
2018-04 7.366667
2018-05 7.451613
...
158328 2020-08 7.935484
2020-09 7.666667
2020-10 7.548387
2020-11 7.333333
2020-12 7.580645
TMIN average for each month within the year:
print (df.groupby(['Code','Year_Mon'])['TMIN'].mean())
Output will be:
Code Year_Mon
000130 2018-01 2.129032
2018-02 2.428571
2018-03 2.451613
2018-04 2.500000
2018-05 2.677419
...
158328 2020-08 2.709677
2020-09 2.166667
2020-10 2.161290
2020-11 2.366667
2020-12 2.548387

what is wrong with pandas date_time?

Following is the code for an example of using pandas datetime module. As shown in the output, it is not consitent, It is mixing date and month. Am i doing something wrong?
​
dates = ['20/11/17', '12/02/18', '02/05/18', '10/09/18',
'22/06/17', '12/02/15','19/11/17', '04/09/16',
'12/05/18', '11/04/15', '10/04/17', '13/06/16']
data = pd.DataFrame(data=dates, columns=['date'])
data['date_format'] = pd.to_datetime(dates)
data
Output:
date date_format
0 20/11/17 2017-11-20
1 12/02/18 2018-12-02
2 02/05/18 2018-02-05
3 10/09/18 2018-10-09
4 22/06/17 2017-06-22
5 12/02/15 2015-12-02
6 19/11/17 2017-11-19
7 04/09/16 2016-04-09
8 12/05/18 2018-12-05
9 11/04/15 2015-11-04
10 10/04/17 2017-10-04
11 13/06/16 2016-06-13

Grouping into series based on days since

I need to create a new grouping every time I have a period of more than 60 days since my previous record.
Basically, I need too take the data I have here:
RowNo StartDate StopDate DaysBetween
1 3/21/2017 3/21/2017 14
2 4/4/2017 4/4/2017 14
3 4/18/2017 4/18/2017 14
4 6/23/2017 6/23/2017 66
5 7/5/2017 7/5/2017 12
6 7/19/2017 7/19/2017 14
7 9/27/2017 9/27/2017 70
8 10/24/2017 10/24/2017 27
9 10/31/2017 10/31/2017 7
10 11/14/2017 11/14/2017 14
And turn it into this:
RowNo StartDate StopDate DaysBetween Series
1 3/21/2017 3/21/2017 14 1
2 4/4/2017 4/4/2017 14 1
3 4/18/2017 4/18/2017 14 1
4 6/23/2017 6/23/2017 66 2
5 7/5/2017 7/5/2017 12 2
6 7/19/2017 7/19/2017 14 2
7 9/27/2017 9/27/2017 70 3
8 10/24/2017 10/24/2017 27 3
9 10/31/2017 10/31/2017 7 3
10 11/14/2017 11/14/2017 14 3
Once I have that I'll group by Series and get the min(StartDate) and max(StopDate) for individual durations.
I could do this using a cursor but I'm sure someone much smarter than me has figured out a more elegant solution. Thanks in advance!
You can use the window function sum() over with a conditional FLAG
Example
Select *
,Series= 1+sum(case when [DaysBetween]>60 then 1 else 0 end) over (Order by RowNo)
From YourTable
Returns
RowNo StartDate StopDate DaysBetween Series
1 2017-03-21 2017-03-21 14 1
2 2017-04-04 2017-04-04 14 1
3 2017-04-18 2017-04-18 14 1
4 2017-06-23 2017-06-23 66 2
5 2017-07-05 2017-07-05 12 2
6 2017-07-19 2017-07-19 14 2
7 2017-09-27 2017-09-27 70 3
8 2017-10-24 2017-10-24 27 3
9 2017-10-31 2017-10-31 7 3
10 2017-11-14 2017-11-14 14 3
EDIT - 2008 Version
Select A.*
,B.*
From YourTable A
Cross Apply (
Select Series=1+sum( case when [DaysBetween]>60 then 1 else 0 end)
From YourTable
Where RowNo <= A.RowNo
) B

pandas group By select columns

I work with Cloudera VM 5.2.0 pandas 0.18.0.
I have the following data
adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv',
parse_dates=['timestamp'],
skipinitialspace=True).assign(adCount=1)
adclicksDF.head(n=5)
Out[65]:
timestamp txId userSessionId teamId userId adId adCategory \
0 2016-05-26 15:13:22 5974 5809 27 611 2 electronics
1 2016-05-26 15:17:24 5976 5705 18 1874 21 movies
2 2016-05-26 15:22:52 5978 5791 53 2139 25 computers
3 2016-05-26 15:22:57 5973 5756 63 212 10 fashion
4 2016-05-26 15:22:58 5980 5920 9 1027 20 clothing
adCount
0 1
1 1
2 1
3 1
4 1
I want to do a group by for the field timestamp
adCategoryclicks = adclicksDF[['timestamp','adId','adCategory','userId','adCount']]
agrupadoDF = adCategoryclicks.groupby(pd.Grouper(key='timestamp', freq='1H'))['adCount'].agg(['count','sum'])
agrupadoDF.head(n=5)
Out[68]:
count sum
timestamp
2016-05-26 15:00:00 14 14
2016-05-26 16:00:00 24 24
2016-05-26 17:00:00 13 13
2016-05-26 18:00:00 16 16
2016-05-26 19:00:00 16 16
I want to add to agrupado more columns adCategory, idUser .
How can I do this?
There is multiple values in userId and adCategory for each group, so aggreagate by join:
In this sample last two datetime are changed for better output
print (adclicksDF)
timestamp txId userSessionId teamId userId adId adCategory \
0 2016-05-26 15:13:22 5974 5809 27 611 2 electronics
1 2016-05-26 15:17:24 5976 5705 18 1874 21 movies
2 2016-05-26 15:22:52 5978 5791 53 2139 25 computers
3 2016-05-26 16:22:57 5973 5756 63 212 10 fashion
4 2016-05-26 16:22:58 5980 5920 9 1027 20 clothing
adCount
0 1
1 1
2 1
3 1
4 1
#cast int to string
adclicksDF['userId'] = adclicksDF['userId'].astype(str)
adCategoryclicks = adclicksDF[['timestamp','adId','adCategory','userId','adCount']]
agrupadoDF = adCategoryclicks.groupby(pd.Grouper(key='timestamp', freq='1H'))
.agg({'adCount': ['count','sum'],
'userId': ', '.join,
'adCategory': ', '.join})
agrupadoDF.columns = ['adCategory','count','sum','userId']
print (agrupadoDF)
adCategory count sum \
timestamp
2016-05-26 15:00:00 electronics, movies, computers 3 3
2016-05-26 16:00:00 fashion, clothing 2 2
userId
timestamp
2016-05-26 15:00:00 611, 1874, 2139
2016-05-26 16:00:00 212, 1027