Groupby by different columns - pandas

I have a dataframe nf as follows :
StationID DateTime Channel Count
0 1 2017-10-01 00:00:00 1 1
1 1 2017-10-01 00:00:00 1 201
2 1 2017-10-01 00:00:00 1 8
3 1 2017-10-01 00:00:00 1 2
4 1 2017-10-01 00:00:00 1 0
5 1 2017-10-01 00:00:00 1 0
6 1 2017-10-01 00:00:00 1 0
7 1 2017-10-01 00:00:00 1 0
.......... and so on
I want to groupby values by each hour and for each channel and StationID
Output Req
Station ID DateTime Channel Count
1 2017-10-01 00:00:00 1 232
1 2017-10-01 00:01:00 1 23
2 2017-10-01 00:00:00 1 244...
...... and so on

I think you need groupby with aggregate sum, for datetimes with floor by hours add floor - it set minutes and seconds to 0:
print (df)
StationID DateTime Channel Count
0 1 2017-12-01 00:00:00 1 1
1 1 2017-12-01 00:00:00 1 201
2 1 2017-12-01 00:10:00 1 8
3 1 2017-12-01 10:00:00 1 2
4 1 2017-10-01 10:50:00 1 0
5 1 2017-10-01 10:20:00 1 5
6 1 2017-10-01 08:10:00 1 4
7 1 2017-10-01 08:00:00 1 1
df['DateTime'] = pd.to_datetime(df['DateTime'])
df1 = (df.groupby(['StationID', df['DateTime'].dt.floor('H'), 'Channel'])['Count']
.sum()
.reset_index()
)
print (df1)
StationID DateTime Channel Count
0 1 2017-10-01 08:00:00 1 5
1 1 2017-10-01 10:00:00 1 5
2 1 2017-12-01 00:00:00 1 210
3 1 2017-12-01 10:00:00 1 2
print (df['DateTime'].dt.floor('H'))
0 2017-12-01 00:00:00
1 2017-12-01 00:00:00
2 2017-12-01 00:00:00
3 2017-12-01 10:00:00
4 2017-10-01 10:00:00
5 2017-10-01 10:00:00
6 2017-10-01 08:00:00
7 2017-10-01 08:00:00
Name: DateTime, dtype: datetime64[ns]
But if dates are not important, only hours use hour:
df2 = (df.groupby(['StationID', df['DateTime'].dt.hour, 'Channel'])['Count']
.sum()
.reset_index()
)
print (df2)
StationID DateTime Channel Count
0 1 0 1 210
1 1 8 1 5
2 1 10 1 7

Or you can use Grouper:
df.groupby(pd.Grouper(key='DateTime', freq='"H'), 'Channel', 'StationID')['Count'].sum()

Related

7 days hourly mean with pandas

I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429

Pandas group by datetime within column level

I have a dataframe created by:
df = pd.DataFrame({})
df['Date'] = pd.to_datetime(np.arange(0,12), unit='h', origin='2018-08-01 06:00:00')
df['ship'] = [1,1,2,2,2,3,3,3,3,3,3,3] # ship ID number
dt_trip = 4 # maximum duration of each trip to be classified as the same trip
Date ship
0 2018-08-01 06:00:00 1
1 2018-08-01 07:00:00 1
2 2018-08-01 08:00:00 2
3 2018-08-01 09:00:00 2
4 2018-08-01 10:00:00 2
5 2018-08-01 11:00:00 3
6 2018-08-01 12:00:00 3
7 2018-08-01 13:00:00 3
8 2018-08-01 14:00:00 3
9 2018-08-01 15:00:00 3
10 2018-08-01 16:00:00 3
11 2018-08-01 17:00:00 3
I try to get a a new column which shows the trips of each ship. Each trip is defined by an interval of 4 hours with respect to the start of the trip. When a new ship number is on the next row, automatically a new trip should start (irrespective of the previous datetime). From a previous post I got a solution for the trips.
origin = df["Date"][0].hour
df["Trip"] = df.apply(lambda x: ((x["Date"].hour - origin) // dt_trip) + 1, axis=1)
df["Trip"] = df.groupby(['Trip','ship']).ngroup() +1 # trip starts at: 1
This solution takes a new trip when the ship-column changes its row. The only change I want to have is to change the origin to the datetime when a new trip starts. So index 4 should have Trip = 2, because the ship is the same and the time difference between the start of the trip (index=2). Now it looks at the first given datetime.
Desired solution looks like:
Date ship Trip Trip_desired
0 2018-08-01 06:00:00 1 1 1
1 2018-08-01 07:00:00 1 1 1
2 2018-08-01 08:00:00 2 2 2
3 2018-08-01 09:00:00 2 2 2
4 2018-08-01 10:00:00 2 3 2
5 2018-08-01 11:00:00 3 4 3
6 2018-08-01 12:00:00 3 4 3
7 2018-08-01 13:00:00 3 4 3
8 2018-08-01 14:00:00 3 5 3
9 2018-08-01 15:00:00 3 5 4
10 2018-08-01 16:00:00 3 5 4
11 2018-08-01 17:00:00 3 5 4
I would do:
total_time = df['Date'] - df.groupby('ship')['Date'].transform('min')
trips = total_time.dt.total_seconds().fillna(0)//(dt_trip*3600)
df['trip'] = df.groupby(['ship', trips]).ngroup()+1
Output:
Date ship trip
0 2018-08-01 06:00:00 1 1
1 2018-08-01 07:00:00 1 1
2 2018-08-01 08:00:00 2 2
3 2018-08-01 09:00:00 2 2
4 2018-08-01 10:00:00 2 2
5 2018-08-01 11:00:00 3 3
6 2018-08-01 12:00:00 3 3
7 2018-08-01 13:00:00 3 3
8 2018-08-01 14:00:00 3 3
9 2018-08-01 15:00:00 3 4
10 2018-08-01 16:00:00 3 4
11 2018-08-01 17:00:00 3 4

Creating nested dataframes with multiple dataframes

I have multiple dataframes, the following are only 2 of them:
print(df1)
Date A B C
2019-10-01 00:00:00 2 3 1
2019-10-01 01:00:00 5 1 6
2019-10-01 02:00:00 8 2 4
2019-10-01 03:00:00 3 6 5
print(df2)
Date A B C
2019-10-01 00:00:00 9 4 2
2019-10-01 01:00:00 3 2 4
2019-10-01 02:00:00 6 5 2
2019-10-01 03:00:00 3 6 5
All of them have same index and columns. I want to create dataframe like this:
Date df1 df2
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5
I have to apply this process to 30 dataframes(their index and columns are same), so I want to write a for loop in order to achieve this dataframe. How can I do that?
Reshape each DataFrame of list of DataFrames by DataFrame.set_index with DataFrame.unstack and then concat, last change columns names with lambda function:
dfs = [df1,df2]
df = (pd.concat([x.set_index('Date').unstack() for x in dfs], axis=1)
.rename(columns=lambda x: f'df{x+1}'))
print (df)
df1 df2
Date
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5
If want some custom columns names in final DataFrame create list with same size like length of dfs and add parameter keys:
dfs = [df1,df2]
names = ['col1','col2']
df = pd.concat([x.set_index('Date').unstack() for x in dfs], keys=names, axis=1)
print (df)
col1 col2
Date
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5

Splitting value dataframe over multiple timeslots

Would like to spread the values of the 15 minute intervals evenly over the 5 minute intervals. But cannot get it to work. Data is:
Datetime a
2018-01-01 00:00:00 6
2018-01-01 00:15:00 3
2018-01-01 00:30:00 9
Desired output would be:
Datetime a
2018-01-01 00:00:00 2
2018-01-11 00:05:00 2
2018-01-11 00:10:00 2
2018-01-11 00:15:00 1
2018-01-11 00:20:00 1
2018-01-11 00:25:00 1
2018-01-11 00:30:00 3
2018-01-11 00:35:00 3
2018-01-11 00:40:00 3
perhaps unnecessarily, but the value '6' of 00:00:00 in the data is spread over the intervals 00:00:00-00:10:00
Slightly different approach:
# convert to datetime
df.Datetime = pd.to_datetime(df.Datetime)
# set Datetime as index
df.set_index('Datetime', inplace=True)
# add one extra row
df.loc[df.index.max()+pd.to_timedelta('10min')] = 0
# set_index and resample
s = df.asfreq('5T', fill_value=0)
# transform the 0's to mean:
(s.groupby(s['a'].ne(0)
.cumsum())
.transform('mean')
.reset_index()
)
Output:
Datetime a
0 2018-01-01 00:00:00 2
1 2018-01-01 00:05:00 2
2 2018-01-01 00:10:00 2
3 2018-01-01 00:15:00 1
4 2018-01-01 00:20:00 1
5 2018-01-01 00:25:00 1
6 2018-01-01 00:30:00 3
7 2018-01-01 00:35:00 3
8 2018-01-01 00:40:00 3

Wrong results with group by for distinct count

I have these two queries for calculating a distinct count from a table for a particular date range. In my first query I group by location, aRID ( which is a rule ) and date. In my second query I don't group by a date.
I am expecting the same distinct count in both the results but I get total count as 6147 in first result and 6359 in second result. What is wrong here? The difference is group by..
select
r.loc
,cast(r.date as DATE) as dateCol
,count(distinct r.dC) as dC_count
from table r
where r.date between '01-01-2018' and '06-02-2018'
and r.loc = 1
group by r.loc, r.aRId, cast(r.date as DATE)
select
r.loc
,count(distinct r.DC) as dC_count
from table r
and r.date between '01-01-2018' and '06-02-2018'
and r.loc = 1
group by r.loc, r.aRId
loc dateCol dC_count
1 2018-01-22 1
1 2018-03-09 2
1 2018-01-28 3
1 2018-01-05 1
1 2018-05-28 143
1 2018-02-17 1
1 2018-05-08 187
1 2018-05-31 146
1 2018-01-02 3
1 2018-02-14 1
1 2018-05-11 273
1 2018-01-14 1
1 2018-03-18 2
1 2018-02-03 1
1 2018-05-20 200
1 2018-05-14 230
1 2018-01-11 5
1 2018-01-31 1
1 2018-05-17 209
1 2018-01-20 2
1 2018-03-01 1
1 2018-01-03 3
1 2018-05-06 253
1 2018-05-26 187
1 2018-03-24 1
1 2018-02-09 1
1 2018-03-04 1
1 2018-05-03 269
1 2018-05-23 187
1 2018-05-29 133
1 2018-03-21 1
1 2018-03-27 1
1 2018-05-15 202
1 2018-03-07 1
1 2018-06-01 155
1 2018-02-21 1
1 2018-01-26 2
1 2018-02-15 2
1 2018-05-12 331
1 2018-03-10 1
1 2018-01-09 3
1 2018-02-18 1
1 2018-03-13 2
1 2018-05-09 184
1 2018-01-12 2
1 2018-03-16 1
1 2018-05-18 198
1 2018-02-07 1
1 2018-02-01 1
1 2018-01-15 3
1 2018-02-24 4
1 2018-03-19 1
1 2018-05-21 161
1 2018-02-10 1
1 2018-05-04 250
1 2018-05-30 148
1 2018-05-24 153
1 2018-01-24 1
1 2018-05-10 199
1 2018-03-08 1
1 2018-01-21 1
1 2018-05-27 151
1 2018-01-04 3
1 2018-05-07 236
1 2018-03-25 1
1 2018-03-11 2
1 2018-01-10 1
1 2018-01-30 1
1 2018-03-14 1
1 2018-02-19 1
1 2018-05-16 192
1 2018-01-13 5
1 2018-01-07 1
1 2018-03-17 3
1 2018-01-27 2
1 2018-02-22 1
1 2018-05-13 200
1 2018-02-08 2
1 2018-01-16 2
1 2018-03-03 1
1 2018-05-02 217
1 2018-05-22 163
1 2018-03-20 1
1 2018-02-05 2
1 2018-02-11 1
1 2018-01-19 2
1 2018-02-28 1
1 2018-05-05 332
1 2018-05-25 211
1 2018-03-23 1
1 2018-05-19 219
loc dC_count
1 6359
From "COUNT (Transact-SQL)"
COUNT(DISTINCT expression) evaluates expression for each row in a group, and returns the number of unique, nonnull values.
The distinct is relative to the group, not to the whole table (or selected subset). I think this might be your misconception here.
To better understand what this means, take the following simplified example:
CREATE TABLE group_test
(a varchar(1),
b varchar(1),
c varchar(1));
INSERT INTO group_test
(a,
b,
c)
VALUES ('a',
'r',
'x'),
('a',
's',
'x'),
('b',
'r',
'x'),
('b',
's',
'y');
If we GROUP BY a and select count(DISTINCT c)
SELECT a,
count(DISTINCT c) #
FROM group_test
GROUP BY a;
we get
a | #
----|----
a | 1
b | 2
As there is only c='x' for a=1, there is only a distinct count of 1 for this group but 2 for the other group as it has 'x'and 'y' in c. The sum of counts is 3 here.
Now if we GROUP BY a, b
SELECT a,
b,
count(DISTINCT c) #
FROM group_test
GROUP BY a,
b;
we get
a | b | #
----|----|----
a | r | 1
a | s | 1
b | r | 1
b | s | 1
We get 1 for every count here as each value of c is the only one in the group. And all of a sudden the sum of counts is 4.
And if we get the distinct count of c for the whole table
SELECT count(DISTINCT c) #
FROM group_test;
we get
#
----
2
which sums up to 2.
The sum of the counts is different in each case but right none the less.
The more groups there are, the higher the chance for a value to be unique within that group. So your results seem totally plausible.
db<>fiddle