Transform data by time & Class - pandas

I have a dataframe nf as following:
DateTime Class Count
0 2017-10-01 00:00:00 1 0
1 2017-10-01 00:00:00 2 240
2 2017-10-01 00:00:00 3 17
3 2017-10-01 00:00:00 4 0
4 2017-10-01 00:00:00 5 1
5 2017-10-01 00:00:00 6 0
6 2017-10-01 00:00:00 7 0
7 2017-10-01 00:00:00 8 0
8 2017-10-01 00:00:00 9 0
9 2017-10-01 00:00:00 10 0
10 2017-10-01 00:00:00 11 0
11 2017-10-01 00:00:00 12 0
12 2017-10-01 00:00:00 13 0
13 2017-10-01 00:00:00 14 0
14 2017-10-01 00:00:00 15 0
..............................
30 2017-10-01 01:00:00 1 0
31 2017-10-01 01:00:00 2 209
32 2017-10-01 01:00:00 3 14
33 2017-10-01 01:00:00 4 0
34 2017-10-01 01:00:00 5 4
35 2017-10-01 01:00:00 6 0
36 2017-10-01 01:00:00 7 0
37 2017-10-01 01:00:00 8 0
38 2017-10-01 01:00:00 9 0
39 2017-10-01 01:00:00 10 0
40 2017-10-01 01:00:00 11 0
41 2017-10-01 01:00:00 12 0
42 2017-10-01 01:00:00 13 0
43 2017-10-01 01:00:00 14 0
44 2017-10-01 01:00:00 15 0
....... and so on
There are total 15 classes and counts for each class for each hour
I want to transform the data into columnwise on a per hour for each count basis as follows
Output req
DateTime Class1 Class2 Class3 Class4.........Class15
2017-10-01 00:00:00 0 240 17 0 ......... 0
2017-10-01 00:01:00 0 209 14 0 ......... 0
....
and so on

You can use pandas to read the data into a pd.Dataframe(), select the counts for each class by slicing the dataframe with conditions and concate the data after that by using the datetime as index:
import pandas as pd
# create dataframe from file
df = pd.read_csv('fname')
# or from numpy array
df = pd.Dataframe(data=np_array, columns=['DateTime', 'Class', 'Count'])
# select the counts for each class
df_c1 = df[df.Class == 1]
df_c2 = df[df.Class == 2]
df_c3 = df[df.Class == 3]
df_c4 = df[df.Class == 4]
df_new = pd.Dataframe()
df_new['DateTime'] = df_c1['DateTime']
df_new['Class1'] = df_c1['Count']
df_new['Class2'] = df_c2['Count']
df_new['Class3'] = df_c3['Count']
df_new['Class4'] = df_c4['Count']
The code example is really dirty and I'm probably missing alot, but maybe it gives you an inspiration. I would also recommend you to check the pandas documentation for concat() and Dataframe()
I'm going to review and refactor my example code tomorrow, in case the problem is not solved already. Meanwhile you could fix the layout of the data in your question it's not readable.

Try pivot_table:
(df.pivot_table(index='DateTime',columns='Class',
values='Count',
aggfunc='sum')
.add_prefix('Class_'))

Related

Calculate the rolling average every two weeks for the same day and hour in a DataFrame

I have a Dataframe like the following:
df = pd.DataFrame()
df['datetime'] = pd.date_range(start='2023-1-2', end='2023-1-29', freq='15min')
df['week'] = df['datetime'].apply(lambda x: int(x.isocalendar()[1]))
df['day_of_week'] = df['datetime'].dt.weekday
df['hour'] = df['datetime'].dt.hour
df['minutes'] = pd.DatetimeIndex(df['datetime']).minute
df['value'] = range(len(df))
df.set_index('datetime',inplace=True)
df = week day_of_week hour minutes value
datetime
2023-01-02 00:00:00 1 0 0 0 0
2023-01-02 00:15:00 1 0 0 15 1
2023-01-02 00:30:00 1 0 0 30 2
2023-01-02 00:45:00 1 0 0 45 3
2023-01-02 01:00:00 1 0 1 0 4
... ... ... ... ... ...
2023-01-08 23:00:00 1 6 23 0 668
2023-01-08 23:15:00 1 6 23 15 669
2023-01-08 23:30:00 1 6 23 30 670
2023-01-08 23:45:00 1 6 23 45 671
2023-01-09 00:00:00 2 0 0 0 672
And I want to calculate the average of the column "value" for the same hour/minute/day, every two consecutive weeks.
What I would like to get is the following:
df=
value
day_of_week hour minutes datetime
0 0 0 2023-01-02 00:00:00 NaN
2023-01-09 00:00:00 NaN
2023-01-16 00:00:00 336
2023-01-23 00:00:00 1008
15 2023-01-02 00:15:00 NaN
2023-01-09 00:15:00 NaN
2023-01-16 00:15:00 337
2023-01-23 00:15:00 1009
So the first two weeks should have NaN values and week-3 should be the average of week-1 and week-2 and then week-4 the average of week-2 and week-3 and so on.
I tried the following code but it does not seem to do what I expect:
df = pd.DataFrame(df.groupby(['day_of_week','hour','minutes'])['value'].rolling(window='14D', min_periods=1).mean())
As what I am getting is:
value
day_of_week hour minutes. datetime
0 0 0 2023-01-02 00:00:00 0
2023-01-09 00:00:00 336
2023-01-16 00:00:00 1008
2023-01-23 00:00:00 1680
15 2023-01-02 00:15:00 1
2023-01-09 00:15:00 337
2023-01-16 00:15:00 1009
2023-01-23 00:15:00 1681
I think you want to shift within each group. Then you need another groupby:
(df.groupby(['day_of_week','hour','minutes'])['value']
.rolling(window='14D', min_periods=2).mean() # `min_periods` is different
.groupby(['day_of_week','hour','minutes']).shift() # shift within each group
.to_frame()
)
Output:
value
day_of_week hour minutes datetime
0 0 0 2023-01-02 00:00:00 NaN
2023-01-09 00:00:00 NaN
2023-01-16 00:00:00 336.0
2023-01-23 00:00:00 1008.0
15 2023-01-02 00:15:00 NaN
... ...
6 23 30 2023-01-15 23:30:00 NaN
2023-01-22 23:30:00 1006.0
45 2023-01-08 23:45:00 NaN
2023-01-15 23:45:00 NaN
2023-01-22 23:45:00 1007.0

Derive pandas datetime from mix integer format

I want to derive a DateTime column from a mixed range of integer column in a panadas dataFrame. The input column is as below. As you see there is a various length of integer numbers in that column. I want to return:
180000 = 18:00:00
60000 = 06:00:00
0 =00:00:00
13 |180000
14 | 0
15 | 60000
16 |100000
17 | 0
18 | 60000
Thanks,
Pedram.
Use to_datetime:
df['Time'] = pd.to_datetime(df['value'].replace(0, '0'*6), format='%H%M%S', errors='coerce').dt.time
Result:
id value Time
0 13 180000 18:00:00
1 14 0 00:00:00
2 15 60000 06:00:00
3 16 100000 10:00:00
4 17 0 00:00:00
5 18 60000 06:00:00

Is there a way of group by month in Pandas starting at specific day number?

I'm trying to group by month some data in python, but i need the month to start at the 25 of each month, is there a way to do that in Pandas?
For weeks there is a way of starting on Monday, Tuesday, ... But for months it's always full month.
pd.Grouper(key='date', freq='M')
You could offset the dates by 24 days and groupby:
np.random.seed(1)
dates = pd.date_range('2019-01-01', '2019-04-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
# for groupby
s = df['date'].sub(pd.DateOffset(24))
(df.groupby([s.dt.year, s.dt.month], as_index=False)
.agg({'date':'min', 'val':'sum'})
)
gives
date val
0 2019-01-01 10.120368
1 2019-01-25 14.895363
2 2019-02-25 14.544506
3 2019-03-25 17.228734
4 2019-04-25 3.334160
Another example:
np.random.seed(1)
dates = pd.date_range('2019-01-20', '2019-01-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
s = df['date'].sub(pd.DateOffset(24))
df['groups'] = df.groupby([s.dt.year, s.dt.month]).cumcount()
gives
date val groups
0 2019-01-20 0.417022 0
1 2019-01-21 0.720324 1
2 2019-01-22 0.000114 2
3 2019-01-23 0.302333 3
4 2019-01-24 0.146756 4
5 2019-01-25 0.092339 0
6 2019-01-26 0.186260 1
7 2019-01-27 0.345561 2
8 2019-01-28 0.396767 3
9 2019-01-29 0.538817 4
10 2019-01-30 0.419195 5
And you can see the how the cumcount restarts at day 25.
I prepared the following test DataFrame:
Dat Val
0 2017-03-24 0
1 2017-03-25 0
2 2017-03-26 1
3 2017-03-27 0
4 2017-04-24 0
5 2017-04-25 0
6 2017-05-24 0
7 2017-05-25 2
8 2017-05-26 0
The first step is to compute a "shifted date" column:
df['Dat2'] = df.Dat + pd.DateOffset(days=-24)
The result is:
Dat Val Dat2
0 2017-03-24 0 2017-02-28
1 2017-03-25 0 2017-03-01
2 2017-03-26 1 2017-03-02
3 2017-03-27 0 2017-03-03
4 2017-04-24 0 2017-03-31
5 2017-04-25 0 2017-04-01
6 2017-05-24 0 2017-04-30
7 2017-05-25 2 2017-05-01
8 2017-05-26 0 2017-05-02
As you can see, March dates in Dat2 start just from original date 2017-03-25,
and so on.
The value of 1 is in March (Dat2) and the value of 2 is in May (also Dat2).
Then, to compute e.g. a sum by month, we can run:
df.groupby(pd.Grouper(key='Dat2', freq='MS')).sum()
getting:
Val
Dat2
2017-02-01 0
2017-03-01 1
2017-04-01 0
2017-05-01 2
So we have correct groupping:
1 is in March,
2 is in May.
The advantage over the other answer is that you have all dates on the first
day of a month, of course bearing in mind that e.g. 2017-03-01 in the
result means the period from 2017-03-25 to 2017-04-24 (including).

pandas grouper int by frequency

I would like to group a Pandas dataframe by hour disregarding the date.
My data:
id opened_at count sum
2016-07-01 07:02:05 1 46.14
154 2016-07-01 07:34:02 1 479
2016-07-01 10:10:01 1 127.14
2016-07-02 12:01:04 1 8.14
2016-07-02 12:00:50 1 18.14
I am able to group by hour with date taken into account by using the following.
groupByLocationDay = df.groupby([df.id,
pd.Grouper(key='opened_at', freq='3h')])
I get the following
id opened_at count sum
2016-07-01 06:00:00 2 4296.14
154 2016-07-01 09:00:00 46 43716.79
2016-07-01 12:00:00 169 150827.14
2016-07-02 12:00:00 17 1508.14
2016-07-02 09:00:00 10 108.14
How can I group by hour only, so that it would look like the following.
id opened_at count sum
06:00:00 2 4296.14
154 09:00:00 56 43824.93
12:00:00 203 152335.28
The original data is on hourly basis, thus I need to get 3h frequency.
Thanks!
you can do it this way:
In [134]: df
Out[134]:
id opened_at count sum
0 154 2016-07-01 07:02:05 1 46.14
1 154 2016-07-01 07:34:02 1 479.00
2 154 2016-07-01 10:10:01 1 127.14
3 154 2016-07-02 12:01:04 1 8.14
4 154 2016-07-02 12:00:50 1 18.14
5 154 2016-07-02 08:34:02 1 479.00
In [135]: df.groupby(['id', df.opened_at.dt.hour // 3 * 3]).sum()
Out[135]:
count sum
id opened_at
154 6 3 1004.14
9 1 127.14
12 2 26.28

convert hourly time period in 15-minute time period

I have a dataframe like that:
df = pd.read_csv("fileA.csv", dtype=str, delimiter=";", skiprows = None, parse_dates=['Date'])
Date Buy Sell
0 01.08.2009 01:00 15 25
1 01.08.2009 02:00 0 30
2 01.08.2009 03:00 10 18
But I need that one (in 15-min-periods):
Date Buy Sell
0 01.08.2009 01:00 15 25
1 01.08.2009 01:15 15 25
2 01.08.2009 01:30 15 25
3 01.08.2009 01:45 15 25
4 01.08.2009 02:00 0 30
5 01.08.2009 02:15 0 30
6 01.08.2009 02:30 0 30
7 01.08.2009 02:45 0 30
8 01.08.2009 03:00 10 18
....and so on.
I have tried df.resample(). But it does not worked. Does someone know a nice pandas method?!
If fileA.csv looks like this:
Date;Buy;Sell
01.08.2009 01:00;15;25
01.08.2009 02:00;0;30
01.08.2009 03:00;10;18
then you could parse the data with
df = pd.read_csv("fileA.csv", delimiter=";", parse_dates=['Date'])
so that df will look like this:
In [41]: df
Out[41]:
Date Buy Sell
0 2009-01-08 01:00:00 15 25
1 2009-01-08 02:00:00 0 30
2 2009-01-08 03:00:00 10 18
You might want to check df.info() to make sure you successfully parsed your data into a DataFrame with three columns, and that the Date column has dtype datetime64[ns]. Since the repr(df) you posted prints the date in a different format and the column headers do not align with the data, there is a good chance that the data has not yet been parsed properly. If that's true and you post some sample lines from the csv, we should be able help you parse the data into a DataFrame.
In [51]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 3 columns):
Date 3 non-null datetime64[ns]
Buy 3 non-null int64
Sell 3 non-null int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 96.0 bytes
Once you have the DataFrame correctly parsed, resampling to 15 minute time periods can be done with asfreq with forward-filling the missing values:
In [50]: df.set_index('Date').asfreq('15T', method='ffill')
Out[50]:
Buy Sell
2009-01-08 01:00:00 15 25
2009-01-08 01:15:00 15 25
2009-01-08 01:30:00 15 25
2009-01-08 01:45:00 15 25
2009-01-08 02:00:00 0 30
2009-01-08 02:15:00 0 30
2009-01-08 02:30:00 0 30
2009-01-08 02:45:00 0 30
2009-01-08 03:00:00 10 18