Get value of Same Hour value at 1,2 day before, 1 weak before , 1 month before - pandas

I have time series data with other fields.
Now I want create more columns like
valueonsamehour1daybefore,valueonsamehour2daybefore,
valueonsamehour3daybefore,valueonsamehour1weekbefore,
valueonsamehour1monthbefore
If values are not present at the hour then value should be set as zero
dataframe can be loaded from here
url = 'https://drive.google.com/file/d/1BXvJqKGLwG4hqWJvh9gPAHqCbCcCKkUT/view?usp=sharing'
path = 'https://drive.google.com/uc? export=download&id='+url.split('/')[-2]
df = pd.read_csv(path,index_col=0,delimiter=",")
The DataFrame looks like the following:
| time | StartCity | District | Id | stype | EndCity | Count
2021-09-15 09:00:00 1 104 2713 21 9 2
2021-05-16 11:00:00 1 107 1044 11 6 1
2021-05-16 12:00:00 1 107 1044 11 6 0
2021-05-16 13:00:00 1 107 1044 11 6 0
2021-05-16 14:00:00 1 107 1044 11 6 0
2021-05-16 15:00:00 1 107 1044 11 6 0
2021-05-16 16:00:00 1 107 1044 11 6 0
2021-05-16 17:00:00 1 107 1044 11 6 0
2021-05-16 18:00:00 1 107 1044 11 6 0
2021-05-16 19:00:00 1 107 1044 11 6 0
2021-05-16 20:00:00 1 107 1044 11 6 0
2021-05-16 21:00:00 1 107 1044 11 6 0
2021-05-16 22:00:00 1 107 1044 11 6 0
2021-05-16 23:00:00 1 107 1044 11 6 0
2021-05-17 00:00:00 1 107 1044 11 6 0
2021-05-17 01:00:00 1 107 1044 11 6 0
2021-05-17 02:00:00 1 107 1044 11 6 0
2021-05-17 03:00:00 1 107 1044 11 6 0
2021-05-17 04:00:00 1 107 1044 11 6 0
2021-05-17 05:00:00 1 107 1044 11 6 0
2021-05-17 06:00:00 1 107 1044 11 6 0
2021-05-17 07:00:00 1 107 1044 11 6 0
2021-05-17 08:00:00 1 107 1044 11 6 0
2021-05-17 09:00:00 1 107 1044 11 6 0
2021-05-17 10:00:00 1 107 1044 11 6 0
2021-05-17 11:00:00 1 107 1044 11 6 0

Related

Pandas: create a period based on date column

I have a dataframe
ID datetime
11 01-09-2021 10:00:00
11 01-09-2021 10:15:15
11 01-09-2021 15:00:00
12 01-09-2021 15:10:00
11 01-09-2021 18:00:00
I need to add period based just on datetime if it increases to 2 hours
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 2
11 01-09-2021 18:00:00 3
And the same thing but based on ID and datetime
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 1
11 01-09-2021 18:00:00 3
How can I do that?
You can get difference by Series.diff, convert to hours Series.dt.total_seconds, comapre for 2 and add cumulative sum:
df['period'] = df['datetime'].diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 2
4 11 2021-01-09 18:00:00 3
Similar idea per groups:
f = lambda x: x.diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
df['period'] = df.groupby('ID')['datetime'].transform(f)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 1
4 11 2021-01-09 18:00:00 3

7 days hourly mean with pandas

I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429

Check duplication based on time series (pandas)

I am working on a dataset that I can see it has duplication but when I use df.duplicates it returns false because of the time column is unique.
How can I get the duplication from A,B, C based on time difference of the duplicates? for example, if the time difference is less then 200 ms delete duplicates
sample of my data
IIUC, you could do something like this:
np.random.seed(123)
df = pd.DataFrame({'A':np.random.randint(1,3,48),'B':np.random.randint(11,13,48),'C':np.random.randint(101,113,48),'time':pd.date_range('2014-09-10',periods=48,freq='10T')})
df.join(df.groupby(pd.Grouper(key='time', freq='30T'), group_keys=False, as_index=False).apply(lambda x: x.duplicated(['A','B','C'], keep=False)).rename('dups'))
Output:
A B C time dups
0 1 11 110 2014-09-10 00:00:00 False
1 2 11 103 2014-09-10 00:10:00 False
2 1 12 105 2014-09-10 00:20:00 False
3 1 12 109 2014-09-10 00:30:00 False
4 1 11 102 2014-09-10 00:40:00 False
5 1 11 103 2014-09-10 00:50:00 False
6 1 12 102 2014-09-10 01:00:00 False
7 2 11 102 2014-09-10 01:10:00 False
8 2 12 104 2014-09-10 01:20:00 False
9 1 11 106 2014-09-10 01:30:00 False
10 2 11 110 2014-09-10 01:40:00 False
11 2 12 101 2014-09-10 01:50:00 False
12 1 11 109 2014-09-10 02:00:00 False
13 2 12 112 2014-09-10 02:10:00 False
14 1 11 102 2014-09-10 02:20:00 False
15 2 12 107 2014-09-10 02:30:00 False
16 1 11 104 2014-09-10 02:40:00 False
17 2 11 104 2014-09-10 02:50:00 False
18 2 11 112 2014-09-10 03:00:00 False
19 1 11 106 2014-09-10 03:10:00 False
20 1 12 110 2014-09-10 03:20:00 False
21 1 11 108 2014-09-10 03:30:00 False
22 2 11 110 2014-09-10 03:40:00 False
23 2 12 103 2014-09-10 03:50:00 False
24 2 12 104 2014-09-10 04:00:00 True
25 1 12 112 2014-09-10 04:10:00 False
26 2 12 104 2014-09-10 04:20:00 True
27 1 11 104 2014-09-10 04:30:00 False
28 1 11 109 2014-09-10 04:40:00 False
29 1 11 107 2014-09-10 04:50:00 False
30 1 11 110 2014-09-10 05:00:00 False
31 2 12 108 2014-09-10 05:10:00 False
32 2 12 107 2014-09-10 05:20:00 False
33 2 11 104 2014-09-10 05:30:00 False
34 1 11 110 2014-09-10 05:40:00 False
35 1 11 107 2014-09-10 05:50:00 False
36 2 11 107 2014-09-10 06:00:00 False
37 1 12 112 2014-09-10 06:10:00 False
38 1 11 107 2014-09-10 06:20:00 False
39 2 12 102 2014-09-10 06:30:00 False
40 1 12 111 2014-09-10 06:40:00 False
41 2 11 104 2014-09-10 06:50:00 False
42 1 12 105 2014-09-10 07:00:00 False
43 2 12 104 2014-09-10 07:10:00 False
44 2 12 102 2014-09-10 07:20:00 False
45 2 11 101 2014-09-10 07:30:00 False
46 1 12 106 2014-09-10 07:40:00 False
47 1 12 109 2014-09-10 07:50:00 False

Split DateTimeIndex data based on hour/minute/second

I have time-series data that I would like to split based on hour, or minute, or second. This is generally user-defined. I would like to know how it can be done.
For example, consider the following:
test = pd.DataFrame({'TIME': pd.date_range(start='2016-09-30',
freq='600s', periods=20)})
test['X'] = np.arange(20)
The output is:
TIME X
0 2016-09-30 00:00:00 0
1 2016-09-30 00:10:00 1
2 2016-09-30 00:20:00 2
3 2016-09-30 00:30:00 3
4 2016-09-30 00:40:00 4
5 2016-09-30 00:50:00 5
6 2016-09-30 01:00:00 6
7 2016-09-30 01:10:00 7
8 2016-09-30 01:20:00 8
9 2016-09-30 01:30:00 9
10 2016-09-30 01:40:00 10
11 2016-09-30 01:50:00 11
12 2016-09-30 02:00:00 12
13 2016-09-30 02:10:00 13
14 2016-09-30 02:20:00 14
15 2016-09-30 02:30:00 15
16 2016-09-30 02:40:00 16
17 2016-09-30 02:50:00 17
18 2016-09-30 03:00:00 18
19 2016-09-30 03:10:00 19
Suppose I want to split it by hour. I would like the following as one chunk which I can then save to a file.
TIME X
0 2016-09-30 00:00:00 0
1 2016-09-30 00:10:00 1
2 2016-09-30 00:20:00 2
3 2016-09-30 00:30:00 3
4 2016-09-30 00:40:00 4
5 2016-09-30 00:50:00 5
The second chunk would be:
TIME X
0 2016-09-30 01:00:00 6
1 2016-09-30 01:10:00 7
2 2016-09-30 01:20:00 8
3 2016-09-30 01:30:00 9
4 2016-09-30 01:40:00 10
5 2016-09-30 01:50:00 11
and so on...
Note I can do it purely based on logical conditions such as,
df[(df['TIME'] >= '2016-09-30 00:00:00') &
(df['TIME'] <= '2016-09-30 00:50:00')]
and repeat....
but what if my sampling changes? Is there a way to create a mask or something that takes less amount of code and is efficient? I have 10 GB of data.
Option 1
you can groupby series without having them in the object you're grouping.
test.groupby([test.TIME.dt.date,
test.TIME.dt.hour,
test.TIME.dt.minute,
test.TIME.dt.second]):
Option 2
use pd.TimeGrouper
test.set_index('TIME').groupby(pd.TimeGrouper('S')) # Group by seconds
test.set_index('TIME').groupby(pd.TimeGrouper('T')) # Group by minutes
test.set_index('TIME').groupby(pd.TimeGrouper('H')) # Group by hours
You need to use groupby for this, and the grouping should be based on date and hour:
test['DATE'] = test['TIME'].dt.date
test['HOUR'] = test['TIME'].dt.hour
grp = test.groupby(['DATE', 'HOUR'])
You can then loop over the groups and do the operation you want.
Example:
for key, df in grp:
print(key, df)
((datetime.date(2016, 9, 30), 0), TIME X DATE HOUR
0 2016-09-30 00:00:00 0 2016-09-30 0
1 2016-09-30 00:10:00 1 2016-09-30 0
2 2016-09-30 00:20:00 2 2016-09-30 0
3 2016-09-30 00:30:00 3 2016-09-30 0
4 2016-09-30 00:40:00 4 2016-09-30 0
5 2016-09-30 00:50:00 5 2016-09-30 0)
((datetime.date(2016, 9, 30), 1), TIME X DATE HOUR
6 2016-09-30 01:00:00 6 2016-09-30 1
7 2016-09-30 01:10:00 7 2016-09-30 1
8 2016-09-30 01:20:00 8 2016-09-30 1
9 2016-09-30 01:30:00 9 2016-09-30 1
10 2016-09-30 01:40:00 10 2016-09-30 1
11 2016-09-30 01:50:00 11 2016-09-30 1)
((datetime.date(2016, 9, 30), 2), TIME X DATE HOUR
12 2016-09-30 02:00:00 12 2016-09-30 2
13 2016-09-30 02:10:00 13 2016-09-30 2
14 2016-09-30 02:20:00 14 2016-09-30 2
15 2016-09-30 02:30:00 15 2016-09-30 2
16 2016-09-30 02:40:00 16 2016-09-30 2
17 2016-09-30 02:50:00 17 2016-09-30 2)
((datetime.date(2016, 9, 30), 3), TIME X DATE HOUR
18 2016-09-30 03:00:00 18 2016-09-30 3
19 2016-09-30 03:10:00 19 2016-09-30 3)

Calculate average values for rows with different ids in MS Excel

File contains information about products per day, and I need to calculate average values for month for each product.
Source data looks like this:
A B C D
id date rating price
1 1 2014/01/01 2 20
2 1 2014/01/02 2 20
3 1 2014/01/03 2 20
4 1 2014/01/04 1 20
5 1 2014/01/05 1 20
6 1 2014/01/06 1 20
7 1 2014/01/07 1 20
8 3 2014/01/01 5 99
9 3 2014/01/02 5 99
10 3 2014/01/03 5 99
11 3 2014/01/04 5 99
12 3 2014/01/05 5 120
13 3 2014/01/06 5 120
14 3 2014/01/07 5 120
Need to get:
A B C D
id date rating price
1 1 1.42 20
2 3 5 108
How to do that? Need some advanced formula or VB Script.
Update: I have data for long period - about 2 years. Need to calculate average values for each product for each week, and after for each month.
Source data example:
id date rating
4 2013-09-01 445
4 2013-09-02 446
4 2013-09-03 447
4 2013-09-04 448
4 2013-09-05 449
4 2013-09-06 450
4 2013-09-07 451
4 2013-09-08 452
4 2013-09-09 453
4 2013-09-10 454
4 2013-09-11 455
4 2013-09-12 456
4 2013-09-13 457
4 2013-09-14 458
4 2013-09-15 459
4 2013-09-16 460
4 2013-09-17 461
4 2013-09-18 462
4 2013-09-19 463
4 2013-09-20 464
4 2013-09-21 465
4 2013-09-22 466
4 2013-09-23 467
4 2013-09-24 468
4 2013-09-25 469
4 2013-09-26 470
4 2013-09-27 471
4 2013-09-28 472
4 2013-09-29 473
4 2013-09-30 474
4 2013-10-01 475
4 2013-10-02 476
4 2013-10-03 477
4 2013-10-04 478
4 2013-10-05 479
4 2013-10-06 480
4 2013-10-07 481
4 2013-10-08 482
4 2013-10-09 483
4 2013-10-10 484
4 2013-10-11 485
4 2013-10-12 486
4 2013-10-13 487
4 2013-10-14 488
4 2013-10-15 489
4 2013-10-16 490
4 2013-10-17 491
4 2013-10-18 492
4 2013-10-19 493
4 2013-10-20 494
4 2013-10-21 495
4 2013-10-22 496
4 2013-10-23 497
4 2013-10-24 498
4 2013-10-25 499
4 2013-10-26 500
4 2013-10-27 501
4 2013-10-28 502
4 2013-10-29 503
4 2013-10-30 504
4 2013-10-31 505
7 2013-09-01 1445
7 2013-09-02 1446
7 2013-09-03 1447
7 2013-09-04 1448
7 2013-09-05 1449
7 2013-09-06 1450
7 2013-09-07 1451
7 2013-09-08 1452
7 2013-09-09 1453
7 2013-09-10 1454
7 2013-09-11 1455
7 2013-09-12 1456
7 2013-09-13 1457
7 2013-09-14 1458
7 2013-09-15 1459
7 2013-09-16 1460
7 2013-09-17 1461
7 2013-09-18 1462
7 2013-09-19 1463
7 2013-09-20 1464
7 2013-09-21 1465
7 2013-09-22 1466
7 2013-09-23 1467
7 2013-09-24 1468
7 2013-09-25 1469
7 2013-09-26 1470
7 2013-09-27 1471
7 2013-09-28 1472
7 2013-09-29 1473
7 2013-09-30 1474
7 2013-10-01 1475
7 2013-10-02 1476
7 2013-10-03 1477
7 2013-10-04 1478
7 2013-10-05 1479
7 2013-10-06 1480
7 2013-10-07 1481
7 2013-10-08 1482
7 2013-10-09 1483
7 2013-10-10 1484
7 2013-10-11 1485
7 2013-10-12 1486
7 2013-10-13 1487
7 2013-10-14 1488
7 2013-10-15 1489
7 2013-10-16 1490
7 2013-10-17 1491
7 2013-10-18 1492
7 2013-10-19 1493
7 2013-10-20 1494
7 2013-10-21 1495
7 2013-10-22 1496
7 2013-10-23 1497
7 2013-10-24 1498
7 2013-10-25 1499
7 2013-10-26 1500
7 2013-10-27 1501
7 2013-10-28 1502
7 2013-10-29 1503
7 2013-10-30 1504
7 2013-10-31 1505
This is the job of a pivot table, and it takes about 30secs to do it
Update:
as per your update, put the date into the Report Filter and modify to suit