How to resample dates into 1 minute bars? - pandas

Say I have a pandas.DataFrame like:
val
2020-01-01 12
2020-04-15 38
2020-05-03 19
How can I create a pandas.DataFrame like:
val
2020-01-01 00:00:00 12
2020-01-01 00:01:00 12
...
2020-01-01 23:58:00 12
2020-01-01 23:59:00 12
2020-04-15 00:00:00 38
2020-04-15 00:01:00 38
...
2020-04-15 23:58:00 38
2020-04-15 23:59:00 38
2020-05-03 00:00:00 19
2020-05-03 00:01:00 19
...
2020-05-03 23:58:00 19
2020-05-03 23:59:00 19
I have tried df.resample('1 min').asfreq() but that gives me all the minutes from the first row to the last row, including all the days that aren't in the original index.

Recreating your sample df:
dates = [ pd.to_datetime('2020-01-01'), pd.to_datetime('2020-04-15'), pd.to_datetime('2020-05-03') ]
val = [12, 38, 19]
df = pd.DataFrame({ 'date' : dates, 'val' : val})
df = df.set_index('date')
I don't generally recommend loops, but this feels like it might be a case where it is more natural to use one. It really depends on how much data you're dealing with. It works, anyway. :)
out = pd.DataFrame()
for row in df.itertuples():
bars = pd.date_range(row.Index, row.Index+pd.Timedelta(days=1), freq="T", closed='left')
out = pd.concat([out, pd.DataFrame(data={'val' : row.val}, index=bars)])
print(out)
val
2020-01-01 00:00:00 12
2020-01-01 00:01:00 12
2020-01-01 00:02:00 12
2020-01-01 00:03:00 12
2020-01-01 00:04:00 12
... ...
2020-05-03 23:55:00 19
2020-05-03 23:56:00 19
2020-05-03 23:57:00 19
2020-05-03 23:58:00 19
2020-05-03 23:59:00 19
[4320 rows x 1 columns]

Related

Pandas: Find weekly max from timeseries(calendar week not 7 days)

I want my dataframe to be grouped by calendar weekly, like Monday to Sunday.
timestamp value
# before time
...
# this is a Friday
2021-10-01 13:00:00 2204.0
2021-10-01 13:30:00 3262.0
...
# this is next Monday
2021-10-04 16:00:00 254.0
2021-10-04 16:30:00 990.0
2021-10-04 17:00:00 1044.0
2021-10-04 17:30:00 26.0
...
# time continues
The result I'm expecting, hope this is clear enough.
timestamp value weekly_max
# this is a Friday
2021-10-01 13:00:00 2204.0 3262.0 # assume 3262.0 is the maximum value during 2021-09-27 to 2021-10-03
2021-10-01 13:30:00 3262.0 3262.0
...
# this is next Monday
2021-10-04 16:00:00 254.0 1044.0
2021-10-04 16:30:00 990.0 1044.0
2021-10-04 17:00:00 1044.0 1044.0
2021-10-04 17:30:00 26.0 1044.0
...
get week number:
df['week'] = df.datetime.dt.isocalendar().week
get max for each week
df_weeklymax = df.groupby('week').agg(max=('value', 'max')).reset_index()
merge 2 tables
df.merge(df_weeklymax, on='week', how='left')
example output:
datetime
value
week
max
0
2021-01-01 00:00:00
20
53
69
1
2021-01-01 13:36:00
69
53
69
2
2021-01-02 03:12:00
69
53
69
3
2021-01-02 16:48:00
57
53
69
4
2021-01-03 06:24:00
39
53
69
5
2021-01-03 20:00:00
56
53
69
6
2021-01-04 09:36:00
73
1
92
7
2021-01-04 23:12:00
76
1
92
8
2021-01-05 12:48:00
92
1
92
9
2021-01-06 02:24:00
4
1
92

Pandas Group/Merge Dataframe by Non-Periodic Series

How do I group one DataFrame by another possibly-non-periodic Series? Mock-up below:
This is the DataFrame to be split:
i = pd.date_range(end="today", periods=20, freq="d").normalize()
v = np.random.randint(0,100,size=len(i))
d = pd.DataFrame({"value": v}, index=i)
>>> d
value
2021-02-06 48
2021-02-07 1
2021-02-08 86
2021-02-09 82
2021-02-10 40
2021-02-11 22
2021-02-12 63
2021-02-13 37
2021-02-14 41
2021-02-15 57
2021-02-16 30
2021-02-17 69
2021-02-18 63
2021-02-19 27
2021-02-20 23
2021-02-21 46
2021-02-22 66
2021-02-23 10
2021-02-24 91
2021-02-25 43
This is the splitting criteria, grouping by the Series dates. A group consists of any ordered dataframe value v such that {v} intersects [s,s+1) - but as with resampling it would be nice to control the inclusion parameters.
s = pd.date_range(start="2019-10-14", freq="2W", periods=52).to_series()
s = s.drop(np.random.choice(s.index, 10, replace=False))
s = s.reset_index(drop=True)
>>> s[25:29]
25 2021-01-24
26 2021-02-07
27 2021-02-21
28 2021-03-07
dtype: datetime64[ns]
And this is the example output... or something like it. Index is taken from the series rather than the dataframe.
>>> ???.sum()
value
...
2021-01-24 47
2021-02-07 768
2021-02-21 334
...
Internally the groups would have this structure:
...
2021-01-10
sum: 0
2021-01-24
2021-02-06 47
sum: 47
2021-02-07
2021-02-07 52
2021-02-08 56
2021-02-09 21
2021-02-10 39
2021-02-11 86
2021-02-12 30
2021-02-13 20
2021-02-14 76
2021-02-15 91
2021-02-16 70
2021-02-17 34
2021-02-18 73
2021-02-19 41
2021-02-20 79
sum: 768
2021-02-21
2021-02-21 90
2021-02-22 75
2021-02-23 12
2021-02-24 70
2021-02-25 87
sum: 334
2021-03-07
sum: 0
...
Looks like you can do:
bucket = pd.cut(d.index, bins=s, label=s[:-1], right=False)
d.groupby(bucket).sum()

7 days hourly mean with pandas

I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429

Creating nested dataframes with multiple dataframes

I have multiple dataframes, the following are only 2 of them:
print(df1)
Date A B C
2019-10-01 00:00:00 2 3 1
2019-10-01 01:00:00 5 1 6
2019-10-01 02:00:00 8 2 4
2019-10-01 03:00:00 3 6 5
print(df2)
Date A B C
2019-10-01 00:00:00 9 4 2
2019-10-01 01:00:00 3 2 4
2019-10-01 02:00:00 6 5 2
2019-10-01 03:00:00 3 6 5
All of them have same index and columns. I want to create dataframe like this:
Date df1 df2
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5
I have to apply this process to 30 dataframes(their index and columns are same), so I want to write a for loop in order to achieve this dataframe. How can I do that?
Reshape each DataFrame of list of DataFrames by DataFrame.set_index with DataFrame.unstack and then concat, last change columns names with lambda function:
dfs = [df1,df2]
df = (pd.concat([x.set_index('Date').unstack() for x in dfs], axis=1)
.rename(columns=lambda x: f'df{x+1}'))
print (df)
df1 df2
Date
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5
If want some custom columns names in final DataFrame create list with same size like length of dfs and add parameter keys:
dfs = [df1,df2]
names = ['col1','col2']
df = pd.concat([x.set_index('Date').unstack() for x in dfs], keys=names, axis=1)
print (df)
col1 col2
Date
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5

Sorting csv file with Python 3

I'm having trouble sorting a csv file which has in its second column the UTC time as: 2010-01-01 00:00:00
I have a file that is like this:
name utc_time longitude latitude
A 2010-01-01 00:00:34 23 41
B 2011-01-01 10:00:00 26 44
C 2009-01-01 03:00:00 34 46
D 2012-01-01 00:00:00 31 47
E 2010-01-01 04:00:00 44 48
F 2013-01-01 14:00:00 24 41
Which I want it to be outputted in a csv file keeping the same structure but sorted by date:
Output:
name utc_time longitude latitude
C 2009-01-01 03:00:00 34 46
A 2010-01-01 00:00:34 23 41
E 2010-01-01 04:00:00 44 48
B 2011-01-01 10:00:00 26 44
D 2012-01-01 00:00:00 31 47
F 2013-01-01 14:00:00 24 41
I'm actually trying this:
fileEru = pd.read_csv("input.csv")
fileEru = sorted(fileEru, key = lambda row: datetime.strptime(row[1],'%Y-%m-%d %H:%M:%S'), reverse=True)
fileEru.to_csv("output.csv")
But it doesn't work.
try this:
(pd.read_csv("input.csv", parse_dates=['utc_time'])
.sort_values('utc_time')
.to_csv("output.csv", index=False))