make a mean of several year dataframes, hour by hour - pandas

I have several dataframes of some value taken very hour, on several year, like this :
df1
Out[6]:
time P G(i) H_sun T2m WS10m Int
0 2005-01-01 00:10:00 0.0 0.0 0.0 0.68 2.11 0.0
1 2005-01-01 01:10:00 0.0 0.0 0.0 0.38 2.11 0.0
2 2005-01-01 02:10:00 0.0 0.0 0.0 0.08 2.11 0.0
3 2005-01-01 03:10:00 0.0 0.0 0.0 -0.22 2.11 0.0
4 2005-01-01 04:10:00 0.0 0.0 0.0 0.06 2.21 0.0
... ... ... ... ... ... ...
8755 2005-12-31 19:10:00 0.0 0.0 0.0 1.75 1.71 0.0
8756 2005-12-31 20:10:00 0.0 0.0 0.0 1.49 1.71 0.0
8757 2005-12-31 21:10:00 0.0 0.0 0.0 1.23 1.70 0.0
8758 2005-12-31 22:10:00 0.0 0.0 0.0 0.95 1.65 0.0
8759 2005-12-31 23:10:00 0.0 0.0 0.0 0.67 1.60 0.0
[8760 rows x 7 columns]
df2
Out[7]:
time P G(i) H_sun T2m WS10m Int
8760 2006-01-01 00:10:00 0.0 0.0 0.0 0.39 1.56 0.0
8761 2006-01-01 01:10:00 0.0 0.0 0.0 0.26 1.52 0.0
8762 2006-01-01 02:10:00 0.0 0.0 0.0 0.13 1.49 0.0
8763 2006-01-01 03:10:00 0.0 0.0 0.0 0.01 1.45 0.0
8764 2006-01-01 04:10:00 0.0 0.0 0.0 -0.45 1.65 0.0
... ... ... ... ... ... ...
17515 2006-12-31 19:10:00 0.0 0.0 0.0 4.24 1.32 0.0
17516 2006-12-31 20:10:00 0.0 0.0 0.0 4.00 1.32 0.0
17517 2006-12-31 21:10:00 0.0 0.0 0.0 3.75 1.32 0.0
17518 2006-12-31 22:10:00 0.0 0.0 0.0 4.34 1.54 0.0
17519 2006-12-31 23:10:00 0.0 0.0 0.0 4.92 1.76 0.0
[8760 rows x 7 columns]
and this for 10 years.
I'm trying to make a mean of the value for the "20XX-01-01 00:10:00" of each year to obtain something like "mean all the value of the 01 January at 00:10". Ideally with a time column merge to obtain just "01-01 00:10:00".
Is it possible ?
For now I just know the df.mean() function to take all the value of a column to have just one result, and that's not what I want.

Join all DataFrames together in concat:
df = pd.concat([df1, df2, df3, ..., df10])
And then aggregate mean with same year - e.g. 2005
df['time'] = pd.to_datetime(df['time'])
#for remove 29 Feb
#df = df[((df['time'].dt.month != 2) | (df['time'].dt.day != 29))]
df1 = df.groupby(pd.to_datetime(df['time'].dt.strftime('2005-%m-%d %H:%M:%S'))).mean()

Related

Grouping columns to form time series data (Python)

I have a dataframe df which looks like this
restaurant
opentime
closetime
group
ABX
10:00:00
21:00:00
Gold
BWZ
13:00:00
14:00:00
Silver
GTW
10:00:00
11:00:00
Gold
I want to create a time series dataframe df2 based on the start and end date of my choice which shows the restaurants open by group and indexed by all the hours. In this case, I have taken a start date of May 17th 2021 and an end date of May 18th 2021. The final dataframe should look like this
Date
Gold
Silver
2021-05-17 9:00:00
0
0
2021-05-17 10:00:00
2
0
2021-05-17 11:00:00
1
0
2021-05-17 12:00:00
1
0
2021-05-17 13:00:00
1
1
2021-05-17 14:00:00
1
1
2021-05-17 15:00:00
1
0
......................
......
........
......................
......
........
2021-05-18 23:00:00
0
0
If the Date part is too difficult to recreate, then just time would also help in such a way it looks like this
Time
Gold
Silver
9:00:00
0
0
10:00:00
2
0
11:00:00
1
0
12:00:00
1
0
13:00:00
1
1
14:00:00
1
1
15:00:00
1
0
......................
......
........
......................
......
........
23:00:00
0
0
Any help will be appreciated.
First part: Time
Create a list that contains all hours between opentime and closetime then explode the list into rows and group by (time, group) and count values for each group.
Second part: Date
Create a datetime index that contains all hours between start_date and end_date. Transform as a series and set time as index.
Last part: Merge Date and Time
Merge dfd (date dataframe) and dft (time dataframe) together to get the groups for each datetime.
start_date = "2021-05-17"
end_date = "2021-05-18"
# compute hours between opentime and closetime
df["time"] = df.apply(lambda x: pd.date_range(x["opentime"],
x["closetime"],
freq="1H").time, axis="columns")
# value count by time and group
dft = df.explode("time").value_counts(["time", "group"]).unstack("group")
# create datetime index between start_date and end_date
dti = pd.date_range(start=pd.to_datetime(start_date),
end=pd.to_datetime(end_date) + pd.DateOffset(days=1),
closed="left", freq="1H", name="datetime")
dfd = dti.to_series(index=dti.time)
# merge date and time dataframes
out = pd.merge(dfd, dft, left_index=True, right_index=True, how="left") \
.set_index("datetime").sort_index().fillna(0)
>>> out
Gold Silver
datetime
2021-05-17 00:00:00 0.0 0.0
2021-05-17 01:00:00 0.0 0.0
2021-05-17 02:00:00 0.0 0.0
2021-05-17 03:00:00 0.0 0.0
2021-05-17 04:00:00 0.0 0.0
2021-05-17 05:00:00 0.0 0.0
2021-05-17 06:00:00 0.0 0.0
2021-05-17 07:00:00 0.0 0.0
2021-05-17 08:00:00 0.0 0.0
2021-05-17 09:00:00 0.0 0.0
2021-05-17 10:00:00 2.0 0.0
2021-05-17 11:00:00 2.0 0.0
2021-05-17 12:00:00 1.0 0.0
2021-05-17 13:00:00 1.0 1.0
2021-05-17 14:00:00 1.0 1.0
2021-05-17 15:00:00 1.0 0.0
2021-05-17 16:00:00 1.0 0.0
2021-05-17 17:00:00 1.0 0.0
2021-05-17 18:00:00 1.0 0.0
2021-05-17 19:00:00 1.0 0.0
2021-05-17 20:00:00 1.0 0.0
2021-05-17 21:00:00 1.0 0.0
2021-05-17 22:00:00 0.0 0.0
2021-05-17 23:00:00 0.0 0.0
2021-05-18 00:00:00 0.0 0.0
2021-05-18 01:00:00 0.0 0.0
2021-05-18 02:00:00 0.0 0.0
2021-05-18 03:00:00 0.0 0.0
2021-05-18 04:00:00 0.0 0.0
2021-05-18 05:00:00 0.0 0.0
2021-05-18 06:00:00 0.0 0.0
2021-05-18 07:00:00 0.0 0.0
2021-05-18 08:00:00 0.0 0.0
2021-05-18 09:00:00 0.0 0.0
2021-05-18 10:00:00 2.0 0.0
2021-05-18 11:00:00 2.0 0.0
2021-05-18 12:00:00 1.0 0.0
2021-05-18 13:00:00 1.0 1.0
2021-05-18 14:00:00 1.0 1.0
2021-05-18 15:00:00 1.0 0.0
2021-05-18 16:00:00 1.0 0.0
2021-05-18 17:00:00 1.0 0.0
2021-05-18 18:00:00 1.0 0.0
2021-05-18 19:00:00 1.0 0.0
2021-05-18 20:00:00 1.0 0.0
2021-05-18 21:00:00 1.0 0.0
2021-05-18 22:00:00 0.0 0.0
>>> dfd.index
Index([00:00:00, 01:00:00, 02:00:00, 03:00:00, 04:00:00, 05:00:00, 06:00:00,
07:00:00, 08:00:00, 09:00:00, 10:00:00, 11:00:00, 12:00:00, 13:00:00,
14:00:00, 15:00:00, 16:00:00, 17:00:00, 18:00:00, 19:00:00, 20:00:00,
21:00:00, 22:00:00, 23:00:00, 00:00:00, 01:00:00, 02:00:00, 03:00:00,
04:00:00, 05:00:00, 06:00:00, 07:00:00, 08:00:00, 09:00:00, 10:00:00,
11:00:00, 12:00:00, 13:00:00, 14:00:00, 15:00:00, 16:00:00, 17:00:00,
18:00:00, 19:00:00, 20:00:00, 21:00:00, 22:00:00, 23:00:00],
dtype='object')
>>> dft.index
Index([10:00:00, 11:00:00, 12:00:00, 13:00:00, 14:00:00, 15:00:00, 16:00:00,
17:00:00, 18:00:00, 19:00:00, 20:00:00, 21:00:00],
dtype='object', name='time')

Group by Index of Row in Pandas

I want to group and sum every 7 rows together (Hence to get a total of each week). There are currently two columns. One for date and the other for a float.
1/22/2020 NaN
1/23/2020 0.0
1/24/2020 1.0
1/25/2020 0.0
1/26/2020 3.0
1/27/2020 0.0
1/28/2020 0.0
1/29/2020 0.0
1/30/2020 0.0
1/31/2020 2.0
2/1/2020 1.0
2/2/2020 0.0
2/3/2020 3.0
2/4/2020 0.0
2/5/2020 0.0
2/6/2020 0.0
2/7/2020 0.0
2/8/2020 0.0
2/9/2020 0.0
2/10/2020 0.0
2/11/2020 1.0
2/12/2020 0.0
2/13/2020 1.0
2/14/2020 0.0
2/15/2020 0.0
2/16/2020 0.0
2/17/2020 0.0
2/18/2020 0.0
2/19/2020 0.0
2/20/2020 0.0
... ...
2/28/2020 0.0
2/29/2020 8.0
3/1/2020 6.0
3/2/2020 23.0
3/3/2020 20.0
3/4/2020 31.0
3/5/2020 68.0
3/6/2020 45.0
3/7/2020 119.0
3/8/2020 114.0
3/9/2020 64.0
3/10/2020 194.0
3/11/2020 397.0
3/12/2020 452.0
3/13/2020 590.0
3/14/2020 710.0
3/15/2020 61.0
3/16/2020 1389.0
3/17/2020 1789.0
3/18/2020 906.0
3/19/2020 3068.0
3/20/2020 4009.0
3/21/2020 4017.0
3/23/2020 25568.0
3/24/2020 10074.0
3/25/2020 12043.0
3/26/2020 18058.0
3/27/2020 17822.0
3/28/2020 19825.0
3/29/2020 19408.0
Assuming your date column is called dt and your value column is val:
import numpy as np
# in case if it's not already date time format:
df["dt"]=pd.to_datetime(df["dt"])
# your data looks sorted, but in case if it's not - that's the prerequisite here:
df=df.sort_values("dt")
df=df.groupby(np.arange(len(df))//7).agg({"dt": (min, max), "val": sum})
The aggregation for dt is done only so you can explicitly indicate aggregated interval - it might be enough to just take min for instance, or ignore it at all...
Set the date column as the index and use resample
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.resample('1W').sum()

Pandas df.shift(axis=1) adds extra entries, why?

Here is a sample of the original table.
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83 11
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41 11
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91 11
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68 11
I need to shift the entire table over by 1 column to produce this:
z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68
Here is the code that ingests the data and tries to shift it over by one column
wind_rass_table_df=pd.read_csv(file_path, header=j+3, engine='python', nrows=77,sep=r'\s{2,}',skip_blank_lines=False,index_col=False)
wind_rass_table_df=wind_rass_table_df.shift(periods=1,axis=1)
Supposedly df.shift(axis=1) should shift the dataframe over by 1 column but it does more than that, it does this:
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC]
0 NaN NaN 2.83 181.0 0.05 2.83 40.0 -0.20 -0.20 2.24
1 NaN NaN 2.41 184.8 0.20 2.40 50.0 -0.01 -0.01 2.47
2 NaN NaN 1.92 192.4 0.41 1.88 60.0 0.25 0.25 2.46
3 NaN NaN 1.75 201.7 0.65 1.63 70.0 0.50 0.50 2.47
The shift function has taken the first column, inserted into the 7th column, shifted the 7th into the 8th and repeated the 8th, shifting the 9th over and so on.
What is the correct way of shifting a dataframe over by one column?
Many thanks!
You can use iloc and create another dataframe:
df = pd.DataFrame(data=df.iloc[:, :-1], columns=df.columns[1:], index=df.index)

Pandas dataframe column math when row conditions is met

I have a dataframe containing the following data. I would like to query the age column of each dataframe (1-4) for values between 295.0 and 305.0. For each dataframe there will be a single age value in this range and a corresponding subsidence value. I would like to take the subsidence value and add it to the remaining values in the dataframe.
For instance in the first dataframe; at age 300.0 subsidence= 274.057861. In this case, 274.057861 would be added to the rest of the subsidence values in dataframe 1.
In the second data frame; at age 299.0 subsidence= 77.773720. So, 77.773720 would be added to to the rest of the subsidence values in dataframe 2. Etc, etc. Is it possible to do this easily in Pandas or am I better off working towards an alternate solution.
Thanks :)
1 2 3 4 \
age subsidence age subsidence age subsidence age
0 0.0 -201.538712 0.0 -235.865433 0.0 134.728821 0.0
1 10.0 -77.446548 8.0 -102.183365 10.0 88.796074 10.0
2 20.0 44.901043 18.0 35.316868 20.0 35.871178 20.0
3 31.0 103.172806 28.0 98.238434 30.0 -17.901653 30.0
4 41.0 124.625687 38.0 124.719254 40.0 -13.381897 40.0
5 51.0 122.877541 48.0 130.725235 50.0 -25.396996 50.0
6 61.0 138.810898 58.0 140.301117 60.0 -37.057205 60.0
7 71.0 119.818176 68.0 137.433670 70.0 -11.587639 70.0
8 81.0 77.867607 78.0 96.285652 80.0 21.854662 80.0
9 91.0 33.612885 88.0 32.740803 90.0 67.754501 90.0
10 101.0 15.885051 98.0 8.626043 100.0 150.172699 100.0
11 111.0 118.089211 109.0 88.812439 100.0 150.172699 100.0
12 121.0 247.301956 119.0 212.000061 110.0 124.367874 110.0
13 131.0 268.748627 129.0 253.204819 120.0 157.066010 120.0
14 141.0 231.799255 139.0 292.828461 130.0 145.811783 130.0
15 151.0 259.626343 149.0 260.067993 140.0 175.388763 140.0
16 161.0 288.704651 159.0 240.051605 150.0 265.435791 150.0
17 171.0 249.121857 169.0 203.727097 160.0 336.471924 160.0
18 181.0 339.038055 179.0 245.738480 170.0 283.483582 170.0
19 191.0 395.920410 189.0 318.751160 180.0 381.575500 180.0
20 201.0 404.843445 199.0 338.245209 190.0 491.534424 190.0
21 211.0 461.865784 209.0 418.997559 200.0 495.025604 200.0
22 221.0 518.710632 219.0 446.496216 200.0 495.025604 200.0
23 231.0 483.963867 224.0 479.213287 210.0 571.982361 210.0
24 239.0 445.292389 229.0 492.352905 220.0 611.698608 220.0
25 249.0 396.609497 239.0 445.322144 230.0 645.545776 230.0
26 259.0 321.553558 249.0 429.429932 240.0 596.046265 240.0
27 269.0 306.150177 259.0 297.355103 250.0 547.157654 250.0
28 279.0 259.717468 269.0 174.210785 260.0 457.071472 260.0
29 289.0 301.114410 279.0 114.175957 270.0 438.705170 270.0
30 300.0 274.057861 289.0 91.768898 280.0 397.985535 280.0
31 310.0 216.760361 299.0 77.773720 290.0 426.858276 290.0
32 320.0 192.317093 309.0 73.767090 300.0 410.508331 300.0
33 330.0 179.511917 319.0 63.295345 300.0 410.508331 300.0
34 340.0 231.126053 329.0 -4.296405 310.0 355.303558 310.0
35 350.0 142.894958 339.0 -62.745190 320.0 284.932892 320.0
36 360.0 51.547047 350.0 -60.224789 330.0 251.817078 330.0
37 370.0 -39.064964 360.0 -85.826874 340.0 302.303925 340.0
38 380.0 -54.111374 370.0 -81.139206 350.0 207.799942 350.0
39 390.0 -68.999535 380.0 -40.080212 360.0 77.729439 360.0
40 400.0 -47.595322 390.0 -29.945852 370.0 -127.037209 370.0
41 410.0 13.159509 400.0 -26.656607 380.0 -109.327545 380.0
42 NaN NaN 410.0 -13.723764 390.0 -127.160942 390.0
43 NaN NaN NaN NaN 400.0 -61.404510 400.0
44 NaN NaN NaN NaN 410.0 13.058900 410.0
For the first Dataframe:
df1['subsidence'] = df1[(df1.age >295) & (df1.age <305)]['subsidence'].value
You need to update each dataframes accordingly.

Pandas - ufunc 'subtract' did not contain a loop with signature matching type

I have this piece of code:
self.value=0.8
for col in df.ix[:,'value1':'value3']:
df = df.iloc[abs(df[col] - self.value).argsort()]
which works perfectly as part of main() function. at return, it prints:
artist track pos neg neu
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
6 Radiohead Codex 0.00 1.00 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0
however, when I import this function as part of a module, and even though I'm passing and printing the same 0.8 self.value, I get the following error:
df = df.iloc[(df[col] - self.flavor).argsort()]
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 721, in wrapper
result = wrap_results(safe_na_op(lvalues, rvalues))
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 682, in safe_na_op
return na_op(lvalues, rvalues)
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 668, in na_op
result[mask] = op(x[mask], y)
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
why is it so? what is going on?
pd.DataFrame.ix is has been deprecated. You should stop using it.
Your use of 'value1':'value3' is dangerous as it can include columns you didn't expect if your columns aren't positioned in the order you thought.
df = pd.DataFrame(
[['a', 'b', 1, 2, 3]],
columns='artist track v1 v2 v3'.split()
)
list(df.loc[:, 'v1':'v3'])
['v1', 'v2', 'v3']
But rearrange the columns and
list(df.loc[:, ['v1', 'v2', 'artist', 'v3', 'b']].loc[:, 'v1':'v3'])
['v1', 'v2', 'artist', 'v3']
You got 'artist' in the the list. And column 'artist' is of type string and can't be subtracted from or by an integer or float.
df['artist'] - df['v1']
> TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U21') dtype('<U21') dtype('<U21')
Setup
Shuffle df
df = df.sample(frac=1)
df
artist track pos neg neu
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
6 Radiohead Codex 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0
Solution
Use np.lexsort
value = 0.8
v = df[['pos', 'neg', 'neu']].values
df.iloc[np.lexsort(np.abs(v - value).T)]
artist track pos neg neu
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
6 Radiohead Codex 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0