I have a df like this:
df = pd.DataFrame({'a': ['2019-09-01 17:00:00', '2019-09-01 17:15:00','2019-09-01 17:30:00','2019-09-01 17:45:00','2019-09-01 18:00:00', '2019-09-01 18:15:00','2019-09-01 18:30:00','2019-09-01 18:45:00'],
'b': [432.6, 427.56, 424.2, 433.44,450.24,447.72,452.76,453.6]})
And I want to create a loop to calculate the mean of the values for every 4 items like this:
When i = 0 (first position)
mean0 = df.loc[0:3,'b'].mean()
When i = 1:
mean1 = df.loc[4:7,'b'].mean()
And so on.I've tried to create something like this:
for i in df['b]:
mean[i] = (df[i,'b'] + df.loc[(i+1),'b'] + df.loc[(i+2),'b'])+df.loc[(i+3),'b'])).mean()
But i always get a error message KeyError: 655.7077670000001 or Nan values.
Thanks for the help.
Try this:
>>> df.groupby(df.index // 4).mean()
b
0 429.45
1 451.08
Or maybe
>>> df['mean'] = df.groupby(df.index // 4)['b'].transform('mean')
>>> df
a b mean
0 2019-09-01 17:00:00 432.60 429.45
1 2019-09-01 17:15:00 427.56 429.45
2 2019-09-01 17:30:00 424.20 429.45
3 2019-09-01 17:45:00 433.44 429.45
4 2019-09-01 18:00:00 450.24 451.08
5 2019-09-01 18:15:00 447.72 451.08
6 2019-09-01 18:30:00 452.76 451.08
7 2019-09-01 18:45:00 453.60 451.08
This solution is efficient as it is vectorized
mean=list(df.groupby(df.index//4)['b'].mean())
And if you want to continue doing your own method for exploring here is the code
n=df.shape[0]//4
mean=[0]*n
for i in range(n):
mean[i] = (df.loc[i*4,'b'] + df.loc[(i*4+1),'b'] + df.loc[(i*4+2),'b']+df.loc[(i*4+3),'b'])/4
Output:
[429.45000000000005, 451.08000000000004]
your code was giving you error because .loc was missing here df[i,'b']=>df.loc[i*4,'b']
Maybe you want to group every 4 values because you want to resample your dataframe.
Try:
out = df.groupby(pd.to_datetime(df['a']).dt.floor('H')).mean().reset_index()
print(out)
# Output
a b
0 2019-09-01 17:00:00 429.45
1 2019-09-01 18:00:00 451.08
Related
I am looking for an efficient way to summarize rows (in groupby-style) that fall in a certain time period, using Pandas in Python. Specifically:
The time period is given in dataframe A: there is a column for "start_timestamp" and a column for "end_timestamp", specifying the start and end time of the time period that is to be summarized. Hence, every row represents one time period that is meant to be summarized.
The rows to be summarized are given in dataframe B: there is a column for "timestamp" and a column "metric" with the values to be aggregated (with mean, max, min etc.). In reality, there might be more than just 1 "metric" column.
For every row's time period from dataframe A, I want to summarice the values of the "metric" column in dataframe B that fall in the given time period. Hence, the number of rows of the output dataframe will be exactly the same as the number of rows of dataframe A.
Any hints would be much appreciated.
Additional Requirements
The number of rows in dataframe A and dataframe B may be large (several thousand rows).
There may be many metrics to summarize in dataframe B (~100).
I want to avoid solving this problem with a for loop (as in the reproducible example below).
Reproducible Example
Input Dataframe A
# Input dataframe A
df_a = pd.DataFrame({
"start_timestamp": ["2022-08-09 00:30", "2022-08-09 01:00", "2022-08-09 01:15"],
"end_timestamp": ["2022-08-09 03:30", "2022-08-09 04:00", "2022-08-09 08:15"]
})
df_a.loc[:, "start_timestamp"] = pd.to_datetime(df_a["start_timestamp"])
df_a.loc[:, "end_timestamp"] = pd.to_datetime(df_a["end_timestamp"])
print(df_a)
start_timestamp
end_timestamp
0
2022-08-09 00:30:00
2022-08-09 03:30:00
1
2022-08-09 01:00:00
2022-08-09 04:00:00
2
2022-08-09 01:15:00
2022-08-09 08:15:00
Input Dataframe B
# Input dataframe B
df_b = pd.DataFrame({
"timestamp":[
"2022-08-09 01:00",
"2022-08-09 02:00",
"2022-08-09 03:00",
"2022-08-09 04:00",
"2022-08-09 05:00",
"2022-08-09 06:00",
"2022-08-09 07:00",
"2022-08-09 08:00",
],
"metric": [1, 2, 3, 4, 5, 6, 7, 8],
})
df_b.loc[:, "timestamp"] = pd.to_datetime(df_b["timestamp"])
print(df_b)
timestamp
metric
0
2022-08-09 01:00:00
1
1
2022-08-09 02:00:00
2
2
2022-08-09 03:00:00
3
3
2022-08-09 04:00:00
4
4
2022-08-09 05:00:00
5
5
2022-08-09 06:00:00
6
6
2022-08-09 07:00:00
7
7
2022-08-09 08:00:00
8
Expected Output Dataframe
# Expected output dataframe
df_target = df_a.copy()
for i, row in df_target.iterrows():
condition = (df_b["timestamp"] >= row["start_timestamp"]) & (df_b["timestamp"] <= row["end_timestamp"])
df_b_sub = df_b.loc[condition, :]
df_target.loc[i, "metric_mean"] = df_b_sub["metric"].mean()
df_target.loc[i, "metric_max"] = df_b_sub["metric"].max()
df_target.loc[i, "metric_min"] = df_b_sub["metric"].min()
print(df_target)
start_timestamp
end_timestamp
metric_mean
metric_max
metric_min
0
2022-08-09 00:30:00
2022-08-09 03:30:00
2.0
3.0
1.0
1
2022-08-09 01:00:00
2022-08-09 04:00:00
2.5
4.0
1.0
2
2022-08-09 01:15:00
2022-08-09 08:15:00
5.0
8.0
2.0
You can use pd.IntervalIndex and contains to create a dataframe with selected metric values and then compute the mean, max, min:
ai = pd.IntervalIndex.from_arrays(
df_a["start_timestamp"], df_a["end_timestamp"], closed="both"
)
t = df_b.apply(
lambda x: pd.Series((ai.contains(x["timestamp"])) * x["metric"]), axis=1
)
df_a[["metric_mean", "metric_max", "metric_min"]] = t[t.ne(0)].agg(
["mean", "max", "min"]
).T.values
print(df_a):
start_timestamp end_timestamp metric_mean metric_max metric_min
0 2022-08-09 00:30:00 2022-08-09 03:30:00 2.0 3.0 1.0
1 2022-08-09 01:00:00 2022-08-09 04:00:00 2.5 4.0 1.0
2 2022-08-09 01:15:00 2022-08-09 08:15:00 5.0 8.0 2.0
Check Below Code using SQLITE3
import sqlite3
conn = sqlite3.connect(':memory:')
df_a.to_sql('df_a',con=conn, index=False)
df_b.to_sql('df_b',con=conn, index=False)
pd.read_sql("""SELECT df_a.start_timestamp, df_a.end_timestamp
, AVG(df_b.metric) as metric_mean
, MAX(df_b.metric) as metric_max
, MIN(df_b.metric) as metric_min
FROM
df_a INNER JOIN df_b
ON df_b.timestamp BETWEEN df_a.start_timestamp AND df_a.end_timestamp
GROUP BY df_a.start_timestamp, df_a.end_timestamp""", con=conn)
Output:
I have a dataframe with two date columns (format: YYYY-MM-DD). I want to create one row for each year between those two dates. The rows would be identical with a new column which specifies the year. For example, if the dates are 2018-01-01 and 2020-01-01 then there would be three rows with same data and a new column with values 2018, 2019, and 2020.
You can use a custom function to compute the range then explode the column:
# Ensure to have datetime
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
# Create the new column
date_range = lambda x: range(x['date1'].year, x['date2'].year+1)
df = df.assign(year=df.apply(date_range, axis=1)).explode('year', ignore_index=True)
Output:
>>> df
date1 date2 year
0 2018-01-01 2020-01-01 2018
1 2018-01-01 2020-01-01 2019
2 2018-01-01 2020-01-01 2020
This should work for you:
import pandas
# some sample data
df = pandas.DataFrame(data={
'foo': ['bar', 'baz'],
'date1':['2018-01-01', '2022-01-01'],
'date2':['2020-01-01', '2017-01-01']
})
# cast date columns to datetime
for col in ['date1', 'date2']:
df[col] = pandas.to_datetime(df[col])
# reset index to ensure that selection by length of index works
df = df.reset_index(drop=True)
# the range of years between the two dates, and iterate through the resulting
# series to unpack the range of years and add a new row with the original data and the year
for i, years in df.apply(
lambda x: range(
min(x.date1, x.date2).year,
max(x.date1, x.date2).year + 1
),
axis='columns'
).iteritems():
for year in years:
new_index = len(df.index)
df.loc[new_index] = df.loc[i].values
df.loc[new_index, 'year'] = int(year)
output:
>>> df
foo date1 date2 year
0 bar 2018-01-01 2020-01-01 NaN
1 baz 2022-01-01 2017-01-01 NaN
2 bar 2018-01-01 2020-01-01 2018.0
3 bar 2018-01-01 2020-01-01 2019.0
4 bar 2018-01-01 2020-01-01 2020.0
5 baz 2022-01-01 2017-01-01 2017.0
6 baz 2022-01-01 2017-01-01 2018.0
7 baz 2022-01-01 2017-01-01 2019.0
8 baz 2022-01-01 2017-01-01 2020.0
9 baz 2022-01-01 2017-01-01 2021.0
10 baz 2022-01-01 2017-01-01 2022.0
I am not pretty sure about what is returned when making a condition inside a dataframe.iloc function. I have the next code line:
-> df1.loc[(df1['Date'] >= df2['StartDate']) & (df1['Date'] <= df2['EndDate'])]
For what I've seen this code line will return all the rows that meet the condition above. Is that correct?
Here is the output you can expect:
df1 = pd.read_csv("file1.csv")
print(df1)
Date
0 2019-07-19
1 2019-07-21
2 2019-07-31
df2 = pd.read_csv("file2.csv")
print(df2)
StartDate EndDate
0 2019-07-01 2019-07-10
1 2019-07-30 2019-07-20
2 2019-07-31 2019-07-31
df = df1.loc[(df1['Date'] >= df2['StartDate']) & (df1['Date'] <= df2['EndDate'])]
print(df)
Date
2 2019-07-31
I hope with these additional information someone could find time to help me with this new issue.
sample date here --> file
'Date as index' (datetime.date)
As I said I'm trying to select a range in a dataframe every time x is in interval [-20 -190] and create a new dataframe with a new column which is the sum of the selected rows and keep the last "encountered" date as index
EDIT : The "loop" start at the first date/beginning of the df and when a value which is less than 0 or -190 is found, then sum it up and continue to find and sum it up and so on
BUT I still got values which are still in the intervall (-190, 0)
example and code below.
Thks
import pandas as pd
df = pd.read_csv('http://www.sharecsv.com/s/0525f76a07fca54717f7962d58cac692/sample_file.csv', sep = ';')
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3
##### output #####
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 11:28:00 -154.35
3 2019-01-02 12:08:00 -4706.87
4 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-29 16:58:00 -0.38
833 2019-09-30 17:08:00 -129365.71
834 2019-09-30 17:13:00 -157.05
835 2019-10-01 08:58:00 -111911.98
########## expected output #############
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 12:08:00 -4706.87
3 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-30 17:08:00 -129365.71
833 2019-10-01 08:58:00 -111911.98
...
...
Use Series.where with Series.between for replace values to NaNs of Date column with back filling missing values and then aggregate sum, next step is filter out rows with match range by boolean indexing and last use DataFrame.resample with cast Series to one column DataFrame by Series.to_frame:
#range -190, 0
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3 = df3[~df3['x'].between(-190, 0)]
df3 = df3.resample('D', on='Date')['x'].sum().to_frame()
I have a data frame with two date columns, a start and end date. How will I find the number of weekends between the start and end dates using pandas or python date-times
I know that pandas has DatetimeIndex which returns values 0 to 6 for each day of the week, starting Monday
# create a data-frame
import pandas as pd
df = pd.DataFrame({'start_date':['4/5/19','4/5/19','1/5/19','28/4/19'],
'end_date': ['4/5/19','5/5/19','4/5/19','5/5/19']})
# convert objects to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
# Trying to get the date index between dates as a prelim step but fails
pd.DatetimeIndex(df['end_date'] - df['start_date']).weekday
I'm expecting the result to be this: (weekend_count includes both start and end dates)
start_date end_date weekend_count
4/5/2019 4/5/2019 1
4/5/2019 5/5/2019 2
1/5/2019 4/5/2019 1
28/4/2019 5/5/2019 3
IIUC
df['New']=[pd.date_range(x,y).weekday.isin([5,6]).sum() for x , y in zip(df.start_date,df.end_date)]
df
start_date end_date New
0 2019-05-04 2019-05-04 1
1 2019-05-04 2019-05-05 2
2 2019-05-01 2019-05-04 1
3 2019-04-28 2019-05-05 3
Try with:
df['weekend_count']=((df.end_date-df.start_date).dt.days+1)-np.busday_count(
df.start_date.dt.date,df.end_date.dt.date)
print(df)
start_date end_date weekend_count
0 2019-05-04 2019-05-04 1
1 2019-05-04 2019-05-05 2
2 2019-05-01 2019-05-04 1
3 2019-04-28 2019-05-05 3