Pandas resample by integration over time with non equidistant data - pandas

I have a DataFrame with a Datetimeindex with non equidistant timestamps. I want to get the mean for each hour. But by using resample.mean(), the time distance between the timestamps is not considered.
How can I resample a DataFrame with a Datetimeindex to integrate the values in a column?
given the following data:
time
data
00:15
5
00:55
1
00:56
1
00:57
1
resample.mean() would give 4, but the value 1 was only set for 3 from 60 minutes.

Related

How to convert duration strings to seconds?

I have such a column in a pandas dataframe:
duration
1 day 22:12:15.778543
2 days 10:09:07.118723
00:18:23.985112
I would like to convert this duration to seconds.
How can I do this? I am not sure if this is possible because of the special string format I got (1 day, 2 days etc.)?
Use to_timedelta with Series.dt.total_seconds:
df['s'] = pd.to_timedelta(df['duration']).dt.total_seconds()
print (df)
duration s
0 1 day 22:12:15.778543 166335.778543
1 2 days 10:09:07.118723 209347.118723
2 00:18:23.985112 1103.985112

pandas: how to group by time intervals of varying length?

I know that it is possible to group your data by time intervals of the same length by using the function resample. But how can I group by time intervals of custom length (i.e. irregular time intervals)?
Here is an example:
Say we have a dataframe with time values, like this:
rng = pd.date_range(start='2015-02-11', periods=7, freq='M')
df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)) })
And we have the following time intervals:
2015-02-12 -----
2015-05-10
2015-05-10 -----
2015-08-20
2015-08-20 -----
2016-01-01
It is clear that rows with index 0,1,2 belong to the first time interval, rows with index 3,4,5 belong to the second time interval and row 7 belongs to the last time interval.
My question is: how do I group these rows according to those specific time intervals, in order to perform aggregate functions (e.g. mean) on them?

Generate dataframe with timeseries index starting today and fixed interval

I'm trying to generate pandas dataframe with timeseries index with the fixed interval. As an input parameters I need to provide set start and end date. The challenge is that the generated index starts either from month start freq='3MS' or month end with freq='3M'. That cannot be defined in number of days as the whole year needs to have exact 4 periods and the start date needs to be as the defined start date.
The expected output should be in this case:
2020-10-05
2021-01-05
2021-04-05
2021-10-05
Any ideas appreciated.
interpolated = pd.DataFrame( index=pd.date_range('2020-10-05', '2045-10-05', freq='3M'),columns['dummy'])

Is there a way to fix or bypass weird time formats in a specific column in a dataframe?

I am working with a SLURM dataset in Pandas that has time formats like so in the 'Elapsed' column:
00:00:00
00:26:51
However, sometimes there are sections that are greater than 24 hours, and it displays it like so:
1-00:02:00
3-01:25:02
I want to find the mean of the entire column but it mishandles the to_timedelta conversion on the entries with entries above 24 hours like shown above. One example is this:
Before to_timedelta: 3-01:25:02
after to_timedelta: -13 days +10:34:58
I cannot simply convert the column into a new format because when entry is not greater than 24 hours, preceding zeros do not exist, ex: 0-20:00:00
This method would be easiest I believe if there is a way however.
Is there a way to fix this conversion or any other ideas on approaching this?
One way to go around is replacing - with days:
pd.to_timedelta(df['time'].str.replace('-','days'))
Output (for 4 lines above):
0 0 days 00:00:00
1 0 days 00:26:51
2 1 days 00:02:00
3 3 days 01:25:02
Name: time, dtype: timedelta64[ns]

Timeseries resample error - none of Dateindex in column pandas

Please excuse obvious errors - still in the learning process.
I am trying to do a simple timeseries plot on my data with a frequency of 15 minutes. The idea is to plot monthly means, starting with resampling data every hour - including only those hourly means that have atleast 1 observation in the interval. There are subsequent conditions for daily and monthly means.
This is relatively simpler only if this error does not crop up- "None of [DatetimeIndex(['2016-01-01 05:00:00', '2016-01-01 05:15:00',\n....2016-12-31 16:15:00'],\n dtype='datetime64[ns]', length=103458, freq=None)] are in the [columns]"
This is my code:
#Original dataframe
Date value
0 1/1/2016 0:00 405.22
1 1/1/2016 0:15 418.56
Date object
value object
dtype: object
#Conversion of 'value' column to numeric/float values.
df.Date = pd.to_datetime(df.Date,errors='coerce')
year=df.Date.dt.year
df['Year'] = df['Date'].map(lambda x: x.year )
df.value = pd.to_numeric(df.value,errors='coerce' )
Date datetime64[ns]
value float64
Year int64
dtype: object
Date value Year
0 2016-01-01 00:00:00 405.22 2016
1 2016-01-01 00:15:00 418.56 2016
df=df.set_index(Date)
diurnal1 = df[df['Date']].resample('h').mean().count()>=2
**(line of error)**
diurnal_mean_1 = diurnal1.mean()[diurnal1.count() >= 1]
(the code follows)
Any help in solving the error will be appreciated.
I think you want df=df.set_index('Date') (Date is a string). Also I would move the conversions over into the constructor if possible after you get it working.