To find the maximum over the last 300 seconds:
import pandas as pd
# 16 to 17 minutes of time-series data.
df = pd.DataFrame(range(10000))
df.index = pd.date_range(1, 1000000000000, 10000)
# maximum over last 300 seconds. (outputs 9999)
df[0].rolling('300s').max().tail(1)
How can I exclude the most recent 30s from the rolling calculation? I went the max between -300s and -30s.
So, instead of 9999 being outputted by the above, I want something like 9700 (thereabouts) to be displayed.
You can compute the rolling max for the last 271 seconds (271 instead of 270 if you need that 300th second included), then shift the results by 30 seconds, and merge them with the original dataframe. Since in your example the index is at the sub-second level, you will need to utilize merge_asof to find the desired matches (you can use the direction parameter of that function to select non-exact matches).
import pandas as pd
# 16 minutes and 40 seconds of time-series data.
df = pd.DataFrame(range(10_000))
df.index = pd.date_range(1, 1_000_000_000_000, 10_000)
roll_max = df[0].rolling('271s').max().shift(30, freq='s').rename('roll_max')
res = pd.merge_asof(df, roll_max, left_index=True, right_index=True)
print(res.tail(1))
# 0 roll_max
# 1970-01-01 00:16:40 9999 9699.0
Related
I am using at some point pd.melt to reshape my dataframe. This command after inspections is taking around 7min to run which is too long in my use case (I am using it in an interactive dashboard).
I am asking if there are any methods on how to improve running time of melt function via pandas.
If not, is it possible and a good practice to use a big data package just for this line of code?
pd.melt(change_t, id_vars=['id', 'date'], value_vars=factors, value_name='value')
factors=list of 20 columns
I've timed melting a test table with 2 id_vars, 20 factors, and 1M rows and it took 22 seconds on my laptop. Is your table similarly sized, or much much larger? If it is a huge table, would it be ok to return only part of the melted output to your interactive dashboard? I put some code for that approach and it took 1.3 seconds to return the first 1000 rows of the melted table.
Timing melting a large test table
import pandas as pd
import numpy as np
import time
id_cols = ['id','date']
n_ids = 1000
n_dates = 100
n_cols = 20
n_rows = 1000000
#Create the test table
df = pd.DataFrame({
'id':np.random.randint(1,n_ids+1,n_rows),
'date':np.random.randint(1,n_dates+1,n_rows),
})
factors = []
for c in range(n_cols):
c_name = 'C{}'.format(c)
factors.append(c_name)
df[c_name] = np.random.random(n_rows)
#Melt and time how long it takes
start = time.time()
pd.melt(df, id_vars=['id', 'date'], value_vars=factors, value_name='value')
print('Melting took',time.time()-start,'seconds for',n_rows,'rows')
#Melting took 21.744 seconds for 1000000 rows
Here's a way you can get just the first 1000 melted rows
ret_rows = 1000
start = time.time()
partial_melt_df = pd.DataFrame()
for ks,g in df.groupby(['id','date']):
g_melt = pd.melt(g, id_vars=['id', 'date'], value_vars=factors, value_name='value')
partial_melt_df = pd.concat((partial_melt_df,g_melt), ignore_index=True)
if len(partial_melt_df) >= ret_rows:
partial_melt_df = partial_melt_df.head(ret_rows)
break
print('Partial melting took',time.time()-start,'seconds to give back',ret_rows,'rows')
#Partial melting took 1.298 seconds to give back 1000 rows
I'm working with historical data, and have some very old dates that are outside the timestamp bounds for pandas. I've consulted the Pandas Time series/date functionality documentation, which has some information on out of bounds spans, but from this information, it still wasn't clear to me what, if anything I could do to convert my data into a datetime type.
I've also seen a few threads on Stack Overflow on this, but they either just point out the problem (i.e. nanoseconds, max range 570-something years), or suggest setting errors = coerce which turns 80% of my data into NaTs.
Is it possible to turn dates lower than the default Pandas lower bound into dates? Here's a sample of my data:
import pandas as pd
df = pd.DataFrame({'id': ['836', '655', '508', '793', '970', '1075', '1119', '969', '1166', '893'],
'date': ['1671-11-25', '1669-11-22', '1666-05-15','1673-01-18','1675-05-07','1677-02-08','1678-02-08', '1675-02-15', '1678-11-28', '1673-12-23']})
You can create day periods by lambda function:
df['date'] = df['date'].apply(lambda x: pd.Period(x, freq='D'))
Or like mentioned #Erfan in comment (thank you):
df['date'] = df['date'].apply(pd.Period)
print (df)
id date
0 836 1671-11-25
1 655 1669-11-22
2 508 1666-05-15
3 793 1673-01-18
4 970 1675-05-07
5 1075 1677-02-08
6 1119 1678-02-08
7 969 1675-02-15
8 1166 1678-11-28
9 893 1673-12-23
Given a timestamped df with timedelta showing time covered such as:
df = pd.DataFrame(pd.to_timedelta(['00:45:00','01:00:00','00:30:00']).rename('span'),
index=pd.to_datetime(['2019-09-19 18:00','2019-09-19 19:00','2019-09-19 21:00']).rename('ts'))
# span
# ts
# 2019-09-19 18:00:00 00:45:00
# 2019-09-19 19:00:00 01:00:00
# 2019-09-19 21:00:00 00:30:00
How can I plot a bar graph showing drop outs every 15 minutes? What I want is a bar graph that will show 0 or 1 on the Y axis with a 1 for each 15 minute segment in the time periods covered above, and a 0 for all the 15 minute segments not covered.
Per this answer I tried:
df['span'].astype('timedelta64[m]').plot.bar()
However this plots each timespan vertically, and does not show that the whole hour of 2019-09-19 20:00 is missing.
.
I tried
df['span'].astype('timedelta64[m]').plot()
It plots the following which is not very useful.
I also tried this answer to no avail.
Update
Based on lostCode's answer I was able to further modify the DataFrame as follows:
def isvalid(period):
for ndx, row in df.iterrows():
if (period.start_time >= ndx) and (period.start_time < row.end):
return 1
return 0
df['end']= df.index + df.span
ds = pd.period_range(df.index.min(), df.end.max(), freq='15T')
df_valid = pd.DataFrame(ds.map(isvalid).rename('valid'), index=ds.rename('period'))
Is there a better, more efficient way to do it?
You can use DataFrame.resample to create a new DataFrame to
to verify the existence of time spaces. To check use DataFrame.isin
import numpy as np
check=df.resample('H')['span'].sum().reset_index()
d=df.reset_index('ts').sort_values('ts')
check['valid']=np.where(check['ts'].isin(d['ts']),1,0)
check.set_index('ts')['valid'].plot(kind='bar',figsize=(10,10))
I am not sure I understand the parameter min_periods in Pandas rolling functions : why does it have to be smaller than the window parameter?
I would like to compute (for instance) the rolling max minus rolling min with a window of ten values BUT I want to wait maybe 20 values before starting computations:
In[1]: import pandas as pd
In[2]: import numpy as np
In[3]: df = pd.DataFrame(columns=['A','B'], data=np.random.randint(low=0,high=100,size=(100,2)))
In[4]: roll = df['A'].rolling(window=10, min_periods=20)
In[5]: df['C'] = roll.max() - roll.min()
In[6]: roll
Out[6]: Rolling [window=10,min_periods=20,center=False,axis=0]
In[7]: df['C'] = roll.max()-roll.min()
I get the following error:
ValueError: Invalid min_periods size 20 greater than window 10
I thought that min_periods was there to tell how many values the function had to wait before starting computations. The documentation says:
min_periods : int, default None
Minimum number of observations in window required to have a value
(otherwise result is NA)
I had not been carefull to the "in window" detail here...
Then what would be the most efficient way to achieve what I am trying to achieve? Should I do something like:
roll = df.loc[20:,'A'].rolling(window=10)
df['C'] = roll.max() - roll.min()
Is there a more efficient way?
the min_period = n option simply means that you require at least n valid observations to compute your rolling stats.
Example, suppose min_period = 5 and you have a rolling mean over the last 10 observations. Now, what happens if 6 of the last 10 observations are actually missing values? Then, given that 4<5 (indeed, there are only 4 non-missing values here and you require at least 5 non-missing observations), the rolling mean will be missing as well.
It's a very, very important option.
From the documentation
min_periods : int, default None Minimum number of observations in
window required to have a value (otherwise result is NA).
The min period argument is just a way to apply the function to a smaller sample than the rolling window. So let say you want the rolling minimum of window of 10, passing the min period argument of 5 would allow to calculate the min of the first 5 data, then the first 6, then 7,8,9 and finally 10. Now that pandas can start rolling his 10 data point windows, because it has more than 10 data point, it will keep period window of 10.
I have some event data that is measured in time, so the data format looks like
Time(s) Pressure Humidity
0 10 5
0 9.9 5.1
0 10.1 5
1 10 4.9
2 11 6
Here the first column is Time elapsed since the start of the experiment, in seconds. The other two cols are some observations. A row is created when certain conditions are true, these conditions are beyond the scope of the discussion here. Each set of 3 numbers separated by a semi colon is a row of data. Since the lowest granularity of resolution in time here is only seconds, you could have two rows with the same timestamp but but will different observations. Basically these were two distinct events that time could not distinguish.
Now my problem is to roll up the data series, by subsampling it say every 10 or 100 seconds, or 1000 seconds. So I want a skimmed data series from the original higher granularity data series. There are a few ways to decide which row you would use, for instance say you are subsampling at every 10 seconds, when 10 seconds elapse, you could have multiple rows with the time stamp of 10 seconds. You could either take
1) first row
2) mean of all rows with the same timestamp of 10
3) some other technique
I am looking to do this in pandas, any ideas or way to start would be very appreciated. Thanks.
Here is a simple example that shows how to perform
the operations requested with pandas.
One uses data binning to group samples and
resample data.
import pandas as pd
# Creation of the dataframe
df = pd.DataFrame({\
'Time(s)':[0 ,0 ,0 ,1 ,2],\
'Pressure':[10, 9.9, 10.1, 10, 11],\
'Humidity':[5 ,5.1 ,5 ,4.9 ,6]})
# Select time increment
delta_t = 1
timeCol = 'Time(s)'
# Creation of the time sampling
v = xrange(df[timeCol].min()-delta_t,df[timeCol].max()+delta_t,delta_t)
# Pandas magic instructions with cut and groupby
df_binned = df.groupby(pd.cut(df[timeCol],v))
# Display the first element
dfFirst = df_binned.head(1)
# Evaluate the mean of each group
dfMean = df_binned.mean()
# Evaluate the median of each group
dfMedian = df_binned.median()
# Find the max of each group
dfMax = df_binned.max()
# Find the min of each group
dfMin = df_binned.min()
Result will look like this for dfFirst
Humidity Pressure Time(s)
Time(s)
(-1, 0] 0 5.0 10 0
(0, 1] 3 4.9 10 1
(1, 2] 4 6.0 11 2
Result will look like this for dfMean
Humidity Pressure Time(s)
Time(s)
(-1, 0] 5.033333 10 0
(0, 1] 4.900000 10 1
(1, 2] 6.000000 11 2