Regularizing stochastic data with conditional forward fill - pandas

I'm making my first foray into using Pandas 0.24's revamped resampler() objects. I have a dataframe of quasi-stochastically sampled vehicle speeds. The logger has a sample period of roughly 1s +/- 100ms when moving, and about 30s when stopped.
The data might look like this:
0 1 2 0 2 5 7 3 0 3 3 3 1 0
I'd like to regularize the data to 1s intervals, but without losing my zero-speed intervals. This is a lot harder than I thought it was going to be, largely because I want to ffill the zero periods, and interpolate the non-zero periods onto the regularized index.
Questions:
Generally speaking, how would you address this two-part up-fill/interpolate process?
Is there a modern analog to the old resample(how=None) logic, to let me add regularized timestamps to the index without adding spurious data?
Am I stuck looping to fill the zero periods, or is there some kind of apply() magic that will let me do a conditional ffill()?
Example data:
orig = [0.0, 0.0, 1.5, 2.0, 1.5, 2.0, 1.0, 0.0, 0.0, 3.5]
idx = pd.DatetimeIndex(['2018-12-19 16:50:51+00:00',
'2018-12-19 16:50:51.400000+00:00',
'2018-12-19 16:50:57.500000+00:00',
'2018-12-19 16:50:57.600000+00:00',
'2018-12-19 16:51:12.500000+00:00',
'2018-12-19 16:51:16.400000+00:00',
'2018-12-19 16:51:18.400000+00:00',
'2018-12-19 16:51:20.400000+00:00',
'2018-12-19 16:51:22.500000+00:00',
'2018-12-19 16:51:24.500000+00:00'])
df = pd.DataFrame(orig,index=idx)
df.plot(figsize=(18,4))
NB how the plot shows an incorrect speed ramp-up ending at second 57.5. The speed should be zero until second 57, and ramp up to 1.5 at second 58.

Naturally, after working on it for three days, I figured out a reasonable answer about ten minutes after posting.
# First create a dummy with the correct index, but containing only the zero periods.
ff = df.asfreq('1s',method='ffill')
dummy1 = ff[ff==0.0]
# Then use 'time' interpolation
dummy2 = dummy1.combine_first(df).interpolate('time')
# Combine_first adds missing rows from the 2nd dataframe, so resample again
solution = dummy2.asfreq('1s')
The dropped sample at the end is inelegant, but won't be an issue for my purposes; my logs always end with zero. I'm curious though. Please post if you have an elegant way to make the initial ffill() include the final sample.

Related

Using Pandas and Numpy to search for conditions within binned data in 2 data frames

Python newbie here. Here's a simplified example of my problem. I have 2 pandas dataframes.
One dataframe lightbulb_df has data on whether a light is on or off and looks something like this:
Light_Time
Light On?
5790.76
0
5790.76
0
5790.771
1
5790.779
1
5790.779
1
5790.782
0
5790.783
1
5790.783
1
5790.784
0
Where the time is in seconds since start of day and 1 is the lightbulb is on, 0 means the lightbulb is off.
The second dataframe sensor_df shows whether or not a sensor detected the lightbulb and has different time values and rates.
Sensor_Time
Sensor Detect?
5790.8
0
5790.9
0
5791.0
1
5791.1
1
5791.2
1
5791.3
0
Both dataframes are very large with 100,000s of rows. The lightbulb will turn on for a few minutes and then turn off, then back on, etc.
Using the .diff function, I was able to compare each row to its predecessor and depending on whether the result was 1 or -1 create a truth table with simplified on and off times and append it to lightbulb_df.
# use .diff() to compare each row to the last row
lightbulb_df['light_diff'] = lightbulb_df['Light On?'].diff()
# the light on start times are when
#.diff is less than 0 (0 - 1 = -1)
light_start = lightbulb_df.loc[lightbulb_df['light_diff'] < 0]
# the light off start times (first times when light turns off)
# are when .diff is greater than 0 (1 - 0 = 1)
light_off = lightbulb_df.loc[lightbulb_df['light_diff'] > 0]
# and then I can concatenate them to have
# a single changed state df that only captures when the lightbulb changes
lightbulb_changes = pd.concat((light_start, light_off)).sort_values(by=['Light_Time'])
So I end up with a dataframe of on start times, a dataframe of off start times, and a change state dataframe that looks like this.
Light_Time
Light On?
light_diff
5790.771
1
1
5790.782
0
-1
5790.783
1
1
5790.784
0
-1
Now my goal is to search the sensor_df dataframe during each of the changed state times (above 5790.771 to 5790.782 and 5790.783 to 5790.784) by 1 second intervals to see whether or not the sensor detected the lightbulb. So I want to end up with the number of seconds the lightbulb was on and the number of seconds the sensor detected the lightbulb for each of the many light on periods in the change state dataframe. I'm trying to get % correctly detected.
Whenever I try to plan this out, I end up using lots of nested for loops or while loops which I know will be really slow with 100,000s of rows of data. I thought about using the .cut function to divide up the dataframe into 1 second intervals. I made a for loop to cycle through each of the times in the changed state dataframe and then nested a while loop inside to loop through 1 second intervals but that seems like it would be really slow.
I know python has a lot of built in functions that could help but I'm having trouble knowing what to google to find the right one.
Any advice would be appreciated.

Pandas run function only on subset of whole Dataframe

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.
Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂
Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

How to vectorize to speed up Dataframe apply pandas

I have a tXn (5000 X 100) dataframe wts_df,
wts_df.tail().iloc[:, 0:6]
Out[71]:
B C H L R T
2020-09-25 0.038746 0.033689 -0.047835 -0.002641 0.009501 -0.030689
2020-09-28 0.038483 0.033189 -0.061742 0.001199 0.009490 -0.028370
2020-09-29 0.038620 0.034957 -0.031341 0.006179 0.007815 -0.027317
2020-09-30 0.038610 0.034902 -0.014271 0.004512 0.007836 -0.024672
2020-10-01 0.038790 0.029937 -0.044198 -0.008415 0.008347 -0.030980
and two similar txn dataframes, vol_df and rx_df (same index and columns). For now we can use,
rx_df = wts_df.applymap(lambda x: np.random.rand())
vol_df = wts_df.applymap(lambda x: np.random.rand())
I need to do this (simplified):
for date in wts_df.index:
wts = wts_df.loc[date] # is a vector now 1Xn
# mutliply all entries of rx_df and vol_df until this date by these wts, and sum across columns
rx = rx_df.truncate(after=date) # still a dataframe but truncated at a given date, kXn
vol = vol_df_df.truncate(after=date)
wtd_rx = (wts * rx).sum(1) # so a vector kX1
wtd_vol = (wts * vol).sum(1)
# take ratio
rx_vol = rx / vol
rate[date] = rx_vol.tail(20).std()
So rate looks like this
pd.Series(rate).tail()
Out[71]:
rate
2020-09-25 0.0546
2020-09-28 0.0383
2020-09-29 0.0920
2020-09-30 0.0510
2020-10-01 0.0890
The above loop is slow, so i tried this:
def rate_calc(wts, date, rx_df=rx_df, vol_df=vol_df):
wtd_rx = (rx_df * wts).sum(1)
wtd_vol = (vol_df * wts).sum(1)
rx_vol = wtd_rx / wtd_vol
rate = rx_vol.truncate(after=date).tail(20).std()
return rate
rates = wts_df.apply(lambda x: rate_calc(x, x.name), axis=1)
This is still very slow. Moreover I need to do this for multiple wts_df contained in a dict so the total operations takes a lot time.
rates = {key: val.apply(lambda x: rate_calc(x, x.name), axis=1) for key, val in wts_df_dict.iteritems()}
Any ideas how to speed such operations?
Your question falls under the category of 'optimization' so allow me to share with you few pointers to solve your problem.
First, when it comes to speed, always use %timeit to ensure you get better results with a new stratgegy.
Second, there are few ways to iterate a data:
with iterrows() -- use it only when the data sample is small (or better yet, try not to use it as it's too slow).
With apply --better alternative to iterrows and much more efficient but when the data set is large (like in your example) it may present a delay problem.
Vectorizing --simply put, you execute the operation on the entire column/array and its significantly fast. Winner!
So, in order to solve your speed problem your strategy should be in the form of vectorizing. So here's how it should work; (mind the .values):
df['new_column'] = my_function(df['column_1'].values, df['column_2'].values...) and you will note a super fast result.

Finding the index for a value in a Pandas Dataframe

I've got a problem that shouldn't be that difficult but it's stumping me. There has to be an easy way to do it. I have a series from a dataframe that looks like this:
value
2001-01-04 0.134
2001-01-05 Nan
2001-01-06 Nan
2001-01-07 0.032
2001-01-08 Nan
2001-01-09 0.113
2001-01-10 Nan
2001-01-11 Nan
2001-01-12 0.112
2001-01-13 Nan
2001-01-14 Nan
2001-01-15 0.136
2001-01-16 Nan
2001-01-17 Nan
Iterating from bottom to top, I need to find the index of the value that is greater than 0.100 at the earliest date where the next earliest date would be less than 0.100.
So in the series above, I want to find the index of the value 0.113 which is 2001-01-09. The next earlier value is below 0.100 (0.031 on 2001-01-07). The two later values are greater than 0.100 but I want the index of the earliest value > 0.100 following a value less than than threshold iterating bottom to top.
The only way I can think of doing this is reversing the series, iterating to the first (last) value, checking if it is > 0.100, then again iterating to the next earlier value, and checking it to see if it's less than 0.100. If it isn't I'm done. If it > 0.100 I have to iterate again and test the earlier number.
Surely there is a non-messy way to do this I'm not seeing that avoids all this stepwise iteration.
Thanks in advance for you help.
You're essentially looking for two conditions. For the first condition, you want the given value to be greater than 0.1:
df['value'].gt(0.1)
For the second condition, you want the previous non-null value to be less than 0.1:
df['value'].ffill().shift().lt(0.1)
Now, combine the two conditions with the and operator, reverse the resulting Boolean indexer, and use idxmax to find the the first (last) instance where your condition holds:
(df['value'].gt(0.1) & df['value'].ffill().shift().lt(0.1))[::-1].idxmax()
Which gives the expected index value.
The above method assumes that at least one value satisfies the situation you've described. If it's possible that your data may not satisfy your situation you may want to use any to verify that a solution exists:
# Build the condition.
cond = (df['value'].gt(0.1) & df['value'].ffill().shift().lt(0.1))[::-1]
# Check if the condition is met anywhere.
if cond.any():
idx = cond.idxmax()
else:
idx = ???
In you're question, you've specified both inequalities to be strict. What happens for a value exactly equal to 0.1? You may want to change one of the gt/lt to ge/le to account for this.
Bookkeepping
# making sure `nan` are actually `nan`
df.value = pd.to_numeric(df.value, 'coerce')
# making sure strings are actually dates
df.index = pd.to_datetime(df.index)
plan
dropna
sort_index
boolean series of less than 0.1
convert to integers to use in diff
diff - Your scenario happens when we go from < .1 to > .1. In this case, diff will be -1
idxmax - find the first -1
df.value.dropna().sort_index().lt(.1).astype(int).diff().eq(-1).idxmax()
2001-01-09 00:00:00
Correction do account for flaw pointed out by #root.
diffs = df.value.dropna().sort_index().lt(.1).astype(int).diff().eq(-1)
diffs.idxmax() if diffs.any() else pd.NaT
editorial
This question highlights an important SO dynamic. We that answer questions often do so by editing our questions until they are in a satisfactory state. I have observed that those of us who answer pandas questions are generally very helpful to each other as well to those who ask questions.
In this post, I was well informed by #root and subsequently changed my post to reflect the added information. That alone makes #root's post very useful in addition to the other great information they provided.
Please recognize both posts and up vote as many useful posts as you can.
Thx

Pandas shifting uneven timeseries data

I have some irregularly stamped time series data, with timestamps and the observations at every timestamp, in pandas. Irregular basically means that the timestamps are uneven, for instance the gap between two successive timestamps is not even.
For instance the data may look like
Timestamp Property
0 100
1 200
4 300
6 400
6 401
7 500
14 506
24 550
.....
59 700
61 750
64 800
Here the timestamp is say seconds elapsed since a chose origin time. As you can see we could have data at the same timestamp, 6 secs in this case. Basically the timestamps are strictly different, just that second resolution cannot measure the change.
Now I need to shift the timeseries data ahead, say I want to shift the entire data by 60 secs, or a minute. So the target output is
Timestamp Property
0 750
1 800
So the 0 point got matched to the 61 point and the 1 point got matched to the 64 point.
Now I can do this by writing something dirty, but I am looking to use as much as possible any inbuilt pandas feature. If the timeseries were regular, or evenly gapped, I could've just used the shift() function. But the fact that the series is uneven makes it a bit tricky. Any ideas from Pandas experts would be welcome. I feel that this would be a commonly encountered problem. Many thanks!
Edit: added a second, more elegant, way to do it. I don't know what will happen if you had a timestamp at 1 and two timestamps of 61. I think it will choose the first 61 timestamp but not sure.
new_stamps = pd.Series(range(df['Timestamp'].max()+1))
shifted = pd.DataFrame(new_stamps)
shifted.columns = ['Timestamp']
merged = pd.merge(df,shifted,on='Timestamp',how='outer')
merged['Timestamp'] = merged['Timestamp'] - 60
merged = merged.sort(columns = 'Timestamp').bfill()
results = pd.merge(df,merged, on = 'Timestamp')
[Original Post]
I can't think of an inbuilt or elegant way to do this. Posting this in case it's more elegant than your "something dirty", which is I guess unlikely. How about:
lookup_dict = {}
def assigner(row):
lookup_dict[row['Timestamp']] = row['Property']
df.apply(assigner, axis=1)
sorted_keys = sorted(lookup_dict.keys)
df['Property_Shifted'] = None
def get_shifted_property(row,shift_amt):
for i in sorted_keys:
if i >= row['Timestamp'] + shift_amt:
row['Property_Shifted'] = lookup_dict[i]
return row
df = df.apply(get_shifted_property, shift_amt=60, axis=1)