Summing time series with slight variance in timestamps - pandas

I imagine that I have several time series like following, from different "sources":
time events
0 1000 1080000
1 2003 2122386
2 3007 3043985
3 4007 3872544
4 5007 4853763
Here, an monotonic increasing count events is sampled every 1000 ms. The sampling is not exact so most of the timestamps vary from their ideal values by a few ms - e.g., the second point is at 2003 instead of 2000.
I want to sum several of these time series: they will all be sampled at ~1000 ms but may not agree to the exact millsecond. E.g another time series could be:
time events
0 1000 1070000
1 2002 2122486
2 3006 3063985
3 4007 3872544
4 5009 4853763
I'd like something reasonable in terms of the final result. For example the same number of rows as each of the input dataframes, with a timestamp column the same as the first, or average of the inputs times. As long as the inputs are smooth, the outputs should be too.

I'd suggest DataFrame.reindex() with nearest method. Example:
def combine_datasources(reference_df, extra_dfs, tolerance_ms=100):
reindexed_df_list = [df.reindex(reference_df.index, method='nearest', tolerance=tolerance_ms) for df in extra_dfs]
combined = pd.concat([reference_df, *reindexed_df_list])
return combined.groupby(combined.index).sum()
combine_datasources(df_a, [df_b])
This code changes the index on the dataframes in the extra_dfs list to match the index for the reference dataframe. Then, it concatenates all of the dataframes together. It uses groupby to do the sum, which requires that the indexes match exactly to work. The timestamps will be the same as the one on the reference dataframe.
Note that if you have data from a time period not covered by the reference dataframe, that data will be dropped.
Here's the output for the dataset in your question:
events
time
1000 2150000
2003 4244872
3007 6107970
4007 7745088
5007 9707526

Related

Pandas run function only on subset of whole Dataframe

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.
Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂
Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

Plotting data from two sets with different shapes in the same plot

I am using data collected from two different instruments which have different resolution because of the sampling rate of each instrument. For a specific time, one of the sets have >10k entries while the other has ~2.5k. They however capture data over the same time interval, and I want to plot them on top of each other even though they have different resolution in data. The minimum and maximum x of both sets are the same however one of them have more entries.
Simplified it could look like this:
1st set from instrument with higher sampling rate:
time(s) value
0.0 10
0.2 11
0.4 12
0.6 13
0.8 14
... ..
100 50
2nd set from instrument with lower sampling rate:
time(s) value
0 100
1 120
2 125
3 128
4 130
. ...
100 430
They are measuring different things, but I would like to display them in the same plot. How can I accomplish this?
I found the mistake.. I was trying to plot both datasets using the time data from the first instrument. Of course they need to be plotted with their respective time data and I put the first time data in the second plot by mistake..

Pandas groupby for k-fold cross-validation with aggregation

say I have a data frame,df, with columns: id |site| time| clicks |impressions
I want to use the machine learning technique of k-fold cross validation ( split the data randomly into k=10 equal sized partitions - based on eg column id) . I think of this as a mapping from id: {0,1,...9} ( so new column 'fold' going from 0-9)
then iteratively take 9/10 partitions as training data and 1/10 partition as validation data
( so first fold==0 is validation, rest is training, then fold==1, rest is training)
[ so am thinking of this as a generator based on grouping by fold column]
finally I want to group all the training data by site and time ( and similarly for validation data) ( in other words sum over the fold index, but keeping the site and time indices)
What is the right way of doing this in pandas?
The way I thought of doing it at the moment is
df_sum=df.groupby( 'fold','site','time').sum()
#so df_sum has indices fold,site, time
# create new Series object,dat, name='cross' by mapping fold indices
# to 'training'/'validation'
df_train_val=df_sum.groupby( [ dat,'site','time']).sum()
df_train_val.xs('validation',level='cross')
Now the direct problem I run into is that groupby with columns will handle introducing a Series object but groupby on multiindices doesn't [df_train_val assignment above doesn't work]. Obviously I could use reset_index but given that I want to group over site and time [ to aggregate over folds 1 to 9, say] this seems wrong. ( I assume grouping is much faster on indices than on 'raw' columns)
So Question 1 is this the right way to do cross-validation followed by aggregation in pandas. More generally grouping and then regrouping based on multiindex values.
Question 2 - is there a way of mixing arbitrary mappings with multilevel indices.
This generator seems to do what I want. You pass in the grouped data (with 1 index corresponding to the fold [0 to n_folds]).
def split_fold2(fold_data, n_folds, new_fold_col='fold'):
i_fold=0
indices=list(fold_data.index.names)
slicers=[slice(None)]*len(fold_data.index.names)
fold_index=fold_data.index.names.index(new_fold_col)
indices.remove(new_fold_col)
while (i_fold<n_folds):
slicers[fold_index]=[i for i in range(n_folds) if i !=i_fold]
slicers_tuple=tuple(slicers)
train_data=fold_data.loc[slicers_tuple,:].groupby(level=indices).sum()
val_data=fold_data.xs(i_fold,level=new_fold_col)
yield train_data,val_data
i_fold+=1
On my data set this takes :
CPU times: user 812 ms, sys: 180 ms, total: 992 ms Wall time: 991 ms
(to retrieve one fold)
replacing train_data assignment with
train_data=fold_data.select(lambda x: x[fold_index]!=i_fold).groupby(level=indices).sum()
takes
CPU times: user 2.59 s, sys: 263 ms, total: 2.85 s Wall time: 2.83 s

Pandas shifting uneven timeseries data

I have some irregularly stamped time series data, with timestamps and the observations at every timestamp, in pandas. Irregular basically means that the timestamps are uneven, for instance the gap between two successive timestamps is not even.
For instance the data may look like
Timestamp Property
0 100
1 200
4 300
6 400
6 401
7 500
14 506
24 550
.....
59 700
61 750
64 800
Here the timestamp is say seconds elapsed since a chose origin time. As you can see we could have data at the same timestamp, 6 secs in this case. Basically the timestamps are strictly different, just that second resolution cannot measure the change.
Now I need to shift the timeseries data ahead, say I want to shift the entire data by 60 secs, or a minute. So the target output is
Timestamp Property
0 750
1 800
So the 0 point got matched to the 61 point and the 1 point got matched to the 64 point.
Now I can do this by writing something dirty, but I am looking to use as much as possible any inbuilt pandas feature. If the timeseries were regular, or evenly gapped, I could've just used the shift() function. But the fact that the series is uneven makes it a bit tricky. Any ideas from Pandas experts would be welcome. I feel that this would be a commonly encountered problem. Many thanks!
Edit: added a second, more elegant, way to do it. I don't know what will happen if you had a timestamp at 1 and two timestamps of 61. I think it will choose the first 61 timestamp but not sure.
new_stamps = pd.Series(range(df['Timestamp'].max()+1))
shifted = pd.DataFrame(new_stamps)
shifted.columns = ['Timestamp']
merged = pd.merge(df,shifted,on='Timestamp',how='outer')
merged['Timestamp'] = merged['Timestamp'] - 60
merged = merged.sort(columns = 'Timestamp').bfill()
results = pd.merge(df,merged, on = 'Timestamp')
[Original Post]
I can't think of an inbuilt or elegant way to do this. Posting this in case it's more elegant than your "something dirty", which is I guess unlikely. How about:
lookup_dict = {}
def assigner(row):
lookup_dict[row['Timestamp']] = row['Property']
df.apply(assigner, axis=1)
sorted_keys = sorted(lookup_dict.keys)
df['Property_Shifted'] = None
def get_shifted_property(row,shift_amt):
for i in sorted_keys:
if i >= row['Timestamp'] + shift_amt:
row['Property_Shifted'] = lookup_dict[i]
return row
df = df.apply(get_shifted_property, shift_amt=60, axis=1)

SPSS Compute Variable

Below is some data:
Test Day1 Day2 Score
A 1 2 100
B 1 3 62
C 3 4 90
D 2 4 20
E 4 5 80
I am trying to take the values from column 'day' and 'day2' and use them to select the row number for the column score. For example for Test A I would like to find the sum of 100 and 62 because that is the values of the first and second rows of score. Test B I would like to find the sum of 100, 62 and 90.
Is their anyway to do this in the Compute Variable window? Found in the menu Transform-Compute Variable?
I tried the following:
Score(MEAN(VALUE(Day1), VALUE(DAY2)))
This is not the proper way to call the cell location of Score and I received an error.
Can anyone help?
Thank you!
You really have two different datasets here. One is a dataset of scores numbered 1 through 5.
The other is a dataset that includes indexes into the score dataset. So the steps would be something like this.
First take the scores dataset and transpose it so that it has one row and 5 columns (Data>Transpose)
Then match that dataset to each case in the main dataset (Data>Merge Files>Add Variables).
Next you have to resort to using syntax directly.
You would declare a vector for the scores (VECTOR)
Finally, you use COMPUTE to index into the scores.
For your real problem, I suppose that you might have batches of scores and maybe there are some gaps. The Restructure Data Wizard can help you generalize this - convert cases into variables, but let's not go there yet.
HTH,
Jon Peck