python/pandas time series: fast attack/slow decay; peak detection with decay - pandas

I would like to implement a "fast attack / slow decay" (peak detect with exponential decay) filter on a time series ts (a column in a pandas dataframe), described as follows:
fasd[t] = max(ts[t], 0.9 * fasd[t-1])
The "basic" code (below) works, but is there pythonic and efficient way to do it, using rolling() or vectorized methods ? Thanks.
import pandas as pd
ts = [1,0,0,0,0,1,0,0,0,1,0.95,1,1,1,1,1,0,0,1,1,1,1,1,1,]
df = pd.DataFrame({'ts':ts})
df['fasd'] = 0
df.loc[0,'fasd'] = df.iloc[0]['ts']
for i in range(1, len(df)):
df.loc[i, 'fasd'] = max(df.loc[i,'ts'], 0.9*df.loc[i-1, 'fasd'])

Using numpy is more efficient:
from time import time
import pandas as pd
ts = [1,0,0,0,0,1,0,0,0,1,0.95,1,1,1,1,1,0,0,1,1,1,1,1,1] * 1000 # artificially increasing the input size
df = pd.DataFrame({'ts':ts})
df['fasd'] = 0
df.loc[0,'fasd'] = df.iloc[0]['ts']
df2 = df.copy()
t0 = time()
for i in range(1, len(df)):
df.loc[i, 'fasd'] = max(df.loc[i,'ts'], 0.9*df.loc[i-1, 'fasd'])
t1 = time()
print(f'Pandas version executed in {t1-t0} sec.')
def fasd(array):
for i in range(1, len(array)):
array[i,1] = max(array[i,0], 0.9*array[i-1,1])
return array
t0 = time()
df2 = pd.DataFrame(fasd(df2.to_numpy()))
t1 = time()
print(f'Numpy version executed in {t1-t0} sec.')
Output:
Pandas version executed in 3.0636708736419678 sec.
Numpy version executed in 0.011569976806640625 sec.

Related

Slice dataframe and put slices into new colums

i have a big dataframe with 1 million rows of time series data. I want to slice it into smaller chunks of 1000 rows each. So this would give me 1000 chunks, and i need every chunk to be copied into a column of a new dataframe.
i am now doing this, which does the job but might be inefficient. Im still happy if people could help:
for df_split in np.array_split(df, len(df) // chunk_size):
#print(df_split['random_nos'].mean())
i=i+1
df_split= df_split.reset_index()
df_split = df_split.rename({'random_nos': 'String'+str(i)}, axis=1)
df_all = pd.concat([df_all, df_split], axis=1)
Will do. Now that i can slice my dataframe, i run into the next problem. in my timeseries, i want to slice around 16000 events with a duration of 2144.14 samples. If i slice at 2144 or 2145, slices will become displaced more and more. I tried the following, but it didnt work:
def slice_df_into_chunks(df_size, chunk_size):
df_list = []
chunk_size2=chunk_size
for i, df_split in enumerate(np.array_split(df3, chunk_size2)):
df_split = df_split.rename(columns={'ChannelA01':f'Slice{i}'})
df_split.reset_index(drop=True, inplace=True)
df_list.append(df_split)
if i % 6==0:
chunk_size2=chunk_size-1
print(i)
else:
chunk_size2=chunk_size
return pd.concat(df_list, axis=1)
df4=slice_df_into_chunks(len(df3), np.floor(EventsPerLog))
i thought about solving this issue with something like this (it takes ages), so every now and then the chunk_size is smaller. After i defined the groups, i can cast those into dataframe-columns.
for i in range(40):
df.loc[i*SamplesEvent:(i+1)*SamplesEvent2,'Group']=i
if i % 6==0:
SamplesEvent2=SamplesEvent2-1
print(i)
else:
SamplesEvent2=5
You could use numpy.array_split to achieve this:
import pandas as pd
import numpy as np
def create_df_with_slices_in_cols(df, no_of_cols):
# df = pd.DataFrame(np.random.rand(df_size), columns=['random_nos'])
df_list = []
for i, df_split in enumerate(np.array_split(df, no_of_cols)):
df_split = df_split.rename(columns={'random_nos':f'Slice{i}'})
df_split.reset_index(drop=True, inplace=True)
df_list.append(df_split)
return pd.concat(df_list, axis=1)
create_df_with_slices_in_cols(pd.DataFrame(np.random.rand(10**6), columns=['random_nos']),
no_of_cols=10**3)
Note that if len(df) is not exactly divisible by no_of_cols (e.g: 10 & 3) then one column will have extra numbers.
create_df_with_slices_in_cols(10, 3)
Slice0 Slice1 Slice2
0 0.955620 0.543234 0.509360
1 0.755157 0.174576 0.267600
2 0.816509 0.776549 0.455464
3 0.990282 NaN NaN
Update
To minimize the displacement of data when using a column size(say n) that doesn't divide len(df) exactly, you can only consider the first n-1 columns, then create one final column with the remaining rows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(16000), columns=['random_nos'])
df_size = len(df)
slice_size = 2144.14
no_of_cols = int(df_size//slice_size + 1)
def create_df_with_slices_in_cols(df, no_of_cols):
df_list = []
for i, df_split in enumerate(np.array_split(df, no_of_cols)):
df_split = df_split.rename(columns={'random_nos':f'Slice{i}'})
df_split.reset_index(drop=True, inplace=True)
df_list.append(df_split)
return pd.concat(df_list, axis=1)
# fill out the first 7 (n-1) columns first
rows_first_pass = int((no_of_cols-1) * np.floor(slice_size))
df_combined = create_df_with_slices_in_cols(df[:rows_first_pass], no_of_cols-1)
# fill out the 8th (nth) column using the remaining rows
rem_rows = df_size - rows_first_pass
last_df = df[-rem_rows:].rename(columns={'random_nos':f'Slice{no_of_cols-1}'})
last_df.reset_index(drop=True, inplace=True)
df_combined = pd.concat([df_combined, last_df], axis=1)

Create fft2 result from rfft2 array

I am trying to recreate the result of a full fft2 by manipulating the result of an rfft2. The documentation states that rfft2 only computes the positive coefficients since the negative coefficients have a symmetry with the positive ones when the input is real. This would be extremely useful for large arrays since computing the rfft2 is much faster than the full fft2.
So the below code is me trying to recreate the fft2 from the rfft2 output. I have tried all kinds of manipulations of the "left" array and can't quite get "same" to be true everywhere. Any ideas?
import numpy as np
import matplotlib.pyplot as plt
from skimage.data import camera
frame = camera()
full_fft = np.fft.fft2(frame)
real_fft = np.fft.rfft2(frame)
left = real_fft[:, :-1].copy()
right = np.flipud(left[:, ::-1])
sim_fft2 = np.hstack((left, right))
same = np.isclose(full_fft, sim_fft2)
plt.figure()
plt.imshow(same)
plt.figure()
plt.imshow(np.log(np.abs(full_fft)))
plt.figure()
plt.imshow(np.log(np.abs(sim_fft2)))
I figured out the symmetry by doing the fft2 on a 6x6 array which then just required programming up a function to convert the output of a rfft2 to be the same as a fft2. Below is that function and an image of the symmetry.
def _rfft2_to_fft2(im_shape, rfft):
fcols = im_shape[-1]
fft_cols = rfft.shape[-1]
result = numpy.zeros(im_shape, dtype=rfft.dtype)
result[:, :fft_cols] = rfft
top = rfft[0, 1:]
if fcols%2 == 0:
result[0, fft_cols-1:] = top[::-1].conj()
mid = rfft[1:, 1:]
mid = numpy.hstack((mid, mid[::-1, ::-1][:, 1:].conj()))
else:
result[0, fft_cols:] = top[::-1].conj()
mid = rfft[1:, 1:]
mid = numpy.hstack((mid, mid[::-1, ::-1].conj()))
result[1:, 1:] = mid
return result

Gaussian rolling weights pandas

Suppose that I have a pandas series of data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n = 1000
srs = pd.Series(np.random.random(n))
I wish to now roll a Gaussian filter through this data such that the weights look like:
window = 100
x = np.arange(window)
mu = 60
sigma = 0.2
y = np.exp(-(x-mu)**2 / 2*sigma**2) / np.sqrt(2*np.pi*sigma**2)
plt.plot(x,y)
That is to say, for each window of length 100 the 60th entry has the maximum weight and the other entries decay as per the Gaussian formulation.
Is this possible with .rolling()?
You can use numpy.average to take a weighted mean:
import numpy as np
import pandas as pd
n = 1000
window_size = 100
srs = pd.Series(np.random.random(n))
mu = 60
sigma = 0.2
x = np.arange(window_size)
weights = np.exp(-(x-mu)**2 / 2*sigma**2) / np.sqrt(2*np.pi*sigma**2)
srs.rolling(window).apply(lambda wndw: np.average(wndw, weights=weights))
This is the same as:
srs.rolling(window).apply(lambda wndw: (wndw*weights).sum()/weights.sum())
Remark that this will break if you try to pass a min_periods less than window since np.average requires a and weights to have the same length.

Dask Dataframe: Defining meta for date diff in groubpy

I'm trying to find inter-purchase times (i.e., days between orders) for customers. Although my code is working correctly without defining meta, I would like to get it working properly and no longer see the warning asking me to provide meta.
Also, I would appreciate any suggestions on how to use map or map_partitions instead of apply.
So far I've tried:
meta={'days_since_last_order': 'datetime64[ns]'}
meta={'days_since_last_order': 'f8'}
meta={'ORDER_DATE_DT':'datetime64[ns]','days_since_last_order': 'datetime64[ns]'}
meta={'ORDER_DATE_DT':'f8','days_since_last_order': 'f8'}
meta=('days_since_last_order', 'f8')
meta=('days_since_last_order', 'datetime64[ns]')
Here is my code:
import numpy as np
import pandas as pd
import datetime as dt
import dask.dataframe as dd
from dask.distributed import wait, Client
client = Client(processes=True)
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
d = (end - start).days + 1
np.random.seed(0)
df = pd.DataFrame()
df['CUSTOMER_ID'] = np.random.randint(1, 4, 10)
df['ORDER_DATE_DT'] = start + pd.to_timedelta(np.random.randint(1, d, 10), unit='d')
print(df.sort_values(['CUSTOMER_ID','ORDER_DATE_DT']))
print(df)
ddf = dd.from_pandas(df, npartitions=2)
# setting ORDER_DATE_DT as index to sort by date
ddf = ddf.set_index('ORDER_DATE_DT')
ddf = client.persist(ddf)
wait(ddf)
ddf = ddf.reset_index()
grp = ddf.groupby('CUSTOMER_ID')[['ORDER_DATE_DT']].apply(
lambda df: df.assign(days_since_last_order=df.ORDER_DATE_DT.diff(1))
# meta=????
)
# for some reason, I'm unable to print grp unless I reset_index()
grp = grp.reset_index()
print(grp.compute())
Here is the printout of df.sort_values(['CUSTOMER_ID','ORDER_DATE_DT'])
Here is the printout of grp.compute()

Slice pandas' MultiIndex DataFrame

To keep track of all simulation-results in a parametric run, i create a MultIndex DataFrame named dfParRun in pandas as follows:
import pandas as pd
import numpy as np
import itertools
limOpt = [0.1,1,10]
reimbOpt = ['Cash','Time']
xOpt = [0.1, .02, .03, .04, .05, .06, .07, .08]
zOpt = [1,5n10]
arrays = [limOpt, reimbOpt, xOpt, zOpt]
parameters = list(itertools.product(*arrays))
nPar = len(parameters)
variables = ['X', 'Y', 'Z']
nVar = len(variables)
index = pd.MultiIndex.from_tuples(parameters, names=['lim', 'reimb', 'xMax', 'zMax'])
dfParRun = pd.DataFrame(np.random.rand((nPar, nVar)), index=index, columns=variables)
To analyse my parametric run, i want to slice this dataframe but this seems a burden. For example, i want to have all results for xMax above 0.5 and lim equal to 10. At this moment, the only working method i find is:
df = dfParRun.reset_index()
df.loc[(df.xMax>0.5) & (df.lim==10)]
and i wonder if there is a method without resetting the index of the DataFrame ?
option 1
use pd.IndexSlice
caveat: requires sort_index
dfParRun.sort_index().loc[pd.IndexSlice[10, :, .0500001:, :]]
option 2
use your df after having reset_index
df.query('xMax > 0.05 & lim == 10')
setup
import pandas as pd
import numpy as np
import itertools
limOpt = [0.1,1,10]
reimbOpt = ['Cash','Time']
xOpt = [0.1, .02, .03, .04, .05, .06, .07, .08]
zOpt = [1, 5, 10]
arrays = [limOpt, reimbOpt, xOpt, zOpt]
parameters = list(itertools.product(*arrays))
nPar = len(parameters)
variables = ['X', 'Y', 'Z']
nVar = len(variables)
index = pd.MultiIndex.from_tuples(parameters, names=['lim', 'reimb', 'xMax', 'zMax'])
dfParRun = pd.DataFrame(np.random.rand(*(nPar, nVar)), index=index, columns=variables)
df = dfParRun.reset_index()