What is fastest way to compute quantile over grouped dataframe? - pandas

I am creating monthly diurnal plots from pandas dataframe. I need to plot mean, median or any quantile. I am able to achieve it correctly, but with large data, quantile computation is way slower than mean or median computation. Is there any faster way to achieve this?
import pandas as pd
import numpy as np
import datetime as dt
date_range = pd.date_range(start=dt.datetime(2018,1,1,00,00), end=dt.datetime(2018,12,31,23,59), freq='1min')
N = len(date_range)
df = pd.DataFrame({'Test': np.random.rand(N)}, index=date_range)
df['Time'] = df.index.time
df['Month'] = df.index.month
time_mean_median = dt.datetime(2019,1,1,0,0,0)
time_qunatiles = dt.datetime(2019,1,1,0,0,0)
for i in range(12):
df_month = df[['Test', 'Time']].loc[df['Month'] == i + 1]
start_time = dt.datetime.now()
df1_group = df[['Test', 'Time']].groupby('Time').agg([np.mean, np.median])
time_mean_median += dt.datetime.now()-start_time
quantiles = [0.23, 0.72]
start_time = dt.datetime.now()
df2_group = df[['Test', 'Time']].groupby('Time').quantile(q=quantiles).unstack()
time_qunatiles += dt.datetime.now() - start_time
print('Mean/median computation time {}'.format(time_mean_median.time()))
print('Quantile computation time {}'.format(time_qunatiles.time()))
In this example I get mean/median total computation time around 0.7 seconds, compare to almost 12 second with quantile computation.

Related

Resampling the decomposed series of trend, residual and seasonality of the data from monthly to daily

I am performing a short-time series decomposition on a dataset to evaluate trend, residual and seasonality of the data. I want to resample these trend, seasonality and residual from monthly to daily frequency. I am getting some error with my datetimeindex:
Following are my codes:
# Import
import numpy as np
import pandas as pd
import tsfresh
import matplotlib.pyplot as plt
# monthly time series data set
df = pd.read_csv("data.csv")
df = pd.DataFrame(df)
# perform linear interpolation on the dataframe
df.interpolate(method='linear', inplace=True)
X = df['X']
date_index = pd.date_range(start='2002-01-01', periods=12, freq='M')
monthly_X= pd.Series(X, index=date_index)
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(df['X'], period = 1)
# Plot the decomposed time series to interpret.
plt.figure(figsize=(100, 20))
decomp.plot();
trend = decomp.trend
seasonal = decomp.seasonal
residual = decomp.resid
# Resample the DataFrame to daily frequency
daily_trend = **pd.Series(trend).resample('D').interpolate()
daily_seasonal = pd.Series(seasonal).resample('D').interpolate()
daily_residual = pd.Series(residual).resample('D').interpolate()
(Getting error here: pd.series(trend).resample - Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex')
How to resample these series?

How do you speed up a score calculation based on two rows in a Pandas Dataframe?

TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.

calculating the covariance matrix fast in python with some minor customizing

I have a pandas data frame and I'm trying to find the covariance of the percentage change of each column. For each pair, I want rows with missing values to be dropped, and the percentage be calculated afterwards. That is, I want something like this:
import pandas as pd
import numpy as np
# create dataframe example
N_ROWS, N_COLS = 249, 3535
df = pd.DataFrame(np.random.random((N_ROWS, N_COLS)))
df.iloc[np.random.choice(N_ROWS, N_COLS), np.random.choice(10, 50)] = np.nan
cov_df = pd.DataFrame(index=df.columns, columns=df.columns)
for col_i in df:
for col_j in df:
cov = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().cov()
cov_df.loc[col_i, col_j] = cov.iloc[0, 1]
The thing is this is super slow. The code below gives me results that is similar (but not exactly) what I want, but it runs quite fast
df.dropna(how='any', axis=0).pct_change().cov()
I am not sure why the second one runs so much faster. I want to speed up my code in the first, but I can't figure out how.
I have tried using combinations from itertools to avoid repeating the calculation for (col_i, col_j) and (col_j, col_i), and using map from multiprocessing to do the computations in parallel, but it still hasn't finished running after 90+ mintues.
somehow this works fast enough, although I am not sure why
from scipy.stats import pearsonr
corr = np.zeros((x.shape[1], x.shape[1]))
for i in range(x.shape[1]):
for j in range (i + 1, x.shape[1]):
y = x[:, [i, j]]
y = y[~np.isnan(y).any(axis=1)]
y = np.diff(y, axis=0) / y[:-1, :]
if len(y) < 2:
corr[i, j] = np.nan
continue
y = pearsonr(y[:, 0], y[:, 1])[0]
corr[i, j] = y
corr = corr + corr.T
np.fill_diagonal(corr, 1)
This takes within 8 minutes, which is fast enough for my use case.
On the other hand, this has been running for 30 minutes but still isn't done.
corr = pd.DataFrame(index=nav.columns, columns=nav.columns)
for col_i in df:
for col_j in df:
corr_ij = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().corr().iloc[0, 1]
corr.loc[col_i, col_j] = corr_ij
t1 = time.time()
Don't know why this is but anyways the first one is a good enough solution for me now.

Pandas get max delta in a timeseries for a specified period

Given a dataframe with a non-regular time series as an index, I'd like to find the max delta between the values for a period of 10 secs. Here is some code that does the same thing:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
xs = np.cumsum(np.random.rand(200))
# This function is to create a general situation where the max is not aways at the end or beginning
ys = xs**1.2 + 10 * np.sin(xs)
plt.plot(xs, ys, '+-')
threshold = 10
xs_thresh_ind = np.zeros_like(xs, dtype=int)
deltas = np.zeros_like(ys)
for i, x in enumerate(xs):
# Find indices that lie within the time threshold
period_end_ind = np.argmax(xs > x + threshold)
# Only operate when the window is wide enough (this can be treated differently)
if period_end_ind > 0:
xs_thresh_ind[i] = period_end_ind
# Find extrema in the period
period_min = np.min(ys[i:period_end_ind + 1])
period_max = np.max(ys[i:period_end_ind + 1])
deltas[i] = period_max - period_min
max_ind_low = np.argmax(deltas)
max_ind_high = xs_thresh_ind[max_ind_low]
max_delta = deltas[max_ind_low]
print(
'Max delta {:.2f} is in period x[{}]={:.2f},{:.2f} and x[{}]={:.2f},{:.2f}'
.format(max_delta, max_ind_low, xs[max_ind_low], ys[max_ind_low],
max_ind_high, xs[max_ind_high], ys[max_ind_high]))
df = pd.DataFrame(ys, index=xs)
OUTPUT:
Max delta 48.76 is in period x[167]=86.10,200.32 and x[189]=96.14,249.09
Is there an efficient pandaic way to achieve something similar?
Create a Series from ys values, indexed by xs - but convert xs to be actual timedelta elements, rather than the float equivalent.
ts = pd.Series(ys, index=pd.to_timedelta(xs, unit="s"))
We want to apply a leading, 10 second window in which we calculate the difference between max and min. Because we want it to be leading, we'll sort the Series in descending order and apply a trailing window.
deltas = ts.sort_index(ascending=False).rolling("10s").agg(lambda s: s.max() - s.min())
Find the maximum delta with deltas[deltas == deltas.max()], which gives
0 days 00:01:26.104797298 48.354851
meaning a delta of 48.35 was found in the interval [86.1, 96.1)

Calculating time-series as a percentage of total

I'm looking at county-level procurement data (millions of bills) and plotting time-series with matplotlib and pandas using groupby:
dataframe_slice.groupby(pd.Grouper(freq='1M')).bill_amount.sum().plot
where bill_amount is a column of floats that shows how much was billed. How can I change the graph to show the dataframe_slice as a percentage of total dataframe bill_amount?
I am not aware of an out-of-the-box pandas function for that (but I am hoping to be proven wrong). Imho, you have to calculate the percentage per group by calculating the total_sum, then determining the percentage per group aggregation. Stand-alone code:
import pandas as pd
from matplotlib import pyplot as plt
#fake data generation
import numpy as np
np.random.seed(123)
n = 200
start = pd.to_datetime("2017-07-17")
end = pd.to_datetime("2018-04-03")
ndays = (end - start).days + 1
date_range = pd.to_timedelta(np.random.rand(n) * ndays, unit="D") + start
df = pd.DataFrame({"ind": date_range,
"bill_amount": np.random.randint(10, 30, n),
"cat": np.random.choice(["X", "Y", "Z"], n)})
df.set_index("ind", inplace=True)
#df.sort_index(inplace=True)
#this assumes that your dataframe has a datetime index
#here starts the actual calculation
total_sum = df.bill_amount.sum()
dataframe_slice = df.groupby(pd.Grouper(freq='1M')).bill_amount.sum().div(total_sum)*100
dataframe_slice.plot()
#and we beautify the plot
plt.xlabel("Month of expenditure")
plt.ylabel("Percentage of expenditure")
plt.tight_layout()
plt.show()
Sample output: