How to vectorize to speed up Dataframe apply pandas - pandas

I have a tXn (5000 X 100) dataframe wts_df,
wts_df.tail().iloc[:, 0:6]
Out[71]:
B C H L R T
2020-09-25 0.038746 0.033689 -0.047835 -0.002641 0.009501 -0.030689
2020-09-28 0.038483 0.033189 -0.061742 0.001199 0.009490 -0.028370
2020-09-29 0.038620 0.034957 -0.031341 0.006179 0.007815 -0.027317
2020-09-30 0.038610 0.034902 -0.014271 0.004512 0.007836 -0.024672
2020-10-01 0.038790 0.029937 -0.044198 -0.008415 0.008347 -0.030980
and two similar txn dataframes, vol_df and rx_df (same index and columns). For now we can use,
rx_df = wts_df.applymap(lambda x: np.random.rand())
vol_df = wts_df.applymap(lambda x: np.random.rand())
I need to do this (simplified):
for date in wts_df.index:
wts = wts_df.loc[date] # is a vector now 1Xn
# mutliply all entries of rx_df and vol_df until this date by these wts, and sum across columns
rx = rx_df.truncate(after=date) # still a dataframe but truncated at a given date, kXn
vol = vol_df_df.truncate(after=date)
wtd_rx = (wts * rx).sum(1) # so a vector kX1
wtd_vol = (wts * vol).sum(1)
# take ratio
rx_vol = rx / vol
rate[date] = rx_vol.tail(20).std()
So rate looks like this
pd.Series(rate).tail()
Out[71]:
rate
2020-09-25 0.0546
2020-09-28 0.0383
2020-09-29 0.0920
2020-09-30 0.0510
2020-10-01 0.0890
The above loop is slow, so i tried this:
def rate_calc(wts, date, rx_df=rx_df, vol_df=vol_df):
wtd_rx = (rx_df * wts).sum(1)
wtd_vol = (vol_df * wts).sum(1)
rx_vol = wtd_rx / wtd_vol
rate = rx_vol.truncate(after=date).tail(20).std()
return rate
rates = wts_df.apply(lambda x: rate_calc(x, x.name), axis=1)
This is still very slow. Moreover I need to do this for multiple wts_df contained in a dict so the total operations takes a lot time.
rates = {key: val.apply(lambda x: rate_calc(x, x.name), axis=1) for key, val in wts_df_dict.iteritems()}
Any ideas how to speed such operations?

Your question falls under the category of 'optimization' so allow me to share with you few pointers to solve your problem.
First, when it comes to speed, always use %timeit to ensure you get better results with a new stratgegy.
Second, there are few ways to iterate a data:
with iterrows() -- use it only when the data sample is small (or better yet, try not to use it as it's too slow).
With apply --better alternative to iterrows and much more efficient but when the data set is large (like in your example) it may present a delay problem.
Vectorizing --simply put, you execute the operation on the entire column/array and its significantly fast. Winner!
So, in order to solve your speed problem your strategy should be in the form of vectorizing. So here's how it should work; (mind the .values):
df['new_column'] = my_function(df['column_1'].values, df['column_2'].values...) and you will note a super fast result.

Related

Pandas rolling window on an offset between 4 and 2 weeks in the past

I have a datafile with quality scores from different suppliers over a time range of 3 years. The end goal is to use machine learning to predict the quality label (good or bad) of a shipment based on supplier information.
I want to use the mean historic quality data over a specific period of time as an input feature in this model by using pandas rolling window. the problem with this method is that pandas only allows you to create a window from t=0-x until t=0 for you rolling window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='14d',closed='left').mean()
And this is were the problem comes in. For my feature I want to use quality data from a period of 2 weeks, but these 2 weeks are not the 2 weeks before the corresponding shipment, but of 2 weeks, starting from t=-4weeks , and ending on t=-2weeks.
You would imagine that this could be solved by using the same string of code but changing the window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='28d' - '14d',closed='left').mean()
This, or any other type of denotation of this specific window does not seem to work.
It seems like pandas does not offer a solution to this problem, so we made a work around it with the following solution:
def time_shift_week(df):
def _avg_score_interval_func(series):
current_time = series.index[-1]
result = series[(series.index > ( current_time- pd.Timedelta(value=4, unit='w')))
& (series.index < (current_time - pd.Timedelta(value=2, unit='w')))]
return result.mean() if len(result)>0 else 0.0
temp_df = df.groupby(by=["supplier", "timestamp"], as_index=False).aggregate({"score": np.mean}).set_index('timestamp')
temp_df["w-42"] = (
temp_df
.groupby(["supplier"])
.ag_score
.apply(lambda x:
x
.rolling(window='30D', closed='both')
.apply(_avg_score_interval_func)
))
return temp_df.reset_index()
This results in a new df in which we find the average score score per supplier per timestamp, which we can subsequently merge with the original data frame to obtain the new feature.
Doing it this way seems really cumbersome and overly complicated for the task I am trying to perform. Eventhough we have found a workaround, I am wondering if there is an easier method of doing this.
Is anyone aware of a less complicated way of performing this rolling window feature extraction?
While pandas does not have the custom date offset you need, calculating the mean is pretty simple: it's just sum divided by count. You can subtract the 14-day rolling window from the 28-day rolling window:
# Some sample data. All scores are sequential for easy verification
idx = pd.MultiIndex.from_product(
[list("ABC"), pd.date_range("2020-01-01", "2022-12-31")],
names=["supplier", "timestamp"],
)
df = pd.DataFrame({"score": np.arange(len(idx))}, index=idx).reset_index()
# Now we gonna do rolling avg on score with the custom window.
# closed=left mean the current row will be excluded from the window.
score = df.set_index("timestamp").groupby("supplier")["score"]
r28 = score.rolling("28d", closed="left")
r14 = score.rolling("14d", closed="left")
avg_score = (r28.sum() - r14.sum()) / (r28.count() - r14.count())

Row-wise cumulative product on large data.table benchmarking

Suppose I have a large data.table with about 1000 columns and 100,000 rows like this:
dt <- data.table(col1 = runif(10^4))
for (i in 2:10^3) set(dt, j = paste('col', as.character(i), sep = ''), value = dt[[i-1]] * 0.95^(i-1))
Think of these as representing 'daily mortality rates'. I want to calculate monthly survival rates, so I have the following chunk of code:
dt[, paste0('surv_rate_', 1:10^3) := Reduce('*', (1-dt[, paste0('col', 1:10^3)])^30, accumulate = T)]
I was not able to find any benchmarking on row-wise cumulative products like is shown above.
Can you think of any better/cleaner/faster methods for doing this in a data.table way?
(I thought of as.data.table(t(cumprod(t(dt))) as well but this appears to be taking forever with a table this size)
As mentionned in comments, another possible solution is to use apply along rows:
res1 <-
copy(dt)[, paste0('surv_rate_', 1:10^3) := transpose(apply((1-.SD)^30,1,cumprod,simplify=F))]
res2 <- copy(dt)[, paste0('surv_rate_', 1:10^3) := Reduce('*', (1-dt[, paste0('col', 1:10^3)])^30, accumulate = T)]
all.equal(res1,res2)
[1] TRUE
However, your solution remains 30% faster:
Unit: milliseconds
expr min
copy(dt)[, `:=`(paste0("surv_rate_", 1:10^3), transpose(apply((1-.SD)^30, 1, cumprod, simplify = F)))] 1011.7095
copy(dt)[, `:=`(paste0("surv_rate_", 1:10^3), Reduce("*", (1-dt[, paste0("col", 1:10^3)])^30, accumulate = T))] 793.5415
lq mean median uq max neval
1246.993 1743.854 1546.797 2119.166 2772.646 10
1046.194 1314.569 1249.636 1405.414 2496.858 10

Pandas manipulation: matching data from other columns to one column, applied uniquely to all rows

I have a model that predicts 10 words for a particular course in order of likelihood, and I'd like the first 5 words of those words that appear in the course's description.
This is the format of the data:
course_name course_title course_description predicted_word_10 predicted_word_9 predicted_word_8 predicted_word_7 predicted_word_6 predicted_word_5 predicted_word_4 predicted_word_3 predicted_word_2 predicted_word_1
Xmath 32 Precalculus Polynomial and rational functions, exponential... directed scholars approach build african different visual cultures placed global
Xphilos 2 Morality Introduction to ethical and political philosop... make presentation weekly european ways general range questions liberal speakers
My idea is for each row to start iterating from predicted_word_1 until I get the first 5 that are in the description. I'd like to save those words in the order they appear into additional columns description_word_1 ... description_word_5. (If there are <5 predicted words in the description I plan to return NAN in the corresponding columns).
To clarify with an example: if the course_description of a course is 'Polynomial and rational functions, exponential and logarithmic functions, trigonometry and trigonometric functions. Complex numbers, fundamental theorem of algebra, mathematical induction, binomial theorem, series, and sequences. ' and its first few predicted words are irrelevantword1, induction, exponential, logarithmic, irrelevantword2, polynomial, algebra...
I would want to return induction, exponential, logarithmic, polynomial, algebra for that in that order and do the same for the rest of the courses.
My attempt was to define an apply function that will take in a row and iterate from the first predicted word until it finds the first 5 that are in the description, but the part I am unable to figure out is how to create these additional columns that have the correct words for each course. This code will currently only keep the words for one course for all the rows.
def find_top_description_words(row):
print(row['course_title'])
description_words_index=1
for i in range(num_words_per_course):
description = row.loc['course_description']
word_i = row.loc['predicted_word_' + str(i+1)]
if (word_i in description) & (description_words_index <=5) :
print(description_words_index)
row['description_word_' + str(description_words_index)] = word_i
description_words_index += 1
df.apply(find_top_description_words,axis=1)
The end goal of this data manipulation is to keep the top 10 predicted words from the model and the top 5 predicted words in the description so the dataframe would look like:
course_name course_title course_description top_description_word_1 ... top_description_word_5 predicted_word_1 ... predicted_word_10
Any pointers would be appreciated. Thank you!
If I understand correctly:
Create new DataFrame with just 100 predicted words:
pred_words_lists = df.apply(lambda x: list(x[3:].dropna())[::-1], axis = 1)
Please note that, there are lists in each row with predicted words. The order is nice, I mean the first, not empty, predicted word is on the first place, the second on the second place and so on.
Now let's create a new DataFrame:
pred_words_df = pd.DataFrame(pred_words_lists.tolist())
pred_words_df.columns = df.columns[:2:-1]
And The final DataFrame:
final_df = df[['course_name', 'course_title', 'course_description']].join(pred_words_df.iloc[:,0:11])
Hope this works.
EDIT
def common_elements(xx, yy):
temp = pd.Series(range(0, len(xx)), index= xx)
return list(df.reindex(yy).sort_values()[0:10].dropna().index)
pred_words_lists = df.apply(lambda x: common_elements(x[2].replace(',','').split(), list(x[3:].dropna())), axis = 1)
Does it satisfy your requirements?
Adapted solution (OP):
def get_sorted_descriptions_words(course_description, predicted_words, k):
description_words = course_description.replace(',','').split()
predicted_words_list = list(predicted_words)
predicted_words = pd.Series(range(0, len(predicted_words_list)), index=predicted_words_list)
predicted_words = predicted_words[~predicted_words.index.duplicated()]
ordered_description = predicted_words.reindex(description_words).dropna().sort_values()
ordered_description_list = pd.Series(ordered_description.index).unique()[:k]
return ordered_description_list
df.apply(lambda x: get_sorted_descriptions_words(x['course_description'], x.filter(regex=r'predicted_word_.*'), k), axis=1)

What algorithm can I use to compute number of say positive or negative postings seen until a certain timepoint?

I wish to check if my understanding and proposed algorithm below would be correct.
to calculate the number of positive postings I have seen until time point ti, I am proposing a loop as below:
sumofPi = 0
for x = 0 until x = ti
sumofPi = sumofPi + Pi-1
I am not sure if this will work but the idea is to be able to sum up the positive postings that comes in within a certain timepoint in a data stream.
Thanks
The sequence seems fine as long as the events are indexed in order and you are comfortable loosing events that happened at the same time but indexed differently as a result of that limitation. You may also want to address posting type filtering.
Your algorithm in Python:
# Sample data
postingevents=[1,0,1,1,0,1]
# Algorithm:
sumofPi = 0
ti=4
for i in range(0,ti):
sumofPi += postingevents[i]
print(sumofPi)
3
Looks like you are dealing with time series.
For time series, I would suggest rolling sum or rolling weighted averages, there's an example here
Below are some Python code samples using loops and recursion with a data sample (Event indicator & epoch time stamp)
# Data sample:
postingevents=[1,0,1,1,0,1]
postingti=[1497634668,1497634669,1497634697,1497634697,1497634714,1497634718]
postings=([postingevents,postingti])
# All events preceeding time stamp T. Events do not need to be ordered by time.
def sumpi_notordered(X,t):
return sum([xv if yv<=t else 0 for (xv,yv) in zip(X[0],X[1])])
# Sum ordered events indexed by T, using recursion.
def sumpi_ordered(X,t):
if t>=1:
return X[t]+sumpi_ordered(X,t-1)
else:
return(X[t])
print(sumpi_notordered(postings,1497634697))
3
print(sumpi_ordered(postingevents,3))
3

Pandas shifting uneven timeseries data

I have some irregularly stamped time series data, with timestamps and the observations at every timestamp, in pandas. Irregular basically means that the timestamps are uneven, for instance the gap between two successive timestamps is not even.
For instance the data may look like
Timestamp Property
0 100
1 200
4 300
6 400
6 401
7 500
14 506
24 550
.....
59 700
61 750
64 800
Here the timestamp is say seconds elapsed since a chose origin time. As you can see we could have data at the same timestamp, 6 secs in this case. Basically the timestamps are strictly different, just that second resolution cannot measure the change.
Now I need to shift the timeseries data ahead, say I want to shift the entire data by 60 secs, or a minute. So the target output is
Timestamp Property
0 750
1 800
So the 0 point got matched to the 61 point and the 1 point got matched to the 64 point.
Now I can do this by writing something dirty, but I am looking to use as much as possible any inbuilt pandas feature. If the timeseries were regular, or evenly gapped, I could've just used the shift() function. But the fact that the series is uneven makes it a bit tricky. Any ideas from Pandas experts would be welcome. I feel that this would be a commonly encountered problem. Many thanks!
Edit: added a second, more elegant, way to do it. I don't know what will happen if you had a timestamp at 1 and two timestamps of 61. I think it will choose the first 61 timestamp but not sure.
new_stamps = pd.Series(range(df['Timestamp'].max()+1))
shifted = pd.DataFrame(new_stamps)
shifted.columns = ['Timestamp']
merged = pd.merge(df,shifted,on='Timestamp',how='outer')
merged['Timestamp'] = merged['Timestamp'] - 60
merged = merged.sort(columns = 'Timestamp').bfill()
results = pd.merge(df,merged, on = 'Timestamp')
[Original Post]
I can't think of an inbuilt or elegant way to do this. Posting this in case it's more elegant than your "something dirty", which is I guess unlikely. How about:
lookup_dict = {}
def assigner(row):
lookup_dict[row['Timestamp']] = row['Property']
df.apply(assigner, axis=1)
sorted_keys = sorted(lookup_dict.keys)
df['Property_Shifted'] = None
def get_shifted_property(row,shift_amt):
for i in sorted_keys:
if i >= row['Timestamp'] + shift_amt:
row['Property_Shifted'] = lookup_dict[i]
return row
df = df.apply(get_shifted_property, shift_amt=60, axis=1)