How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table? - pandas

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?

As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

Related

Calculate all possible sums of all summands in 4 columns in R

I have 4 columns of different sizes (eg column 1: 96 rows, column 2: 36 rows; column 3: 12 rows; column 4: 401 rows)
I am now looking for a function that allows me to calculate all possible sums of these 4 summands.
So at the end i need to have 963612*401= 16630272 summs as a result in a data frame or vector or array to make a histogramm with ggplot.
I tried solve it with a for case that did not work:
r = 1
for(i in 1:(length(df$column1))) {
for(j in 1:(length(df$column2))) {
for(h in 1:(length(df$column3))){
for(k in 1:(length(df$column4))) {
(i+j)-> a
r = r +1
}
}
}
Does someone have an idea how to solve this problem? Or does anyone have an idea why my code does not work? It has problems by storing my results in variable a and second problem it produces somehow way to much sums. (More than 16630272)
Many thanks!
Your real column1/2/3/4 are probably not columns in a single data frame - as those must be in uniform length. Anyway, if these are the true sizes the naive solution shouldn't hog an impossible amount of memory (~126M):
d <- expand.grid(column1, column2, column3, column4)
all.sums <- mapply(sum, d[[1]], d[[2]], d[[3]], d[[4]])

Pandas dataframe: grouping by unique identifier, checking conditions, and applying 1/0 to new column if condition is met/not met

I have a large dataset pertaining customer churn, where every customer has an unique identifier (encoded key). The dataset is a timeseries, where every customer has one row for every month they have been a customer, so both the date and customer-identifier column naturally contains duplicates. What I am trying to do is to add a new column (called 'churn') and set the column to 0 or 1 based on if it is that specific customer's last month as a customer or not.
I have tried numerous methods to do this, but each and every one fails, either do to tracebacks or they just don't work as intended. It should be noted that I am very new to both python and pandas, so please explain things like I'm five (lol).
I have tried using pandas groupby to group rows by the unique customer keys, and then checking conditions:
df2 = df2.groupby('customerid').assign(churn = [1 if date==max(date) else 0 for date in df2['date']])
which gives tracebacks because dataframegroupby object has no attribute assign.
I have also tried the following:
df2.sort_values(['date']).groupby('customerid').loc[df['date'] == max('date'), 'churn'] = 1
df2.sort_values(['date']).groupby('customerid').loc[df['date'] != max('date'), 'churn'] = 0
which gives a similar traceback, but due to the attribute loc
I have also tried using numpy methods, like the following:
df2['churn'] = df2.groupby(['customerid']).np.where(df2['date'] == max('date'), 1, 0)
which again gives tracebacks due to the dataframegroupby
and:
df2['churn'] = np.where((df2['date']==df2['date'].max()), 1, df2['churn'])
which does not give tracebacks, but does not work as intended, i.e. it applies 1 to the churn column for the max date for all rows, instead of the max date for the specific customerid - which in retrospect is completely understandable since customerid is not specified anywhere.
Any help/tips would be appreciated!
IIUC use GroupBy.transform with max for return maximal values per groups and compare with date column, last set 1,0 values by mask:
mask = df2['date'].eq(df2.groupby('customerid')['date'].transform('max'))
df2['churn'] = np.where(mask, 1, 0)
df2['churn'] = mask.astype(int)

Comparing timedelta fields

I am looking at file delivery times and can't work out how to compare two timedelta fields using a for loop if statement.
time_diff is the difference between cob_date and last_update_time
average_diff is based on the average for a particular file
I want to find the delay for each row.
I have been able to produce a column delay using average_diff - time_diff
However, when the average_diff - time_diff < 0 I just want to return delay = 0 as this is not a delay.
I have made a for loop but this isn't working and I don't know why. I'm sure the answer is very simple but I can't get there.
test_pv_import_v2['delay2'] = pd.to_timedelta('0')
for index, row in test_pv_import_v2.iterrows():
if test_pv_import_v2['time_diff'] > test_pv_import_v2['average_diff'] :
test_pv_import_v2['delay2'] = test_pv_import_v2['time_diff'] - test_pv_import_v2['average_diff']
Use Series.where for set 0 Timedelta by condition:
mask = test_pv_import_v2['time_diff'] > test_pv_import_v2['average_diff']
s = (test_pv_import_v2['time_diff'] - test_pv_import_v2['average_diff'])
test_pv_import_v2['delay2'] = s.where(mask, pd.to_timedelta('0'))

Why even though I sliced my original DataFrame and assigned it to another variable, my original DataFrame still changed values?

I am trying to calculate a portfolio's daily total price, by multiplying weights of each asset with the daily price of the assets.
Currently I have a DataFrame tw which is all zeros except for the dates that I want to re-balance, which holds my assets weights. What I would like to do is for each month, populate the zeros with the weights I am trying to re-balance with, till the next re-balancing date, and so on and so forth.
My code:
df_of_weights = tw.loc[dates_to_rebalance[13]:]
temp_date = dates_to_rebalance[13]
counter = 0
for date in df_of_weights.index:
if date.year == temp_date.year and date.month == temp_date.month:
if date.day == temp_date.day:
pass
else:
df_of_weights.loc[date] = df_of_weights.loc[temp_date].values
counter += 1
temp_date = dates_to_rebalance[13+counter]
I understand that if you slice your DataFrame and assign it to a variable (df_of_weights), changing the values of said variable would not affect the original DataFrame. However, the values in tw changed. Have been searching for an answer online for a while now and am really confused.
You should use copy in order to fix the problem such that:
df_of_weights = tw.loc[dates_to_rebalance[13]:].copy()
The problem is slicing provides view instead of copy. The issue is still open.
https://github.com/pandas-dev/pandas/issues/15631

Dataframe non-null values differ from value_counts() values

There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing data.group.value_counts(). That returned 3 values: 1,2 and 3 (surprise) we then typed data.group.value_counts.sum() and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem