Why even though I sliced my original DataFrame and assigned it to another variable, my original DataFrame still changed values? - pandas

I am trying to calculate a portfolio's daily total price, by multiplying weights of each asset with the daily price of the assets.
Currently I have a DataFrame tw which is all zeros except for the dates that I want to re-balance, which holds my assets weights. What I would like to do is for each month, populate the zeros with the weights I am trying to re-balance with, till the next re-balancing date, and so on and so forth.
My code:
df_of_weights = tw.loc[dates_to_rebalance[13]:]
temp_date = dates_to_rebalance[13]
counter = 0
for date in df_of_weights.index:
if date.year == temp_date.year and date.month == temp_date.month:
if date.day == temp_date.day:
pass
else:
df_of_weights.loc[date] = df_of_weights.loc[temp_date].values
counter += 1
temp_date = dates_to_rebalance[13+counter]
I understand that if you slice your DataFrame and assign it to a variable (df_of_weights), changing the values of said variable would not affect the original DataFrame. However, the values in tw changed. Have been searching for an answer online for a while now and am really confused.

You should use copy in order to fix the problem such that:
df_of_weights = tw.loc[dates_to_rebalance[13]:].copy()
The problem is slicing provides view instead of copy. The issue is still open.
https://github.com/pandas-dev/pandas/issues/15631

Related

Pandas rolling window on an offset between 4 and 2 weeks in the past

I have a datafile with quality scores from different suppliers over a time range of 3 years. The end goal is to use machine learning to predict the quality label (good or bad) of a shipment based on supplier information.
I want to use the mean historic quality data over a specific period of time as an input feature in this model by using pandas rolling window. the problem with this method is that pandas only allows you to create a window from t=0-x until t=0 for you rolling window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='14d',closed='left').mean()
And this is were the problem comes in. For my feature I want to use quality data from a period of 2 weeks, but these 2 weeks are not the 2 weeks before the corresponding shipment, but of 2 weeks, starting from t=-4weeks , and ending on t=-2weeks.
You would imagine that this could be solved by using the same string of code but changing the window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='28d' - '14d',closed='left').mean()
This, or any other type of denotation of this specific window does not seem to work.
It seems like pandas does not offer a solution to this problem, so we made a work around it with the following solution:
def time_shift_week(df):
def _avg_score_interval_func(series):
current_time = series.index[-1]
result = series[(series.index > ( current_time- pd.Timedelta(value=4, unit='w')))
& (series.index < (current_time - pd.Timedelta(value=2, unit='w')))]
return result.mean() if len(result)>0 else 0.0
temp_df = df.groupby(by=["supplier", "timestamp"], as_index=False).aggregate({"score": np.mean}).set_index('timestamp')
temp_df["w-42"] = (
temp_df
.groupby(["supplier"])
.ag_score
.apply(lambda x:
x
.rolling(window='30D', closed='both')
.apply(_avg_score_interval_func)
))
return temp_df.reset_index()
This results in a new df in which we find the average score score per supplier per timestamp, which we can subsequently merge with the original data frame to obtain the new feature.
Doing it this way seems really cumbersome and overly complicated for the task I am trying to perform. Eventhough we have found a workaround, I am wondering if there is an easier method of doing this.
Is anyone aware of a less complicated way of performing this rolling window feature extraction?
While pandas does not have the custom date offset you need, calculating the mean is pretty simple: it's just sum divided by count. You can subtract the 14-day rolling window from the 28-day rolling window:
# Some sample data. All scores are sequential for easy verification
idx = pd.MultiIndex.from_product(
[list("ABC"), pd.date_range("2020-01-01", "2022-12-31")],
names=["supplier", "timestamp"],
)
df = pd.DataFrame({"score": np.arange(len(idx))}, index=idx).reset_index()
# Now we gonna do rolling avg on score with the custom window.
# closed=left mean the current row will be excluded from the window.
score = df.set_index("timestamp").groupby("supplier")["score"]
r28 = score.rolling("28d", closed="left")
r14 = score.rolling("14d", closed="left")
avg_score = (r28.sum() - r14.sum()) / (r28.count() - r14.count())

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

Pandas dataframe: grouping by unique identifier, checking conditions, and applying 1/0 to new column if condition is met/not met

I have a large dataset pertaining customer churn, where every customer has an unique identifier (encoded key). The dataset is a timeseries, where every customer has one row for every month they have been a customer, so both the date and customer-identifier column naturally contains duplicates. What I am trying to do is to add a new column (called 'churn') and set the column to 0 or 1 based on if it is that specific customer's last month as a customer or not.
I have tried numerous methods to do this, but each and every one fails, either do to tracebacks or they just don't work as intended. It should be noted that I am very new to both python and pandas, so please explain things like I'm five (lol).
I have tried using pandas groupby to group rows by the unique customer keys, and then checking conditions:
df2 = df2.groupby('customerid').assign(churn = [1 if date==max(date) else 0 for date in df2['date']])
which gives tracebacks because dataframegroupby object has no attribute assign.
I have also tried the following:
df2.sort_values(['date']).groupby('customerid').loc[df['date'] == max('date'), 'churn'] = 1
df2.sort_values(['date']).groupby('customerid').loc[df['date'] != max('date'), 'churn'] = 0
which gives a similar traceback, but due to the attribute loc
I have also tried using numpy methods, like the following:
df2['churn'] = df2.groupby(['customerid']).np.where(df2['date'] == max('date'), 1, 0)
which again gives tracebacks due to the dataframegroupby
and:
df2['churn'] = np.where((df2['date']==df2['date'].max()), 1, df2['churn'])
which does not give tracebacks, but does not work as intended, i.e. it applies 1 to the churn column for the max date for all rows, instead of the max date for the specific customerid - which in retrospect is completely understandable since customerid is not specified anywhere.
Any help/tips would be appreciated!
IIUC use GroupBy.transform with max for return maximal values per groups and compare with date column, last set 1,0 values by mask:
mask = df2['date'].eq(df2.groupby('customerid')['date'].transform('max'))
df2['churn'] = np.where(mask, 1, 0)
df2['churn'] = mask.astype(int)

Numpy maximum(arrays)--how to determine the array each max value came from

I have numpy arrays representing July temperature for each year since 1950.
I can use the numpy.maximum(temp1950,temp1951,temp1952,..temp2014)
to determine the maximum July temperature at each cell.
I need the maximum for each cell..the numpy.maximum() works for only 2 arrays
How do I determine the year that each max value came from?
Also the numpy.maximum(array1,array2) works comparing only two arrays.
Thanks to Praveen, the following works fine:
array1 = numpy.array( ([1,2],[3,4]) )
array2 = numpy.array( ([3,4],[1,2]) )
array3 = numpy.array( ([9,1],[1,9]) )
all_arrays = numpy.dstack((array1,array2,array3))
#maxvalues = numpy.maximum(all_arrays)#will not work
all_arrays.max(axis=2) #this returns the max from each cell location
max_indexes = numpy.argmax(all_arrays,axis=2)#this returns correct indexes
The answer is argmax, except that you need to do this along the required axis. If you have 65 years' worth of temperatures, it doesn't make sense to keep them in separate arrays.
Instead, put them all into a single 2D dimensional array using something like np.vstack and then take the argmax over rows.
alltemps = np.vstack((temp1950, temp1951, ..., temp2014))
maxindexes = np.argmax(alltemps, axis=0)
If your temperature arrays are already 2D for some reason, then you can use np.dstack to stack in depth instead. Then you'll have to take argmax over axis=2.
For the specific example in your question, you're looking for something like:
t = np.dstack((array1, array2)) # Note the double parantheses. You need to pass
# a tuple to the function
maxindexes = np.argmax(t, axis=2)
PS: If you are getting the data out of a file, I suggest putting them in a single array to start with. It gets hard to handle 65 variable names.
You need to use Numpy's argmax
It would give you the index of the largest element in the array, which you can map to the year.

Dynamically creating variables, while doing map/apply on a dataframe in pandas to get key names for the values in Series object returned

I am writing code for a Naive Bayes model(I know there's a standard implementation in Sklearn, but I want to code it anyway) - For this I have say upwards of 30 features, against all of which I have the corresponding click & impression counts (Treat them as True/False flags)
What I need then, is to calculate
P(Click/F1, F2.. F30) = (P(Click)*P(F1/Click)*P(F2|click) ..*P(F30|Click))/(P(F1, F2...F30), and
P(NoClick/F1, F2.. F30) = (P(NoClick)*P(F1/NoClick)*P(F2|Noclick) ..*P(F30|NOClick))/(P(F1, F2...F30)
Where I will disregard the denominator as it will affect both Click & Non click behaviour similarly.
Example, for two features, day_custom & is_tablet_phone, I have
is_tablet_phone click impression
FALSE 375417 28291280
TRUE 17743 4220980
day_custom click impression
Fri 77592 7029703
Mon 43576 3773571
Sat 65950 5447976
Sun 66460 5031271
Thu 74329 6971541
Tue 55282 4575114
Wed 51555 4737712
My approach to the Problem : Assuming I read the individual files in data frame, one after another, I want the abilty to calculate & store the corresponding Probablities back in a file, that I will then use for real time prediction of Probabilty to click vs no click.
One possible structure of "processed file" thus would be -:
Here's my entire code -:
In the full blown example, I am traversing the entire directory structure(of 30 txt files, one at a time, from the base path) - which is why I need the ability to create "names" at runtime.
for base_path in base_paths:
for root, dirs, files in os.walk(base_path):
for file in files:
file_paths.append(os.path.join(root, file))
For reasons of tractability, follow from here, by taking the 2 txt files as sample input
file_paths=['/home/ekta/Desktop/NB/day_custom.txt','/home/ekta/Desktop/NB/is_tablet_phone.txt']
flag=0
for filehandle in file_paths:
feature_name=filehandle.split("/")[-1].split(".")[0]
df= pd.read_csv(filehandle,skiprows=0, encoding='utf-8',sep='\t',index_col=False,dtype={feature_name: object,'click': int,'impression': int})
df2=df[(df.impression-df.click>0) & (df.click >0)]
if flag ==0:
MySumC,MySumNC,Mydict=0,0,collections.defaultdict(dict)
MySumC=sum(df2['click'])
MySumNC=sum(df2['impression'])
P_C=float(MySumC)/float(MySumC+MySumNC)
P_NC=1-P_C
for feature_value in df2[feature_name]:
Mydict[feature_name+'_'+feature_value]={'P_'+feature_name+'_'+feature_value+'_C':(df2[df2[feature_name]==feature_value]['click']*float(P_C))/MySumC, \
'P_'+feature_name+'_'+feature_value+'_NC':(df2[df2[feature_name]==feature_value]['impression']*float(P_NC))/MySumNC}
flag=1 %Set the flag as "1" because we don't need to compute the MySumC,MySumNC, P_C & P_NC again
Question :
It looks like THIS loop is the killer here.Also, intutively, looping on a dataframe is a BAD practice. How can I rewrite this, perhaps using Map/Apply ?
for feature_value in df2[feature_name]:
Mydict[feature_name+'_'+feature_value]={'P_'+feature_name+'_'+feature_value+'_C':(df2[df2[feature_name]==feature_value]['click']*float(P_C))/MySumC, \
'P_'+feature_name+'_'+feature_value+'_NC':(df2[df2[feature_name]==feature_value]['impression']*float(P_NC))/MySumNC}
What I need in Mydict , which is a hash to store each feature name and each feature value in it
{'day_custom_Mon':{'P_day_custom_Mon_C':.787,'P_day_custom_Mon_NC': 0.556},
'day_custom_Tue':{'P_day_custom_Tue_C':0.887,'P_day_custom_Tue_NC': 0.156},
'day_custom_Wed':{'P_day_custom_Tue_C':0.087,'P_day_custom_Tue_NC': 0.167}
'day_custom_Thu':{'P_day_custom_Tue_C':0.947,'P_day_custom_Tue_NC': 0.196},
'is_tablet_phone_True':{'P_is_tablet_phone_True_C':.787,'P_is_tablet_phone_True_NC': 0.066},
'is_tablet_phone_False':{'P_is_tablet_phone_False_C':.787,'P_is_tablet_phone_False_NC': 0.077},
.. and so on..
%PPS: I just made up those float numbers, but you get the point
Also because I will later serialize this file & pass to Redis directly, for other systems to feed on it, in an cron-job manner, so I need to preserve some sort of Dynamic naming .
What I tried -:
Since I am reading feature_name as
feature_name=filehandle.split("/")[-1].split(".")[0]` # thereby abstracting & creating variables dynamically
def funct1(row):
return row[feature_name]
def funct2(row):
return row['click']
def funct3(row):
return row['impression']
then..
df2.apply(funct2,axis=1)df2.apply(funct,axis=1)*float(P_C))/MySumC, df2.apply(funct3,axis=1)*float(P_NC))/MySumNC Gives me both the values I need for a feature_value(say Mon, Tue, Wed, and so on..) for a feature_name (say,day_custom)
I also know that df2.apply(funct1, axis=1) contains part of mycustom "names"(ie feature values), how would I then build these names using map/apply ?
Ie. I will have the values, but how would I create the "key" 'P_'+feature_name+'_'+feature_value+'_C' , since feature value post apply is returned as a series object.
check out the following recipe which does exactly what you want, only using data frame manipulations. I also simplified the actual frequency calculation a bit ;)
#set the feature name values as the index of
df2.set_index(feature_name, inplace=True)
#This is what df2.set_index() looks like:
# click impression
#day_custom
#Fri 9917 3163
#Mon 2566 3818
#Sat 8725 7753
#Sun 6938 8642
#Thu 6136 2556
#Tue 5234 2356
#Wed 9463 9433
#rename the index of your data frame
df2.rename(index=lambda x:"%s_%s"%('day_custom', x), inplace=True)
#compute the total sum of your data frame entries
totsum = float(df2.values.sum())
#use apply to multiply every data frame element by the total sum
df2 = df2.applymap(lambda x:x/totsum)
#transpose the data frame to have the following shape
#day_custom day_custom_Fri day_custom_Mon ...
#click 0.102019 0.037468 ...
#impression 0.087661 0.045886 ...
#
#
dftranspose = df2.T
# template kw for formatting
templatekw = {'click':"P_%s_C", 'impression':"P_%s_NC"}
# build a list of small data frames with correct index names P_%s_NC etc
dflist = [dftranspose[[col]].rename(lambda x:templatekw[x]%col) for col in dftranspose]
#use the concatenate function to produce a sparse dictionary
MyDict= pd.concat(dflist).to_dict()
Instead of assigning to MyDict at the end, you can use the update-method during the loop.
For understanding the comments below, see here my
Original answer:
Try to use a pivot_table:
def clickfunc(x):
return np.sum(x) * P_C / MySumC
def impressionfunc(x):
return np.sum(x) * P_NC / MySumNC
newtable = df2.pivot_table(['click', 'impression'], 'feature_name', \
aggfunc=[clickfunc, impressionfunc])
#transpose the table for the dictionary to have the right form
newtable = newtable.T
#to_dict functionality already gives the correct result
MyDict = newtable.to_dict()
#rename by copying
for feature_value, subdict in MyDict.items():
word = feature_name +"_"+ feature_value
copydict[word] = {'P_' + word + '_C':subdict['click'],\
'P_' + word + '_NC':subdict['impression'] }
This gives you the result you want in copydict
itertuples() is what worked for me(worked at lightspeed) - though It is still not using the map/apply approach that I so much wanted to see. Itertuples on a pandas dataframe returns the whole row, so I no longer have to do df2[df2[feature_name]==feature_value]['click'] - be aware that this matching by value is not only expensive, but also undesired, since it may return a series, if there were duplicate rows. itertuples solves that problem were elegantly, though I need to then access the individual objects/columns by integer indexes , which means less re-usable code. I could abstract this, but It wont be like accessing by column names, the status-quo.
for row in df2.itertuples():
Mydict[feature_name+'_'+str(row[1])]={'P_'+feature_name+'_'+str(row[1])+'_C':(row[2]*float(P_C))/MySumC, \
'P_'+feature_name+'_'+str(row[1])+'_NC':(row[3]*float(P_NC))/MySumNC}
Note that I am accesing each column in the row by row[1] , row[2] and like. For example, row has (0, u'Fri', 77592, 7029703)
Post this I get
dict(Mydict)
{'day_custom_Thu': {'P_day_custom_Thu_NC': 0.18345372640838162, 'P_day_custom_Thu_C': 0.0019559423132143377}, 'day_custom_Mon': {'P_day_custom_Mon_C': 0.0011466875948906617, 'P_day_custom_Mon_NC': 0.099300235316209587}, 'day_custom_Sat': {'P_day_custom_Sat_NC': 0.14336163246883712, 'P_day_custom_Sat_C': 0.0017354517827023852}, 'day_custom_Tue': {'P_day_custom_Tue_C': 0.001454726996987919, 'P_day_custom_Tue_NC': 0.1203925662982053}, 'day_custom_Sun': {'P_day_custom_Sun_NC': 0.13239618235343156, 'P_day_custom_Sun_C': 0.0017488722589598259}, 'is_tablet_phone_TRUE': {'P_is_tablet_phone_TRUE_NC': 0.11107365073163174, 'P_is_tablet_phone_TRUE_C': 0.00046690100046229593}, 'day_custom_Wed': {'P_day_custom_Wed_NC': 0.12467127727567069, 'P_day_custom_Wed_C': 0.0013566522616712882}, 'day_custom_Fri': {'P_day_custom_Fri_NC': 0.1849842396242351, 'P_day_custom_Fri_C': 0.0020418070466026303}, 'is_tablet_phone_FALSE': {'P_is_tablet_phone_FALSE_NC': 0.74447539516197614, 'P_is_tablet_phone_FALSE_C': 0.0098789704610580936}}