slice dataframe inplace and dynamically rename in a loop - pandas

I am aware that it may not be good practice but I am curious to know if it is possible to take two dfs (in this case, srm and srae), take a slice of each, and then rename this sliced dataframes as srm1 and srae1.
The logic is below.
for x in (srm, srae):
x1 = x[x['years_in_role']>5]
print(x.shape, x1.shape)

You can unpack 2 tuples to 2 variables:
srm1, srae1 = [x[x['years_in_role']>5] for x in (srm, srae)]
Your solution should be used for create list and then create new variables:
L = []
for x in (srm, srae):
x1 = x[x['years_in_role']>5]
L.append(x1)
srm1 = L[0]
srae1 = L[1]

Related

replacing df.append with pd.concat when building a new dataframe from file read

...
header = pd.DataFrame()
for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}:
header = header.append({'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'},
ignore_index=True)`
...
I have some Jupyter Notebook code which reads in 2 text files to data1 and data2 and using a list I am picking out specific matching lines in both files to a dataframe for easy display and comparison in the notebook
Since df.append is now being bumped for pd.concat what's the tidiest way to do this
is it basically to replace the inner loop code with
...
header = pd.concat(header, {all the column code from above })
...
addtional input to comment below
Yes, sorry for example the next block of code does this:
for x in {4,2 5}:
header = header.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1])
ignore_index=True)`
repeated 5 times with different data indices in the loop, and then a different SOMENEWROWNAME
I inherited this notebook and I see now that this way of doing it was because they only wanted to do a numerical float difference on the columns where numbers come
but there are several such blocks, with different lines in the data and where that first parameter SOMENEWROWNAME is the different text fields from the respective lines in the data.
so I was primarily just trying to fix these append to concat warnings, but of course if the code can be better written then all good!
Use list comprehension and DataFrame constructor:
data = [{'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'} for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}]
df = pd.DataFrame(data)
EDIT:
out = []
#sample
for x in {1,7,30}:
out.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
df1 = pd.DataFrame(out)
out1 = []
#sample
for x in {1,7,30}:
out1.append({another dict})))
df2 = pd.DataFrame(out1)
df = pd.concat([df1, df2])
Or:
final = []
for x in {4,2,5}:
final.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
for x in {4,2, 5}:
final.append({another dict})))
df = pd.DataFrame(final)

Create and Merge Pandas Dataframes in loop

I need to read in bunch of i/p dataframes based on some conditions and then merge them and finally create dataframes as 'merge_m0', 'merge_m1', 'merge_m2' and so on.
In the actual code, I need to read about 20 dataframes. But, for simplicity and ease of understanding, I'm creating 3 dataframes and using a for loop to read them and merge.
#INPUT: Sample input dataframes df0, df1 &df2
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
To do this, I'm using globals() to create dataframes in loop and to merge them but it's not working and throwing " 'DataFrame' object has no attribute 'globals'" error.
#Code:
def comb_mths(x,y):
globals()[f"m{x}"] = globals()[f'df{x}'][globals()[f'df{x}'].globals()[f'm{x}_val_mthd'].isin([1,25])]
globals()[f"m{y}"] = globals()[f'df{y}'][(globals()[f'df{y}'].globals()[f'm{y}_val_mthd'].isin([8,10,11,12])) & (globals()[f'df{y}'].globals()[f'm{y}_orig_val_mthd'].isin([2,3,4,5]))]
globals()[f"merge_m{x}"]=pd.merge(globals()[f"m{x}"],globals()[f"m{y}"], how='inner',on=['id'])
for i in range(0,3):
comb_mths(i, i+1)
I've tried as below as well in place of the 1st line in the above function
#globals()[f"m{x}"] = globals()[f'df{x}'][globals()[f'df{x}'].m{x}_val_mthd.isin([1,25])]
#globals()[f"m{x}"] = globals()[f'df{x}']["[f'm{x}_val_mthd']"].isin([1,25])
I think there must be some better and easy alternative to do this and appreciate if anyone can help. Thanks!
Edit#
my updated post:
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
df_list=[]
for i in range(0,3):
df_list.append(globals()[f'df{i}']) #I'm appending all the i/p dataframes which are created already by other step in the code and hope this works
def comb_mths(i):
dfa = df_list[i]
dfb = df_list[i+1]
dfma = dfa[dfa.iloc[:, 1].isin([1,25])]
dfmb = dfb[(dfb.iloc[:, 1].isin([8,10,11,12])) & (dfb.iloc[:, 3].isin([2,3,4,5]))]
print(dfma)
print(dfmb)
print('\n'*3)
globals()[f"merge_m{i}"] = dfma.merge(dfmb, how='inner', on=['id'])
return globals()[f"merge_m{i}"]
for i in range(0,2):
comb_mths(i)
print(merge_m0)
print(merge_m1)
in the above function after creating "merge_m{i}" dataframe, I need to check one more 'if-else' condition and calculate a variable say 'mths'.
**The logic goes like this:
when i=0, I need to check for "m1_orig_val_mthd", when i=1, I need to check for "m2_orig_val_mthd", when i=2, I need to check for "m3_orig_val_mthd" and so on**
and that if-else condition pseudo code is like below. Can you please show me how do I add this below condition also in the above function?
when i=0 1st iteration
if m1_orig_val_mthd isin (2,4,6):
diff = (mydate - m1_appr_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
elif m1_orig_val_mthd isin (1,3,5):
diff = (mydate - m1_bpo_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
when i=1 2nd iteration
if m2_orig_val_mthd isin (2,4,6):
diff = (mydate - m2_appr_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
elif m2_orig_val_mthd isin (1,3,5):
diff = (mydate - m2_bpo_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
and so on...
I took a different approach assuming you can create all the input dataframes first. If you can create your dataframes and put them in a list, it makes handling them easier and code easier to read.
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
# add your inputs to the list
df_list = [df0, df1, df2]
# only pass in i, then call dfa, dfb by position in the list
def comb_mths(i):
dfa = df_list[i]
dfb = df_list[i+1]
# print(dfa)
# print(dfb)
# print('\n'*3)
# I wasn't exactly sure what you wanted here, but I think the original issue was you were calling your new dataframe before it was created.
dfma = dfa[dfa.iloc[:, 1].isin([1,25])] # as long as columns are in the same position, you don't need to call them by name, just position
dfmb = dfb[(dfb.iloc[:, 1].isin([8,10,11,12])) & (dfb.iloc[:, 3].isin([2,3,4,5]))]
print(dfma)
print(dfmb)
print('\n'*3)
#creating new merged datframes. cleaned this up too
globals()[f"merge_m{i}"] = dfma.merge(dfmb, how='inner', on=['id'])
return globals()[f"merge_m{i}"] #added return statement
for i in range(0,2): # watch range end or you'll get an error
comb_mths(i)
print(merge_m0)
print(merge_m1)
Additional code:
# to populate the df_list, do this
# you aren't actually naming them, I only did that in example above due to your Example
# when you call them, you are calling the position in the list
df_list = []
for i in range(0,20):
df = 'do your code here'
df_list.append(df)
# print the df to verify they are created
for df in df_list:
print(df)

How to obtain a matrix by adding one more new row vector within an iteration?

I am generating arrays (technically they are row vectors) with a for-loop. a, b, c ... are the outputs.
Can I add the new array to the old ones together to form a matrix?
import numpy as np
# just for example:
a = np.asarray([2,5,8,10])
b = np.asarray([1,2,3,4])
c = np.asarray([2,3,4,5])
... ... ... ... ...
I have tried ab = np.stack((a,b)), and this could work. But my idea is to always add a new row to the old matrix in a new loop, but with abc = np.stack((ab,c)) then there would be an error ValueError: all input arrays must have the same shape.
Can anyone tell me how I could add another vector to an already existing matrix? I couldn´t find a perfect answer in this forum.
np.stack wouldn't work, you can only stack arrays with same dimensions.
One possible solution is to use np.concatenate((original, to_append), axis = 0) each time. Check the docs.
You can also try using np.append.
Thanks to the ideas from everybody, the best solution of this problem is to append nparray or list to a list during the iteration and convert the list to a matrix using np.asarray in the end.
a = np.asarray([2,5,8,10]) # or a = [2,5,8,10]
b = np.asarray([1,2,3,4]) # b = [1,2,3,4]
c = np.asarray([2,3,4,5]) # c = [2,3,4,5]
... ...
l1 = []
l1.append(a)
l1.append(b)
l1.append(c)
... ...
l1don´t have to be empty, however, the elements which l1 already contained should be the same type as the a,b,c
For example, the difference between l1 = [1,1,1,1] and l1 = [[1,1,1,1]] is huge in this case.

Store regression coefficients, merge back into data-frame

I'm trying to estimate a random effects model, and store those coefficients. I then want to merge them to the data-frame to predict the dependent variable.
There is a random effect coefficient for each group. In the data-frame, if an observation belongs to group 1, I want the group 1 coefficient listed there. For observations in group 2, the group 2 coefficient and so on.
I am able to access and store the coefficients. But I'm not able to merge them back into the data-frame. I'm not sure how to think of it. Here is the code I have so far:
md = smf.mixedlm('y ~ x', data=df, groups=train['GroupID'])
mdf = md.fit()
I tried storing the coefficients in three ways:
re_coeffs = pd.Series(mdf.random_effects.values) #creates a series with shape (1,)
re_coeffs = [(k) for k in mdf.random_effects.values()] #creates a list with the coeffs
re_coeffs = np.array(mdf.random_effects.values) #creates array with shape ()
All of them work, but none of them let me merge them back into the original data-frame. I'm not sure about using a dictionary or a list, or generally how to think about merging these coefficients back into the original data-frame.
I'll appreciate any suggestions for this.
This seems to work:
md = smf.mixedlm('y ~ x', data=train, groups=train['GroupID'])
mdf = md.fit()
re_coeffs = [(k) for k in mdf.random_effects.values()]
df = pd.DataFrame(re_coeffs)
df['ConfigID'] = df.index
merged = pd.merge(train,df, on=['GroupID'])

When does pandas do pass-by-reference Vs pass-by-value when passing dataframe to a function?

def dropdf_copy(df):
df = df.drop('y',axis=1)
def dropdf_inplace(df):
df.drop('y',axis=1,inplace=True)
def changecell(df):
df['y'][0] = 99
x = pd.DataFrame({'x': [1,2],'y': [20,31]})
x
Out[204]:
x y
0 1 20
1 2 31
dropdf_copy(x)
x
Out[206]:
x y
0 1 20
1 2 31
changecell(x)
x
Out[208]:
x y
0 1 99
1 2 31
In the above example dropdf() doesnt modify the original dataframe x while changecell() modifies x. I know if I add the minor change to changecell() it wont change x.
def changecell(df):
df = df.copy()
df['y'][0] = 99
I dont think its very elegant to inlcude df = df.copy() in every function I write.
Questions
1) Under what circumstances does pandas change the original dataframe and when it does not? Can someone give me a clear generalizable rule? I know it may have something to do with mutability Vs immutability but its not clearly explained in stackoverflow.
2) Does numpy behave simillary or its different? What about other python objects?
PS: I have done research in stackoverflow but couldnt find a clear generalizable rule for this problem.
By default python does pass by reference. Only if a explicit copy is made in the function like assignment or a copy() function is used the original object passed is unchanged.
Example with explicit copy :
#1. Assignment
def dropdf_copy1(df):
df = df.drop('y',axis=1)
#2. copy()
def dropdf_copy2(df):
df = df.copy()
df.drop('y',axis=1,inplace = True)
If explicit copy is not done then original object passed is changed.
def dropdf_inplace(df):
df.drop('y',axis=1,inplace = True)
Nothing to deal with pandas. It'a problem of local/global variables on mutable values. in dropdf, you set df as a local variable.
The same with lists:
def global_(l):
l[0]=1
def local_(l):
l=l+[0]
in the second function, it will be the same if you wrote :
def local_(l):
l2=l+[0]
so you don't affect l.
Here the python tutor exemple which shoes what happen.