Create and Merge Pandas Dataframes in loop - pandas
I need to read in bunch of i/p dataframes based on some conditions and then merge them and finally create dataframes as 'merge_m0', 'merge_m1', 'merge_m2' and so on.
In the actual code, I need to read about 20 dataframes. But, for simplicity and ease of understanding, I'm creating 3 dataframes and using a for loop to read them and merge.
#INPUT: Sample input dataframes df0, df1 &df2
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
To do this, I'm using globals() to create dataframes in loop and to merge them but it's not working and throwing " 'DataFrame' object has no attribute 'globals'" error.
#Code:
def comb_mths(x,y):
globals()[f"m{x}"] = globals()[f'df{x}'][globals()[f'df{x}'].globals()[f'm{x}_val_mthd'].isin([1,25])]
globals()[f"m{y}"] = globals()[f'df{y}'][(globals()[f'df{y}'].globals()[f'm{y}_val_mthd'].isin([8,10,11,12])) & (globals()[f'df{y}'].globals()[f'm{y}_orig_val_mthd'].isin([2,3,4,5]))]
globals()[f"merge_m{x}"]=pd.merge(globals()[f"m{x}"],globals()[f"m{y}"], how='inner',on=['id'])
for i in range(0,3):
comb_mths(i, i+1)
I've tried as below as well in place of the 1st line in the above function
#globals()[f"m{x}"] = globals()[f'df{x}'][globals()[f'df{x}'].m{x}_val_mthd.isin([1,25])]
#globals()[f"m{x}"] = globals()[f'df{x}']["[f'm{x}_val_mthd']"].isin([1,25])
I think there must be some better and easy alternative to do this and appreciate if anyone can help. Thanks!
Edit#
my updated post:
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
df_list=[]
for i in range(0,3):
df_list.append(globals()[f'df{i}']) #I'm appending all the i/p dataframes which are created already by other step in the code and hope this works
def comb_mths(i):
dfa = df_list[i]
dfb = df_list[i+1]
dfma = dfa[dfa.iloc[:, 1].isin([1,25])]
dfmb = dfb[(dfb.iloc[:, 1].isin([8,10,11,12])) & (dfb.iloc[:, 3].isin([2,3,4,5]))]
print(dfma)
print(dfmb)
print('\n'*3)
globals()[f"merge_m{i}"] = dfma.merge(dfmb, how='inner', on=['id'])
return globals()[f"merge_m{i}"]
for i in range(0,2):
comb_mths(i)
print(merge_m0)
print(merge_m1)
in the above function after creating "merge_m{i}" dataframe, I need to check one more 'if-else' condition and calculate a variable say 'mths'.
**The logic goes like this:
when i=0, I need to check for "m1_orig_val_mthd", when i=1, I need to check for "m2_orig_val_mthd", when i=2, I need to check for "m3_orig_val_mthd" and so on**
and that if-else condition pseudo code is like below. Can you please show me how do I add this below condition also in the above function?
when i=0 1st iteration
if m1_orig_val_mthd isin (2,4,6):
diff = (mydate - m1_appr_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
elif m1_orig_val_mthd isin (1,3,5):
diff = (mydate - m1_bpo_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
when i=1 2nd iteration
if m2_orig_val_mthd isin (2,4,6):
diff = (mydate - m2_appr_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
elif m2_orig_val_mthd isin (1,3,5):
diff = (mydate - m2_bpo_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
and so on...
I took a different approach assuming you can create all the input dataframes first. If you can create your dataframes and put them in a list, it makes handling them easier and code easier to read.
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
# add your inputs to the list
df_list = [df0, df1, df2]
# only pass in i, then call dfa, dfb by position in the list
def comb_mths(i):
dfa = df_list[i]
dfb = df_list[i+1]
# print(dfa)
# print(dfb)
# print('\n'*3)
# I wasn't exactly sure what you wanted here, but I think the original issue was you were calling your new dataframe before it was created.
dfma = dfa[dfa.iloc[:, 1].isin([1,25])] # as long as columns are in the same position, you don't need to call them by name, just position
dfmb = dfb[(dfb.iloc[:, 1].isin([8,10,11,12])) & (dfb.iloc[:, 3].isin([2,3,4,5]))]
print(dfma)
print(dfmb)
print('\n'*3)
#creating new merged datframes. cleaned this up too
globals()[f"merge_m{i}"] = dfma.merge(dfmb, how='inner', on=['id'])
return globals()[f"merge_m{i}"] #added return statement
for i in range(0,2): # watch range end or you'll get an error
comb_mths(i)
print(merge_m0)
print(merge_m1)
Additional code:
# to populate the df_list, do this
# you aren't actually naming them, I only did that in example above due to your Example
# when you call them, you are calling the position in the list
df_list = []
for i in range(0,20):
df = 'do your code here'
df_list.append(df)
# print the df to verify they are created
for df in df_list:
print(df)
Related
replacing df.append with pd.concat when building a new dataframe from file read
... header = pd.DataFrame() for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}: header = header.append({'col1':data1[x].split(':')[0], 'col2':data1[x].split(':')[1][:-1], 'col3':data2[x].split(':')[1][:-1], 'col4':data2[x]==data1[x], 'col5':'---'}, ignore_index=True)` ... I have some Jupyter Notebook code which reads in 2 text files to data1 and data2 and using a list I am picking out specific matching lines in both files to a dataframe for easy display and comparison in the notebook Since df.append is now being bumped for pd.concat what's the tidiest way to do this is it basically to replace the inner loop code with ... header = pd.concat(header, {all the column code from above }) ... addtional input to comment below Yes, sorry for example the next block of code does this: for x in {4,2 5}: header = header.append({'col1':SOMENEWROWNAME'', 'col2':data1[x].split(':')[1][:-1], 'col3':data2[x].split(':')[1][:-1], 'col4':data2[x]==data1[x], 'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]) ignore_index=True)` repeated 5 times with different data indices in the loop, and then a different SOMENEWROWNAME I inherited this notebook and I see now that this way of doing it was because they only wanted to do a numerical float difference on the columns where numbers come but there are several such blocks, with different lines in the data and where that first parameter SOMENEWROWNAME is the different text fields from the respective lines in the data. so I was primarily just trying to fix these append to concat warnings, but of course if the code can be better written then all good!
Use list comprehension and DataFrame constructor: data = [{'col1':data1[x].split(':')[0], 'col2':data1[x].split(':')[1][:-1], 'col3':data2[x].split(':')[1][:-1], 'col4':data2[x]==data1[x], 'col5':'---'} for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}] df = pd.DataFrame(data) EDIT: out = [] #sample for x in {1,7,30}: out.append({'col1':SOMENEWROWNAME'', 'col2':data1[x].split(':')[1][:-1], 'col3':data2[x].split(':')[1][:-1], 'col4':data2[x]==data1[x], 'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1])))))) df1 = pd.DataFrame(out) out1 = [] #sample for x in {1,7,30}: out1.append({another dict}))) df2 = pd.DataFrame(out1) df = pd.concat([df1, df2]) Or: final = [] for x in {4,2,5}: final.append({'col1':SOMENEWROWNAME'', 'col2':data1[x].split(':')[1][:-1], 'col3':data2[x].split(':')[1][:-1], 'col4':data2[x]==data1[x], 'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1])))))) for x in {4,2, 5}: final.append({another dict}))) df = pd.DataFrame(final)
Working on multiple data frames with data for NBA players during the season, how can I modify all the dataframes at the same time?
I have a list of 16 dataframes that contain stats for each player in the NBA during the respective season. My end goal is to run unsupervised learning algorithms on the data frames. For example, I want to see if I can determine a player's position by their stats or if I can determine their total points during the season based on their stats. What I would like to do is modify the list(df_list), unless there's a better solution, of these dataframes instead modifying each dataframe to: Change the datatype of the MP(minutes played column from str to int. Modify the dataframe where there are only players with 1000 or more MP and there are no duplicate players(Rk) (for instance in a season, a player(Rk) can play for three teams in a season and have 200MP, 300MP, and 400MP mins with each team. He'll have a column for each team and a column called TOT which will render his MP as 900(200+300+400) for a total of four rows in the dataframe. I only need the TOT row Use simple algebra with various and individual columns columns, for example: being able to total the MP column and the PTS column and then diving the sum of the PTS column by the MP column. Or dividing the total of the PTS column by the len of the PTS column. What I've done so far is this: Import my libraries and create 16 dataframes using pd.read_html(url). The first dataframes created using two lines of code: url = "https://www.basketball-reference.com/leagues/NBA_1997_totals.html" ninetysix = pd.read_html(url)[0] HOWEVER, the next four data frames had to be created using a few additional line of code(I received an error code that said "html5lib not found, please install it" so I downloaded both html5lib and requests). I say that to say...this distinction in creating the DF may have to considered in a solution. The code I used: import requests import uuid url = 'https://www.basketball-reference.com/leagues/NBA_1998_totals.html' cookies = {'euConsentId': str(uuid.uuid4())} html = requests.get(url, cookies=cookies).content ninetyseven = pd.read_html(html)[0] These four data frames look like this: I tried this but it didn't do anything: df_list = [ eightyfour, eightyfive, eightysix, eightyseven, eightyeight, eightynine, ninety, ninetyone, ninetytwo, ninetyfour, ninetyfive, ninetysix, ninetyseven, ninetyeight, owe_one, owe_two ] for df in df_list: df = df.loc[df['Tm'] == 'TOT'] df = df.copy() df['MP'] = df['MP'].astype(int) df['Rk'] = df['Rk'].astype(int) df = list(df[df['MP'] >= 1000]['Rk']) df = df[df['Rk'].isin(df)] owe_two ============================UPDATE=================================== This code will solves a portion of problem # 2 url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html' dd = pd.read_html(url)[0] dd = dd[dd['Rk'].ne('Rk')] dd['MP'] = dd['MP'].astype(int) players_1000_rk_list = list(dd[dd['MP'] >= 1000]['Rk']) players_dd = dd[dd['Rk'].isin(players_1000_rk_list)] But it doesn't remove the duplicates. ==================== UPDATE 10/11/22 ================================ Let's say I take rows with values "TOT" in the "Tm" and create a new DF with them, and these rows from the original data frame... could I then compare the new DF with the original data frame and remove the names from the original data IF they match the names from the new data frame?
the problem is that the df you are working on in the loop is not the same df that is in the df_list. you could solve this by saving the new df back to the list, overwriting the old df for i,df in enumerate(df_list): df = df.loc[df['Tm'] == 'TOT'] df = df.copy() df['MP'] = df['MP'].astype(int) df['Rk'] = df['Rk'].astype(int) df = list(df[df['MP'] >= 1000]['Rk']) df = df[df['Rk'].isin(df)] df_list[i] = df the2 lines are probably wrong as well df = list(df[df['MP'] >= 1000]['Rk']) df = df[df['Rk'].isin(df)] perhaps you want this for i,df in enumerate(df_list): df = df.loc[df['Tm'] == 'TOT'] df = df.copy() df['MP'] = df['MP'].astype(int) df['Rk'] = df['Rk'].astype(int) #df = list(df[df['MP'] >= 1000]['Rk']) #df = df[df['Rk'].isin(df)] # just the rows where MP > 1000 df_list[i] = df[df['MP'] >= 1000]
Pandas: concatenating data frames in iterations
I want to concatenate data frames in a loop with pandas.concat. They have the same columns but different indexes and values and they are generated within the loop. In such way the output dataframe will 'grow' over iterations starting from empty data frame. For a list it will look like this: a = [] for i in range(10): a.append(i**2) However, I found it is not advisable to make empty data frame. Is the only solution to get the first data frame before the loop and in the loop concat 2nd, 3rd, ... data frames? Jarek
You could use append: a = pd.DataFrame() for i in range(10): <your code here> a = a.append(i) Or concat: a = pd.DataFrame() for i in range(10): <your code here> a = pd.concat([a, i])
slice dataframe inplace and dynamically rename in a loop
I am aware that it may not be good practice but I am curious to know if it is possible to take two dfs (in this case, srm and srae), take a slice of each, and then rename this sliced dataframes as srm1 and srae1. The logic is below. for x in (srm, srae): x1 = x[x['years_in_role']>5] print(x.shape, x1.shape)
You can unpack 2 tuples to 2 variables: srm1, srae1 = [x[x['years_in_role']>5] for x in (srm, srae)] Your solution should be used for create list and then create new variables: L = [] for x in (srm, srae): x1 = x[x['years_in_role']>5] L.append(x1) srm1 = L[0] srae1 = L[1]
I have a dataframe and I want to find the standard deviation for some specific cells
I'm trying to use pandas to find the standard deviation for the entries in some specific cells I have tried using numPy's stdev like so: numpy.std(df[columnName][j:i]) I have also tried using this: df.std(axis=0)[columnName][j:i] Just pseudocode becuase my actual code is more confusing than necessary for this question: df = loadIris() for feat in df.columns: i = 0 j = 0 flower = df['flower'][i] while i < df.index.max(): if df['flower'][i] == flower: i+=1 else: j = i stand = df.std(axis=0)[feat][j:i] flower = df['flower'][i]
I ended up just appending all of the values to a list and then calculating the standard deviation using statistics.stdev which you can get by importing statistics.