Pandas: concatenating data frames in iterations - pandas

I want to concatenate data frames in a loop with pandas.concat. They have the same columns but different indexes and values and they are generated within the loop. In such way the output dataframe will 'grow' over iterations starting from empty data frame. For a list it will look like this:
a = []
for i in range(10):
a.append(i**2)
However, I found it is not advisable to make empty data frame. Is the only solution to get the first data frame before the loop and in the loop concat 2nd, 3rd, ... data frames?
Jarek

You could use append:
a = pd.DataFrame()
for i in range(10):
<your code here>
a = a.append(i)
Or concat:
a = pd.DataFrame()
for i in range(10):
<your code here>
a = pd.concat([a, i])

Related

replacing df.append with pd.concat when building a new dataframe from file read

...
header = pd.DataFrame()
for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}:
header = header.append({'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'},
ignore_index=True)`
...
I have some Jupyter Notebook code which reads in 2 text files to data1 and data2 and using a list I am picking out specific matching lines in both files to a dataframe for easy display and comparison in the notebook
Since df.append is now being bumped for pd.concat what's the tidiest way to do this
is it basically to replace the inner loop code with
...
header = pd.concat(header, {all the column code from above })
...
addtional input to comment below
Yes, sorry for example the next block of code does this:
for x in {4,2 5}:
header = header.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1])
ignore_index=True)`
repeated 5 times with different data indices in the loop, and then a different SOMENEWROWNAME
I inherited this notebook and I see now that this way of doing it was because they only wanted to do a numerical float difference on the columns where numbers come
but there are several such blocks, with different lines in the data and where that first parameter SOMENEWROWNAME is the different text fields from the respective lines in the data.
so I was primarily just trying to fix these append to concat warnings, but of course if the code can be better written then all good!
Use list comprehension and DataFrame constructor:
data = [{'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'} for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}]
df = pd.DataFrame(data)
EDIT:
out = []
#sample
for x in {1,7,30}:
out.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
df1 = pd.DataFrame(out)
out1 = []
#sample
for x in {1,7,30}:
out1.append({another dict})))
df2 = pd.DataFrame(out1)
df = pd.concat([df1, df2])
Or:
final = []
for x in {4,2,5}:
final.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
for x in {4,2, 5}:
final.append({another dict})))
df = pd.DataFrame(final)

How can I iterate/ transpose /append a data frame to another one?

for row in range(1, len(df)):
try:
df_out, orthogroup, len_group = HOG_get_group_stats(df.loc[row, "HOG"])
temp_df = pd.DataFrame()
for id in range(len(df_out)):
print(" ")
temp_df = pd.concat([df, pd.DataFrame(df_out.iloc[id, :]).T], axis=1)
temp_df["HOG"] = orthogroup
temp_df["len_group"] = len_group
print(temp_df)
except:
print(row, "no")
Here I have a script that does the following:
Iterate over df and apply the HOG_get_group_stats function to the HOG column in df and then, get 3 variables as outputs. (Basically, the function creates some stats as a data frame called df_out, and extracts some information as two more columns called orthogroup, len_group)
Create an empty template called temp_df
Transpose the df_out data frame and make it one single row and then, concatenate with the df we used in the beginning as columns.
Add orthogroup, len_group columns to the end of the temp_df
Problem:
It prints out the data however, when I try to see the temp_df as a data frame it shows only a single row ( probably the last one) means that my concatenation of several data frames doesn't work.
Questions:
How can I iterate and then append a data frame as columns?
Is there an easier way to iterate over a data frame? (e.g. iterrow)
Is there a better way to transpose rows to columns in a data frame? ( e.g. pivot, melt)
Any help would be appreciated!!
You can find the sample files to df, df_out,temp_df and expected output_sample table here :
Sample_files

Create and Merge Pandas Dataframes in loop

I need to read in bunch of i/p dataframes based on some conditions and then merge them and finally create dataframes as 'merge_m0', 'merge_m1', 'merge_m2' and so on.
In the actual code, I need to read about 20 dataframes. But, for simplicity and ease of understanding, I'm creating 3 dataframes and using a for loop to read them and merge.
#INPUT: Sample input dataframes df0, df1 &df2
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
To do this, I'm using globals() to create dataframes in loop and to merge them but it's not working and throwing " 'DataFrame' object has no attribute 'globals'" error.
#Code:
def comb_mths(x,y):
globals()[f"m{x}"] = globals()[f'df{x}'][globals()[f'df{x}'].globals()[f'm{x}_val_mthd'].isin([1,25])]
globals()[f"m{y}"] = globals()[f'df{y}'][(globals()[f'df{y}'].globals()[f'm{y}_val_mthd'].isin([8,10,11,12])) & (globals()[f'df{y}'].globals()[f'm{y}_orig_val_mthd'].isin([2,3,4,5]))]
globals()[f"merge_m{x}"]=pd.merge(globals()[f"m{x}"],globals()[f"m{y}"], how='inner',on=['id'])
for i in range(0,3):
comb_mths(i, i+1)
I've tried as below as well in place of the 1st line in the above function
#globals()[f"m{x}"] = globals()[f'df{x}'][globals()[f'df{x}'].m{x}_val_mthd.isin([1,25])]
#globals()[f"m{x}"] = globals()[f'df{x}']["[f'm{x}_val_mthd']"].isin([1,25])
I think there must be some better and easy alternative to do this and appreciate if anyone can help. Thanks!
Edit#
my updated post:
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
df_list=[]
for i in range(0,3):
df_list.append(globals()[f'df{i}']) #I'm appending all the i/p dataframes which are created already by other step in the code and hope this works
def comb_mths(i):
dfa = df_list[i]
dfb = df_list[i+1]
dfma = dfa[dfa.iloc[:, 1].isin([1,25])]
dfmb = dfb[(dfb.iloc[:, 1].isin([8,10,11,12])) & (dfb.iloc[:, 3].isin([2,3,4,5]))]
print(dfma)
print(dfmb)
print('\n'*3)
globals()[f"merge_m{i}"] = dfma.merge(dfmb, how='inner', on=['id'])
return globals()[f"merge_m{i}"]
for i in range(0,2):
comb_mths(i)
print(merge_m0)
print(merge_m1)
in the above function after creating "merge_m{i}" dataframe, I need to check one more 'if-else' condition and calculate a variable say 'mths'.
**The logic goes like this:
when i=0, I need to check for "m1_orig_val_mthd", when i=1, I need to check for "m2_orig_val_mthd", when i=2, I need to check for "m3_orig_val_mthd" and so on**
and that if-else condition pseudo code is like below. Can you please show me how do I add this below condition also in the above function?
when i=0 1st iteration
if m1_orig_val_mthd isin (2,4,6):
diff = (mydate - m1_appr_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
elif m1_orig_val_mthd isin (1,3,5):
diff = (mydate - m1_bpo_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
when i=1 2nd iteration
if m2_orig_val_mthd isin (2,4,6):
diff = (mydate - m2_appr_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
elif m2_orig_val_mthd isin (1,3,5):
diff = (mydate - m2_bpo_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
and so on...
I took a different approach assuming you can create all the input dataframes first. If you can create your dataframes and put them in a list, it makes handling them easier and code easier to read.
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
# add your inputs to the list
df_list = [df0, df1, df2]
# only pass in i, then call dfa, dfb by position in the list
def comb_mths(i):
dfa = df_list[i]
dfb = df_list[i+1]
# print(dfa)
# print(dfb)
# print('\n'*3)
# I wasn't exactly sure what you wanted here, but I think the original issue was you were calling your new dataframe before it was created.
dfma = dfa[dfa.iloc[:, 1].isin([1,25])] # as long as columns are in the same position, you don't need to call them by name, just position
dfmb = dfb[(dfb.iloc[:, 1].isin([8,10,11,12])) & (dfb.iloc[:, 3].isin([2,3,4,5]))]
print(dfma)
print(dfmb)
print('\n'*3)
#creating new merged datframes. cleaned this up too
globals()[f"merge_m{i}"] = dfma.merge(dfmb, how='inner', on=['id'])
return globals()[f"merge_m{i}"] #added return statement
for i in range(0,2): # watch range end or you'll get an error
comb_mths(i)
print(merge_m0)
print(merge_m1)
Additional code:
# to populate the df_list, do this
# you aren't actually naming them, I only did that in example above due to your Example
# when you call them, you are calling the position in the list
df_list = []
for i in range(0,20):
df = 'do your code here'
df_list.append(df)
# print the df to verify they are created
for df in df_list:
print(df)

Append values to pandas dataframe incrementally inside for loop

I am trying to add rows to pandas dataframe incrementally inside the for loop.
My for loop is like below:
def print_values(cc):
data = []
for x in values[cc]:
data.append(labels[x])
# cc is a constant and data is a list. I need these values to be appended to a row in pandas dataframe.
# Pandas dataframe structure is like follows: df=pd.DataFrame(columns = ['Index','Names'])
print cc
print data
# This does not work - Not sure about the problem !!
#df_clustercontents.loc['Cluster_Index'] = cc
#df_clustercontents.loc['DatabaseNames'] = data
for x in range(0,10):
print_values(x)
I need the values "cc" and "data" to be appended to the dataframe incrementally.
Any help would be really appreciated !!
You can use ,
...
print(cc)
print(data)
df_clustercontents.loc[len(df_clustercontents)]=[cc,data]
...

Python3.4 Pandas DataFrame from function

I wrote a function that outputs selected data from a parsing function. I am trying to put this information into a DataFrame using pandas.DataFrame but I am having trouble.
The headers are listed below as well as the function.head() data output
QUESTION
How will I be able to place the function output within the pandas DataFrame so the headers are linked to the output
HEADERS
--TICK---------NI----------CAPEXP----------GW---------------OE---------------RE-------
OUTPUT
['MMM', ['4,956,000'], ['(1,493,000)'], ['7,050,000'], ['13,109,000'], ['34,317,000']]
['ABT', ['2,284,000'], ['(1,077,000)'], ['10,067,000'], ['21,526,000'], ['22,874,000']]
['ABBV', ['1,774,000'], ['(612,000)'], ['5,862,000'], ['1,742,000'], ['535,000']]
-Loop through each item (I'm assuming data is a list with each element being one of the lists shown above)
-Take the first element as the ticker and convert the rest into numbers using translate to undo the string formatting
-Make a DataFrame per row and then concat all at the end, then transpose
-Set the columns by parsing the header string (I've called it headers)
dflist = list()
for x in data:
h = x[0]
rest = [float(z[0].translate(str.maketrans('(','-','),'))) for z in x[1:]]
dflist.append(pd.DataFrame([h]+rest))
df = pd.concat(dflist, 1).T
df.columns = [x for x in headers.split('-') if len(x) > 0]
But this might be a bit slow - would be easier if you could get your input into a more consistent format.