Performance issue pandas 6 mil rows - pandas

need one help.
I am trying to concatenate two data frames. 1st has 58k rows, other 100. Want to concatenate in a way that each of 58k row has 100 rows from other df. So in total 5.8 mil rows.
Performance is very poor, takes 1 hr to do 10 pct. Any suggestions for improvement?
Here is code snippet.
def myfunc(vendors3,cust_loc):
cust_loc_vend = pd.DataFrame()
cust_loc_vend.empty
for i,row in cust_loc.iterrows():
clear_output(wait=True)
a= row.to_frame().T
df= pd.concat([vendors3, a],axis=1, ignore_index=False)
#cust_loc_vend = pd.concat([cust_loc_vend, df],axis=1, ignore_index=False)
cust_loc_vend= cust_loc_vend.append(df)
print('Current progress:',np.round(i/len(cust_loc)*100,2),'%')
return cust_loc_vend
For e.g. if first DF has 5 rows and second has 100 rows
DF1 (sample 2 columns)
I want a merged DF such that each row in DF 2 has All rows from DF1-

Well all you are looking for is a join.But since there is no column column, what you can do is create a column which is similar in both the dataframes and then drop it eventually.
df['common'] = 1
df1['common'] = 1
df2 = pd.merge(df, df1, on=['common'],how='outer')
df = df.drop('tmp', axis=1)
where df and df1 are dataframes.

Related

How to Union 3 dataframes by Pandas?

I need to union 3 dataframes df1, df2, df3, how can I:keep all columns in all 3 dataframes without overlab?
3 dataframes are for 3 different kind of products, one dataframe has less columns than the other two.
step1 = pd.merge_ordered(df1, df2)
all_lob = pd.merge_ordered(step1, df3)
The result seems eliminated some columns, how can i just stack 3 dataframes all together?
Thank you.
I'm not sure what you want to do, but when we are talking about putting the dataframes together on the column-level, you can use the pandas concat method and correspondingly specify the axis:
import pandas as pd
df_1 = pd.DataFrame({'Col_1':[1,2,3], 'Col_2':[4,5,6]})
df_2 = pd.DataFrame({'Col_1':[1,2,3,4], 'Col_2':[4,5,6,7]})
df_3 = pd.DataFrame({'Col_1':[1,2,3,4,5], 'Col_2':[4,5,6,7,8]})
df = pd.concat([df_1,df_2,df_3],axis=1)
print(df)

Concatenate single row dataframe with multiple row dataframe

I have a dataframe with large number of columns but single row as df1:
Col1 Col2 Price Qty
A B 16 5
I have another dataframe as follows, df2:
Price Qty
8 2.5
16 5
6 1.5
I want to achieve the following:
Col1 Col2 Price Qty
A B 8 2.5
A B 16 5
A B 6 1.5
Where essentially I am taking all rows of df1 and repeat it while concatenating with df2 but bring the Price and Qty columns from df2 and replace the ones present originally in df1.
I am not sure how to proceed with above.
I believe the following approach will work,
# first lets repeat the single row df1 as many times as there are rows in df2
df1 = pd.DataFrame(np.repeat(df1.values, len(df2.index), axis=0), columns=df1.columns)
# lets reset the indexes of both DataFrames just to be safe
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
# now, lets merge the two DataFrames based on the index
# after dropping the Price and Qty columns from df1
df3 = pd.merge(df1.drop(['Price', 'Qty'], axis=1), df2, left_index=True, right_index=True)
# finally, lets drop the index columns
df3.drop(['index_x', 'index_y'], inplace=True, axis=1)

is There any methods to merge multiple dataframes of different templates

There are a total of 4 dataframes (df1 / df2 / df3 / df4),
Each dataframe has a different template, but they all have the same columns.
I want to merges the row of each dataframe based on the same column, but what function should I use? A 'merge' or 'join' function doesn't seem to work, and deleting the rest of the columns after grouping them into a list seems to be too messy.
I want to make attached image
This is an option, you can merge the dataframes and then drop the useless columns from the total dataframe.
df_total = pd.concat([df1, df2, df3, df4], axis=0)
df_total.drop(['Value2', 'Value3'], axis=1)
You can use reduce to get it done too.
from functools import reduce
reduce(lambda left,right: pd.merge(left, right, on=['ID','value1'], how='outer'), [df1,df2,df3,df4])[['ID','value1']]
ID value1
0 a 1
1 b 4
2 c 5
3 f 1
4 g 5
5 h 6
6 i 1

How do I offset a dataframe with values in another dataframe?

I have two dataframes. One is the basevales (df) and the other is an offset (df2).
How do I create a third dataframe that is the first dataframe offset by matching values (the ID) in the second dataframe?
This post doesn't seem to do the offset... Update only some values in a dataframe using another dataframe
import pandas as pd
# initialize list of lists
data = [['1092', 10.02], ['18723754', 15.76], ['28635', 147.87]]
df = pd.DataFrame(data, columns = ['ID', 'Price'])
offsets = [['1092', 100.00], ['28635', 1000.00], ['88273', 10.]]
df2 = pd.DataFrame(offsets, columns = ['ID', 'Offset'])
print (df)
print (df2)
>>> print (df)
ID Price
0 1092 10.02
1 18723754 15.76 # no offset to affect it
2 28635 147.87
>>> print (df2)
ID Offset
0 1092 100.00
1 28635 1000.00
2 88273 10.00 # < no match
This is want I want to produce: The price has been offset by matching
ID Price
0 1092 110.02
1 18723754 15.76
2 28635 1147.87
I've also looked at Pandas Merging 101
I don't want to add columns to the dataframe, and I don;t want to just replace column values with values from another dataframe.
What I want is to add (sum) column values from the other dataframe to this dataframe, where the IDs match.
The closest I come is df_add=df.reindex_like(df2) + df2 but the problem is that it sums all columns - even the ID column.
Try this :
df['Price'] = pd.merge(df, df2, on=["ID"], how="left")[['Price','Offset']].sum(axis=1)

How to vectorize looking up the row index of one dataframe based on conditions from rows in another dataframe

I have two pandas dataframes with the same columns, eg
df1 = pd.DataFrame({'A':[0,0,1,1], 'B':[0,1,0,1]})
df2 = pd.DataFrame({'A':[0,1], 'B':[1,1]})
And I want to return the row index from df1 where the values match the rows in df2. eg, yielding [1, 3]. I could do this by looping over df2, but in practice this is really slow. What is the correct way to vectorize this operation in Pandas?
Try with merge first
out = df1.reset_index().merge(df2,how='right')['index']
Out[63]:
0 1
1 3
Name: index, dtype: int64