How to vectorize looking up the row index of one dataframe based on conditions from rows in another dataframe - pandas

I have two pandas dataframes with the same columns, eg
df1 = pd.DataFrame({'A':[0,0,1,1], 'B':[0,1,0,1]})
df2 = pd.DataFrame({'A':[0,1], 'B':[1,1]})
And I want to return the row index from df1 where the values match the rows in df2. eg, yielding [1, 3]. I could do this by looping over df2, but in practice this is really slow. What is the correct way to vectorize this operation in Pandas?

Try with merge first
out = df1.reset_index().merge(df2,how='right')['index']
Out[63]:
0 1
1 3
Name: index, dtype: int64

Related

new_df = df1[df2['pin'].isin(df1['vpin'])] UserWarning: Boolean Series key will be reindexed to match DataFrame index

I'm getting the following warning while executing this line
new_df = df1[df2['pin'].isin(df1['vpin'])]
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
The df1 and df2 has only one similar column and they do not have same number of rows.
I want to filter df1 based on the column in df2. If df2.pin is in df1.vpin I want those rows.
There are multiple rows in df1 for same df2.pin and I want to retrieve them all.
pin
count
1
10
2
20
vpin
Column B
1
Cell 2
1
Cell 4
The command is working. I'm trying to overcome the warning.
It doesn't really make sense to use df2['pin'].isin(df1['vpin']) as a boolean mask to index df1 as this mean will have the indices of df2, thus the reindexing performed by pandas.
Use instead:
new_df = df1[df1['vpin'].isin(df2['pin'])]

Performance issue pandas 6 mil rows

need one help.
I am trying to concatenate two data frames. 1st has 58k rows, other 100. Want to concatenate in a way that each of 58k row has 100 rows from other df. So in total 5.8 mil rows.
Performance is very poor, takes 1 hr to do 10 pct. Any suggestions for improvement?
Here is code snippet.
def myfunc(vendors3,cust_loc):
cust_loc_vend = pd.DataFrame()
cust_loc_vend.empty
for i,row in cust_loc.iterrows():
clear_output(wait=True)
a= row.to_frame().T
df= pd.concat([vendors3, a],axis=1, ignore_index=False)
#cust_loc_vend = pd.concat([cust_loc_vend, df],axis=1, ignore_index=False)
cust_loc_vend= cust_loc_vend.append(df)
print('Current progress:',np.round(i/len(cust_loc)*100,2),'%')
return cust_loc_vend
For e.g. if first DF has 5 rows and second has 100 rows
DF1 (sample 2 columns)
I want a merged DF such that each row in DF 2 has All rows from DF1-
Well all you are looking for is a join.But since there is no column column, what you can do is create a column which is similar in both the dataframes and then drop it eventually.
df['common'] = 1
df1['common'] = 1
df2 = pd.merge(df, df1, on=['common'],how='outer')
df = df.drop('tmp', axis=1)
where df and df1 are dataframes.

How to iterate through rows of a DataFrame and add those rows to a blank DataFrame?

I have two populated DataFrames, df1 and df2. I also have an empty Dataframe (test):
df1 = pd.read_excel(xlpath1, sheetname='Sheet1')
df2 = pd.read_excel(xlpath2, sheetname='Sheet1')
test = pd.DataFrame()
I'd like to iterate through the rows of df1 and add those rows to the empty test Dataframe. When I try the following, I don't get any sort of error, but nothing is added to the test DataFrame:
for i, j in df1.iterrows():
test.append(j)
Any ideas? Do I need to add the proper columns to the test DataFrame first? My total end-goal is to iterate through multiple DataFrames and add only unique items to the empty DataFrame (ex, adding items that appear in one of the many DataFrames).
If are trying to append dataframe df1 on empty dataframe df2 you can use concat function of pandas.
test = pd.concat([df1, test], axis = 0)
axis = 0 ; for appending two dataframes row-wise

Remove rows from multiple dataframe that contain bad data

Say I have n dataframes, df1, df2...dfn.
Finding rows that contain "bad" values in a row in a given dataframe is done by e.g.,
index1 = df1[df1.isin([np.nan, np.inf, -np.inf])]
index2 = df2[df2.isin([np.nan, np.inf, -np.inf])]
Now, droping these bad rows in the bad dataframe is done with:
df1 = df1.replace([np.inf, -np.inf], np.nan).dropna()
df2 = df2.replace([np.inf, -np.inf], np.nan).dropna()
The problem is that any function that expects the two (n) dataframes columns to be of the same length may give an error if there is bad data in one df but not the other.
How do I drop not just the bad row from the offending dataframe, but the same row from a list of dataframes?
So in the two dataframe case, if in df1 date index 2009-10-09 contains a "bad" value, that same row in df2 will be dropped.
[Possible "ugly"? solution?]
I suspect that one way to do it is to merge the two (n) dataframes on date, then apply the cleanup function to drop "bad" values are automatic since the entire row gets dropped? But what happens if a date is missing from one dataframe and not the other? [and they still happen to be the same length?]
Doing your replace
df1 = df1.replace([np.inf, -np.inf], np.nan)
df2 = df2.replace([np.inf, -np.inf], np.nan)
Then, Here we using inner .
newdf=pd.concat([df1,df2],axis=1,keys=[1,2], join='inner').dropna()
And split it back to two dfs , here we using combine_first with dropna of original df
df1,df2=[s[1].loc[:,s[0]].combine_first(x.dropna()) for x,s in zip([df1,df2],newdf.groupby(level=0,axis=1))]

dataframe multiply some columns with a series

I have a dataframe df1 where the index is a DatetimeIndex and there are 5 columns, col1, col2, col3, col4, col5.
I have another df2 which has an almost equal datetimeindex (some days of df1 may be missing from df1), and a single 'Value' column.
I would like to multiply df1 in-place by the Value from df2 when the dates are the same. But not for all columns col1...col5, only col1...col4
I can see it is possible to multiply col1*Value, then col2*Value and so on... and make up a new dataframe to replace df1.
Is there a more efficient way?
You an achieve this, by reindexing the second dataframe so they are the same shape, and then using the dataframe operator mul:
Create two data frames with datetime series. The second one using only business days to make sure we have gaps between the two. Set the dates as indices.
import pandas as pd
# first frame
rng1 = pd.date_range('1/1/2011', periods=90, freq='D')
df1 = pd.DataFrame({'value':range(1,91),'date':rng1})
df1.set_index('date', inplace =True)
# second frame with a business day date index
rng2 = pd.date_range('1/1/2011', periods=90, freq='B')
df2 = pd.DataFrame({'date':rng2})
df2['value_to_multiply'] = range(1-91)
df2.set_index('date', inplace =True)
reindex the second frame with the index from the first. Df1 will now have gaps for non-business days filled with the first previous valid observation.
# reindex the second dataframe to match the first
df2 =df2.reindex(index= df1.index, method = 'ffill')
Multiple df2 by df1['value_to_multiply_by']:
# multiple filling nans with 1 to avoid propagating nans
# nans can still exists if there are no valid previous observations such as at the beginning of a dataframe
df1.mul(df2['value_to_multiply_by'].fillna(1), axis=0)