Remove rows from multiple dataframe that contain bad data

Remove rows from multiple dataframe that contain bad data - pandas

Say I have n dataframes, df1, df2...dfn.
Finding rows that contain "bad" values in a row in a given dataframe is done by e.g.,
index1 = df1[df1.isin([np.nan, np.inf, -np.inf])]
index2 = df2[df2.isin([np.nan, np.inf, -np.inf])]
Now, droping these bad rows in the bad dataframe is done with:
df1 = df1.replace([np.inf, -np.inf], np.nan).dropna()
df2 = df2.replace([np.inf, -np.inf], np.nan).dropna()
The problem is that any function that expects the two (n) dataframes columns to be of the same length may give an error if there is bad data in one df but not the other.
How do I drop not just the bad row from the offending dataframe, but the same row from a list of dataframes?
So in the two dataframe case, if in df1 date index 2009-10-09 contains a "bad" value, that same row in df2 will be dropped.
[Possible "ugly"? solution?]
I suspect that one way to do it is to merge the two (n) dataframes on date, then apply the cleanup function to drop "bad" values are automatic since the entire row gets dropped? But what happens if a date is missing from one dataframe and not the other? [and they still happen to be the same length?]

Doing your replace
df1 = df1.replace([np.inf, -np.inf], np.nan)
df2 = df2.replace([np.inf, -np.inf], np.nan)
Then, Here we using inner .
newdf=pd.concat([df1,df2],axis=1,keys=[1,2], join='inner').dropna()
And split it back to two dfs , here we using combine_first with dropna of original df
df1,df2=[s[1].loc[:,s[0]].combine_first(x.dropna()) for x,s in zip([df1,df2],newdf.groupby(level=0,axis=1))]

Related

new_df = df1[df2['pin'].isin(df1['vpin'])] UserWarning: Boolean Series key will be reindexed to match DataFrame index

I'm getting the following warning while executing this line
new_df = df1[df2['pin'].isin(df1['vpin'])]
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
The df1 and df2 has only one similar column and they do not have same number of rows.
I want to filter df1 based on the column in df2. If df2.pin is in df1.vpin I want those rows.
There are multiple rows in df1 for same df2.pin and I want to retrieve them all.
pin
count
1
10
2
20
vpin
Column B
1
Cell 2
1
Cell 4
The command is working. I'm trying to overcome the warning.

It doesn't really make sense to use df2['pin'].isin(df1['vpin']) as a boolean mask to index df1 as this mean will have the indices of df2, thus the reindexing performed by pandas.
Use instead:
new_df = df1[df1['vpin'].isin(df2['pin'])]

How to apply function to each column and row of dataframe pandas

I have two dataframes.
df1 has an index list made of strings like (row1,row2,..,rown) and a column list made of strings like (col1,col2,..,colm) while df2 has k rows and 3 columns (char_1,char_2,value). char_1 contains strings like df1 indexes while char_2 contains strings like df1 columns. I only want to assign the df2 value to df1 in the right position. For example if the first row of df2 reads ['row3','col1','value2'] I want to assign value2 to df1 in the position ([2,0]) (third row and first column).
I tried to use two functions to slide rows and columns of df1:
def func1(val):
# first I convert the series to dataframe
val=val.to_frame()
val=val.reset_index()
val=val.set_index('index') # I set the index so that it's the right column
def func2(val2):
try: # maybe the combination doesn't exist
idx1=list(cou.index[df2[char_2]==(val2.name)]) #val2.name reads col name of df1
idx2=list(cou.index[df2[char_1]==val2.index.values[0]]) #val2.index.values[0] reads index name of df1
idx= list(reduce(set.intersection, map(set, [idx1,idx2])))
idx=int(idx[0]) # final index of df2 where I need to take value to assign to df1
check=1
except:
check=0
if check==1: # if index exists
val2[0]=df2['value'][idx] # assign value to df1
return val2
val=val.apply(func2,axis=1) #apply the function for columns
val=val.squeeze() #convert again to series
return val
df1=df1.apply(func1,axis=1) #apply the function for rows
I made the conversion inside func1 because without this step I wasn't able to work with series keeping index and column names so I wasn't able to find the index idx in func2.
Well the problem is that it takes forever. df1 size is (3'600 X 20'000) and df2 is ( 500 X 3 ) so it's not too much. I really don't understand the problem.. I run the code for the first row and column to check the result and it's fine and it takes 1 second, but now for the entire process I've been waiting for hours and it's still not finished.
Is there a way to optimize it? As I wrote in the title I only need to run a function that keeps column and index names and works sliding the entire dataframe. Thanks in advance!

How do i combine multiple dataframes using a repeating index system

I have multiple dataframes that I want to combine and only want to use the indexing system of the first dataframe. The problem is the indices I want to use are repeating and I want to keep it that way.
df = pd.concat([df1, df2, df3], axis=1, join='inner')
This gives me InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Just so it's clear, df1 has repeating indices (0-9 and then it repeats again multiple times), whereas df2 and df3 are single-column dataframes and have non-repeating indices. The number of rows do match though.

Well from what i understand your index repeats itself, on df1. That is what is causing the given error InvalidIndexError: Reindexing only valid with uniquely valued Index objects, since you have a loop beetween (0,9 values) pandas, will never be able to identify which row to join with what row since the indexes well are repeated so non unique. My apprach would be just to use join, but hey if you want to use concat for reasons
A few ways to do this would be just
Just using the join function
df1.join([df2,df3])
But if you insist on using concat, i would
x = df1.index
df1.reset_index(drop=True)
df = pd.concat([df1,df2,df3],axis=1,join='inner')
df.index = x

How to iterate through rows of a DataFrame and add those rows to a blank DataFrame?

I have two populated DataFrames, df1 and df2. I also have an empty Dataframe (test):
df1 = pd.read_excel(xlpath1, sheetname='Sheet1')
df2 = pd.read_excel(xlpath2, sheetname='Sheet1')
test = pd.DataFrame()
I'd like to iterate through the rows of df1 and add those rows to the empty test Dataframe. When I try the following, I don't get any sort of error, but nothing is added to the test DataFrame:
for i, j in df1.iterrows():
test.append(j)
Any ideas? Do I need to add the proper columns to the test DataFrame first? My total end-goal is to iterate through multiple DataFrames and add only unique items to the empty DataFrame (ex, adding items that appear in one of the many DataFrames).

If are trying to append dataframe df1 on empty dataframe df2 you can use concat function of pandas.
test = pd.concat([df1, test], axis = 0)
axis = 0 ; for appending two dataframes row-wise

dataframe multiply some columns with a series

I have a dataframe df1 where the index is a DatetimeIndex and there are 5 columns, col1, col2, col3, col4, col5.
I have another df2 which has an almost equal datetimeindex (some days of df1 may be missing from df1), and a single 'Value' column.
I would like to multiply df1 in-place by the Value from df2 when the dates are the same. But not for all columns col1...col5, only col1...col4
I can see it is possible to multiply col1*Value, then col2*Value and so on... and make up a new dataframe to replace df1.
Is there a more efficient way?

You an achieve this, by reindexing the second dataframe so they are the same shape, and then using the dataframe operator mul:
Create two data frames with datetime series. The second one using only business days to make sure we have gaps between the two. Set the dates as indices.
import pandas as pd
# first frame
rng1 = pd.date_range('1/1/2011', periods=90, freq='D')
df1 = pd.DataFrame({'value':range(1,91),'date':rng1})
df1.set_index('date', inplace =True)
# second frame with a business day date index
rng2 = pd.date_range('1/1/2011', periods=90, freq='B')
df2 = pd.DataFrame({'date':rng2})
df2['value_to_multiply'] = range(1-91)
df2.set_index('date', inplace =True)
reindex the second frame with the index from the first. Df1 will now have gaps for non-business days filled with the first previous valid observation.
# reindex the second dataframe to match the first
df2 =df2.reindex(index= df1.index, method = 'ffill')
Multiple df2 by df1['value_to_multiply_by']:
# multiple filling nans with 1 to avoid propagating nans
# nans can still exists if there are no valid previous observations such as at the beginning of a dataframe
df1.mul(df2['value_to_multiply_by'].fillna(1), axis=0)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove rows from multiple dataframe that contain bad data - pandas

Related

new_df = df1[df2['pin'].isin(df1['vpin'])] UserWarning: Boolean Series key will be reindexed to match DataFrame index

How to apply function to each column and row of dataframe pandas

How do i combine multiple dataframes using a repeating index system

How to iterate through rows of a DataFrame and add those rows to a blank DataFrame?

dataframe multiply some columns with a series

Categories

Resources