comparing and removing rows in pandas - pandas

I am trying to create a new object by comparing two list. If the rows are matching the row should be removed form the splitted row_list or appended to a new list containing only the differences between both lists.
results = []
for row in splitted_row_list:
print(row)
for row1 in all_rows:
if row1 == row:
splitted_row_list.remove(row)
else:
results.append(row)
print(results)
However, this code just returns all the rows. Does anyone have a suggestion?
Sample data
all_rows[0]:'1390', '139080', '13980', '1380', '139080', '13080'
splitted_row_list[0]:'35335','53527','353529','242424','5222','444'

As I understand you want to compare two lists by index and keep the differences and you want to do it with pandas (because of the tag):
So here are two lists for example:
ls1=[0,10,20,30,40,50,60,70,80,90]
ls2=[0,15,20,35,40,55,60,75,80,95]
I make a pandas dataframe with these lists, and build a mask to filter out the the matching values:
df= pd.DataFrame(data={'ls1':ls1, 'ls2':ls2})
mask= df['ls1']!=df['ls2']
I can then call the different values for each list using the mask:
# list 1
df[mask]['ls1'].values
out: array([10, 30, 50, 70, 90])
and
# list 2
df[mask]['ls2'].values
out: array([15, 35, 55, 75, 95])

Related

How to apply function to each column and row of dataframe pandas

I have two dataframes.
df1 has an index list made of strings like (row1,row2,..,rown) and a column list made of strings like (col1,col2,..,colm) while df2 has k rows and 3 columns (char_1,char_2,value). char_1 contains strings like df1 indexes while char_2 contains strings like df1 columns. I only want to assign the df2 value to df1 in the right position. For example if the first row of df2 reads ['row3','col1','value2'] I want to assign value2 to df1 in the position ([2,0]) (third row and first column).
I tried to use two functions to slide rows and columns of df1:
def func1(val):
# first I convert the series to dataframe
val=val.to_frame()
val=val.reset_index()
val=val.set_index('index') # I set the index so that it's the right column
def func2(val2):
try: # maybe the combination doesn't exist
idx1=list(cou.index[df2[char_2]==(val2.name)]) #val2.name reads col name of df1
idx2=list(cou.index[df2[char_1]==val2.index.values[0]]) #val2.index.values[0] reads index name of df1
idx= list(reduce(set.intersection, map(set, [idx1,idx2])))
idx=int(idx[0]) # final index of df2 where I need to take value to assign to df1
check=1
except:
check=0
if check==1: # if index exists
val2[0]=df2['value'][idx] # assign value to df1
return val2
val=val.apply(func2,axis=1) #apply the function for columns
val=val.squeeze() #convert again to series
return val
df1=df1.apply(func1,axis=1) #apply the function for rows
I made the conversion inside func1 because without this step I wasn't able to work with series keeping index and column names so I wasn't able to find the index idx in func2.
Well the problem is that it takes forever. df1 size is (3'600 X 20'000) and df2 is ( 500 X 3 ) so it's not too much. I really don't understand the problem.. I run the code for the first row and column to check the result and it's fine and it takes 1 second, but now for the entire process I've been waiting for hours and it's still not finished.
Is there a way to optimize it? As I wrote in the title I only need to run a function that keeps column and index names and works sliding the entire dataframe. Thanks in advance!

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

Pandas append function adds new columns

I want to append one row to my dataframe.
Here's the code
import pandas as pd
citiesDataFrame=pd.read_csv('cities.csv')
citiesDataFrame=citiesDataFrame.append({
'LatD': 50,
'"LatM"' : 70,
'"LatS"' : 40,
'"NS"': '"S"',
'"LonD"': 200,
'"LonM"': 15,
'"LonS"': 40,
'"EW"': "E",
'"City"': '"Kentucky"',
'"State"': "KY"},ignore_index=True)
citiesDataFrame
But when i run, append doesn't work properly. In my dataframe i have 10 columns and 128 rows, when i run the code, it appends 9 columns and 1 row (here is modified dataframe) to dataframe.
Notice it works for LatD. The reason is your column names aren't identical to the existing names. Seems like a quoting issue. Not sure why you have the double quotes inside the single quotes. Make the column names match and then the append will work.

Advanced condition lookup in pandas(numpy)

given:
a list of elements 'ls' and a big df 'df', all the elements of 'ls' is in the 'df'.
ls = ['a0','a1','a2','b0','b2','c0',...,'c_k']
df = [['a0','b0','c0'],
['a0','b0','c1'],
['a0','b0','c2'],
...
['a_i','b_j','c_k']]
goal:
I want to collect the rows set of the 'df' that contains the most elements of 'ls', such as ['a0','b0','c0'] is the best one. But at most a row just contain only 2 elements
tried:
I tried enumerating 3 or 2 elements in 'ls', but it was too expensive and probably return None since there exist only 2 elements in some row.
I tried to use a dictionary to count, but it didn't work either.
I've been puzzling over this problem all day, any help will be greatly appreciated.
I would go like this:
row_id = df.apply(lambda x: x.isin(ls).sum(), axis=1)
This will give you the row index with max entries in the list.
The desired row can be obtained so:
df.iloc[row_id, :]

Numpy where perform multiple actions

I have two dataframe columns where I want to check if the element of one are inside the other one. I perform this using the pandas isin method.
However, if the element is present in the second dataframe, I also want to subtract is from both:
attivo['S'] = np.where(attivo['SKU'].isin(stampate['SKU-S']), attivo['S'] - 1, attivo['S'])
In this example, if an item in the column S of attivo dataframe is present in the column SKU-S of the stampate dataframe, the S column will decrease by one unit, however, I also want that the same column S will decrease in the stampate dataframe.
How is it possible to achieve this?
EDIT with sample data:
df1 = pd.DataFrame({'SKU': 'productSKU', 'S': 5}, index=[0])
df2 = pd.DataFrame({'SKU-S': 'productSKU', 'S': 5}, index=[0])
Currently, I am achieving this:
df1['S'] = np.where(df1['SKU'].isin(df2['SKU-S']), df1['S'] - 1, df1['S'])
However, I would like that both dataframes are updated, in this case, both of them will display 4 in the S column.
IIUC:
s = df1['SKU'].isin(df2['SKU-S'])
# modify df1
df1['S'] -= s
# count the SKU in df1 that belongs to df2 by values
counts = df1['SKU'].where(s).value_counts()
# modify df2
df2['S'] -= df2['SKU-S'].map(counts).fillna(0)