why does python not drop all duplicates? - pandas

This is my
original data frame
I want to remove the duplicates for the columns 'head_x' and 'head_y' and the columns 'cost_x' and 'cost_y'.
This is my code:
df=df.astype(str)
df.drop_duplicates(subset={'head_x','head_y'}, keep=False, inplace=True)
df.drop_duplicates(subset={'cost_x','cost_y'}, keep=False, inplace=True)
print(df)
This is the output dataframe, as you can see the first row is a duplicate on both subsets. So why is this row stil there?
I do not just want to remove the first row but all duplicates. Tis is another output where also for Index/Node 6 there is a duplicate.

Take a look at the first 2 rows:
head_x cost_x head_y cost_y
Node
1 2 6 2 3
1 2 6 3 4
Start from head_x and head_y:
from the first row are 2 and 2,
from the second row are 2 and 3,
so these two pairs are different.
Then look at cost_x and cost_y:
from the first row are 6 and 3,
from the second row are 6 and 4,
so these two pairs are also different.
Conclusion: These 2 rows are not duplicates, taking into account both column
subsets.

df=df.astype(str)
df = df.drop_duplicates(subset={'head_x','head_y'}, keep=False, inplace=True)
df = df.drop_duplicates(subset={'cost_x','cost_y'}, keep=False, inplace=True)
I assume that cost_x should be replaced with head_y, in other way there are no duplicates

Related

Joining two data frames on column name and comparing result side by side

I have two data frames which look like df1 and df2 below and I want to create df3 as shown.
I could do this using a left join to have all the rows in one dataframe and then did a numpy.where to see if they are matching or not.
I could get what I want but I feel there should be an elegant way of doing this which will eliminate renaming columns, reshuffling columns in dataframe and then using np.where.
Is there a better way to do this?
code to reproduce dataframes:
import pandas as pd
df1=pd.DataFrame({'product':['apples','bananas','oranges','pineapples'],'price':[1,2,3,7],'quantity':[5,7,11,4]})
df2=pd.DataFrame({'product':['apples','bananas','oranges'],'price':[2,2,4],'quantity':[5,7,13]})
df3=pd.DataFrame({'product':['apples','bananas','oranges'],'price_df1':[1,2,3],'price_df2':[2,2,4],'price_match':['No','Yes','No'],'quantity':[5,7,11],'quantity_df2':[5,7,13],'quantity_match':['Yes','Yes','No']})
An elegant way to do your task is to:
generate "partial" DataFrames from each source column,
and then concatenate them.
The first step is to define a function to join 2 source columns and append "match" column:
def myJoin(s1, s2):
rv = s1.to_frame().join(s2.to_frame(), how='inner',
lsuffix='_df1', rsuffix='_df2')
rv[s1.name + '_match'] = np.where(rv.iloc[:,0] == rv.iloc[:,1], 'Yes', 'No')
return rv
Then, from df1 and df2, generate 2 auxiliary DataFrames setting product as the index:
wrk1 = df1.set_index('product')
wrk2 = df2.set_index('product')
And the final step is:
result = pd.concat([ myJoin(wrk1[col], wrk2[col]) for col in wrk1.columns ], axis=1)\
.reset_index()
Details:
for col in wrk1.columns - generates names of columns to join.
myJoin(wrk1[col], wrk2[col]) - generates the partial result for this column from
both source DataFrames.
[…] - a list comprehension, collecting the above partial results in a list.
pd.concat(…) - concatenates these partial results into the final result.
reset_index() - converts the index (product names) into a regular column.
For your source data, the result is:
product price_df1 price_df2 price_match quantity_df1 quantity_df2 quantity_match
0 apples 1 2 No 5 5 Yes
1 bananas 2 2 Yes 7 7 Yes
2 oranges 3 4 No 11 13 No

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

Delete all rows with an empty cell anywhere in the table at once in pandas

I have googled it and found lots of question in stackoverflow. So suppose I have a dataframe like this
A B
-----
1
2
4 4
First 3 rows will be deleted. And suppose I have not 2 but 200 columns. How can I do that?
As per your request - first replace to Nan:
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.dropna()
If you want to remove on a specific column, then you need to specify the column name in the brackets

Copy a column value from another dataframe based on a condition

Let us say I have two dataframes: df1 and df2. Assume the following initial values.
df1=pd.DataFrame({'ID':['ASX-112','YTR-789','ASX-124','UYT-908','TYE=456','ERW-234','UUI-675','GHV-805','NMB-653','WSX-123'],
'Costperlb':[4515,5856,3313,9909,8980,9088,6765,3456,9012,1237]})
df2=df1[df1['Costperlb']>4560]
As you can see, df2 is a proper subset of df1 (it was created from df1 by imposing a condition on selection of rows).
I added a column to df2, which contains certain values based on a calculation. Let us call this df2['grade'].
df2['grade']=[1,4,3,5,1,1]
df1 and df2 contain one column named 'ID' which is guaranteed to be unique in each dataframe.
I want to:
Create a new column in df1 and initialize it to 0. Easy. df1['grade']=0.
Copy df2['grade'] values over to df1['grade'], ensuring that df1['ID']=df2['ID'] for each such copy.
The result should be the grade values for the corresponding IDs copied over.
Step 2 is what is perplexing me a bit. A naive df1['grade']=df2['grade'].values does not work obviously as the lengths of the two dataframes is different.
Now, if I think hard enough, I could possibly come up with a monstrosity like:
df1['grade'].loc[(df1['ID'].isin(df2)) & ...] but I am uncomfortable with doing that.
I am a newbie with python, and furthermore, the indices of df1 are being used elsewhere after this assignment, and I do not want drop indices, reset indices as some of the solutions are suggested in some of the search results I found.
I just want to find out rows in df1 where the 'ID' row matches the 'ID' row in df2, and then copy the 'grade' column value in that specific row over. How do I do this?
Your code:
df1=pd.DataFrame({'ID':['ASX-112','YTR-789','ASX-124','UYT-908','TYE=456','ERW-234','UUI-675','GHV-805','NMB-653','WSX-123'],
'Costperlb':[4515,5856,3313,9909,8980,9088,6765,3456,9012,1237]})
df2=df1[df1['Costperlb']>4560]
df2['grade']=[1,4,3,5,1,1]
You can use merge with "left". In this way the indexing of df1 is preserved:
new_df = df1.merge(df2[["ID","grade"]], on="ID", how="left")
new_df["grade"] = new_df["grade"].fillna(0)
new_df
Output:
ID Costperlb grade
0 ASX-112 4515 0.0
1 YTR-789 5856 1.0
2 ASX-124 3313 0.0
3 UYT-908 9909 4.0
4 TYE=456 8980 3.0
5 ERW-234 9088 5.0
6 UUI-675 6765 1.0
7 GHV-805 3456 0.0
8 NMB-653 9012 1.0
9 WSX-123 1237 0.0
Here I called the merged dataframe new_df, but you can simply change it to df1.
EDIT
If instead of 0 you want to replace the NaN with a string, try this:
new_df = df1.merge(df2[["ID","grade"]], on="ID", how="left")
new_df["grade"] = new_df["grade"].fillna("No transaction possible")
new_df
Output:
ID Costperlb grade
0 ASX-112 4515 No transaction possible
1 YTR-789 5856 1
2 ASX-124 3313 No transaction possible
3 UYT-908 9909 4
4 TYE=456 8980 3
5 ERW-234 9088 5
6 UUI-675 6765 1
7 GHV-805 3456 No transaction possible
8 NMB-653 9012 1
9 WSX-123 1237 No transaction possible

Adding lists stored in dataframe

I have two dataframes as:
df1.ix[1:3]
DateTime
2018-01-02 [-0.0031537018416199097, 0.006451397621428631,...
2018-01-03 [-0.0028882814454597745, -0.005829869983964528...
df2.ix[1:3]
DateTime
2018-01-02 [-0.03285881500135208, -0.027806145786217932, ...
2018-01-03 [-0.0001314381449719178, -0.006278235444742629...
len(df1.ix['2018-01-02'][0])
500
len(df2.ix['2018-01-02'][0])
500
When I do df1 + df2 I get:
len((df1 + df2).ix['2018-01-02'][0])
1000
So, the lists instead of being summation is being concatenated.
How do I add element wise the lists in the dataframes df1 and df2.
When an operation is applied between two dataframes, it gets broadcasted at element level. Element in your case is a list and when '+' operator is applied between two lists, it concatenates them. That's why resulting dataframe contains concatenated lists.
There can be multiple approaches for actually summing up elements of lists instead of concatenating.
One approach can be converting list elements into columns and then adding dataframes and then merging columns back to a single list.(which has been suggested in first answer but in a wrong way)
Step 1: Converting list elements to columns
df1=df1.apply(lambda row:pd.Series(row[0]), axis=1)
df2=df2.apply(lambda row:pd.Series(row[0]), axis=1)
We need to pass row[0] instead of row to get rid of column index associated with series.
Step 2: Add dataframes
df=df1+df2 #this dataframe will have 500 columns
Step 3: Merge columns back to lists
df=df.apply(lambda row:pd.Series({0:list(row)}),axis=1)
This is an interesting part. Why are we returning a series here? Why only returning list(row) doesn't work and keep retaining 500 columns?
Reason is - if length of list returned is same as length of columns in the beginning, then this list gets fit in columns and to us it seems nothing happened. Whereas if length of the list is not equal to number of columns, then it is returned as single list.
Let's look at an example.
Suppose I've a dataframe, having columns 0 ,1 and 2.
df=pd.DataFrame({0:[1,2,3],1:[4,5,6],2:[7,8,9]})
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
Number of columns in original dataframe are 3. If I try to return a list with two columns, it works and a series is returned,
df1=df.apply(lambda row:[row[0],row[1]],axis=1)
0 [1, 4]
1 [2, 5]
2 [3, 6]
dtype: object
Instead if try to return list of three numbers, it would get fit in columns.
df1=df.apply(list,axis=1)
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
So if we want to return list of same size as number of columns, we'll have to return it in form of Series where one row's value has been given as list.
Another approach can be, introduce one column of a dataframe into other and then add columns using apply function.
df1[1]=df2[0]
df=df1.apply(lambda r: list(np.array(r[0])+np.array(r[1])),axis=1)
We can take advantage of numpy arrays here. '+' operator on numpy arrays sums up corresponding values and gives a single numpy array.
Cast them to series so that they become columns, then add your dfs:
df1 = df1.apply(pd.Series, axis=1)
df2 = df2.apply(pd.Series, axis=1)
df1 + df2