Pandas: Filter rows by comparing a column's value to another value for the same column in a different row - pandas

I have searched the heck out of this one, and I don't think I've found anything applicable. But I'm new to Pandas, so I may have missed something-apologies in this case.
Suppose I have a dataframe, df, with the following contents:
Column1 Column2
A Apple
B Apple
A Pear
A Orange
B Orange
A Pear
I want to filter the dataframe to show ONLY rows where:
-Column2's value matches at least 1 other Column2 value
-For these 2 matching rows, at least 1 of Column1's values are different.
Expected results of the above df:
Column1 Column2
A Apple
B Apple
A Orange
B Orange
I have tried using the .loc() method for this, but I cannot find an appropriate filter/set of filters. (I also tried to use a 'for i in df' loop, but this just gave an error).
I would usually post some sample code in these situations, but I don't think any of my approaches so far have made much sense.
Any help would be much appreciated-thanks.

Use GroupBy.transform with nunqiue:
df_filtered = df[df.groupby('Column2')['Column1'].transform('nunique').gt(1)]
print(df_filtered)
We could also use pd.crosstab:
df[df['Column2'].map(pd.crosstab(df['Column1'],df['Column2']).gt(0).sum().gt(1))]
#df[df['Column2'].map(pd.crosstab(df['Column1'],df['Column2']).where(lambda x: x>0).count().gt(1))]
We coluld also use groupby.filter in general this is slower
df.groupby('Column2').filter(lambda x: x.Column1.nunique()>1)
Output
Column1 Column2
0 A Apple
1 B Apple
3 A Orange
4 B Orange
The best solution is the first with groupby.transform

You can use a groupby and filter:
(
df.groupby('Column2')
.filter(lambda x: len(x.drop_duplicates(subset='Column1'))>1)
)
Column1 Column2
0 A Apple
1 B Apple
3 A Orange
4 B Orange

Related

I want to combine multiple rows that have a missing column id into one row. I am trying to use pandas for this

Column_id
statements
a1
My dog
Nan
ate
Nan
a bone
a2
an apple,
Nan
banana &
Nan
orange
The original dataframe is called df. I would like the statement rows where column_id is missing to merge into one. The resulting data frame would look like this:
Column_id
statements
a1
My dog ate a bone
a2
an apple, banana & orange
I figured out the way to go about it and posting the solution to help someone stuck on the same problem:
df_new = {'Column_id': 'first',
'statements': lambda x: ' '.join(x.dropna())}
df.groupby(df.Column_id.notnull().cumsum().rename(None)).agg(df_new)

How to drop a dataset from a 2d DataFrame

I would like to delete a row/column from a 2d DataFrame.
Let's assume the DataFrame looks like this:
animal cat dog hedgehog
time
0 1 1 0
1 2 0 1
How to get rid of let's say the whole dog thingy to get something like that:
animal cat hedgehog
time
0 1 0
1 2 1
I tried e.g. df.drop() with a lot of variations but haven't fully understood pandas yet.
df.drop('dog',axis=1)
will drop it.You need to pass an axis.
If you want this drop operation to effect current df , use inplace keyword.
df.drop('dog',axis=1,inplace=True)
If you want to drop more than one column, then pass a list.
df.drop(['dog','cat'],axis=1,inplace=True)
You can remove the column, like this
df.drop(columns='dog', inplace=True)
and also you can remove many columns one time, like this
df.drop(columns=['dog', 'cat'], inplace=True)

Convert transactions with several products from columns to row [duplicate]

I'm having a very tough time trying to figure out how to do this with python. I have the following table:
NAMES VALUE
john_1 1
john_2 2
john_3 3
bro_1 4
bro_2 5
bro_3 6
guy_1 7
guy_2 8
guy_3 9
And I would like to go to:
NAMES VALUE1 VALUE2 VALUE3
john 1 2 3
bro 4 5 6
guy 7 8 9
I have tried with pandas, so I first split the index (NAMES) and I can create the new columns but I have trouble indexing the values to the right column.
Can someone at least give me a direction where the solution to this problem is? I don't expect a full code (I know that this is not appreciated) but any help is welcome.
After splitting the NAMES column, use .pivot to reshape your DataFrame.
# Split Names and Pivot.
df['NAME_NBR'] = df['NAMES'].str.split('_').str.get(1)
df['NAMES'] = df['NAMES'].str.split('_').str.get(0)
df = df.pivot(index='NAMES', columns='NAME_NBR', values='VALUE')
# Rename columns and reset the index.
df.columns = ['VALUE{}'.format(c) for c in df.columns]
df.reset_index(inplace=True)
If you want to be slick, you can do the split in a single line:
df['NAMES'], df['NAME_NBR'] = zip(*[s.split('_') for s in df['NAMES']])

Replacing partial text in cells in a dataframe

This is an extension of a question asked and solved earlier (Replace specific values inside a cell without chaging other values in a dataframe)
I have a dataframe where different numeric codes are used in place of text strings and now I would like to replace those codes with text values. In the reference question (above link) it worked with the regex method before but now it is not working anymore and I am clueless if there are any changes made to the .replace method?
Example of my dataframe:
col1
0 1,2,3
1 1,2
2 2-3
3 2, 3
The code lines that I wrote use a dictionary of values that needs to changed and then regex is set to be true.
I used the following code:
d = {'1':'a', '2':'b', '3':'c'}
df['col2'] = df['col1'].replace(d, regex=True)
The result I got is:
col1 col2
0 1,2,3 a,2,3
1 1,2 a,2
2 2-3 b-3
3 2, 3 b, 3
Whereas, I was expecting:
col1 col2
0 1,2,3 a,b,c
1 1,2 a,b
2 2-3 b-c
3 2, 3 b, c
Or alternatively:
col1
0 a,b,c
1 a,b
2 b-c
3 b, c
Is there any changes to the .replace method in the last 1 year? or am I doing anything wrong here? Earlier the same code that I have written worked but not anymore.
Ok, after some experimenting, I found that for each code (numbers) in my cells I need to have a regex replacement statement, such as:
df.replace({'col1': r'1'}, {'col1': 'a'}, regex=True, inplace=True)
df.replace({'col1': r'2'}, {'col1': 'b'}, regex=True, inplace=True)
df.replace({'col1': r'3'}, {'col1': 'c'}, regex=True, inplace=True)
Which results in:
col1
0 a,b,c
1 a,b
2 b-c
3 b, c
This is just a work around as it will overwrite the existing column but it works in my case as my main objective was to replace the codes with values.

Pandas merge: how to return the column on which you have done the merge?

I have a dataframe with some categorical data. I want to create a new column which shows only some of those values, and converts the others to 'other'. E.g. to show only the top 10 cities, or, in the example below, show only two colours and convert the others to 'other'.
I want to do it via a pandas.merge, like a SQL outer join: on one hand my table, on the other side a table with only the values I want to keep (ie not convert to 'others').
The problem is, and it took me a bit of debugging and swearing to find that out, that pandas.merge does not return both columns on which you have done the merge, even if one of the columns has nulls.
The solution I have found is to create another column with the same values - which I think would make anyone familiar with SQL cringe. Is there a more elegant way?
This is the code to show what I mean:
import pandas as pd
df=pd.DataFrame()
df['colour']=['yellow','yellow','green','red']
mycols=pd.DataFrame()
mycols['colour']=['yellow','red']
# after this merge, I have no way of knowing which colour in df has no match in mycols
newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))
# so I need to create another column in mycols
mycols['colour copied']=mycols['colour']
newdf2=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))
newdf2['colour copied']=newdf2['colour copied'].fillna('other')
newdf2.rename(columns={'colour copied': 'colour - reclassified'})
You can add parameter indicator=True for return if matched both, left_only or right_only values:
newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'), indicator=True)
print (newdf)
colour _merge
0 yellow both
1 yellow both
2 green left_only
3 red both
And then set values by condition - here if _merge is both set column colour, else value colour by numpy.where, DataFrame.pop is for extract column:
newdf['colour copied'] = np.where(newdf.pop('_merge') == 'both', newdf['colour'], 'colour')
print (newdf)
colour colour copied
0 yellow yellow
1 yellow yellow
2 green other
3 red red
But if working only with one column is possible simplier alternative - compare by Series.isin for test membership:
df['colour copied'] = np.where(df['colour'].isin(mycols['colour']), df['colour'], 'other')
print (df)
colour colour copied
0 yellow yellow
1 yellow yellow
2 green other
3 red red