Pandas merge: how to return the column on which you have done the merge? - pandas

I have a dataframe with some categorical data. I want to create a new column which shows only some of those values, and converts the others to 'other'. E.g. to show only the top 10 cities, or, in the example below, show only two colours and convert the others to 'other'.
I want to do it via a pandas.merge, like a SQL outer join: on one hand my table, on the other side a table with only the values I want to keep (ie not convert to 'others').
The problem is, and it took me a bit of debugging and swearing to find that out, that pandas.merge does not return both columns on which you have done the merge, even if one of the columns has nulls.
The solution I have found is to create another column with the same values - which I think would make anyone familiar with SQL cringe. Is there a more elegant way?
This is the code to show what I mean:
import pandas as pd
df=pd.DataFrame()
df['colour']=['yellow','yellow','green','red']
mycols=pd.DataFrame()
mycols['colour']=['yellow','red']
# after this merge, I have no way of knowing which colour in df has no match in mycols
newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))
# so I need to create another column in mycols
mycols['colour copied']=mycols['colour']
newdf2=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))
newdf2['colour copied']=newdf2['colour copied'].fillna('other')
newdf2.rename(columns={'colour copied': 'colour - reclassified'})

You can add parameter indicator=True for return if matched both, left_only or right_only values:
newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'), indicator=True)
print (newdf)
colour _merge
0 yellow both
1 yellow both
2 green left_only
3 red both
And then set values by condition - here if _merge is both set column colour, else value colour by numpy.where, DataFrame.pop is for extract column:
newdf['colour copied'] = np.where(newdf.pop('_merge') == 'both', newdf['colour'], 'colour')
print (newdf)
colour colour copied
0 yellow yellow
1 yellow yellow
2 green other
3 red red
But if working only with one column is possible simplier alternative - compare by Series.isin for test membership:
df['colour copied'] = np.where(df['colour'].isin(mycols['colour']), df['colour'], 'other')
print (df)
colour colour copied
0 yellow yellow
1 yellow yellow
2 green other
3 red red

Related

How to drop a dataset from a 2d DataFrame

I would like to delete a row/column from a 2d DataFrame.
Let's assume the DataFrame looks like this:
animal cat dog hedgehog
time
0 1 1 0
1 2 0 1
How to get rid of let's say the whole dog thingy to get something like that:
animal cat hedgehog
time
0 1 0
1 2 1
I tried e.g. df.drop() with a lot of variations but haven't fully understood pandas yet.
df.drop('dog',axis=1)
will drop it.You need to pass an axis.
If you want this drop operation to effect current df , use inplace keyword.
df.drop('dog',axis=1,inplace=True)
If you want to drop more than one column, then pass a list.
df.drop(['dog','cat'],axis=1,inplace=True)
You can remove the column, like this
df.drop(columns='dog', inplace=True)
and also you can remove many columns one time, like this
df.drop(columns=['dog', 'cat'], inplace=True)

How to compute the similarity between two text columns in dataframes with pyspark?

I have 2 data frames with different # of rows. Both them has a column as text. My aim to compare them and find similarities and find a ratio of similarity and add this score in final data set. Comparison is between title from df1 and headline from df2. Position of these text rows are different.
df1
duration
title
publish_start_date
129.33
Smuggler's Run fr...
2021-10-29T10:21:...
49.342
anchises. Founded...
2021-10-29T06:00:...
69.939
by Diego Angel in...
2021-10-29T00:33:...
102.60
Orange County sch...
2021-10-28T10:24:...
df2
DataSource
Post Id
headline
Linkedin
L1904055
in English versi...
Linkedin
F6955268
in other language...
Facebook
F1948698
Its combined edit...
Twitter
T7954991
Emma Raducanu: 10...
Basically, I am trying to find a similarities between 2 data sets row by row (on text). Is there any way to do this?
number of final data set = number of first data set x number of second data set
What you are looking for is a Cross Join. This way each row in DF1 will get joined with all rows in DF2 after which you can apply a function to compare similatiries between them.

Pandas: Filter rows by comparing a column's value to another value for the same column in a different row

I have searched the heck out of this one, and I don't think I've found anything applicable. But I'm new to Pandas, so I may have missed something-apologies in this case.
Suppose I have a dataframe, df, with the following contents:
Column1 Column2
A Apple
B Apple
A Pear
A Orange
B Orange
A Pear
I want to filter the dataframe to show ONLY rows where:
-Column2's value matches at least 1 other Column2 value
-For these 2 matching rows, at least 1 of Column1's values are different.
Expected results of the above df:
Column1 Column2
A Apple
B Apple
A Orange
B Orange
I have tried using the .loc() method for this, but I cannot find an appropriate filter/set of filters. (I also tried to use a 'for i in df' loop, but this just gave an error).
I would usually post some sample code in these situations, but I don't think any of my approaches so far have made much sense.
Any help would be much appreciated-thanks.
Use GroupBy.transform with nunqiue:
df_filtered = df[df.groupby('Column2')['Column1'].transform('nunique').gt(1)]
print(df_filtered)
We could also use pd.crosstab:
df[df['Column2'].map(pd.crosstab(df['Column1'],df['Column2']).gt(0).sum().gt(1))]
#df[df['Column2'].map(pd.crosstab(df['Column1'],df['Column2']).where(lambda x: x>0).count().gt(1))]
We coluld also use groupby.filter in general this is slower
df.groupby('Column2').filter(lambda x: x.Column1.nunique()>1)
Output
Column1 Column2
0 A Apple
1 B Apple
3 A Orange
4 B Orange
The best solution is the first with groupby.transform
You can use a groupby and filter:
(
df.groupby('Column2')
.filter(lambda x: len(x.drop_duplicates(subset='Column1'))>1)
)
Column1 Column2
0 A Apple
1 B Apple
3 A Orange
4 B Orange

Iterate two dataframes, compare and change a value in pandas or pyspark

I am trying to do an exercise in pandas.
I have two dataframes. I need to compare few columns between both dataframes and change the value of one column in the first dataframe if the comparison is successful.
Dataframe 1:
Article Country Colour Buy
Pants Germany Red 0
Pull Poland Blue 0
Initially all my articles have the flag 'Buy' set to zero.
I have dataframe 2 that looks as:
Article Origin Colour
Pull Poland Blue
Dress Italy Red
I want to check if the article, country/origin and colour columns match (so check whether I can find the each article from dataframe 1 in dataframe two) and, if so, I want to put the flag 'Buy' to 1.
I trying to iterate through both dataframe with pyspark but pyspark daatframes are not iterable.
I thought about doing it in pandas but apaprently is a bad practise to change values during iteration.
Which code in pyspark or pandas would work to do what I need to do?
Thanks!
merge with an indicator then map the values. Make sure to drop_duplicates on the merge keys in the right frame so the merge result is always the same length as the original, and rename so we don't repeat the same information after the merge. No need to have a pre-defined column of 0s.
df1 = df1.drop(columns='Buy')
df1 = df1.merge(df2.drop_duplicates().rename(columns={'Origin': 'Country'}),
indicator='Buy', how='left')
df1['Buy'] = df1['Buy'].map({'left_only': 0, 'both': 1}).astype(int)
Article Country Colour Buy
0 Pants Germany Red 0
1 Pull Poland Blue 1

pandas groupby error for collection repeated values? [duplicate]

I have a df that looks like the following:
id item color
01 truck red
02 truck red
03 car black
04 truck blue
05 car black
I am trying to create a df that looks like this:
item color count
truck red 2
truck blue 1
car black 2
I have tried
df["count"] = df.groupby("item")["color"].transform('count')
But it is not quite what I am searching for.
Any guidance is appreciated
That's not a new column, that's a new DataFrame:
In [11]: df.groupby(["item", "color"]).count()
Out[11]:
id
item color
car black 2
truck blue 1
red 2
To get the result you want is to use reset_index:
In [12]: df.groupby(["item", "color"])["id"].count().reset_index(name="count")
Out[12]:
item color count
0 car black 2
1 truck blue 1
2 truck red 2
To get a "new column" you could use transform:
In [13]: df.groupby(["item", "color"])["id"].transform("count")
Out[13]:
0 2
1 2
2 2
3 1
4 2
dtype: int64
I recommend reading the split-apply-combine section of the docs.
Another possible way to achieve the desired output would be to use Named Aggregation. Which will allow you to specify the name and respective aggregation function for the desired output columns.
Named aggregation
(New in version 0.25.0.)
To support column-specific aggregation with control over the output
column names, pandas accepts the special syntax in GroupBy.agg(),
known as “named aggregation”, where:
The keywords are the output column names
The values are tuples whose first element is the column to select and
the second element is the aggregation to apply to that column. Pandas
provides the pandas.NamedAgg named tuple with the fields ['column','aggfunc'] to make it clearer what the arguments are. As usual, the
aggregation can be a callable or a string alias.
So to get the desired output - you could try something like...
import pandas as pd
# Setup
df = pd.DataFrame([
{
"item":"truck",
"color":"red"
},
{
"item":"truck",
"color":"red"
},
{
"item":"car",
"color":"black"
},
{
"item":"truck",
"color":"blue"
},
{
"item":"car",
"color":"black"
}
])
df_grouped = df.groupby(["item", "color"]).agg(
count_col=pd.NamedAgg(column="color", aggfunc="count")
)
print(df_grouped)
Which produces the following output:
count_col
item color
car black 2
truck blue 1
red 2
You can use value_counts and name the column with reset_index:
In [3]: df[['item', 'color']].value_counts().reset_index(names='counts')
Out[3]:
item color counts
0 car black 2
1 truck red 2
2 truck blue 1
Here is another option:
import numpy as np
df['Counts'] = np.zeros(len(df))
grp_df = df.groupby(['item', 'color']).count()
which results in
Counts
item color
car black 2
truck blue 1
red 2
An option that is more literal then the accepted answer.
df.groupby(["item", "color"], as_index=False).agg(count=("item", "count"))
Any column name can be used in place of "item" in the aggregation.
"as_index=False" prevents the grouped column from becoming the index.