pandas groupby error for collection repeated values? [duplicate] - pandas

I have a df that looks like the following:
id item color
01 truck red
02 truck red
03 car black
04 truck blue
05 car black
I am trying to create a df that looks like this:
item color count
truck red 2
truck blue 1
car black 2
I have tried
df["count"] = df.groupby("item")["color"].transform('count')
But it is not quite what I am searching for.
Any guidance is appreciated

That's not a new column, that's a new DataFrame:
In [11]: df.groupby(["item", "color"]).count()
Out[11]:
id
item color
car black 2
truck blue 1
red 2
To get the result you want is to use reset_index:
In [12]: df.groupby(["item", "color"])["id"].count().reset_index(name="count")
Out[12]:
item color count
0 car black 2
1 truck blue 1
2 truck red 2
To get a "new column" you could use transform:
In [13]: df.groupby(["item", "color"])["id"].transform("count")
Out[13]:
0 2
1 2
2 2
3 1
4 2
dtype: int64
I recommend reading the split-apply-combine section of the docs.

Another possible way to achieve the desired output would be to use Named Aggregation. Which will allow you to specify the name and respective aggregation function for the desired output columns.
Named aggregation
(New in version 0.25.0.)
To support column-specific aggregation with control over the output
column names, pandas accepts the special syntax in GroupBy.agg(),
known as “named aggregation”, where:
The keywords are the output column names
The values are tuples whose first element is the column to select and
the second element is the aggregation to apply to that column. Pandas
provides the pandas.NamedAgg named tuple with the fields ['column','aggfunc'] to make it clearer what the arguments are. As usual, the
aggregation can be a callable or a string alias.
So to get the desired output - you could try something like...
import pandas as pd
# Setup
df = pd.DataFrame([
{
"item":"truck",
"color":"red"
},
{
"item":"truck",
"color":"red"
},
{
"item":"car",
"color":"black"
},
{
"item":"truck",
"color":"blue"
},
{
"item":"car",
"color":"black"
}
])
df_grouped = df.groupby(["item", "color"]).agg(
count_col=pd.NamedAgg(column="color", aggfunc="count")
)
print(df_grouped)
Which produces the following output:
count_col
item color
car black 2
truck blue 1
red 2

You can use value_counts and name the column with reset_index:
In [3]: df[['item', 'color']].value_counts().reset_index(names='counts')
Out[3]:
item color counts
0 car black 2
1 truck red 2
2 truck blue 1

Here is another option:
import numpy as np
df['Counts'] = np.zeros(len(df))
grp_df = df.groupby(['item', 'color']).count()
which results in
Counts
item color
car black 2
truck blue 1
red 2

An option that is more literal then the accepted answer.
df.groupby(["item", "color"], as_index=False).agg(count=("item", "count"))
Any column name can be used in place of "item" in the aggregation.
"as_index=False" prevents the grouped column from becoming the index.

Related

Iterate two dataframes, compare and change a value in pandas or pyspark

I am trying to do an exercise in pandas.
I have two dataframes. I need to compare few columns between both dataframes and change the value of one column in the first dataframe if the comparison is successful.
Dataframe 1:
Article Country Colour Buy
Pants Germany Red 0
Pull Poland Blue 0
Initially all my articles have the flag 'Buy' set to zero.
I have dataframe 2 that looks as:
Article Origin Colour
Pull Poland Blue
Dress Italy Red
I want to check if the article, country/origin and colour columns match (so check whether I can find the each article from dataframe 1 in dataframe two) and, if so, I want to put the flag 'Buy' to 1.
I trying to iterate through both dataframe with pyspark but pyspark daatframes are not iterable.
I thought about doing it in pandas but apaprently is a bad practise to change values during iteration.
Which code in pyspark or pandas would work to do what I need to do?
Thanks!
merge with an indicator then map the values. Make sure to drop_duplicates on the merge keys in the right frame so the merge result is always the same length as the original, and rename so we don't repeat the same information after the merge. No need to have a pre-defined column of 0s.
df1 = df1.drop(columns='Buy')
df1 = df1.merge(df2.drop_duplicates().rename(columns={'Origin': 'Country'}),
indicator='Buy', how='left')
df1['Buy'] = df1['Buy'].map({'left_only': 0, 'both': 1}).astype(int)
Article Country Colour Buy
0 Pants Germany Red 0
1 Pull Poland Blue 1

Collapsing pandas groupby object without aggregation function [duplicate]

Lets say this is my data-frame
df = pd.DataFrame({ 'bio' : ['1', '1', '1', '4'],
'center' : ['one', 'one', 'two', 'three'],
'outcome' : ['f','t','f','f'] })
It looks like this ...
bio center outcome
0 1 one f
1 1 one t
2 1 two f
3 4 three f
I want to drop row 1 because it has the same bio & center as row 0.
I want to keep row 2 because it has the same bio but different center then row 0.
Something like this won't work based on drop_duplicates input structure but it's what I am trying to do
df.drop_duplicates(subset = 'bio' & subset = 'center' )
Any suggestions ?
edit : changed df a bit to fit example by correct answer
Your syntax is wrong. Here's the correct way:
df.drop_duplicates(subset=['bio', 'center', 'outcome'])
Or in this specific case, just simply:
df.drop_duplicates()
Both return the following:
bio center outcome
0 1 one f
2 1 two f
3 4 three f
Take a look at the df.drop_duplicates documentation for syntax details. subset should be a sequence of column labels.
The previous Answer was very helpful. It helped me. I also needed to add something in code to get what I wanted. So, I wanted to add here that.
The data-frame:
bio center outcome
0 1 one f
1 1 one t
2 1 two f
3 4 three f
After implementing drop_duplicates:
bio center outcome
0 1 one f
2 1 two f
3 4 three f
Notice at the index. They got messed up. If anyone wants to back the normal indexes i.e. 0, 1, 2 from 0, 2, 3:
df.drop_duplicates(subset=['bio', 'center', 'outcome'], ignore_index=True)
Output:
bio center outcome
0 1 one f
1 1 two f
2 4 three f

Pandas merge: how to return the column on which you have done the merge?

I have a dataframe with some categorical data. I want to create a new column which shows only some of those values, and converts the others to 'other'. E.g. to show only the top 10 cities, or, in the example below, show only two colours and convert the others to 'other'.
I want to do it via a pandas.merge, like a SQL outer join: on one hand my table, on the other side a table with only the values I want to keep (ie not convert to 'others').
The problem is, and it took me a bit of debugging and swearing to find that out, that pandas.merge does not return both columns on which you have done the merge, even if one of the columns has nulls.
The solution I have found is to create another column with the same values - which I think would make anyone familiar with SQL cringe. Is there a more elegant way?
This is the code to show what I mean:
import pandas as pd
df=pd.DataFrame()
df['colour']=['yellow','yellow','green','red']
mycols=pd.DataFrame()
mycols['colour']=['yellow','red']
# after this merge, I have no way of knowing which colour in df has no match in mycols
newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))
# so I need to create another column in mycols
mycols['colour copied']=mycols['colour']
newdf2=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))
newdf2['colour copied']=newdf2['colour copied'].fillna('other')
newdf2.rename(columns={'colour copied': 'colour - reclassified'})
You can add parameter indicator=True for return if matched both, left_only or right_only values:
newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'), indicator=True)
print (newdf)
colour _merge
0 yellow both
1 yellow both
2 green left_only
3 red both
And then set values by condition - here if _merge is both set column colour, else value colour by numpy.where, DataFrame.pop is for extract column:
newdf['colour copied'] = np.where(newdf.pop('_merge') == 'both', newdf['colour'], 'colour')
print (newdf)
colour colour copied
0 yellow yellow
1 yellow yellow
2 green other
3 red red
But if working only with one column is possible simplier alternative - compare by Series.isin for test membership:
df['colour copied'] = np.where(df['colour'].isin(mycols['colour']), df['colour'], 'other')
print (df)
colour colour copied
0 yellow yellow
1 yellow yellow
2 green other
3 red red

pandas: reshape dataframe for stacked bar plot

I have a dataframe like this
meaning label \
0 hypertension 0
1 angina 5
2 angina 9
percentFeatureInCluster percentFeatureInPop
0 33.781654 30.618880
1 24.916958 3.768201
2 4.663107 3.768201
I am trying to group by meaning, and get a stacked bar plot where there are as many bars per meaning as there are rows in each group + an additional one for percentFeatureInPop.
I.e this would be the DataFrame I am looking for, which I can easily feed into plot.bar(stacked=True) and get the plot I'm looking for.
meaning percentFeatureInCluster0 percentFeatureInCluster5
hypertension 33.781654 0
angina 0 24.916958
percentFeatureInCluster9 percentFeatureInPop
0 30.618880
4.663107 3.768201
How can this be achieved?
pre = 'percentFeatureInCluster'
d1 = df.set_index(['meaning', 'label'])[pre].unstack(fill_value=0).add_prefix(pre)
d1.plot.bar(stacked=True, figsize=[10, 4])

Pandas: find most frequent values in columns of lists

x animal
0 5 [dog, cat]
1 6 [dog]
2 8 [elephant]
I have dataframe like this. How can i find most frequent animals contained in all lists of column.
Method value_counts() consider list as one element and i can't use it.
something along these lines?
import pandas as pd
df = pd.DataFrame({'x' : [5,6,8], 'animal' : [['dog', 'cat'], ['elephant'], ['dog']]})
x = sum(df.animal, [])
#x
#Out[15]: ['dog', 'cat', 'elephant', 'dog']
from collections import Counter
c = Counter(x)
c.most_common(1)
#Out[17]: [('dog', 2)]
Maybe take a step back and redefine your data structure? Pandas is more suited if your dataframe is "flat".
Instead of:
x animal
0 5 [dog, cat]
1 6 [dog]
2 8 [elephant]
Do:
x animal
0 5 dog
1 5 cat
2 6 dog
3 8 elephant
Now you can count easily with len(df[df['animal'] == 'dog']) as well as many other Pandas things!
To flatten your dataframe, reference this answer:
Flatten a column with value of type list while duplicating the other column's value accordingly in Pandas