How to count the number of occurences for a histogram using dataframes
d = {'color': ["blue", "green", "yellow", "red, blue", "green, yellow", "yellow, red, blue"],}
df = pd.DataFrame(data=d)
How do you go from
color
blue
green
yellow
red, blue
green, yellow
yellow, red, blue
to
color
occurance
blue
3
green
2
yellow
3
Let's try split by regex ,s\* for comma with zero or more whitespaces, then explode into rows and value_counts to get the count of values:
s = (
df['color'].str.split(r',\s*')
.explode()
.value_counts()
.rename_axis('color')
.reset_index(name='occurance')
)
Or can split and expand then stack:
s = (
df['color'].str.split(r',\s*', expand=True)
.stack()
.value_counts()
.rename_axis('color')
.reset_index(name='occurance')
)
s:
color occurance
0 blue 3
1 yellow 3
2 green 2
3 red 2
Here is another way using .str.get_dummies()
df['color'].str.get_dummies(sep=', ').sum()
Related
I have a dataframe
COL1 COL2 COL3
Red Blue Green
Red Yellow Blue
Blue Red Blue
I want to rename value in the dataframe if they appear 2x (or more) in a row
So the expected output is
COL1 COL2 COL3
Red Blue Green
Red Yellow Blue
Blue Red 2Blue
We can use a custom function here which will check if values are duplicated in a row and add an incremental counter to each of them after using series.mask:
def myf(x):
counter = x.groupby(x).cumcount().add(1).astype(str)
return x.mask(x.duplicated(),x.radd(counter))
print(df.apply(myf,axis=1))
#or df.T.apply(myf).T
COL1 COL2 COL3
0 Red Blue Green
1 Red Yellow Blue
2 Blue Red 2Blue
I have an excel file that im reading into pandas that looks similar to this
name size color material size color material size color material
bob m red coton m yellow cotton m green dri-fit
james l green dri-fit l green cotton l red cotton
steve l green dri-fit l green cotton l red cotton
I want to tally all my shirt types into something like this
l green dri-fit 2
l red coton 2
m red coton 1
i am using pandas ExcelFile to read the file into a file object, then using parse to parse the sheet into a dataframe.
import pandas as pd
file = pd.ExcelFile('myexcelfile.xlsx')
df = file.parse('sheet1')
To try and get to my desired output, I am trying to use Wide to Long. The problem is, because some of my columns have the same names, when I read the file into pandas its renaming my columns. The second instance of size, for example, turns automatically into size.2, same with color and material. If i try to use stubnames with wide to long, it complains that the first instance of size ... "stubname cant be identical to a column name".
Is there any way to use wide to long prior to pandas renaming my columns?
The column numbering is problematic for pd.wide_to_long, so we need to modify the first instance of the column names, adding a .0, so they don't conflict with the stubs.
Sample Data
import pandas as pd
df = pd.read_clipboard()
print(df)
name size color material size.1 color.1 material.1 size.2 color.2 material.2
0 bob m red coton m yellow cotton m green dri-fit
1 james l green dri-fit l green cotton l red cotton
2 steve l green dri-fit l green cotton l red cotton
Code:
stubs = ['size', 'color', 'material']
d = {x: f'{x}.0' for x in stubs}
df.columns = [d.get(k, k) for k in df.columns]
res = pd.wide_to_long(df, i='name', j='num', sep='.', stubnames=stubs)
# size color material
#name num
#bob 0 m red coton
#james 0 l green dri-fit
#steve 0 l green dri-fit
#bob 1 m yellow cotton
#james 1 l green cotton
#steve 1 l green cotton
#bob 2 m green dri-fit
#james 2 l red cotton
#steve 2 l red cotton
res.groupby([*res]).size()
#size color material
#l green cotton 2
# dri-fit 2
# red cotton 2
#m green dri-fit 1
# red coton 1
# yellow cotton 1
value_counts
cols = ['size', 'color', 'material']
s = pd.value_counts([*zip(*map(np.ravel, map(df.get, cols)))])
(l, red, cotton) 2
(l, green, cotton) 2
(l, green, dri-fit) 2
(m, green, dri-fit) 1
(m, yellow, cotton) 1
(m, red, coton) 1
dtype: int64
Counter
And more to my liking
from collections import Counter
s = pd.Series(Counter([*zip(*map(np.ravel, map(df.get, cols)))]))
s.rename_axis(['size', 'color', 'material']).reset_index(name='freq')
size color material freq
0 m red coton 1
1 m yellow cotton 1
2 m green dri-fit 1
3 l green dri-fit 2
4 l green cotton 2
5 l red cotton 2
CODE BELOW:
df = pd.read_excel('C:/Users/me/Desktop/sovrflw_data.xlsx')
df.drop('name', axis=1, inplace=True)
arr = df.values.reshape(-1, 3)
df2 = pd.DataFrame(arr, columns=['size','color','material'])
df2['count']=1
df2.groupby(['size','color','material'],as_index=False).count()
Here is my original df
import pandas as pd
df_1 = pd.DataFrame({'color': ['blue', 'blue', 'yellow', 'yellow'], 'count': [1,3,4,5]})
color count
blue 1
blue 3
yellow 4
yellow 5
I would like to group by color column and sum count column and then populate original dataframe with results. So final result should look like:
df_2 = pd.DataFrame({'color': ['blue', 'blue', 'yellow', 'yellow'], 'count': [1,3,4,5],
'total_per_color': [4,4,9,9]})
color count total_per_color
blue 1 4
blue 3 4
yellow 4 9
yellow 5 9
I can do it with groupby and sum and then merge using pandas, but I wonder if there is quicker way to do it? In SQL one can achieve it with partition, in R I can use dplyr and mutate. Is there something similar in pandas?
Using transform with groupby
df_1['total_per_color']=df_1.groupby('color')['count'].transform('sum')
df_1
Out[886]:
color count total_per_color
0 blue 1 4
1 blue 3 4
2 yellow 4 9
3 yellow 5 9
I have a dataframe with a column being categorical.
I remove all the rows having one the categories.
How can I make sure the resulting dataframe has only those categories that exist and does not keep the deleted categories in its index?
df = pd.DataFrame({'color':np.random.choice(['Blue','Green','Brown','Red'], 50)})
df.color = df.color.astype('category')
df.color.head()
Output:
0 Blue
1 Green
2 Blue
3 Green
4 Brown
Name: color, dtype: category
Categories (4, object): [Blue, Brown, Green, Red]
Remove Brown from dataframe and category.
df = df.query('color != "Brown"')
df.color = df.color.cat.remove_categories('Brown')
df.color.head()
Output:
0 Blue
1 Green
2 Blue
3 Green
7 Red
Name: color, dtype: category
Categories (3, object): [Blue, Green, Red]
How can I make sure the resulting dataframe has only those categories that exist and does not keep the deleted categories in its index?
There's (now?) a pandas function doing exactly that: remove_unused_categories
This function only has one parameter inplace, which is deprecated since pandas 1.2.0. Hence, the following example (based on Scott's answer) does not use inplace:
>>> df = pd.DataFrame({'color':np.random.choice(['Blue','Green','Brown','Red'], 50)})
... df.color = df.color.astype('category')
... df.color.head()
0 Green
1 Brown
2 Blue
3 Red
4 Brown
Name: color, dtype: category
Categories (4, object): [Blue, Brown, Green, Red]
>>> df = df[df.color != "Brown"]
... df.color = df.color.cat.remove_unused_categories()
... df.color.head()
0 Green
2 Blue
3 Red
5 Red
6 Green
Name: color, dtype: category
Categories (3, object): [Blue, Green, Red]
I have the dataframe:
import pandas as pd
id = [0,0,0,0,1,1,1,1]
color = ['red','blue','red','black','blue','red','black','black']
test = pd.DataFrame(zip(id, color), columns = ['id', 'color'])
and would like to create a column of the running count of the unique colors grouped by id so that the final dataframe looks like this:
id color expanding_unique_count
0 0 red 1
1 0 blue 2
2 0 red 2
3 0 black 3
4 1 blue 1
5 1 red 2
6 1 black 3
7 1 black 3
I tried this simple way:
def len_unique(x):
return(len(np.unique(x)))
test['expanding_unique_count'] = test.groupby('id')['color'].apply(lambda x: pd.expanding_apply(x, len_unique))
And got ValueError: could not convert string to float: black
If I change the colors to integers:
color = [1,2,1,3,2,1,3,3]
test = pd.DataFrame(zip(id, color), columns = ['id', 'color'])
Then running the same code above produces the desired result. Is there a way for this to work while maintaining the string type for the column color?
It looks like expanding_apply and rolling_apply mainly work on numeric values. Maybe try creating a numeric column to code the color string as numeric values (this can be done by make color column categorical), and then expanding_apply.
# processing
# ===================================
# create numeric label
test['numeric_label'] = pd.Categorical(test['color']).codes
# output: array([2, 1, 2, 0, 1, 2, 0, 0], dtype=int8)
# your expanding function
test['expanding_unique_count'] = test.groupby('id')['numeric_label'].apply(lambda x: pd.expanding_apply(x, len_unique))
# drop the auxiliary column
test.drop('numeric_label', axis=1)
id color expanding_unique_count
0 0 red 1
1 0 blue 2
2 0 red 2
3 0 black 3
4 1 blue 1
5 1 red 2
6 1 black 3
7 1 black 3
Edit:
def func(group):
return pd.Series(1, index=group.groupby('color').head(1).index).reindex(group.index).fillna(0).cumsum()
test['expanding_unique_count'] = test.groupby('id', group_keys=False).apply(func)
print(test)
id color expanding_unique_count
0 0 red 1
1 0 blue 2
2 0 red 2
3 0 black 3
4 1 blue 1
5 1 red 2
6 1 black 3
7 1 black 3