Reset the categories of categorical index in Pandas - pandas

I have a dataframe with a column being categorical.
I remove all the rows having one the categories.
How can I make sure the resulting dataframe has only those categories that exist and does not keep the deleted categories in its index?

df = pd.DataFrame({'color':np.random.choice(['Blue','Green','Brown','Red'], 50)})
df.color = df.color.astype('category')
df.color.head()
Output:
0 Blue
1 Green
2 Blue
3 Green
4 Brown
Name: color, dtype: category
Categories (4, object): [Blue, Brown, Green, Red]
Remove Brown from dataframe and category.
df = df.query('color != "Brown"')
df.color = df.color.cat.remove_categories('Brown')
df.color.head()
Output:
0 Blue
1 Green
2 Blue
3 Green
7 Red
Name: color, dtype: category
Categories (3, object): [Blue, Green, Red]

How can I make sure the resulting dataframe has only those categories that exist and does not keep the deleted categories in its index?
There's (now?) a pandas function doing exactly that: remove_unused_categories
This function only has one parameter inplace, which is deprecated since pandas 1.2.0. Hence, the following example (based on Scott's answer) does not use inplace:
>>> df = pd.DataFrame({'color':np.random.choice(['Blue','Green','Brown','Red'], 50)})
... df.color = df.color.astype('category')
... df.color.head()
0 Green
1 Brown
2 Blue
3 Red
4 Brown
Name: color, dtype: category
Categories (4, object): [Blue, Brown, Green, Red]
>>> df = df[df.color != "Brown"]
... df.color = df.color.cat.remove_unused_categories()
... df.color.head()
0 Green
2 Blue
3 Red
5 Red
6 Green
Name: color, dtype: category
Categories (3, object): [Blue, Green, Red]

Related

how to count the occurences of a value

How to count the number of occurences for a histogram using dataframes
d = {'color': ["blue", "green", "yellow", "red, blue", "green, yellow", "yellow, red, blue"],}
df = pd.DataFrame(data=d)
How do you go from
color
blue
green
yellow
red, blue
green, yellow
yellow, red, blue
to
color
occurance
blue
3
green
2
yellow
3
Let's try split by regex ,s\* for comma with zero or more whitespaces, then explode into rows and value_counts to get the count of values:
s = (
df['color'].str.split(r',\s*')
.explode()
.value_counts()
.rename_axis('color')
.reset_index(name='occurance')
)
Or can split and expand then stack:
s = (
df['color'].str.split(r',\s*', expand=True)
.stack()
.value_counts()
.rename_axis('color')
.reset_index(name='occurance')
)
s:
color occurance
0 blue 3
1 yellow 3
2 green 2
3 red 2
Here is another way using .str.get_dummies()
df['color'].str.get_dummies(sep=', ').sum()

Re-define dataframe index with map function

I have a dataframe like this. I wanted to know how can I apply map function to its index and rename it into a easier format.
df = pd.DataFrame({'d': [1, 2, 3, 4]}, index=['apple_017', 'orange_054', 'orange_061', 'orange_053'])
df
d
apple_017 1
orange_054 2
orange_061 3
orange_053 4
There are only two labels in the indeces of the dataframe, so it's either apple or orange in this case and this is how I tried:
data.index = data.index.map(i = "apple" if "apple" in i else "orange")
(Apparently it's not how it works)
Desired output:
d
apple 1
orange 2
orange 3
orange 4
Appreciate anyone's help and suggestion!
Try via split():
df.index=df.index.str.split('_').str[0]
OR
via map():
df.index=df.index.map(lambda x:'apple' if 'apple' in x else 'orange')
output of df:
d
apple 1
orange 2
orange 3
orange 4

How do you groupby and aggregate using conditional statements in Pandas?

Expanding on the question here, I'm wondering how to add aggregation to the following based on conditions:
Index Name Item Quantity
0 John Apple Red 10
1 John Apple Green 5
2 John Orange Cali 12
3 Jane Apple Red 10
4 Jane Apple Green 5
5 Jane Orange Cali 18
6 Jane Orange Spain 2
7 John Banana 3
8 Jane Coconut 5
9 John Lime 10
... And so forth
What I need to do is getting this data converted into a dataframe like the following. Note: I am only interested in getting the total quantity of the apples and oranges both of them in separate columns, i.e. whatever other items appear in a certain group are not to be included in the aggregation done on column "Quantity" (but they are still to appear in the column "All items" as strings):
Index Name All Items Apples Total Oranges Total
0 John Apple Red, Apple Green, Orange Cali, Banana, Lime 15 12
1 Jane Apple Red, Apple Green, Orange Cali, Coconut 15 20
How would do I achieve that? Many thanks in advance!
You can use groupby and pivot_table after extracting Apple and Orange sub strings as below:
import re
s = df['Item'].str.extract("(Apple|Orange)",expand=False,flags=re.I)
# re.I used above is optional and is used for case insensitive matching
a = df.assign(Item_1=s).dropna(subset=['Item_1'])
out = (a.groupby("Name")['Item'].agg(",".join).to_frame().join(
a.pivot_table("Quantity","Name","Item_1",aggfunc=sum).add_suffix("_Total"))
.reset_index())
print(out)
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain 15
1 John Apple Red,Apple Green,Orange Cali 15
Orange_Total
0 20
1 12
EDIT:
For edited question, you can use the same code only except groupby on the original dataframe df instead of the subset a and then join:
out = (df.groupby("Name")['Item'].agg(",".join).to_frame().join(
a.pivot_table("Quantity","Name","Item_1",aggfunc=sum).add_suffix("_Total"))
.reset_index())
print(out)
Name Item Apple_Total \
0 Jane Apple Red,Apple Green,Orange Cali,Orange Spain... 15
1 John Apple Red,Apple Green,Orange Cali,Banana,Lime 15
Orange_Total
0 20
1 12
First Filter only the required rows using str.contains on the column Item
from io import StringIO
import pandas as pd
s = StringIO("""Name;Item;Quantity
John;Apple Red;10
John;Apple Green;5
John;Orange Cali;12
Jane;Apple Red;10
Jane;Apple Green;5
Jane;Orange Cali;18
Jane;Orange Spain;2
John;Banana;3
Jane;Coconut;5
John;Lime;10
""")
df = pd.read_csv(s,sep=';')
req_items_idx = df[df.Item.str.contains('Apple|Orange')].index
df_filtered = df.loc[req_items_idx,:]
Once you have them you can further pivot the data to get the required values based on Name
pivot_df = pd.pivot_table(df_filtered,index=['Name'],columns=['Item'],aggfunc='sum')
pivot_df.columns = pivot_df.columns.droplevel()
pivot_df.columns.name = None
pivot_df = pivot_df.reset_index()
Generate the Totals for Apples and Oranges
orange_columns = pivot_df.columns[pivot_df.columns.str.contains('Orange')].tolist()
apple_columns = pivot_df.columns[pivot_df.columns.str.contains('Apple')].tolist()
pivot_df['Apples Total'] = pivot_df.loc[:,apple_columns].sum(axis=1)
pivot_df['Orange Total'] = pivot_df.loc[:,orange_columns].sum(axis=1)
A wrapper function to combine the Items together
def combine_items(inp,columns):
res = []
for val,col in zip(inp.values,columns):
if not pd.isnull(val):
res += [col]
return ','.join(res)
req_columns = apple_columns+orange_columns
pivot_df['Items'] = pivot_df[apple_columns+orange_columns].apply(combine_items,args=([req_columns]),axis=1)
Finally you can get the required columns in a single place and print the values
total_columns = pivot_df.columns[pivot_df.columns.str.contains('Total')].tolist()
name_item_columns = pivot_df.columns[pivot_df.columns.str.contains('Name|Items')].tolist()
>>> pivot_df[name_item_columns+total_columns]
Name Items Apples Total Orange Total
0 Jane Apple Green,Apple Red,Orange Cali,Orange Spain 15.0 20.0
1 John Apple Green,Apple Red,Orange Cali 15.0 12.0
The answer is intended to outline the individual steps and approach one can take to solve something similar to this
Edits: fixed a bug.
To do this, before doing your groupby you can create your Total columns. These will contain a the number of apples and oranges in that row, depending whether that row's Item is apple or orange.
df['Apples Total'] = df.apply(lambda x: x.Quantity if ('Apple' in x.Item) else 0, axis=1)
df['Oranges Total'] = df.apply(lambda x: x.Quantity if ('Orange' in x.Item) else 0, axis=1)
When this is in place, groupby name and aggregate on each column. Sum on the total columns, and aggregate to list on the item column.
df.groupby('Name').agg({'Apples Total': 'sum',
'Oranges Total': 'sum',
'Item': lambda x: list(x)
})
df = pd.read_csv(StringIO("""
Index,Name,Item,Quantity
0,John,Apple Red,10
1,John,Apple Green,5
2,John,Orange Cali,12
3,Jane,Apple Red,10
4,Jane,Apple Green,5
5,Jane,Orange Cali,18
6,Jane,Orange Spain,2
7,John,Banana,3
8,Jane,Coconut,5
9,John,Lime,10
"""))
Getting list of items
grouping by name to get the list of items
items_list = pd.DataFrame(df.groupby(["Name"])["Item"].apply(list)).rename(columns={"Item": "All Items"})
items_list
All Items
Name
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut]
John [Apple Red, Apple Green, Orange Cali, Banana, Lime]
getting count of name item groups
renaming the temp df items column such that all the apples/oranges are treated similarly
temp2 = df.groupby(["Name", "Item"])['Quantity'].apply(sum)
temp2 = pd.DataFrame(temp2).reset_index().set_index("Name")
temp2['Item'] = temp2['Item'].str.replace(r'(?:.*)(apple|orange)(?:.*)', r'\1', case=False,regex=True)
temp2
Item Quantity
Name
Jane Apple 5
Jane Apple 10
Jane Coconut 5
Jane Orange 18
Jane Orange 2
John Apple 5
John Apple 10
John Banana 3
John Lime 10
John Orange 12
getting the required pivot table
pivot table for getting items count as separate column and retaining just apple orange count
pivot_df = pd.pivot_table(temp2, values='Quantity', columns='Item', index=["Name"], aggfunc=np.sum)
pivot_df = pivot_df[['Apple', 'Orange']]
pivot_df
Item Apple Orange
Name
Jane 15.0 20.0
John 15.0 12.0
merging the items list df and the pivot_df
output = items_list.merge(pivot_df, on="Name").rename(columns = {'Apple': 'Apples
Total', 'Orange': 'Oranges Total'})
output
All Items Apples Total Oranges Total
Name
Jane [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut] 15.0 20.0
John [Apple Red, Apple Green, Orange Cali, Banana, Lime] 15.0 12.0

Pandas groupby, sum and populate original dataframe

Here is my original df
import pandas as pd
df_1 = pd.DataFrame({'color': ['blue', 'blue', 'yellow', 'yellow'], 'count': [1,3,4,5]})
color count
blue 1
blue 3
yellow 4
yellow 5
I would like to group by color column and sum count column and then populate original dataframe with results. So final result should look like:
df_2 = pd.DataFrame({'color': ['blue', 'blue', 'yellow', 'yellow'], 'count': [1,3,4,5],
'total_per_color': [4,4,9,9]})
color count total_per_color
blue 1 4
blue 3 4
yellow 4 9
yellow 5 9
I can do it with groupby and sum and then merge using pandas, but I wonder if there is quicker way to do it? In SQL one can achieve it with partition, in R I can use dplyr and mutate. Is there something similar in pandas?
Using transform with groupby
df_1['total_per_color']=df_1.groupby('color')['count'].transform('sum')
df_1
Out[886]:
color count total_per_color
0 blue 1 4
1 blue 3 4
2 yellow 4 9
3 yellow 5 9

pandas dataframe generic column values

Is there a consistent way of getting pandas column values by DF['ColName'], including index column? If 'ColName' is an index column, you get KeyError.
It is very inconvenient that every time you need to determine whether a column being passed in is an index column or not, then handle it differently.
Thank you.
consider the dataframe df
df = pd.DataFrame(
dict(
A=[1, 2, 3],
B=[4, 5, 6],
C=['x', 'y', 'z'],
),
pd.MultiIndex.from_tuples(
[
('cat', 'red'),
('dog', 'blue'),
('bird', 'yellow')
],
names=['species', 'color']
)
)
print(df)
A B C
species color
cat red 1 4 x
dog blue 2 5 y
bird yellow 3 6 z
you can always refer to levels of the index in the same way you'd refer to columns if you reset_index() first.
Grab column 'A'
df.reset_index()['A']
0 1
1 2
2 3
Name: A, dtype: int64
Grab 'color' without reset_index()
df['color']
> KeyError
With reset_index()
0 red
1 blue
2 yellow
Name: color, dtype: object
This doesn't come without it's downside. That index was potentially useful to have for column 'A'
df['A']
species color
cat red 1
dog blue 2
bird yellow 3
Name: A, dtype: int64
Automatically aligned the 'index' with the values of column 'A' which was the whole point of it being the index.