stop pandas from renaming columns with same name so i can use wide to long - pandas

I have an excel file that im reading into pandas that looks similar to this
name size color material size color material size color material
bob m red coton m yellow cotton m green dri-fit
james l green dri-fit l green cotton l red cotton
steve l green dri-fit l green cotton l red cotton
I want to tally all my shirt types into something like this
l green dri-fit 2
l red coton 2
m red coton 1
i am using pandas ExcelFile to read the file into a file object, then using parse to parse the sheet into a dataframe.
import pandas as pd
file = pd.ExcelFile('myexcelfile.xlsx')
df = file.parse('sheet1')
To try and get to my desired output, I am trying to use Wide to Long. The problem is, because some of my columns have the same names, when I read the file into pandas its renaming my columns. The second instance of size, for example, turns automatically into size.2, same with color and material. If i try to use stubnames with wide to long, it complains that the first instance of size ... "stubname cant be identical to a column name".
Is there any way to use wide to long prior to pandas renaming my columns?

The column numbering is problematic for pd.wide_to_long, so we need to modify the first instance of the column names, adding a .0, so they don't conflict with the stubs.
Sample Data
import pandas as pd
df = pd.read_clipboard()
print(df)
name size color material size.1 color.1 material.1 size.2 color.2 material.2
0 bob m red coton m yellow cotton m green dri-fit
1 james l green dri-fit l green cotton l red cotton
2 steve l green dri-fit l green cotton l red cotton
Code:
stubs = ['size', 'color', 'material']
d = {x: f'{x}.0' for x in stubs}
df.columns = [d.get(k, k) for k in df.columns]
res = pd.wide_to_long(df, i='name', j='num', sep='.', stubnames=stubs)
# size color material
#name num
#bob 0 m red coton
#james 0 l green dri-fit
#steve 0 l green dri-fit
#bob 1 m yellow cotton
#james 1 l green cotton
#steve 1 l green cotton
#bob 2 m green dri-fit
#james 2 l red cotton
#steve 2 l red cotton
res.groupby([*res]).size()
#size color material
#l green cotton 2
# dri-fit 2
# red cotton 2
#m green dri-fit 1
# red coton 1
# yellow cotton 1

value_counts
cols = ['size', 'color', 'material']
s = pd.value_counts([*zip(*map(np.ravel, map(df.get, cols)))])
(l, red, cotton) 2
(l, green, cotton) 2
(l, green, dri-fit) 2
(m, green, dri-fit) 1
(m, yellow, cotton) 1
(m, red, coton) 1
dtype: int64
Counter
And more to my liking
from collections import Counter
s = pd.Series(Counter([*zip(*map(np.ravel, map(df.get, cols)))]))
s.rename_axis(['size', 'color', 'material']).reset_index(name='freq')
size color material freq
0 m red coton 1
1 m yellow cotton 1
2 m green dri-fit 1
3 l green dri-fit 2
4 l green cotton 2
5 l red cotton 2

CODE BELOW:
df = pd.read_excel('C:/Users/me/Desktop/sovrflw_data.xlsx')
df.drop('name', axis=1, inplace=True)
arr = df.values.reshape(-1, 3)
df2 = pd.DataFrame(arr, columns=['size','color','material'])
df2['count']=1
df2.groupby(['size','color','material'],as_index=False).count()

Related

Create feature matrix from Dataframe

I would like to transform a dataframe into a feature matrix (actually, I'm not sure it is called a feature matrix).
df = pd.DataFrame({'Car': ['Audi', 'Toyota', 'Chrysler', 'Toyota', 'Chrysler', 'Chrysler'],
'Color': ['red', 'red', 'blue', 'silver', 'blue', 'silver']})
Car Color
0 Audi red
1 Toyota red
2 Chrysler blue
3 Toyota silver
4 Chrysler blue
5 Chrysler silver
I would like to create a matrix with cars and colors as index and columns where a True, or 1 shows a possible combination like follows:
Color Audit Chrysler Toyota
0 blue 0 1 0
1 red 1 0 1
2 silver 0 1 1
I can create a matrix and then iterate over the rows and enter the values, but this takes quite long. Is there a better way to create this matrix?
Kind regards,
Stephan
pivot_table would seem to apply here:
df.pivot_table(index="Car", columns="Color", aggfunc=len)
Which gives:
Color blue red silver
Car
Audi NaN 1.0 NaN
Chrysler 2.0 NaN 1.0
Toyota NaN 1.0 1.0
You specify the vertical component as the index column (Car), and the horizontal one as the columns component (Color), then provide a function to fill the cells (len).
Then, to nuance it a little, you could use fillna() to "paint" the empty cells with zeros. And apply a logical test to show which ones are "possible"
e.g.
df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0)>0
Which gives:
Color blue red silver
Car
Audi False True False
Chrysler True False True
Toyota False True True
And as a final bit of polish, having learned about it from here, you could run an applymap to get your 0,1 output:
(df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0)>0).applymap(lambda x : 1 if x==True else 0)
Giving:
Color blue red silver
Car
Audi 0 1 0
Chrysler 1 0 1
Toyota 0 1 1
Finally, this process is sometimes referred to in the literature as One Hot Encoding and there are some cool implementations such as this one from sklearn in case your investigations lead you in that direction.
In extension to Thomas's answer below code should give exactly what you desire in the output
import pandas as pd
df = pd.DataFrame({'Car': ['Audi', 'Toyota', 'Chrysler', 'Toyota', 'Chrysler', 'Chrysler'],
'Color': ['red', 'red', 'blue', 'silver', 'blue', 'silver']})
output = (df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0).T > 0).astype(int)
print(output)
Car Audi Chrysler Toyota
Color
blue 0 1 0
red 1 0 1
silver 0 1 1

how to count the occurences of a value

How to count the number of occurences for a histogram using dataframes
d = {'color': ["blue", "green", "yellow", "red, blue", "green, yellow", "yellow, red, blue"],}
df = pd.DataFrame(data=d)
How do you go from
color
blue
green
yellow
red, blue
green, yellow
yellow, red, blue
to
color
occurance
blue
3
green
2
yellow
3
Let's try split by regex ,s\* for comma with zero or more whitespaces, then explode into rows and value_counts to get the count of values:
s = (
df['color'].str.split(r',\s*')
.explode()
.value_counts()
.rename_axis('color')
.reset_index(name='occurance')
)
Or can split and expand then stack:
s = (
df['color'].str.split(r',\s*', expand=True)
.stack()
.value_counts()
.rename_axis('color')
.reset_index(name='occurance')
)
s:
color occurance
0 blue 3
1 yellow 3
2 green 2
3 red 2
Here is another way using .str.get_dummies()
df['color'].str.get_dummies(sep=', ').sum()

Scatter Plot With Multi Column Data in Plotly Express

I have a pandas dataframe like below
x s y
Date
2021-06-25 1 red 2
2021-06-28 2 red 3
2021-06-29 3 red 4
2021-06-25 1 blue 2
2021-06-28 2 blue 3
2021-06-29 3 blue 4
How can I create a scatter plot in plotly[express]/cufflinks like
such that plot axes are x and y and red and blue are separate traces in the plot.
Any hint ?
your sample data looks problematic, both red and blue have same values. Have added .5 to blue to demonstrate
simple pandas to structure data first, so colors are columns
then use plotly express scatter()
import pandas as pd
import io
import plotly.express as px
df = pd.read_csv(io.StringIO("""Date x s y
2021-06-25 1 red 2
2021-06-28 2 red 3
2021-06-29 3 red 4
2021-06-25 1 blue 2.5
2021-06-28 2 blue 3.5
2021-06-29 3 blue 4.5"""), sep="\s+").set_index("Date")
df2 = df.set_index(["x","s"]).unstack("s").droplevel(0,1)
px.scatter(df2, x=df2.index, y=df2.columns)

Reset the categories of categorical index in Pandas

I have a dataframe with a column being categorical.
I remove all the rows having one the categories.
How can I make sure the resulting dataframe has only those categories that exist and does not keep the deleted categories in its index?
df = pd.DataFrame({'color':np.random.choice(['Blue','Green','Brown','Red'], 50)})
df.color = df.color.astype('category')
df.color.head()
Output:
0 Blue
1 Green
2 Blue
3 Green
4 Brown
Name: color, dtype: category
Categories (4, object): [Blue, Brown, Green, Red]
Remove Brown from dataframe and category.
df = df.query('color != "Brown"')
df.color = df.color.cat.remove_categories('Brown')
df.color.head()
Output:
0 Blue
1 Green
2 Blue
3 Green
7 Red
Name: color, dtype: category
Categories (3, object): [Blue, Green, Red]
How can I make sure the resulting dataframe has only those categories that exist and does not keep the deleted categories in its index?
There's (now?) a pandas function doing exactly that: remove_unused_categories
This function only has one parameter inplace, which is deprecated since pandas 1.2.0. Hence, the following example (based on Scott's answer) does not use inplace:
>>> df = pd.DataFrame({'color':np.random.choice(['Blue','Green','Brown','Red'], 50)})
... df.color = df.color.astype('category')
... df.color.head()
0 Green
1 Brown
2 Blue
3 Red
4 Brown
Name: color, dtype: category
Categories (4, object): [Blue, Brown, Green, Red]
>>> df = df[df.color != "Brown"]
... df.color = df.color.cat.remove_unused_categories()
... df.color.head()
0 Green
2 Blue
3 Red
5 Red
6 Green
Name: color, dtype: category
Categories (3, object): [Blue, Green, Red]

(pandas) Create new column based on first element in groupby object

Say I have the following dataframe:
>>> df = pd.DataFrame({'Person': ['bob', 'jim', 'joe', 'bob', 'jim', 'joe'], 'Color':['blue', 'green', 'orange', 'yellow', 'pink', 'purple']})
>>> df
Color Person
0 blue bob
1 green jim
2 orange joe
3 yellow bob
4 pink jim
5 purple joe
And I want to create a new column that represents the first color seen for each person:
Color Person First Color
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange
I have come to a solution but it seems really inefficient:
>>> df['First Color'] = 0
>>> groups = df.groupby(['Person'])['Color']
>>> for g in groups:
... first_color = g[1].iloc[0]
... df['First Color'].loc[df['Person']==g[0]] = first_color
Is there a faster way to do this all at once where it doesn't have to iterate through the groupby object?
You need transform with first:
print (df.groupby('Person')['Color'].transform('first'))
0 blue
1 green
2 orange
3 blue
4 green
5 orange
Name: Color, dtype: object
df['First_Col'] = df.groupby('Person')['Color'].transform('first')
print (df)
Color Person First_Col
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange
use transform() method:
In [177]: df['First_Col'] = df.groupby('Person')['Color'].transform('first')
In [178]: df
Out[178]:
Color Person First_Col
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange