import numpy as np
df = df.dropna(subset=['genres']).reset_index(drop=True)
splitted = df['genres'].str.split('|')
l = splitted.str.len()
x = df['gross'] / df['budget']
df = pd.DataFrame({x: np.repeat(df[x], l), 'genres':np.concatenate(splitted)})
d = {'mean':'Average Income'}
df1 = df.groupby('genres')[x].agg(['mean']).rename(columns=d)
df1.plot.bar()
plt.yscale("log")
plt.xlabel("Genre")
I want to plot the average of each 'x' for how ever many genres there is[since there are multiple genres for a single movie, I split them into single ones], but I'm not sure what is wrong with my code. It's not doing what I wanted. I need some assistance.
Here's the error message
I think if need aggregate only one function more common is used groupby + mean:
import numpy as np
df = pd.DataFrame({'genres':['Comedy|Crime|Drama|Thriller','Comedy|Crime|Drama',
'Comedy|Crime','Drama|Thriller','Drama','Comedy|Crime'],
'gross':[10,20,30,40,50,60],
'budget':[3,4,5,3,2,5]})
df = df.dropna(subset=['genres']).reset_index(drop=True)
splitted = df['genres'].str.split('|')
l = splitted.str.len()
x = df['gross'] / df['budget']
#is necessary define new column name (divided) and change `df[]` to `x`
df = pd.DataFrame({'divided': np.repeat(x, l), 'genres':np.concatenate(splitted)})
print (df)
divided genres
0 3.333333 Comedy
1 3.333333 Crime
2 3.333333 Drama
3 3.333333 Thriller
4 5.000000 Comedy
5 5.000000 Crime
6 5.000000 Drama
7 6.000000 Comedy
8 6.000000 Crime
9 13.333333 Drama
10 13.333333 Thriller
11 25.000000 Drama
12 12.000000 Comedy
13 12.000000 Crime
#define column for aggregate (divided), no x, because processing new df created by repeat
d = {'mean':'Average Income'}
df1 = df.groupby('genres')['divided'].mean().rename(columns=d).reset_index(name='return')
df1.plot.bar(x='genres', y='return')
plt.yscale("log")
plt.xlabel("Genre")
Related
I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E
I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E
I have a list of t-shirt orders along with the corresponding size and I would like to plot them in pie chart for each design showing the percentage in which size sells the most etc.
Design Total
0 Boba L 9
1 Boba M 4
2 Boba S 2
3 Boba XL 5
4 Burger L 6
5 Burger M 2
6 Burger S 3
7 Burger XL 1
8 Donut L 5
9 Donut M 9
10 Donut S 2
11 Donut XL 5
It is not complete clear what you asking, but here is my interpretation:
df[['Design', 'Size']] = df['Design'].str.rsplit(n=1, expand=True)
fig, ax = plt.subplots(1, 3, figsize=(10,8))
ax = iter(ax)
for t, g in df.groupby('Design'):
g.set_index('Size')['Total'].plot.pie(ax=next(ax), autopct='%.2f', title=f'{t}')
Maybe you want:
df = pd.read_clipboard() #create data from above text no modification
dfplot = df.loc[df.groupby(df['Design'].str.rsplit(n=1).str[0])['Total'].idxmax(), :]
ax = dfplot.set_index('Design')['Total'].plot.pie(autopct='%.2f')
ax.set_ylabel('');
Let do groupby.plot.pie:
(df.Design.str.split(expand=True)
.assign(Total=df['Total'])
.groupby(0)
.plot.pie(x=1,y='Total', autopct='%.1f%%')
)
# format the plots
for design, ax in s.iteritems():
ax.set_title(design)
one of the output:
I have the follwoign Dataframe:
df = pd.DataFrame({'Idenitiy': ['Haus1', 'Haus2', 'Haus1','Haus2'],
'kind': ['Gas', 'Gas', 'Strom','Strom'],
'2005':[2,3,5,6],
'2006':[2,3.5,5.5,7]})
No I would like to have the following dataframe as an output as the Product of the entitites:
Year Product(Gas) Product(Strom)
2005 6 30
2006 6 38,5
2007 7 38,5
Thank you.
Here's a way to do:
# multiply column values
from functools import reduce
def mult(f):
v = [reduce(lambda a,b : a*b, f['2005']), reduce(lambda a,b : a*b, f['2006'])]
return pd.Series(v, index=['2005','2006'])
# groupby and multiply column values
df1 = df.groupby('kind')[['2005','2006']].apply(mult).unstack().reset_index()
df1.columns = ['Year','Kind','vals']
print(df1)
Year Kind vals
0 2005 Gas 6.0
1 2005 Strom 30.0
2 2006 Gas 7.0
3 2006 Strom 38.5
# reshape the table
df1 = (df1
.pivot_table(index='Year', columns=['Kind'], values='vals'))
# fix column names
df1 = df1.add_prefix('Product_')
df1.columns.name = None
df1 = df1.reset_index()
Year Product_Gas Product_Strom
0 2005 6.0 30.0
1 2006 7.0 38.5
total_income = df.groupby('genres')['gross'].sum()
average_income = df.groupby('genres')['gross'].mean()
total_income.plot.bar(label="Total Income", color = 'r')
average_income.plot.bar(label="Average Income")
plt.xlabel("Genres")
plt.ylabel("Dollars (Gross)")
plt.yscale("log")
Here's my code that plots the sum and average of gross by the genres of movies. The problem is when I plot the graph, it gives me a complete black graph. I believe it is due to the length of words in the genres because it contains multiple genres.
How Can I fix this so it shows the graph and it's genres? I need assistance.
You can use str.split for lists, then get len for length.
Last create new DataFrame by constructor with numpy.repeat and numpy.concatenate:
df = pd.DataFrame({'genres':['Comedy|Crime|Drama|Thriller','Comedy|Crime|Drama','Comedy|Crime','Drama|Thriller','Drama','Comedy|Crime'],
'gross':[10,20,30,40,50,60]})
print (df)
genres gross
0 Comedy|Crime|Drama|Thriller 10
1 Comedy|Crime|Drama 20
2 Comedy|Crime 30
3 Drama|Thriller 40
4 Drama 50
5 Comedy|Crime 60
splitted = df['genres'].str.split('|')
l = splitted.str.len()
df = pd.DataFrame({'gross': np.repeat(df['gross'], l), 'genres':np.concatenate(splitted)})
print (df)
genres gross
0 Comedy 10
0 Crime 10
0 Drama 10
0 Thriller 10
1 Comedy 20
1 Crime 20
1 Drama 20
2 Comedy 30
2 Crime 30
3 Drama 40
3 Thriller 40
4 Drama 50
5 Comedy 60
5 Crime 60
d = {'mean':'Average','sum':'Total'}
df1 = df.groupby('genres')['gross'].agg(['sum','mean']).rename(columns=d)
print (df1)
Total Average
genres
Comedy 120 30
Crime 120 30
Drama 120 30
Thriller 50 25
df1.plot.bar()