Graphing the sum and average; pandas - pandas

total_income = df.groupby('title_year')['gross'].sum()
average_income = df.groupby('title_year')['gross'].mean()
print(plt.semilogy(total_income,average_income))
So I wanted to plot the total and average income on the same graph showing two lines. And I want my x-axis to show the years from 1916-2016 and y-axis to show in Dollars. But my code isn't doing that. I need help on how to change up my code in order to get what I needed
Here's my output of my code.

This is my data file named data.csv:
year,gross
2015,45
2015,47
2015,49
2016,76
2016,78
2016,87
2017,103
2017,115
2017,133
1.) This is all the code to get the log-normal plot:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("data.csv")
total_income = df.groupby('year')['gross'].sum()
average_income = df.groupby('year')['gross'].mean()
total_income.plot(label="Total Income")
average_income.plot(label="Average Income")
plt.xlabel("Year")
plt.ylabel("log$_{10}$(Gross)")
plt.yscale("log")
plt.legend()
plt.tight_layout()
plt.savefig("plot.png")
2.) This is how you use plt.semilogy():
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("data.csv")
total_income = pd.DataFrame(df.groupby('year')['gross'].sum())
average_income = pd.DataFrame(df.groupby('year')['gross'].mean())
plt.semilogy(total_income.index, total_income["gross"],
label="Total Income")
plt.semilogy(average_income.index, average_income["gross"],
label="Average Income")
plt.xlabel("Year")
plt.ylabel("log$_{10}$(Gross)")
plt.legend()
plt.tight_layout()
plt.savefig("plot.png")
1.) and 2.) methods produce the following same plot.

Related

How make scatterplot in pandas readable

I've been playing with Titanic dataset and working through some visualisations in Pandas using this tutorial. https://www.kdnuggets.com/2023/02/5-pandas-plotting-functions-might-know.html
I have a visual of scatterplot having used this code.
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('train.csv')
I was confused by bootstrap plot result so went on to scatterplot.
pd.plotting.scatter_matrix(df, figsize=(10,10), )
plt.show()
I can sort of interpret it but I'd like to put the various variables at top and bottom of every column. Is that doable?
You can use:
fig, ax = plt.subplots(4, 3, figsize=(20, 15))
sns.scatterplot(x = 'bedrooms', y = 'price', data = dataset, whis=1.5, ax=ax[0, 0])
sns.scatterplot(x = 'bathrooms', y = 'price', data = dataset, whis=1.5, ax=ax[0, 1])

Barplot per each ax in matplotlib

I have the following dataset, ratings in stars for two fictitious places:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
Since the rating is a category (is not a continuous data) I convert it to a category:
df['rating_cat'] = pd.Categorical(df['rating'])
What I want is to create a bar plot per each fictitious place ('A or B'), and the count per each rating. This is the intended plot:
I guess using a for per each value in id could work, but I have some trouble to decide the size:
fig, ax = plt.subplots(1,2,figsize=(6,6))
axs = ax.flatten()
cats = df['rating_cat'].cat.categories.tolist()
ids_uniques = df.id.unique()
for i in range(len(ids_uniques)):
ax[i].bar(df[df['id']==ids_uniques[i]], df['rating'].size())
But it returns me an error TypeError: 'int' object is not callable
Perhaps it's something complicated what I am doing, please, could you guide me with this code
The pure matplotlib way:
from math import ceil
# Prepare the data for plotting
df_plot = df.groupby(["id", "rating"]).size()
unique_ids = df_plot.index.get_level_values("id").unique()
# Calculate the grid spec. This will be a n x 2 grid
# to fit one chart by id
ncols = 2
nrows = ceil(len(unique_ids) / ncols)
fig = plt.figure(figsize=(6,6))
for i, id_ in enumerate(unique_ids):
# In a figure grid spanning nrows x ncols, plot into the
# axes at position i + 1
ax = fig.add_subplot(nrows, ncols, i+1)
df_plot.xs(id_).plot(axes=ax, kind="bar")
You can simplify things a lot with Seaborn:
import seaborn as sns
sns.catplot(data=df, x="rating", col="id", col_wrap=2, kind="count")
If you're ok with installing a new library, seaborn has a very helpful countplot. Seaborn uses matplotlib under the hood and makes certain plots easier.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
sns.countplot(
data = df,
x = 'rating',
hue = 'id',
)
plt.show()
plt.close()

Add a category without data in it to a plot in seaborn

I am making plotting some data as a catplot like this:
ax = sns.catplot(x='Kind', y='VAF', hue='Sample', jitter=True, data=df, legend=False)
The trouble is that some of the categories of 'VAF' contain no data, and the corresponding label is not added to the plot. Is there a way to retain the label but just not plot any points for it?
Here is a reproducible example to help explain:
x=pd.DataFrame({'Data':[1,3,4,6,3,2],'Number':['One','One','One','One','Three','Three']})
plt.figure()
ax = sns.catplot(x='Number', y='Data', jitter=True, data=x)
In this plot you can see that on the x-axis, samples One and Three are displayed. But imagine that there is also a sample Two that just had no data points in it. How can I display One, Two, and Three on the x-axis?
Order parameter
Of course one would need to know which categories are expected. Given a list of expected categories, one can use the order parameter to supply the expected categories.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'Data':[1,3,4,6,3,2],
'Number':['One','One','One','One','Three','Three']})
exp_cats = ["One", "Two", "Three"]
ax = sns.stripplot(x='Number', y='Data', jitter=True, data=df, order=exp_cats)
plt.show()
Alternatives
The above works with matplotlib 2.2.3, but not with 3.0. It works again with the current development version (hence 3.1). For the moment, there are the following alternatives:
A. Looping over categories
Given a list of expected categories, one can just loop over them and plot a scatter of each category.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Data':[1,3,4,6,3,2],
'Number':['One','One','One','One','Three','Three']})
exp_cats = ["One", "Two", "Three"]
for i, cat in enumerate(exp_cats):
cdf = df[df["Number"] == cat]
x = np.zeros(len(cdf))+i+.2*(np.random.rand(len(cdf))-0.5)
plt.scatter(x, cdf["Data"].values)
plt.xticks(range(len(exp_cats)), exp_cats)
plt.show()
B. Map categories to numbers.
You can map the expected categories to numbers and plot numbers instead of categories.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Data':[1,3,4,6,3,2],
'Number':['One','One','One','One','Three','Three']})
exp_cats = ["One", "Two", "Three"]
df["IntNumber"] = df["Number"].map(dict(zip(exp_cats, range(len(exp_cats)))))
plt.scatter(df["IntNumber"] + .2*(np.random.rand(len(df))-0.5), df["Data"].values,
c = df["IntNumber"].values.astype(int))
plt.xticks(range(len(exp_cats)), exp_cats)
plt.show()
C. Appending missing categories to the dataframe
Finally you may append nan values to the dataframe to make sure each expected category appears in it.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'Data':[1,3,4,6,3,2],
'Number':['One','One','One','One','Three','Three']})
exp_cats = ["One", "Two", "Three"]
dfa = df.append(pd.DataFrame({'Data':[np.nan]*len(exp_cats), 'Number':exp_cats}))
ax = sns.stripplot(x='Number', y='Data', jitter=True, data=dfa, order=exp_cats)
plt.show()

Python: How to get the number range in order

Trying to plot a number graph but the number range on x-axis is not ordered.
Code snippet:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def plot_sam_img(self):
fig = plt.figure()
plt.xlabel('Range')
plt.ylabel('Total')
df1 = pd.DataFrame({'Range':self.value_counts().index, 'Total':self.value_counts().values})
try:
df1_cut = df1.groupby(pd.cut(df1["Range"], [0,1,2,3,4,5,6,7,8,9,10,11,12,13,df1['Range'].max()])).sum()
except:
df1_cut = df1.groupby(pd.cut(df1["Range"], list(xrange(0, df1['Range'].max()+1)))).sum()
objects = df1_cut['Total'].index
y_pos = np.arange(len(df1_cut['Total'].index))
plt.bar(df1_cut['Total'].index.values, df1_cut['Total'].values)
Actual O/P:
On x-axis: [0,1] [1,2] [10,11] [11,12] [2,3] [3,4] [4,5]......
Expected O/P:
On x-axis: [00] [01] [02] [03] [04]......[010] [011]
This way I think the order issue could be sorted out. Please correct if my understanding is wrong here. Thanks

pandas dataframe bar plot put space between bars

So I want my image look like this
But now my image look like this
How do I reduce the space between bars without making the bar width into 1?
Here is my code:
plot=repeat.loc['mean'].plot(kind='bar',rot=0,alpha=1,cmap='Reds',
yerr=repeat.loc['std'],error_kw=dict(elinewitdh=0.02,ecolor='grey'),
align='center',width=0.2,grid=None)
plt.ylabel('')
plt.grid(False)
plt.title(cell,ha='center')
plt.xticks([])
plt.yticks([])
plt.ylim(0,120)
plt.tight_layout()`
make the plot from scratch if the toplevel functions from pandas or seaborn do not give you the desired result! :)
import seaborn.apionly as sns
import scipy as sp
import matplotlib.pyplot as plt
# some fake data
data = sp.randn(10,10) + 1
data = data[sp.argsort(sp.average(data,axis=1))[::-1],:]
avg = sp.average(data,axis=1)
std = sp.std(data,axis=1)
# a practical helper from seaborn to quickly generate the colors
colors = sns.color_palette('Reds',n_colors = data.shape[0])
fig, ax = plt.subplots()
pos = range(10)
ax.bar(pos,avg,width=1)
for col,patch in zip(colors,ax.patches):
patch.set_facecolor(col)
patch.set_edgecolor('k')
for i,p in enumerate(pos):
ax.plot([p,p],[avg[i],avg[i]+std[i]],color='k',lw=2, zorder=-1)