Cumulative histogram plot from dataframe - pandas

The goal is to create a plot like this
Dummy df:
columns = ['number_of_words', 'occurrences']
data = [[1, 2312252],
[2,1000000],
[3,800000],
[4, 400000],
[5, 100000],
[6, 70000],
[7, 40000],
[8, 10000],
[9, 4000],
[10, 50]]
dummy_df = pd.DataFrame(columns=columns, data=data)
The y axis represents the occurrences and the x axis the number of words column from the dummy_df.
The x axis should be cumulative such that it stacks the values on top of each other.
Example: With number_of_words = 1 we have around 2.3 m occurrences. With number_of_words = 2 we have around 1m occurrences, thus it should plot 2.3m + 1m at occurrences = 2.
At the final entry of number_of_words the histogram should reach sum(occurrences).
I do NOT want to normalize it.

Since you already got the frequencies worked out, just add it cumulatively:
dummy_df['acc'] = dummy_df.occurrences.cumsum()
ax = dummy_df['acc'].plot('bar', width=1, color='b')
dummy_df['acc'].shift().plot('bar', alpha=0.7, width=1, color='r', ax=ax)

To split it into parts, plot it twice. The first is the normal cumsum, then second is just the values, with the shifted cumsum setting the bottom (This overlaps the top of the previous plotted cumsum).
Using .iloc[1:] to slice the Series just before plotting removes the first bar, which you want to exclude.
fig, ax = plt.subplots()
df['occurrences'].cumsum().iloc[1:].plot(kind='bar', width=1, ec='k', ax=ax)
df['occurrences'].iloc[1:].plot(kind='bar', width=1, ec='k',
bottom=df['occurrences'].cumsum().shift().fillna(0).iloc[1:], ax=ax, color='red')
plt.show()

Related

PyPlot line plot changing color by column value

I have a dataframe with a structure similar to the following example.
df = pd.DataFrame({'x': ['2008-01-01', '2008-01-02', '2008-01-03', '2008-01-04'], 'y': [1, 2, 3, 6],
'group_id': ['OBSERVED', 'IMPUTED', 'OBSERVED', 'IMPUTED'], 'color': ['blue', 'red', 'blue', 'red']})
df['x'] = pd.to_datetime(df['x'])
I.e. a dataframe where some of the values (y) are observed and others are imputed.
x y group_id color
0 2008-01-01 1 OBSERVED blue
1 2008-01-02 2 IMPUTED red
2 2008-01-03 3 OBSERVED blue
3 2008-01-04 6 IMPUTED red
How to I create a single line which changes color based on the group_id (the column color is uniquely determined by group_id as in this example)?
I have tried the following two solutions (one of them being omitted by the comment)
df_grp = df.groupby('group_id')
fig, ax = plt.subplots(1)
for id, data in df_grp:
#ax.plot(data['x'], data['y'], label=id, color=data['color'].unique().tolist()[0])
data.plot('x', 'y', label=id, ax=ax)
plt.legend()
plt.show()
However, the plow is not
a single line.
colored correctly by each segment.
You can use the below code to do the forward looking colors. The key was to get the data right in the dataframe, so that the plotting was easy. You can print(df) after manipulation to see what was done. Primarily, I added the x and y from below row as additional columns in the current row for all except last row. I also included a marker of the resultant color so that you know whether the color is red of blue. One thing to note, the dates in the x column should be in ascending order.
#Add x_end column to df from subsequent row - column X
end_date=df.iloc[1:,:]['x'].reset_index(drop=True).rename('x_end')
df = pd.concat([df, end_date], axis=1)
#Add y_end column to df from subsequent row - column y
end_y=df.iloc[1:,:]['y'].reset_index(drop=True).astype(int).rename('y_end')
df = pd.concat([df, end_y], axis=1)
#Add values for last row, same as x and y so the marker is of right color
df.iat[len(df)-1, 4] = df.iat[len(df)-1, 0]
df.iat[len(df)-1, 5] = df.iat[len(df)-1, 1]
for i in range(len(df)):
plt.plot([df.iloc[i,0],df.iloc[i,4]],[df.iloc[i,1],df.iloc[i,5]], marker='o', c=df.iat[i,3])
Output plot

how to put 3 seaborn scatter plots under one another?

I want to combine all 3 seaborn scatter plots under one "frame".
plt.figure(figsize=(7,15))
plt.subplots(3,1)
sns.scatterplot(x=train['Garage Area'], y=train['SalePrice'])
plt.show()
sns.scatterplot(x=train['Gr Liv Area'], y=train['SalePrice'])
plt.show()
sns.scatterplot(x=train['Overall Cond'], y=train['SalePrice'])
plt.show()
But it creates 5, the first 3 are small according to (7,15) size but the last 2 are different.
I suspect it should be
plt.figure(figsize=(7,15))
fig,ax = plt.subplots(3,1)
ax[0] = fig.add_subplot(sns.scatterplot(x=train['Garage Area'], y=train['SalePrice']))
#plt.show()
ax[1] = fig.add_subplot(sns.scatterplot(x=train['Gr Liv Area'], y=train['SalePrice']))
#plt.show()
ax[2] =fig.add_subplot(sns.scatterplot(x=train['Overall Cond'], y=train['SalePrice']))
plt.show()
but all 3 plots are stuck in the last 3rd chart!
The following is one way to do it:
Create a figure with 3 subplots (3 rows, 1 column)
Pass the respective subplot using ax[0], ax[1] and ax[2] to the three separate sns.scatterplot commands using the keyword ax
fig, ax = plt.subplots(3, 1, figsize=(7,15))
sns.scatterplot(x=train['Garage Area'], y=train['SalePrice'], ax=ax[0])
sns.scatterplot(x=train['Gr Liv Area'], y=train['SalePrice'], ax=ax[1])
sns.scatterplot(x=train['Overall Cond'], y=train['SalePrice'], ax=ax[2])
plt.show()

Matplotlib how to divide an histogram by a constant number

I would like to perform a personalized normalization on histograms on matplotlib. In particular I have two histograms and I would like to divide each of them by a given number (number of generated events).
I don't want to "normally" normalize it, because the "normal normalization" makes the area equal to 1. What I wish for is basically to divide the value of each bin by a given number N, so that if my histogram has 2 bins, one with 5 entries and one with 3, the resulting "normalized" (or "divided") histogram would have the first bin with 5/N entries and the second one with 3/N.
I searched far&wide and found nothing really helpful. Do you have any handy solution? This is my code, working with pandas:
num_bins = 128
list_1 = dataframe_1['E']
list_2 = dataframe_2['E']
fig, ax = plt.subplots()
ax.set_xlabel('Proton energy [MeV]')
ax.set_ylabel('Normalized frequency')
ax.set_title('Proton energy distribution')
n, bins, patches = ax.hist(list_1, num_bins, density=1, alpha=0.5, color='red', ec='red', label='label_1')
n, bins, patches = ax.hist(list_2, num_bins, density=1, alpha=0.5, color='blue', ec='blue', label='label_2')
plt.legend(loc='upper center', fontsize='x-large')
fig.savefig('NiceTitle.pdf')
plt.close('all')

Customizing subplots in matplotlib

I want to place 3 plots using subplots. Two plots on the top row and one plot that will occupy the entire second row.
My code creates a gap between the top two plots and the lower plot. How can I correct that?
df_CI
Country China India
1980 5123 8880
1981 6682 8670
1982 3308 8147
1983 1863 7338
1984 1527 5704
fig = plt.figure() # create figure
ax0 = fig.add_subplot(221) # add subplot 1 (2 row, 2 columns, first plot)
ax1 = fig.add_subplot(222) # add subplot 2 (2 row, 2 columns, second plot).
ax2 = fig.add_subplot(313) # a 3 digit number where the hundreds represent nrows, the tens represent ncols
# and the units represent plot_number.
# Subplot 1: Box plot
df_CI.plot(kind='box', color='blue', vert=False, figsize=(20, 20), ax=ax0) # add to subplot 1
ax0.set_title('Box Plots of Immigrants from China and India (1980 - 2013)')
ax0.set_xlabel('Number of Immigrants')
ax0.set_ylabel('Countries')
# Subplot 2: Line plot
df_CI.plot(kind='line', figsize=(20, 20), ax=ax1) # add to subplot 2
ax1.set_title ('Line Plots of Immigrants from China and India (1980 - 2013)')
ax1.set_ylabel('Number of Immigrants')
ax1.set_xlabel('Years')
# Subplot 3: Box plot
df_CI.plot(kind='bar', figsize=(20, 20), ax=ax2) # add to subplot 1
ax0.set_title('Box Plots of Immigrants from China and India (1980 - 2013)')
ax0.set_xlabel('Number of Immigrants')
ax0.set_ylabel('Countries')
plt.show()
I've always found subplots syntax a little difficult.
With these calls
ax0 = fig.add_subplot(221)
ax1 = fig.add_subplot(222)
you're dividing your figure in a 2x2 grid and filling the first row.
ax2 = fig.add_subplot(313)
Now you're dividing it in three rows and filling the last one.
You're basically creating two independent subplot grids, there is no easy way to define how to space subplots from one with respect to the other.
A much easier and pythonic way is using gridspec to create a single finer grid and address it with python slicing.
fig = plt.figure()
gs = mpl.gridspec.GridSpec(2, 2, wspace=0.25, hspace=0.25) # 2x2 grid
ax0 = fig.add_subplot(gs[0, 0]) # first row, first col
ax1 = fig.add_subplot(gs[0, 1]) # first row, second col
ax2 = fig.add_subplot(gs[1, :]) # full second row
And now you can also easily tune spacing with wspace and hspace.
More complex layouts are also a lot easier, it's just the familiar slicing syntax.
fig = plt.figure()
gs = mpl.gridspec.GridSpec(10, 10, wspace=0.25, hspace=0.25)
fig.add_subplot(gs[2:8, 2:8])
fig.add_subplot(gs[0, :])
for i in range(5):
fig.add_subplot(gs[1, (i*2):(i*2+2)])
fig.add_subplot(gs[2:, :2])
fig.add_subplot(gs[8:, 2:4])
fig.add_subplot(gs[8:, 4:9])
fig.add_subplot(gs[2:8, 8])
fig.add_subplot(gs[2:, 9])
fig.add_subplot(gs[3:6, 3:6])
# fancy colors
cmap = mpl.cm.get_cmap("viridis")
naxes = len(fig.axes)
for i, ax in enumerate(fig.axes):
ax.set_xticks([])
ax.set_yticks([])
ax.set_facecolor(cmap(float(i)/(naxes-1)))

How should arrays for plot_surface be built?

I'm trying to understand how to build arrays for use in plot_surface (in Axes3d).
I tried to build a simple surface manipulating data of those arrays:
In [106]: x
Out[106]:
array([[0, 0],
[0, 1],
[0, 0]])
In [107]: y
Out[107]:
array([[0, 0],
[1, 1],
[0, 0]])
In [108]: z
Out[108]:
array([[0, 0],
[1, 1],
[2, 2]])
But I can't figure out how they are interpreted - for example there is nothing in z=2 on my plot.
Anybody please explain exactly which values will be taken to make point, which for line and finally surface.
For example I would like to build a surface that would connect with lines points:
[0,0,0]->[1,1,1]->[0,0,2]
[0,0,0]->[1,-1,1]->[0,0,2]
and a surface between those lines.
What should arrays for plot_surface look like to get something like this?
Understanding how the grids in plot_surface work is not easy. So first I'll give a general explanation, and then I'll explain how to convert the data in your case.
If you have an array of N x values and an array of M y values, you need to create two grids of x and y values of dimension (M,N) each. Fortunately numpy.meshgrid will help. Confused? See an example:
x = np.arange(3)
y=np.arange(1,5)
X, Y = np.meshgrid(x,y)
The element (x[i], y[j]) is accessed as (X[j,i], Y[j,i]). And its Z value is, of course, Z[j,i], which you also need to define.
Having said that, your data does produce a point of the surface in (0,0,2), as expected. In fact, there are two points at that position, coming from coordinate indices (0,0,0) and (1,1,1).
I attach the result of plotting your arrays with:
fig = plt.figure()
ax=fig.add_subplot(1,1,1, projection='3d')
surf=ax.plot_surface(X, Y, Z)
If I understand you correctly you try to interpolate a surface through a set of points. I don't think the plot_surface is the correct function for this. But correct me if I'm wrong. I think you should look for interpolation tools, probably those in scipy.interpolate. The result of the interpolation can then be plotted using plot_surface.
plot_surface is able to plot a grid (with z values) in 3D space based on x, y coordinates. The arrays of x and y are those created by numpy.meshgrid.
example of plot_surface:
import pylab as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
plt.ion()
x = np.arange(0,np.pi, 0.1)
y = x.copy()
z = np.sin(x).repeat(32).reshape(32,32)
X, Y = np.meshgrid(x,y)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X,Y,z, cmap=plt.cm.jet, cstride=1, rstride=1)