matplotlib - plot merged dataframe with group bar - matplotlib

I try to plot a grouped bar chart from a merged dataframe. below code the bar is stacked, how can I put it side by side just like a grouped bar chart?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],
'value':[ 10 ,6, 6, 8]})
df2 = pd.DataFrame({
'key': ['B', 'D', 'A', 'F'],
'value':[ 3, 5, 5, 7]})
df3 = pd.merge(df1, df2, how='inner', on=['key'])
print(df1)
print(df2)
print(df3)
fig, ax = plt.subplots(figsize=(12, 8))
b1 = ax.bar(df3['key'],df3['value_x'])
b2 = ax.bar(df3['key'],df3['value_y'])
pngname = "demo.png"
fig.savefig(pngname, dpi=fig.dpi)
print("[[./%s]]"%(pngname))
Current output:

The problem is that the x axis data is the same, in your case it aren't numbers, it are the keys: "A", "B", "C". So matplotlib stacks them one onto another.
There's a simple way around it, as some tutorials online show https://www.geeksforgeeks.org/create-a-grouped-bar-plot-in-matplotlib/.
So, what you do is basically enumerate the keys, i.e. A=1, B=2, C=3. After this, choose your desired bar width, I chose 0.4 for example. And now, shift one group of bars to the left by bar_width/2, and shift the other one to the right by bar_width/2.
Perhaps the code explains it better than I did:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],
'value':[ 10 ,6, 6, 8]})
df2 = pd.DataFrame({
'key': ['B', 'D', 'A', 'F'],
'value':[ 3, 5, 5, 7]})
df3 = pd.merge(df1, df2, how='inner', on=['key'])
fig, ax = plt.subplots(figsize=(12, 8))
# modifications
x = np.arange(len(df3['key'])) # enumerate the keys
bar_width = 0.4 # choose bar length
b1 = ax.bar(x - bar_width/2,df3['value_x'], width=bar_width, label='value_x') # shift x values left
b2 = ax.bar(x + bar_width/2,df3['value_y'], width=bar_width, label='value_y') # shift x values right
plt.xticks(x, df3['key']) # replace x axis ticks with keys from df3.
plt.legend(['value_x', 'value_y'])
plt.show()
Result:

Related

matplotlib stacked bar chart with zero centerd

I have a dataset like below.
T/F
Value
category
T
1
A
F
3
B
T
5
C
F
7
A
T
8
B
...
...
...
so, I want to draw a bar chart like below. same categoy has same position
same category has same position, zero centered bar and number of F is bar below the horizontal line, T is upper bar.
How can I make this chart with matplotlib.pyplot? or other library
I need example.
One approach involves making the False values negative, and then creating a Seaborn barplot with T/F as hue. You might want to make a copy of the data if you can't change the original.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
data = pd.DataFrame({'T/F': ['T', 'F', 'T', 'F', 'T'],
'Value': [1, 3, 5, 7, 8],
'category': ['A', 'B', 'C', 'A', 'B']})
data['Value'] = np.where(data['T/F'] == 'T', data['Value'], -data['Value'])
ax = sns.barplot(data=data, x='category', y='Value', hue='T/F', dodge=False, palette='turbo')
ax.axhline(0, lw=2, color='black')
plt.tight_layout()
plt.show()

Not able to create a 3x3 grid of subplots to visualize 9 Series individually

I want to have a 3x3 grid of subplots to visualize each Series individually.
I first created some toy data:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='whitegrid', rc={"figure.figsize":(14,6)})
rs = np.random.RandomState(444)
dates = pd.date_range(start="2009-01-01", end='2019-12-31', freq='1D')
values = rs.randn(4017,12).cumsum(axis=0)
data = pd.DataFrame(values, dates, columns =['a','b','c','d','e','f','h','i','j','k','l','m'])
Here is the first code I wrote:
fig, ax = plt.subplots(3, 3, sharex=True, sharey=True)
for col in n_cols:
ax = data[col].plot()
With these lines of code the problem is that I get the 3x3 grid but all the columns have been plotten on the same subplotsAxes, in the bottom right corner.
Bottom Right Corner with all Lines
Here is the second thing I tried:
n_cols = ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i', 'j']
fig, ax = plt.subplots(3, 3, sharex=True, sharey=True)
for col in n_cols:
for i in range(3):
for j in range(3):
ax[i,j].plot(data[col])
But now I get all the columns plotted on every single subplotAxes.
All AxesSubplot with same lines
And if I try something like this:
fig, ax = plt.subplots(sharex=True, sharey=True)
for col in n_cols:
for i in range(3):
for j in range(3):
ax[i,j].add_subplot(data[col])
But I get:
TypeError: 'AxesSubplot' object is not subscriptable
I am sorry but can't figure out what to do.
Currently you're plotting each series in each of the subplots:
for col in n_cols:
for i in range(3):
for j in range(3):
ax[i,j].plot(data[col])
Following your example code, here is a way to only plot a single series per subplot:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rs = np.random.RandomState(444)
dates = pd.date_range(start="2009-01-01", end='2019-12-31', freq='1D')
values = rs.randn(4017,12).cumsum(axis=0)
data = pd.DataFrame(values, dates, columns =['a','b','c','d','e','f','h','i','j','k','l','m'])
n_cols = ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i', 'j']
fig, ax = plt.subplots(3, 3, sharex=True, sharey=True)
for i in range(3):
for j in range(3):
col_name = n_cols[i*3+j]
ax[i,j].plot(data[col_name])
plt.show()

pandas scatter plot and groupby does not work

I am trying to do a scatter plot with pandas. Unfortunately kind='scatter' doesn't work. If I change this to kind='line' it works as expected. What can I do to fix this?
for label, d in df.groupby('m'):
d[['te','n']].sort_values(by='n', ascending=False).plot(kind="scatter", x='n', y='te', ax=ax, label='m = '+str(label))```
Use plot.scatter instead:
df = pd.DataFrame({'x': [0, 5, 7,3, 2, 4, 6], 'y': [0, 5, 7,3, 2, 4, 6]})
df.plot.scatter('x', 'y')
Use this snippet if you want individual labels and colours:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
'm': np.random.randint(0, 5, size=100),
'x': np.random.uniform(size=100),
'y': np.random.uniform(size=100),
})
fig, ax = plt.subplots()
for label, d in df.groupby('m'):
# generate a random color:
color = list(np.random.uniform(size=3))
d.plot.scatter('x', 'y', label=f'group {label}', ax=ax, c=[color])

Obtaining the exact data coordinates of seaborn boxplot boxes

I have a seaborn boxplot (sns.boxplot) on which I would like to add some points. For example, say I have this pandas DataFrame:
[In] import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'Property 1':['a']*100+['b']*100,
'Property 2': ['w', 'x', 'y', 'z']*50,
'Value': np.random.normal(size=200)})
df.head(3)
[Out] Property 1 Property 2 Value
0 a w 1.421380
1 a x -1.034465
2 a y 0.212911
[In] df.shape
[Out] (200, 3)
I can easily generate a boxplot with seaborn:
[In] sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
[Out]
Now say I want to add markers for a specific case in my sample. I can get close with this:
[In] specific_case = pd.DataFrame([['a', 'w', '0.5'],
['a', 'x', '0.2'],
['a', 'y', '0.1'],
['a', 'z', '0.3'],
['b', 'w', '-0.5'],
['b', 'x', '-0.2'],
['b', 'y', '0.3'],
['b', 'z', '0.5']
],
columns = df.columns
)
[In] sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
plt.plot(np.arange(-0.25, 3.75, 0.5),
specific_case['Value'].values, 'ro')
[Out]
That is unsatisfactory, of course.
I then used this answer that talks about getting the bBox and this tutorial about converting diplay coordinates into data coordinates to write this function:
[In] def get_x_coordinates_of_seaborn_boxplot(ax, x_or_y):
display_coordinates = []
inv = ax.transData.inverted()
for c in ax.get_children():
if type(c) == mpl.patches.PathPatch:
if x_or_y == 'x':
display_coordinates.append(
(c.get_extents().xmin+c.get_extents().xmax)/2)
if x_or_y == 'y':
display_coordinates.append(
(c.get_extents().ymin+c.get_extents().ymax)/2)
return inv.transform(tuple(display_coordinates))
That works great for my first hue, but not at all for my second:
[In] ax = sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
coords = get_x_coordinates_of_seaborn_boxplot(ax, 'x')
plt.plot(coords, specific_case['Value'].values, 'ro')
[Out]
How can I get the data coordinates of all my boxes?
I'm unsure about the purpose of those transformations. But it seems the real problem is just to plot the points from the specific_case at the correct positions. The xcoordinate of every boxplot is shifted by 0.2 from the whole number. (That is because bars are 0.8 wide by default, you have 2 boxes, which makes each 0.4 wide, half of that is 0.2.)
You then need to arrange the x values to fit to those of the specific_case dataframe.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'Property 1':['a']*100+['b']*100,
'Property 2': ['w', 'x', 'y', 'z']*50,
'Value': np.random.normal(size=200)})
specific_case = pd.DataFrame([['a', 'w', '0.5'],
['a', 'x', '0.2'],
['a', 'y', '0.1'],
['a', 'z', '0.3'],
['b', 'w', '-0.5'],
['b', 'x', '-0.2'],
['b', 'y', '0.3'],
['b', 'z', '0.5']
], columns = df.columns )
ax = sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
X = np.repeat(np.atleast_2d(np.arange(4)),2, axis=0)+ np.array([[-.2],[.2]])
ax.plot(X.flatten(), specific_case['Value'].values, 'ro', zorder=4)
plt.show()
I got it figured out:
In your code do this to extract the x-coordinate based on hue. I did not do it for the y, but the logic should be the same:
Create two lists holding your x coordinate:
display_coordinates_1=[]
display_coordinates_2=[]
Inside your for loop that starts with:
for c in ax.get_children():
Use the following:
display_coordinates_1.append(c.get_extents().x0)
You need x0 for the x-coordinate of boxplots under first hue.
The following gives you the x-coordinates for the subplots in the second hue. Note the use of x1 here:
display_coordinates_2.append(c.get_extents().x1)
Lastly, after you inv.transform() the two lists, make sure you select every other value, since for x-coordinates each list has 6 outputs and you want the ones at indices 0,2,4 or [::2].
Hope this helps.

Pandas bar plot -- specify bar color by column

Is there a simply way to specify bar colors by column name using Pandas DataFrame.plot(kind='bar') method?
I have a script that generates multiple DataFrames from several different data files in a directory. For example it does something like this:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pds
data_files = ['a', 'b', 'c', 'd']
df1 = pds.DataFrame(np.random.rand(4,3), columns=data_files[:-1])
df2 = pds.DataFrame(np.random.rand(4,3), columns=data_files[1:])
df1.plot(kind='bar', ax=plt.subplot(121))
df2.plot(kind='bar', ax=plt.subplot(122))
plt.show()
With the following output:
Unfortunately, the column colors aren't consistent for each label in the different plots. Is it possible to pass in a dictionary of (filenames:colors), so that any particular column always has the same color. For example, I could imagine creating this by zipping up the filenames with the Matplotlib color_cycle:
data_files = ['a', 'b', 'c', 'd']
colors = plt.rcParams['axes.color_cycle']
print zip(data_files, colors)
[('a', u'b'), ('b', u'g'), ('c', u'r'), ('d', u'c')]
I could figure out how to do this directly with Matplotlib: I just thought there might be a simpler, built-in solution.
Edit:
Below is a partial solution that works in pure Matplotlib. However, I'm using this in an IPython notebook that will be distributed to non-programmer colleagues, and I'd like to minimize the amount of excessive plotting code.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pds
data_files = ['a', 'b', 'c', 'd']
mpl_colors = plt.rcParams['axes.color_cycle']
colors = dict(zip(data_files, mpl_colors))
def bar_plotter(df, colors, sub):
ncols = df.shape[1]
width = 1./(ncols+2.)
starts = df.index.values - width*ncols/2.
plt.subplot(120+sub)
for n, col in enumerate(df):
plt.bar(starts + width*n, df[col].values, color=colors[col],
width=width, label=col)
plt.xticks(df.index.values)
plt.grid()
plt.legend()
df1 = pds.DataFrame(np.random.rand(4,3), columns=data_files[:-1])
df2 = pds.DataFrame(np.random.rand(4,3), columns=data_files[1:])
bar_plotter(df1, colors, 1)
bar_plotter(df2, colors, 2)
plt.show()
You can pass a list as the colors. This will require a little bit of manual work to get it to line up, unlike if you could pass a dictionary, but may be a less cluttered way to accomplish your goal.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pds
data_files = ['a', 'b', 'c', 'd']
df1 = pds.DataFrame(np.random.rand(4,3), columns=data_files[:-1])
df2 = pds.DataFrame(np.random.rand(4,3), columns=data_files[1:])
color_list = ['b', 'g', 'r', 'c']
df1.plot(kind='bar', ax=plt.subplot(121), color=color_list)
df2.plot(kind='bar', ax=plt.subplot(122), color=color_list[1:])
plt.show()
EDIT
Ajean came up with a simple way to return a list of the correct colors from a dictionary:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pds
data_files = ['a', 'b', 'c', 'd']
color_list = ['b', 'g', 'r', 'c']
d2c = dict(zip(data_files, color_list))
df1 = pds.DataFrame(np.random.rand(4,3), columns=data_files[:-1])
df2 = pds.DataFrame(np.random.rand(4,3), columns=data_files[1:])
df1.plot(kind='bar', ax=plt.subplot(121), color=map(d2c.get,df1.columns))
df2.plot(kind='bar', ax=plt.subplot(122), color=map(d2c.get,df2.columns))
plt.show()
Pandas version 1.1.0 makes this easier. You can pass a dictionary to specify different color for each column in the pandas.DataFrame.plot.bar() function:
Here is an example:
df1 = pd.DataFrame({'a': [1.2, .8, .9], 'b': [.2, .9, .7]})
df2 = pd.DataFrame({'b': [0.2, .5, .4], 'c': [.5, .6, .7], 'd': [1.1, .6, .7]})
color_dict = {'a':'green', 'b': 'red', 'c':'blue', 'd': 'cyan'}
df1.plot.bar(color = color_dict)
df2.plot.bar(color = color_dict)