Color problems with pandas matplotlib - graph colors are inconsistent - pandas

I have graphs with just 3 colors - Green, Red, Grey for values A, B, C. The application uses group by and value counts to get the cumulative count of A, B, and C across months and shows a donut chart, barh, and a bar chart. The colors shift from graph to graph - on one they A is green and the other graph with the same data shows A as red.
Simple fix, right?
def color_for_label(label):
xlate = {'A': 'green',
'B': 'red',
'C': 'grey',
}
return [xlate[x] for x in label]
chart = gb.unstack(level=-1)
.plot.barh(color=color_for_label(gb.index[0:2].names),
width=.50,
stacked=True,
legend=None)
The data returns an index sometimes and a multiindex other times. It chokes on and but works on
The colors are constant Red/Green/Grey that always go with the values A/B/C.
I've tried checking datatypes and try/except structures, but both got too complex quickly. Anyone got a simple solution to share?
Lets use the data from this example pandas pivot table to stacked bar chart
df.assign(count =1 ).groupby(['battle_type']).count().plot.barh(stacked=True)
and (latter preferred - I'm not loving the groupby inconsistencies)
df.pivot_table(index='battle_type', columns='attacker_outcome', aggfunc='size').plot.barh(stacked=True)
both get me
I have a 3rd value, "Tie" in my example of A, B, C above, but lets ignore that for the moment.
I want to make sure that win is always green, lose is red, Tie is grey.
so I have my simple function
def color_for_label(label):
xlate = {'win': 'green',
'lose': 'red',
'Tie': 'grey',
}
return xlate[label]
so I add
....plot.barh(stacked=True, color=color_for_label(**label**))
And here I'm stuck - what do I set label to so that win is always green, lose is red and tie is grey?

Got it!
First, translate colors for the new example
def color_for_label(label):
xlate = {'win': 'green',
'loss': 'red',
'tie': 'grey',
}
return [xlate[x] for x in label]
Then break it into two lines.
# create a dataframe
gb = df.pivot_table(index='battle_type', columns='attacker_outcome', aggfunc='size')
# pass the dataframe column values
gb.plot.barh(stacked=True, color=color_for_label(gb.columns.values))

Related

Histogram as stacked bar chart based on categories

I have data with a numeric and categorical variable. I want to show a histogram of the numeric column, where each bar is stacked by the categorical variable. I tried to do this with ax.hist(data, histtype='bar', stacked=True), but couldn't quite get it to work.
If my data is
df = pd.DataFrame({'age': np.random.normal(45, 5, 100), 'job': np.random.choice(['engineer', 'barista',
'quantity surveyor'], size=100)})
I've organised it like this:
df['binned_age'] = pd.qcut(df.age, 5)
df.groupby('binned_age')['job'].value_counts().plot(kind='bar')
Which gives me a bar chart divided the way I want, but side by side, not stacked, and without different colours for each category.
Is there a way to stack this plot? Or just do it a regular histogram, but stacked by category?
IIUC, you will need to reshape your dataset first - i will do that using pivot_table and use len for an aggregator as that will give you the frequency.
Then you can use a similar code to the one you provided above.
df.drop('age',axis=1,inplace=True)
df_reshaped = df.pivot_table(index=['binned_age'], columns=['job'], aggfunc=len)
df_reshaped.plot(kind='bar', stacked=True, ylabel='Frequency', xlabel='Age binned',
title='Age group frequency by Job', rot=45)
prints:
You can use the documentation to tailor the chart to your needs
df['age_bin'] = pd.cut(df['age'], [20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70])
df.groupby('age_bin')['job'].value_counts().unstack().plot(kind='bar', stacked=True)

Create a bar chart with bars colored according to a category and line on the same chart

I trained a model to predict a value and I want to make a bar chart that plots target - prediction for each sample, and then color these bars according to a category. I then want to add two horizontal lines for plus or minus sigma around the central axis, so it's clear which predictions are very far off. Imagine we know sigma == 0.3 and we have a dataframe
error
sample_id
category
.1
1
'A'
.4
2
'A'
.1
3
'B'
-.2
4
'B'
-.1
5
'C'
How could I do this? I've managed to do just the errors and the plus or minus sigma lines just using matplotlib, here it is to communicate what I mean.
You'll find the pd.Series.transform() and/or pd.DataFrame.apply() methods quite useful. Essentially, you can map each value of your input columns (in this case errors) into some valid color value, returning a pd.Series of colors that's the same shape as errors.
The phrasing of the question is unclear, but it sounds like you want a single pair of lines for each category? In which case, you will first need to do a pd.Series.groupby() operation to get the shape that you want before the transform opeartion. Probably just a series of length 3, for your A B C categories.
Then, this Series (whether it is of length len(df) or df.category.nunique()) can be passed into your plt.bar method as the color argument.
This is actually very easy, I just didn't understand the 'color' option of plt.bar. If it is a list of length equal to the number of bars, then it will color each bar with the corresponding color. It's as simple as
plt.(x,y,color = z)
#len(x) = len(y) = len(z), and z is an array of colors
As krukah mentions, you just need to translate categories to colors. I picked a color map, made a dictionary that picked a color for each unique category, and then turned the cats array (a 2d np array, each row encodes a category) into an array of colors.
unique_cats = np.unique(cats, axis=0)
n_unique = unique_cats.shape[0]
for_picking = np.arange(0,1,1/n_unique)
cmap = plt.cm.get_cmap('plasma')
color_dict = {}
#this for loop fills in the dictionary by picking colors from the cmap
for i in range(n_unique):
color_dict[str(unique_cats[i])] =cmap(for_picking[i])
color_cats = [color_dict[str(cat)] for cat in cats]
Hopefully that helps someone some day.

How to create a discrete colormap that maps integers to colors, invariant to range of input data

Let's say I have a vector containing integers from the set [1,2,3]. I would like to create a colormap in which 1 always appears as blue, 2 always appears as red, and 3 always appears as purple, regardless of the range of the input data--e.g., even if the input vector only contains 1s and 2s, I would still like those to appear as blue and red, respectively (and purple is not used in this case).
I've tried the code below:
This works as expected (data contains 1, 2 and 3):
cmap = colors.ListedColormap(["blue", "red", "purple"])
bounds = [0.5,1.5,2.5,3.5]
norm = colors.BoundaryNorm(bounds, cmap.N)
data = np.array([1,2,1,2,3])
sns.heatmap(data.reshape(-1,1), cmap=cmap, norm=norm, annot=True)
Does not work as expected (data contains only 1 and 2):
cmap = colors.ListedColormap(["blue", "red", "purple"])
bounds = [0.5,1.5,2.5,3.5]
norm = colors.BoundaryNorm(bounds, cmap.N)
data = np.array([1,2,1,2,2])
sns.heatmap(data.reshape(-1,1), cmap=cmap, norm=norm, annot=True)
In the first example, 1 appears as blue, 2 appears as red and 3 appears as purple, as desired.
In the second example, 1 appears as blue and 2 appears as purple, while red is not used.
Not completely sure, but I think this minimal example solves your problem. Here, I've taken an actual colormap and edited it to produce a smaller version of it. Hope it helps!
#0. Import libraries
#==============================
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns
import numpy as np
#==============================
#1. Create own colormap
#======================================
#1.1. Choose the colormap you want to
#pick up colors from
source_cmap=matplotlib.cm.get_cmap('Set2')
#1.2. Choose number of colors and set a step
cols=4;step=1/float(cols - 1)
#1.3. Declare a vector to store given colors
cmap_vec=[]
#1.4. Run from 0 to 1 (limits of colormap)
#stepwise and pick up equidistant colors
#---------------------------------------
for color in np.arange(0,1.1,step):
#store color in vector
cmap_vec.append( source_cmap(color) )
#---------------------------------------
#1.5. Create colormap with chosen colors
custom_cmap=\
colors.ListedColormap([ color for color in cmap_vec ])
#====================================
#2. Basic example to plot in
#======================================
A = np.matrix('0 3; 1 2')
B=np.asarray(A)
ax=sns.heatmap(B,annot=True,cmap=custom_cmap)
plt.show()
#======================================

How can I make this plot awesome (colours by group plus alpha value by second group)

I do have following dataframe:
I plotted it the following way:
Right now the plot looks ugly. Aside of using different font size, marker_edge_width, marker face color etc. I would like to have two colors for each protein (hum1 and hum2) and within the group the different pH values should have different intensities. What makes it more difficult is the fact that my groups do not have the same size.
Any ideas ?
P.S Such a build in feature would be really cool e.g colourby = level_one thenby level_two
fig = plt.figure(figsize=(9,9))
ax = fig.add_subplot(1,1,1)
c1 = plt.cm.Greens(np.linspace(0.5, 1, 4))
c2 = plt.cm.Blues(np.linspace(0.5, 1, 4))
colors = np.vstack((c1,c2))
gr.unstack(level=(0,1))['conc_dil'].plot(marker='o',linestyle='-',color=colors,ax=ax)
plt.legend(loc=1,bbox_to_anchor = (0,0,1.5,1),numpoints=1)
gives:
P.S This post helped me:
stacked bar plot and colours

Overlaying actual data on a boxplot from a pandas dataframe

I am using Seaborn to make boxplots from pandas dataframes. Seaborn boxplots seem to essentially read the dataframes the same way as the pandas boxplot functionality (so I hope the solution is the same for both -- but I can just use the dataframe.boxplot function as well). My dataframe has 12 columns and the following code generates a single plot with one boxplot for each column (just like the dataframe.boxplot() function would).
fig, ax = plt.subplots()
sns.set_style("darkgrid", {"axes.facecolor":"darkgrey"})
pal = sns.color_palette("husl",12)
sns.boxplot(dataframe, color = pal)
Can anyone suggest a simple way of overlaying all the values (by columns) while making a boxplot from dataframes?
I will appreciate any help with this.
This hasn't been added to the seaborn.boxplot function yet, but there's something similar in the seaborn.violinplot function, which has other advantages:
x = np.random.randn(30, 6)
sns.violinplot(x, inner="points")
sns.despine(trim=True)
A general solution for the boxplot for the entire dataframe, which should work for both seaborn and pandas as their are all matplotlib based under the hood, I will use pandas plot as the example, assuming import matplotlib.pyplot as plt already in place. As you have already have the ax, it would make better sense to just use ax.text(...) instead of plt.text(...).
In [35]:
print df
V1 V2 V3 V4 V5
0 0.895739 0.850580 0.307908 0.917853 0.047017
1 0.931968 0.284934 0.335696 0.153758 0.898149
2 0.405657 0.472525 0.958116 0.859716 0.067340
3 0.843003 0.224331 0.301219 0.000170 0.229840
4 0.634489 0.905062 0.857495 0.246697 0.983037
5 0.573692 0.951600 0.023633 0.292816 0.243963
[6 rows x 5 columns]
In [34]:
df.boxplot()
for x, y, s in zip(np.repeat(np.arange(df.shape[1])+1, df.shape[0]),
df.values.ravel(), df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
For a single series in the dataframe, a few small changes is necessary:
In [35]:
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
for x, y, s in zip(np.repeat(1, df.shape[0]),
sub_df.ravel(), sub_df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
Making scatter plots is also similar:
#for the whole thing
df.boxplot()
plt.scatter(np.repeat(np.arange(df.shape[1])+1, df.shape[0]), df.values.ravel(), marker='+', alpha=0.5)
#for just one column
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
plt.scatter(np.repeat(1, df.shape[0]), sub_df.ravel(), marker='+', alpha=0.5)
To overlay stuff on boxplot, we need to first guess where each boxes are plotted at among xaxis. They appears to be at 1,2,3,4,..... Therefore, for the values in the first column, we want them to be plot at x=1; the 2nd column at x=2 and so on.
Any efficient way of doing it is to use np.repeat, repeat 1,2,3,4..., each for n times, where n is the number of observations. Then we can make a plot, using those numbers as x coordinates. Since it is one-dimensional, for the y coordinates, we will need a flatten view of the data, provided by df.ravel()
For overlaying the text strings, we need a anther step (a loop). As we can only plot one x value, one y value and one text string at a time.
I have the following trick:
data = np.random.randn(6,5)
df = pd.DataFrame(data,columns = list('ABCDE'))
Now assign a dummy column to df:
df['Group'] = 'A'
print df
A B C D E Group
0 0.590600 0.226287 1.552091 -1.722084 0.459262 A
1 0.369391 -0.037151 0.136172 -0.772484 1.143328 A
2 1.147314 -0.883715 -0.444182 -1.294227 1.503786 A
3 -0.721351 0.358747 0.323395 0.165267 -1.412939 A
4 -1.757362 -0.271141 0.881554 1.229962 2.526487 A
5 -0.006882 1.503691 0.587047 0.142334 0.516781 A
Use the df.groupby.boxplot(), you get it done.
df.groupby('Group').boxplot()