Create a bar chart with bars colored according to a category and line on the same chart - pandas

I trained a model to predict a value and I want to make a bar chart that plots target - prediction for each sample, and then color these bars according to a category. I then want to add two horizontal lines for plus or minus sigma around the central axis, so it's clear which predictions are very far off. Imagine we know sigma == 0.3 and we have a dataframe
error
sample_id
category
.1
1
'A'
.4
2
'A'
.1
3
'B'
-.2
4
'B'
-.1
5
'C'
How could I do this? I've managed to do just the errors and the plus or minus sigma lines just using matplotlib, here it is to communicate what I mean.

You'll find the pd.Series.transform() and/or pd.DataFrame.apply() methods quite useful. Essentially, you can map each value of your input columns (in this case errors) into some valid color value, returning a pd.Series of colors that's the same shape as errors.
The phrasing of the question is unclear, but it sounds like you want a single pair of lines for each category? In which case, you will first need to do a pd.Series.groupby() operation to get the shape that you want before the transform opeartion. Probably just a series of length 3, for your A B C categories.
Then, this Series (whether it is of length len(df) or df.category.nunique()) can be passed into your plt.bar method as the color argument.

This is actually very easy, I just didn't understand the 'color' option of plt.bar. If it is a list of length equal to the number of bars, then it will color each bar with the corresponding color. It's as simple as
plt.(x,y,color = z)
#len(x) = len(y) = len(z), and z is an array of colors
As krukah mentions, you just need to translate categories to colors. I picked a color map, made a dictionary that picked a color for each unique category, and then turned the cats array (a 2d np array, each row encodes a category) into an array of colors.
unique_cats = np.unique(cats, axis=0)
n_unique = unique_cats.shape[0]
for_picking = np.arange(0,1,1/n_unique)
cmap = plt.cm.get_cmap('plasma')
color_dict = {}
#this for loop fills in the dictionary by picking colors from the cmap
for i in range(n_unique):
color_dict[str(unique_cats[i])] =cmap(for_picking[i])
color_cats = [color_dict[str(cat)] for cat in cats]
Hopefully that helps someone some day.

Related

Adding error_y from two columns in a stacked bar graph, plotly express

I have created a stacked bar plot using plotly.express. Each X-axis category has two correspondent Y-values that are stacked to give the total value of the two combined.
How can I add an individual error bar for each Y-value?
I have tried several options that all yield the same: The same value is added to both stacked bars. The error_y values are found in two separate columns in the dataframe: "st_dev_PHB_%" and "st_dev_PHV_%" , respectively, which correspond to 6 categorical values (x="C").
My intuition tells me its best to merge them into a new column in the dataframe, since I load the dataframe in the bar plot. However, each solution I try give an error or that the same value is added to each pair of Y-values.
What would be nice, is if it's possible to have X error_y values corresponding to the X number of variables loaded in the y=[...,...] . But that would off course be too easy .........................
data_MM = read_csv(....)
#data_MM["error_bar"] = data_MM[['st_dev_PHB_%', 'st_dev_PHV_%']].apply(tuple, axis=1).tolist()
#This one adds the values together instead of adding them to same list.
#data_MM["error_bar"] = data_MM['st_dev_PHB_%'] + data_MM['st_dev_PHV_%']
#data_MM["error_bar"] = data_MM[["st_dev_PHB_%", "st_dev_PHV_%"]].values.tolist()
#data_MM["error_bar"] = list(zip(data_MM['st_dev_PHB_%'],data_MM['st_dev_PHV_%']))
bar_plot = px.bar(data_MM, x="C", y=["PHB_wt%", "PHV_wt%"], hover_data =["PHA_total_wt%"], error_y="error_bar")
bar_plot.show()
The most commonly endured error message:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
I see your problem with the same error bar being used in both bars in the stack. However, I got a working solution with Plotly.graph_objs. The only downside was the second bar is plotting at the front, and therefore the upper half of the lower error bar is covered. At least you can still read off the error value from the hover data.
Here is the full code:
n = 20
x = list(range(1, n + 1))
y1 = np.random.random(n)
y2 = y1 + np.random.random(n)
e1 = y1 * 0.2
e2 = y2 * 0.05
trace1 = go.Bar(x=x, y=y1, error_y=dict(type='data', array=e1), name="Trace 1")
trace2 = go.Bar(x=x, y=y2, error_y=dict(type='data', array=e2), name="Trace 2")
fig = go.Figure(data=[trace1, trace2])
fig.update_layout(title="Test Plot", xaxis_title="X axis", yaxis_title="Y axis", barmode="stack")
fig.show()
Here is a resulting plot (top plot showing one error value, bottom plot showing different error value for the same bar stack):

Leave axis ticks for blank treatments in ggplot r

I have a bunch of dfs with 19 treatments that I'm plotting subsets of. Trying to figure out how to leave columns for all 19 treatments in the plots, even if they have no values that are being plotted in that specific plot. Smaller reproducible example below.
set.seed(3)
df <- data.frame(matrix(ncol=2,nrow=200))
df$X1 <- rep(c("A","B","C","D","E"),each = 40)
df$X2 <- runif(200,20,50)
ggplot(df,aes(x=X1,y=X2,color=X1))+
geom_dotplot(binaxis="y",data= df[df$X2>48,])+
geom_boxplot(data=df[df$X2>48,],varwidth = T)
See how there are only columns for A,B,C,E? How do I make sure it leaves a column for D? Also, I would need it to skip a color as well so in all the different plots, A, B, C, D, and E are always consistent colors.
(it would also be preferably if I could just put the subset code in the ggplot() box if possible so I wouldn't have to write the subsets over and over again).
I tried adding
scale_x_date(drop=F)+
as a line, but it didn't change.
Thanks.
You can subset your data in the main ggplot() call. The limits of the x-axis and colours can be set manually:
fullvar <- unique(df$X1)
ggplot(df[df$X2 > 48,],aes(x=X1,y=X2,color=X1))+
geom_dotplot(binaxis="y")+
geom_boxplot(varwidth = T) +
scale_x_discrete(limits = fullvar) +
scale_color_discrete(limits = fullvar)
EDIT: Overlooked the colours question.

How can I make this plot awesome (colours by group plus alpha value by second group)

I do have following dataframe:
I plotted it the following way:
Right now the plot looks ugly. Aside of using different font size, marker_edge_width, marker face color etc. I would like to have two colors for each protein (hum1 and hum2) and within the group the different pH values should have different intensities. What makes it more difficult is the fact that my groups do not have the same size.
Any ideas ?
P.S Such a build in feature would be really cool e.g colourby = level_one thenby level_two
fig = plt.figure(figsize=(9,9))
ax = fig.add_subplot(1,1,1)
c1 = plt.cm.Greens(np.linspace(0.5, 1, 4))
c2 = plt.cm.Blues(np.linspace(0.5, 1, 4))
colors = np.vstack((c1,c2))
gr.unstack(level=(0,1))['conc_dil'].plot(marker='o',linestyle='-',color=colors,ax=ax)
plt.legend(loc=1,bbox_to_anchor = (0,0,1.5,1),numpoints=1)
gives:
P.S This post helped me:
stacked bar plot and colours

How can I replace the summing in numpy matrix multiplication with concatenation in a new dimension?

For each location in the result matrix, instead of storing the dot product of the corresponding row and column in the argument matrices, I would like like to store the element wise product, which will be a vector extending into a third dimension.
One idea would be to convert the argument matrices to vectors with vector entries, and then take their outer product, but I'm not sure how to do this either.
EDIT:
I figured it out before I saw there was a reply. Here is my solution:
def newdot(A, B):
A = A.reshape((1,) + A.shape)
B = B.reshape((1,) + B.shape)
A = A.transpose(2, 1, 0)
B = B.transpose(1, 0, 2)
return A * B
What I am doing is taking apart each row and column pair that will have their outer product taken, and forming two lists of them, which then get their contents matrix multiplied together in parallel.
It's a little convoluted (and difficult to explain) but this function should get you what you're looking for:
def f(m1, m2):
return (m2.A.T * m1.A.reshape(m1.shape[0],1,m1.shape[1]))
m3 = m1 * m2
m3_el = f(m1, m2)
m3[i,j] == sum(m3_el[i,j,:])
m3 == m3_el.sum(2)
The basic idea is to turn the matrices into arrays and do element-by-element multiplication. One of the arrays gets reshaped to have a size of one in its middle dimension, and array broadcasting rules expand this dimension out to match the height of the other array.

How can i plot a barplot with multiple bars grouped up?

i would like to plot my data in such a fashion, that bars are grouped up?Something like this:
my code looks like this so far:
data = NP.genfromtxt('newfile',unpack=True,names=True,dtype=None)
for i in sample:
mask = data['name'==sample]
ax2.bar(pos-0.5,(data['data']*100,label="samples", color="lightblue")
This created several graphs instead of a combined one, though. How do i convert this into a grafic looking like the one i presented above?
Without having all the code listed, I had to make some assumptions about the data and about what you wanted. Thus, I haven't tested this code with actual data. I have tried to document my assumptions carefully in the code below. If you have problems with what I've posted, perhaps you could post the text file mentioned in the first line of your code.
# Assume data is a record array
data = NP.genfromtxt('newfile',unpack=True,names=True,dtype=None)
# Assume 'sample' is a column in the data
sample = NP.unique(data['name'])
num_items = len(sample)
ind = NP.arange(sample)
# The margin can be increased to add more space between the groups of data
margin = 0.05
width = (1.-2.*margin)/num_items
# This list will make each sample data set a different color
# It must be AT LEAST as long as the number of samples
# If not, the extra data won't be plotted
colorList = ['red', 'blue', 'black']
if len(colorList) < num_items:
print 'Warning: the number of samples exceeds the length of the color list.'
f = plt.figure()
# Assumed the color was supposed to vary with each data set, so I added a list
for s, color in zip(enumerate(sample), colorList):
num, i = s
print "plotting: ", i
# Assumed you want to plot a separate set of bars for each sample, so I changed the mask to 'name'==i
mask = data['name'==i]
# The position of the xdata must be calculated for each of the sample data series
xdata = ind+margin+(num*width)
# Assumed you wanted to plot the 'data' column where mask is true and then multiply by 100
# Also assumed the label and color were supposed to vary with each data set
plt.bar(xdata, data['data'][mask]*100, width, label=i, color=color)