Add a line at z=0 to ggplot2 heatmap - ggplot2

I have plotted a heatmap in ggplot2. I want to add a curved line to the plot to show where z=0 (i.e. where the value of the data used for the fill is zero), how can I do this?
Thanks

Since no example data or code is provided, I'll illustrate with the volcano dataset, representing heights of a volcano in a matrix. Since the data doesn't contain a zero point, we'll draw the line at the arbitrarily chosen 125 mark.
library(ggplot2)
# Convert matrix to data.frame
df <- data.frame(
row = as.vector(row(volcano)),
col = as.vector(col(volcano)),
value = as.vector(volcano)
)
# Set contour breaks at desired level
ggplot(df, aes(col, row, fill = value)) +
geom_raster() +
geom_contour(aes(z = value),
breaks = 125, col = 'red')
Created on 2020-04-06 by the reprex package (v0.3.0)
If this isn't a good approximation of your problem, I'd suggest to include example data and code in your question.

Related

How can I define and add a lagend to this ggplot 2 script?

I came up with the following script to bin my data on X values, and plot the means of those bins in overlapping bar graphs. It works fine, but I can't seem to get a legend to generate, probably due to poor understanding of aesthetic mapping.
Here is the script, note that "MOI" and "T_cell_contacts" are two data columns in each DF.
ggplot(mapping=aes(MOI, T_cell_contacts)) + stat_summary_bin(data = Cleaned24hr4, fun = "mean", geom="bar", bins= 100, fill = "#FF6666", alpha = 0.3) + stat_summary_bin(data = cleaned24hr8, fun = "mean", geom="bar", bins= 100, fill = "#3733FF", alpha = 0.3) + ylab("mean")
I also added the graph that it plots.
Full disclosure: I was in the middle of writing this when #schumacher posted their response :). Decided to finish anyway.
There are two ways to approach this. One way (more complicated) is to keep the dataframes separate and ask ggplot2 to create a legend via mapping, and the second (simpler) way is to combine into one dataset similar to what #schumacher posted and map the fill color to the extra id column created.
I'll show you both, but first, here's a sample dataset:
library(ggplot2)
set.seed(8675309)
df1 <- data.frame(my_x=rep(1:100, 3), my_y=rnorm(300, 40, 4))
df2 <- data.frame(my_x=rep(11:110, 3), my_y=rnorm(300, 110, 10))
# and the plot code similar to OP's question
ggplot(mapping=aes(x = my_x, y = my_y)) +
stat_summary_bin(data=df1, fun="mean", geom="bar", bins=40, fill="blue", alpha=0.3) +
stat_summary_bin(data=df2, fun="mean", geom="bar", bins=40, fill="red", alpha=0.3)
Method 1 : Combine Dataframes
This is the preferred method for a variety of reasons I can't list completely here. There are a lot of options you can use for combining datasets. One is using union() or rbind() after adding some sort of ID column to your data, but you can do all in one shot using bind_rows() from dplyr:
df <- dplyr::bind_rows(list(dataset1 = df1, dataset2 = df2), .id="id")
The result will bind the rows together and by specifying the .id argument, it will create a new column in the dataset called "id" that uses the names for each of the datasets in the list as the value. In this case, the value in thd df$id column is either "dataset1" if it originated from df1 or "dataset2" if it originated from df2.
Then you use aes(fill=...) to map the fill color to the column "id" in the combined dataset.
p <- ggplot(df, aes(x=my_x, y=my_y)) +
stat_summary_bin(aes(fill=id), fun="mean", geom="bar", bins=40, alpha=0.3)
p
This creates a plot with the default colors for fill, so if you want to supply your own, just use scale_fill_manual(values=...) to specify the particular colors. Using a named vector for values= ensures that each color is applied the way you want it to be, but you can just supply an unnamed vector of color names.
p + scale_fill_manual(values = c("dataset1" = "blue", "dataset2" = "red"))
Method 2 : Use mapping to add the legend
While Method 1 is preferred, there is another way that does not force you to combine your dataframes. This is also useful to illustrate a bit about how ggplot2 decides to create and draw legends. The legend is created automaticaly via the mapping= argument, specifically via aes(). If you put any aesthetic inside of aes() that would normally impart a different appearance and not location (with some exceptions like x, y, and label), then this initiates the creation of a legend. You can map either a column in your dataset (like above), or you can just supply a single value and that will be applied to the entire dataset used for the geom. In this case, see what happens when you change the fill= argument for each geom call to be within aes() and assign it to a character value:
p1 <- ggplot(mapping = aes(x=my_x, y=my_y)) +
stat_summary_bin(aes(fill="first"), data=df1, fun="mean", geom="bar", bins=40, alpha=0.3) +
stat_summary_bin(aes(fill="second"), data=df2, fun="mean", geom="bar", bins=40, alpha=0.3) +
scale_fill_manual(values = c("first" = "blue", "second" = "red"))
p1
It works! When you provide a character value for the fill= aesthetic inside aes(), it's basically labeling every observation in that data to have the value "first" or "second" and using that to make the legend. Cool, right?
You notice a problem though, which is that the alpha value for the legend is not correct. This is because you get overplotting. It's just one of the reasons why you shouldn't really do it this way, but... sort of works. It is only noticeable if you ahve an alpha value. You can get that to look normal, but you need to use guide_legend() to override the aesthetics. Since the code effectively causes the legend to be drawn completely for each geom... you have to cut the alpha value in half for it to display correctly.
p1 + guides(fill=guide_legend(override.aes = list(alpha=0.15)))
Oh, and the real reason why not to use Method 2 is.... just think about doing that again for 5 datasets... how about 10?... how about 20?.....
I think the difficulty has to do with building a single legend out of two different geoms. My approach was to combine your data into a single data frame. The records from each to be set apart by a new category column, I'll call "cat" for short.
With the popular dplyr package:
Cleaned24hr4 <- mutate(Cleaned24hr4, cat = "hr4")
Cleaned24hr8 <- mutate(Cleaned24hr8, cat = "hr8")
Then put them together:
Cleaned <- union(Cleaned24hr4,Cleaned24hr8)
Define your colors:
colorcode <- c("hr4" = "#FF6666", "hr8" = "#3733FF")
Here's my ggplot statement:
ggplot(Cleaned, mapping=aes(MOI, T_cell_contacts)) +
stat_summary_bin(fun = "mean", geom="bar", bins= 100, aes(fill = cat), alpha = 0.3) +
scale_fill_manual(values = colorcode) +
ylab("mean")
Output using some dummy data.

Adding error_y from two columns in a stacked bar graph, plotly express

I have created a stacked bar plot using plotly.express. Each X-axis category has two correspondent Y-values that are stacked to give the total value of the two combined.
How can I add an individual error bar for each Y-value?
I have tried several options that all yield the same: The same value is added to both stacked bars. The error_y values are found in two separate columns in the dataframe: "st_dev_PHB_%" and "st_dev_PHV_%" , respectively, which correspond to 6 categorical values (x="C").
My intuition tells me its best to merge them into a new column in the dataframe, since I load the dataframe in the bar plot. However, each solution I try give an error or that the same value is added to each pair of Y-values.
What would be nice, is if it's possible to have X error_y values corresponding to the X number of variables loaded in the y=[...,...] . But that would off course be too easy .........................
data_MM = read_csv(....)
#data_MM["error_bar"] = data_MM[['st_dev_PHB_%', 'st_dev_PHV_%']].apply(tuple, axis=1).tolist()
#This one adds the values together instead of adding them to same list.
#data_MM["error_bar"] = data_MM['st_dev_PHB_%'] + data_MM['st_dev_PHV_%']
#data_MM["error_bar"] = data_MM[["st_dev_PHB_%", "st_dev_PHV_%"]].values.tolist()
#data_MM["error_bar"] = list(zip(data_MM['st_dev_PHB_%'],data_MM['st_dev_PHV_%']))
bar_plot = px.bar(data_MM, x="C", y=["PHB_wt%", "PHV_wt%"], hover_data =["PHA_total_wt%"], error_y="error_bar")
bar_plot.show()
The most commonly endured error message:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
I see your problem with the same error bar being used in both bars in the stack. However, I got a working solution with Plotly.graph_objs. The only downside was the second bar is plotting at the front, and therefore the upper half of the lower error bar is covered. At least you can still read off the error value from the hover data.
Here is the full code:
n = 20
x = list(range(1, n + 1))
y1 = np.random.random(n)
y2 = y1 + np.random.random(n)
e1 = y1 * 0.2
e2 = y2 * 0.05
trace1 = go.Bar(x=x, y=y1, error_y=dict(type='data', array=e1), name="Trace 1")
trace2 = go.Bar(x=x, y=y2, error_y=dict(type='data', array=e2), name="Trace 2")
fig = go.Figure(data=[trace1, trace2])
fig.update_layout(title="Test Plot", xaxis_title="X axis", yaxis_title="Y axis", barmode="stack")
fig.show()
Here is a resulting plot (top plot showing one error value, bottom plot showing different error value for the same bar stack):

Grouping the factors in ggplot

I am trying to create a graph based on matrix similar to one below... I am trying to group the Erosion values based on "Slope"...
library(ggplot2)
new_mat<-matrix(,nrow = 135, ncol = 7)
colnames(new_mat)<-c("Scenario","Runoff (mm)","Erosion (t/ac)","Slope","Soil","Tillage","Rotation")
for ( i in 1:nrow(new_mat)){
new_mat[i,2]<-sample(10:50, 1)
new_mat[i,3]<-sample(0.1:20, 1)
new_mat[i,4]<-sample(c("S2","S3","S4","S5","S1"),1)
new_mat[i,5]<-sample(c("Deep","Moderate","Shallow"),1)
new_mat[i,7]<-sample(c("WBP","WBF","WF"),1)
new_mat[i,6]<-sample(c("Intense","Reduced","Notill"),1)
new_mat[i,1]<-paste0(new_mat[i,4],"_",new_mat[i,5],"_",new_mat[i,6],"_",new_mat[i,7],"_")
}
#### Graph part ########
grphs_mat<-as.data.frame(new_mat)
grphs_mat$`Runoff (mm)`<-as.numeric(as.character(grphs_mat$`Runoff (mm)`))
grphs_mat$`Erosion (t/ac)`<-as.numeric(as.character(grphs_mat$`Erosion (t/ac)`))
ggplot(grphs_mat, aes(Scenario, `Erosion (t/ac)`,group=Slope, colour = Slope))+
scale_y_continuous(limits=c(0,max(as.numeric((grphs_mat$`Erosion (t/ac)`)))))+
geom_point()+geom_line()
But when i run this code.. The values are distributed in x-axis for all 135 scenarios. But what i want is grouping to be done in terms of slope but it also picks up the other common factors such as Soil+Rotation+Tillage and place it in x-axis. For example:
For these five scenarios:
S1_Deep_Intense_WBF_
S2_Deep_Intense_WBF_
S3_Deep_Intense_WBF_
S4_Deep_Intense_WBF_
S5_Deep_Intense_WBF_
It separates the S1, S2, S3,S4,S5 but also be able to know that other factors are same and put them in x-axis such that the slope lines are stacked on top of each other in 135/5 = 27 x-axis points. The final figure should look like this (Refer image). Apologies for not being able to explain it better.
I think i am making a mistake in grouping or assigning the x-axis values.
I will appreciate your suggestions.
In the example you give, I didn't get every possible factor combination represented so the plots looked a bit weird. What I did instead was start with the following:
set.seed(42)
new_mat <- matrix(,nrow = 1000, ncol = 7)
And then deduplicated this by summarising the values. A possible relevant step here for you analysis is that I made new variable with the interaction() function that is the combination of three other factors.
library(tidyverse)
df <- grphs_mat
df$x <- with(df, interaction(Rotation, Soil, Tillage))
# The simulation did not yield unique combinations
df <- df %>% group_by(x, Slope) %>%
summarise(n = sum(`Erosion (t/ac)`))
Next, I plotted this new x variable on the x-axis and used "stack" positions for the lines and points.
g <- ggplot(df, aes(x, y = n, colour = Slope, group = Slope)) +
geom_line(position = "stack") +
geom_point(position = "stack")
To make the x-axis slightly more readable, you can replace the . that the interaction() function placed by newlines.
g + scale_x_discrete(labels = function(x){gsub("\\.", "\n", x)})
Another option is to simply rotate the x axis labels:
g + theme(axis.text.x.bottom = element_text(angle = 90))
There are a few additional options for the x-axis if you go into ggplot2 extension packages.

Infer Series Labels and Data from pandas dataframe column for plotting

Consider a simple 2x2 dataset with with Series labels prepended as the first column ("Repo")
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLib 140.0 47.0
Here are the DataFrame columns:
p(df.columns)
([u'Repo', u'AllTests', u'Restricted']
So we have the first column is the string/label and the second and third columns are data values. We want one series per row corresponding to the Galactian and the Forecast-MLlib repos.
It would seem this would be a common task and there would be a straightforward way to simply plot the DataFrame . However the following related question does not provide any simple way: it essentially throws away the DataFrame structural knowledge and plots manually:
Set matplotlib plot axis to be the dataframe column name
So is there a more natural way to plot these Series - that does not involve deconstructing the already-useful DataFrame but instead infers the first column as labels and the remaining as series data points?
Update Here is a self contained snippet
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['Galactian','Forecast-MLlib'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','AllTests','Restricted']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.show()
And here is output
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLlib 140.0 47.0
And with piRSquared help it looks like this
So the data is showing now .. but the Series and Labels are swapped. Will look further to try to line them up properly.
Another update
By flipping the columns/labels the series are coming out as desired.
The change was to :
labels = npa(['AllTests','Restricted'])
..
colnames = ['Repo','Galactian','Forecast-MLlib']
So the updated code is
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['AllTests','Restricted'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','Galactian','Forecast-MLlib']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.title("Restricting Long-Running Tests\nin Galactus and Forecast-ML")
plt.show()
p('df columns', df.columns)
ps(df)
Pandas assumes your label information is in the index and columns. Set the index first:
df.set_index('Repo').astype(float).plot()
Or
df.set_index('Repo').T.astype(float).plot()

How can i plot a barplot with multiple bars grouped up?

i would like to plot my data in such a fashion, that bars are grouped up?Something like this:
my code looks like this so far:
data = NP.genfromtxt('newfile',unpack=True,names=True,dtype=None)
for i in sample:
mask = data['name'==sample]
ax2.bar(pos-0.5,(data['data']*100,label="samples", color="lightblue")
This created several graphs instead of a combined one, though. How do i convert this into a grafic looking like the one i presented above?
Without having all the code listed, I had to make some assumptions about the data and about what you wanted. Thus, I haven't tested this code with actual data. I have tried to document my assumptions carefully in the code below. If you have problems with what I've posted, perhaps you could post the text file mentioned in the first line of your code.
# Assume data is a record array
data = NP.genfromtxt('newfile',unpack=True,names=True,dtype=None)
# Assume 'sample' is a column in the data
sample = NP.unique(data['name'])
num_items = len(sample)
ind = NP.arange(sample)
# The margin can be increased to add more space between the groups of data
margin = 0.05
width = (1.-2.*margin)/num_items
# This list will make each sample data set a different color
# It must be AT LEAST as long as the number of samples
# If not, the extra data won't be plotted
colorList = ['red', 'blue', 'black']
if len(colorList) < num_items:
print 'Warning: the number of samples exceeds the length of the color list.'
f = plt.figure()
# Assumed the color was supposed to vary with each data set, so I added a list
for s, color in zip(enumerate(sample), colorList):
num, i = s
print "plotting: ", i
# Assumed you want to plot a separate set of bars for each sample, so I changed the mask to 'name'==i
mask = data['name'==i]
# The position of the xdata must be calculated for each of the sample data series
xdata = ind+margin+(num*width)
# Assumed you wanted to plot the 'data' column where mask is true and then multiply by 100
# Also assumed the label and color were supposed to vary with each data set
plt.bar(xdata, data['data'][mask]*100, width, label=i, color=color)