Arrange correlation plots generated by ggcorplot using ggarrange - ggplot2

I am trying to generate a figure with several correlation plots side by side. I generate the correlation plots with ggcorplot and I would like to use ggarrange to generate the final figure where only the first plot has the y axis text. To do so I used rremove() from ggpubr. I obtain a result, but the first plot (the only one with a y axis text) is off the scale compared to the others.
I could use mrow(), but I am not sure if I can set to have a common legend.
Any advice is welcome.
I provide example below.
library(dplyr)
library(ggcorrplot)
corr_setosa<-iris%>%
filter(Species=="setosa")%>%
select(Sepal.Length:Petal.Width)%>%
cor(., use="complete.obs")
p.mat_setosa<-iris%>%
filter(Species=="setosa")%>%
select(Sepal.Length:Petal.Width)%>%
cor_pmat()
corr.set<-ggcorrplot(corr_setosa[1:3,],
p.mat = p.mat_setosa[1:3, ],
insig="blank")
corr_virginica<-iris%>%
filter(Species=="virginica")%>%
select(Sepal.Length:Petal.Width)%>%
cor(., use="complete.obs")
p.mat_virginica<-iris%>%
filter(Species=="virginica")%>%
select(Sepal.Length:Petal.Width)%>%
cor_pmat()
corr.vir<-ggcorrplot(corr_virginica[1:3,],
p.mat = p.mat_virginica[1:3, ],
insig="blank")
ggarrange(corr.set,
corr.vir+theme(axis.text.y=element_blank())
)

Related

How to display centroids for categorical variables instead of arrows using function ggord?

I really can’t figure out how to display just the centroids for my categorical variables using the function ggord. If anybody could help me, that would be great.
Here is an example of what I’m trying to achieve using the dune data set:
library(vegan)
library (ggord)
library(ggplot2)
ord <- rda(dune~Moisture+ Management+A1,dune.env)
#first plot
plot(ord)
# second plot
ggord(ord)
#I tried to add the centroids, but somehow the whole plot seems to be differently scaled?
centroids<-ord$CCA$centroids
ggord(ord)+geom_point(aes(centroids[,1],centroids[,2]),pch=4,cex=5,col="black",data=as.data.frame(centroids))
In the first plot only the centroids (instead of arrows) for moisture and management are displayed. In the ggord plot every variable is displayed with an arrow.
And why do these plots look so different? The scales of the axes is totally different?
Something like this could work - you can use the var_sub argument to retain specific predictors (e.g., continuous), then just plot others on top of the ggord object.
library(vegan)
library(ggord)
library(ggplot2)
data(dune)
data(dune.env)
ord <- rda(dune~Moisture+ Management+A1,dune.env)
# get centroids for factors
centroids <- data.frame(ord$CCA$centroids)
centroids$labs <- row.names(centroids)
# retain only continuous predictors, then add factor centroids
ggord(ord, var_sub = 'A1') +
geom_text(data = centroids, aes(x = RDA1, y = RDA2, label = labs))

Matplotlib/Seaborn: Boxplot collapses on x axis

I am creating a series of boxplots in order to compare different cancer types with each other (based on 5 categories). For plotting I use seaborn/matplotlib. It works fine for most of the cancer types (see image right) however in some the x axis collapses slightly (see image left) or strongly (see image middle)
https://i.imgur.com/dxLR4B4.png
Looking into the code how seaborn plots a box/violin plot https://github.com/mwaskom/seaborn/blob/36964d7ffba3683de2117d25f224f8ebef015298/seaborn/categorical.py (line 961)
violin_data = remove_na(group_data[hue_mask])
I realized that this happens when there are too many nans
Is there any possibility to prevent this collapsing by code only
I do not want to modify my dataframe (replace the nans by zero)
Below you find my code:
boxp_df=pd.read_csv(pf_in,sep="\t",skip_blank_lines=False)
fig, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(data=boxp_df, ax=ax)
plt.xticks(rotation=-45)
plt.ylabel("label")
plt.tight_layout()
plt.savefig(pf_out)
The output is a per cancer type differently sized plot
(depending on if there is any category completely nan)
I am expecting each plot to be in the same width.
Update
trying to use the order parameter as suggested leads to the following output:
https://i.imgur.com/uSm13Qw.png
Maybe this toy example helps ?
|Cat1|Cat2|Cat3|Cat4|Cat5
|3.93| |0.52| |6.01
|3.34| |0.89| |2.89
|3.39| |1.96| |4.63
|1.59| |3.66| |3.75
|2.73| |0.39| |2.87
|0.08| |1.25| |-0.27
Update
Apparently, the problem is not the data but the length of the title
https://github.com/matplotlib/matplotlib/issues/4413
Therefore I would close the question
#Diziet should I delete it or does my issue might help other ones?
Sorry for not including the line below in the code example:
ax.set_title("VERY LONG TITLE", fontsize=20)
It's hard to be sure without data to test it with, but I think you can pass the names of your categories/cancers to the order= parameter. This forces seaborn to use/display those, even if they are empty.
for instance:
tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips, order=['Thur','Fri','Sat','Freedom Day','Sun','Durin\'s Day'])

Generating subplots of heatmaps in Julia-lang

I am trying to produce a figure/plot with more than a single heatmap (matrix with color shading according to the cell value). At the moment using Plots;
pyplot() and heatmap(mat) is enough to produce a heatmap.
It is not clear to me how to produce a single figure with more though. After looking at this page example subplots for how to use the layout, and then the example histogram, I cannot seem to produce working examples for the two together.
The question is how to produce a figure with two different matrices displayed via heatmap or some other function to do the same?
(as an extra side, could you also explain the context of the 'using' statement and how it relates to the 'backend'?)
The easiest way is to make a Vector of heatmaps, then plot those
using Plots
hms = [heatmap(randn(10,10)) for i in 1:16];
plot(hms..., layout = (4,4), colorbar = false)
The using statement calls the Plots library. The "backend" is another package, loaded by Plots, that does the actual plotting. Plots itself has no plotting capabilities - it translates the plot call to a plot call for the backend package.
Explanation of the code above:
Plotting with Plots is a two-step process. 1: plot generates a Plot object with all the information for the plot; 2: when a Plot object is returned to the console, it automatically calls julia´s display function, which then generates the plot. But you can do other things with the Plot object first, like put it in an array.
The heatmap call is a short form of plot(randn(10,10), seriestype = :heatmap), so it just creates a Plot object. 16 Plot objects are stored in the vector.
Passing a number of Plot objects to plot creates a new, larger Plot, with each of the incoming Plot objects as subplots. The splat operator ... simply passes each element of the Array{Plot} to plot as an individual argument.

Why does DataFrameGroupBy.boxplot method throw error when given argument "subplots=True/False"?

I can use DataFrameGroupBy.boxplot(...) to create a boxplot in the following way:
In [15]: df = pd.DataFrame({"gene_length":[100,100,100,200,200,200,300,300,300],
...: "gene_id":[1,1,1,2,2,2,3,3,3],
...: "density":[0.4,1.1,1.2,1.9,2.0,2.5,2.2,3.0,3.3],
...: "cohort":["USA","EUR","FIJ","USA","EUR","FIJ","USA","EUR","FIJ"]})
In [17]: df.groupby("cohort").boxplot(column="density",by="gene_id")
In [18]: plt.show()
This produces the following image:
This is exactly what I want, except instead of making three subplots, I want all the plots to be in one plot (with different colors for USA, EUR, and FIJ). I've tried
In [17]: df.groupby("cohort").boxplot(column="density",subplots=False,by="gene_id")
but it produces the error
KeyError: 'gene_id'
I think the problem has something to do with the fact that by="gene_id" is a keyword sent to the matplotlib boxplot method. If someone has a better way of producing the plot I am after, perhaps by using DataFrame.boxplot(?) instead, please respond here. Thanks so much!
To use the pure pandas functions, I think you should not GroupBy before calling boxplot, but instead, request to group by certain columns in the call to boxplot on the DataFrame itself:
df.boxplot(column='density',by=['gene_id','cohort'])
To get a better-looking result, you might want to consider using the Seaborn library. It is designed to help precisely with this sort of tasks:
sns.boxplot(data=df,x='gene_id',y='density',hue='cohort')
EDIT to take into account comment below
If you want to have each of your cohort boxplots stacked/superimposed for each gene_id, it's a bit more complicated (plus you might end up with quite an ugly output). You cannot do this using Seaborn, AFAIK, but you could with pandas directly, by using the position= parameter to boxplot (see doc). The catch it to generate the correct sequence of positions to place the boxplots where you want them, but you'll have to fix the tick labels and the legend yourself.
pos = [i for i in range(len(df.gene_id.unique())) for _ in range(len(df.cohort.unique()))]
df.boxplot(column='density',by=['gene_id','cohort'],positions=pos)
An alternative would be to use seaborn.swarmplot instead of using boxplot. A swarmplot plots every point instead of the synthetic representation of boxplots, but you can use the parameter split=False to get the points colored by cohort but stacked on top of each other for each gene_id.
sns.swarmplot(data=df,x='gene_id',y='density',hue='cohort', split=False)
Without knowing the actual content of your dataframe (number of points per gene and per cohort, and how separate they are in each cohort), it's hard to say which solution would be the most appropriate.

Multiplot with matplotlib without knowing the number of plots before running

I have a problem with Matplotlib's subplots. I do not know the number of subplots I want to plot beforehand, but I know that I want them in two rows. so I cannot use
plt.subplot(212)
because I don't know the number that I should provide.
It should look like this:
Right now, I plot all the plots into a folder and put them together with illustrator, but there has to be a better way with Matplotlib. I can provide my code if I was unclear somewhere.
My understanding is that you only know the number of plots at runtime and hence are struggling with the shorthand syntax, e.g.:
plt.subplot(121)
Thankfully, to save you having to do some awkward math to figure out this number programatically, there is another interface which allows you to use the form:
plt.subplot(n_cols, n_rows, plot_num)
So in your case, given you want n plots, you can do:
n_plots = 5 # (or however many you programatically figure out you need)
n_cols = 2
n_rows = (n_plots + 1) // n_cols
for plot_num in range(n_plots):
ax = plt.subplot(n_cols, n_rows, plot_num)
# ... do some plotting
Alternatively, there is also a slightly more pythonic interface which you may wish to be aware of:
fig, subplots = plt.subplots(n_cols, n_rows)
for ax in subplots:
# ... do some plotting
(Notice that this was subplots() not the plain subplot()). Although I must admit, I have never used this latter interface.
HTH