How to customize the order for labels in ggcorrplot - ggcorrplot

I have created this correlation matrix using ggcorrplot but the order of the labels is alphabetically, and I want to create a specific order.
Are there some solutions to create a specific order of labels?
Thank you
graphy2 <- ggcorrplot(dff1,type = 'upper', show.diag = T, lab = T))

Related

Plotting a graph of the top 15 highest values

I am working on a dataset which shows the budget spent on movies. I want make a plot which contains the top 15 highest budget movies.
#sort the 'budget' column in decending order and store it in the new dataframe.
info = pd.DataFrame(dp['budget'].sort_values(ascending = False))
info['original_title'] = dp['original_title']
data = list(map(str,(info['original_title'])))
#extract the top 10 budget movies data from the list and dataframe.
x = list(data[:10])
y = list(info['budget'][:10])
This was the ouput i got
C:\Users\Phillip\AppData\Local\Temp\ipykernel_7692\1681814737.py:2: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
y = list(info['budget'][:5])
I'm new to the data analysis scene so i'm confused on how else to go about the problem
A simple example using a movie dataset I found online:
import pandas as pd
url = "https://raw.githubusercontent.com/erajabi/Python_examples/master/movie_sample_dataset.csv"
df = pd.read_csv(url)
# Bar plot of 15 highest budgets:
df.nlargest(n=15, columns="budget").plot.bar(x="movie_title", y="budget")
You can customize your plot in various ways by adding arguments to the .bar(...) call.

trying to add p values from another data frame and values but values are matching the position in the frame not the variable name ggplot

Hi Everyone this is my first time posting a question. I am new to using R to generate figures. I am following a tutorial to add p values to a bar plot from datanovia. I am able to successfully compute adjusted p values for several comparisons and now I am trying to plot them on a grouped bar chart. However the values are plotting in the position they appear the in dataframe and not matching to the name of the variable across the data.
For example if the fourth line of the data frame containing the p values shows a significant value then the fourth group in the bar plot will display that value, even though the x axis variable name doesn't match between the dataframes at the fourth position.
How do I correct this an ensure that the p values are displaying with their corresponding comparison?
This is the code to establish the p values.
library(ggpubr)
library(rstatix)
stat.test <- gg_means %>%
group_by(lipid) %>%
t_test(cor ~ Genotype) %>%
adjust_pvalue(method = "BH") %>%
add_significance("p.adj")
This is the code to create the bar plot
bp <- ggbarplot(gg_means, x = "lipid", y = "cor", add = "mean_sd", color= "Genotype",
palette = c("#00AFBB", "#E7B800"),
position = position_dodge(0.8),
ylab = "nmol/mg protein") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
and finally to add the pvalues to the bar plot
stat.test <- stat.test %>%
add_xy_position(fun = "mean_sd", x = "lipid", dodge = 0.8)
bp + stat_pvalue_manual(stat.test, label = "p.adj.signif", tip.length = 0.01)
The plot produced looks like 2 The arrows indicate where the values should be mapping.

Select cells in a pandas DataFrame by a Series of its column labels

Say we have a DataFrame and a Series of its column labels, both (almost) sharing a common index:
df = pd.DataFrame(...)
s = df.idxmax(axis=1).shift(1)
How can I obtain cells given a series of columns, getting value from every row using a corresponding column label from the joined series? I'd imagine it would be:
values = df[s] # either
values = df.loc[s] # or
In my example I'd like to have values that are under biggest-in-their-row values (I'm doing a poor man's ML :) )
However I cannot find any interface selecting cells by series of columns. Any ideas folks?
Meanwhile I use this monstrous snippet:
def get_by_idxs(df: pd.DataFrame, idxs: pd.Series) -> pd.Series:
ts_v_pairs = [
(ts, row[row['idx']])
for ts, row in df.join(idxs.rename('idx'), how='inner').iterrows()
if isinstance(row['idx'], str)
]
return pd.Series([v for ts, v in ts_v_pairs], index=[ts for ts, v in ts_v_pairs])
I think you need dataframe lookup
v = s.dropna()
v[:] = df.to_numpy()[range(len(v)), df.columns.get_indexer_for(v)]

Store regression coefficients, merge back into data-frame

I'm trying to estimate a random effects model, and store those coefficients. I then want to merge them to the data-frame to predict the dependent variable.
There is a random effect coefficient for each group. In the data-frame, if an observation belongs to group 1, I want the group 1 coefficient listed there. For observations in group 2, the group 2 coefficient and so on.
I am able to access and store the coefficients. But I'm not able to merge them back into the data-frame. I'm not sure how to think of it. Here is the code I have so far:
md = smf.mixedlm('y ~ x', data=df, groups=train['GroupID'])
mdf = md.fit()
I tried storing the coefficients in three ways:
re_coeffs = pd.Series(mdf.random_effects.values) #creates a series with shape (1,)
re_coeffs = [(k) for k in mdf.random_effects.values()] #creates a list with the coeffs
re_coeffs = np.array(mdf.random_effects.values) #creates array with shape ()
All of them work, but none of them let me merge them back into the original data-frame. I'm not sure about using a dictionary or a list, or generally how to think about merging these coefficients back into the original data-frame.
I'll appreciate any suggestions for this.
This seems to work:
md = smf.mixedlm('y ~ x', data=train, groups=train['GroupID'])
mdf = md.fit()
re_coeffs = [(k) for k in mdf.random_effects.values()]
df = pd.DataFrame(re_coeffs)
df['ConfigID'] = df.index
merged = pd.merge(train,df, on=['GroupID'])

Is there a way to set the order in pandas group boxplots?

Is there a way to sort the x-axis for a grouped box plot in pandas? It seems like it is sorted by an ascending order and I would like it to be ordered based on some other column value.
If you're grouping by a category, set it as an ordered categorical in the desired order.
See example below:
Here a dataset is created with three categories A, B and C where the mean value of each category is of the order C, B, A. The goal is to plot the categories in order of their mean value.
The key is converting the category to an ordered categorical data type with the desired order.
# create some data
n = 50
a = pd.concat([pd.Series(['A']*n, name='cat'),
pd.Series(np.random.normal(1, 1, n), name='val')],
axis=1)
b = pd.concat([pd.Series(['B']*n, name='cat'),
pd.Series(np.random.normal(.5, 1, n), name='val')],
axis=1)
c = pd.concat([pd.Series(['C']*n, name='cat'),
pd.Series(np.random.normal(0, 1, n), name='val')],
axis=1)
df = pd.concat([a, b, c]).reset_index(drop=True)
# unordered boxplot
df.boxplot(column='val', by='cat')
# get order by mean
means = df.groupby(['cat'])['val'].agg(np.mean).sort_values()
ordered_cats = means.index.values
# create categorical data type and set categorical column as new data type
cat_dtype = pd.CategoricalDtype(ordered_cats, ordered=True)
df['cat'] = df['cat'].astype(cat_dtype)
# ordered boxplot
df.boxplot(column='val', by='cat')
Using the solution posted by krieger, the short answer is to convert the category column to a CategoricalDtype like so:
ordered_list = ['dog', 'cat', 'mouse']
df['category'] = df['category'].astype(pd.CategoricalDtype(ordered_list , ordered=True))