Aesthetics : fill - Warning ocours when ploting gg Boxplot - ggplot2

Ive written a R-chunk which should provide me a coloured ggplot boxplot. All needed templates are loaded, so is the Data.
The Data for „Healthy“ & „BodyTemperature“ is based inside the Data „Hospital“.
For Healthy there can be only 0 oder 1.
It should plott two Boxplots next to each other on the x-axis, one showing Healthy (0) the other one Unhealthy (1) compared to the BodyTemperature of the patients on y-axis.
The Boxplot should be coloured with the Template „Brewer“.
Everytime i try to run this chunk, a warning occours. Whats the solution?
colour:
colour <- brewer.pal(n = 2, name = "Set1")
colour
Warnung: minimal value for n is 3, returning requested palette with 3 different levels
[1] "#E41A1C" "#377EB8" "#4DAF4A"
R-Chunk
colour = brewer.pal(n = 2, name = "Set1")
ggplot(Hospital, aes(x = Healthy, y = BodyTemperature)) +
geom_boxplot(fill=c(colour)) +
ylab("Temperature") +
xlab("Healthy") +
ggtitle("Health compared to Temperature")
Warning ocours:
Error in `check_aesthetics()`:
! Aesthetics must be either length 1 or the same as the data (1): fill
Backtrace:
1. base (local) `<fn>`(x)
2. ggplot2:::print.ggplot(x)
4. ggplot2:::ggplot_build.ggplot(x)
5. ggplot2 (local) by_layer(function(l, d) l$compute_geom_2(d))
6. ggplot2 (local) f(l = layers[[i]], d = data[[i]])
7. l$compute_geom_2(d)
8. ggplot2 (local) f(..., self = self)
9. self$geom$use_defaults(data, self$aes_params, modifiers)
10. ggplot2 (local) f(..., self = self)
11. ggplot2:::check_aesthetics(params[aes_params], nrow(data))
Error in check_aesthetics(params[aes_params], nrow(data)) :

As you want to color your boxplots by the value of Healthy you could do so by mapping Healthy on the fill aesthetic. Also to use one of the Brewer palettes ggplot2 already offers some convenience functions which in case of the fill aes is called scale_fill_brewer. Not sure whether you want a legend but IMHO it does not make sense so I removed it via guides. Finally as you provided no data it's not clear whether your Healthy column is a numeric or a categorical variable. For this reason I wrapped in factor to make it categorical.
Using some fake random example data:
set.seed(123)
library(ggplot2)
Hospital <- data.frame(
Healthy = rep(c(0, 1), 50),
BodyTemperature = runif(100)
)
ggplot(Hospital, aes(x = factor(Healthy), y = BodyTemperature)) +
geom_boxplot(aes(fill = factor(Healthy))) +
scale_fill_brewer(palette = "Set1") +
ylab("Temperature") +
xlab("Healthy") +
ggtitle("Health compared to Temperature") +
guides(fill = "none")

Related

geom_nodelabel_repel() position for circular ggraph plot

I have a network diagram that looks like this:
I made it using ggraph and added the labels using geom_nodelabel_repel() from ggnetwork:
( ggraph_plot <- ggraph(layout) +
geom_edge_fan(aes(color = as.factor(responses), edge_width = as.factor(responses))) +
geom_node_point(aes(color = as.factor(group)), size = 10) +
geom_nodelabel_repel(aes(label = name, x=x, y=y), segment.size = 1, segment.color = "black", size = 5) +
scale_color_manual("Group", values = c("#2b83ba", "#d7191c", "#fdae61")) +
scale_edge_color_manual("Frequency of Communication", values = c("Once a week or more" = "#444444","Monthly" = "#777777",
"Once every 3 months" = "#888888", "Once a year" = "#999999"),
limits = c("Once a week or more", "Monthly", "Once every 3 months", "Once a year")) +
scale_edge_width_manual("Frequency of Communication", values = c("Once a week or more" = 3,"Monthly" = 2,
"Once every 3 months" = 1, "Once a year" = 0.25),
limits = c("Once a week or more", "Monthly", "Once every 3 months", "Once a year")) +
theme_void() +
theme(legend.text = element_text(size=16, face="bold"),
legend.title = element_text(size=16, face="bold")) )
I want to have the labels on the left side of the plot be off to the left, and the labels on the right side of the plot to be off to the right. I want to do this because the actual labels are quite long (organization names) and they get in the way of the lines in the actual plot.
How can I do this using geom_nodelabel_repel()? i've tried different combinations of box_padding and point_padding, as well as h_just and v_just but these apply to all labels and it doesn't seem like there is a way to subset or position specific points.
Apologies for not providing a reproducible example but I wasn't sure how to do this without compromising the identities of respondents from my survey.
Well, there is always the manually-intensive, yet effective method of separately adding the geom_node_label_repel function for the nodes on the "left" vs. the "right" of the plot. It's not at all elegant and probably bad coding practice, but I've done similar things myself when I can't figure out an elegant solution. It works really well when you don't have a very large dataset to begin with and if you are not planning to make the same plot over and over again. Basically, it would entail:
Identifying if there exists a property in your dataset that places points on the "left" vs. the "right". In this case, it doesn't look like it, so you would just have to create a list manually of those entries on the "left" vs. "right" of your plot.
Using separate calls to geom_node_label_repel with different nudge_x values. Use any reasonable method to subset the "left" and "right datapoints. You can create a new column in the dataset, or use formatting in-line like data = subset(your.data.frame, property %in% left.list)
For example, if you created a column called subset.side, being either "left" or "right" in your data.frame (here: your.data.frame), your calls to geom_node_label_repel might look something like:
geom_node_label_repel(
data=subset(your.data.frame, subset.side=='left'),
aes(label=name, x=x, y=y), segment.size=1, segment.color='black', size=5,
nudge_x=-10
) +
geom_node_label_repel(
data=subset(your.data.frame, subset.side=='right'),
aes(label=name, x=x, y=y), segment.size=1, segment.color='black', size=5,
nudge_x=10
) +
Alternatively, you can create a list based on the label name itself--let's say you called those lists names.left and names.right, where you can subset accordingly by swapping in as represented in the pseudo code below:
geom_node_label_repel(
data=subset(your.data.frame, name %in% names.left),...
nudge_x = -10, ...
) +
geom_node_label_repel(
data=subset(your.data.frame, name %in% names.right),...
nudge_x = 10, ...
)
To be fair, I have not worked with the node geoms before, so I am assuming here that the positioning of the labels will not affect the mapping (as it would not with other geoms).

Highlighting a point in geom_jitter

I have a data set with several species that shall be visualised via geom_jitter. In every species, there is one observation that I want to highlight as red point (i.e. this study). However, I couldn't find a useful explanation and hope anybody here can help to fix this.
The data have 7 variables and 65 observations.
The 7 variables are
Genus
Species
Energy
Organics
CN
CP
Reference
The variable "Reference" contains many different things, among others "this study".
The code looks like this:
``
ggplot(Genus,aes(x=Genus,y=Organics,colour=Genus))+
geom_jitter(position=position_jitter(0.15),alpha=1,size=2)+
labs(x="\nGenus",y="AFDW % DW\n")+
theme(axis.title.x=element_text(size=18),
axis.text.x=element_blank(),
axis.title.y=element_text(size=18),
legend.title=element_text(colour="black",size=14),
legend.text=element_text(face="italic",colour="black",size=14),
axis.text.y=element_text(size=14,colour="black"),
axis.ticks=element_blank())
p23
``
Does anybody has an idea how to highlight the points belonging to "this study" in every species scatter plot?
Without any relevent example data I can't be more specific as to how to achieve this. Here is how you could achieve it on some dummy data, as mentioned in a comment:
library(ggplot2)
df <- data.frame(
Genus = rep(LETTERS[1:2], each = 50),
Organics = rnorm(100),
Reference = sample(LETTERS, 100, replace = T),
stringsAsFactors = FALSE
)
# Pick out 2 points in across groups to be highlighted
df$Reference[rpois(1, 20) + c(0, 50)] <- "This Study"
ggplot(df, aes(Genus, Organics)) +
geom_jitter(position = position_jitter(0.15),
aes(colour = ifelse(Reference == "This Study",
"Highlight", "Don't Highlight"))) +
labs(colour = "Highlight?")
You can choose the highlight colours by adding + scale_colour_manual(values = c("my_colour_1", "my_colour_2"))

Data Selection - Finding relations between dataframe attributes

let's say i have a dataframe of 80 columns and 1 target column,
for example a bank account table with 80 attributes for each record (account) and 1 target column which decides if the client stays or leaves.
what steps and algorithms should i follow to select the most effective columns with the higher impact on the target column ?
There are a number of steps you can take, I'll give some examples to get you started:
A correlation coefficient, such as Pearson's Rho (for parametric data) or Spearman's R (for ordinate data).
Feature importances. I like XGBoost for this, as it includes the handy xgb.ggplot.importance / xgb.plot_importance methods.
One of the many feature selection options, such as python's sklearn.feature_selection methods.
This one way to do it using the Pearson correlation coefficient in Rstudio, I used it once when exploring the red_wine dataset my targeted variable or column was the quality and I wanted to know the effect of the rest of the columns on it.
see below figure shows the output of the code as you can see the blue color represents positive relation and red represents negative relations and the closer the value to 1 or -1 the darker the color
c <- cor(
red_wine %>%
# first we remove unwanted columns
dplyr::select(-X) %>%
dplyr::select(-rating) %>%
mutate(
# now we translate quality to a number
quality = as.numeric(quality)
)
)
corrplot(c, method = "color", type = "lower", addCoef.col = "gray", title = "Red Wine Variables Correlations", mar=c(0,0,1,0), tl.cex = 0.7, tl.col = "black", number.cex = 0.9)

Tukey-Test Grouping and plotting in SciPy

I'm trying to plot results from a Tukey test, but I am struggling with putting data into groups based on a P-Value. This is the equivalent in R which I am trying to replicate. I have been using the SciPy one-way ANOVA tests and the Tukey test statsmodel but can't get these groups done in the same way.
Any help is greatly appreciated
I've also just found this another example in R of what I want to do in python
I have been struggling to do the same thing. I found a paper that tells you how to code the letters.
Hans-Peter Piepho (2004) An Algorithm for a Letter-Based Representation of All-Pairwise Comparisons, Journal of Computational and Graphical Statistics, 13:2, 456-466, DOI: 10.1198/1061860043515
Doing the coding was a little tricky as you need to check and replicate columns and then combine columns. I tried to add some comments to the colde. I figured out a method where you can run tukeyhsd and then from the results compute the letters. It should be possible to turn this into a function. Or hopefully part of tukeyhsd. My data is not posted but it is a column of data and then a column describing the groups. The groups for me are the five boroughs of NYC. You can also just change the comments and use random data the first time.
# Read data. Comment out the next ones to use random data.
df=pd.read_excel('anova_test.xlsx')
#n=1000
#df = pd.DataFrame(columns=['Groups','Data'],index=np.arange(n))
#df['Groups']=np.random.randint(1, 4,size=n)
#df['Data']=df['Groups']*np.random.random_sample(size=n)
# define columns for data and then grouping
col_to_group='Groups'
col_for_data='Data'
#Now take teh data and regroup for anova
samples = [cols[1] for cols in df.groupby(col_to_group)[col_for_data]] #I am not sure how this works but it makes an numpy array for each group
f_val, p_val = stats.f_oneway(*samples) # I am not sure what this star does but this passes all the numpy arrays correctly
#print('F value: {:.3f}, p value: {:.3f}\n'.format(f_val, p_val))
# this if statement can be uncommmented if you don't won't to go furhter with out p<0.05
#if p_val<0.05: #If the p value is less than 0.05 it then does the tukey
mod = MultiComparison(df[col_for_data], df[col_to_group])
thsd=mod.tukeyhsd()
#print(mod.tukeyhsd())
#this is a function to do Piepho method. AN Alogrithm for a letter based representation of al-pairwise comparisons.
tot=len(thsd.groupsunique)
#make an empty dataframe that is a square matrix of size of the groups. #set first column to 1
df_ltr=pd.DataFrame(np.nan, index=np.arange(tot),columns=np.arange(tot))
df_ltr.iloc[:,0]=1
count=0
df_nms = pd.DataFrame('', index=np.arange(tot), columns=['names']) # I make a dummy dataframe to put axis labels into. sd stands for signifcant difference
for i in np.arange(tot): #I loop through and make all pairwise comparisons.
for j in np.arange(i+1,tot):
#print('i=',i,'j=',j,thsd.reject[count])
if thsd.reject[count]==True:
for cn in np.arange(tot):
if df_ltr.iloc[i,cn]==1 and df_ltr.iloc[j,cn]==1: #If the column contains both i and j shift and duplicat
df_ltr=pd.concat([df_ltr.iloc[:,:cn+1],df_ltr.iloc[:,cn+1:].T.shift().T],axis=1)
df_ltr.iloc[:,cn+1]=df_ltr.iloc[:,cn]
df_ltr.iloc[i,cn]=0
df_ltr.iloc[j,cn+1]=0
#Now we need to check all columns for abosortpion.
for cleft in np.arange(len(df_ltr.columns)-1):
for cright in np.arange(cleft+1,len(df_ltr.columns)):
if (df_ltr.iloc[:,cleft].isna()).all()==False and (df_ltr.iloc[:,cright].isna()).all()==False:
if (df_ltr.iloc[:,cleft]>=df_ltr.iloc[:,cright]).all()==True:
df_ltr.iloc[:,cright]=0
df_ltr=pd.concat([df_ltr.iloc[:,:cright],df_ltr.iloc[:,cright:].T.shift(-1).T],axis=1)
if (df_ltr.iloc[:,cleft]<=df_ltr.iloc[:,cright]).all()==True:
df_ltr.iloc[:,cleft]=0
df_ltr=pd.concat([df_ltr.iloc[:,:cleft],df_ltr.iloc[:,cleft:].T.shift(-1).T],axis=1)
count+=1
#I sort so that the first column becomes A
df_ltr=df_ltr.sort_values(by=list(df_ltr.columns),axis=1,ascending=False)
# I assign letters to each column
for cn in np.arange(len(df_ltr.columns)):
df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(1,chr(97+cn))
df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(0,'')
df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(np.nan,'')
#I put all the letters into one string
df_ltr=df_ltr.astype(str)
df_ltr.sum(axis=1)
#print(df_ltr)
#print('\n')
#print(df_ltr.sum(axis=1))
#Now to plot like R with a violing plot
fig,ax=plt.subplots()
df.boxplot(column=col_for_data, by=col_to_group,ax=ax,fontsize=16,showmeans=True
,boxprops=dict(linewidth=2.0),whiskerprops=dict(linewidth=2.0)) #This makes the boxplot
ax.set_ylim([-10,20])
grps=pd.unique(df[col_to_group].values) #Finds the group names
grps.sort() # This is critical! Puts the groups in alphabeical order to make it match the plotting
props=dict(facecolor='white',alpha=1)
for i,grp in enumerate(grps): #I loop through the groups to make the scatters and figure out the axis labels.
x = np.random.normal(i+1, 0.15, size=len(df[df[col_to_group]==grp][col_for_data]))
ax.scatter(x,df[df[col_to_group]==grp][col_for_data],alpha=0.5,s=2)
name="{}\navg={:0.2f}\n(n={})".format(grp
,df[df[col_to_group]==grp][col_for_data].mean()
,df[df[col_to_group]==grp][col_for_data].count())
df_nms['names'][i]=name
ax.text(i+1,ax.get_ylim()[1]*1.1,df_ltr.sum(axis=1)[i],fontsize=10,verticalalignment='top',horizontalalignment='center',bbox=props)
ax.set_xticklabels(df_nms['names'],rotation=0,fontsize=10)
ax.set_title('')
fig.suptitle('')
fig.savefig('anovatest.jpg',dpi=600,bbox_inches='tight')
Results showing the letters above plots using the tukeyhsd
Here is a function that returns letter labels if you have a symmetric matrix of p-values from a Tukey test:
import numpy as np
def tukeyLetters(pp, means=None, alpha=0.05):
'''TUKEYLETTERS - Produce list of group labels for TukeyHSD
letters = TUKEYLETTERS(pp), where PP is a symmetric matrix of
probabilities from a Tukey test, returns alphabetic labels
for each group to indicate clustering. PP may also be a vector
from PAIRWISE_TUKEYHSD.
Optional argument MEANS specifies group means, which is used for
ordering the letters. ("a" gets assigned to the group with lowest
mean.) Without this argument, ordering is arbitrary.
Optional argument ALPHA specifies cutoff for treating groups as
part of the same cluster.'''
if len(pp.shape)==1:
# vector
G = int(3 + np.sqrt(9 - 4*(2-len(pp))))//2
ppp = .5*np.eye(G)
ppp[np.triu_indices(G,1)] = pp
pp = ppp + ppp.T
conn = pp>alpha
G = len(conn)
if np.all(conn):
return ['a' for g in range(G)]
conns = []
for g1 in range(G):
for g2 in range(g1+1,G):
if conn[g1,g2]:
conns.append((g1,g2))
letters = [ [] for g in range(G) ]
nextletter = 0
for g in range(G):
if np.sum(conn[g,:])==1:
letters[g].append(nextletter)
nextletter += 1
while len(conns):
grp = set(conns.pop(0))
for g in range(G):
if all(conn[g, np.sort(list(grp))]):
grp.add(g)
for g in grp:
letters[g].append(nextletter)
for g in grp:
for h in grp:
if (g,h) in conns:
conns.remove((g,h))
nextletter += 1
if means is None:
means = np.arange(G)
means = np.array(means)
groupmeans = []
for k in range(nextletter):
ingroup = [g for g in range(G) if k in letters[g]]
groupmeans.append(means[np.array(ingroup)].mean())
ordr = np.empty(nextletter, int)
ordr[np.argsort(groupmeans)] = np.arange(nextletter)
result = []
for ltr in letters:
lst = [chr(97 + ordr[x]) for x in ltr]
lst.sort()
result.append(''.join(lst))
return result
To make that concrete, here is a full example:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
data = [ 1,2,2,1,4,5,4,5,7,8,7,8,1,3,4,5 ]
group = [ 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 ]
tuk = pairwise_tukeyhsd(data, group)
letters = tukeyLetters(tuk.pvalues)
This will result in letters containing ['a', 'c', 'b', 'ac']

Customizing new trading strategy in R using quantmod

I want to create a new custom TA-indicator to the stock symbol in R. But I have no idea about how to convert my SQL conditional strategy into R self-defined function and add it up to the ChartSeries in R.
The question are listed in the following code as the explanation.
library("quantmod")
library("FinancialInstrument")
library("PerformanceAnalytics")
library("TTR")
stock <- getSymbols("002457.SZ",auto.assign=FALSE,from="2012-11-26",to="2014-01-30")
head(stock)
chartSeries(stock, theme = "white", subset = "2013-07-01/2014-01-30",TA = "addSMA(n=5,col=\"gray\");addSMA(n=10,col=\"yellow\");
addSMA(n=20,col=\"pink\");addSMA(n=30,col=\"green\");addSMA(n=60,col=\"blue\");addVo()")
Question: How can I rewrite the code below to make it available as a function in R?
#Signal Design
#Today's volume is the lowset during the last 20 trading days
lowvolume <- VOL<=LLV(VOL,20);
#seveal moving average lines stick together
X1:=ABS(MA(C,10)/MA(C,20)-1)<0.01;
X2:=ABS(MA(C,5)/MA(C,10)-1)<0.01;
X3:=ABS(MA(C,5)/MA(C,20)-1)<0.01;
#If the follwing condition is satisfied, then the signal appears
MA(C,5)>REF(MA(C,5),1) AND X1 AND X2 AND X3 AND lowvolume;
#Convert the above SQL code into the following R custom function
VOLINE <- function(x) {
}
#Create a new TA function for the chartseries and then add it up.
addVoline <- newTA(FUN=VOLINE,
+ preFUN=Cl,
+ col=c(rep(3,6),
+ rep(”#333333”,6)),
+ legend=”VOLINE”)
I dont think you need sql in this case
Try this
require(quantmod)
# fetch the data
s <- get(getSymbols('yhoo'))
# add the indicators
s$ma5 <- SMA(Cl(s) ,5)
s$ma10 <- SMA(Cl(s) ,10)
s$ma20 <- SMA(Cl(s) ,20)
s$llv <- rollapply(Vo(s), 20, min)
# generate the signal
s$signal <- (s$ma10 / s$ma20 - 1 < 0.01 & s$ma5 / s$ma10 - 1 < 0.01 & s$ma5 / s$ma20 - 1 < 0.01 & Vo(s) == s$llv)
# draw
chart_Series(s)
add_TA(s$signal == 1, on = 1, col='red')
I'm not sure what REF means but i'm sure you can do that by your self.
This is the output (i cant seem to upload the photo but you see a chart with horizontal lines where signal eq 1)
Use the function as a wrapper for sqldf() in the sqldf package. The argument to sqldf() will be a select statement on the data frame that has the data.
A good tutorial for this can be found at Burns Statistics.