How to recycle two colours in a geom_jitter with more than 2 categories - ggplot2

I would like to use only two colours that alternate in a geom_jitter plot.
ggplot(data=dat1, aes(name, value, colour=factor(location)))+
geom_jitter()
The location has more than 2 categories, but I want to colour them by alternating two colours.
So far I've tried to add + scale_color_manual(values=rep(c("red","green"))) but that did not work.
My data is something like this
ID <- c('P1', 'P2', 'P3', 'P4')
location <- c('harm', 'breast', 'colon', 'liver')
value <- c(7.2, 5.4, 0.3, 2.1)
ID location value
p1 harm 7.2
p2 breast 5.4
p3 colon 0.3
p4 liver 2.1
What I want is the plot shows only two colours eg. harm = red, breast=green, colon = red .... I can't do manually because I have a lot of categories and I want to create a function in which the location may be different.
Any advice?
Thanks

you can add the numbers of your categories at the end. E.G if you have 32 categories:
scale_color_manual(values=rep(c("blue","red"), 32 ))

Related

Taking mean of N largest values of group by absolute value

I have some DataFrame:
d = {'fruit': ['apple', 'pear', 'peach'] * 6, 'values': np.random.uniform(-5,5,18), 'values2': np.random.uniform(-5,5,18)}
df = pd.DataFrame(data=d)
I can take the mean of each fruit group as such:
df.groupby('fruit').mean()
However, for each group of fruit, I'd like to take the mean of the N number of largest values as
ranked by absolute value.
So for example, if my values were as follows and N=3:
[ 0.7578507 , 3.81178045, -4.04810913, 3.08887538, 2.87999752, 4.65670954]
The desired outcome would be (4.65670954 + -4.04810913 + 3.81178045) / 3 = ~1.47
Edit - to clarify that sign is preserved in outcome:
(4.65670954 + -20.04810913 + 3.81178045) / 3 = -3.859
Updating with a new approach that I think is simpler. I was avoiding apply like the plague but maybe this is one of the more acceptable uses. Plus it fixes the fact that you want to mean the original values as ranked by their absolute values:
def foo(d):
return d[d.abs().nlargest(3).index].mean()
out = df.groupby('fruit')['values'].apply(foo)
So you index each group by the 3 largest absolute values, then mean.
And for the record my original, incorrect, and slower code was:
df['values'].abs().groupby(df['fruit']).nlargest(3).groupby("fruit").mean()

using 'loop' or 'for' with table data to pull each row data and use the pulled data for two parameters in gams

I am new to GAMS and I have a table data which has 3 rows and 6 columns. I want to pull each row and use its data for two parameters(pull each row which has 6 element and use the first three elements for one parameter and the other three elements for the second parameter) using loop or for statement. i tried to use both of them but for the loop i received zero value for my parameter which is incorrect and for the for statement i received some errors.
this is my code for the first row which both 'loop' and 'for' are used (i used them separately each time but for show what was my code i just wrote them together).
Please help me.
Thanks
scalars j;
sets
o /red,green,blue/
p /b1,b2,b3,p1,p2,p3/
k /1*3/;
Table sup(*,*)
b1 b2 b3 p1 p2 p3
red 12 15 20 200 50 50
green 16 17 0 150 50 0
blue 13 18 0 100 50 0 ;
parameters Bid_Red(k),Pmax_Red(k),t;
*for statement***************
for(j= 1 to 3,
t=card(o)+j;
Bid_Red(k)$( ord(k) = j )=sup('red',j);
Pmax_Red(k)$( ord(k) = j )=sup('red',t);
);
*loop statement***************
t=card(o);
loop(k,
Bid_Red(k)=sup('red',k);
Pmax_Red(k)=sup('red',k+t);
);
display Bid_red, Pmax_Red
One of the core features of GAMS is how it deals with set structures and indexing. I'd recommend looking at the excellent documentation, for example on set definition https://www.gams.com/latest/docs/UG_SetDefinition.html, to really get a feel for how to get the best out of it.
In your case, you can proceed as follows. p is a set. Create some subsets of it p_ and b_, given by the syntax subset_name(set_name).
sets p_(p) / p1, p2, p3 /,
b_(p) / b1, b2, b3 /;
Create parameters over appropriate dimensions (i.e. the full set), and define them over the subset you are interested in:
parameters bid_red(o,p),pmax_red(o,p);
bid_red(o,b_) = sup(o,b_);
pmax_red(o,p_) = sup(o,p_);
Then display bid_red, pmax_red; gives:
---- 21 PARAMETER bid_red
b1 b2 b3
red 12.000 15.000 20.000
green 16.000 17.000
blue 13.000 18.000
---- 21 PARAMETER pmax_red
p1 p2 p3
red 200.000 50.000 50.000
green 150.000 50.000
blue 100.000 50.000
If you do want to select individual rows, you can use e.g. pmax_red('red',p_) in your code. This is essentially just a special case of subsetting in which the subset is of size 1.

Data Selection - Finding relations between dataframe attributes

let's say i have a dataframe of 80 columns and 1 target column,
for example a bank account table with 80 attributes for each record (account) and 1 target column which decides if the client stays or leaves.
what steps and algorithms should i follow to select the most effective columns with the higher impact on the target column ?
There are a number of steps you can take, I'll give some examples to get you started:
A correlation coefficient, such as Pearson's Rho (for parametric data) or Spearman's R (for ordinate data).
Feature importances. I like XGBoost for this, as it includes the handy xgb.ggplot.importance / xgb.plot_importance methods.
One of the many feature selection options, such as python's sklearn.feature_selection methods.
This one way to do it using the Pearson correlation coefficient in Rstudio, I used it once when exploring the red_wine dataset my targeted variable or column was the quality and I wanted to know the effect of the rest of the columns on it.
see below figure shows the output of the code as you can see the blue color represents positive relation and red represents negative relations and the closer the value to 1 or -1 the darker the color
c <- cor(
red_wine %>%
# first we remove unwanted columns
dplyr::select(-X) %>%
dplyr::select(-rating) %>%
mutate(
# now we translate quality to a number
quality = as.numeric(quality)
)
)
corrplot(c, method = "color", type = "lower", addCoef.col = "gray", title = "Red Wine Variables Correlations", mar=c(0,0,1,0), tl.cex = 0.7, tl.col = "black", number.cex = 0.9)

One ggplot from two data frames (1 bar each)

I was looking for an answer everywhere, but I just couldn't find one to this problem (maybe I was just too stupid to use other answers, because I'm new to R).
I have two data frames with different numbers of rows. I want to create a plot containing a single bar per data frame. Both should have the same length and the count of different variables should be stacked over each other. For example: I want to compare the proportions of gender in those to data sets.
t1<-data.frame(cbind(c(1:6), factor(c(1,2,2,1,2,2))))
t2<-data.frame(cbind(c(1:4), factor(c(1,2,2,1))))
1 represents male, 2 represents female
I want to create two barplots next to each other that represent, that the proportions of gender in the first data frame is 2:4 and in the second one 2:2.
My attempt looked like this:
ggplot() + geom_bar(aes(1, t1$X2, position = "fill")) + geom_bar(aes(1, t2$X2, position = "fill"))
That leads to the error: "Error: stat_count() must not be used with a y aesthetic."
First I should merge the two dataframes. You need to add a variable that will identify the origin of the data, add in both dataframes a column with an ID (like t1 and t2). Keep in mind that your columnames are the same in both frames so you will be able to use the function rbind.
t1$data <- "t1"
t2$data <- "t2"
t <- (rbind(t1,t2))
Now you can make the plot:
ggplot(t[order(t$X2),], aes(data, X2, fill=factor(X2))) +
geom_bar(stat="identity", position="stack")

How do I design a DB Table when my sources are VENN DIAGRAMS?

Imagine 3 circles. Each circle has some numbers
Circle 1 has the following numbers
1, 4, 7, 9
Circle 2 has the following numbers
2, 5, 8, 9
Circle 3 has the following numbers
3, 6, 7, 8, 9
Circle 1 and Circle 2 share the following numbers
10, 9
Circle 1 and Circle 3 share the following numbers
7, 9
Circle 2 and Circle 3 share the following numbers
8, 9
All three circles share
9
Each number represents symptoms so in my case
Circle 1's numbers could be symptoms for a short circuit
Circle 2 could be numbers for a component failure
Circle 3 could be numbers for external issues
each of the three issues share certain symptoms
If given #9, we wouldn't be able to deduce the problem but could display a list of all issues involving #9
If given more #'s, we can attempt to show relevant issues.
My problem is how do I put this into a table so my code can look things up.
My Database of choice is SQLite3
#Vincent, the only issue I have is that there are several variables. I have variables called t1, t2, t3, a1, a2, a3. Each of these variables are symptoms. The user interface for my application allows the user to input a value for each variable then I want to check the DB. All values for each symptom can be any value in the 3 circles (mentioned in original problem)
Create 3 tables as:
symptom = (symptom_id, descr)
problem = (problem_id, descr)
problem_symptom = (problem_id, symptom_id)
e.g.
Symptom
Symptom_id Desc
1 doda
2 dado
3 dada
Problem
Problem_id Descr
1 Short Circuit
2 Component Failure
Symptom_Problem
Symptom_id Problem_id
1 1 --- doda is a symptom of Short circuit
2 1 --- dado is a symptom of short circuit
etc.
You can then query and join to determine the problems based on the symptoms.
I would suggest 3 tables: symptom, problem, and problem_symptoms. The latter would be a join table between the first two.