Adding secondary y axis in ggplot2 without scaling factor - ggplot2

I want to plot two y axis - one with continuous data and other with values - ranging from 0 to 7
Example:
ID IHC FISH1 FISH2
1 3 11.5 9.5
2 1 2.9 3.9
3 2 1.5 6.5
4 1 3.3 1.3
5 2 5.5 8.5
6 2 6.6 9.6
How can I plot secondary y axis - if it is not related to primary y axis
I want to code in R.
I want a plot like this as output
data %>%
select(ID, IHC, FISH1, FISH2) %>%
gather(key = "FISH_IHC", value = "FISH_val", FISH1, FISH 2, IHC, -ID) %>%
mutate(as_factor = as.factor(FISH_IHC)) %>%
ggplot(aes(x = reorder(ID,FISH_val), y = FISH_val, group = as_factor), na.rm = TRUE) +
geom_point(aes(shape=as_factor, color=as_factor)) +
scale_y_continuous(limits = c(0, 10),
oob = function(x, ...){x},
expand = c(0, -1),
breaks=number_ticks(10),
sec.axis = sec_axis(scales::rescale(pretty(range(IHC)),
name = "IHC")))

Related

pandas-groupby: apply custom function which needs 2 columns as input to get one column as output

I have a dataframe with dates and a value per day. I want to see the gradient of the value, if it is growing, declining, .... The best way is to apply a linear regression with day as x and value as y:
import pandas as pd
df = pd.DataFrame({'customer':['a','a','a','b','b','b'],
'day':[1,2,4,2,3,4],
'value':[1.5,2.4,3.6,1.5,1.3,1.1]})
df:
customer day value
0 a 1 1.5
1 a 2 2.4
2 a 4 3.6
3 b 2 1.5
4 b 3 1.3
5 b 4 1.1
By hand I can do a linear regression:
from sklearn.linear_model import LinearRegression
def gradient(x,y):
return LinearRegression().fit(x,y).coef_[0]
xa = df[df.customer =='a'].day.values.reshape(-1, 1)
ya = df[df.customer =='a'].value.values.reshape(-1, 1)
xb = df[df.customer =='b'].day.values.reshape(-1, 1)
yb = df[df.customer =='b'].value.values.reshape(-1, 1)
print(gradient(xa,ya),gradient(xb,yb))
result: [0.68571429] [-0.2]
But I would like to use a groupby as in
df.groupby('customer').agg({'value':['mean','sum','gradient']})
with an output like:
value
mean sum gradient
customer
a 2.5 7.5 0.685
b 1.3 3.9 -0.2
the issue is that the gradient needs 2 columns as input.
You can do:
# calculate gradient
v = (df
.groupby('customer')
.apply(lambda x: gradient(x['day'].to_numpy().reshape(-1, 1),
x['value'].to_numpy().reshape(-1, 1)))
v.name = 'gradient'
# calculate mean, sum
d1 = df.groupby('customer').agg({'value': ['mean', 'sum']})
# join the results
d1 = d1.join(v)
# fix columns
d1.columns = d1.columns.str.join('')
print(d1)
valuemean valuesum gradient
customer
a 2.5 7.5 0.685714
b 1.3 3.9 -0.200000

subplots in python with multiple line charts using pandas ans seaborn

I have a data frame as shown below
product bought_date Monthly_profit Average_discout
A 2016 85000000 5
A 2017 55000000 5.6
A 2018 45000000 10
A 2019 35000000 9.8
B 2016 75000000 5
B 2017 55000000 4.6
B 2018 75000000 11
B 2019 45000000 9.8
C 2016 95000000 5.3
C 2017 55000000 5.1
C 2018 50000000 10.2
C 2019 45000000 9.8
From the above I would like to plot 3 subplots.
one for product A, B and C.
In each subplot there should be 3 line plot, where
X axis = bought_date
Y axis1 = Monthly_profit
Y axis2 = Average_discout
I tried below code.
sns.set(style = 'darkgrid')
sns.lineplot(x = 'bought_date', y = 'Monthly_profit', style = 'product',
data = df1, markers = True, ci = 68, err_style='bars')
Variant 1: using subplots and separating the data manually
products = df['product'].unique()
fig,ax = plt.subplots(1,len(products),figsize=(20,10))
for i,p in enumerate(products):
sns.lineplot('bought_date', 'Monthly_profit', data=df[df['product']==p], ax=ax[i])
sns.lineplot('bought_date', 'Average_discout', data=df[df['product']==p], ax=ax[i].twinx(), color='orange')
ax[i].legend([f'Product {p}'])
Variant 2: using FacetGrid:
def lineplot2(x, y, y2, **kwargs):
ax = sns.lineplot(x, y, **kwargs)
ax2 = ax.twinx()
sns.lineplot(x, y2, ax=ax2, **kwargs)
g = sns.FacetGrid(df, col='product')
g.map(lineplot2, 'bought_date', 'Monthly_profit', 'Average_discout', marker='o')
These are just general rough examples, you'll have to tidy up axis labels etc. as needed.

When using coord_cartesian, y axis dissapear

I have the following table:
x var y
a group1 0.5
b group1 -0.65
c group1 -1.3
d group1 0.2
a group2 1.2
b group2 -1.6
c group2 -0.7
d group2 -3
I want to plot x against y, in two different plots by var (group1 or 2), using ggplot.
However, I also want to "zoom in" into the y-axis, thus showing the whole x axis but, in the y-axis, only values from -0.5 to -3:
ggplot(table,
aes(x = x,
y = y)) +
geom_point() +
facet_wrap(vars(var)) +
scale_y_continuous() +
coord_cartesian(ylim = c(-0.5,
-3))
However, this removes the values and ticks from the y axis, and I do not know how to make them appear:

Joining without matching all rows of 'y' using dplyr

The problems with the base function merge are well documented online yet still cause havoc. plyr::join solved many of these issues and works fantastically. The new kid on the block is dplyr. I'd like to know how to perform option 2 in the example below using dplyr. Anyone know if that's possible, and should it be a feature request?
Reproducible example
df1 <- data.frame(nm = c("y", "x", "z"), v2 = 10:12)
df2 <- data.frame(nm = c("x", "x", "y", "z", "x"), v1 = c(1, 1, 2, 3, 1))
Option 1: merge
merge(df1, df2, by = "nm", all.x = T, all.y = F)
This doesn't provide what I want and messes with the order:
## nm v2 v1
## 1 x 11 1
## 2 x 11 1
## 3 x 11 1
## 4 y 10 2
## 5 z 12 3
Option 2: plyr - this is what I want, but it's a little slow
library(plyr)
join(df1, df2, match = "first")
Note: only rows from x are kept:
## nm v2 v1
## 1 y 10 2
## 2 x 11 1
## 3 z 12 3
Option 3: dplyr:
library(dplyr)
inner_join(df1, df2)
This changes the order and keeps rows from y.
## nm v2 v1
## 1 x 11 1
## 2 x 11 1
## 3 y 10 2
## 4 z 12 3
## 5 x 11 1
left_join(df1, df2)
The only difference here is the order:
## nm v2 v1
## 1 y 10 2
## 2 x 11 1
## 3 x 11 1
## 4 x 11 1
## 5 z 12 3
This is a really useful feature so surprised option 2 is not even possible with dplyr, unless I've missed something.
I don't think what you are looking for is possible using dplyr. However, in this case you can get the desired output using the code below.
library(dplyr)
unique(inner_join(df1, df2))
Output:
nm v2 v1
1 x 11 1
3 y 10 2
4 z 12 3

Is cut() style binning available in dplyr?

Is there a way to do something like a cut() function for binning numeric values in a dplyr table? I'm working on a large postgres table and can currently either write a case statement in the sql at the outset, or output unaggregated data and apply cut(). Both have pretty obvious downsides... case statements are not particularly elegant and pulling a large number of records via collect() not at all efficient.
Just so there's an immediate answer for others arriving here via search engine, the n-breaks form of cut is now implemented as the ntile function in dplyr:
> data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = ntile(x, 2))
x bin
1 5 2
2 1 1
3 3 2
4 2 1
5 2 1
6 3 2
I see this question was never updated with the tidyverse solution so I'll add it for posterity.
The function to use is cut_interval from the ggplot2 package. It works similar to base::cut but it does a better job of marking start and end points than the base function in my experience because cut increases the range by 0.1% at each end.
data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = cut_interval(x, n = 2))
x bin
1 5 (3,5]
2 1 [1,3]
3 3 [1,3]
4 2 [1,3]
5 2 [1,3]
6 3 [1,3]
You can also specify the bin width with cut_width.
data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = cut_width(x, width = 2, center = 1))
x bin
1 5 (4,6]
2 1 [0,2]
3 3 (2,4]
4 2 [0,2]
5 2 [0,2]
6 3 (2,4]
The following works with dplyr, assuming x is the variable we wish to bin:
# Make n bins
df %>% mutate( x_bins = cut( x, breaks = n )
# Or make specific bins
df %>% mutate( x_bins = cut( x, breaks = c(0,2,6,10) )