geom_violin using the weight aesthetic unexpectedly drop levels - ggplot2

library(tidyverse)
set.seed(12345)
dat <- data.frame(year = c(rep(1990, 100), rep(1991, 100), rep(1992, 100)),
fish_length = sample(x = seq(from = 10, 131, by = 0.1), 300, replace = F),
nb_caught = sample(x = seq(from = 1, 200, by = 0.1), 300, replace = T),
stringsAsFactors = F) %>%
mutate(age = ifelse(fish_length < 20, 1,
ifelse(fish_length >= 20 & fish_length < 100, 2,
ifelse(fish_length >= 100 & fish_length < 130, 3, 4)))) %>%
arrange(year, fish_length)
head(dat)
year fish_length nb_caught age
1 1990 10.1 45.2 1
2 1990 10.7 170.0 1
3 1990 10.9 62.0 1
4 1990 12.1 136.0 1
5 1990 14.1 80.8 1
6 1990 15.0 188.9 1
dat %>% group_by(year) %>% summarise(ages = n_distinct(age)) # Only 1992 has age 4 fish
# A tibble: 3 x 2
year ages
<dbl> <int>
1 1990 3
2 1991 3
3 1992 4
dat %>% filter(age == 4) # only 1 row for age 4
year fish_length nb_caught age
1 1992 130.8 89.2 4
Here:
year = year of sampling
fish_length = length of the fish in cm
nb_caught = number of fish caught following the use of an age-length key, hence explaining the presence of decimals
age = age of the fish
graph1: geom_violin not using the weight aesthetic.
Here, I got to copy each line of dat according to the value found in nb_caught.
dim(dat) # 300 rows
dat_graph1 <- dat[rep(1:nrow(dat), floor(dat$nb_caught)), ]
dim(dat_graph1) # 30932 rows
dat_graph1$nb_caught <- NULL # useless now
sum(dat$nb_caught) - nrow(dat_graph1) # 128.2 rows lost here
Since I have decimal values of nb_caught, I took the integer value to create dat_graph1. I lost 128.2 "rows" in the process.
Now for the graph:
dat_tile <- data.frame(year = sort(unique(dat$year))[sort(unique(dat$year)) %% 2 == 0])
# for the figure's background
graph1 <- ggplot(data = dat_graph1,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph1") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph1
graph1
Note here that I have a flat bar for age 4 in year 1992.
dat_graph1 %>% filter(year == 1992, age == 4) %>% pull(fish_length) %>% unique
[1] 130.8
That is because I only have one length for that particular year-age combination.
graph2: geom_violin using the weight aesthetic.
Now, instead of copying each row of dat by the value of number_caught, let's use the weight aesthetic.
Let's calculate the weight wt that each line of dat will have in the calculation of the density curve of each year-age combinations.
dat_graph2 <- dat %>%
group_by(year, age) %>%
mutate(wt = nb_caught / sum(nb_caught)) %>%
as.data.frame()
head(dat_graph2)
year fish_length nb_caught age wt
1 1990 10.1 45.2 1 0.03573123
2 1990 10.7 170.0 1 0.13438735
3 1990 10.9 62.0 1 0.04901186
4 1990 12.1 136.0 1 0.10750988
5 1990 14.1 80.8 1 0.06387352
6 1990 15.0 188.9 1 0.14932806
graph2 <- ggplot(data = dat_graph2,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(aes(weight = wt), draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph2") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph2
dat_graph2 %>% filter(year == 1992, age == 4)
year fish_length nb_caught age wt
1 1992 130.8 89.2 4 1
graph2
Note here that the flat bar for age 4 in year 1992 seen on graph1 has been dropped here even though the line exists in dat_graph2.
My questions
Why is the age 4 in 1992 level dropped when using the weight aesthetic? How can I overcome this?
Why are the two graphs not visually alike even though they used the same data?
Thanks in advance for your help!

1.
Problem 1 is not related to using the weight aesthetic. You can check this by dropping the weight aesthetic in the code for your second graph. The problem is, that the algorithm for computing the density fails, when there are too less observations.
That is the reason, why group 4 shows up in graph 1 with the expanded dataset (grpah 1). Here you increase the number of observations by replicating the number of obs.
Unfortunately, geom_violin gives no warning in your specific case. However, if you filter dat_graph2 for age == 4 geom_violin gives you the warning
Warning message:
Computation failed in `stat_ydensity()`:
replacement has 1 row, data has 0
geom_density is much clearer on this issue, giving a warning, that groups with less than two obs have been dropped.
Unfortunately, I have no solution to overcome this, besides working with the expanded dataset.
2.
Concerning problem 2 I have no convincing answer except that I guess that this is related to the details of the kernel density estimator used by geom_violin, geom_density, ... and perhaps also somehow related to the number of data points.

Related

ggplot2 Expand the plot limits giving error

I have a df:
test<- data.frame (Metrics = c("PCT_PF_READS (%)" , "PCT_Q30_R1 (%)" , "PCT_Q30_R2 (%)"),
LowerLimit = c(80,80,80),
Percent = c(93.1,95.1,92.4)
)
> test
Metrics LowerLimit Percent
1 PCT_PF_READS (%) 80 93.1
2 PCT_Q30_R1 (%) 80 95.1
3 PCT_Q30_R2 (%) 80 92.4
I am trying to plot in ggplot2 but I want to specify the yaxis.
If I do:
ggplot(data=test3, aes(x= Metrics,y=Percent,)) +
geom_bar(stat="identity" )
If I try to set the yaxis to start at 75, I get a blank plot:
ggplot(data=test3, aes(x= Metrics,y=Percent,)) +
geom_bar(stat="identity" ) + scale_y_continuous(limits = c(75,100))
with the message
Warning message:
Removed 3 rows containing missing values (geom_bar)
But the values are in range????
Does this answer your question?
library(tidyverse)
test<- data.frame (Metrics = c("PCT_PF_READS (%)" , "PCT_Q30_R1 (%)" , "PCT_Q30_R2 (%)"),
LowerLimit = c(80,80,80),
Percent = c(93.1,95.1,92.4)
)
# Starting plot:
ggplot(data = test, aes(x = Metrics, y = Percent)) +
geom_bar(stat = "identity")
# If you cut off any of the bar using "limit" the bar is removed,
# E.g. this removes the middle bar (Percent = 95.1)
ggplot(data = test, aes(x = Metrics, y = Percent)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0,95))
#> Warning: Removed 1 rows containing missing values (position_stack).
# A better solution is to use "coord_cartesian()"
ggplot(data = test, aes(x = Metrics, y = Percent)) +
geom_bar(stat = "identity") +
coord_cartesian(ylim = c(75, 100))
# Although it's generally advised to keep the whole axis,
# as 'chopping off' the bottom can be misleading
# Another alternative is to write the percentages on the plot:
ggplot(data = test, aes(x = Metrics, y = Percent)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(Percent, "%")),
nudge_y = 2)
Created on 2022-10-19 by the reprex package (v2.0.1)

controling bar length in ggplot

I want to use the below in rmarkdown word document but the bars are too long; the axis break is good. I want to compress the size of ggplot bars while maintaining readiblity
library(tidyverse)
library(scales)
library(patchwork)
options(scipen = 999)
df <- tribble(
~category, ~ numbers, ~value,
"category 1",8, 9901020,
"category 2",8 ,18629623,
"category 3",9 ,16471680,
"category 4",7 ,13661732,
"category 5",5 ,7173011,
"category 6",10, 18395574)
dfmod <- df%>%
dplyr::mutate(numbers_lable= numbers
,numbers = numbers * 1e6)
pl <- ggplot(data = dfmod,aes(x= fct_reorder(category,desc(value))
, y = value))
pl <- pl + geom_col(fill = "grey")
pl <- pl + geom_line(aes(y = numbers, group = 1), size = .8, color = "blue")
pl <- pl + geom_text(aes(label = paste0('Sales'
, scales::comma(value)
, '\n(Orders '
, numbers_lable,')'))
, vjust = -0.3
, size = 3)
pl <- pl + scale_y_continuous(name= "Sales revenue"
,labels = scales::comma_format(scale = 1e-6
,suffix = "M")
, sec.axis = sec_axis( trans= ~./1e6
, name = "Number of sales"
, breaks = c(0,2,4,6,8,10)))
pl <- pl + theme_bw()
pl <- pl + theme(axis.title.y = element_text(color = "grey")
,axis.title.y.right = element_text(color = "blue")
)
pl
tried fig.width/height, out.width/height whenever I change them they just chop the labels
in excel I would size the chart for example as 12 cm * 5 cm
and everything would be visible but with smaller space
Extend the upper limit of your y-axis, e.g.,
sec.axis = sec_axis(
trans= ~./1e6,
name = "Number of sales",
breaks = c(0, 2, 4, 6, 8, 10)),
limits = c(0, max(df$value) * 1.2)
)
Use fig.width/fig.height for output-dependent adjustment. The following looks fine for PDF output
```{r, echo=F, fig.width=8, fig.height=4}
<your figure code>
```

I want to use values from dataframeA as upper and lower bounds to filter dataframeB

I have two dataframes A and B.
Dataframe A has 4 columns with 2 sets of maximum and minimums that I want to use as upper and lower bounds for 2 columns in dataframe B.
latitude = data['y']
longitude = data['x']
upper_lat = coords['lat_max']
lower_lat = coords['lat_min']
upper_lon = coords['long_max']
lower_lon = coords['long_min']
def filter_data_2(filter, upper_lat, lower_lat, upper_lon, lower_lon, lat, lon):
v = filter[(lower_lat <= lat <= upper_lat ) & (lower_lon <= lon <= upper_lon)]
return v
newdata = filter_data_2(data, upper_lat, lower_lat, upper_lon, lower_lon, latitude, longitude)
ValueError: Can only compare identically-labeled Series objects
MWE:
import pandas as pd
a = {'lower_lon': [2,4,6], 'upper_lon': [4,6,10], 'lower_lat': [1,3,5], 'upper_lat': [3,5,7]}
constraints = pd.DataFrame(data=a)
constraints
lower_lon upper_lon lower_lat upper_lat
0 2 4 1 3
1 4 6 3 5
2 6 10 5 7
b = {'lon' : [3, 5, 7, 9, 11, 13, 15], 'lat': [2, 4, 6, 8, 10, 12, 14]}
to_filter = pd.DataFrame(data=b)
to_filter
lon lat
0 3 2
1 5 4
2 7 6
3 9 8
4 11 10
5 13 12
6 15 14
lat = to_filter['lat']
lon = to_filter['lon']
lower_lon = constraints['lower_lon']
upper_lon = constraints['upper_lon']
lower_lat = constraints['lower_lat']
upper_lat = constraints['upper_lat']
v = to_filter[(lower_lat <= lat) & (lat <= upper_lat) & (lower_lon <= lon) & (lon <= upper_lon)]
Expected Results
v
lon lat
0 3 2
1 5 4
2 7 6
The global filter will be the union of the sets of all the contraints, in pandas you could:
v = pd.DataFrame()
for i in constraints.index:
# Current constraints
min_lon, max_lon, min_lat, max_lat = constraints.loc[i, :]
# Apply filter
df = to_filter[ (to_filter.lon>= min_lon & to_filter.lon<= max_lon) & (to_filter.lat>= min_lat & to_filter.lat<= max_lat) ]
# Join in a single df previous and current filter outcome
v= pd.concat( [v, df] )
# Remove duplicates, if any
v = v.drop_duplicates()

geom_vline() dateRangeInput()

I have set up a line graph in shiny. The x axis has dates covering 2014 to current date.
I have set up various vertical lines using geom_vline() to highlight points in the data.
I'm using dateRangeInput() so the user can choose the start/end date range to look at on the graph.
One of my vertical lines is in Feb 2014. If the user uses the dateRangeInput() to look at dates from say Jan 2016 the vertical line for Feb 2014 is still showing on the graph. This is also causing the x axis to go from 2014 even though the data line goes from Jan 2016 to current date.
Is there a way to stop this vertical line showing on the graph when it's outside of the dataRangeInput()? Maybe there's an argument in geom_vline() to deal with this?
library(shiny)
library(tidyr)
library(dplyr)
library(ggplot2)
d <- seq(as.Date("2014-01-01"),Sys.Date(),by="day")
df <- data.frame(date = d , number = seq(1,length(d),by=1))
lines <- data.frame(x = as.Date(c("2014-02-07","2017-10-31", "2017-08-01")),
y = c(2500,5000,7500),
lbl = c("label 1", "label 2", "label 3"))
#UI
ui <- fluidPage(
#date range select:
dateRangeInput(inputId = "date", label = "choose date range",
start = min(df$date), end = max(df$date),
min = min(df$date), max = max(df$date)),
#graph:
plotOutput("line")
)
#SERVER:
server <- function(input, output) {
data <- reactive({ subset(df, date >= input$date[1] & date <= input$date[2])
})
#graph:
output$line <- renderPlot({
my_graph <- ggplot(data(), aes(date, number )) + geom_line() +
geom_vline(data = lines, aes(xintercept = x, color = factor(x) )) +
geom_label(data = lines, aes(x = x, y = y,
label = lbl, colour = factor(x),
fontface = "bold" )) +
scale_color_manual(values=c("#CC0000", "#6699FF", "#99FF66")) +
guides(colour = "none", size = "none")
return(my_graph)
})
}
shinyApp(ui = ui, server = server)
As mentioned by Aimée in a different thread:
In a nutshell, ggplot2 will always plot all of the data that you provide and the axis limits are based on that unless you specify otherwise. So because you are telling it to plot the line & label, they will appear on the plot even though the rest of the data doesn't extend that far.
You can resolve this by telling ggplot2 what you want the limits of your x axis to be, using the coord_cartesian function.
# Set the upper and lower limit for the x axis
dateRange <- c(input$date[1], input$date[2])
my_graph <- ggplot(df, aes(date, number)) + geom_line() +
geom_vline(data = lines, aes(xintercept = x, color = factor(x) )) +
geom_label(data = lines, aes(x = x, y = y,
label = lbl, colour = factor(x),
fontface = "bold" )) +
scale_color_manual(values=c("#CC0000", "#6699FF", "#99FF66")) +
guides(colour = "none", size = "none") +
coord_cartesian(xlim = dateRange)

How to use r kmeans cluster vector to repaint plot?

km = kmeans(FourA,3)
km$cluster
[1] 1 1 1 2 1 1 2 2 2 2 3 2 ...
How do I use the km$cluster vector to create 3 new arrays so that I can plot the graph with the three clusters using a different character/color?
For your reference
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
cl <- kmeans(x, 3, nstart = 25)
plot(x, col = cl$cluster)
points(cl$centers, col = 1:3, pch = 8)