Not seeing count labels on bar chart in ggplot2 - ggplot2

I'd like to add count labels to a simple bar chart. I've tried the following code with and without y=count; it runs without error but doesn't display the labels:
Class <- ggplot(asteroid, aes(x=class, y=count, fill = class, labels = TRUE))
Class +
geom_bar() +
ggtitle('Frequency of Hazardous and Non-Hazardous Asteroids by Class') +
xlab('Class Based on Solar System Location') +
labs(caption = '(*) indicates a potentially hazardous object') +
scale_fill_discrete(labels=c('Amor*','Apollo','Apollo*','Aten','Aten*','Interior Earth Object*'))
I have also tried this:
Class <- ggplot(asteroid, aes(x=class, y=count, fill = class))
Class +
geom_bar() +
ggtitle('Frequency of Hazardous and Non-Hazardous Asteroids by Class') +
xlab('Class Based on Solar System Location') +
labs(caption = '(*) indicates a potentially hazardous object') +
scale_fill_discrete(labels=c('Amor*','Apollo','Apollo*','Aten','Aten*','Interior Earth Object*'))+
geom_bar(stat = "identity") +
geom_text(aes(label = count), vjust = 0)
but I'm met with the following error:
"Error in `f()`:
! Aesthetics must be valid data columns. Problematic aesthetic(s): y = count.
Did you mistype the name of a data column or forget to add after_stat()?"
Obviously R is thinking "count" to be count() but I can't seem to find a way to work around this. Have tried using freq instead, Count, etc. Those were long shots but I'm a bit stumped.
Edit: Here's a snapshot of the character variable I'm working with:
> head(dput(asteroid$class))
> [1] "APO*" "APO*" "APO*" "APO*" "APO*" "APO*"
And the dataset:
head(dput(asteroid))
# A tibble: 6 × 17
Object...1 `Epoch (TDB)` `a (AU)` e `i (deg)` `w (deg)` `Node (deg)` `M (deg)` `q (AU)` `Q (AU)` `P (yr)`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1566 Icarus 57800 1.08 0.827 22.8 31.4 88.0 216. 0.187 1.97 1.12
2 1620 Geographos 57800 1.25 0.335 13.3 277. 337. 104. 0.828 1.66 1.39
3 1862 Apollo 55249 1.47 0.560 6.35 286. 35.7 175. 0.647 2.29 1.78
4 1981 Midas 57800 1.78 0.650 39.8 268. 357. 173. 0.621 2.93 2.37
5 2101 Adonis 57800 1.87 0.765 1.33 43.4 350. 235. 0.441 3.31 2.57
6 2102 Tantalus 57800 1.29 0.299 64.0 61.6 94.4 355. 0.904 1.68 1.47
# … with 6 more variables: `H (mag)` <dbl>, `MOID (AU)` <dbl>, ref <dbl>, class <chr>, Object...16 <chr>,
# Hazardous <dbl>

Updated with your data sample.
asteroid is uncounted, so I've added count(class). I've added a second version where the data is uncounted.
library(tidyverse)
asteroid <- data.frame(class = c("apo",'ieo','amor','aten', 'amor'))
# Counted
asteroid |>
count(class) |>
ggplot(aes(class, n, fill = class)) +
geom_col() +
ggtitle('Frequency of Hazardous and Non-Hazardous Asteroids by Class') +
xlab('Class Based on Solar System Location') +
labs(caption = '() indicates a potentially hazardous object') +
geom_text(aes(label = n), vjust = 0)
# Uncounted
asteroid |>
ggplot(aes(class, fill = class)) +
geom_bar() +
ggtitle('Frequency of Hazardous and Non-Hazardous Asteroids by Class') +
xlab('Class Based on Solar System Location') +
labs(caption = '() indicates a potentially hazardous object') +
geom_text(aes(label = ..count..), stat = 'count', vjust = 0)
Created on 2022-06-06 by the reprex package (v2.0.1)

Related

Dirichlet regressioni coefficients

starting with this example of Dirichlet regression here.
My variable y is a vector of N = 3 elements and the Dirichlet regression model estimates N-1 coeff.
Let’s say I am interested in all 3 coefficients, how can I get them?
Thanks!
library(brms)
library(rstan)
library(dplyr)
bind <- function(...) cbind(...)
N <- 20
df <- data.frame(
y1 = rbinom(N, 10, 0.5), y2 = rbinom(N, 10, 0.7),
y3 = rbinom(N, 10, 0.9), x = rnorm(N)
) %>%
mutate(
size = y1 + y2 + y3,
y1 = y1 / size,
y2 = y2 / size,
y3 = y3 / size
)
df$y <- with(df, cbind(y1, y2, y3))
make_stancode(bind(y1, y2, y3) ~ x, df, dirichlet())
make_standata(bind(y1, y2, y3) ~ x, df, dirichlet())
fit <- brm(bind(y1, y2, y3) ~ x, df, dirichlet())
summary(fit)
Family: dirichlet
Links: muy2 = logit; muy3 = logit; phi = identity
Formula: bind(y1, y2, y3) ~ x
Data: df (Number of observations: 20)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
muy2_Intercept 0.29 0.10 0.10 0.47 1.00 2830 2514
muy3_Intercept 0.56 0.09 0.38 0.73 1.00 2833 2623
muy2_x 0.04 0.11 -0.17 0.24 1.00 3265 2890
muy3_x -0.00 0.10 -0.20 0.19 1.00 3229 2973
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi 39.85 9.13 23.83 59.78 1.00 3358 2652
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

How can plot my own data in a grid in a map sf but return vacum

I am trying to summarize some statistics in the grid that I made, however something fails when I try to do it.
This is my data
head(catk)
Simple feature collection with 6 features and 40 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 303.22 ymin: -61.43 xmax: 303.95 ymax: -60.78
Geodetic CRS: WGS 84
# A tibble: 6 × 41
X1 day month year c1_id greenweight_caught_kg obs_haul_id
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 1 4 12 1997 26529 7260 NA
2 2 4 12 1997 26530 7920 NA
3 3 4 12 1997 26531 4692 NA
4 4 4 12 1997 26532 5896 NA
5 5 4 12 1997 26533 88 NA
6 6 5 12 1997 26534 7040 NA
# … with 34 more variables: obs_logbook_id <lgl>, obs_haul_number <lgl>,
# haul_number <dbl>, vessel_name <chr>, vessel_nationality_code <chr>,
# fishing_purpose_code <chr>, season_ccamlr <dbl>,
# target_species <chr>, asd_code <dbl>, trawl_technique <lgl>,
# catchperiod_code <chr>, date_catchperiod_start <date>,
# datetime_set_start <dttm>, datetime_set_end <dttm>,
# datetime_haul_start <dttm>, datetime_haul_end <dttm>, …
and I did this raster
an <- getData("GADM", country = "ATA", level = 0)
an#data$NAME_0
e <- extent(-70,-40,-68,-60)
rc <- crop(an, e)
proj4string(rc) <- CRS("+init=epsg:4326")
rc3 <- st_as_sf(rc)
catk <- st_as_sf(catk, coords = c("Longitude", "Latitude"), crs = 4326) %>%
st_shift_longitude()
Grid <- rc3 %>%
st_make_grid(cellsize = c(1,0.4)) %>% # para que quede cuadrada
st_cast("MULTIPOLYGON") %>%
st_sf() %>%
mutate(cellid = row_number())
result <- Grid %>%
st_join(catk) %>%
group_by(cellid) %>%
summarise(sum_cat = sum(Catcht))
but I can not represent the data in the grid
ggplot() +
geom_sf(data = Grid, color="#d9d9d9", fill=NA) +
geom_sf(data = rc3) +
theme_bw() +
coord_sf() +
scale_alpha(guide="none")+
xlab(expression(paste(Longitude^o,~'O'))) +
ylab(expression(paste(Latitude^o,~'S')))+
guides( colour = guide_legend()) +
theme(panel.background = element_rect(fill = "#f7fbff"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
theme(legend.position = "right")+
xlim(-69,-45)
fail plot
Please help me to find this solution!!
So I just saw that you shifted the coordinates with st_shift_longitude() and therefore your bounding box is:
Bounding box: xmin: 303.22 ymin: -61.43 xmax: 303.95 ymax: -60.78
Do you really need it? That doesn't match with your defined extent
e <- extent(-70,-40,-68,-60)
And a bbox for WGS84 is suppose to be at max c(-180, -90, 180, 90).
Also, on your plot you are not instructed ggplot2 to do anything with the values of catk. Grid and rc3 do not have anything from catk, is the result object.
Anyway, I give a try to your problem even though I don't have access to your dataset. I represent on each cell sum_cat from result
library(raster)
library(sf)
library(dplyr)
library(ggplot2)
# Mock your data
catk <- structure(list(Longitude = c(-59.0860203764828, -50.1352159580643,
-53.7671292009259, -67.9105254106185, -67.5753491797446, -51.7045571975837,
-45.6203830411619, -61.2695183762776, -51.6287384188695, -52.244074640978,
-45.4625770258213, -51.0935832496694, -45.6375681312716, -44.744215508174,
-66.3625310507564), Latitude = c(-62.0038884948778, -65.307178606448,
-65.8980199769778, -60.4475595973147, -67.7543165093134, -60.4616894158005,
-67.9720957068844, -62.2184680275876, -66.2345680342004, -64.1523442367459,
-62.5435163857161, -65.9127866479611, -66.8874734854608, -62.0859917484373,
-66.8762861503705), Catcht = c(18L, 95L, 32L, 40L, 16L, 49L,
22L, 14L, 86L, 45L, 3L, 51L, 45L, 41L, 19L)), row.names = c(NA,
-15L), class = "data.frame")
# Start the analysis
an <- getData("GADM", country = "ATA", level = 0)
e <- extent(-70,-40,-68,-60)
rc <- crop(an, e)
proj4string(rc) <- CRS("+init=epsg:4326")
rc3 <- st_as_sf(rc)
# Don't think you need st_shift_longitude, removed
catk <- st_as_sf(catk, coords = c("Longitude", "Latitude"), crs = 4326)
Grid <- rc3 %>%
st_make_grid(cellsize = c(1,0.4)) %>% # para que quede cuadrada
st_cast("MULTIPOLYGON") %>%
st_sf() %>%
mutate(cellid = row_number())
result <- Grid %>%
st_join(catk) %>%
group_by(cellid) %>%
summarise(sum_cat = sum(Catcht))
ggplot() +
geom_sf(data = Grid, color="#d9d9d9", fill=NA) +
# Add here results with my mock data by grid
geom_sf(data = result %>% filter(!is.na(sum_cat)), aes(fill=sum_cat)) +
geom_sf(data = rc3) +
theme_bw() +
coord_sf() +
scale_alpha(guide="none")+
xlab(expression(paste(Longitude^o,~'O'))) +
ylab(expression(paste(Latitude^o,~'S')))+
guides( colour = guide_legend()) +
theme(panel.background = element_rect(fill = "#f7fbff"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
theme(legend.position = "right")+
xlim(-69,-45)
Created on 2022-03-23 by the reprex package (v2.0.1)

Using subplot with seaborn

I use the following code to visualize different Seaborn plot in one windows:
fig, axs = plt.subplots(2,2, figsize=(6,6))
x1 = sns.jointplot(x="MO2: Vit. vent 1 10min [m/s] ", y="MO2: Puissance active 10min [kW] ", data=df_MO1_MO2.loc[df_MO1_MO2["Year"] == 2011], kind='kde', ax=axs[0][0])
x2 = sns.jointplot(x="MO2: Vit. vent 1 10min [m/s] ", y="MO2: Puissance active 10min [kW] ", data=df_MO1_MO2.loc[df_MO1_MO2["Year"] == 2012], kind='kde', ax=axs[0][1])
x2 = sns.jointplot(x="MO2: Vit. vent 1 10min [m/s] ", y="MO2: Puissance active 10min [kW] ", data=df_MO1_MO2.loc[df_MO1_MO2["Year"] == 2016], kind='kde', ax=axs[1][0])
x3 = sns.jointplot(x="MO2: Vit. vent 1 10min [m/s] ", y="MO2: Puissance active 10min [kW] ", data=df_MO1_MO2.loc[df_MO1_MO2["Year"] == 2017], kind='kde', ax=axs[1][1])
fig.suptitle('Active power evolution', position=(.5,1.1), fontsize=20)
fig.tight_layout()
This code return correctly a 2 by 2 subplot but it shows also the four plots in 4 different lines. Can you help me to find the error in my code.

Limitation of Keras/Tensorflow for solving Linear Regression tasks

I was trying to implement linear regression in Keras/TensorFlow and was very surprised how difficult it is. The standard examples work great on random data. However, if we change the input data a little bit, all examples stop work correctly.
I try to find coefficients for y = 0.5 * x1 + 0.5 * x2.
np.random.seed(1443)
n = 100000
x = np.zeros((n, 2))
y = np.zeros((n, 1))
x[:,0] = sorted(preprocessing.scale( np.random.poisson(1000000, (n)) ))
x[:,1] = sorted(preprocessing.scale( np.random.poisson(1000000, (n)) ) )
y = (x[:,0] + x[:,1]) /2
model = keras.Sequential()
model.add( keras.layers.Dense(1, input_shape =(2,), dtype="float32" ))
model.compile(loss='mean_squared_error', optimizer='sgd')
model.fit(x,y, epochs=1000, batch_size=64)
print(model.get_weights())
The results:
| epochs| batch_size | bias | x1 | x2
| ------+------------+------------+------------+-----------
| 1000 | 64 | -5.83E-05 | 0.90410435 | 0.09594361
| 1000 | 1024 | -5.71E-06 | 0.98739249 | 0.01258729
| 1000 | 10000 | -3.07E-07 | -0.2441376 | 1.2441349
My first thought was that it is a bug in Keras. So, I tried R/Tensorflow library:
floatType <- "float32"
p <- 2L
X <- tf$placeholder(floatType, shape = shape(NULL, p), name = "x-data")
Y <- tf$placeholder(floatType, name = "y-data")
W <- tf$Variable(tf$zeros(list(p, 1L), dtype=floatType))
b <- tf$Variable(tf$zeros(list(1L), dtype=floatType))
Y_hat <- tf$add(tf$matmul(X, W), b)
cost <- tf$reduce_mean(tf$square(Y_hat - Y))
generator <- tf$train$GradientDescentOptimizer(learning_rate=0.01)
optimizer <- generator$minimize(cost)
session <- tf$Session()
session$run(tf$global_variables_initializer())
set.seed(1443)
n <- 10^5
x <- matrix( replicate(p, sort(scale((rpois(n, 10^6))))) , nrow = n )
y <- matrix((x[,1]+x[,2])/2)
i <- 1
batch_size <- 10000
epoch_number <- 1000
iterationNumber <- n*epoch_number / batch_size
while (iterationNumber > 0) {
feed_dict <- dict(X = x[i:(i+batch_size-1),, drop = F], Y = y[i:(i+batch_size-1),, drop = F])
session$run(optimizer, feed_dict = feed_dict)
i <- i+batch_size
if( i > n-batch_size)
i <- i %% batch_size
iterationNumber <- iterationNumber - 1
}
r_model <- lm(y ~ x)
tf_coef <- c(session$run(b), session$run(W))
r_coef <- r_model$coefficients
print(rbind(tf_coef, r_coef))
The results:
| epochs| batch_size | bias | x1 | x2
| ------+------------+------------+------------+-----------
|2000 | 64 | -1.33E-06 | 0.500307 | 0.4996932
|1000 | 1000 | 2.79E-08 | 0.5000809 | 0.499919
|1000 | 10000 | -4.33E-07 | 0.5004921 | 0.499507
|1000 | 100000 | 2.96E-18 | 0.5 | 0.5
Tensorflow finds the correct result only when batch size = samples number and the optimization algorithm is SGD. If optimization algorithm was "adam" or "adagrad", errors were much larger.
For obvious reasons, I cannot choose hyperparameter batch_size = n. Could you recommend any approaches to solve this problem with precision 1E-07 for Keras or TensorFlow?
Why TensorFlow finds better solutions than Keras?
Comment 1.
Based on post "today" below:
Train dataset shuffling will significantly improve the performance of TensorFlow version:
shuffledIndex<-sample(1:(nrow(x)))
x <- x[shuffledIndex,]
y <- y[shuffledIndex,,drop=FALSE]
For batch size = 2000:
|(Intercept) | x1 | x2
|----------------+-----------+----------
|-1.130693e-09 | 0.5000004 | 0.4999989
The problem is that you are sorting the generated random numbers for each feature value. So they end up very close to each other:
>>> np.mean(np.abs(x[:,0]-x[:,1]))
0.004125721684553685
As a result we would have:
y = (x1 + x2) / 2
~= (x1 + x1) / 2
= x1
= 0.5 * x1 + 0.5 * x1
= 0.3 * x1 + 0.7 * x1
= -0.3 * x1 + 1.3 * x1
= 10.1 * x1 - 9.1 * x1
= thousands of other possible combinations
In this case the solution that Keras would converge to would really depend on the initial value of the weights and bias of Dense layer. With different initial values you would get different results (and possibly for some of them, it may not converge at all):
# set the initial weight of Dense layer
model.layers[0].set_weights([np.array([[0], [1]]), np.array([0])])
# fit the model ...
# the final weights
model.get_weights()
[array([[0.00203656],
[0.9981099 ]], dtype=float32),
array([4.5520876e-05], dtype=float32)] # because: y = 0 * x1 + 1 * x1 = x1 ~= (x1 + x2) / 2
# again set the weights to something different
model.layers[0].set_weights([np.array([[0], [0]]), np.array([1])])
# fit the model...
# the final weights
model.get_weights()
[array([[0.49986306],
[0.50013727]], dtype=float32),
array([1.4176634e-08], dtype=float32)] # the one you were looking for!
However, if you don't sort the features (i.e. just remove sorted) it is very likely that the converged weights to be very close to [0.5, 0.5].

Add sample size to a panel figure of boxplots

I am trying to add sample size to boxplots (preferably at the top or bottom of them) that are grouped by two levels. I used the facet_grid() function to produce a panel plot. I then tried to use the annotate() function to add the sample sizes, however this couldn't work because it repeated the values in the second panel. Is there a simple way to do this?
head(FeatherData, n=10)
Location Status FeatherD Species ID
## 1 TX Resident -27.41495 Carolina wren CARW (32)
## 2 TX Resident -29.17626 Carolina wren CARW (32)
## 3 TX Resident -31.08070 Carolina wren CARW (32)
## 4 TX Migrant -169.19579 Yellow-rumped warbler YRWA (28)
## 5 TX Migrant -170.42079 Yellow-rumped warbler YRWA (28)
## 6 TX Migrant -158.66925 Yellow-rumped warbler YRWA (28)
## 7 TX Migrant -165.55278 Yellow-rumped warbler YRWA (28)
## 8 TX Migrant -170.43374 Yellow-rumped warbler YRWA (28)
## 9 TX Migrant -170.21801 Yellow-rumped warbler YRWA (28)
## 10 TX Migrant -184.45871 Yellow-rumped warbler YRWA (28)
ggplot(FeatherData, aes(x = Location, y = FeatherD)) +
geom_boxplot(alpha = 0.7, fill='#A4A4A4') +
scale_y_continuous() +
scale_x_discrete(name = "Location") +
theme_bw() +
theme(plot.title = element_text(size = 20, family = "Times", face =
"bold"),
text = element_text(size = 20, family = "Times"),
axis.title = element_text(face="bold"),
axis.text.x=element_text(size = 15)) +
ylab(expression(Feather~delta^2~H["f"]~"‰")) +
facet_grid(. ~ Status)
There's multiple ways to do this sort of task. The most flexible way is to compute your statistic outside the plotting call as a separate dataframe and use it as its own layer:
library(dplyr)
library(ggplot2)
cw_summary <- ChickWeight %>%
group_by(Diet) %>%
tally()
cw_summary
# A tibble: 4 x 2
Diet n
<fctr> <int>
1 1 220
2 2 120
3 3 120
4 4 118
ggplot(ChickWeight, aes(Diet, weight)) +
geom_boxplot() +
facet_grid(~Diet) +
geom_text(data = cw_summary,
aes(Diet, Inf, label = n), vjust = 1)
The other method is to use the summary functions built in, but that can be fiddly. Here's an example:
ggplot(ChickWeight, aes(Diet, weight)) +
geom_boxplot() +
stat_summary(fun.y = median, fun.ymax = length,
geom = "text", aes(label = ..ymax..), vjust = -1) +
facet_grid(~Diet)
Here I used fun.y to position the summary at the median of the y values, and used fun.ymax to compute an internal variable called ..ymax.. with the function length (which just counts the number of observations).