How Do I Fix This Data Frame Error for this dataset on Covid? - syntax-error

read.csv('covid_deaths.csv', header = TRUE)
df <-covid_deaths
covid_deaths$Age Group <- factor(covid_deaths$Age Group, levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17), labels = c('All Ages', 'Under One Year', '0-17 years', '1-4 years', '5-14 years', '15-24 years', '18-29 years', '25-34 years', '30-39 years', '35-44 years', '40-49 years', '45-54 years', '50-64 years', '55-64 years', '65-74 years', '75-84 years', '85 years and older'))
#increasing the max.print limit to see more on console
options(max.print = 60)
#pulling up the age group and COVID_19 deaths
covid_deaths[, c('Age Group','COVID_19 Deaths')]
#making a vector out of the relevant columns
df2[,c( 9,10)]
vec1 <- covid_deaths$Age Group
vec2 <- covid_deaths$COVID_19 Deaths
summary(covid_deaths)
m <- covid_deaths(matrix(sample(100, 20, replace = TRUE), ncol = 9))
I am trying to get the descriptive statisitics mean, standard deviation and variance. I keep on getting an error that
Error in `$<-.data.frame`(`*tmp*`, `Age Group`, value = integer(0)) :
replacement has 0 rows, data has 107406

Related

Chi square function: how to?

I don't understand if what I'm doing is correct.
I wish to perform a chi-squared test on a dataset, but I'm not sure about the result.
This is my dataset:
> dput(chi)
structure(list(`Age.(days)` = c("< 7 days", "7-10 days", "10-12 days",
"12-15 days", "15-20 days", "20-25 days"), Broods = c(6, 9, 10,
6, 14, 5), N.Carnus = c(92, 74, 48, 17, 37, 10)), row.names = c(NA,
6L), class = "data.frame")
And this is the test:
chi$"Age.(days)" <- as.character(chi$"Age.(days)")
chisq.test(table(chi$"Age.(days)", chi$"N.Carnus"))
What I wish to find is if there is a significant connection between age of the host and number of parasites (coming from a certain number of broods).
Thank you :)

Using .loc to categorize continuous data for range of values

0, 10.65
1, 15.27
2, 15.96
3, 13.49
4, 12.69
5, 7.90
6, 15.96
7, 18.64
8, 21.28
9, 12.69
10, 14.65
11, 12.69
12, 13.49
13, 9.91
14, 10.65
15, 16.29
the code I write is
data2.loc[data2['int_rate'] <= 8.00, 'int_rate'] = "low"
data2.loc[8.00 < data2['int_rate'] <= 30.00, 'int_rate'] = "medium"
data2.loc[15.00 < data2['int_rate'] < 30.00, 'int_rate'] = "high"
In result I get all the value lower than 8.0 as low but no changes to other value.
The answer my problem will be:
data2['int_rate'] = (pd.cut(data2.int_rate, bins=[0, 8.00, 15.00, 30.00], labels=['low', 'medium', 'high']))
above code will implent the low to values lower than 8.00 and high to its respectively.

Labeling xarray plot with month names

I have an xarray dataset with three dimensions include lat, lon and time. Time dimension is monthly value for 12 values from 1 to 12. I want to plot a variable of this dataset with name of months (e.g. 'Jan', 'Feb', 'Mar',...).
How can I change number of months to name of months in plotting?
<xarray.Dataset>
Dimensions: (month: 12, latitude: 501, longitude: 721)
Coordinates:
* longitude (longitude) float64 49.8 49.81 49.82 ... 56.99 57.0
* latitude (latitude) float64 27.0 27.01 27.02 ... 31.99 32.0
* month (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
Data variables:
Sum_monthly_Rain_mm (month, latitude, longitude) float32 dask.array<chunksize=(1, 501, 721), meta=np.ndarray>
Tair_C (month, latitude, longitude) float32 dask.array<chunksize=(1, 501, 721), meta=np.ndarray>
Plots:
temp_rain_mean_months.Tair_C.plot(x='longitude', y='latitude', col='month', col_wrap=4,
levels=[-10, -5, 0, 5, 10, 15, 20, 25, 30, 35, 40]);
two ways...
You can iterate through the axes on the plot object returned by da.plot and set the title manually:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
p = da.groupby('time.month').mean(dim='time').plot(col='month', col_wrap=4)
for i, ax in enumerate(p.axes.flat):
current_title = ax.get_title()
assert current_title[:len('month = ')] == 'month = '
month_ind = int(current_title[len('month = '):]) - 1
ax.set_title(months[month_ind])
Or, you could modify the dim on the array prior to plotting:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
da['month_name'] = ('month', months)
da.swap_dims({'month': 'month_name'}).Tair_C.plot(
x='longitude',
y='latitude',
col='month_name',
col_wrap=4,
levels=[-10, -5, 0, 5, 10, 15, 20, 25, 30, 35, 40],
)

Using fb-prophet Package to Predict By Group with Additional Regressors in R

prophet users of the world, hope all is well. I'm having some difficulties with a particular use case that I'll try to illustrate using some sample data and code below. First let's generate some sample data so that it will be a little bit easier to know what I am talking about.
library(data.table)
library(prophet)
library(dplyr)
# one year of months to be used for generating predictions
ds = c('2016-01-01', '2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-01','2016-07-01','2016-08-01','2016-09-01','2016-10-01','2016-11-01','2016-12-01' )
# historical customer counts
y = c (78498,12356,93732,5556,410,10296,9779,744,16407,100484,23954,141398,10575,850,16334,17496,1643,28074,93181,
18770,129968,11590,850,16738,17510,1376,27931,94369,18444,134850,13386,919,19075,18050,1565,31296,112094,27995,
167094,13402,1422,22766,20072,2340,37863,87346,16180,119863,7691,725,16931,12163,1241,25872,87455,16322,116390,
6994,620,13524,11059,990,22188,105473,23652,154145,13520,1008,18857,19209,1632,31105,102252,21284,138779,11670,
918,16078,16679,1257,26755,115033,22415,139835,13965,936,18027,18642,1407,28622,155371,40556,174321,25119,1859,
35326,28844,2962,51582,108817,19158,109864,8693,756,14358,13390,1091,21419)
# the segment channels of the customers
segment_channel = c('Existing_Omni', 'Existing_Retail', 'Existing_Direct', 'NTB_Omni', 'NTB_Retail', 'NTB_Direct', 'React_Omni', 'React_Retail', 'React_Direct')
# an external regressor to be added to the model (in my data there are like 40 of these regressor variables that I would like too add)
flash_sale = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3)
fake_data = merge(ds,segment_channel, all.y=TRUE)
setnames(fake_data, 'x', 'ds')
setnames(fake_data, 'y', 'segment_channel')
nrow(fake_data) # should be 108 rows, the 9 customer segements for each of the months in 2016
# next join the known customer counts, let's say we have them for the first 8 months of the year
fake_data = cbind(fake_data, y)
fake_data = cbind(fake_data, flash_sale)
# set some of the y values to NA so we can pretend we are trying to predict them using the ds time series as well as the flash sale values,
# which will be known in advance
fake_data = as.data.table(fake_data)
fake_data$ds = as.Date(fake_data$ds)
fake_data[, y := ifelse(ds >= '2016-08-01', NA, y)]
This code will generate a data set fairly similar to what I am working with for my problem, so hopefully you may be able to reproduce what I am doing. There are essentially two things I would like to be able to do with this data. The first is fairly straight forward, I want to be able to obviously add a regressor (like flash_sale in this example to the prophet model that I create. I can do this fairly easily like so:
christ <- tibble(
holiday = 'christ',
ds = as.Date(c('2016-11-01', '2017-11-01', '2018-11-01',
'2019-11-01')),
lower_window = 0,
upper_window = 1
)
nye <- tibble(
holiday = 'nye',
ds = as.Date(c('2016-11-01', '2017-12-01', '2018-11-01',
'2019-11-01')),
lower_window = 0,
upper_window = 1
)
holidays <- bind_rows(nye, christ)
m <- prophet(holidays = holidays)
m<- add_regressor(m, name = "flash_sale")
m <- fit.prophet(m, fake_data)
forecast <- predict(m, fake_data)
prophet_plot_components(m, forecast)
This should generate a fairly ugly plot but it's pretty easy to see that given the data this should be able to do the trick, and I could add multiple lines to add additional regressors. Ok, so we're all good so far. But the other issue is that I have 9 segment channels that I'm dealing with, and I don't want to build a separate model for each of them. Luckily I found a pretty good link on stack overflow that accomplishes the grouped prophet prediction: Using Prophet Package to Predict By Group in Dataframe in R
fcst = fake_data %>%
group_by(segment_channel) %>%
do(predict(prophet(., seasonality.mode = 'multiplicative', holidays = holidays, seasonality.prior.scale = 10, changepoint.prior.scale = .034), make_future_dataframe(prophet(.), periods = 11, freq='month'))) %>%
dplyr::select(ds, segment_channel, yhat)
fcst
> fcst
# A tibble: 207 x 3
# Groups: segment_channel [9]
ds segment_channel yhat
<dttm> <fct> <dbl>
1 2016-01-01 00:00:00 Existing_Direct 38712.
2 2016-02-01 00:00:00 Existing_Direct 40321.
3 2016-03-01 00:00:00 Existing_Direct 42648.
4 2016-04-01 00:00:00 Existing_Direct 45130.
5 2016-05-01 00:00:00 Existing_Direct 46580.
6 2016-06-01 00:00:00 Existing_Direct 49437.
7 2016-07-01 00:00:00 Existing_Direct 50651.
8 2016-08-01 00:00:00 Existing_Direct 52685.
9 2016-09-01 00:00:00 Existing_Direct 54719.
10 2016-10-01 00:00:00 Existing_Direct 56687.
# ... with 197 more rows
This is more or less exactly what I want! Cool. So now all I have to do is figure out how to get my grouped predictions and my regressors added all in one step. I know I can have multi-line statements inside of do, so this is what I tried in order to get this to work:
> fcst = fake_data %>%
+ group_by(segment_channel) %>%
+ do(
+ predict(prophet(., seasonality.mode = 'multiplicative', holidays = holidays, seasonality.prior.scale = 10, changepoint.prior.scale = .034),
+ add_regressor(prophet(., holidays = holidays), name = 'flash_sale'),
+ fit.prophet(prophet(. , holidays = holidays)),
+ make_future_dataframe(prophet(.), periods = 11, freq='month'))) %>%
+ dplyr::select(ds, segment_channel, yhat)
Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
n.changepoints greater than number of observations. Using 4
Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
n.changepoints greater than number of observations. Using 4
Error in add_regressor(prophet(., holidays = holidays), name = "flash_sale") :
Regressors must be added prior to model fitting.
Darn. Looks like it was running but then something about how I tried to add the regressor wasn't kosher. Next I it tried this way:
> fcst = fake_data %>%
+ group_by(segment_channel) %>%
+ do(
+ prophet(holidays = holidays),
+ add_regressor(prophet(., holidays = holidays), name = 'flash_sale'),
+ fit.prophet(prophet(. , holidays = holidays)),
+ predict(prophet(., seasonality.mode = 'multiplicative', holidays = holidays, seasonality.prior.scale = 10, changepoint.prior.scale = .034),
+ make_future_dataframe(prophet(.), periods = 11, freq='month'))) %>%
+ dplyr::select(ds, segment_channel, yhat)
Error: Can only supply one unnamed argument, not 4
Call `rlang::last_error()` to see a backtrace
> fcst = fake_data %>%
+ group_by(segment_channel) %>%
+ do(
+ add_regressor(prophet(., holidays = holidays), name = 'flash_sale'),
+ fit.prophet(prophet(. , holidays = holidays)),
+ predict(prophet(., seasonality.mode = 'multiplicative', holidays = holidays, seasonality.prior.scale = 10, changepoint.prior.scale = .034),
+ make_future_dataframe(prophet(.), periods = 11, freq='month'))) %>%
+ dplyr::select(ds, segment_channel, yhat)
Error: Can only supply one unnamed argument, not 3
Call `rlang::last_error()` to see a backtrace
I'm super confused at this point so I'm just hoping something out on the interwebs might know just the right incantation I need to get where I'm going.

JAGS Bayesian state-space modeling

I'm trying to use a state-space model to estimate population demographics (fecundity, survivorship, population growth, population size). We have 4 different age states.
# J0 = number of individuals 0-1
# surv1 = survivorship from 0-1
# J1 = number of individuals 0-1
# surv2 = survivorship from 1-2
# J2= = number of individuals 0-1
# surv3 = survivorship from 2-3
# J3= number of individuals 0-1
# survad = survivorship >3 "adult")
# Data as vectors (Talek clan from 1988-2013)
# X0 = individuals 0-1 in years
# X1 = individuals 1-2 in years
# X2 = individuals 2-3 in years
# X3 = individuals 3+ in years
# Total = group size
X0 <- c(7, 9, 4, 8, 9, 5, 8, 5, 7, 5, 5, 8, 10, 3, 5, 7, 2, 6, 6, 11, 14, 12, 15, 9, 10)
X1 <- c( 4, 4, 3, 4, 8, 5, 2, 4, 3, 4, 4, 5, 3, 7, 0, 5, 6, 3, 3, 5, 10, 12, 10, 13, 8)
X2 <- c(3, 2, 3, 3, 3, 8, 4, 1, 1, 2, 2, 4, 2, 2, 5, 0, 5, 5, 4, 3, 3, 10, 12, 7, 10)
X3 <- c(18, 16, 13, 16, 29, 29, 26, 22, 21, 18, 16, 15, 16, 15, 11, 14, 9, 12, 16, 18, 21, 23, 33, 32, 31)
Total <- c(32, 31, 23, 31, 49, 47, 40, 32, 32, 29, 27, 32, 31, 27, 21, 26, 22, 26, 29, 37, 48, 57, 70, 61, 59)
Here's the BUGS code:
sink(file = "HyenaIPM_all.txt")
cat("
model {
# Specify the priors for all parameters in the model
N.est[1] ~ dnorm(50, tau.proc)T(0,) # Initial abundance
mean.lambda ~ dunif(0, 5)
sigma.proc ~ dunif(0, 50)
tau.proc <- pow(sigma.proc, -2)
for (t in 1:TT) {
fec[t] ~ dunif(0, 5) # per capita fecundidty
surv1[t] ~ dunif(0, 1) # survivorship from 0-1
surv2[t] ~ dunif(0, 1) # survivorship from 1-2
surv3[t] ~ dunif(0, 1) # survivorship from 2-3
survad[t] ~ dunif(0, 1) # adult survivorship
}
# Estimate fecundity and survivorship
for (t in 2:TT) {
# Fecundity
J0[t+1] ~ dpois(survad[t]*fec[t])
J0[t+1] <- J3[t] * fec[t]
# Survivorship
J1[t+1] ~ dbin(surv1[t], J0[t])
J1[t+1] <- J0[t]*surv1[t]
J2[t+1] ~ dbin(surv2[t], J1[t])
J2[t+1] <- J1[t]*surv2[t]
J3[t+1] ~ dbin(surv3[t], J2[t-1])
J3[t+1] <- J2[t]*surv3[t] + J3[t]*survad[t]
A[t+1] ~ dbin(survad[t], A[t])
A[t+1] <- J3[t]*surv3[t] + A[t]*survad[t]
# Lambda
lambda[t+1] ~ dnorm(mean.lambda, tau.proc)
N.est[t+1] <- N.est[t]*lambda[t]
}
# Population size
for (t in 1:TT){
N[t] ~ dpois(N.est[t])
}
}
", fill = T)
sink()
# Parameters monitored
sp.params <- c("fec", "surv1", "surv2", "surv3", "survad", "lambda")
# MCMC settings
ni <- 200
nt <- 10
nb <- 100
nc <- 3
# Initial values
sp.inits <- function()list(mean.lambda = runif(1, 0, 1))
#Load all the data
sp.data <- list(N = Total, TT = length(Total), J0 = X0, J1 = X1, J2 = X2, J3 = X3)
library(R2jags)
hyena_model <- jags(sp.data, sp.inits, sp.params, "HyenaIPM_all.txt", n.chains = nc, n.thin = nt, n.iter = ni, n.burnin = nb)
Unfortunately, I get the following error when I run the code.
Error in jags.model(model.file, data = data, inits = init.values, n.chains = n.chains, :
RUNTIME ERROR:
Index out of range for node J0
Does anyone have any suggestions for why we get this error? Not sure why the distribution would be wrong for J0.
This is a very informative error message. The index for J0 is t+1 which ranges from 2+1 to TT+1, but J0 has length TT. So when the index is TT+1 it is out of range since it is larger than TT.