How to use arcpullr::get_spatial_layer() and arcpullr::get_layer_by_poly() - api

I couldn't figure this out through the package documentation https://cran.r-project.org/web/packages/arcpullr/vignettes/intro_to_arcpullr.html.
My codes return the errors described below.
library(arcpullr)
url <- "https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/WBD/MapServer/1"
huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_poly(url,geometry = "esriGeometryPolygon")
huc8_1:
Error in if (layer_info$type == "Group Layer") { :
argument is of length zero
huc8_2:
Error in get_sf_crs(geometry) : "sf" %in% class(sf_obj) is not TRUE
It would be very appreciated if you could provide any help to explain the errors and suggest any solutions. Thanks!

I didn't use the arcpullr package. Using leaflet.esri::addEsriFeatureLayer with a where clause works.
See the relevant codes below, as an example:
leaflet.esri::addEsriFeatureLayer(
url="https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/IR_201820_byParameter/MapServer/2",
options = leaflet.esri::featureLayerOptions(where = IR_where_huc12)
)

You have to pass an sf object as the second argument to any of the get_layer_by_* functions. I alter your example a bit using a point instead of a polygon for spatial querying (since it's easier to create), but get_layer_by_poly would work the same way using an sf polygon instead of a point. Also, the service you use requires a token. I changed the url to USGS HU 6-digit basins instead
library(arcpullr)
url <- "https://hydro.nationalmap.gov/arcgis/rest/services/wbd/MapServer/3"
query_pt <- sf_point(c(-90, 45))
# this would query everything in the feature layer, which may or may not be huge
# huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_point(url, query_pt)
huc_map <- plot_layer(huc8_2)
huc_map
huc_map + ggplot2::geom_sf(data = query_pt)

Related

RStudio Error: Unused argument ( by = ...) when fitting gam model, and smoothing seperately for a factor

I am still a beginnner in R. For a project I am trying to fit a gam model on a simple dataset with a timeset and year. I am doing it in R and I keep getting an error message that claims an argument is unused, even though I specify it in the code.
It concerns a dataset which includes a categorical variable of "Year", with only two levels. 2020 and 2022. I want to investigate if there is a peak in the hourly rate of visitors ("H1") in a nature reserve. For each observation period the average time was taken, which is the predictor variable used here ("T"). I want to use a Gam model for this, and have the smoothing applied differently for the two years.
The following is the line of code that I tried to use
`gam1 <- gam(H1~Year+s(T,by=Year),data = d)`
When I try to run this code, I get the following error message
`Error in s(T, by = Year) : unused argument (by = Year)`
I also tried simply getting rid of the "by" argument
`gam1 <- gam(H1~Year+s(T,Year),data = d)`
This allows me to run the code, but when trying to summon the output using summary(gam1), I get
Error in [<-(tmp, snames, 2, value = round(nldf, 1)) : subscript out of bounds
Since I feel like both errors are probably related to the same thing that I'm doing wrong, I decided to combine the question.
Did you load the {mgcv} package or the {gam} package? The latter doesn't have factor by smooths and as such the first error message is what I would expect if you did library("gam") and then tried to fit the model you showed.
To fit the model you showed, you should restart R and try in a clean session:
library("mgcv")
# load you data
# fit model
gam1 <- gam(H1 ~ Year + s(T, by = Year), data = d)
It could well be that you have both {gam} and {mgcv} loaded, in which case whichever you loaded last will be earlier on the function search path. As both packages have functions gam() and s(), R might just be finding the wrong versions (masking), so you might also try
gam1 <- mgcv::gam(H1 ~ Year + mgcv::s(T, by = Year), data = d)
But you would be better off only loading {mgcv} if you wan factor by smooths.
#Gavin Simpson
I did have both loaded, and I tried just using mgcv as you suggested. However, then I get the following error.
Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
I am assuming this is simply because it's not actually trying to use the "gam" function, but rather it attempts to name something gam1. So I would assume I actually need the package of 'gam' before I could do this.
The second line of code also doesn't work. I get the following error
Error in model.frame.default(formula = H1 ~ Year + mgcv::s(T, by = Year), :
invalid type (list) for variable 'mgcv::s(T, by = Year)'
This happens no matter the order I download the two packages in. And if I don't download 'gam', I get the error as described above.

How can I access value in sequence type?

There are the following attributes in client_output
weights_delta = attr.ib()
client_weight = attr.ib()
model_output = attr.ib()
client_loss = attr.ib()
After that, I made the client_output in the form of a sequence through
a = tff.federated_collect(client_output) and round_model_delta = tff.federated_map(selecting_fn,a)in here . and I declared
`
#tff.tf_computation() # append
def selecting_fn(a):
#TODO
return round_model_delta
in here. In the process of averaging on the server, I want to average the weights_delta by selecting some of the clients with a small loss value. So I try to access it via a.weights_delta but it doesn't work.
The tff.federated_collect returns a tff.SequenceType placed at tff.SERVER which you can manipulate the same way as for example client dataset is usually handled in a method decorated by tff.tf_computation.
Note that you have to use the tff.federated_collect operator in the scope of a tff.federated_computation. What you probably want to do[*] is pass it into a tff.tf_computation, using the tff.federated_map operator. Once inside the tff.tf_computation, you can think of it as a tf.data.Dataset object and everything in the tf.data module is available.
[*] I am guessing. More detailed explanation of what you would like to achieve would be helpful.

Schema option in spark_read_parquet()

I am pretty new to R and spark. I want to read a parquet file with the following code. Anyone knows how to specify schema there?
library(sparklyr)
sc <- spark_connect(master = "yarn",
appname = "test")
df <- spark_read_parquet(sc,
"name",
"path/to/the/file",
repartition = 0,
schema = "?")
I looked at the link https://spark.rstudio.com/reference/spark_read_parquet/, there isn't any detail or example regarding how to set schema in the function to optimize it.
If you are only trying to read a parquet file, a schema does not need to be used, it is just an available option. The following code should work.
df <- spark_read_parquet(sc,
"name",
"path/to/the/file",
repartition = 0,
schema = Null)
But if you want to use a schema, there are many options and choosing the right one depends on your data and what you are using it for. But try running your code without a schema option to see if that works for your data.
try
tbl_change_db(sc, "dbName")
and if u are using RStudio then click the refresh button on the upper right part of snippet

GET API call for multiple URLS

I have about a years experience with R but really struggle to wrap my head around loops, i'd really appreciate any explanations you guys have with go with an answer!
I am trying to use the Spotify API to loop round a list of Music Categories (- this is the API term, they are termed Genre/Mood in Spotify App) and retrieve a list of playlists. To retrieve one categories playlists I can use the following:
I'd imagine this is a fairly simple problem which does not require testing and acquiring keys.
If required, keys can be easily acquired using Spotify documentation (linked below), I can help with setup if required (for this or any other Spotify project).
#Setup up; Store keys and authenticate
ClientID <- "************"
ClientSecret <- "***********"
#OAuth
spotifyEndpoint <- oauth_endpoint(NULL,
"https://accounts.spotify.com/authorize",
"https://accounts.spotify.com/api/token")
spotifyApp <- oauth_app("spotify", ClientID, ClientSecret)
spotifyToken <- oauth2.0_token(spotifyEndpoint, spotifyApp)
#Create URL to call
CatPlaylist <- paste("https://api.spotify.com/v1/browse/categories/","funk","/playlists",sep="")
#Call api using GET
CatPlaylist <- httr::GET(CatPlaylist, spotifyToken)
#Transform results form JSON
CatPlaylist <- jsonlite::fromJSON(toJSON(content(CatPlaylist)))
#Transform into df
CatPlaylist <- t(data.frame(CatPlaylist$playlists$items$name))
How would I loop through this to collect other categories, effectively replacing "funk" with something like "party" or "chill".
Edit: Attempts added below
I have tried the following, in which
Cats
Holds the full URL for each call.
final = NULL
for(i in 1:length(Cats)){
CatPlaylist <- paste("https://api.spotify.com/v1/browse/categories/",i,"/playlists",sep="")
CatPlaylist <- GET(CatPlaylist, spotifyToken)
CatPlaylist <- jsonlite::fromJSON(toJSON(content(CatPlaylist)))
CatPlaylist <- t(data.frame(CatPlaylist$playlists$items$name))
final <- rbind(CatPlaylist,final)}
API documentation
https://developer.spotify.com/web-api/get-categorys-playlists/
System info: R 3.3.2 R Studio Version 1.0.143 OS Sierra 10_12_3
Thanks in advance :)
Found a solution to this. This struggle with loops so any suggestions are welcome but in the meantime the solution is below in case anyone else needs the answer!
#Get category playlist - this works and returns a DF with rownames as categories. If country is not specified the API will return results which apply to all countries.
my.list <-list()
my.listOwner <-list()
for(i in 1:length(Categories$id)){
CatPlaylist <- paste0("https://api.spotify.com/v1/browse/categories/",Categories$id[i],"/playlists?limit=50&country=GB",sep="")
miss.id <- Categories$id[i]
list.name<- as.character(miss.id)
a <- GET(CatPlaylist, spotifyToken)
b <- jsonlite::fromJSON(toJSON(content(a)))
my.list[[list.name]] <-data.frame(unlist(b$playlists$items$name))
my.listOwner[[list.name]] <-data.frame(unlist(b$playlists$items$owner$id))
}
final <-do.call(rbind,my.list)
finalOwner <- do.call(rbind,my.listOwner)

Handling paginated SQL query results

For my dissertation data collection, one of the sources is an externally-managed system, which is based on Web form for submitting SQL queries. Using R and RCurl, I have implemented an automated data collection framework, where I simulate the above-mentioned form. Everything worked well while I was limiting the size of the resulting dataset. But, when I tried to go over 100000 records (RQ_SIZE in the code below), the tandem "my code - their system" started being unresponsive ("hanging").
So, I have decided to use SQL pagination feature (LIMIT ... OFFSET ...) to submit a series of requests, hoping then to combine the paginated results into a target data frame. However, after changing my code accordingly, the output that I see is only one pagination progress character (*) and then no more output. I'd appreciate, if you could help me identify the probable cause of the unexpected behavior. I cannot provide reproducible example, as it's very difficult to extract the functionality, not to mention the data, but I hope that the following code snippet would be enough to reveal the issue (or, at least, a direction toward the problem).
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
numRequests <- as.numeric(data) %/% RQ_SIZE + 1
# Now, we can request & retrieve data via SQL pagination
for (i in 1:numRequests) {
# setup SQL pagination
if (rq$where == '') rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
# some code
# add current data frame to the list
dfList <- c(dfList, data)
if (DEBUG) message("*", appendLF = FALSE)
}
# merge all the result pages' data frames
data <- do.call("rbind", dfList)
# save current data frame to RDS file
saveRDS(data, rdataFile)
It probably falls into the category when presumably MySQL hinders LIMIT OFFSET:
Why does MYSQL higher LIMIT offset slow the query down?
Overall, fetching large data sets over HTTP repeatedly is not very reliable.
Since this is for your dissertation, here is a hand:
## Folder were to save the results to disk.
## Ideally, use a new, empty folder. Easier then to load from disk
folder.out <- "~/mydissertation/sql_data_scrape/"
## Create the folder if not exist.
dir.create(folder.out, showWarnings=FALSE, recursive=TRUE)
## The larger this number, the more memory you will require.
## If you are renting a large box on, say, EC2, then you can make this 100, or so
NumberOfOffsetsBetweenSaves <- 10
## The limit size per request
RQ_SIZE <- 1000
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
## Get the total number of rows
TotalRows <- as.numeric(srdaGetData())
TotalNumberOfRequests <- TotalRows %/% RQ_SIZE
TotalNumberOfGroups <- TotalNumberOfRequests %/% NumberOfOffsetsBetweenSaves + 1
## FYI: Total number of rows being requested is
## (NumberOfOffsetsBetweenSaves * RQ_SIZE * TotalNumberOfGroups)
for (g in seq(TotalNumberOfGroups)) {
ret <-
lapply(seq(NumberOfOffsetsBetweenSaves), function(i) {
## function(i) is the same code you have
## inside your for loop, but cleaned up.
# setup SQL pagination
if (rq$where == '')
rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*g*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
# retrieve result
data <- srdaGetData()
# some code
if (DEBUG) message("*", appendLF = FALSE)
### DONT ASSIGN TO dfList, JUST RETURN `data`
# xxxxxx DONT DO: xxxxx dfList <- c(dfList, data)
### INSTEAD:
## return
data
})
## save each iteration
file.out <- sprintf("%s/data_scrape_%04i.RDS", folder.out, g)
saveRDS(do.call(rbind, ret), file=file.out)
## OPTIONAL (this will be slower, but will keep your rams and goats in line)
# rm(ret)
# gc()
}
Then, once you are done scraping:
library(data.table)
folder.out <- "~/mydissertation/sql_data_scrape/"
files <- dir(folder.out, full=TRUE, pattern="\\.RDS$")
## Create an empty list
myData <- vector("list", length=length(files))
## Option 1, using data.frame
for (i in seq(myData))
myData[[i]] <- readRDS(files[[i]])
DT <- do.call(rbind, myData)
## Option 2, using data.table
for (i in seq(myData))
myData[[i]] <- as.data.table(readRDS(files[[i]]))
DT <- rbindlist(myData)
I'm answering my own question, as, finally, I have figured out what has been the real source of the problem. My investigation revealed that the unexpected waiting state of the program was due to PostgreSQL becoming confused by malformed SQL queries, which contained multiple LIMIT and OFFSET keywords.
The reason of that is pretty simple: I used rq$where both outside and inside the for loop, which made paste() concatenate previous iteration's WHERE clause with the current one. I have fixed the code by processing contents of the WHERE clause and saving it before the loop and then using the saved value in each iteration of the loop safely, as it became independent from the value of the original WHERE clause.
This investigation also helped me to fix some other deficiencies in my code and make improvements (such as using sub-selects to properly handle SQL queries returning number of records for queries with aggregate functions). The moral of the story: you can never be too careful in software development. Big thank you to those nice people who helped with this question.