I have many "problems" but i will try to split them up as good as i can.
First i will present my code:
# Laster pakker
library(RODBC)
library(plyr)
library(lattice)
library(colorRamps)
library(Perfor)
# Picking up data
query <- "select convert(varchar(10),fk_dim_date,103) fk_dim_date,fk_dim_portfolio, dtd_portfolio_return_pct, dtd_benchmark_return_pct, * from nbim_dm..v_fact_performance
where fk_dim_date > '20130103' and fk_dim_portfolio in ('6906', '1812964')
"
# Formatting SQLen
query <- strwrap(query,width=nchar(query),simplify=TRUE)
# quering
ch <- odbcDriverConnect("driver={SQL Server};server=XXXX;Database=XXXX;", rows_at_time = 1024)
result <- sqlQuery(ch, query, as.is=c(TRUE, TRUE, TRUE))
close(ch)
# Do some cleanup
`enter code here`resultt$v_d <- as.Date(as.POSIXct(t$v_d))
#split
y <- split(qt,qt$fk_dim_portfolio)
#making names
new_names <- c("one","two")
for (i in 1:length(y){assign(new_names[i],y[[i]])})
So far so good:
The table that my SQL is running on has approx 178 diff. port_ids, some of which are useless and others that are highly useful. However i want this code to pull all fk_dim_ports (pulling: '6906', '1812964 was just for example purposes). After pulling the data i want to seperate it into n (now 178 sets) and make them xts objects which i have run into some trouble using:
qt <- xts(t[,-1],order.by=t[,1])
But works perfectly well if i don`t split the data using:
y <- split(qt,qt$fk_dim_portfolio)
Assuming this will work, my intention is to create charts.PerformanceSummary(mydata) for every table of my previous created data frames.
If you have any tips on how to split, make timeseries objects and loop the generation of the charts i would highly appreciate this.
I am aware that this post probably don`t comply to your rules/customs etc, but thanks for helping.
Lars
Related
I am trying to use a list of sample sites in a sql query to extract data for those sample sites. I am using a loop to draw data from several databases and then bind it together as one dataframe.
This works if I list out all the sample sites separately, but I will eventually have over 1000 sample sites and would rather just use a list instead of writing these all separately in the query.
Here is how I have built my sql query:
Query_all <- paste0("
SELECT
SAMPLE_POINTS.REF
MEASUREMENTS.RESULT
FROM
SAMPLES INNER JOIN MEASUREMENTS ON SAMPLES.SAMP_ID =
MEASUREMENTS.MEAS_SAMPLE_ID
WHERE
SAMPLE_POINTS.REF in ('SITE0001','SITE0002');",sep="")
data <- sqlQuery(channel, Query_all, stringsAsFactors = FALSE)
I would like the WHERE clause to refer out to a list rather than have to list SITE0001, SITE0002, SITE0003...SITE1000.
Other answers on here don't have the sql query as a separate variable so I am struggling to see how I can change my code based on those answers. I am new to R and sql and databases generally, and this is code someone else has given me to use.
Thanks
We have two problems:
Construct your vector of site strings. (Let's use vectors, not lists, as you shouldn't have any need for a list class object here.)
Put that in the the string with the rest of the query.
## build the site strings
library(stringr)
## I'll use `1:3` to illustrate, you can change my 3 to 1000
sites = paste0("SITE", str_pad(1:3, width = 4, side = "left", pad = "0"))
# make sure they look right
sites
# [1] "SITE0001" "SITE0002" "SITE0003"
site_string = toString(sQuote(sites, q = FALSE))
# this quotes the sites and sticks them together
# as one comma-separated string:
site_string
# [1] "'SITE0001', 'SITE0002', 'SITE0003'"
That looks ready to go in a SQL query. Here are 3 ways we could do that. For all 3, I'll start with a (slightly different) starting query:
With base R's paste, we can simply stick things together. This works pretty well if you're sticking one piece in a query.
# paste method:
qry = "SELECT
SAMPLE_POINTS.REF
MEASUREMENTS.RESULT
FROM
SAMPLES INNER JOIN MEASUREMENTS ON SAMPLES.SAMP_ID =
MEASUREMENTS.MEAS_SAMPLE_ID
WHERE
SAMPLE_POINTS.REF in ("
qry_all = paste(qry, site_string, ");") ## we also paste on the final );
paste is really common, but for "fill-in-the-blank" sort of stuff, I like sprintf better. We use %s as a placeholder in the string, and then it replaces that with additional arguments. If you were filling in variables in multiple places in the query, I find this much more readable:
# sprintf method:
qry = "SELECT
SAMPLE_POINTS.REF
MEASUREMENTS.RESULT
FROM
SAMPLES INNER JOIN MEASUREMENTS ON SAMPLES.SAMP_ID =
MEASUREMENTS.MEAS_SAMPLE_ID
WHERE
SAMPLE_POINTS.REF in (%s);"
qry_all = sprintf(qry, site_string)
Improving on that even more, the glue package lets us refer to R objects in the string by name, wrapped in {}. This is really nice if you are putting multiple objects in the string, because with sprintf you have to make sure you get the order of them right, but here based on name the code is more readable:
# the glue method:
library(glue)
qry = "SELECT
SAMPLE_POINTS.REF
MEASUREMENTS.RESULT
FROM
SAMPLES INNER JOIN MEASUREMENTS ON SAMPLES.SAMP_ID =
MEASUREMENTS.MEAS_SAMPLE_ID
WHERE
SAMPLE_POINTS.REF in ({site_string});"
qry_all = glue(qry)
All of these methods will generate the same qry_all string, which you can pass to sqlQuery as usual: data <- sqlQuery(channel, qry_all, stringsAsFactors = FALSE)
Also note that if you print() these qry_all strings in your console, the linebreaks will be printed as \n. To interpret the linebreaks, use cat(qry_all).
I am working on a case study for a ride share. The data is broken up into monthly datasets, and in order to analyze the data over the last year I would need to merge the data. I uploaded all the data to both BigQuery and Rstudio but am unsure of the best way to make one large dataset.
I may not even have to do this, but I believe that to find trends I should have all the data in one datatable. If this is not the case then I will clean the data one month at a time.
Maybe use purrr::map_dfr()? It's like lapply() and rbind() rolled into one.
library(tidyverse)
all_the_tables <-
map_dfr( # union as it loops over the function
.x = list.files(pattern = ".csv"), # input for the function
.f = read_csv # the function
)
If it's more complicated and you need something to vary the source by you can use something like.
map_dfr(
.x = list.files(pattern = ".csv"),
.f = # the tilde let's you make a more complex sequence of steps
~read_csv(path = .x) |>
mutate(source = .x)
)
If it's a lot of files consider using vroom::vroom()
I couldn't figure this out through the package documentation https://cran.r-project.org/web/packages/arcpullr/vignettes/intro_to_arcpullr.html.
My codes return the errors described below.
library(arcpullr)
url <- "https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/WBD/MapServer/1"
huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_poly(url,geometry = "esriGeometryPolygon")
huc8_1:
Error in if (layer_info$type == "Group Layer") { :
argument is of length zero
huc8_2:
Error in get_sf_crs(geometry) : "sf" %in% class(sf_obj) is not TRUE
It would be very appreciated if you could provide any help to explain the errors and suggest any solutions. Thanks!
I didn't use the arcpullr package. Using leaflet.esri::addEsriFeatureLayer with a where clause works.
See the relevant codes below, as an example:
leaflet.esri::addEsriFeatureLayer(
url="https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/IR_201820_byParameter/MapServer/2",
options = leaflet.esri::featureLayerOptions(where = IR_where_huc12)
)
You have to pass an sf object as the second argument to any of the get_layer_by_* functions. I alter your example a bit using a point instead of a polygon for spatial querying (since it's easier to create), but get_layer_by_poly would work the same way using an sf polygon instead of a point. Also, the service you use requires a token. I changed the url to USGS HU 6-digit basins instead
library(arcpullr)
url <- "https://hydro.nationalmap.gov/arcgis/rest/services/wbd/MapServer/3"
query_pt <- sf_point(c(-90, 45))
# this would query everything in the feature layer, which may or may not be huge
# huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_point(url, query_pt)
huc_map <- plot_layer(huc8_2)
huc_map
huc_map + ggplot2::geom_sf(data = query_pt)
How do you find the associated words to the eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) (NRC Word-Emotion Association Lexicon) when using get_nrc_sentiment of the using the syuzhet package?
a <- c("I hate going to work it is dull","I love going to work it is fun")
a_corpus = Corpus(VectorSource(a))
a_tm <- TermDocumentMatrix(a_corpus)
a_tmx <- as.matrix(a_tm)
a_df<-data.frame(text=unlist(sapply(a, `[`)), stringsAsFactors=F)
a_sent<-get_nrc_sentiment(a_df$text)
e.g. we can see in a_sent that one term has been classified as anger, but how do we find what that term was? So I want to list all the sentiments and the terms associated in my example.
Thanks.
library(tidytext)
library(tm)
a <- c("I hate going to work it is dull","I love going to work it is fun")
a_corpus = Corpus(VectorSource(a))
a_tm <- TermDocumentMatrix(a_corpus)
a_tmx <- as.matrix(a_tm)
a_df<-data.frame(text=unlist(sapply(a, `[`)), stringsAsFactors=F)
a_sent<-get_nrc_sentiment(a_df$text)
lexicon <- get_sentiments("nrc")
v <- sort(rowSums(a_tmx),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# List the words in common between the text provided and NRC lexicon
intersect(lexicon$word, d$word)
# Show the words in common and their sentiments
s <- cbind(lexicon$word[lexicon$word%in%d$word], lexicon$sentiment[lexicon$word%in%d$word])
print(s)
For my dissertation data collection, one of the sources is an externally-managed system, which is based on Web form for submitting SQL queries. Using R and RCurl, I have implemented an automated data collection framework, where I simulate the above-mentioned form. Everything worked well while I was limiting the size of the resulting dataset. But, when I tried to go over 100000 records (RQ_SIZE in the code below), the tandem "my code - their system" started being unresponsive ("hanging").
So, I have decided to use SQL pagination feature (LIMIT ... OFFSET ...) to submit a series of requests, hoping then to combine the paginated results into a target data frame. However, after changing my code accordingly, the output that I see is only one pagination progress character (*) and then no more output. I'd appreciate, if you could help me identify the probable cause of the unexpected behavior. I cannot provide reproducible example, as it's very difficult to extract the functionality, not to mention the data, but I hope that the following code snippet would be enough to reveal the issue (or, at least, a direction toward the problem).
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
numRequests <- as.numeric(data) %/% RQ_SIZE + 1
# Now, we can request & retrieve data via SQL pagination
for (i in 1:numRequests) {
# setup SQL pagination
if (rq$where == '') rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
# some code
# add current data frame to the list
dfList <- c(dfList, data)
if (DEBUG) message("*", appendLF = FALSE)
}
# merge all the result pages' data frames
data <- do.call("rbind", dfList)
# save current data frame to RDS file
saveRDS(data, rdataFile)
It probably falls into the category when presumably MySQL hinders LIMIT OFFSET:
Why does MYSQL higher LIMIT offset slow the query down?
Overall, fetching large data sets over HTTP repeatedly is not very reliable.
Since this is for your dissertation, here is a hand:
## Folder were to save the results to disk.
## Ideally, use a new, empty folder. Easier then to load from disk
folder.out <- "~/mydissertation/sql_data_scrape/"
## Create the folder if not exist.
dir.create(folder.out, showWarnings=FALSE, recursive=TRUE)
## The larger this number, the more memory you will require.
## If you are renting a large box on, say, EC2, then you can make this 100, or so
NumberOfOffsetsBetweenSaves <- 10
## The limit size per request
RQ_SIZE <- 1000
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
## Get the total number of rows
TotalRows <- as.numeric(srdaGetData())
TotalNumberOfRequests <- TotalRows %/% RQ_SIZE
TotalNumberOfGroups <- TotalNumberOfRequests %/% NumberOfOffsetsBetweenSaves + 1
## FYI: Total number of rows being requested is
## (NumberOfOffsetsBetweenSaves * RQ_SIZE * TotalNumberOfGroups)
for (g in seq(TotalNumberOfGroups)) {
ret <-
lapply(seq(NumberOfOffsetsBetweenSaves), function(i) {
## function(i) is the same code you have
## inside your for loop, but cleaned up.
# setup SQL pagination
if (rq$where == '')
rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*g*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
# retrieve result
data <- srdaGetData()
# some code
if (DEBUG) message("*", appendLF = FALSE)
### DONT ASSIGN TO dfList, JUST RETURN `data`
# xxxxxx DONT DO: xxxxx dfList <- c(dfList, data)
### INSTEAD:
## return
data
})
## save each iteration
file.out <- sprintf("%s/data_scrape_%04i.RDS", folder.out, g)
saveRDS(do.call(rbind, ret), file=file.out)
## OPTIONAL (this will be slower, but will keep your rams and goats in line)
# rm(ret)
# gc()
}
Then, once you are done scraping:
library(data.table)
folder.out <- "~/mydissertation/sql_data_scrape/"
files <- dir(folder.out, full=TRUE, pattern="\\.RDS$")
## Create an empty list
myData <- vector("list", length=length(files))
## Option 1, using data.frame
for (i in seq(myData))
myData[[i]] <- readRDS(files[[i]])
DT <- do.call(rbind, myData)
## Option 2, using data.table
for (i in seq(myData))
myData[[i]] <- as.data.table(readRDS(files[[i]]))
DT <- rbindlist(myData)
I'm answering my own question, as, finally, I have figured out what has been the real source of the problem. My investigation revealed that the unexpected waiting state of the program was due to PostgreSQL becoming confused by malformed SQL queries, which contained multiple LIMIT and OFFSET keywords.
The reason of that is pretty simple: I used rq$where both outside and inside the for loop, which made paste() concatenate previous iteration's WHERE clause with the current one. I have fixed the code by processing contents of the WHERE clause and saving it before the loop and then using the saved value in each iteration of the loop safely, as it became independent from the value of the original WHERE clause.
This investigation also helped me to fix some other deficiencies in my code and make improvements (such as using sub-selects to properly handle SQL queries returning number of records for queries with aggregate functions). The moral of the story: you can never be too careful in software development. Big thank you to those nice people who helped with this question.