Comparing different models using LOOCV method - lasso-regression

I have to compare different models (OLS, BEST SUBSET, RIDGE, LASSO, PCR and PLS) using the LOO cross Validation (the criterion of comparison is the test-MSE).
Could someone explain me how to do it (possibly using an example dataset)?
I need the R code. Thank you all!
P.S : Sorry for my English , but I speak another language.
Ok, I've tried to use the "caret" package:
library(ISLR)
library(caret)
library(forecast)
myControl <- trainControl(method='LOOCV')
LM <- train(Salary~., data=Hitters, method=lm,
trControl=myControl)
Step <- train(Salary~., data=Hitters, method='leapSeq',
trControl=myControl)
Ridge <- train(Salary~., data=Hitters, method='ridge',
trControl=myControl)
Lasso <- train(Salary~., data=Hitters, method='lasso',
trControl=myControl)
PLS <- train(Salary~., data=Hitters, method="pls",
trControl=myControl)
PCR <- train(Salary~., data=Hitters, method='pcr',
trControl=myControl)
How can I set the parameters lambda, ncomp and nvmax?
Thank you all!

I think this is possible with the Caret package: install.packages("caret"). The advantage of Caret is that you can run many different models at the same time and compare the performance. I am however not sure if you can find all the models you request, but take a look at the following list if your models are there: http://topepo.github.io/caret/modelList.html . I would also suggest to read the tutorial: http://www.edii.uclm.es/~useR-2013/Tutorials/kuhn/user_caret_2up.pdf.

Related

How to run SPARQL queries in R (WBG Topical Taxonomy) without parallelization

I am an R user and I am interested to use the World Bank Group (WBG) Topical Taxonomy through SPARQL queries.
This can be done directly on the API https://vocabulary.worldbank.org/PoolParty/sparql/taxonomy but it can be done also through R by using the functions load.rdf (to load the taxonomy.rdf rdfxml file downloaded from https://vocabulary.worldbank.org/ ) and after using the sparql.rdf to perform the query. These functions are available on the "rrdf" package
These are the three lines of codes:
taxonomy_file <- load.rdf("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_1 <- sparql.rdf(taxonomy_file,query)
What I obtain from result_query_1 is exactly the same I get through the API.
However, the load.rdf function uses all the cores available on my computer and not only one. This function is somehow parallelizing the load task over all the core available on my machine and I do not want that. I haven't found any option on that function to specify its serialized usege.
Therefore, I am trying to find other solutions. For instance, I have tried "rdf_parse" and "rdf_query" of the package "rdflib" but without any encouraging result. These are the code lines I have used.
taxonomy_file <- rdf_parse("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_2 <- rdf_query(taxonomy_file , query = query)
Is there any other function that perform this task? The objective of my work is to run several queries simultaneously using foreach.
Thank you very much for any suggestion you could provide me.

How should I merge monthly datasets into on dataset for cleaning?

I am working on a case study for a ride share. The data is broken up into monthly datasets, and in order to analyze the data over the last year I would need to merge the data. I uploaded all the data to both BigQuery and Rstudio but am unsure of the best way to make one large dataset.
I may not even have to do this, but I believe that to find trends I should have all the data in one datatable. If this is not the case then I will clean the data one month at a time.
Maybe use purrr::map_dfr()? It's like lapply() and rbind() rolled into one.
library(tidyverse)
all_the_tables <-
map_dfr( # union as it loops over the function
.x = list.files(pattern = ".csv"), # input for the function
.f = read_csv # the function
)
If it's more complicated and you need something to vary the source by you can use something like.
map_dfr(
.x = list.files(pattern = ".csv"),
.f = # the tilde let's you make a more complex sequence of steps
~read_csv(path = .x) |>
mutate(source = .x)
)
If it's a lot of files consider using vroom::vroom()

How to use arcpullr::get_spatial_layer() and arcpullr::get_layer_by_poly()

I couldn't figure this out through the package documentation https://cran.r-project.org/web/packages/arcpullr/vignettes/intro_to_arcpullr.html.
My codes return the errors described below.
library(arcpullr)
url <- "https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/WBD/MapServer/1"
huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_poly(url,geometry = "esriGeometryPolygon")
huc8_1:
Error in if (layer_info$type == "Group Layer") { :
argument is of length zero
huc8_2:
Error in get_sf_crs(geometry) : "sf" %in% class(sf_obj) is not TRUE
It would be very appreciated if you could provide any help to explain the errors and suggest any solutions. Thanks!
I didn't use the arcpullr package. Using leaflet.esri::addEsriFeatureLayer with a where clause works.
See the relevant codes below, as an example:
leaflet.esri::addEsriFeatureLayer(
url="https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/IR_201820_byParameter/MapServer/2",
options = leaflet.esri::featureLayerOptions(where = IR_where_huc12)
)
You have to pass an sf object as the second argument to any of the get_layer_by_* functions. I alter your example a bit using a point instead of a polygon for spatial querying (since it's easier to create), but get_layer_by_poly would work the same way using an sf polygon instead of a point. Also, the service you use requires a token. I changed the url to USGS HU 6-digit basins instead
library(arcpullr)
url <- "https://hydro.nationalmap.gov/arcgis/rest/services/wbd/MapServer/3"
query_pt <- sf_point(c(-90, 45))
# this would query everything in the feature layer, which may or may not be huge
# huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_point(url, query_pt)
huc_map <- plot_layer(huc8_2)
huc_map
huc_map + ggplot2::geom_sf(data = query_pt)

Sentiment analysis R syuzhet NRC Word-Emotion Association Lexicon

How do you find the associated words to the eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) (NRC Word-Emotion Association Lexicon) when using get_nrc_sentiment of the using the syuzhet package?
a <- c("I hate going to work it is dull","I love going to work it is fun")
a_corpus = Corpus(VectorSource(a))
a_tm <- TermDocumentMatrix(a_corpus)
a_tmx <- as.matrix(a_tm)
a_df<-data.frame(text=unlist(sapply(a, `[`)), stringsAsFactors=F)
a_sent<-get_nrc_sentiment(a_df$text)
e.g. we can see in a_sent that one term has been classified as anger, but how do we find what that term was? So I want to list all the sentiments and the terms associated in my example.
Thanks.
library(tidytext)
library(tm)
a <- c("I hate going to work it is dull","I love going to work it is fun")
a_corpus = Corpus(VectorSource(a))
a_tm <- TermDocumentMatrix(a_corpus)
a_tmx <- as.matrix(a_tm)
a_df<-data.frame(text=unlist(sapply(a, `[`)), stringsAsFactors=F)
a_sent<-get_nrc_sentiment(a_df$text)
lexicon <- get_sentiments("nrc")
v <- sort(rowSums(a_tmx),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# List the words in common between the text provided and NRC lexicon
intersect(lexicon$word, d$word)
# Show the words in common and their sentiments
s <- cbind(lexicon$word[lexicon$word%in%d$word], lexicon$sentiment[lexicon$word%in%d$word])
print(s)

xts objects and split

I have many "problems" but i will try to split them up as good as i can.
First i will present my code:
# Laster pakker
library(RODBC)
library(plyr)
library(lattice)
library(colorRamps)
library(Perfor)
# Picking up data
query <- "select convert(varchar(10),fk_dim_date,103) fk_dim_date,fk_dim_portfolio, dtd_portfolio_return_pct, dtd_benchmark_return_pct, * from nbim_dm..v_fact_performance
where fk_dim_date > '20130103' and fk_dim_portfolio in ('6906', '1812964')
"
# Formatting SQLen
query <- strwrap(query,width=nchar(query),simplify=TRUE)
# quering
ch <- odbcDriverConnect("driver={SQL Server};server=XXXX;Database=XXXX;", rows_at_time = 1024)
result <- sqlQuery(ch, query, as.is=c(TRUE, TRUE, TRUE))
close(ch)
# Do some cleanup
`enter code here`resultt$v_d <- as.Date(as.POSIXct(t$v_d))
#split
y <- split(qt,qt$fk_dim_portfolio)
#making names
new_names <- c("one","two")
for (i in 1:length(y){assign(new_names[i],y[[i]])})
So far so good:
The table that my SQL is running on has approx 178 diff. port_ids, some of which are useless and others that are highly useful. However i want this code to pull all fk_dim_ports (pulling: '6906', '1812964 was just for example purposes). After pulling the data i want to seperate it into n (now 178 sets) and make them xts objects which i have run into some trouble using:
qt <- xts(t[,-1],order.by=t[,1])
But works perfectly well if i don`t split the data using:
y <- split(qt,qt$fk_dim_portfolio)
Assuming this will work, my intention is to create charts.PerformanceSummary(mydata) for every table of my previous created data frames.
If you have any tips on how to split, make timeseries objects and loop the generation of the charts i would highly appreciate this.
I am aware that this post probably don`t comply to your rules/customs etc, but thanks for helping.
Lars