Sentiment analysis R syuzhet NRC Word-Emotion Association Lexicon - text-mining

How do you find the associated words to the eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) (NRC Word-Emotion Association Lexicon) when using get_nrc_sentiment of the using the syuzhet package?
a <- c("I hate going to work it is dull","I love going to work it is fun")
a_corpus = Corpus(VectorSource(a))
a_tm <- TermDocumentMatrix(a_corpus)
a_tmx <- as.matrix(a_tm)
a_df<-data.frame(text=unlist(sapply(a, `[`)), stringsAsFactors=F)
a_sent<-get_nrc_sentiment(a_df$text)
e.g. we can see in a_sent that one term has been classified as anger, but how do we find what that term was? So I want to list all the sentiments and the terms associated in my example.
Thanks.

library(tidytext)
library(tm)
a <- c("I hate going to work it is dull","I love going to work it is fun")
a_corpus = Corpus(VectorSource(a))
a_tm <- TermDocumentMatrix(a_corpus)
a_tmx <- as.matrix(a_tm)
a_df<-data.frame(text=unlist(sapply(a, `[`)), stringsAsFactors=F)
a_sent<-get_nrc_sentiment(a_df$text)
lexicon <- get_sentiments("nrc")
v <- sort(rowSums(a_tmx),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# List the words in common between the text provided and NRC lexicon
intersect(lexicon$word, d$word)
# Show the words in common and their sentiments
s <- cbind(lexicon$word[lexicon$word%in%d$word], lexicon$sentiment[lexicon$word%in%d$word])
print(s)

Related

How to use arcpullr::get_spatial_layer() and arcpullr::get_layer_by_poly()

I couldn't figure this out through the package documentation https://cran.r-project.org/web/packages/arcpullr/vignettes/intro_to_arcpullr.html.
My codes return the errors described below.
library(arcpullr)
url <- "https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/WBD/MapServer/1"
huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_poly(url,geometry = "esriGeometryPolygon")
huc8_1:
Error in if (layer_info$type == "Group Layer") { :
argument is of length zero
huc8_2:
Error in get_sf_crs(geometry) : "sf" %in% class(sf_obj) is not TRUE
It would be very appreciated if you could provide any help to explain the errors and suggest any solutions. Thanks!
I didn't use the arcpullr package. Using leaflet.esri::addEsriFeatureLayer with a where clause works.
See the relevant codes below, as an example:
leaflet.esri::addEsriFeatureLayer(
url="https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/IR_201820_byParameter/MapServer/2",
options = leaflet.esri::featureLayerOptions(where = IR_where_huc12)
)
You have to pass an sf object as the second argument to any of the get_layer_by_* functions. I alter your example a bit using a point instead of a polygon for spatial querying (since it's easier to create), but get_layer_by_poly would work the same way using an sf polygon instead of a point. Also, the service you use requires a token. I changed the url to USGS HU 6-digit basins instead
library(arcpullr)
url <- "https://hydro.nationalmap.gov/arcgis/rest/services/wbd/MapServer/3"
query_pt <- sf_point(c(-90, 45))
# this would query everything in the feature layer, which may or may not be huge
# huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_point(url, query_pt)
huc_map <- plot_layer(huc8_2)
huc_map
huc_map + ggplot2::geom_sf(data = query_pt)

GET API call for multiple URLS

I have about a years experience with R but really struggle to wrap my head around loops, i'd really appreciate any explanations you guys have with go with an answer!
I am trying to use the Spotify API to loop round a list of Music Categories (- this is the API term, they are termed Genre/Mood in Spotify App) and retrieve a list of playlists. To retrieve one categories playlists I can use the following:
I'd imagine this is a fairly simple problem which does not require testing and acquiring keys.
If required, keys can be easily acquired using Spotify documentation (linked below), I can help with setup if required (for this or any other Spotify project).
#Setup up; Store keys and authenticate
ClientID <- "************"
ClientSecret <- "***********"
#OAuth
spotifyEndpoint <- oauth_endpoint(NULL,
"https://accounts.spotify.com/authorize",
"https://accounts.spotify.com/api/token")
spotifyApp <- oauth_app("spotify", ClientID, ClientSecret)
spotifyToken <- oauth2.0_token(spotifyEndpoint, spotifyApp)
#Create URL to call
CatPlaylist <- paste("https://api.spotify.com/v1/browse/categories/","funk","/playlists",sep="")
#Call api using GET
CatPlaylist <- httr::GET(CatPlaylist, spotifyToken)
#Transform results form JSON
CatPlaylist <- jsonlite::fromJSON(toJSON(content(CatPlaylist)))
#Transform into df
CatPlaylist <- t(data.frame(CatPlaylist$playlists$items$name))
How would I loop through this to collect other categories, effectively replacing "funk" with something like "party" or "chill".
Edit: Attempts added below
I have tried the following, in which
Cats
Holds the full URL for each call.
final = NULL
for(i in 1:length(Cats)){
CatPlaylist <- paste("https://api.spotify.com/v1/browse/categories/",i,"/playlists",sep="")
CatPlaylist <- GET(CatPlaylist, spotifyToken)
CatPlaylist <- jsonlite::fromJSON(toJSON(content(CatPlaylist)))
CatPlaylist <- t(data.frame(CatPlaylist$playlists$items$name))
final <- rbind(CatPlaylist,final)}
API documentation
https://developer.spotify.com/web-api/get-categorys-playlists/
System info: R 3.3.2 R Studio Version 1.0.143 OS Sierra 10_12_3
Thanks in advance :)
Found a solution to this. This struggle with loops so any suggestions are welcome but in the meantime the solution is below in case anyone else needs the answer!
#Get category playlist - this works and returns a DF with rownames as categories. If country is not specified the API will return results which apply to all countries.
my.list <-list()
my.listOwner <-list()
for(i in 1:length(Categories$id)){
CatPlaylist <- paste0("https://api.spotify.com/v1/browse/categories/",Categories$id[i],"/playlists?limit=50&country=GB",sep="")
miss.id <- Categories$id[i]
list.name<- as.character(miss.id)
a <- GET(CatPlaylist, spotifyToken)
b <- jsonlite::fromJSON(toJSON(content(a)))
my.list[[list.name]] <-data.frame(unlist(b$playlists$items$name))
my.listOwner[[list.name]] <-data.frame(unlist(b$playlists$items$owner$id))
}
final <-do.call(rbind,my.list)
finalOwner <- do.call(rbind,my.listOwner)

Why has the cluster of text mined in R gone fuzzy?

Question: Why has the cluster dendrogram of text mined data gone fuzzy /messy (see link to the diagram below)?
Synopsis:I first harvested the original data of approximately 5500 e-scanned articles from a Mongo database, and saved in disk drive as a Json object (ode not shown here, harvested using Cran Mongolite package for R). What is shown here is the standard text processing (using Cran TM package) to clean “the”, “and”, “ing”, “;”, “:” etc.,). That lead to the ensuing hierarchical clustering, which looks fuzzy/MESSY because some of the words in the Json object were very long combinations of letters and not real words that can be separately identified.
Calling two of the libraries
library("tm")
library ("SnowballC")
Creating a path to the data and a corpus of text
cname <- file.path("C:", "texts")
docs <- Corpus(DirSource(cname))
Processing the text
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)
tdm <- TermDocumentMatrix(docs)
Thirdly the clustering via dendrogram
d<-dist(tdm,method = "euclidean")
hc<-hclust(d, method="ward.D2")
library("rafalib")
myplclust(hc, labels=hc$labels)
Link to the image:
cluster/dendrogram/text mining
Answer is cut less frequent words, more than just the standard stopwords that I had already done (see how I cut them in the code below)
mystopwords <- findFreqTerms(tdm, 1, 20)
mystpwrds <- paste(mystopwords, collapse = "|")
tdm <- tdm[tdm$dimnames$Terms[!grepl(mystpwrds,tdm$dimnames$Terms)],]
The whole picture and code is published here:
http://rpubs.com/antonyama/180574

Handling paginated SQL query results

For my dissertation data collection, one of the sources is an externally-managed system, which is based on Web form for submitting SQL queries. Using R and RCurl, I have implemented an automated data collection framework, where I simulate the above-mentioned form. Everything worked well while I was limiting the size of the resulting dataset. But, when I tried to go over 100000 records (RQ_SIZE in the code below), the tandem "my code - their system" started being unresponsive ("hanging").
So, I have decided to use SQL pagination feature (LIMIT ... OFFSET ...) to submit a series of requests, hoping then to combine the paginated results into a target data frame. However, after changing my code accordingly, the output that I see is only one pagination progress character (*) and then no more output. I'd appreciate, if you could help me identify the probable cause of the unexpected behavior. I cannot provide reproducible example, as it's very difficult to extract the functionality, not to mention the data, but I hope that the following code snippet would be enough to reveal the issue (or, at least, a direction toward the problem).
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
numRequests <- as.numeric(data) %/% RQ_SIZE + 1
# Now, we can request & retrieve data via SQL pagination
for (i in 1:numRequests) {
# setup SQL pagination
if (rq$where == '') rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
# some code
# add current data frame to the list
dfList <- c(dfList, data)
if (DEBUG) message("*", appendLF = FALSE)
}
# merge all the result pages' data frames
data <- do.call("rbind", dfList)
# save current data frame to RDS file
saveRDS(data, rdataFile)
It probably falls into the category when presumably MySQL hinders LIMIT OFFSET:
Why does MYSQL higher LIMIT offset slow the query down?
Overall, fetching large data sets over HTTP repeatedly is not very reliable.
Since this is for your dissertation, here is a hand:
## Folder were to save the results to disk.
## Ideally, use a new, empty folder. Easier then to load from disk
folder.out <- "~/mydissertation/sql_data_scrape/"
## Create the folder if not exist.
dir.create(folder.out, showWarnings=FALSE, recursive=TRUE)
## The larger this number, the more memory you will require.
## If you are renting a large box on, say, EC2, then you can make this 100, or so
NumberOfOffsetsBetweenSaves <- 10
## The limit size per request
RQ_SIZE <- 1000
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
## Get the total number of rows
TotalRows <- as.numeric(srdaGetData())
TotalNumberOfRequests <- TotalRows %/% RQ_SIZE
TotalNumberOfGroups <- TotalNumberOfRequests %/% NumberOfOffsetsBetweenSaves + 1
## FYI: Total number of rows being requested is
## (NumberOfOffsetsBetweenSaves * RQ_SIZE * TotalNumberOfGroups)
for (g in seq(TotalNumberOfGroups)) {
ret <-
lapply(seq(NumberOfOffsetsBetweenSaves), function(i) {
## function(i) is the same code you have
## inside your for loop, but cleaned up.
# setup SQL pagination
if (rq$where == '')
rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*g*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
# retrieve result
data <- srdaGetData()
# some code
if (DEBUG) message("*", appendLF = FALSE)
### DONT ASSIGN TO dfList, JUST RETURN `data`
# xxxxxx DONT DO: xxxxx dfList <- c(dfList, data)
### INSTEAD:
## return
data
})
## save each iteration
file.out <- sprintf("%s/data_scrape_%04i.RDS", folder.out, g)
saveRDS(do.call(rbind, ret), file=file.out)
## OPTIONAL (this will be slower, but will keep your rams and goats in line)
# rm(ret)
# gc()
}
Then, once you are done scraping:
library(data.table)
folder.out <- "~/mydissertation/sql_data_scrape/"
files <- dir(folder.out, full=TRUE, pattern="\\.RDS$")
## Create an empty list
myData <- vector("list", length=length(files))
## Option 1, using data.frame
for (i in seq(myData))
myData[[i]] <- readRDS(files[[i]])
DT <- do.call(rbind, myData)
## Option 2, using data.table
for (i in seq(myData))
myData[[i]] <- as.data.table(readRDS(files[[i]]))
DT <- rbindlist(myData)
I'm answering my own question, as, finally, I have figured out what has been the real source of the problem. My investigation revealed that the unexpected waiting state of the program was due to PostgreSQL becoming confused by malformed SQL queries, which contained multiple LIMIT and OFFSET keywords.
The reason of that is pretty simple: I used rq$where both outside and inside the for loop, which made paste() concatenate previous iteration's WHERE clause with the current one. I have fixed the code by processing contents of the WHERE clause and saving it before the loop and then using the saved value in each iteration of the loop safely, as it became independent from the value of the original WHERE clause.
This investigation also helped me to fix some other deficiencies in my code and make improvements (such as using sub-selects to properly handle SQL queries returning number of records for queries with aggregate functions). The moral of the story: you can never be too careful in software development. Big thank you to those nice people who helped with this question.

xts objects and split

I have many "problems" but i will try to split them up as good as i can.
First i will present my code:
# Laster pakker
library(RODBC)
library(plyr)
library(lattice)
library(colorRamps)
library(Perfor)
# Picking up data
query <- "select convert(varchar(10),fk_dim_date,103) fk_dim_date,fk_dim_portfolio, dtd_portfolio_return_pct, dtd_benchmark_return_pct, * from nbim_dm..v_fact_performance
where fk_dim_date > '20130103' and fk_dim_portfolio in ('6906', '1812964')
"
# Formatting SQLen
query <- strwrap(query,width=nchar(query),simplify=TRUE)
# quering
ch <- odbcDriverConnect("driver={SQL Server};server=XXXX;Database=XXXX;", rows_at_time = 1024)
result <- sqlQuery(ch, query, as.is=c(TRUE, TRUE, TRUE))
close(ch)
# Do some cleanup
`enter code here`resultt$v_d <- as.Date(as.POSIXct(t$v_d))
#split
y <- split(qt,qt$fk_dim_portfolio)
#making names
new_names <- c("one","two")
for (i in 1:length(y){assign(new_names[i],y[[i]])})
So far so good:
The table that my SQL is running on has approx 178 diff. port_ids, some of which are useless and others that are highly useful. However i want this code to pull all fk_dim_ports (pulling: '6906', '1812964 was just for example purposes). After pulling the data i want to seperate it into n (now 178 sets) and make them xts objects which i have run into some trouble using:
qt <- xts(t[,-1],order.by=t[,1])
But works perfectly well if i don`t split the data using:
y <- split(qt,qt$fk_dim_portfolio)
Assuming this will work, my intention is to create charts.PerformanceSummary(mydata) for every table of my previous created data frames.
If you have any tips on how to split, make timeseries objects and loop the generation of the charts i would highly appreciate this.
I am aware that this post probably don`t comply to your rules/customs etc, but thanks for helping.
Lars