r flexdashboard issue with arules reactive input - shinydashboard

I'm trying to create a flexdashboard with results from the r arules apriori function, so to display associated relations for specific items selected from the pull down menu in the markdown dashboard. When I create function outside the markdown environment I'm able successfully feed the new product item to the apriori function without problem and the graph changes as expected when I change item. When I replace the function variable with the reactive function name I get an error message saying no rules "Error: x contains 0 rules!"
Using the provided Grocery data, I'm want to the reactive input to feed the selected variable either "whole milk" or "sugar" and output the rules for the the specified variable and create a graph. I'm new to dashboards so I don't know if need to use a different function than the "reactive".
Below is the markdown code I'm having trouble feeding a new variables into the aproiri function.
---
title: "Grocery_Test"
output:
flexdashboard::flex_dashboard:
storyboard: true
social: menu
source_code: embed
runtime: shiny
---
```{r global, include=FALSE}
library(plotly)
library(flexdashboard)
library(htmlwidgets)
library(htmltools)
library(knitr)
library(arules)
library(arulesViz)
library(igraph)
library(datasets)
data(Groceries)
```Inputs {.sidebar}
-----------------------------------------------------------------------
```{r}
selectInput("Product","Product", c("whole milk","sugar"))
```
###Associated Product
```{r}
product <- reactive({input$product})
renderPlot({
#product<-"whole milk"
rules <- apriori (data=Groceries
,parameter=list (supp=0.001,conf = 0.15,maxlen=3)
,appearance = list(default="rhs",lhs=Product())
,control = list (verbose=F))
rules_conf <- sort (rules, by=c("confidence"), decreasing=TRUE) # 'high-confidence' rules.
redundant <- which (colSums (is.subset (rules_conf, rules_conf)) > 1) # get redundant rules in vector
rules_conf <- rules_conf[-redundant] # remove redundant rules
plot(rules_conf, method="graph",measure = "confidence", shading = "lift"
,control = list(engine="htmlwidget"))
})
```

Related

How to use arcpullr::get_spatial_layer() and arcpullr::get_layer_by_poly()

I couldn't figure this out through the package documentation https://cran.r-project.org/web/packages/arcpullr/vignettes/intro_to_arcpullr.html.
My codes return the errors described below.
library(arcpullr)
url <- "https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/WBD/MapServer/1"
huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_poly(url,geometry = "esriGeometryPolygon")
huc8_1:
Error in if (layer_info$type == "Group Layer") { :
argument is of length zero
huc8_2:
Error in get_sf_crs(geometry) : "sf" %in% class(sf_obj) is not TRUE
It would be very appreciated if you could provide any help to explain the errors and suggest any solutions. Thanks!
I didn't use the arcpullr package. Using leaflet.esri::addEsriFeatureLayer with a where clause works.
See the relevant codes below, as an example:
leaflet.esri::addEsriFeatureLayer(
url="https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/IR_201820_byParameter/MapServer/2",
options = leaflet.esri::featureLayerOptions(where = IR_where_huc12)
)
You have to pass an sf object as the second argument to any of the get_layer_by_* functions. I alter your example a bit using a point instead of a polygon for spatial querying (since it's easier to create), but get_layer_by_poly would work the same way using an sf polygon instead of a point. Also, the service you use requires a token. I changed the url to USGS HU 6-digit basins instead
library(arcpullr)
url <- "https://hydro.nationalmap.gov/arcgis/rest/services/wbd/MapServer/3"
query_pt <- sf_point(c(-90, 45))
# this would query everything in the feature layer, which may or may not be huge
# huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_point(url, query_pt)
huc_map <- plot_layer(huc8_2)
huc_map
huc_map + ggplot2::geom_sf(data = query_pt)

How do I find a specific tag's value (which could be anything) with beautifulsoup?

I am trying to get the job IDs from the tags of Indeed listings. So far, I have taken Indeed search results and put each job into its own "bs4.element.Tag" object, but I don't know how to extract the value of the tag (or is it a class?) "data-jk". Here is what I have so far:
import requests
import bs4
import re
# 1: scrape (5?) pages of search results for listing ID's
results = []
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=10"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=20"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=30"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=40"))
# each search page has a query "q", location "l", and a "start" = 10*int
# the search results are contained in a "td" with ID = "resultsCol"
justjobs = []
for eachResult in results:
soup_jobs = bs4.BeautifulSoup(eachResult.text, "lxml") # this is for IDs
justjobs.extend(soup_jobs.find_all(attrs={"data-jk":True})) # re.compile("data-jk")
# each "card" is a div object
# each has the class "jobsearch-SerpJobCard unifiedRow row result clickcard"
# as well as a specific tag "data-jk"
# "data-jk" seems to be the actual IDs used in each listing's URL
# Now, each div element has a data-jk. I will try to get data-jk from each one:
jobIDs = []
print(type(justjobs[0])) # DEBUG
for eachJob in justjobs:
jobIDs.append(eachJob.find("data-jk"))
print("Length: " + str(len(jobIDs))) # DEBUG
print("Example JobID: " + str(jobIDs[1])) # DEBUG
The examples I've seen online generally try to get the information contained between and , but I am not sure how to get the info from inside of the (first) tag itself. I've tried doing it by parsing it as a string instead:
print(justjobs[0])
for eachJob in justjobs:
jobIDs.append(str(eachJob)[115:131])
print(jobIDs)
but the website is also inconsistent with how the tags operate, and I think that using beautifulsoup would be more flexible than multiple cases and substrings.
Any pointers would be greatly appreciated!
Looks like you can regex them out from a script tag
import requests,re
html = requests.get('https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0').text
p = re.compile(r"jk:'(.*?)'")
ids = p.findall(html)

How to retain sliderInput range of a variable(s) when additional variables added dynamically?

The following code allows to create sliderInput control widgets for selected number of variables. For the brevity, filtering code as per sliderInput range is removed. So, please neglect what is being seen in the data table. sliderInput widgets pop up whenever a new variable selected or added to the list of already existing variables. However, these widgets do not retain the previously selected values, i.e., let's say you selected sepal.length, modified slider range, and then selected sepal.width. We'll have two widgets but widget of sepal.length does not retain the modified values, i.e., the slider falls back to the original values. Agreed that retaining modified values is not being coded, and I'm unable to figure it out how it can be done. I understand that I need to read sliderInput range values as soon as they modified, save new modified values, and then update sliderInput range with the saved values before the list of functions called in renderUI. And, the other way is read new values as global variable and use them while creating in widgets, but I plan to access the control widgets in other tabs, so, this does not appear to be a neat idea. Greatly appreciate suggestions.
# Create a data frame of which information is used in creating (dynamical) control widgets
varnames <- names(iris[,1:4]) # names
varinit <- apply(iris[,1:4],2,median) # initival value used in slider
varmin <- apply(iris[,1:4],2,min) # min.
varmax <- apply(iris[,1:4],2,max) # max.
ui <- navbarPage(
tabPanel("Plot",
sidebarLayout(
sidebarPanel(
checkboxGroupInput("ConditioningVariables", "Conditioning variables (choose one or more):",
varnames,inline = TRUE),
uiOutput("ControlWidgetsofConditioningVariables")
),
mainPanel(
dataTableOutput("data")
)
)
)
)
server <- function(input, output, session) {
allControls <- lapply(setNames(varnames, varnames), function(x) {
sliderInput(x, x, varmin[x], varmax[x], c(varmin[x], varinit[x]),
round = -2)
})
output$ControlWidgetsofConditioningVariables <- renderUI({
if (is.null(input$ConditioningVariables)){
return()
} else {
allControls[input$ConditioningVariables]
}
})
## filtering related code is removed, so, data does not change ...
newdata <- reactive({
iris
})
output$data <- renderDataTable({ newdata() })
}
shinyApp(ui, server)

How to use pysam.view to emulate all functions of samtools view

I am trying to use pysam.view() to filter out certain alignments from a BAM file. The problem I am facing is how to include several regions in the filter.
pysam.view() emulates the samtools view command which allows one to enter several regions separated by the space character, eg:
samtools view opts bamfile chr1:2010000-20200000 chr2:2010000-20200000
But the corresponding pysam.view call:
pysam.view(ops, bamfile, '1:2010000-20200000 2:2010000-20200000')
does not work. It does not return any alignments. I'm quite sure the problem lies in how to specify the list of regions, since the following command works fine:
pysam.view(ops, bamfile, '1:2010000-20200000')
and returns alignments.
My question is: does pysam.view support multiple regions and how does one specify this list? I have searched for documentation regarding this but not found anything.
The short answer to your question is that the format you'd use is
pysam.view(ops, bamfile, '1:2010000-20200000','2:2010000-20200000')
(Also note that the number indicating the end of each of your regions is ~10x larger than the beginning - it seems you might have intended 2010000-2020000 instead.)
I have tested it using the following code:
import pysam
my_bam_file = '/path/to/my/bam_file.bam'
alignments1 = pysam.view(my_bam_file, '1:2010000-4000000')
alignments2 = pysam.view(my_bam_file, '1:5000000-6000000')
alignments3 = pysam.view(my_bam_file, '1:2010000-4000000', '1:5000000-6000000')
print(len(alignments1) + len(alignments2) == len(alignments3))
[Output:] True
However, this way of extracting alignments is not very efficient, as the output you get is one large str, instead of individual alignments. To get a list of separate alignments instead, use the following code:
import pysam
my_bam_file = '/path/to/my/bam_file.bam'
imported = pysam.AlignmentFile(my_bam_file, mode = 'rb')
regions = ('1:2010000-20200000','2:2010000-20200000')
alignments = []
for region in regions:
bam = imported.fetch(region = region, until_eof = True)
alignments.extend([alignment for alignment in bam])
Each element of alignment then ends up being a pysam.AlignedSegment object, with which you can work further using the functions in pysam API.

Handling paginated SQL query results

For my dissertation data collection, one of the sources is an externally-managed system, which is based on Web form for submitting SQL queries. Using R and RCurl, I have implemented an automated data collection framework, where I simulate the above-mentioned form. Everything worked well while I was limiting the size of the resulting dataset. But, when I tried to go over 100000 records (RQ_SIZE in the code below), the tandem "my code - their system" started being unresponsive ("hanging").
So, I have decided to use SQL pagination feature (LIMIT ... OFFSET ...) to submit a series of requests, hoping then to combine the paginated results into a target data frame. However, after changing my code accordingly, the output that I see is only one pagination progress character (*) and then no more output. I'd appreciate, if you could help me identify the probable cause of the unexpected behavior. I cannot provide reproducible example, as it's very difficult to extract the functionality, not to mention the data, but I hope that the following code snippet would be enough to reveal the issue (or, at least, a direction toward the problem).
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
numRequests <- as.numeric(data) %/% RQ_SIZE + 1
# Now, we can request & retrieve data via SQL pagination
for (i in 1:numRequests) {
# setup SQL pagination
if (rq$where == '') rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
# some code
# add current data frame to the list
dfList <- c(dfList, data)
if (DEBUG) message("*", appendLF = FALSE)
}
# merge all the result pages' data frames
data <- do.call("rbind", dfList)
# save current data frame to RDS file
saveRDS(data, rdataFile)
It probably falls into the category when presumably MySQL hinders LIMIT OFFSET:
Why does MYSQL higher LIMIT offset slow the query down?
Overall, fetching large data sets over HTTP repeatedly is not very reliable.
Since this is for your dissertation, here is a hand:
## Folder were to save the results to disk.
## Ideally, use a new, empty folder. Easier then to load from disk
folder.out <- "~/mydissertation/sql_data_scrape/"
## Create the folder if not exist.
dir.create(folder.out, showWarnings=FALSE, recursive=TRUE)
## The larger this number, the more memory you will require.
## If you are renting a large box on, say, EC2, then you can make this 100, or so
NumberOfOffsetsBetweenSaves <- 10
## The limit size per request
RQ_SIZE <- 1000
# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
DATA_SEP, ADD_SQL)
## Get the total number of rows
TotalRows <- as.numeric(srdaGetData())
TotalNumberOfRequests <- TotalRows %/% RQ_SIZE
TotalNumberOfGroups <- TotalNumberOfRequests %/% NumberOfOffsetsBetweenSaves + 1
## FYI: Total number of rows being requested is
## (NumberOfOffsetsBetweenSaves * RQ_SIZE * TotalNumberOfGroups)
for (g in seq(TotalNumberOfGroups)) {
ret <-
lapply(seq(NumberOfOffsetsBetweenSaves), function(i) {
## function(i) is the same code you have
## inside your for loop, but cleaned up.
# setup SQL pagination
if (rq$where == '')
rq$where <- '1=1'
rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*g*(i-1))
# Submit data request
srdaRequestData(queryURL, rq$select, rq$from, rq$where,
DATA_SEP, ADD_SQL)
# retrieve result
data <- srdaGetData()
# some code
if (DEBUG) message("*", appendLF = FALSE)
### DONT ASSIGN TO dfList, JUST RETURN `data`
# xxxxxx DONT DO: xxxxx dfList <- c(dfList, data)
### INSTEAD:
## return
data
})
## save each iteration
file.out <- sprintf("%s/data_scrape_%04i.RDS", folder.out, g)
saveRDS(do.call(rbind, ret), file=file.out)
## OPTIONAL (this will be slower, but will keep your rams and goats in line)
# rm(ret)
# gc()
}
Then, once you are done scraping:
library(data.table)
folder.out <- "~/mydissertation/sql_data_scrape/"
files <- dir(folder.out, full=TRUE, pattern="\\.RDS$")
## Create an empty list
myData <- vector("list", length=length(files))
## Option 1, using data.frame
for (i in seq(myData))
myData[[i]] <- readRDS(files[[i]])
DT <- do.call(rbind, myData)
## Option 2, using data.table
for (i in seq(myData))
myData[[i]] <- as.data.table(readRDS(files[[i]]))
DT <- rbindlist(myData)
I'm answering my own question, as, finally, I have figured out what has been the real source of the problem. My investigation revealed that the unexpected waiting state of the program was due to PostgreSQL becoming confused by malformed SQL queries, which contained multiple LIMIT and OFFSET keywords.
The reason of that is pretty simple: I used rq$where both outside and inside the for loop, which made paste() concatenate previous iteration's WHERE clause with the current one. I have fixed the code by processing contents of the WHERE clause and saving it before the loop and then using the saved value in each iteration of the loop safely, as it became independent from the value of the original WHERE clause.
This investigation also helped me to fix some other deficiencies in my code and make improvements (such as using sub-selects to properly handle SQL queries returning number of records for queries with aggregate functions). The moral of the story: you can never be too careful in software development. Big thank you to those nice people who helped with this question.