Using Shiny to plot database values - sql

I have looked through many different posts on SQL and RODBC, but for some reason I cannot seem to figure this out. I have created a Shiny App. A user will enter the date range of choice and a plot should show with the queried data. However, this does not work. I have tried hard coding the days in the SQL string and that WORKS. So I have pinpointed the problem in the dates not properly carrying over to the sql string. I have tried paste0() , but read that sprintf() works better for having multiple values. BTW, I am querying a PI server. Here is my code:
Server
library(shiny)
library(RODBC)
shinyServer(function(input, output) {
Connection <- odbcConnect(dsn = "PIWHI", believeNRows = FALSE, rows_at_time =1)
x <- reactive({input$range[1]})
y <- reactive({input$range[2]})
query <- sprintf("SELECT time, value
FROM picomp
WHERE tag = 'A80100'
AND time >= DATE('%s')
AND time <= DATE('%s')", x,y)
LineX <- sqlQuery(WhitingPI, query)
Gravity <- LineX$VALUE
Time <- LineX$TIME
output$den <- renderPlot({plot(Time,Gravity)})
}
)
UI
library(shiny)
library(reshape)
shinyUI(pageWithSidebar(
# Application title
headerPanel("Monitoring Tool"),
sidebarPanel(
dateRangeInput('range',
label = 'Date input: dd-mm-yyyy. Controls start and end of date range input in main panel.',
start = Sys.Date()-1, end = Sys.Date(),
format = "dd-M-yyyy", startview = 'year', language = 'en', weekstart = 1),
sliderInput("sigmacoef", "Confidence Level",min = 0, max = 5, value = 2, step =0.5),
submitButton("Update View")
),
mainPanel(
plotOutput("den")
)
))

Related

Error while finding topics quantity on Latent Dirichlet Allocation model using ldatuning library

This is the outcome error and I can tell this is because there is at least one document without some term, but I don't get why and how I can solve it.
prep_fun = function(x) {
x %>%
str_to_lower %>% #make text lower case
str_replace_all("[^[:alpha:]]", " ") %>% #remove non-alpha symbols - chao punctuation y #
str_replace_all("\\s+", " ") %>% #collapse multiple spaces
str_replace_all("\\W*\\b\\w\\b\\W*", " ") #Remuevo letras individuales
}
tok_fun <- function(x) {
tokens <- word_tokenizer(x)
textstem::lemmatize_words(tokens)
}
it_patentes <- itoken(data$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = data$id,
progressbar = F)
vocab <- create_vocabulary(it_patentes, ngram = c(ngram_min = 1L, ngram_max = 3L),
stopwords = tm::stopwords("english"))
pruned_vocab <- prune_vocabulary(vocab, term_count_min = max(vocab$term_count)*.01,
doc_proportion_min = 0.001)
vectorizer <- vocab_vectorizer(pruned_vocab)
dtm <- create_dtm(it_patentes, vectorizer,type = "dgTMatrix", progressbar = FALSE)
> #Plot the metrics to get number of topics
> t1 <- Sys.time()
> tunes <- FindTopicsNumber(
+ dtm = dtm,
+ topics = c(2:25),
+ metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010"),
+ method = "Gibbs",
+ control = list(seed = 17),
+ mc.cores = 4L,
+ verbose = TRUE
+ )
fit models...Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry
> print(difftime(Sys.time(), t1, units = 'sec'))
Time difference of 9.155343 secs
> FindTopicsNumber_plot(tunes)
Error in base::subset(values, select = 2:ncol(values)) :
object 'tunes' not found
Even though I know ldatuning is made for topicmodels, I don't think there might be a huge difference to get a number to start testing, is there?
ldatuning expects input dtm matrix in a different format (format from topicmodels package). You need to convert dtm (sparse matrix from Matrix package) to a format which ldatuning can understand

How to extract stat_smooth curve maxima in gpplot panel (facet_grid)?

I have created this plot with 18 grids using facet_grid command and two different fitting equations (for Jan - Apr, and May - Jun). I have two things that I need help with:
(may sound obvious, but) I haven't been able to find on the internet working codes extract a curve maximum for a stat_smooth fit. I'd appreciate if someone could show and explain what the codes mean. This is the closest I could find, but I am not sure what it means:
gb <- ggplot_build(p1)
curve_max <- gb$data[[1]]$x[which(diff(sign(diff(gb$data[[1]]$y)))==-2)+1]
How to add a vertical line to indicate max value on each curve?
Data file (rlc2 <- read_excel)
Plot
plot <- ggplot(rlc2, aes(par, etr, color=month, group=site))+
geom_point()+
stat_smooth(data = subset(rlc2, rlc2$month!="May" & rlc2$month!="Jun"),
method = "glm",
formula = y ~ x + log(x),
se = FALSE,
method.args = list(family = gaussian(link = "log"), start=c(a=0, b=0, c=0)))+
stat_smooth(data = subset(rlc2, rlc2$month=="May" | rlc2$month=="Jun"),
method = "nlsLM",
formula = y ~ M*(1 - exp(-(a*x))),
se = FALSE,
method.args = list(start=c(M=0, a=10)))+
facet_grid(rows = vars(month), cols = vars(site))
plot
field_rlc_plot
Any other advice are also welcome. I am educated as programmer so my codes are probably a bit messy. Thank you for helping.
Try this:
First, fit the data and extract the maximum of the fit.
my.fit <- function(month, site, data) {
fit <- glm(formula = etr ~ par + log(par),
data = data,
family=gaussian(link = "log")
)
#arrange the dersired output in a tibble
tibble(max = max(fit$fitted.values),
site = site,
month = month)
}
#Apply a custom function `my.fit` on each subset of data
#according to month and site using the group_by/nest/map method
# the results are rowbinded and returned in a data.frame
my.max<-
rlc2 %>%
dplyr::group_by(month, site) %>%
tidyr::nest() %>%
purrr::pmap_dfr(my.fit)
Next, join the results back on your data and plot a geom_line
rlc2 %>%
dplyr::left_join(my.max) %>%
ggplot(aes(x = par, y = etr))+
geom_point()+
stat_smooth(data = subset(rlc2, rlc2$month!="May" & rlc2$month!="Jun"),
method = "glm",
formula = y ~ x + log(x),
se = FALSE,
method.args = list(family = gaussian(link = "log"), start=c(a=0, b=0, c=0)))+
stat_smooth(data = subset(rlc2, rlc2$month=="May" | rlc2$month=="Jun"),
method = "nlsLM",
formula = y ~ M*(1 - exp(-(a*x))),
se = FALSE,
method.args = list(start=c(M=0, a=10)))+
geom_line(aes(y=max), col="red")+
facet_grid(rows = vars(month), cols = vars(site))

SparkR. SQL. Count records satisfying criteria within rolling time window using timestamps

I have a dataset with a structure similar to the df you get from this:
dates<- base::seq.POSIXt(from=as.POSIXlt(as.Date("2018-01-01"),
format="%Y-%m-%d"), to=as.POSIXlt(as.Date("2018-01-03"), format="%Y-%m-%d"), by = "hour")
possible_statuses<- c('moving', 'stopped')
statuses4demo<- base::sample(possible_statuses, size=98, replace = TRUE, prob = c(.75, .25))
hours_back<- 5
hours_back_milliseconds<- hours_back*3600 * 1000
# Generate dataframe
df<- data.frame(date=rep(dates,2), user_id=c(rep("user_1", 49), rep("user_2", 49)), status=statuses4demo)
df$row_id<- seq(from=1,to=nrow(df), by=1)
df$eventTimestamp<- as.numeric(format(df$date, "%s"))*1000
df$hours_back_timestamp<- df$eventTimestamp - hours_back_milliseconds
df$num_stops_within_past_5_hours<- 0
I would like to get a dataframe with rolling counts for the number of observations with a status of "stopped" for each row. To do this in R, I just made a couple nested loops, i.e., ran this:
for(i in 1:length(unique(df$user_id))){
the_user<- unique(df$user_id)[i]
filtered_data<- df[which(df$user_id == the_user),]
for(j in 1:nrow(filtered_data)){
the_row_id<- filtered_data$row_id[j]
the_time<- filtered_data$eventTimestamp[j]
the_past_time<- filtered_data$hours_back_timestamp[j]
num_stops_in_past_interval<- base::nrow(filtered_data[filtered_data$eventTimestamp >= the_past_time & filtered_data$eventTimestamp < the_time & filtered_data$status == "stopped",])
df$num_stops_within_past_5_hours[which(df$row_id==the_row_id)]<- num_stops_in_past_interval
}
}
View(df)
I am trying to do the same thing, but either by using the built in functions in SparkR or (I think more likely) an SQL statement. I am wondering if anyone knows how I could reproduce the output from the df, but inside a Spark context? Any help is much appreciated. Thank you in advance. --Nate
Start with this data:
sdf<- SparkR::createDataFrame(df[, c("date", "eventTimestamp", "status", "user_id", "row_id")])
This solution works for the sample data as you have it set up, but isn't a more general solution for observations with any arbitrary timestamp.
ddf <- as.DataFrame(df)
ddf$count <- ifelse(ddf$status == "stopped", 1, 0)
# Create a windowSpec partitioning by user_id and ordered by date
ws <- orderBy(windowPartitionBy("user_id"), "date")
# Get the cumulative sum of the count variable by user id
ddf$count <- over(sum(ddf$count), ws)
# Get the lagged value of the cumulative sum from 5hrs ago
ddf$lag_count <- over(lag(ddf$count, offset = 5, default = 0), ws)
# The count of stops in the last 5hrs is the difference between the two
ddf$num_stops_within_past_5_hours <- ddf$count - ddf$lag_count
Edited to add a more general solution that can handle inconsistent time breaks
# Using a sampled version of the original df to create inconsistent
time breaks
ddf <- as.DataFrame(df[base::sample(nrow(df), nrow(df) - 20), ])
ddf$count <- ifelse(ddf$status == "stopped", 1, 0)
to_join <- ddf %>% select("count", "eventTimestamp", "user_id") %>% rename(eventTimestamp_ = .$eventTimestamp, user_id_ = .$user_id)
ddf$count <- NULL
# join in each row where the event timestamp is within the interval
ddf_new <- join(ddf, to_join, ddf$hours_back_timestamp <= to_join$eventTimestamp_ & ddf$eventTimestamp >= to_join$eventTimestamp_ & ddf$user_id == to_join$user_id_, joinType = "left")
ddf_new <- ddf_new %>% groupBy(
'date',
'eventTimestamp',
'user_id',
'status',
'row_id',
'hours_back_timestamp') %>%
agg(num_stops_within_past_5_hours = sum(ddf_new$count))

shiny sliderInput range minimum and maximum values

I need to adjust a histogram output in shiny with a range values:
(where m is an arbitrary matrix)
#ui.R
sliderInput(inputId="adjust", label="Choose adjacency threshold", value=c(0.001, 0.9), min=0.0001, max=1),
plotOutput("hist")
#server.R
df<-reactive({
idx = m > min(input$adjust) & m < max(input$adjust)
data.frame(
id = row(m)[idx],
value = m[idx])
})
output$hist<-renderPlot({hist(df()$values)})
However this doesn't seem to affect the histogram- it re-renders when I toggle the slider but its the same each time.... it takes a very long time and it just seems to take all of the values in into account?
Does anyone know how to make this work?
when I try to print the slider's min/max value- nothing comes to the page:
#ui.R
verbatimTextOutput("x")
#server
output$x<-renderPrint({min(input$adjust)})
Hence I might be approaching this the completely wrong way... does anyone know how to do this?
FULL EXAMPLE
library(shiny)
runApp(list(ui = fluidPage(sliderInput(inputId="adjust", label="Choose adjacency threshold", value=c(0.001, 0.9), min=0.0001, max=1),
plotOutput("hist")
server=function(input, output){
adjacentmat<-reactive({adjacency(dat)})
data<-reactive({
adj_mat<-adjacentmat()
adj_mat[adj_mat < input$adjust] <- 0
m<-adj_mat
idx = m > min(input$adjust) & m < max(input$adjust)
data.frame(
source = row(m)[idx],
target = col(m)[idx],
corr = m[idx])
})
output$hist<-renderPlot({hist(data()$corr)})
}
)
generate the dat variable with the following code:
library('dplyr')
set.seed(1)
# generate a couple clusters
nodes_per_cluster <- 30
n <- 10
nvals <- nodes_per_cluster * n
# cluster 1 (increasing)
cluster1 <- matrix(rep((1:n)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)
# cluster 2 (decreasing)
cluster2 <- matrix(rep((n:1)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)
# noise cluster
noise <- matrix(sample(1:2, nvals, replace=TRUE) +
rnorm(nvals, sd=1.5),
nrow=nodes_per_cluster, byrow=TRUE)
dat <- rbind(cluster1, cluster2, noise)
colnames(dat) <- paste0('n', 1:n)
rownames(dat) <- c(paste0('cluster1_', 1:nodes_per_cluster),
paste0('cluster2_', 1:nodes_per_cluster),
paste0('noise_', 1:nodes_per_cluster))
This works for me:
library(shiny)
runApp(list(ui = fluidPage(
mainPanel(sliderInput("test", "Select values", value= c(.001,.9), min= 0.0001, max= 1)),
verbatimTextOutput("test2")
),
server = function(input, output, session) {
output$test2 <- renderPrint(min(input$test))
}))
I'm guessing your problem is somewhere in the code you haven't shown us. Can you give the code for the entire running example of your problem?

dplyr distinct column values sql

I use dplyr in conjunction with a PostgreSQL data base which makes a reproducible example a bit hard for me. Anyways, I want to use the distinct function to filter out messy data, i.e. duplicate timestamps. So far I have:
db <- src_postgres(dbname = "a", host = "b", port = 1234,
user = "c")
measurements <- tbl(adres_db, "measurement")
sites <- group_by(measurements, site)
sites_clean <- filter(sites,
site < 38)
sites_clean <- distinct(sites_clean, timestamp)
P_stats <- summarise(
sites_clean,
count = n(),
P = mean(p_sum)
)
collect(P_stats)
I get the error:
Error: Can't calculate distinct only on specified columns with SQL
Is there a workaround for this? Will dplyr support this in a future version?
Update
I followed the documentation and created a minimal working example using a sqlite data base (also thanks beginneR for the %>% reminder)
library(dplyr)
set.seed(1)
my_db <- src_sqlite("my_db.sqlite3", create = T)
meas <- data.frame(id = 1:30,
timestamp = sample(Sys.time() + c(1,2,3) * 3600,
size = 30, replace = TRUE),
site = sample(c(1, 2, 40), size = 30, replace = TRUE),
p_sum = rpois(30, 2))
meas_sqlite <- copy_to(my_db, meas, temporary = FALSE)
meas_tbl <- tbl(my_db, "meas")
P_stats <- group_by(meas_tbl, site, timestamp) %>%
summarise(P = mean(p_sum)) %>%
group_by(site) %>%
filter(site < 38) %>%
summarise(count = n(), P = mean(P))
collect(P_stats)
While this works, I feel it is not as clean as it could be. Also I need to try the suggestion with row_number(timestamp) == 1 on the PostgreSQL instance.