Downloading Pdfs from Internet using R

Downloading Pdfs from Internet using R - pdf

I am having trouble getting this code to work. I am trying to download documents from the FAO website in the URL. Please can someone help me? I use MAC OS and my chrome version is Version 106.0.5249.103 (Official Build) (x86_64)
library(rvest)
library(httr2)
library(RSelenium)
library(stringr)
url <- "https://www.fao.org/faolex/country-profiles/general-profile/en/?iso3=NAM"
base_url <- "https://www.fao.org"
ids <- read_html(url) %>%
html_elements(".doclink > a") %>%
html_attr("href") %>%
paste0(base_url, .)
grab_link <- function(page_url, s_ctl) {
# load the target url
s_ctl$navigate(page_url)
# wait for the page load to complete
Sys.sleep(4)
# getPageSource returns a list with html as the first element
page <- s_ctl$getPageSource()[[1]]
# Using rvest
read_html(page) %>%
html_elements(".item-title > a") %>%
html_attr("href") %>%
url_parse() %>%
purrr::pluck("query", "url")
}
selenium_driver <- rsDriver(
browser = "chrome",
chromever = "106.0.5249.61",
port = 4444L, #4545L,
verbose = FALSE,
check = FALSE
)
# control the client browser
ctl_browser <- selenium_driver[["client"]]
links <- purrr::map_chr(ids, grab_link, ctl_browser)
# Stop selenium server and quit browser
selenium_driver[["server"]]$stop()

Related

Setting overwrite == TRUE using memdb and dbplyr

The following shiny app works the first time you run it, but then errors if you change the species input because the table name already exists in memory. I was wondering how to set overwrite == TRUE given the code below?
library(shiny)
library(tidyverse)
library(dbplyr)
ui <- fluidPage(
selectInput("species", "Species", choices = unique(iris$Species),
selected = "setosa"),
tableOutput("SQL_table"),
actionButton("code", "View SQL"),
)
server <- function(input, output) {
# render table
output$SQL_table <- renderTable(
head(iris %>% filter(Species == input[["species"]]))
)
# generate query
SQLquery <- reactive({
sql_render(
show_query(
tbl_memdb(iris) %>%
filter(Species == local(input$species))
)
)
})
# see query
observeEvent( input$code, {
showModal(
modalDialog(
SQLquery()
)
)
})
}
shinyApp(ui = ui, server = server)

since memdb_frame is just a function call of copy_to we can use it directly to set overwrite = TRUE
copy_to(src_memdb(), iris, name = 'iris', overwrite=TRUE)

How to update the .RDS file with the files user uploads and store it for next time in Shiny (server)?

Basically, there are 50 files which are about 50 MB each and therefore, processed them and made 30k rows processed data as .RDS file to render some visualization.
However, User needs the app to upload recent files and keep updating the visualization.
Is it possible to update the .RDS file with the files user uploads ?
Can the user access this updated .RDS file even next time (session) ?
In the following example,
there is an upload button and render just a file.
Can we store the uploaded files somehow ?
So, that, we can use these uploaded files to update the .RDS file
relevant links:
R Shiny Save to Server
https://shiny.rstudio.com/articles/persistent-data-storage.html
library(shiny)
# Define UI for data upload app ----
ui <- fluidPage(
# App title ----
titlePanel("Uploading Files"),
# Sidebar layout with input and output definitions ----
sidebarLayout(
# Sidebar panel for inputs ----
sidebarPanel(
# Input: Select a file ----
fileInput("files", "Upload", multiple = TRUE, accept = c(".csv"))
),
# In order to update the database if the user clicks on the action button
actionButton("update_database","update Database")
# Main panel for displaying outputs ----
mainPanel(
# Output: Data file ----
dataTableOutput("tbl_out")
)
)
)
# Define server logic to read selected file ----
server <- function(input, output) {
lst1 <- reactive({
validate(need(input$files != "", "select files..."))
##### Here, .RDS file can be accessed and updated for visualiztaion
if (is.null(input$files)) {
return(NULL)
} else {
path_list <- as.list(input$files$datapath)
tbl_list <- lapply(input$files$datapath, read.table, header=TRUE, sep=";")
df <- do.call(rbind, tbl_list)
return(df)
}
})
output$tbl_out <- renderDataTable({
##### Can also access this data for visualizations
lst1()
})
session$onSessionEnded({
observeEvent(input$update_database,{
s3save(appended_data, object = "object_path_of_currentRDS_file",bucket = "s3_bucket")
})
})
}
# Create Shiny app ----
shinyApp(ui, server)
But there was an error:
Warning: Error in private$closedCallbacks$register: callback must be a function
50: stop
49: private$closedCallbacks$register
48: session$onSessionEnded
47: server [/my_folder_file_path.R#157]
Error in private$closedCallbacks$register(sessionEndedCallback) :
callback must be a function

Solution:
replace this code
session$onSessionEnded({
observeEvent(input$update_database,{
s3save(appended_data, object = "object_path_of_currentRDS_file",bucket = "s3_bucket")
})
})
## with this
observeEvent(input$update_database,{
s3saveRDS(appended_data, object = "object_path_of_currentRDS_file",bucket = "s3_bucket")
})

shinyapps.io does not work when my shiny use RODBC to link a SQL database

On my local computer, I use shiny to design a web page to show the analysis result. The data is extracted from the company's SQL database using RODBC to link the database to R. The code is like this:
library(shiny)
library(shinydashboard)
library(DT)
library(RODBC)
library(stringr)
library(dplyr)
DNS <- '***'
uid <- '***'
pwd <- '***'
convertMenuItem <- function(mi,tabName) {
mi$children[[1]]$attribs['data-toggle']="tab"
mi$children[[1]]$attribs['data-value'] = tabName
mi
}
sidebar <- dashboardSidebar(
sidebarMenu(
convertMenuItem(menuItem("Query1",tabName="Query1",icon=icon("table"),
dateRangeInput('Date1','Date Range',start = Sys.Date()-1, end = Sys.Date()-1,
separator=" - ",format="dd/mm/yy"),
textInput('Office1','Office ID','1980'),
submitButton("Submit")), tabName = "Query1"),
convertMenuItem(menuItem("Query2",tabName="Query2",icon=icon("table"),
dateRangeInput('Date2','Date Range',start = Sys.Date()-1, end = Sys.Date()-1,
separator=" - ",format="dd/mm/yy"),
textInput('Office2','Office ID','1980'),
submitButton("Submit")), tabName = "Query2"),
)
)
body <- dashboardBody(
tabItems(
tabItem(tabName="Query1",
helpText('********************************'),
fluidRow(
column(12,DT::dataTableOutput('table1'))
)
),
tabItem(tabName = "Query2",h2("Widgets tab content"))
)
)
dashboardheader <- dashboardHeader(
title = 'LOSS PREVENTION'
)
ui <- dashboardPage(
skin='purple',
dashboardheader,
sidebar,
body
)
server <- function(input, output) {
output$table1 <- DT::renderDataTable({
ch<-odbcConnect(DNS,uid=uid,pwd=pwd)
a <- sqlQuery(ch,paste(' ***'))
odbcClose(ch)
DT::datatable(a,options = list(scrollX=T))
})
}
shinyApp(ui, server)
Then, I have my account on shinyapps.io. And use rsconnect to deploy this programm. And the deployment is successful.
But when I use https://myAccount.shinyapps.io/myshiny/ to access my code. I have the following error:
2018-05-10T00:57:38.473259+00:00 shinyapps[340325]: Warning in RODBC::odbcDriverConnect("DSN=****;UID=****;PWD=****") :
2018-05-10T00:57:38.473262+00:00 shinyapps[340325]: [RODBC] ERROR: state IM002, code 0, message [unixODBC][Driver Manager]Data source name not found, and no default driver specified
But, if there is no RODBC and SQL database involved in my code, the code works fine.
So, the problem is because shinyapps.io cannot access my company's SQL database. How can I deal with it?

The app works on your computer because the Data Source Name (DSN) has been configured there. It is not configured on shinyapps.io. According to this help article you can use for example
odbcDriverConnect('Driver=FreeTDS;TDS_Version=7.0;Server=<server>;Port=<port>;Database=<db>;Uid=<uid>;Pwd=<pw>;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;')
A more complete treatment can be found in the documentation.

Changing download directory in RSelenium using phantomjs

I'd like to download a webgl graphic from plotly in R. Plotly supports this by providing an "export" which allows a (R)Selenium argument. Everything works as expected when using chrome as browser (see example 1 below)
library(plotly)
library(magrittr)
library(RSelenium)
p <- plot_ly(z = ~volcano) %>% add_surface()
### Example 1: Chrome (working as expecteded) ###
eCaps <- list(
chromeOptions =
list(prefs = list(
"profile.default_content_settings.popups" = 0L,
"download.prompt_for_download" = FALSE,
"download.default_directory" = getwd()
)
)
)
chrome <- rsDriver(port = 4590L, browser = "chrome",
geckover = NULL, iedrver = NULL, phantomver = NULL,
extraCapabilities = eCaps)
export(p, file = "test1.svg", selenium = chrome)
chrome[["server"]]$stop()
But, since I'm interested in a "silent" download in the background (without opening a browser-window) I tried the same with phantomjs. This seems to work as well but downloads the plot to my default download directory (see example 2 below). During my investigation I came across several old posts saying downloading files in phantomjs isn't possible. But since it seems possible (but uses the default directory) I thought there might be a way to set the download directory in the meantime as well.
library(plotly)
library(magrittr)
library(RSelenium)
p <- plot_ly(z = ~volcano) %>% add_surface()
### Example 2: Phantom JS (Download path not settable) ###
phjs <- rsDriver(port = 4595L, browser = "phantomjs",
chromever = NULL, geckover = NULL, iedrver = NULL)
export(p, file = "test2.svg", selenium = phjs)
phjs[["server"]]$stop()
In short: How do I change the download folder in phantomjs (when using with (R)selenium)?

Calculating the load time of page elements using Rcurl? (R)

I started playing with the idea of testing a webpage load time using R. I have devised a tiny R code to do so:
page.load.time <- function(theURL, N = 10, wait_time = 0.05)
{
require(RCurl)
require(XML)
TIME <- numeric(N)
for(i in seq_len(N))
{
Sys.sleep(wait_time)
TIME[i] <- system.time(webpage <- getURL(theURL, header=FALSE,
verbose=TRUE) )[3]
}
return(TIME)
}
And would welcome your help in several ways:
Is it possible to do the same, but to also know which parts of the page took what parts to load? (something like Yahoo's YSlow)
I sometime run into the following error -
Error in curlPerform(curl = curl,
.opts = opts, .encoding = .encoding) :
Failure when receiving data from the
peer Timing stopped at: 0.03 0 43.72
Any suggestions on what is causing this and how to catch such errors and discard them?
Can you think of ways to improve the above function?
Update: I redid the function. It is now painfully slow...
one.page.load.time <- function(theURL, HTML = T, JavaScript = T, Images = T, CSS = T)
{
require(RCurl)
require(XML)
TIME <- NULL
if(HTML) TIME["HTML"] <- system.time(doc <- htmlParse(theURL))[3]
if(JavaScript) {
theJS <- xpathSApply(doc, "//script/#src") # find all JavaScript files
TIME["JavaScript"] <- system.time(getBinaryURL(theJS))[3]
} else ( TIME["JavaScript"] <- NA)
if(Images) {
theIMG <- xpathSApply(doc, "//img/#src") # find all image files
TIME["Images"] <- system.time(getBinaryURL(theIMG))[3]
} else ( TIME["Images"] <- NA)
if(CSS) {
theCSS <- xpathSApply(doc, "//link/#href") # find all "link" types
ss_CSS <- str_detect(tolower(theCSS), ".css") # find the CSS in them
theCSS <- theCSS[ss_CSS]
TIME["CSS"] <- system.time(getBinaryURL(theCSS))[3]
} else ( TIME["CSS"] <- NA)
return(TIME)
}
page.load.time <- function(theURL, N = 3, wait_time = 0.05,...)
{
require(RCurl)
require(XML)
TIME <- vector(length = N, "list")
for(i in seq_len(N))
{
Sys.sleep(wait_time)
TIME[[i]] <- one.page.load.time(theURL,...)
}
require(plyr)
TIME <- data.frame(URL = theURL, ldply(TIME, function(x) {x}))
return(TIME)
}
a <- page.load.time("http://www.r-bloggers.com/", 2)
a

your getURL call will only do one request and get the source HTML for the web page. It won't get the CSS or Javascript or other elements. If this is what you mean by 'parts' of the web page then you'll have to scrape the source HTML for those parts (in SCRIPT tags, or css references etc) and getURL them separately with timing.

Perhaps Spidermonkey from Omegahat could work.
http://www.omegahat.org/SpiderMonkey/

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Downloading Pdfs from Internet using R - pdf

Related

Setting overwrite == TRUE using memdb and dbplyr

How to update the .RDS file with the files user uploads and store it for next time in Shiny (server)?

shinyapps.io does not work when my shiny use RODBC to link a SQL database

Changing download directory in RSelenium using phantomjs

Calculating the load time of page elements using Rcurl? (R)

Categories

Resources