R dynamic sql query using RODBC and export to .csv - sql

If I run the following code in R studio then it works but I have set sys.sleep. I have a large batch of queries to run and I don't know how long each will take. If I exclude the sys.sleep then the exports are blank as the export is run before the query is complete. Is there a way of getting R to wait until the query is complete?
#setup
#install.packages("stringr", dependencies=TRUE)
require(stringr)
library(RODBC)
#odbc connection
db <- odbcDriverConnect("dsn=DW Master;uid=username;pwd=password;")
#sql to be run
qstr <- "select top 10 * from prod"
#variable
weeknum<-c('201401','201402','201403')
for (i in weeknum )
{
data <- sqlQuery(db, qstr, believeNRows = FALSE)
Sys.sleep(10)
filename<-paste("data_", str_trim(i), ".csv")
filename
write.csv(data, file = filename)
}

From this SO post, try adding the rows_at_time argument:
data <- sqlQuery(db, qstr, believeNRows = FALSE, rows_at_time = 1)
Alternatively you can break up the two processes:
# QUERIES TO DATA FRAMES
weeknum<-c('201401','201402','201403')
for (i in weeknum ) {
data <- sqlQuery(db, qstr, believeNRows = FALSE, rows_at_time = 1)
assign(paste("data",i,sep=""),data)
}
# DATA FRAMES TO CSV FILES
dfList <- c('data201401','data201402','data201403')
for (n in dfList) {
df<-get(n)
filename<-paste(n, ".csv", sep="")
write.csv(df, file = filename)
}

Related

Setting overwrite == TRUE using memdb and dbplyr

The following shiny app works the first time you run it, but then errors if you change the species input because the table name already exists in memory. I was wondering how to set overwrite == TRUE given the code below?
library(shiny)
library(tidyverse)
library(dbplyr)
ui <- fluidPage(
selectInput("species", "Species", choices = unique(iris$Species),
selected = "setosa"),
tableOutput("SQL_table"),
actionButton("code", "View SQL"),
)
server <- function(input, output) {
# render table
output$SQL_table <- renderTable(
head(iris %>% filter(Species == input[["species"]]))
)
# generate query
SQLquery <- reactive({
sql_render(
show_query(
tbl_memdb(iris) %>%
filter(Species == local(input$species))
)
)
})
# see query
observeEvent( input$code, {
showModal(
modalDialog(
SQLquery()
)
)
})
}
shinyApp(ui = ui, server = server)
since memdb_frame is just a function call of copy_to we can use it directly to set overwrite = TRUE
copy_to(src_memdb(), iris, name = 'iris', overwrite=TRUE)

Error: Input array is longer than number of columns in this table powershell

I am trying to load 160gb csv file to sql and I am using powershell script I got from Github and I get this error
IException calling "Add" with "1" argument(s): "Input array is longer than the number of columns in this table."
At C:\b.ps1:54 char:26
+ [void]$datatable.Rows.Add <<<< ($line.Split($delimiter))
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : DotNetMethodException
So I checked the same code with small 3 line csv and all of the columns match and also have header in first row and there are no extra delimiters not sure why I am getting this error.
The code is below
<# 8-faster-runspaces.ps1 #>
# Set CSV attributes
$csv = "M:\d\s.txt"
$delimiter = "`t"
# Set connstring
$connstring = "Data Source=.;Integrated Security=true;Initial Catalog=PresentationOptimized;PACKET SIZE=32767;"
# Set batchsize to 2000
$batchsize = 2000
# Create the datatable
$datatable = New-Object System.Data.DataTable
# Add generic columns
$columns = (Get-Content $csv -First 1).Split($delimiter)
foreach ($column in $columns) {
[void]$datatable.Columns.Add()
}
# Setup runspace pool and the scriptblock that runs inside each runspace
$pool = [RunspaceFactory]::CreateRunspacePool(1,5)
$pool.ApartmentState = "MTA"
$pool.Open()
$runspaces = #()
# Setup scriptblock. This is the workhorse. Think of it as a function.
$scriptblock = {
Param (
[string]$connstring,
[object]$dtbatch,
[int]$batchsize
)
$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connstring,"TableLock")
$bulkcopy.DestinationTableName = "abc"
$bulkcopy.BatchSize = $batchsize
$bulkcopy.WriteToServer($dtbatch)
$bulkcopy.Close()
$dtbatch.Clear()
$bulkcopy.Dispose()
$dtbatch.Dispose()
}
# Start timer
$time = [System.Diagnostics.Stopwatch]::StartNew()
# Open the text file from disk and process.
$reader = New-Object System.IO.StreamReader($csv)
Write-Output "Starting insert.."
while ((($line = $reader.ReadLine()) -ne $null))
{
[void]$datatable.Rows.Add($line.Split($delimiter))
if ($datatable.rows.count % $batchsize -eq 0)
{
$runspace = [PowerShell]::Create()
[void]$runspace.AddScript($scriptblock)
[void]$runspace.AddArgument($connstring)
[void]$runspace.AddArgument($datatable) # <-- Send datatable
[void]$runspace.AddArgument($batchsize)
$runspace.RunspacePool = $pool
$runspaces += [PSCustomObject]#{ Pipe = $runspace; Status = $runspace.BeginInvoke() }
# Overwrite object with a shell of itself
$datatable = $datatable.Clone() # <-- Create new datatable object
}
}
# Close the file
$reader.Close()
# Wait for runspaces to complete
while ($runspaces.Status.IsCompleted -notcontains $true) {}
# End timer
$secs = $time.Elapsed.TotalSeconds
# Cleanup runspaces
foreach ($runspace in $runspaces ) {
[void]$runspace.Pipe.EndInvoke($runspace.Status) # EndInvoke method retrieves the results of the asynchronous call
$runspace.Pipe.Dispose()
}
# Cleanup runspace pool
$pool.Close()
$pool.Dispose()
# Cleanup SQL Connections
[System.Data.SqlClient.SqlConnection]::ClearAllPools()
# Done! Format output then display
$totalrows = 1000000
$rs = "{0:N0}" -f [int]($totalrows / $secs)
$rm = "{0:N0}" -f [int]($totalrows / $secs * 60)
$mill = "{0:N0}" -f $totalrows
Write-Output "$mill rows imported in $([math]::round($secs,2)) seconds ($rs rows/sec and $rm rows/min)"
Working with a 160 GB input file is going to be a pain. You can't really load it into any kind of editor - or at least you don't really analyze such a data mass without some serious automation.
As per the comments, it seems that the data has some quality issues. In order to find the offending data, you could try binary searching. This approach shrinks the data fast. Like so,
1) Split the file in about two equal chunks.
2) Try and load first chunk.
3) If successful, process the second chunk. If not, see 6).
4) Try and load second chunk.
5) If successful, the files are valid, but you got another a data quality issue. Start looking into other causes. If not, see 6).
6) If either load failed, start from the beginning and use the failed file as the input file.
7) Repeat until you narrow down the offending row(s).
Another a method would be using an ETL tool like SSIS. Configure the package to redirect invalid rows into an error log to see what data is not working properly.

How to read multiple .xls files in one go in r

Tried the below code multiple times, but nothing happens when I run the below code. I think fread does not read .xls format. Thus I tried two other different codes, one with Rio package and another with openxlsx package. Sorry i am new to this. There are 38 files, each with name "Cust+Txn+Details+Customer (36).xls". Thank you.
## First put all file names into a list
library(data.table)
files <- list.files(path = "F:\\MUMuniv\\machine learning class\\
price sensitivty\\PS works\\Customer files",
pattern = ".xls", full.names = T)
readdata <- function(fn){
dt_temp <- fread(fn)
return(dt_temp)
}
# then using
all.files <- lapply(files, readdata)
final.data <- rbindlist(all.files)
Error I got: " Error in fread(fn) : mmap'd region has EOF at the end "
#Example 2
#rio package
require("rio")
xls <- dir(path = ".", all.files = T)
created <- mapply(convert, xls, gsub(".xlsx", ".csv", "xls"))
unlink(xls)
Error in get_ext(file) : 'file' has no extension
#example 3
# using openxlsx package
require("openxlsx")
# Create a vector of Excel files to read
files.to.read = list.files(path = ".", all.files = T)
# Read each file and write it to csv
lapply(files.to.read, function(f) {
df = read.xlsx(f, sheet=1)
write.csv(df, gsub("xlsx", "csv", f), row.names=FALSE)
})
Error in file(con, "r") : invalid 'description' argument In addition: Warning message:
In unzip(xlsxFile, exdir = xmlDir) : error 1 in extracting from zip file

Avoid re-loading datasets within a reactive in shiny

I have a shiny app that requires the input from one of several files. A simplified example would be:
library(shiny)
x <- matrix(rnorm(20), ncol=2)
y <- matrix(rnorm(10), ncol=4)
write.csv(x, 'test_x.csv')
write.csv(y, 'test_y.csv')
runApp(list(ui = fluidPage(
titlePanel("Choose dataset"),
sidebarLayout(
sidebarPanel(
selectInput("data", "Dataset", c("x", "y"), selected="x")
),
mainPanel(
tableOutput('contents')
)
)
)
, server = function(input, output, session){
myData <- reactive({
inFile <- paste0("test_", input$data, ".csv")
data <- read.csv(inFile, header=FALSE)
data
})
output$contents <- renderTable({
myData()
})
}))
In reality, the files I read in are much large, so I would like to avoid reading them in each time input$data changes, if it has already been done once. For example, by making the matrices mat_x and mat_y available within the environment, and then within myData testing:
if (!exists(paste0("mat_", input$data))) {
inFile <- paste0("test_", input$data, ".csv")
data <- read.csv(inFile, header=FALSE)
assign(paste0("mat_", input$data), data)
}
Is there a way to do this, or do I have to create a separate reactive for mat_x and mat_y and using that within myData? I actually have 9 possible input files, but each user may only want to use one or two.
You could do something like
myData <- reactive({
data <- fetch_data(input$data)
data
)}
fetch_data <- function(input) {
if (!exists(paste0("mat_", input))) {
inFile <- paste0("test_", input, ".csv")
data <- read.csv(inFile, header=FALSE)
assign(paste0("mat_", input), data)
} else {
data <- paste0("mat_", input)
}
return (data)
}

Calculating the load time of page elements using Rcurl? (R)

I started playing with the idea of testing a webpage load time using R. I have devised a tiny R code to do so:
page.load.time <- function(theURL, N = 10, wait_time = 0.05)
{
require(RCurl)
require(XML)
TIME <- numeric(N)
for(i in seq_len(N))
{
Sys.sleep(wait_time)
TIME[i] <- system.time(webpage <- getURL(theURL, header=FALSE,
verbose=TRUE) )[3]
}
return(TIME)
}
And would welcome your help in several ways:
Is it possible to do the same, but to also know which parts of the page took what parts to load? (something like Yahoo's YSlow)
I sometime run into the following error -
Error in curlPerform(curl = curl,
.opts = opts, .encoding = .encoding) :
Failure when receiving data from the
peer Timing stopped at: 0.03 0 43.72
Any suggestions on what is causing this and how to catch such errors and discard them?
Can you think of ways to improve the above function?
Update: I redid the function. It is now painfully slow...
one.page.load.time <- function(theURL, HTML = T, JavaScript = T, Images = T, CSS = T)
{
require(RCurl)
require(XML)
TIME <- NULL
if(HTML) TIME["HTML"] <- system.time(doc <- htmlParse(theURL))[3]
if(JavaScript) {
theJS <- xpathSApply(doc, "//script/#src") # find all JavaScript files
TIME["JavaScript"] <- system.time(getBinaryURL(theJS))[3]
} else ( TIME["JavaScript"] <- NA)
if(Images) {
theIMG <- xpathSApply(doc, "//img/#src") # find all image files
TIME["Images"] <- system.time(getBinaryURL(theIMG))[3]
} else ( TIME["Images"] <- NA)
if(CSS) {
theCSS <- xpathSApply(doc, "//link/#href") # find all "link" types
ss_CSS <- str_detect(tolower(theCSS), ".css") # find the CSS in them
theCSS <- theCSS[ss_CSS]
TIME["CSS"] <- system.time(getBinaryURL(theCSS))[3]
} else ( TIME["CSS"] <- NA)
return(TIME)
}
page.load.time <- function(theURL, N = 3, wait_time = 0.05,...)
{
require(RCurl)
require(XML)
TIME <- vector(length = N, "list")
for(i in seq_len(N))
{
Sys.sleep(wait_time)
TIME[[i]] <- one.page.load.time(theURL,...)
}
require(plyr)
TIME <- data.frame(URL = theURL, ldply(TIME, function(x) {x}))
return(TIME)
}
a <- page.load.time("http://www.r-bloggers.com/", 2)
a
your getURL call will only do one request and get the source HTML for the web page. It won't get the CSS or Javascript or other elements. If this is what you mean by 'parts' of the web page then you'll have to scrape the source HTML for those parts (in SCRIPT tags, or css references etc) and getURL them separately with timing.
Perhaps Spidermonkey from Omegahat could work.
http://www.omegahat.org/SpiderMonkey/