Fetching big SQL table in the web app session - sql

I'm quite new in web app so apologize if my question is abit basic. I'm developing a Web app with R shiny where the inputs are very large tables from Azure SQL server. They are 20 tables each in the order of hundred-thousand rows and hundreds of columns containing numbers, Characters and etc. I have no problem calling them, my main issue is that it takes so much time to fetch everything from Azure SQL server. It takes approximately 20 minutes. So the user of the web app needs to wait quite a long.
I'm using DBI package as follows:
db_connect <- function(database_config_name){
dbConfig <- config::get(database_config_name)
connection <- DBI::dbConnect(odbc::odbc(),
Driver = dbConfig$driver,
Server = dbConfig$server,
UID = dbConfig$uid,
PWD = dbConfig$pwd,
Database = dbConfig$database,
encoding = "latin1"
)
return(connection)
}
and then fetching tables by :
connection <- db_connect(db_config_name)
table <- dplyr::tbl(con, dbplyr::in_schema(fetch_schema_name(db_config_name,table_name,data_source_type), fetch_table_name(db_config_name,table_name,data_source_type)))
I searched a lot but didn't come across a good solution, I appreciate any solutions can tackle this problem.

I work with R accessing SQL Server (not Azure) daily. For larger data (as in your example), I always revert to using the command-line tool sqlcmd, it is significantly faster. The only pain point for me was learning the arguments and working around the fact that it does not return proper CSV, there is post-query munging required. You may have an additional pain-point of having to adjust my example to connect to your Azure instance (I do not have an account).
In order to use this in a shiny environment and preserve its interactivity, I use the processx package to start the process in the background and then poll its exit status periodically to determine when it has completed.
Up front: this is mostly a "loose guide", I do not pretend that this is a fully-functional solution for you. There might be some rough-edges that you need to work through yourself. For instance, while I say you can do it asynchronously, it is up to you to work the polling process and delayed-data-availability into your shiny application. My answer here provides starting the process and reading the file once complete. And finally, if encoding= is an issue for you, I don't know if sqlcmd does non-latin correctly, and I don't know if or how to fix this with its very limited, even antiquated arguments.
Steps:
Save the query into a text file. Short queries can be provided on the command-line, but past some point (128 chars? I don't know that it's clearly defined, and have not looked enough recently) it just fails. Using a query-file is simple enough and always works, so I always use it.
I always use temporary files for each query instead of hard-coding the filename; this just makes sense. For convenience (for me), I use the same tempfile base name and append .sql for the query and .csv for the returned data, that way it's much easier to match query-to-data in the temp files. It's a convention I use, nothing more.
tf <- tempfile()
# using the same tempfile base name for both the query and csv-output temp files
querytf <- paste0(tf, ".sql")
writeLines(query, querytf)
csvtf <- paste0(tf, ".csv")
# these may be useful in troubleshoot, but not always [^2]
stdouttf <- paste0(tf, ".stdout")
stderrtf <- paste0(tf, ".stderr")
Make the call. I suggest you see how fast this is in a synchronous way first to see if you need to add an async query and polling in your shiny interface.
exe <- "/path/to/sqlcmd" # or "sqlcmd.exe"
args <- c("-W", "b", "-s", "\037", "-i", querytf, "-o", csvtf,
"-S", dbConfig$server, "-d", dbConfig$database,
"-U", dbConfig$uid, "-P", dbConfig$pwd)
## as to why I use "\037", see [^1]
## note that the user id and password will be visible on the shiny server
## via a `ps -fax` command-line call
proc <- processx::process$new(command = exe, args = args,
stdout = stdouttf, stderr = stderrtf) # other args exist
# this should return immediately, and should be TRUE until
# data retrieval is done (or error)
proc$is_alive()
# this will hang (pause R) until retrieval is complete; if/when you
# shift to asynchronous queries, do not do this
proc$wait()
One can use processx::run instead of process$new and proc$wait(), but I thought I'd start you down this path in case you want/need to go asynchronous.
If you go with an asynchronous operation, then periodically check (perhaps every 3 or 10 seconds) proc$is_alive(). Once that returns FALSE, you can start processing the file. During this time, shiny will continue to operate normally. (If you do not go async and therefore choose to proc$wait(), then shiny will hang until the query is complete.)
If you make a mistake and do not proc$wait() and try to continue with reading the file, that's a mistake. The file may not exist, in which case it will err with No such file or directory. The file may exist, perhaps empty. It may exist and have incomplete data. So really, make a firm decision to stay synchronous and therefore call proc$wait(), or go asynchronous and poll periodically until proc$is_alive() returns FALSE.
Reading in the file. There are three "joys" of using sqlcmd that require special handling of the file.
It does not do embedded quotes consistently, which is why I chose to use "\037" as a separator. (See [^1].)
It adds a line of dashes under the column names, which will corrupt the auto-classing of data when R reads in the data. For this, we do a two-step read of the file.
Nulls in the database are the literal NULL string in the data. For this, we update the na.strings= argument when reading the file.
exitstat <- proc$get_exit_status()
if (exitstat == 0) {
## read #1: get the column headers
tmp1 <- read.csv(csvtf, nrows = 2, sep = "\037", header = FALSE)
colnms <- unlist(tmp1[1,], use.names = FALSE)
## read #2: read the rest of the data
out <- read.csv(csvtf, skip = 2, header = FALSE, sep = "\037",
na.strings = c("NA", "NULL"), quote = "")
colnames(out) <- colnms
} else {
# you should check both stdout and stderr files, see [^2]
stop("'sqlcmd' exit status: ", exitstat)
}
Note:
After a lot of pain with several issues (some in sqlcmd.exe, some in data.table::fread and other readers, all dealing with CSV-format non-compliance), at one point I chose to stop working with comma-delimited returns, instead opting for the "\037" field Delimiter. It works fine with all CSV-reading tools and has fixed so many problems (some not mentioned here). If you're not concerned, feel free to change the args to "-s", "," (adjusting the read as well).
sqlcmd seems to use stdout or stderr in different ways when there are problems. I'm sure there's rationale somewhere, but the point is that if there is a problem, check both files.
I added the use of both stdout= and stderr= because of a lot of troubleshooting I did, and continue to do if I munge a query. Using them is not strictly required, but you might be throwing caution to the wind if you omit those options.
By the way, if you choose to only use sqlcmd for all of your queries, there is no need to create a connection object in R. That is, db_connect may not be necessary. In my use, I tend to use "real" R DBI connections for known-small queries and the bulk sqlcmd for anything above around 10K rows. There is a tradeoff; I have not measured it sufficiently in my environment to know where the tipping point is, and it is likely different in your case.

Related

Enable Impala Impersonation on Superset

Is there a way to make the logged user (on superset) to make the queries on impala?
I tried to enable the "Impersonate the logged on user" option on Databases but with no success because all the queries run on impala with superset user.
I'm trying to achieve the same! This will not completely answer this question since it does not still work but I want to share my research in order to maybe help another soul that is trying to use this instrument outside very basic use cases.
I went deep in the code and I found out that impersonation is not implemented for Impala. So you cannot achieve this from the UI. I found out this PR https://github.com/apache/superset/pull/4699 that for whatever reason was never merged into the codebase and tried to copy&paste code in my Superset version (1.1.0) but it didn't work. Adding some logs I can see that the configuration with the impersonation is updated, but then the actual Impala query is with the user I used to start the process.
As you can imagine, I am a complete noob at this. However I found out that the impersonation thing happens when you create a cursor and there is a constructor parameter in which you can pass the impersonation configuration.
I managed to correctly (at least to my understanding) implement impersonation for the SQL lab part.
In the sql_lab.py class you have to add in the execute_sql_statements method the following lines
with closing(engine.raw_connection()) as conn:
# closing the connection closes the cursor as well
cursor = conn.cursor(**database.cursor_kwargs)
where cursor_kwargs is defined in db_engine_specs/impala.py as the following
#classmethod
def get_configuration_for_impersonation(cls, uri, impersonate_user, username):
logger.info(
'Passing Impala execution_options.cursor_configuration for impersonation')
return {'execution_options': {
'cursor_configuration': {'impala.doas.user': username}}}
#classmethod
def get_cursor_configuration_for_impersonation(cls, uri, impersonate_user,
username):
logger.debug('Passing Impala cursor configuration for impersonation')
return {'configuration': {'impala.doas.user': username}}
Finally, in models/core.py you have to add the following bit in the get_sqla_engine def
params = extra.get("engine_params", {}) # that was already there just for you to find out the line
self.cursor_kwargs = self.db_engine_spec.get_cursor_configuration_for_impersonation(
str(url), self.impersonate_user, effective_username) # this is the line I added
...
params.update(self.get_encrypted_extra()) # already there
#new stuff
configuration = {}
configuration.update(
self.db_engine_spec.get_configuration_for_impersonation(
str(url),
self.impersonate_user,
effective_username))
if configuration:
params.update(configuration)
As you can see I just shamelessy pasted the code from the PR. However this kind of works only for the SQL lab as I already said. For the dashboards there is an entirely different way of querying Impala that I did not still find out.
This means that queries for the dashboards are handled in a different way and there isn't something like this
with closing(engine.raw_connection()) as conn:
# closing the connection closes the cursor as well
cursor = conn.cursor(**database.cursor_kwargs)
My gut (and debugging) feeling is that you need to first understand the sqlalchemy part and extend a new ImpalaEngine class that uses a custom cursor with the impersonation conf. Or something like that, however it is not simple (if we want to call this simple) as the sql_lab part. So, the trick is to find out where the query is executed and create a cursor with the impersonation configuration. Easy, isnt'it ?
I hope that this could shed some light to you and the others that have this issue. Let me know if you did find out another way to solve this issue, or if this comment was useful.
Update: something really useful
A colleague of mine succesfully implemented impersonation with impala without touching any superset related, but instead working directly with the impyla lib. A PR was open with the code to change. You can apply the patch directly in the impyla src used by superset. You have to edit both dbapi.py and hiveserver2.py.
As a reminder: we are still testing this and we do not know if it works with different accounts using the same superset instance.

How to use dbWriteTable on a connection with no default Database

I've seen many posts on SO and the DBI Github regarding trouble using DBI::dbWriteTable (i.e. [1], [2]). These mostly have to do with the use of non-default schemas or whatnot.
That's not my case.
I have a server running SQL Server 2014. This server contains multiple databases.
I'm developing a program which interacts with many of these databases at the same time. I therefore defined my connection using DBI::dbConnect() without a Database= argument.
I've so far only had to do SELECTs on the databases, and this connection works just fine with dbGetQuery(). I just need to name my tables including the database names: DatabaseFoo.dbo.TableBar, which is more than fine since it makes things transparent and intentional. It also stops me from being lazy and making some calls omitting the Database name on whichever DB I named in the connection.
I now need to add data to a table, and I can't get it to work. A call to
DBI::dbWriteTable(conn, "DatabaseFoo.dbo.TableBar", myData, append = TRUE)
works, but creates a table named DatabaseFoo.dbo.TableBar in the master Database, which isn't what I meant (I didn't even know there was a master Database).
The DBI::dbWriteTable man page states the name should be
A character string specifying the unquoted DBMS table name, or the result of a call to dbQuoteIdentifier().
So I tried dbQuoteIdentifier() (and a few other variations):
DBI::dbWriteTable(conn,
DBI::dbQuoteIdentifier(conn,
"DatabaseFoo.dbo.TableBar"),
myData)
# no error, same problem as above
DBI::dbWriteTable(conn,
DBI::dbQuoteIdentifier(conn,
DBI::SQL("DatabaseFoo.dbo.TableBar")),
myData)
# Error: Can't unquote DatabaseFoo.dbo.TableBar
DBI::dbWriteTable(conn,
DBI::SQL("DatabaseFoo.dbo.TableBar"),
myData)
# Error: Can't unquote DatabaseFoo.dbo.TableBar
DBI::dbWriteTable(conn,
DBI::dbQuoteIdentifier(conn,
DBI::Id(catalog = "DatabaseFoo",
schema = "dbo",
table = "TableBar")),
myData)
# Error: Can't unquote "DatabaseFoo"."dbo"."TableBar"
DBI::dbWriteTable(conn,
DBI::Id(catalog = "DatabaseFoo",
schema = "dbo",
table = "TableBar"),
myData)
# Error: Can't unquote "DatabaseFoo"."dbo"."TableBar"
In the DBI::Id() attempts, I also tried using cluster instead of catalog. No effect, identical error.
However, if I change my dbConnect() call to add a Database="DatabaseFoo" argument, I can simply use dbWriteTable(conn, "TableBar", myData) and it works.
So the question becomes, am I doing something wrong? Is this related to the problems in the other questions?
This is a shortcoming in the DBI package. The dev version DBI >= 1.0.0.9002 no longer suffers from this problem, will hit CRAN as DBI 1.1.0 soonish.

ETL to csv files, split up and then pushed to s3 to be consume by redshift

Just getting started with Kiba, didn't find anything obvious, but I could be just channeling my inner child (who looks for their shoes by staring at the ceiling).
I want to dump a very large table to Amazon Redshift. It seems that the fastest way to do that is to write out a bunch of CSV files to an S3 bucket, then tell Redshift (via the COPY command) to pull them in. Magical scaling gremlins will do the rest.
So, I think that I want Kiba to write a CSV file for every 10k rows of data, then push it to s3, then start writing to a new file. At the end, make a post-processing call to COPY
So, can I "pipeline" the work or should this be a big, nested Destination class?
i.e.
source -> transform -> transform ... -> [ csv -> s3 ]{every 10000}; post-process
Kiba author here. Thanks for trying it out!
Currently, the best way to implement this is to create what I'd call a "buffering destination". (A version of that will likely end up in Kiba Common at some point).
(Please test thoroughly, I just authored that this morning for you, didn't run it at all, although I've used less generic versions in the past. Also keep in mind that this version uses an in-memory buffer for your 10k rows, so growing the number to something much larger will consume memory. A least memory consuming version could also be created though, which would write rows to file as you get them)
class BufferingDestination
def initialize(buffer_size:, on_flush:)
#buffer = []
#buffer_size
#on_flush = on_flush
#batch_index = 0
end
def write(row)
#buffer << row
flush if #buffer.size >= buffer_size
end
def flush
on_flush.call(batch_index: #batch_index, rows: #buffer)
#batch_index += 1
#buffer.clear
end
def close
flush
end
end
This is something you can then use like this, for instance here reusing the Kiba Common CSV destination (although you can write your own too):
require 'kiba-common/destinations/csv'
destination BufferingDestination,
buffer_size: 10_000,
on_flush: -> { |batch_index, rows|
filename = File.join("output-#{sprintf("%08d", batch_index)}")
csv = Kiba::Common::Destinations::CSV.new(
filename: filename,
csv_options: { ... },
headers: %w(my fields here)
)
rows.each { |r| csv.write(r) }
csv.close
}
You could then trigger your COPY right in the on_flush block after generating the file (if you want the upload to start right away), or in a post_process block (but this would only start after all the CSV are ready, which can be a feature to ensure some form of transactional global upload if you prefer).
You could go fancy and start a thread queue to actually handle the upload in parallel if you really need this (but then be careful with zombie threads etc).
Another way is to have "multiple steps" ETL processes, with one script to generate the CSV, and another one picking them for upload, running concurrently (this is something I've explained in my talk at RubyKaigi 2018 for instance).
Let me know how things work for you!
I'm not sure here exact question. But, I think your solution seems correct overall, but few suggestions though.
You may wnat too consider even having more then 10K records per CSV file and gzip them while sending to S3.
You want to see menifest creation containing list of multiple files and then run copy command supplying menifest file as input.
Thibaut, I did something similar, except that I streamed it out to a Tempfile, I think...
require 'csv'
# #param limit [Integer, 1_000] Number of rows per csv file
# #param callback [Proc] Proc taking one argument [CSV/io], that can be used after
# each csv file is finished
module PacerPro
class CSVDestination
def initialize(limit: 1_000, callback: ->(obj) { })
#limit = limit
#callback = callback
#csv = nil
#row_count = 0
end
# #param row [Hash] returned from transforms
def write(row)
csv << row.values
#row_count += 1
return if row_count < limit
self.close
end
# Called by Kiba when the transform pipeline is finished
def close
csv.close
callback.call(csv)
tempfile.unlink
#csv = nil
#row_count = 0
end
private
attr_reader :limit, :callback
attr_reader :row_count, :tempfile
def csv
#csv ||= begin
#tempfile = Tempfile.new('csv')
CSV.open(#tempfile, 'w')
end
end
end
end

JSR 352 : How do you write to a MVS Dataset from a Java Batch program?

I need to write to a non-VSAM dataset in the mainframe. I know that we need to use the ZFile library to do it and I found how to do it here
I am running my Java batch job in the WebSphere Liberty on zOS. How do I specify the dataset? Can I directly give the DataSet a name like this?
dsnFile = new ZFile("X.Y.Z", "wb,type=record,noseek");
I am able to write it to a text file on the server itself using Java's File Writers but I don't know how to access a mvs dataset.
I am relatively new to the world of zOS and mainframe.
It sounds like you might be asking more generally how to use the ZFile API on WebSphere Liberty on z/OS.
Have you tried something like:
String pdsName = ZFile.getSlashSlashQuotedDSN("X.Y.Z");
ZFile zfile = new ZFile(pdsName , ...options...)
As far as batch-specific use cases, you might obviously have to differentiate between writing to a new file that's created for the first time on an original execution, as opposed to appending to an already-existing one on a restart.
You also might find some useful snipopets in this doctorbatch.io repo, along with the original link you posted.
For reference, I'll copy/paste from the ZFile Javadoc:
ZFile dd = new ZFile("//DD:MYDD", "r");
Opens the DD namee MYDD for reading
ZFile dsn = new ZFile("//'SYS1.HELP(ACCOUNT)'", "rt");
Opens the member ACCOUNT from the PDS SYS1.HELP for reading text records
ZFile dsn = new ZFile("//SEQ", "wb,type=record,recfm=fb,lrecl=80,noseek");
Opens the data set {MVS_USER}.SEQ for sequential binary writing. Note that ",noseek" should be specified with "type=record" if access is sequential, since performance is greatly improved.
One final note, another couple useful ZFile helper methods are: bpxwdyn() and getFullyQualifiedDSN().

How to flush current output in RApache?

I'm testing using RApache as an SSE (Server Sent Events) and similar (long poll, comet, etc.) back-end. I seem to be stuck on how to flush my output. Is it possible?
Here is my test R script:
setContentType("text/plain")
repeat{
cat(format(Sys.time()),"\n")
#sendBin(paste(format(Sys.time()),"\n"))
flush(stdout())
Sys.sleep(1)
}
My Rapache.conf entry is:
<Location /rtest/sse>
Options -MultiViews
SetHandler r-handler
RFileHandler /var/www/local/rtest/sse.r
</Location>
And I test it using either wget or curl:
wget -O - http://localhost/rtest/sse
curl http://localhost/rtest/sse
Both just sit there, meaning nothing is being sent.
Using sendBin() made no change, and neither did using flush().
If I change repeat to for(i in 1:5) then it sits there for 5 seconds and then shows 5 timestamps (spaced one second apart). So, I believe everything else is working fine and this is purely a buffering issue.
UPDATE: Looking at this with fresh eyes after 5 months, I think I could have described the problem more clearly: the problem is that RApache appears to be buffering all the output, and not sending anything until the R script exits. To be useful for streaming it has to send data out of Apache and on to the client each time flush() is called, i.e. while the R script is still running.
So, my question is: is there a way to get RApache to behave like that?
UPDATE 2 I tried adding flush.console() before or after the flush(stdout()) but no difference. I also tried setStatus(status=200L) at the top. And I tried SERVER$no_cache=T;SERVER$no_local_copy=T; at the top of the script. Again it made no difference. (Yes, none of those should have helped, but it never hurts to try!)
Here is a link to how PHP implements flush when it is running as an Apache module:
http://git.php.net/?p=php-src.git;a=blob;f=sapi/apache2handler/sapi_apache2.c#l290
I think the key point is that there is a call to ap_rflush(r). I'm guessing that RApache is not making the ap_rflush() call.
You are passing the wrong MIME type. Try changing with
setContentType("text/event-stream")
EDIT1:
this is the attempt, (still unsuccessful) I mentioned in the comment below, to implement SSE in Rook.
<%
res$header('Content-Type', 'text/event-stream')
res$header('Cache-Control', 'no-cache')
res$header('Connection', 'keep-alive')
A <- 1
sendMessage <- function(){
while(A<=4){
cat("id: ", Sys.time(), "\n", "data: hello\n\n", sep="")
A <- A+1
flush(stdout())
Sys.sleep(1)
}
}
-%>
<% sendMessage() %>
the while loop condition was supposed to be always TRUE but I'm having your same problem so I had to do a finite loop...
The good new is I DO have data reaching the browser. I can tell by looking, in developer tools, at the Content-Length in the Response Header section. it says 114 for the above code and you change, say, "Hello" in "Hello!" it'll say 118.
The js code is: (you'll need JQuery as well)
$(document).ready(function(){
$("button").click(function(){
var source = new EventSource("../R/sse.Rhtml");
source.onopen = function(event){
console.log("readyState: " + source.readyState);
}
source.onmessage = function(event){
$("#div").append(event.data);
};
source.onerror = function(event){
console.log(event);
};
});
});
So, in essence
1) The connection is open (readyState 1)
2) Buffering is still there
3) Data (after buffering) reaches the browser but an error happens in receiving them properly.
EIDT2:
it's interesting to note that brew()ing the above .Rhtml file the output is not buffered. There must be a configuration the in the web server (both the R internal and Apache) that buffer the data flows.
As a side note, flush is not even needed, cat's output defaults to stout(). So the options are:
Web server configuration
The R equivalent of the PHP ob_flush(); which is always used in any PHP implementation I've seen. this is example