ETL to csv files, split up and then pushed to s3 to be consume by redshift - amazon-s3

Just getting started with Kiba, didn't find anything obvious, but I could be just channeling my inner child (who looks for their shoes by staring at the ceiling).
I want to dump a very large table to Amazon Redshift. It seems that the fastest way to do that is to write out a bunch of CSV files to an S3 bucket, then tell Redshift (via the COPY command) to pull them in. Magical scaling gremlins will do the rest.
So, I think that I want Kiba to write a CSV file for every 10k rows of data, then push it to s3, then start writing to a new file. At the end, make a post-processing call to COPY
So, can I "pipeline" the work or should this be a big, nested Destination class?
i.e.
source -> transform -> transform ... -> [ csv -> s3 ]{every 10000}; post-process

Kiba author here. Thanks for trying it out!
Currently, the best way to implement this is to create what I'd call a "buffering destination". (A version of that will likely end up in Kiba Common at some point).
(Please test thoroughly, I just authored that this morning for you, didn't run it at all, although I've used less generic versions in the past. Also keep in mind that this version uses an in-memory buffer for your 10k rows, so growing the number to something much larger will consume memory. A least memory consuming version could also be created though, which would write rows to file as you get them)
class BufferingDestination
def initialize(buffer_size:, on_flush:)
#buffer = []
#buffer_size
#on_flush = on_flush
#batch_index = 0
end
def write(row)
#buffer << row
flush if #buffer.size >= buffer_size
end
def flush
on_flush.call(batch_index: #batch_index, rows: #buffer)
#batch_index += 1
#buffer.clear
end
def close
flush
end
end
This is something you can then use like this, for instance here reusing the Kiba Common CSV destination (although you can write your own too):
require 'kiba-common/destinations/csv'
destination BufferingDestination,
buffer_size: 10_000,
on_flush: -> { |batch_index, rows|
filename = File.join("output-#{sprintf("%08d", batch_index)}")
csv = Kiba::Common::Destinations::CSV.new(
filename: filename,
csv_options: { ... },
headers: %w(my fields here)
)
rows.each { |r| csv.write(r) }
csv.close
}
You could then trigger your COPY right in the on_flush block after generating the file (if you want the upload to start right away), or in a post_process block (but this would only start after all the CSV are ready, which can be a feature to ensure some form of transactional global upload if you prefer).
You could go fancy and start a thread queue to actually handle the upload in parallel if you really need this (but then be careful with zombie threads etc).
Another way is to have "multiple steps" ETL processes, with one script to generate the CSV, and another one picking them for upload, running concurrently (this is something I've explained in my talk at RubyKaigi 2018 for instance).
Let me know how things work for you!

I'm not sure here exact question. But, I think your solution seems correct overall, but few suggestions though.
You may wnat too consider even having more then 10K records per CSV file and gzip them while sending to S3.
You want to see menifest creation containing list of multiple files and then run copy command supplying menifest file as input.

Thibaut, I did something similar, except that I streamed it out to a Tempfile, I think...
require 'csv'
# #param limit [Integer, 1_000] Number of rows per csv file
# #param callback [Proc] Proc taking one argument [CSV/io], that can be used after
# each csv file is finished
module PacerPro
class CSVDestination
def initialize(limit: 1_000, callback: ->(obj) { })
#limit = limit
#callback = callback
#csv = nil
#row_count = 0
end
# #param row [Hash] returned from transforms
def write(row)
csv << row.values
#row_count += 1
return if row_count < limit
self.close
end
# Called by Kiba when the transform pipeline is finished
def close
csv.close
callback.call(csv)
tempfile.unlink
#csv = nil
#row_count = 0
end
private
attr_reader :limit, :callback
attr_reader :row_count, :tempfile
def csv
#csv ||= begin
#tempfile = Tempfile.new('csv')
CSV.open(#tempfile, 'w')
end
end
end
end

Related

Fetching big SQL table in the web app session

I'm quite new in web app so apologize if my question is abit basic. I'm developing a Web app with R shiny where the inputs are very large tables from Azure SQL server. They are 20 tables each in the order of hundred-thousand rows and hundreds of columns containing numbers, Characters and etc. I have no problem calling them, my main issue is that it takes so much time to fetch everything from Azure SQL server. It takes approximately 20 minutes. So the user of the web app needs to wait quite a long.
I'm using DBI package as follows:
db_connect <- function(database_config_name){
dbConfig <- config::get(database_config_name)
connection <- DBI::dbConnect(odbc::odbc(),
Driver = dbConfig$driver,
Server = dbConfig$server,
UID = dbConfig$uid,
PWD = dbConfig$pwd,
Database = dbConfig$database,
encoding = "latin1"
)
return(connection)
}
and then fetching tables by :
connection <- db_connect(db_config_name)
table <- dplyr::tbl(con, dbplyr::in_schema(fetch_schema_name(db_config_name,table_name,data_source_type), fetch_table_name(db_config_name,table_name,data_source_type)))
I searched a lot but didn't come across a good solution, I appreciate any solutions can tackle this problem.
I work with R accessing SQL Server (not Azure) daily. For larger data (as in your example), I always revert to using the command-line tool sqlcmd, it is significantly faster. The only pain point for me was learning the arguments and working around the fact that it does not return proper CSV, there is post-query munging required. You may have an additional pain-point of having to adjust my example to connect to your Azure instance (I do not have an account).
In order to use this in a shiny environment and preserve its interactivity, I use the processx package to start the process in the background and then poll its exit status periodically to determine when it has completed.
Up front: this is mostly a "loose guide", I do not pretend that this is a fully-functional solution for you. There might be some rough-edges that you need to work through yourself. For instance, while I say you can do it asynchronously, it is up to you to work the polling process and delayed-data-availability into your shiny application. My answer here provides starting the process and reading the file once complete. And finally, if encoding= is an issue for you, I don't know if sqlcmd does non-latin correctly, and I don't know if or how to fix this with its very limited, even antiquated arguments.
Steps:
Save the query into a text file. Short queries can be provided on the command-line, but past some point (128 chars? I don't know that it's clearly defined, and have not looked enough recently) it just fails. Using a query-file is simple enough and always works, so I always use it.
I always use temporary files for each query instead of hard-coding the filename; this just makes sense. For convenience (for me), I use the same tempfile base name and append .sql for the query and .csv for the returned data, that way it's much easier to match query-to-data in the temp files. It's a convention I use, nothing more.
tf <- tempfile()
# using the same tempfile base name for both the query and csv-output temp files
querytf <- paste0(tf, ".sql")
writeLines(query, querytf)
csvtf <- paste0(tf, ".csv")
# these may be useful in troubleshoot, but not always [^2]
stdouttf <- paste0(tf, ".stdout")
stderrtf <- paste0(tf, ".stderr")
Make the call. I suggest you see how fast this is in a synchronous way first to see if you need to add an async query and polling in your shiny interface.
exe <- "/path/to/sqlcmd" # or "sqlcmd.exe"
args <- c("-W", "b", "-s", "\037", "-i", querytf, "-o", csvtf,
"-S", dbConfig$server, "-d", dbConfig$database,
"-U", dbConfig$uid, "-P", dbConfig$pwd)
## as to why I use "\037", see [^1]
## note that the user id and password will be visible on the shiny server
## via a `ps -fax` command-line call
proc <- processx::process$new(command = exe, args = args,
stdout = stdouttf, stderr = stderrtf) # other args exist
# this should return immediately, and should be TRUE until
# data retrieval is done (or error)
proc$is_alive()
# this will hang (pause R) until retrieval is complete; if/when you
# shift to asynchronous queries, do not do this
proc$wait()
One can use processx::run instead of process$new and proc$wait(), but I thought I'd start you down this path in case you want/need to go asynchronous.
If you go with an asynchronous operation, then periodically check (perhaps every 3 or 10 seconds) proc$is_alive(). Once that returns FALSE, you can start processing the file. During this time, shiny will continue to operate normally. (If you do not go async and therefore choose to proc$wait(), then shiny will hang until the query is complete.)
If you make a mistake and do not proc$wait() and try to continue with reading the file, that's a mistake. The file may not exist, in which case it will err with No such file or directory. The file may exist, perhaps empty. It may exist and have incomplete data. So really, make a firm decision to stay synchronous and therefore call proc$wait(), or go asynchronous and poll periodically until proc$is_alive() returns FALSE.
Reading in the file. There are three "joys" of using sqlcmd that require special handling of the file.
It does not do embedded quotes consistently, which is why I chose to use "\037" as a separator. (See [^1].)
It adds a line of dashes under the column names, which will corrupt the auto-classing of data when R reads in the data. For this, we do a two-step read of the file.
Nulls in the database are the literal NULL string in the data. For this, we update the na.strings= argument when reading the file.
exitstat <- proc$get_exit_status()
if (exitstat == 0) {
## read #1: get the column headers
tmp1 <- read.csv(csvtf, nrows = 2, sep = "\037", header = FALSE)
colnms <- unlist(tmp1[1,], use.names = FALSE)
## read #2: read the rest of the data
out <- read.csv(csvtf, skip = 2, header = FALSE, sep = "\037",
na.strings = c("NA", "NULL"), quote = "")
colnames(out) <- colnms
} else {
# you should check both stdout and stderr files, see [^2]
stop("'sqlcmd' exit status: ", exitstat)
}
Note:
After a lot of pain with several issues (some in sqlcmd.exe, some in data.table::fread and other readers, all dealing with CSV-format non-compliance), at one point I chose to stop working with comma-delimited returns, instead opting for the "\037" field Delimiter. It works fine with all CSV-reading tools and has fixed so many problems (some not mentioned here). If you're not concerned, feel free to change the args to "-s", "," (adjusting the read as well).
sqlcmd seems to use stdout or stderr in different ways when there are problems. I'm sure there's rationale somewhere, but the point is that if there is a problem, check both files.
I added the use of both stdout= and stderr= because of a lot of troubleshooting I did, and continue to do if I munge a query. Using them is not strictly required, but you might be throwing caution to the wind if you omit those options.
By the way, if you choose to only use sqlcmd for all of your queries, there is no need to create a connection object in R. That is, db_connect may not be necessary. In my use, I tend to use "real" R DBI connections for known-small queries and the bulk sqlcmd for anything above around 10K rows. There is a tradeoff; I have not measured it sufficiently in my environment to know where the tipping point is, and it is likely different in your case.

Looking for a faster way to batch export pdf:s in InDesign

I'm using this script (below) to batch export pdf:s from several indesign files for a task i do every week. The filenames are always the same, i'm using 8-10 different indd files to create 12-15 different pdf:s.
The script is set up like this:
//Sets variables for print and web presets
var myPDFExportPreset = app.pdfExportPresets.item("my-present-for-print-pdf");
var myPDFExportPreset2 = app.pdfExportPresets.item("my-preset-for-web-pdf");
//sample of one pdf exported first with print, then web pdf preset as two different files
var firstFileIntoPdfs = function(){
var openDocument= app.open(File("MYFILEPATH/firstfile.indd"));
openDocument.exportFile(
ExportFormat.pdfType,
File("MYFILEPATH/print-pdfs/firstfile-print.pdf"),
false,
myPDFExportPreset
);
openDocument.exportFile(
ExportFormat.pdfType,
File("MYFILEPATH/web-pdfs/firstfile-web.pdf"),
false,
myPDFExportPreset2
);
};
I'm defining all exports like the one above as named functions, some using only one of the presets, some two like the one above. I'm calling all these functions at the end of the file
firstFileIntoPdfs();
secondFileIntoPdfs();
thirdFileIntoPdfs();
fourthFileIntoPdfs();
and so on... ยจ
The script is however quite slow, 10 files into 1 or 2 pdfs each, like the function above, can take 10 minutes. I don't think this is a CPU issue, what i noticed is that it seems like the script is waiting for the files in "firstFileIntoPdfs()" to be created, a process that takes some minutes, before proceeding to execute the next function. Then waiting again...
Selecting File -> Export manually you can set new files to be exported while the previous ones are still processing the pdf files, which to me has seemed faster than how this script is working. Manual clicking is however error prone and tedious, of course.
Is there a better way to write this batch export script than how i've done above, that would make all functions executed while pdfs from previous functions still are processed in the system? I'd like to keep them as separate functions in order to be able to comment out some when only needing certain specific pdf:s. (unless the process of exporting all becomes nearly as fast as exporting only 1 pdf).
I hope my question makes sense!
There is an asynch method available, replace exportFile with asynchrousExportFile:
var openDocument= app.open(File("MYFILEPATH/firstfile.indd"));
openDocument.asynchronousExportFile(
ExportFormat.pdfType,
File("MYFILEPATH/print-pdfs/firstfile-print.pdf"),
false,
myPDFExportPreset
);
which use a background task

Using a text file as Spark streaming source for testing purpose

I want to write a test for my spark streaming application that consume a flume source.
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ suggests using ManualClock but for the moment reading a file and verifying outputs would be enough for me.
So I wish to use :
JavaStreamingContext streamingContext = ...
JavaDStream<String> stream = streamingContext.textFileStream(dataDirectory);
stream.print();
streamingContext.awaitTermination();
streamingContext.start();
Unfortunately it does not print anything.
I tried:
dataDirectory = "hdfs://node:port/absolute/path/on/hdfs/"
dataDirectory = "file://C:\\absolute\\path\\on\\windows\\"
adding the text file in the directory BEFORE the program begins
adding the text file in the directory WHILE the program run
Nothing works.
Any suggestion to read from text file?
Thanks,
Martin
Order of start and await are indeed inversed.
In addition to that, the easiest way to pass data to your Spark Streaming application for testing is a QueueDStream. It's a mutable queue of RDD of arbitrary data. This means that you could create the data programmatically or load it from disk into an RDD and pass that to your Spark Streaming code.
Eg. to avoid the timing issues faced with the fileConsumer, you could try this:
val rdd = sparkContext.textFile(...)
val rddQueue: Queue[RDD[String]] = Queue()
rddQueue += rdd
val dstream = streamingContext.queueStream(rddQueue)
doMyStuffWithDstream(dstream)
streamingContext.start()
streamingContext.awaitTermination()
I am so stupid, I inverted calls to start() and awaitTermination()
If you want to do the same, you should read from HDFS, and add the file WHILE the program runs.

Copy an image carrierwave

I have ArtworkUploader and i want to create a duplicate of the artwork image in same directory. Help me to solve this.
My Uploader:
class ArtworkUploader < CarrierWave::Uploader::Base
def store_dir
if model
"uploads/#{model.class.to_s.underscore}/#{model.id}/#{mounted_as}"
end
end
def filename
"artwork.png"
end
end
I tried with console but it doesn't work. What am i missing here?
Console:
> u = User.find(5)
> u.artwork.create(name: "testing.png", file: u.artwork.path)
> NoMethodError: undefined method `create!' for /uploads/5/artwork/Artwork:ArtworkUploader
there are 2 way I can think of you can do that
a) VIA Versioning :
create version of your file
version: copy_file do
process :create_a_copy
end
and now just define create_a_copy method inside your uploader which will just return the same file
This way you can have the copy of the file .I didn't understand your custom filename stuff but the way you have define it your for uploader filename method you can do that the same for version as well something like this
version: copy_file do
process :create_a_copy
def filename
"testing.png"
end
end
NOTE:
Not sure of the filename stuff for version file since I done it long back, but I believe the above setting of different filename method would work.
Adavantage :
All File and copy are associated in one uploader
No extra column is required on database (which is required in approach b)
Now the above approach too have some caveats in it
Slightly more complex
Single delete issue, delete to original uploader would delete its copy as well
b) VIA Separate Column :
They other way you can achieve that is defining a separate column called artwork_copy and mount the same uploader with just a slightly change in your uploader like this
def filename
if self.mounted_as == :artwork
"artwork.png"
else
"testing.png"
end
end
And the way you attach the file (give that your file is stored in locally)
u = User.find(5)
u.artwork_copy = File.open(u.artwork) ## Just cross check
u.save
There is an another way you do this via do the above same
u = User.find(5)
u.artwork_copy.store!(File.open(u.artwork))
Now you it pretty obvious what the advantage/disadvantage of the approach b mention above
Hope this make sense

Code run from Rspec file behaves differently than when run from the model

I've been spinning around this for the a few hours now without any luck finding a reference to the problem...
We're building a simple indexing app for a video library stored on AmazonS3.
When writing my test, I initial write everything on the test file to establish what results I'd like, and progressively move the real implementation to the model.
Working with Rails 3, the AWS-S3 gem, and Rspec
So, on my test, I start off with the following code:
spec/models/s3import_spec.rb
...
it "gets the names of all objects" do
im = S3import.new
a = []
im.bucket.objects.each do |obj|
a << obj.key
end
a.should == ["Agility/", "Agility/Stationary_Over Stick/",
"Agility/Stationary_Over Stick/2 foot hops over stick.mp4"]
end
This simple test creates an import object that knows the S3 bucket name and credentials, and goes through the objects in the bucket and captures the object's name. This works as expected.
When I move the code over to the model, I end up with the following model;
app/models/s3import.rb
...
def objNames
a = []
bucket.objects.each do |i|
a << i.key
end
end
and the test changes to this:
it "gets the names of all objects" do
im = S3import.new
a = im.objNames
a.should == ["Agility/", "Agility/Stationary_Over Stick/",
"Agility/Stationary_Over Stick/2 foot hops over stick.mp4"]
end
My confusion is, when I run the test calling the code on the model side, I don't get the array of strings that I was expecting (as I got in the self-contained test code). I receive the following:
[#<AWS::S3::S3Object:0x2179225400 '/transcode2011/Agility/'>,
+ #<AWS::S3::S3Object:0x2179225380 '/transcode2011/Agility/Stationary_Over Stick/'>,
+ #<AWS::S3::S3Object:0x2179225320 '/transcode2011/Agility/Stationary_Over Stick/2 foot hops over stick.mp4']
As you can see, the returned array consist of the original AWS::S3:S3Objects... As if the loop simply duplicated the original array rather then getting the 'key' as a string.
I've tested the same in the console and I can't seem to figure out what specifically is different that causes the discrepancy.
Any help would be greatly appreciated.
I think you're returning the bucket. Try adding a line for a different return value.
def objNames
a = []
bucket.objects.each do |i|
a << i.key
end
a
end