How to dynamically combine generators? - iterator

for submissions in itertools.zip_longest(submission_stream, submission_stream2): #want to put all streams here
for submission in submissions:
# processing
The above code works for two streams that I have initialised. My goal is to combine streams based on username in a .csv file. If a username is there, run a stream for them. If it gets removed, or a new username is added, remove or start that stream respectively.
An example of a stream is:
submission_stream = reddit.redditor("username").stream.submissions(skip_existing=True, pause_after=-1)
I would really appreciate if someone would guide me.

You would probably have to start streaming over again every time your .csv file is changed, although you could get away with filtering (itertools.filterfalse)
for username removals. Code sketch, assuming functions to get a list of streams, determine if a submission belongs to deleted username, and determine if file was changed with an addition:
while True:
streams = get_list_of_streams_from_csv()
for submissions in itertools.zip_longest(streams):
for submission in itertools.filterfalse(not_deleted, submissions):
#processing
if csv_changed_to_add()
break
Adding in additional streams, capturing deletion with .filterfalse:
streams = get_list_of_streams_from_csv()
zip_iter = itertools.zip_longest(streams)
while True:
for submission in itertools.filterfalse(not_deleted, zip_iter):
#processing
if csv_changed_to_add()
break
zip_iter = itertools.zip_longest(zip_iter, get_list_of_new_streams())

Related

Fetching big SQL table in the web app session

I'm quite new in web app so apologize if my question is abit basic. I'm developing a Web app with R shiny where the inputs are very large tables from Azure SQL server. They are 20 tables each in the order of hundred-thousand rows and hundreds of columns containing numbers, Characters and etc. I have no problem calling them, my main issue is that it takes so much time to fetch everything from Azure SQL server. It takes approximately 20 minutes. So the user of the web app needs to wait quite a long.
I'm using DBI package as follows:
db_connect <- function(database_config_name){
dbConfig <- config::get(database_config_name)
connection <- DBI::dbConnect(odbc::odbc(),
Driver = dbConfig$driver,
Server = dbConfig$server,
UID = dbConfig$uid,
PWD = dbConfig$pwd,
Database = dbConfig$database,
encoding = "latin1"
)
return(connection)
}
and then fetching tables by :
connection <- db_connect(db_config_name)
table <- dplyr::tbl(con, dbplyr::in_schema(fetch_schema_name(db_config_name,table_name,data_source_type), fetch_table_name(db_config_name,table_name,data_source_type)))
I searched a lot but didn't come across a good solution, I appreciate any solutions can tackle this problem.
I work with R accessing SQL Server (not Azure) daily. For larger data (as in your example), I always revert to using the command-line tool sqlcmd, it is significantly faster. The only pain point for me was learning the arguments and working around the fact that it does not return proper CSV, there is post-query munging required. You may have an additional pain-point of having to adjust my example to connect to your Azure instance (I do not have an account).
In order to use this in a shiny environment and preserve its interactivity, I use the processx package to start the process in the background and then poll its exit status periodically to determine when it has completed.
Up front: this is mostly a "loose guide", I do not pretend that this is a fully-functional solution for you. There might be some rough-edges that you need to work through yourself. For instance, while I say you can do it asynchronously, it is up to you to work the polling process and delayed-data-availability into your shiny application. My answer here provides starting the process and reading the file once complete. And finally, if encoding= is an issue for you, I don't know if sqlcmd does non-latin correctly, and I don't know if or how to fix this with its very limited, even antiquated arguments.
Steps:
Save the query into a text file. Short queries can be provided on the command-line, but past some point (128 chars? I don't know that it's clearly defined, and have not looked enough recently) it just fails. Using a query-file is simple enough and always works, so I always use it.
I always use temporary files for each query instead of hard-coding the filename; this just makes sense. For convenience (for me), I use the same tempfile base name and append .sql for the query and .csv for the returned data, that way it's much easier to match query-to-data in the temp files. It's a convention I use, nothing more.
tf <- tempfile()
# using the same tempfile base name for both the query and csv-output temp files
querytf <- paste0(tf, ".sql")
writeLines(query, querytf)
csvtf <- paste0(tf, ".csv")
# these may be useful in troubleshoot, but not always [^2]
stdouttf <- paste0(tf, ".stdout")
stderrtf <- paste0(tf, ".stderr")
Make the call. I suggest you see how fast this is in a synchronous way first to see if you need to add an async query and polling in your shiny interface.
exe <- "/path/to/sqlcmd" # or "sqlcmd.exe"
args <- c("-W", "b", "-s", "\037", "-i", querytf, "-o", csvtf,
"-S", dbConfig$server, "-d", dbConfig$database,
"-U", dbConfig$uid, "-P", dbConfig$pwd)
## as to why I use "\037", see [^1]
## note that the user id and password will be visible on the shiny server
## via a `ps -fax` command-line call
proc <- processx::process$new(command = exe, args = args,
stdout = stdouttf, stderr = stderrtf) # other args exist
# this should return immediately, and should be TRUE until
# data retrieval is done (or error)
proc$is_alive()
# this will hang (pause R) until retrieval is complete; if/when you
# shift to asynchronous queries, do not do this
proc$wait()
One can use processx::run instead of process$new and proc$wait(), but I thought I'd start you down this path in case you want/need to go asynchronous.
If you go with an asynchronous operation, then periodically check (perhaps every 3 or 10 seconds) proc$is_alive(). Once that returns FALSE, you can start processing the file. During this time, shiny will continue to operate normally. (If you do not go async and therefore choose to proc$wait(), then shiny will hang until the query is complete.)
If you make a mistake and do not proc$wait() and try to continue with reading the file, that's a mistake. The file may not exist, in which case it will err with No such file or directory. The file may exist, perhaps empty. It may exist and have incomplete data. So really, make a firm decision to stay synchronous and therefore call proc$wait(), or go asynchronous and poll periodically until proc$is_alive() returns FALSE.
Reading in the file. There are three "joys" of using sqlcmd that require special handling of the file.
It does not do embedded quotes consistently, which is why I chose to use "\037" as a separator. (See [^1].)
It adds a line of dashes under the column names, which will corrupt the auto-classing of data when R reads in the data. For this, we do a two-step read of the file.
Nulls in the database are the literal NULL string in the data. For this, we update the na.strings= argument when reading the file.
exitstat <- proc$get_exit_status()
if (exitstat == 0) {
## read #1: get the column headers
tmp1 <- read.csv(csvtf, nrows = 2, sep = "\037", header = FALSE)
colnms <- unlist(tmp1[1,], use.names = FALSE)
## read #2: read the rest of the data
out <- read.csv(csvtf, skip = 2, header = FALSE, sep = "\037",
na.strings = c("NA", "NULL"), quote = "")
colnames(out) <- colnms
} else {
# you should check both stdout and stderr files, see [^2]
stop("'sqlcmd' exit status: ", exitstat)
}
Note:
After a lot of pain with several issues (some in sqlcmd.exe, some in data.table::fread and other readers, all dealing with CSV-format non-compliance), at one point I chose to stop working with comma-delimited returns, instead opting for the "\037" field Delimiter. It works fine with all CSV-reading tools and has fixed so many problems (some not mentioned here). If you're not concerned, feel free to change the args to "-s", "," (adjusting the read as well).
sqlcmd seems to use stdout or stderr in different ways when there are problems. I'm sure there's rationale somewhere, but the point is that if there is a problem, check both files.
I added the use of both stdout= and stderr= because of a lot of troubleshooting I did, and continue to do if I munge a query. Using them is not strictly required, but you might be throwing caution to the wind if you omit those options.
By the way, if you choose to only use sqlcmd for all of your queries, there is no need to create a connection object in R. That is, db_connect may not be necessary. In my use, I tend to use "real" R DBI connections for known-small queries and the bulk sqlcmd for anything above around 10K rows. There is a tradeoff; I have not measured it sufficiently in my environment to know where the tipping point is, and it is likely different in your case.

Issues pulling change log using python

I am trying to query and pull changelog details using python.
The below code returns the list of issues in the project.
issued = jira.search_issues('project= proj_a', maxResults=5)
for issue in issued:
print(issue)
I am trying to pass values obtained in the issue above
issues = jira.issue(issue,expand='changelog')
changelog = issues.changelog
projects = jira.project(project)
I get the below error on trying the above:
JIRAError: JiraError HTTP 404 url: https://abc.atlassian.net/rest/api/2/issue/issue?expand=changelog
text: Issue does not exist or you do not have permission to see it.
Could anyone advise as to where am I going wrong or what permissions do I need.
Please note, if I pass a specific issue_id in the above code it works just fine but I am trying to pass a list of issue_id
You can already receive all the changelog data in the search_issues() method so you don't have to get the changelog by iterating over each issue and making another API call for each issue. Check out the code below for examples on how to work with the changelog.
issues = jira.search_issues('project= proj_a', maxResults=5, expand='changelog')
for issue in issues:
print(f"Changes from issue: {issue.key} {issue.fields.summary}")
print(f"Number of Changelog entries found: {issue.changelog.total}") # number of changelog entries (careful, each entry can have multiple field changes)
for history in issue.changelog.histories:
print(f"Author: {history.author}") # person who did the change
print(f"Timestamp: {history.created}") # when did the change happen?
print("\nListing all items that changed:")
for item in history.items:
print(f"Field name: {item.field}") # field to which the change happened
print(f"Changed to: {item.toString}") # new value, item.to might be better in some cases depending on your needs.
print(f"Changed from: {item.fromString}") # old value, item.from might be better in some cases depending on your needs.
print()
print()
Just to explain what you did wrong before when iterating over each issue: you have to use the issue.key, not the issue-resource itself. When you simply pass the issue, it won't be handled correctly as a parameter in jira.issue(). Instead, pass issue.key:
for issue in issues:
print(issue.key)
myIssue = jira.issue(issue.key, expand='changelog')

ETL to csv files, split up and then pushed to s3 to be consume by redshift

Just getting started with Kiba, didn't find anything obvious, but I could be just channeling my inner child (who looks for their shoes by staring at the ceiling).
I want to dump a very large table to Amazon Redshift. It seems that the fastest way to do that is to write out a bunch of CSV files to an S3 bucket, then tell Redshift (via the COPY command) to pull them in. Magical scaling gremlins will do the rest.
So, I think that I want Kiba to write a CSV file for every 10k rows of data, then push it to s3, then start writing to a new file. At the end, make a post-processing call to COPY
So, can I "pipeline" the work or should this be a big, nested Destination class?
i.e.
source -> transform -> transform ... -> [ csv -> s3 ]{every 10000}; post-process
Kiba author here. Thanks for trying it out!
Currently, the best way to implement this is to create what I'd call a "buffering destination". (A version of that will likely end up in Kiba Common at some point).
(Please test thoroughly, I just authored that this morning for you, didn't run it at all, although I've used less generic versions in the past. Also keep in mind that this version uses an in-memory buffer for your 10k rows, so growing the number to something much larger will consume memory. A least memory consuming version could also be created though, which would write rows to file as you get them)
class BufferingDestination
def initialize(buffer_size:, on_flush:)
#buffer = []
#buffer_size
#on_flush = on_flush
#batch_index = 0
end
def write(row)
#buffer << row
flush if #buffer.size >= buffer_size
end
def flush
on_flush.call(batch_index: #batch_index, rows: #buffer)
#batch_index += 1
#buffer.clear
end
def close
flush
end
end
This is something you can then use like this, for instance here reusing the Kiba Common CSV destination (although you can write your own too):
require 'kiba-common/destinations/csv'
destination BufferingDestination,
buffer_size: 10_000,
on_flush: -> { |batch_index, rows|
filename = File.join("output-#{sprintf("%08d", batch_index)}")
csv = Kiba::Common::Destinations::CSV.new(
filename: filename,
csv_options: { ... },
headers: %w(my fields here)
)
rows.each { |r| csv.write(r) }
csv.close
}
You could then trigger your COPY right in the on_flush block after generating the file (if you want the upload to start right away), or in a post_process block (but this would only start after all the CSV are ready, which can be a feature to ensure some form of transactional global upload if you prefer).
You could go fancy and start a thread queue to actually handle the upload in parallel if you really need this (but then be careful with zombie threads etc).
Another way is to have "multiple steps" ETL processes, with one script to generate the CSV, and another one picking them for upload, running concurrently (this is something I've explained in my talk at RubyKaigi 2018 for instance).
Let me know how things work for you!
I'm not sure here exact question. But, I think your solution seems correct overall, but few suggestions though.
You may wnat too consider even having more then 10K records per CSV file and gzip them while sending to S3.
You want to see menifest creation containing list of multiple files and then run copy command supplying menifest file as input.
Thibaut, I did something similar, except that I streamed it out to a Tempfile, I think...
require 'csv'
# #param limit [Integer, 1_000] Number of rows per csv file
# #param callback [Proc] Proc taking one argument [CSV/io], that can be used after
# each csv file is finished
module PacerPro
class CSVDestination
def initialize(limit: 1_000, callback: ->(obj) { })
#limit = limit
#callback = callback
#csv = nil
#row_count = 0
end
# #param row [Hash] returned from transforms
def write(row)
csv << row.values
#row_count += 1
return if row_count < limit
self.close
end
# Called by Kiba when the transform pipeline is finished
def close
csv.close
callback.call(csv)
tempfile.unlink
#csv = nil
#row_count = 0
end
private
attr_reader :limit, :callback
attr_reader :row_count, :tempfile
def csv
#csv ||= begin
#tempfile = Tempfile.new('csv')
CSV.open(#tempfile, 'w')
end
end
end
end

Duplicate audio files in preloadjs manifest for LoadQueue

I am populating a manifest array for use by preloadjs's LoadQueue class. In the manifest, I am referencing sources to both audio and image files while creating unique ids for each. This all had been working great.
However, as the audio/image files are being selected from a CMS database (Wordpress custom post types), it may be the case that the same audio may be selected more than once. In other words, the same audio file may appear in the manifest more than once. When this happens, a very odd bug occurs where the last IMAGE reference in the resultant LoadQueue instance returns "undefined". It doesn't matter where the duplicate audio occurs in the manifest array, its always the last IMAGE object in the LoadQueue instance that returns undefined.
Duplicate image files do NOT cause a problem.
Is this a bug in preloadjs? (yes, it is of course wasteful to load more than one copy of the same audio file, but in my use case, we are talking about small files and a finite number of posts)
var manifest = [];
for (var i = 0; i < game_pieces_data.length; i++) {
manifest.push({id: "sound" + i, src: game_pieces_data[i].audio_url});
manifest.push({id: "image" + i, src: game_pieces_data[i].image_url});
}
preload = new createjs.LoadQueue();
preload.installPlugin(createjs.Sound);
preload.on("complete", handlePreloadComplete, this);
preload.loadManifest(manifest);
function handlePreloadComplete() {
var bitmap;
for (var i = 0; i < game_pieces_data.length; i++) {
bitmap = new createjs.Bitmap(preload.getResult('image' + i));
// bitmap.image.width <- will return undefined for last item of
// the loop if one of the audio files is a duplicate?
...
}
}
EDIT: I've determined that LoadQueue's "complete" event is firing before the final "fileLoad" event fires (the last image). This is the reason for the final image being undefined when asked for in my handler for the "complete" event. Again, this only happens when there is a duplicate audio file.
EDIT2: To further narrow down the issue, I've created a manifest that is only loading audio files and traced the "fileload" and the "complete" event. For every additional duplicate audio file, the number of "fileload" events fired for that file increases by 1 (2 dups = fileload fires 3 times for that file, 3 dups = fileload fires 4 for that file...etc). Additionally, an extra copy is added to the LoadQueue instances array of files (accessed by getResult).
However, the "complete" event will fire after the length of the manifest is reached, hence additional fileloaded events being fired after the complete event. The harm in this comes when you have a manifest with mixed files. In my case, my image files are getting pushed to the end of the queue by the extra duplicate audio files being made. And since "complete" fires correctly at the end of the length of the manifest, it fires before any image files being pushed to the end of the queue can be loaded causing errors in code expecting those file to be there after the queue completes.
I am working around this by creating 2 LoadQueue instances, one for the audio and one for images. When the audio queue "complete" fires, I create the image one and load those from a separate manifest. This is not idea however as it appears that there are now multiple useless duplicates of duplicate audio files in memory. And this number increase exponentially with each additional duplicate that may be selected in the CMS.

Redirect and parse in realtime stdout of an long running process in vb.net

This code executes "handbrakecli" (a command line application) and places the output into a string:
Dim p As Process = New Process
p.StartInfo.FileName = "handbrakecli"
p.StartInfo.Arguments = "-i [source] -o [destination]"
p.StartInfo.UseShellExecute = False
p.StartInfo.RedirectStandardOutput = True
p.Start
Dim output As String = p.StandardOutput.ReadToEnd
p.WaitForExit
The problem is that this can take up to 20 minutes to complete during which nothing will be reported back to the user. Once it's completed, they'll see all the output from the application which includes progress details. Not very useful.
Therefore I'm trying to find a sample that shows the best way to:
Start an external application (hidden)
Monitor its output periodically as it displays information about it's progress (so I can extract this and present a nice percentage bar to the user)
Determine when the external application has finished (so I can't continue with my own applications execution)
Kill the external application if necessary and detect when this has happened (so that if the user hits "cancel", I get take the appropriate steps)
Does anyone have any recommended code snippets?
The StandardOutput property is of type StreamReader, which has methods other than ReadToEnd.
It would be more code, but if you used the Read method, you could do other things like provide the user with the opportunity to cancel or report some type of progress.
Link to Read Method with code sample:
http://msdn.microsoft.com/en-us/library/ath1fht8(v=VS.90).aspx
Edit:
The Process class also has a BeginOutputReadLine method which is an asynchronous method call with callback.
http://msdn.microsoft.com/en-us/library/system.diagnostics.process.beginoutputreadline(v=VS.90).aspx