Celery file upload - file-upload

Is it possible to make celery process file uploads?
I'm using pootle in my django project and some files to be translated are being uploaded for too long. As far as I understand celery tries to serialize the arguments of the executing function, and since one of them is a file, execution gets stuck.
I wouldn't like to modify pootle's behaviour(even if possible), so how can I solve the issue?

Pootle makes use of RQ, so that would be a more natural fit for background jobs.
I assume you are using update_stores for uploading the files, rather then the UI. The UI can potentially timeout. While update_stores does not background it is easier to safely execute and build into scripts.

Related

How to supply snakemake options within the Snakefile itself?

I would like to specify singularity bind paths inside the snakefile (ie. snakemake script) and not via command line. I believe this could be done somehow via their api by from snakemake import something, etc. How do I achieve this?
Broadly speaking, how do we supply options/arguments to snakemake via their api within a Snakefile?
I made a pipeline that does a couple things, and one of those things is download samples. Starting 30 downloads at the same time is a waste of resources so I wanted to limit the number of parallel downloads, and I don't want to always pass --resources parallel_resources=1 to the command. I noticed that the snakemake.workflow exists when a Snakefile is executed, and here I added this as a resource:
workflow.global_resources.update({'parallel_downloads': 1})
I have no experience with singularity, so I don't fully understand what you want. But my guess is that this is somewhere stored in the workflow and you can change it there.
p.s. this is not at all through an API, or guaranteed to work between versions

Is there a way to list all the cached scripts in redis?

SCRIPT EXISTS sha1
The above will tell you if a script exists but is there a way to list all the cached scripts in redis?
thanks!
You can't do this, but what's the purpose of listing loaded scripts if the scripts need to be loaded from the application layer. That is, what scripts are loaded is a known information by your code.
I don't know what programming language or framework you're currently using in your solution, but whatever platform you're using, you just need to put some code to intercept the moment where you load a script to Redis and fire an event to be handled somewhere.

How to implement a system wide text replacement in windows programmatically?

I have a small VB .Net application that, among other things, attempts to substitute system wide typed text by the user(hotstrings concept). To achieve that, I have deployed 'ahk2exe' and 'AutoHotkeySC.bin' with my application and did the following:
When a user assignes a new 'hotstring':
Kill 'hotstring' exe script file if running
Append new hotstring to the script file (if non exist then create a new one)
Convert edited/new script file to exe (using ahk2exe)
Run the newly converted script exe
(somewhere there I also check if the hotstring has been already assigned)
However, I am not totally satisfied with this method for the following two main reasons:
The extra resources deployed with the application.
Lag: The time it takes for the system to kill the process and then restart it takes a minimum of 5 seconds on my fast computer and more on other computers. That amount of time is much more than the time it takes the user to assign the hotstring, minimize/close the window and then test his/her new hotstring. When the user does so initially with no success they will think the process failed. So this method is not very good for user experience.
So, I am looking for a different method or implementation. May be using keyboard hooks? Or maybe adding a .dll library that achieves the same. Are there any resources you know about that might help (free or commercial)? What is the best way to achieve my desired goal?
Many thanks for your help.
Implementing what Autohotkey does would be a pretty non trivial task.
But I'm pretty sure that AHK supports an "autoreload" option for scripts
googling "autohotkey auto reload" turned up several pages discussing that very concept. IF that worked, all you'd have to do is update the script file and that's it, AHK should automatically reload the script.

How can I speed up batch processing job in Coldfusion?

Every once in awhile I am fed a large data file that my client uploads and that needs to be processed through CMFL. The problem is that if I put the processing on a CF page, then it runs into a timeout issue after 120 seconds. I was able to move the processing code to a CFC where it seems to not have the timeout issue. However, sometime during the processing, it causes ColdFusion to crash and has to restarted. There are a number of database queries (5 or more, mixture of updates and selects) required for each line (8,000+) of the file I go through as well as other logic provided by me in the form of CFML.
My question is what would be the best way to go through this file. One caveat, I am not able to move the file to the database server and process it entirely with the DB. However, would it be more efficient to pass each line to a stored procedure that took care of everything? It would still be a lot of calls to the database, but nothing compared to what I have now. Also, what would be the best way to provide feedback to the user about how much of the file has been processed?
Edit:
I'm running CF 6.1
I just did a similar thing and use CF often for data parsing.
1) Maintain a file upload table (Parent table). For every file you upload you should be able to keep a list of each file and what status it is in (uploaded, processed, unprocessed)
2) Temp table to store all the rows of the data file. (child table) Import the entire data file into a temporary table. Attempting to do it all in memory will inevitably lead to some errors. Each row in this table will link to a file upload table entry above.
3) Maintain a processing status - For each row of the datafile you bring in, set a "process/unprocessed" tag. This way if it breaks, you can start from where you left off. As you run through each line, set it to be "processed".
4) Transaction - use cftransaction if possible to commit all of it at once, or at least one line at a time (with your 5 queries). That way if something goes boom, you don't have one row of data that is half computed/processed/updated/tested.
5) Once you're done processing, set the file name entry in the table in step 1 to be "processed"
By using the approach above, if something fails, you can set it to start where it left off, or at least have a clearer path of where to start investigating, or worst case clean up in your data. You will have a clear way of displaying to the user the status of the current upload processing, where it's at, and where it left off if there was an error.
If you have any questions, let me know.
Other thoughts:
You can increase timeouts, give the VM more memory, put it in 64 bit but all of those will only increase the capacity of your system so much. It's a good idea to do these per call and do it in conjunction with the above.
Java has some neat file processing libraries that are available as CFCS. if you run into a lot of issues with speed, you can use one of those to read it into a variable and then into the database
If you are playing with XML, do not use coldfusion's xml parsing. It works well for smaller files and has fits when things get bigger. There are several cfc's written out there (check riaforge, etc) that wrap some excellent java libraries for parsing xml data. You can then create a cfquery manually if need be with this data.
It's hard to tell without more info, but from what you have said I shoot out three ideas.
The first thing, is with so many database operations, it's possible that you are generating too much debugging. Make sure that under Debug Output settings in the administrator that the following settings are turned off.
Enable Robust Exception Information
Enable AJAX Debug Log Window
Request Debugging Output
The second thing I would do is look at those DB queries and make sure they are optimized. Make sure selects are happening with indicies, etc.
The third thing I would suspect is that the file hanging out in memory is probably suboptimal.
I would try looping through the file using file looping:
<cfloop file="#VARIABLES.filePath#" index="VARIABLES.line">
<!--- Code to go here --->
</cfloop>
Have you tried an event gateway? I believe those threads are not subject to the same timeout settings as page request threads.
SQL Server Integration Services (SSIS) is the recommended tool for complex ETL (Extract, Transform, and Load) work, which is what this sounds like. (It can be configured to access files on other servers.) The question might be, can you work up an interface between Cold Fusion and SSIS?
If you can upgrade to cf8 and take advantage of cfloop file="" which would give you greater speed and the file would not be put in memory (which is probably the cause of the crashing).
Depending on the situation you are encountering you could also use cfthread to speed up processing.
Currently, an event gateway is the only way to get around the timeout limits of an HTTP request cycle. CF does not have a way to process CF pages offline, that is, there is no command-line invocation (one of my biggest gripes about CF - very little offling processing).
Your best bet is to use an Event Gateway or rewrite your parsing logic in straight Java.
I had to do the same thing, Ben Nadel has written a bunch of great articles uses java file io, to allow you to more speedily read files, write files etc...
Really helped improve the performance of our csv importing application.

How to reliably handle files uploaded periodically by an external agent?

It's a very common scenario: some process wants to drop a file on a server every 30 minutes or so. Simple, right? Well, I can think of a bunch of ways this could go wrong.
For instance, processing a file may take more or less than 30 minutes, so it's possible for a new file to arrive before I'm done with the previous one. I don't want the source system to overwrite a file that I'm still processing.
On the other hand, the files are large, so it takes a few minutes to finish uploading them. I don't want to start processing a partial file. The files are just tranferred with FTP or sftp (my preference), so OS-level locking isn't an option.
Finally, I do need to keep the files around for a while, in case I need to manually inspect one of them (for debugging) or reprocess one.
I've seen a lot of ad-hoc approaches to shuffling upload files around, swapping filenames, using datestamps, touching "indicator" files to assist in synchronization, and so on. What I haven't seen yet is a comprehensive "algorithm" for processing files that addresses concurrency, consistency, and completeness.
So, I'd like to tap into the wisdom of crowds here. Has anyone seen a really bulletproof way to juggle batch data files so they're never processed too early, never overwritten before done, and safely kept after processing?
The key is to do the initial juggling at the sending end. All the sender needs to do is:
Store the file with a unique filename.
As soon as the file has been sent, move it to a subdirectory called e.g. completed.
Assuming there is only a single receiver process, all the receiver needs to do is:
Periodically scan the completed directory for any files.
As soon as a file appears in completed, move it to a subdirectory called e.g. processed, and start working on it from there.
Optionally delete it when finished.
On any sane filesystem, file moves are atomic provided they occur within the same filesystem/volume. So there are no race conditions.
Multiple Receivers
If processing could take longer than the period between files being delivered, you'll build up a backlog unless you have multiple receiver processes. So, how to handle the multiple-receiver case?
Simple: Each receiver process operates exactly as before. The key is that we attempt to move a file to processed before working on it: that, and the fact the same-filesystem file moves are atomic, means that even if multiple receivers see the same file in completed and try to move it, only one will succeed. All you need to do is make sure you check the return value of rename(), or whatever OS call you use to perform the move, and only proceed with processing if it succeeded. If the move failed, some other receiver got there first, so just go back and scan the completed directory again.
If the OS supports it, use file system hooks to intercept open and close file operations. Something like Dazuko. Other operating systems may let you know about file operations in anoter way, for example Novell Open Enterprise Server lets you define epochs, and read list of files modified during an epoch.
Just realized that in Linux, you can use inotify subsystem, or the utilities from inotify-tools package
File transfers is one of the classics of system integration. I'd recommend you to get the Enterprise Integration Patterns book to build your own answer to these questions -- to some extent, the answer depends on the technologies and platforms you are using for endpoint implementation and for file transfer. It's a quite comprehensive collection of workable patterns, and fairly well written.