downloading huge files - application using grails - sql

I am developing a RESTful web service that allows users to download data in csv and json formats that is dynamically retrieved from the database.
Right now I am using a StringWriter to write out the CSV data. My major concern is that the resultset could get very large depending the on the user input. In that case, having them all in memory doesn't seem to be a good idea to me.
I am thinking of creating a temp file, but how to make sure the file gets deleted soon after the download completes.
Is there a better way to do this.
Thanks for the help.

If memory is the issue, you could simply write out to the response writer that writes directly to the output stream? This way you're not storing anything (much) in memory and no need to write out temporary files:
// controller action for CSV download
def download = {
response.setContentType("text/csv")
response.setHeader("Content-disposition", "attachment;filename=downloadFile.csv")
def results = // get all your results
results.each { result ->
out << result.col1 << ',' << result.col2 // etc
out << '\n'
}
}
This writes out to the output stream as it is looping round your results.
In theory You can make this even more memory efficient by using a scrollable results set - see "Using Scrollable Results" section of Querying with GORM - Criteria - and looping round that whilst writing out to the response writer. In theory this means you're also not loading all your DB results into memory, but in practice this may not work as expected if you're using MySQL (and its Java connector). Manually batching up queries may work too (get DB rows 1-10000, write out, get 10001-20001, etc)
This kind of thing might be more difficult with JSON, depending on what library you're using to render your objects.

Well, the simplest solution to preventing temp files from sticking around too long would be a cron job that simply deletes any file in the temp directory that has a modified time older than, say, 1 hour.
If you want it to all be done within Grails, you could design a Quartz job to clean up files. This job could either do as described above (and simply check modification timestamps to decide what to delete) or you could run the job only "on demand" with a parameter of a file name to be deleted. Once the download action is called you could schedule the cleanup of that specific file for X minutes later (to allow enough time for a successful download). The job would then be in charge of simply deleting the file.

Depending on the number of files involved you can always use http://download.oracle.com/javase/1,5.0/docs/api/java/io/File.html#deleteOnExit() to ensure the file is blown away when the VM shuts down.

To create a temp file that gets automatically deleted after the session has expired, you can use the Session Temp Files plugin.

Related

Amazon S3: How to safely upload multiple files?

I have two client programs which are using S3 to communicate some information. That information is a list of files.
Let's call the clients the "uploader" and "downloader":
The uploader does something like this:
upload file A
upload file B
upload file C
upload a SUCCESS marker file
The downloader does something lie this:
check for SUCCESS marker
if found, download A, B, C.
else, get data from somewhere else
and both of these programs are being run periodically. The uploader will populate a new directory when it is done, and the downloader will try to get the latest versions of A,B,C available.
Hopefully the intent is clear — I don't want the downloader to see a partial view, but rather get all of A,B,C or skip that directory.
However, I don't think that works, as written. Thanks to eventual consistency, the uploader's PUTs could be reordered into:
upload file B
upload a SUCCESS marker file
upload file A
...
And at this moment, the downloader might run, see the SUCCESS marker, and assume the directory is populated (which it is not).
So what's the right approach, here?
One idea is for the uploader to first upload A,B,C, then repeatedly check that the files are stored, and only after it sees all of them, then finally write the SUCCESS marker.
Would that work?
Stumbled upon similar issue in my project.
If the intention is to guarantee cross-file consistency (between files A,B,C) the only possible solution (purely within s3) is:
1) to put them as NEW objects
2) do not explicitly check for existence using HEAD or GET request prior to the put.
These two constraints above are required for fully consistent read-after-write behavior (https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3-introduces-new-usability-enhancements/)
Each time you update the files, you need to generate a unique prefix (folder) name and put this name into your marker file (the manifest) which you are going to UPDATE.
The manifest will have a stable name but will be eventually consistent. Some clients may get the old version and some may get the new one.
The old manifest will point to the old “folder” and the new one will point the new “folder”. Thus each client will read only old files or only new files but never mixed, so cross file consistency will be achieved. Still different clients may end up having different versions. If the clients keep pulling the manifest and getting updated on change, they will eventually become consistent too.
Possible solution for client inconsistency is to move manifest meta data out of s3 into a consistent database (such as dynamo db)
A few obvious caveats with pure s3 approach:
1) requires full set of files to be uploaded each time (incremental updates are not possible)
2) needs eventual cleanup of old obsolete folders
3) clients need to keep pulling manifest to get updated
4) clients may be inconsistent between each other
It is possible to do this single copies in S3. Each file (A B C) will have prepended to it a unique hash or version code [e.g. md5sum generated from the concatenation of all three files.]
In addition the hash value will be uploaded to the bucket as well into a separate object.
When consuming the files, first read the hash file and compare to the last hash successfully consumed. If changed, then read the files and check the hash value within each. If they all match, the data is valid and may be used. If not, the downloaded files should be disgarded and downloaded again (after a suitable delay)..
This will catch the occassional race condition between write and read across multiple objects.
This works because the hash is repeated in all objects. The hash file is actually optional, serving as a low-cost and fast short cut for determining if the data is updated.

Cache data in SQL CE database

Background
I have an SQL CE database, that is constantly updated (every second).
I have a (web) application that allows a user to look at the data in real-time. At some point a user can click "take a snapshot" button, and it will open the snapshot in a different window.
And then on that form, there is "print" and "download" buttons that will either generate a page for printing, or will stream the data as CSV file - but same data snapshot has to be used, i.e. I can't go to the DB to get latest data for that.
Details
SQL CE dabatase is exposed through WCF web service.
Snapshot consists of up to 500 records, 10 columns each. Expiration time on the snapshot of 2 hours is sufficient.
It is a low-traffic application, so I don't expect more than few (5) connections at the same time.
Loosing snapshot is not a big deal, user can simply generate new one.
database is accessed by self-hosted WCF web service using Linq-to-SQL.
Web site is ASP.NET MVC hosted on UltiDev Cassini.
database, and web site are most likely be on the same box, when deployed. The entire app is intranet bound.
Problem
I need to cache the snapshot of the data at the moment user pressed "take a snapshot" button, so that I can use same data to generate print page, or generate a file for download.
Solution 1:
Each time there is a need to generate a snapshot, I will create a table in the database. Since there are no temp tables in SQL CE, I will need to clean it up myself.
Solution 2:
Cache the snapshot in-memory on either DB server, or web server.
Question:
Is there anything wrong with proposed solutions? Any different solution suggestions?
A consideration is the typical usage pattern. Do most snapshots eventually result in either being printed or exported or both?
If such is the case, we might as well "get it in memory" (temporarily) in the form of a non blocking (asynchronous) select statement from the device to the server. In this fashion the data will "be there" or well on its way when user decides to use it.
If on the other hand many snapshot end up not being effectively used, Solution #1 seems quite ok (maybe the table could be named after the account/user, hence guaranteeing "self clean up" based on the number of snapshot a user can maintain at a given time (though it seems to be just one, with even the tolerance of loosing it sometimes).
500 rows by 10 columns isn't really very large at all. For the sake of simplicity in this case, I might just generate the CSV data at the same time I generate the initial snapshot page, and then place the CSV data in a hidden field in the snapshot page. The "Print" and "Download CSV" buttons would then POST the form that contains the CSV data to a Print page that generates the printable version from the posted CSV data, or a page that streams the CSV directly back to the client's browser, respectively. This way, at least, you wouldn't have any clean-up issues to deal with, and you avoid having to cache something on the server (either in the cache proper or in the database) that might well end up never being used at all.
If you cached the CSV data in a hidden field client-side, you could even handle both the printing and the CSV display completely client-side with javascript, although I don't know if that's worth the trouble or not.

How can I speed up batch processing job in Coldfusion?

Every once in awhile I am fed a large data file that my client uploads and that needs to be processed through CMFL. The problem is that if I put the processing on a CF page, then it runs into a timeout issue after 120 seconds. I was able to move the processing code to a CFC where it seems to not have the timeout issue. However, sometime during the processing, it causes ColdFusion to crash and has to restarted. There are a number of database queries (5 or more, mixture of updates and selects) required for each line (8,000+) of the file I go through as well as other logic provided by me in the form of CFML.
My question is what would be the best way to go through this file. One caveat, I am not able to move the file to the database server and process it entirely with the DB. However, would it be more efficient to pass each line to a stored procedure that took care of everything? It would still be a lot of calls to the database, but nothing compared to what I have now. Also, what would be the best way to provide feedback to the user about how much of the file has been processed?
Edit:
I'm running CF 6.1
I just did a similar thing and use CF often for data parsing.
1) Maintain a file upload table (Parent table). For every file you upload you should be able to keep a list of each file and what status it is in (uploaded, processed, unprocessed)
2) Temp table to store all the rows of the data file. (child table) Import the entire data file into a temporary table. Attempting to do it all in memory will inevitably lead to some errors. Each row in this table will link to a file upload table entry above.
3) Maintain a processing status - For each row of the datafile you bring in, set a "process/unprocessed" tag. This way if it breaks, you can start from where you left off. As you run through each line, set it to be "processed".
4) Transaction - use cftransaction if possible to commit all of it at once, or at least one line at a time (with your 5 queries). That way if something goes boom, you don't have one row of data that is half computed/processed/updated/tested.
5) Once you're done processing, set the file name entry in the table in step 1 to be "processed"
By using the approach above, if something fails, you can set it to start where it left off, or at least have a clearer path of where to start investigating, or worst case clean up in your data. You will have a clear way of displaying to the user the status of the current upload processing, where it's at, and where it left off if there was an error.
If you have any questions, let me know.
Other thoughts:
You can increase timeouts, give the VM more memory, put it in 64 bit but all of those will only increase the capacity of your system so much. It's a good idea to do these per call and do it in conjunction with the above.
Java has some neat file processing libraries that are available as CFCS. if you run into a lot of issues with speed, you can use one of those to read it into a variable and then into the database
If you are playing with XML, do not use coldfusion's xml parsing. It works well for smaller files and has fits when things get bigger. There are several cfc's written out there (check riaforge, etc) that wrap some excellent java libraries for parsing xml data. You can then create a cfquery manually if need be with this data.
It's hard to tell without more info, but from what you have said I shoot out three ideas.
The first thing, is with so many database operations, it's possible that you are generating too much debugging. Make sure that under Debug Output settings in the administrator that the following settings are turned off.
Enable Robust Exception Information
Enable AJAX Debug Log Window
Request Debugging Output
The second thing I would do is look at those DB queries and make sure they are optimized. Make sure selects are happening with indicies, etc.
The third thing I would suspect is that the file hanging out in memory is probably suboptimal.
I would try looping through the file using file looping:
<cfloop file="#VARIABLES.filePath#" index="VARIABLES.line">
<!--- Code to go here --->
</cfloop>
Have you tried an event gateway? I believe those threads are not subject to the same timeout settings as page request threads.
SQL Server Integration Services (SSIS) is the recommended tool for complex ETL (Extract, Transform, and Load) work, which is what this sounds like. (It can be configured to access files on other servers.) The question might be, can you work up an interface between Cold Fusion and SSIS?
If you can upgrade to cf8 and take advantage of cfloop file="" which would give you greater speed and the file would not be put in memory (which is probably the cause of the crashing).
Depending on the situation you are encountering you could also use cfthread to speed up processing.
Currently, an event gateway is the only way to get around the timeout limits of an HTTP request cycle. CF does not have a way to process CF pages offline, that is, there is no command-line invocation (one of my biggest gripes about CF - very little offling processing).
Your best bet is to use an Event Gateway or rewrite your parsing logic in straight Java.
I had to do the same thing, Ben Nadel has written a bunch of great articles uses java file io, to allow you to more speedily read files, write files etc...
Really helped improve the performance of our csv importing application.

Platform independent file locking?

I'm running a very computationally intensive scientific job that spits out results every now and then. The job is basically to just simulate the same thing a whole bunch of times, so it's divided among several computers, which use different OSes. I'd like to direct the output from all these instances to the same file, since all the computers can see the same filesystem via NFS/Samba. Here are the constraints:
Must allow safe concurrent appends. Must block if some other instance on another computer is currently appending to the file.
Performance does not count. I/O for each instance is only a few bytes per minute.
Simplicity does count. The whole point of this (besides pure curiosity) is so I can stop having every instance write to a different file and manually merging these files together.
Must not depend on the details of the filesystem. Must work with an unknown filesystem on an NFS or Samba mount.
The language I'm using is D, in case that matters. I've looked, there's nothing in the standard lib that seems to do this. Both D-specific and general, language-agnostic answers are fully acceptable and appreciated.
Over NFS you face some problems with client side caching and stale data. I have written an OS independent lock module to work over NFS before. The simple idea of creating a [datafile].lock file does not work well over NFS. The basic idea to work around it is to create a lock file [datafile].lock which if present means file is NOT locked and a process that wants to acquire a lock renames the file to a different name like [datafile].lock.[hostname].[pid]. The rename is an atomic enough operation that works well enough over NFS to guarantee exclusivity of the lock. The rest is basically a bunch of fail safe, loops, error checking and lock retrieval in case the process dies before releasing the lock and renaming the lock file back to [datafile].lock
The classic solution is to use a lock file, or more accurately a lock directory. On all common OSs creating a directory is an atomic operation so the routine is:
try to create a lock directory with a fixed name in a fixed location
if the create failed, wait a second or so and try again - repeat until success
write your data to the real data file
delete the lock directory
This has been used by applications such as CVS for many years across many platforms. The only problem occurs in the rare cases when your app crashes while writing and before removing the lock.
Why not just build a simple server which sits between the file and the other computers?
Then if you ever wanted to change the data format, you would only have to modify the server, and not all of the clients.
In my opinion building a server would be much easier than trying to use a Network file system.
Lock File with a twist
Like other answers have mentioned, the easiest method is to create a lock file in the same directory as the datafile.
Since you want to be able to access the same file over multiple PC the best solution I can think of is to just include the identifier of the machine currently writing to the data file.
So the sequence for writing to the data file would be:
Check if there is a lock file present
If there is a lock file, see if I'm the one owning it by checking that its content has my identifier.
If that's the case, just write to the data file then delete the lock file.
If that's not the case, just wait a second or a small random length of time and try the whole cycle again.
If there is no lock file, create one with my identifier and try the whole cycle again to avoid race condition (re-check that the lock file is really mine).
Along with the identifier, I would record a timestamp in the lock file and check whether it's older than a given timeout value.
If the timestamp is too old, then assume that the lock file is stale and just delete it as it would mea one of the PC writing to the data file may have crashed or its connection may have been lost.
Another solution
If you are in control the format of the data file, could be to reserve a structure at the beginning of the file to record whether it is locked or not.
If you just reserve a byte for this purpose, you could assume, for instance, that 00 would mean the data file isn't locked, and that other values would represent the identifier of the machine currently writing to it.
Issues with NFS
OK, I'm adding a few things because Jiri Klouda correctly pointed out that NFS uses client-side caching that will result in the actual lock file being in an undetermined state.
A few ways to solve this issue:
mount the NFS directory with the noac or sync options. This is easy but doesn't completely guarantee data consistency between client and server though so there may still be issues although in your case it may be OK.
Open the lock file or data file using the O_DIRECT, the O_SYNC or O_DSYNC attributes. This is supposed to disable caching altogether.
This will lower performance but will ensure consistency.
You may be able to use flock() to lock the data file but its implementation is spotty and you will need to check if your particular OS actually uses the NFS locking service. It may do nothing at all otherwise.
If the data file is locked, then another client opening it for writing will fail.
Oh yeah, and it doesn't seem to work on SMB shares, so it's probably best to just forget about it.
Don't use NFS and just use Samba instead: there is a good article on the subject and why NFS is probably not the best answer to your usage scenario.
You will also find in this article various methods for locking files.
Jiri's solution is also a good one.
Basically, if you want to keep things simple, don't use NFS for frequently-updated files that are shared amongst multiple machines.
Something different
Use a small database server to save your data into and bypass the NFS/SMB locking issues altogether or keep your current multiple data files system and just write a small utility to concatenate the results.
It may still be the safest and simplest solution to your problem.
I don't know D, but I thing using a mutex file to do the jobe might work. Here's some pseudo-code you might find useful:
do {
// Try to create a new file to use as mutex.
// If it's already created, it will throw some kind of error.
mutex = create_file_for_writing('lock_file');
} while (mutex == null);
// Open your log file and write results
log_file = open_file_for_reading('the_log_file');
write(log_file, data);
close_file(log_file);
close_file(mutex);
// Free mutex and allow other processes to create the same file.
delete_file(mutex);
So, all processes will try to create the mutex file but only the one who wins will be able to continue. Once you write your output, close and delete the mutex so other processes can do the same.

How to reliably handle files uploaded periodically by an external agent?

It's a very common scenario: some process wants to drop a file on a server every 30 minutes or so. Simple, right? Well, I can think of a bunch of ways this could go wrong.
For instance, processing a file may take more or less than 30 minutes, so it's possible for a new file to arrive before I'm done with the previous one. I don't want the source system to overwrite a file that I'm still processing.
On the other hand, the files are large, so it takes a few minutes to finish uploading them. I don't want to start processing a partial file. The files are just tranferred with FTP or sftp (my preference), so OS-level locking isn't an option.
Finally, I do need to keep the files around for a while, in case I need to manually inspect one of them (for debugging) or reprocess one.
I've seen a lot of ad-hoc approaches to shuffling upload files around, swapping filenames, using datestamps, touching "indicator" files to assist in synchronization, and so on. What I haven't seen yet is a comprehensive "algorithm" for processing files that addresses concurrency, consistency, and completeness.
So, I'd like to tap into the wisdom of crowds here. Has anyone seen a really bulletproof way to juggle batch data files so they're never processed too early, never overwritten before done, and safely kept after processing?
The key is to do the initial juggling at the sending end. All the sender needs to do is:
Store the file with a unique filename.
As soon as the file has been sent, move it to a subdirectory called e.g. completed.
Assuming there is only a single receiver process, all the receiver needs to do is:
Periodically scan the completed directory for any files.
As soon as a file appears in completed, move it to a subdirectory called e.g. processed, and start working on it from there.
Optionally delete it when finished.
On any sane filesystem, file moves are atomic provided they occur within the same filesystem/volume. So there are no race conditions.
Multiple Receivers
If processing could take longer than the period between files being delivered, you'll build up a backlog unless you have multiple receiver processes. So, how to handle the multiple-receiver case?
Simple: Each receiver process operates exactly as before. The key is that we attempt to move a file to processed before working on it: that, and the fact the same-filesystem file moves are atomic, means that even if multiple receivers see the same file in completed and try to move it, only one will succeed. All you need to do is make sure you check the return value of rename(), or whatever OS call you use to perform the move, and only proceed with processing if it succeeded. If the move failed, some other receiver got there first, so just go back and scan the completed directory again.
If the OS supports it, use file system hooks to intercept open and close file operations. Something like Dazuko. Other operating systems may let you know about file operations in anoter way, for example Novell Open Enterprise Server lets you define epochs, and read list of files modified during an epoch.
Just realized that in Linux, you can use inotify subsystem, or the utilities from inotify-tools package
File transfers is one of the classics of system integration. I'd recommend you to get the Enterprise Integration Patterns book to build your own answer to these questions -- to some extent, the answer depends on the technologies and platforms you are using for endpoint implementation and for file transfer. It's a quite comprehensive collection of workable patterns, and fairly well written.