background upload of images to S3 with Paperclip and Delayed Jobs - ruby-on-rails-3

I'm building an API for mobile apps which supports image uploading, using Paperclip.
Paperclip is set with S3 storage and its working fine.
I want to do the uploading from the server to S3 in the background using Delayed Jobs (the app will be hosted on Heroku).
Trying something such as #user.delay.photo = File.open(...), the result are errors by Delayed Jobs
UPDATE "delayed_jobs" SET "last_error" = '{uninitialized stream
how can I do the background uploading ?

The problem is IO objects cannot be marshal and retrieve it back easily.
Using .delay method, it tries to dump the object into database records and pull it back when processing the job. Doing this way, make the record is big and brittle.
Better use the custom job instead if you have a lot of things to do in the job.
class UploadJob < Struct.new(:user_id)
def perform
user = User.find(user_id)
user.photo = File.open(.....)
end
end
Delayed::Job.enqueue UploadJob.new(#user.id)
You could do yourself by writing the image to the tmp directory in the project and reference in from the job. Last do a clean up when the job is finished.
Or, you could try this gem: delayed_paperclip which is more handy.

Related

Where does scrapyd write crawl results when using an S3 FEED_URI, before uploading to S3?

I'm running a long-running web crawl using scrapyd and scrapy 1.0.3 on an Amazon EC2 instance. I'm exporting jsonlines files to S3 using these parameters in my spider/settings.py file:
FEED_FORMAT: jsonlines
FEED_URI: s3://my-bucket-name
My scrapyd.conf file sets the items_dir property to empty:
items_dir=
The reason the items_dir property is set to empty is so that scrapyd does not override the FEED_URI property in the spider's settings, which points to an s3 bucket (see Saving items from Scrapyd to Amazon S3 using Feed Exporter).
This works as expected in most cases but I'm running into a problem on one particularly large crawl: the local disk (which isn't particularly big) fills up with the in-progress crawl's data before it can fully complete, and thus before the results can be uploaded to S3.
I'm wondering if there is any way to configure where the "intermediate" results of this crawl can be written prior to being uploaded to S3? I'm assuming that however Scrapy internally represents the in-progress crawl data is not held entirely in RAM but put on disk somewhere, and if that's the case, I'd like to set that location to an external mount with enough space to hold the results before shipping the completed .jl file to S3. Specifying a value for "items_dir" prevents scrapyd from automatically uploading the results to s3 on completion.
The S3 feed storage option inherits from BlockingFeedStorage, which itself uses TemporaryFile(prefix='feed-') (from tempfile module)
The default directory is chosen from a platform-dependent list
You can subclass S3FeedStorage and override the open() method to return a temp file from somewhere else than the default, for example using the dir argument of tempfile.TemporaryFile([mode='w+b'[, bufsize=-1[, suffix=''[, prefix='tmp'[, dir=None]]]]])

Moving files >5 gig to AWS S3 using a Data Pipeline

We are experiencing problems with files produced by Java code which are written locally and then copied by the Data Pipeline to S3. The error mentions file size.
I would have thought that if multipart uploads is required, then the Pipeline would figure that out. I wonder if there is a way of configuring the Pipeline so that it indeed uses multipart uploading. Because otherwise the current Java code which is agnostic about S3 has to write directly to S3 or has to do what it used to and then use multipart uploading -- in fact, I would think the code would just directly write to S3 and not worry about uploading.
Can anyone tell me if Pipelines can use multipart uploading and if not, can you suggest whether the correct approach is to have the program write directly to S3 or to continue to write to local storage and then perhaps have a separate program be invoked within the same Pipeline which will do the multipart uploading?
The answer, based on AWS support, is that indeed 5 gig files can't be uploaded directly to S3. And there is no way currently for a Data Pipeline to say, "You are trying to upload a large file, so I will do something special to handle this." It simply fails.
This may change in the future.
Data Pipeline CopyActivity does not support files larger than 4GB. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
This is below the 5GB limit imposed by S3 for each file-part put.
You need to write your own script wrapping AWS CLI or S3cmd (older). This script may be executed as a shell activity.
Writing directly to S3 may be an issue as S3 does not support append operations - unless you can somehow write multiple smaller objects in a folder.

Paperclip with multiple server instances

I am using paperclip in RoR , and I am having some troubles when showing the images. Some times some images are shown and some other times thy are not. Does anybody has experienced something like this?
If you are using local paths to save the images, paperclip will save them on the server processing the request.
Next requests to show the image will work if they are received on the same server where it was saved, or will fail if the request is processed by another server.
To avoid that you should be using a common storage, for example s3.

Can images be added to the Asset Pipeline at runtime / hot?

Scenario:
Have already deployed a Rails 3.2 app, and it ran the asset pipeline flow…
Now as a result of a user action, we have a new image and we want it to be a part of the asset pipeline for benefits like cache busting (though now I think about it, with user-uploaded images given unique filenames each time, that is a moot point)
Is there any way to be able to / is it a good idea to use the asset pipeline at this point, for that new image?
I have a feeling this is a stupid question.
Compiling assets is just a rake task, you could invoke the task at any point. The task is bundle exec rake assets:precompile. See http://guides.rubyonrails.org/asset_pipeline.html for more information.
However, I would not treat user uploaded images as assets to be compiled. I think doing so would be a bad idea.
Instead, like you touched on, make your user uploaded images have unique names. So when new images get uploaded/replaced new names will be generated. An example of this being done can be found in the paperclip gem. It writes the images to disk and saves a record/reference in the database. Those images have an id in the databases and the URLs involve that id. So you would have /photos/4/nothing.png and /photos/2/yes.png. Where 2 and 4 point back to database records with metadata/relations to the images.
Also, when you use the image_tag rails view helper it will automatically add a cache buster onto the URL of the image. The cache buster is done as a query string, so image_tag('test.png') becomes /images/test.png?1234567890. Sometimes certain proxies will not 'bust' query string caches, but those are a minority.

downloading huge files - application using grails

I am developing a RESTful web service that allows users to download data in csv and json formats that is dynamically retrieved from the database.
Right now I am using a StringWriter to write out the CSV data. My major concern is that the resultset could get very large depending the on the user input. In that case, having them all in memory doesn't seem to be a good idea to me.
I am thinking of creating a temp file, but how to make sure the file gets deleted soon after the download completes.
Is there a better way to do this.
Thanks for the help.
If memory is the issue, you could simply write out to the response writer that writes directly to the output stream? This way you're not storing anything (much) in memory and no need to write out temporary files:
// controller action for CSV download
def download = {
response.setContentType("text/csv")
response.setHeader("Content-disposition", "attachment;filename=downloadFile.csv")
def results = // get all your results
results.each { result ->
out << result.col1 << ',' << result.col2 // etc
out << '\n'
}
}
This writes out to the output stream as it is looping round your results.
In theory You can make this even more memory efficient by using a scrollable results set - see "Using Scrollable Results" section of Querying with GORM - Criteria - and looping round that whilst writing out to the response writer. In theory this means you're also not loading all your DB results into memory, but in practice this may not work as expected if you're using MySQL (and its Java connector). Manually batching up queries may work too (get DB rows 1-10000, write out, get 10001-20001, etc)
This kind of thing might be more difficult with JSON, depending on what library you're using to render your objects.
Well, the simplest solution to preventing temp files from sticking around too long would be a cron job that simply deletes any file in the temp directory that has a modified time older than, say, 1 hour.
If you want it to all be done within Grails, you could design a Quartz job to clean up files. This job could either do as described above (and simply check modification timestamps to decide what to delete) or you could run the job only "on demand" with a parameter of a file name to be deleted. Once the download action is called you could schedule the cleanup of that specific file for X minutes later (to allow enough time for a successful download). The job would then be in charge of simply deleting the file.
Depending on the number of files involved you can always use http://download.oracle.com/javase/1,5.0/docs/api/java/io/File.html#deleteOnExit() to ensure the file is blown away when the VM shuts down.
To create a temp file that gets automatically deleted after the session has expired, you can use the Session Temp Files plugin.