How to Upload PhantomJS Page Content to S3 - amazon-s3

I am using PhantomJS 1.9.7 to scrape a web page. I need to send the returned page content to S3. I am currently using the filesystem module included with PhantomJS to save to the local file system and using a php script to scan the directory and ship the files off to S3. I would like to completely bypass the local filesystem and send the files directly from PhantomJS to S3. I could not find a direct way to do this within PhantomJS.
I toyed with the idea of using the child_process module and pass in the content as an argument, like so:
var execFile = require("child_process").execFile;
var page = require('webpage').create();
var content = page.content;
execFile('php', '[path/to/script.php, content]', null, function(err,stdout,stdin){
console.log("execFileSTDOUT:", JSON.stringify(stdout));
console.log("execFileSTDERR:", JSON.stringify(stderr));
});
which would call a php script directly to accomplish the upload. This will require using an additional process to call a CLI command. I am not comfortable with having another asynchronous process running. What I am looking for is a way to send the content directly to S3 from the PhantomJS script similar to what the filesystem module does with the local filesystem.
Any ideas as to how to accomplish this would be appreciated. Thanks!

You could just create and open another page and point it to your S3 service. Amazon S3 has a REST API and a SOAP API and REST seems easier.
For SOAP you will have to manually build the request. The only problem might be the wrong content-type. Though it looks as if it was implemented, but I cannot find a reference in the documentation.
You could also create a form in the page context and send the file that way.

Related

Is an upload (put) object to AWS S3 from web browser possible?

But is a bit of a random question and no one should ever do it this way, but is it possible to execute a put api call to amazon S3 from the web browser? Using only query params.
For instance, ignoring authentication params, I know you can do https://s3.amazonaws.com/~some bucket~
To list files in the bucket. Is there a way to upload?
Have look at Browser-Based Uploads Using POST

Handling file uploads with Restler

What is the best practice to implement file uploads using Restler framework?
I like to have a API call that get the file save it in CDN and return back the CDN file URL to the caller. What is the best way to implement it?
File upload to CDN using our API
This requires two steps, first is to get the file on the API server.
add UploadFormat to the supported formats
Adjust the static properties of UploadFormat to suit your need
From your api method use $_FILES and move_uploaded_file to get the file to the desired folder. This step is common for any php upload process.
Now that you have the file on the server
Upload it to CDN. You can use any means provided my your CDN. It can be ftp or using some SDK to do the upload
Construct the CDN url and return it to the client

amazon s3 for downloads how to handle security

I'm building a web application and am looking into using Amazon S3 to store user uploads.
My concern is, I dont want user A to see his download link for a document he uploaded is urltoMyS3/doc1234.pdf and try urltoMyS3/doc1235.pdf and get another users document.
The only way I can think of to do this, is to only allow the web application to connect to S3, then check if the user has access to a file on the web application, have the web app download the file, and then serve it to the client. The problem with this method is the application would have to download the file first and would inevitably slow the download process down for the user.
How is user files typically handled with Amazon S3? Or is it simply not typically used in a scenario where the files should not be public? Is there another service for something like this?
Thanks
You can implement Query String Authentication, which will solve your problem.
Query string authentication is useful for giving HTTP or browser
access to resources that would normally require authentication. The
signature in the query string secures the request. Query string
authentication requests require an expiration date. You can specify
any future expiration time in epoch or UNIX time (number of seconds
since January 1, 1970).
You can do this by generating the appropriate links, see the following
https://docs.aws.amazon.com/AmazonS3/latest/dev/RESTAuthentication.html#RESTAuthenticationQueryStringAuth
If time-bound authentication will not work for (as suggested in other answers). You could consider implementing something like s3fs to mount your S3 bucket as a drive on your web application server. In this manner you can simply make your authentication and then serve up the file directly to the user, without them having any idea that the file resides in S3. Similarly, you can simply write uploaded files directly to this s3fs mount.
S3fs, also allows you to configure a local cache of the S3 directory on your machine for faster access.
This works nicely in a cluster web server environment as well, as you can just have each server mount the s3fs drive and perform/read/writes on it independently.
A link with more info

Will this approach freeze the application?

I am using Heroku and Amazon S3, for storage.
I'm trying to make the download dialogue appear for the audio file, instead of the browser playing it.
In one of my controllers, I have:
response.content_type = 'application/octet-stream'
response.headers['Content-Disposition'] = "attachment; filename=#audio.filename"
response.headers['X-Accel-Redirect'] = #audio.encoded_file_url
render :nothing => true
#audio.encoded_file_url returns http://bucket_name.s3.amazonaws.com/uploads/19/test.mp3.
Which seems to work on my local machine. However, I am wondering if this approach will block an entire HTTP request handler, freezing the app until the download completes.
In Heroku, a HTTP request handler is one Dyno. And having several Dynos is expensive.
I'm not sure that you can rely on nginx being used (X-Accel-Redirect is an nginx-ism) - the heroku docs imply that it's not always used.
In addition, X-Accel-Redirect is, to my knowledge only for redirecting to files actually on the server, not for externally hosted files. Why not do a normal redirect to the S3 hosted file (using an authenticated URL if needed) ?
If you need to set headers like content disposition, this can be done either at time of upload or afterwards. If you use fog to do your s3 business you could do it like this (assuming storage is a Fog::Storage object)
storage.copy_object("your_bucket", "filename","your_bucket","filename", "x-amz-metadata-directive" => 'REPLACE', 'Content-Disposition' => '...')
Note that this overwrites all the metadata - if you have other fields such as Content-Type, Cache-Control etc. then make sure to set them here or they will be lost.
I would really recommend against letting users download files from your application via the dynos that your using to server your pages. Any static assets should really be served from S3, which you can then direct users to for file downloads.
Whilst the user is downloading, your dyno is effectively just feeding that file to them, and thus unable to do anything else.

Does Amazon S3 help anything in this case?

I'm thinking about whether to host uploaded media files (video and audio) on S3 instead of locally. I need to check user's permissions on each download.
So there would be an action like get_file, which first checks the user's permissions and then gets the file from S3 and sends it using send_file to the user.
def get_file
if #user.can_download(params[:file_id])
# first, download the file from S3 and then send it to the user using send_file
end
end
But in this case, the server (unnecessarily) downloads the file first from S3 and then sends it to the user. I thought the use case for S3 was to bypass the Rails/HTTP server stack for reduced load.
Am I thinking this wrong?
PS. I'm using CarrierWave for file uploads. Not sure if that's relevant.
Amazon S3 provides something called RESTful authenticated reads, which are basically timeoutable URLs to otherwise protected content.
CarrierWave provides support for this. Simply declare S3 access policy to authenticated read:
config.s3_access_policy = :authenticated_read
and then model.file.url will automatically generate the RESTful URL.
Typically you'd embed the S3 URL in your page, so that the client's browser fetches the file directly from Amazon. Note however that this exposes the raw unprotected URL. You could name the file with a long hash instead of something predictable, so it's at least not guessable -- but once that URL is exposed, it's essentially open to the Internet. So if you absolutely always need access control on the files, then you'll need to proxy it like you're currently doing. In that case, you may decide it's just better to store the file locally.