Retrieving the size of a file stored in S3 - scrapy

My spiders use filespipeline to save files to an S3 bucket. I have overridden filespipeline.file_path to give each S3 blob a nice Key, and my Item class has a field called binary_size, into which I would like to insert the size of the downloaded/uploaded file.
To achieve this I overrode filespipeline.item_completed, in which I lookup the item on the S3 bucket (deriving the same item Key as decided in file_path), get it's size, then set it on and return the item. It seems very simple!
It appears however that during execution of item_completed, the S3 bucket does not yet contain the item. My code fails with a simple traceback "Key does not exist".
Is there a place that I can hook in to modify the item at some time when I can be sure that the blob will exist in S3?
Any clues, however faint or spare, will be appreciated! It's all working so nicely, I don't want to have to do the S3 stuff manually.

Related

Compare local files to s3 bucket and determine what files (fullpath) are not in the bucket

I have a local file share that was copied to a snowball and imported to an s3 bucket (~70TB; many small files)
Since the import, users have added content to the local share.
Im trying to get a list of all the files that are not present then transfer them to the bucket.
I've tried a sync and an s3cmd sync but it has to iterate through each item, my thoughts are if I export a list then run a copy only items that need to move it would save a lot of time.
Looking for some help on the simplest and fastest way to approach this
Sure you can go this way if you assume your files weren't changed, i.e. if file path is uniquely identifies the content. Additionally you could check if the size remained the same.
To get list of objects in an s3 bucket use list-objects:
aws s3api list-objects --bucket text-content --query 'Contents[].{Key: Key, Size: Size}')

Append file on S3 bucket

I am working on a requirement where I have to constantly append the file on S3 bucket. The scenario is similar to rolling log file. Once the script(or any other method) starts writing the data to the file, until I stop it, the file should be appended on S3 bucket. I searched several ways but could not find the solution. Most the available resources says how to upload the static file to S3 but not the dynamically generated file.
S3 objects can only be overwritten, not appended-to. It's not possible.
Once created, objects are durably stored and immutable. Any "change" to an object requires that the object be replaced.
While it is possible to stream a file into S3, this doesn't accomplish the purpose either, because the object you are creating is not accessible until the upload is finalized.

Where does scrapyd write crawl results when using an S3 FEED_URI, before uploading to S3?

I'm running a long-running web crawl using scrapyd and scrapy 1.0.3 on an Amazon EC2 instance. I'm exporting jsonlines files to S3 using these parameters in my spider/settings.py file:
FEED_FORMAT: jsonlines
FEED_URI: s3://my-bucket-name
My scrapyd.conf file sets the items_dir property to empty:
items_dir=
The reason the items_dir property is set to empty is so that scrapyd does not override the FEED_URI property in the spider's settings, which points to an s3 bucket (see Saving items from Scrapyd to Amazon S3 using Feed Exporter).
This works as expected in most cases but I'm running into a problem on one particularly large crawl: the local disk (which isn't particularly big) fills up with the in-progress crawl's data before it can fully complete, and thus before the results can be uploaded to S3.
I'm wondering if there is any way to configure where the "intermediate" results of this crawl can be written prior to being uploaded to S3? I'm assuming that however Scrapy internally represents the in-progress crawl data is not held entirely in RAM but put on disk somewhere, and if that's the case, I'd like to set that location to an external mount with enough space to hold the results before shipping the completed .jl file to S3. Specifying a value for "items_dir" prevents scrapyd from automatically uploading the results to s3 on completion.
The S3 feed storage option inherits from BlockingFeedStorage, which itself uses TemporaryFile(prefix='feed-') (from tempfile module)
The default directory is chosen from a platform-dependent list
You can subclass S3FeedStorage and override the open() method to return a temp file from somewhere else than the default, for example using the dir argument of tempfile.TemporaryFile([mode='w+b'[, bufsize=-1[, suffix=''[, prefix='tmp'[, dir=None]]]]])

Simple way to load new files only into Redshift from S3?

The documentation for the Redshift COPY command specifies two ways to choose files to load from S3, you either provide a base path and it loads all the files under that path, or you specify a manifest file with specific files to load.
However in our case, which I imagine is pretty common, the S3 bucket periodically receives new files with more recent data. We'd like to be able to load only the files that haven't already been loaded.
Given that there is a table stl_file_scan that logs all the files that have been loaded from S3, it would be nice to somehow exclude those that have successfully been loaded. This seems like a fairly obvious feature, but I can't find anything in the docs or online about how to do this.
Even the Redshift S3 loading template in AWS Data Pipeline appears to manage this scenario by loading all the data -- new and old -- to a staging table, and then comparing/upserting to the target table. This seems like an insane amount of overhead when we can tell up front from the filenames that a file has already been loaded.
I know we could probably move the files that have already been loaded out of the bucket, however we can't do that, this bucket is the final storage place for another process which is not our own.
The only alternative I can think of is to have some other process running that tracks files that have been successfully loaded to redshift, and then periodically compares that to the s3 bucket to determine the differences, and then writes the manifest file somewhere before triggering the copy process. But what a pain! We'd need a separate ec2 instance to run the process which would have it's own management and operational overhead.
There must be a better way!
This is how I solved the problem,
S3 -- (Lambda Trigger on newly created Logs) -- Lambda -- Firehose -- Redshift
It works at any scale. With more load, more calls to Lambda, more data to firehose and everything taken care automatically.
If there are issues with the format of the file, you can configure dead letter queues, events will be sent there and you can reprocess once you fix lambda.
Here I would like to mention some steps that includes process that how to load data in redshift.
Export local RDBMS data to flat files (Make sure you remove invalid
characters, apply escape sequence during export).
Split files into 10-15 MB each to get optimal performance during
upload and final Data load.
Compress files to *.gz format so you don’t end up with $1000
surprise bill :) .. In my case Text files were compressed 10-20
times.
List all file names to manifest file so when you issue COPY command
to Redshift its treated as one unit of load.
Upload manifest file to Amazon S3 bucket.
Upload local *.gz files to Amazon S3 bucket.
Issue Redshift COPY command with different options.
Schedule file archiving from on-premises and S3 Staging area on AWS.
Capturing Errors, setting up restart ability if something fails
Doing it easy way you can follow this link.
In general compare of loaded files to existing on S3 files is a bad but possible practice. The common "industrial" practice is to use message queue between data producer and data consumer that actually loads the data. Take a look on RabbitMQ vs Amazon SQS and etc..

S3 — Auto generate folder structure?

I need to store user uploaded files in Amazon S3. I'm new to S3, but as I got from docs, S3 requires of me to specify file upload path in PUT method.
I'm wondering if there is a way to send file to S3, and simply get link for http(s) access? I wish Amazon to handle all headache related to file/folder structure itself. For example, I just pipe from node.js file to S3, and on callback I get http link with no expiration date. And Amazon itself creates smth like /2014/12/01/.../$hash.jpg and just returns me the final link? Such use case looks to be quite common.
Is it possible? If no, could you suggest any options to simplify file storage/filesystem tree structure in S3?
Many thanks.
S3 doesnt' have folders, actually. In a normal filesystem, 2014/12/01/blah.jpg would mean you've got a 2014 folder with a folder called 12 inside it and so on, but in S3 the entire 2014/12/01/blah.jpg it the key - essentially a single long filename. You don't have to create any folders.