Scrapy crawl appends locally, replaces on S3? - amazon-s3

I implemented a Scrapy project that is now working fine locally. Using the crawl command, each spider appended it's jsonlines to the same file if the file existed. When I changed the feed exporter to S3 using boto it now overwrites the entire file with the data from the last run spider instead of appending to the file.
Is there any way to enable the Scrapy/boto/S3 to append the jsonlines in to the file like it does locally?
Thanks

There is no way to append to a file in S3. You could enable versioning on the S3 bucket and then each time the file was written to S3, it would create a new version of the file. Then you could retrieve all versions of the file using the list_versions method of the boto Bucket object.

From reading the feed exporter code (https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/feedexport.py), the file exporter opens the specified file in append mode whilst the S3 exporter calls set_contents_from_file which presumably overwrites the original file.
The boto S3 documentation (http://boto.readthedocs.org/en/latest/getting_started.html) doesn't mention being able to modify stored files, so the only solution would be to create a custom exporter that stores a local copy of results that can be appended to first before copying that file to S3.

Related

How to add the files of s3bucket folder to a zipfile and download the zip file

I have a folder in s3bucket. I want to zip the files inside it and then download the zip file. Whatever i found was related to lambda. Is there a way i can do it without using lambda? if not then what is the proper way to do it.
Thank you in advance.
S3 can't zip it on the fly for you since it's only a file storage service. You could use lambda of course, but the simplest way to download a "folder" on S3 is to use the AWS CLI.
aws s3 sync s3://<bucket_name>/<folder_key> <local_dest_path>
You can then zip it on your local machine if needed.

get zip files from one s3 bucket unzip them to another s3 bucket

I have zip files in one s3 bucket
I need to unzip them and copy the unzipped folder to another s3 bucket and keep the source path
for example - if in source bucket the zip file in under
"s3://bucketname/foo/bar/file.zip"
then in destination bucket it should be "s3://destbucketname/foo/bar/zipname/files.."
how can it be done ?
i know that it is possible somehow to do it with lambda so i wont have to download it locally but i have no idea how
thanks !
If your desire is to trigger the above process as soon as the Zip file is uploaded into the bucket, then you could write an AWS Lambda function
When the Lambda function is triggered, it will be passed the name of the bucket and object that was uploaded. The function should then:
Download the Zip file to /tmp
Unzip the file (Beware: maximum storage available: 500MB)
Loop through the unzipped files and upload them to the destination bucket
Delete all local files created (to free-up space for any future executions of the function)
For a general example, see: Tutorial: Using AWS Lambda with Amazon S3 - AWS Lambda
You can use AWS Lambda for this. You can also set an event notification in your S3 bucket so that a lambda function is triggered everytime a new file arrives. You can write a Python code that uses boto3 to connect to S3. Then you can read files into a buffer, and unzip them using these libraries, gzip them and then reupload to S3 in your desired folder/path:
import gzip
import zipfile
import io
with zipped.open(file, "r") as f_in:
gzipped_content = gzip.compress(f_in.read())
destinationbucket.upload_fileobj(io.BytesIO(gzipped_content),
final_file_path,
ExtraArgs={"ContentType": "text/plain"}
)
There is also a tutorial here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9
Arguably Python is simpler to use for your Lambda, but if you are considering Java, I've made a library that manages unzipping of data in AWS S3 utilising stream download and multipart upload.
Unzipping is achieved without keeping data in memory or writing to disk. That makes it suitable for large data files - it has been used to unzip files of size 100GB+.
It is available in Maven Central, here is the GitHub link: nejckorasa/s3-stream-unzip

Can make figure out file dependencies in AWS S3 buckets?

The source directory contains numerous large image and video files.
These files need to be uploaded to an AWS S3 bucket with the aws s3 cp command. For example, as part of this build process, I copy my image file my_image.jpg to the S3 bucket like this: aws s3 cp my_image.jpg s3://mybucket.mydomain.com/
I have no problem doing this copy to AWS manually. And I can script it too. But I want to use the makefile to upload my image file my_image.jpg iff the same-named file in my S3 bucket is older than the one in my source directory.
Generally make is very good at this kind of dependency checking based on file dates. However, is there a way I can tell make to get the file dates from files in S3 buckets and use that to determine if dependencies need to be rebuilt or not?
The AWS CLI has an s3 sync command that can take care of a fair amount of this for you. From the documentation:
A s3 object will require copying if:
the sizes of the two s3 objects differ,
the last modified time of the source is newer than the last modified time of the destination,
or the s3 object does not exist under the specified bucket and prefix destination.
I think you'll need to make S3 look like a file system to make this work. On Linux it is common to use FUSE to build adapters like that. Here are some projects to present S3 as a local filesystem. I haven't tried any of those, but it seems like the way to go.

How to import .zip data to neo4j?

I have a CSV file which is zipped and stored on s3, I'm planning to import the file directly from the URL. I'm not able to find any way of doing that in Neo4j official docs.
LOAD CSV can do this. neo4j-import has the same underlying file reader and so can read zipped files directly, although it seems to be lacking URL support currently.

Empty files on S3 prevent from downloading using s3cmd and s3sync

I am trying to setup a backup/restore using S3. The upload sync worked well using s3sync. However, next to each folder there is an empty file with matching name. I read somewhere that this is created to define the folder structure but I am not sure about that as it doesn't happen if I create a folder using a different method s3fox etc.
These empty files prevent me from restoring the directories/files. When I do s3cmd sync, I get an error message "can not make directory: File exists" as it first creates that empty file and that fails when trying to create the directory. Any ideas how I can solve this problem?