Exporting data over an api from s3 using lambda - amazon-s3

I have some data stored in dynamo db and some highres images of each user stored in S3. The requirement is to be able to export a users data on demand. So by an api endpoint, collate all data and send it as a response. We are using aws lambda using node.js for business logic, s3 for storing images and sql db for storing relational data
I had set up a mechanism to connect api gateway to receive requests and put them in a sqs. The sqs would trigger a lambda which would run queries to gather all the data and image paths. We would copy all the images and data into a new bucket with custId as a folder name. Now heres where Im stuck. How to stream this data from our new aws bucket. All collected data is about 4gb. I have tried to stream via aws-lambda but keep failing. I am able to stream sigle files but not all data as zip. Hv done this in node, but would not want to set up an EC2 is possible and try to solve it directly with s3 and lambdas
CAnt seem to find a way to stream an entire folder from aws to the client as a response to an http request

Okay found the answer. Instead of trying to return a zip stream, Im now just zipping and saving the folder on the bucket itself and returning a signed url for it. Many node modules help us zip s3 folders without loading entire files in memory. Using that we have zipped our folder and returned a signed url. How it will behave under actual load remains to be seen. Will do that soon

Related

Stream 'Azure Blob Storage' file URL to Amazon S3 bucket

I am consuming a document-sharing API in my application, which when used, returns a "downloadUrl" for a given file located in Azure Blob Storage.
I want to take that Azure Blob Storage url, and stream the document into an Amazon S3 bucket.
How would I go about doing this? I see similar questions such as Copy from Azure Blob to AWS S3 using C# , but in that example they seem to have access to the stream of the document itself. Is there any way for me to simply provide S3 with the link, and have them do the rest? Or do I need to get the file on the server, and stream it as in the example above?
Thanks in advance for the help.
There is only one case where S3 can be directed to fetch content "into" an object, and that is when the source is also an existing S3 object. It can be in the same bucket or a different bucket, or even a different AWS region or account, as long as the calling user has permissions on both source and target.
Any other case -- such as what you are contemplating -- requires that you fetch the object from the source, yourself, and then upload it into S3... as in the example.

Getting data from S3 (client) to our S3 (company)

We have a requirement to get a .csv files from a bucket which is a client location (They would provide the S3 bucket info and other information required). Every day we need to pull this data into our S3 bucket so we can process it further. Please suggest the best way/technology that we can use to achieve the result.
I am planning to do it by Python boto (or Pandas or Pyspark) or Spark; reason being, once we get this data it might be processed further.
You can try the S3 cross account object copy using the S3 copy option. This is more secure and the suggested one. Please go through the below link for more details. It also works for same account different buckets. After copying then you can trigger some lambda function with custom code(python) to do the processing of the .csv files.
How to copy Amazon S3 objects from one AWS account to another by using the S3 COPY operation
If your customer keeps the data in an s3 bucket to which your account has been granted access to it, then it should be possible to use the .csv files as a direct source of data for a spark job. Use the s3a://theirbucket/nightly/*.csv as the RDD source, and save it to s3a://mybucket/somewhere, ideally in a format other than CSV (Parquet, ORC, ...). This lets you do some basic transformation of the format into one easier to work with.
If you just want the raw CSV files, that S3 Copy operation is what you need, as it copies the data within S3 itself (6+MiB/s if in the same S3 location), and not needing any of your own VMs involved.

Moving files >5 gig to AWS S3 using a Data Pipeline

We are experiencing problems with files produced by Java code which are written locally and then copied by the Data Pipeline to S3. The error mentions file size.
I would have thought that if multipart uploads is required, then the Pipeline would figure that out. I wonder if there is a way of configuring the Pipeline so that it indeed uses multipart uploading. Because otherwise the current Java code which is agnostic about S3 has to write directly to S3 or has to do what it used to and then use multipart uploading -- in fact, I would think the code would just directly write to S3 and not worry about uploading.
Can anyone tell me if Pipelines can use multipart uploading and if not, can you suggest whether the correct approach is to have the program write directly to S3 or to continue to write to local storage and then perhaps have a separate program be invoked within the same Pipeline which will do the multipart uploading?
The answer, based on AWS support, is that indeed 5 gig files can't be uploaded directly to S3. And there is no way currently for a Data Pipeline to say, "You are trying to upload a large file, so I will do something special to handle this." It simply fails.
This may change in the future.
Data Pipeline CopyActivity does not support files larger than 4GB. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
This is below the 5GB limit imposed by S3 for each file-part put.
You need to write your own script wrapping AWS CLI or S3cmd (older). This script may be executed as a shell activity.
Writing directly to S3 may be an issue as S3 does not support append operations - unless you can somehow write multiple smaller objects in a folder.

S3 — Auto generate folder structure?

I need to store user uploaded files in Amazon S3. I'm new to S3, but as I got from docs, S3 requires of me to specify file upload path in PUT method.
I'm wondering if there is a way to send file to S3, and simply get link for http(s) access? I wish Amazon to handle all headache related to file/folder structure itself. For example, I just pipe from node.js file to S3, and on callback I get http link with no expiration date. And Amazon itself creates smth like /2014/12/01/.../$hash.jpg and just returns me the final link? Such use case looks to be quite common.
Is it possible? If no, could you suggest any options to simplify file storage/filesystem tree structure in S3?
Many thanks.
S3 doesnt' have folders, actually. In a normal filesystem, 2014/12/01/blah.jpg would mean you've got a 2014 folder with a folder called 12 inside it and so on, but in S3 the entire 2014/12/01/blah.jpg it the key - essentially a single long filename. You don't have to create any folders.

query regarding cloud file storage services- can i append data to an existing file

I am working to create an application where some files will be stored in Amazon S3/Rackspace Cloud Files/other similar cloud file storage providers.
There are a couple of scenarios where it would be easier for me, if I could append data to an existing file... Is this possible? Or do I have to download the file from Amazon S3, then append data to it, and finally upload the modified file back to Amazon S3?
There is no way to append anything to existing files in S3.
You will have to download it and upload it again after modifying.
If you wish though, you can always upload the new data with a tag (a timestamp or a counter), e.g. file_201201011344. So when reading files, you get all files mactching your pattern and append them on the client side.