What architecture is best for creating a serverless aws service? - amazon-s3

I need to implement an AWS service used to store back-up data from devices.
Devices are identified via ids. Service consists of 3 endpoints:
Save device backup.
Get device backup.
Get latest device backup time.
Backup: binary data, from 10kb up to 1mb
Load examples
100к saved backups per day. 2k restored backups per day.
Take p1 and multiply by 100
I came up with 2 architectures.
Which architecture is better to choose or build a new one?
Can I combine the gateway api into one or do I need a separate API for each request?
Can I merge lambda into one or do I need a separate action for each action?

A device backup would consist of two elements:
The backup data: Best stored in Amazon S3
Metadata about the backup (user, timestamp, pointer to backup data): Best stored in some type of database, such as DynamoDB
The processes would then be:
Saving backup: Send backup data via API Gateway to Lambda. The Lambda function would save the data in Amazon S3 and add an entry to the DynamoDB database, returning a reference to the backup entry in the database.
Retrieving backup: Send request via API Gateway to Lambda. The Lambda function uses the metadata in DynamoDB to determine which backup to serve, then creates an Amazon S3 pre-signed URL and returns the URL to the device. The device then retrieves the backup directly from the S3 bucket.
Listing backups: Send request via API Gateway to Lambda. The Lambda function uses the metadata in DynamoDB to retrieve a list of backups (or just the latest backup), then returns the values.
It would be cleaner to use a separate Lambda function for each type of request (save, retrieve, list). These would be triggered via different paths within API Gateway.

Related

BigQuery data transfer not recognizing wild card storage object names

I have setup a data transfer service in Google cloud to move data from JSON files in storage buckets to BigQuery table. The files in storage buckets have the same name prefix but have different date based suffixes. I have used gs://bucket_name/filename*.json as the
Cloud Storage URI in the transfer service configuration. The service seems to run fine when tried the first time. But subsequent runs with the same set of files completes with the error "No files found matching: "gs://bucket_name/filename*.json"". I don't see any other errors in the logs. I am trying this for the first time and need to schedule multiple such transfer service configurations. Any clue on why this is not working?

SQS and AWS Lambda Integration

I am developing an Audit Trail System, that will act as a central location for all the critical events happening around the organization. I am planning to use Amazon SQS as a temporary queue to hold the messages that in turn will trigger the AWS lambda function to write the messages into AWS S3 store. I want to segregate the data at tenantId level (some identifiable id) and persist the messages as batches in S3, that will reduce the no of calls from lambda to S3. Moreover, I want to trigger the lambda every hour. But, I have 2 issues here, one the max batch size provided by SQS is 10, also the lambda trigger polls the SQS service on regular basis, that's gonna increase the no of calls to my S3. I want to create a manual batch of 1000 messages(say) before calling the S3 batch api. I am not very much sure how to architecture my system, so that above requirements can be met. Help or idea provided is very much appreciable!
Simplified Architecture:
Thanks!
I would recommend that you instead use Amazon Kinesis Data Firehose. It basically does what you're wanting to do:
Accepts incoming messages
Buffers them for a period of time
Writes output to S3 or Elasticsearch
This is all done as a managed service, and can also integrate with AWS Lambda to provide custom processing (eg filter out certain records).
However, you might have to do something special to segregate the data at tenantId. See: Can I customize partitioning in Kinesis Firehose before delivering to S3?

aws s3 sync cli ignoring multipart upload config when syncing between buckets

I'm trying to sync a large number of files from one bucket to another, some of the files are up to 2GB in size after using the aws cli's s3 sync command like so
aws s3 sync s3://bucket/folder/folder s3://destination-bucket/folder/folder
and verifying the files that had been transferred it became clear that the large files had lost the metadata that was present on the original file in the original bucket.
This is a "known" issue with larger files where s3 switches to multipart upload to handled the transfer.
This multipart handeling can be configured via the .aws/config file which has been done like so
[default]
s3 =
multipart_threshold = 4500MB
However when again testing the transfer the metadata on the larger files is still not present, it is present on any of the smaller files so it's clear that I'm heating the multipart upload issue.
Given this is an s3 to s3 transfer is the local s3 configuration taken into consideration at all?
As an alternative to this is there a way to just sync the metadata now that all the files have been transferred?
Have also tried doing aws s3 cp with no luck either.
You could use Cross/Same-Region Replication to copy the objects to another Amazon S3 bucket.
However, only newly added objects will copy between the buckets. You can, however, trigger the copy by copying the objects onto themselves. I'd recommend you test this on a separate bucket first, to make sure you don't accidentally lose any of the metadata.
The method suggested seems rather complex: Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena | AWS Big Data Blog
The final option would be to write your own code to copy the objects, and copy the metadata at the same time.
Or, you could write a script that compares the two buckets to see which objects did not get their correct metadata, and have it just update the metadata on the target object. This actually involves copying the object to itself, while specifying the metadata. This is probably easier than copying ALL objects yourself, since it only needs to 'fix' the ones that didn't get their metadata.
Finally managed to implement a solution for this and took the oportunity to play around with the Serverless framework and Step Functions.
The general flow I went with was:
Step Function triggered using a Cloudwatch Event Rule targetting S3 Events of the type 'CompleteMultipartUpload', as the metadata is only ever missing on S3 objects that had to be transfered using a multipart process
The initial Task on the Step Function checks if all the required MetaData is present on the object that raised the event.
If it is present then the Step Function is finished
If it is not present then the second lambda task is fired which copies all metadata from the source object to the destination object.
This could be achieved without Step Functions however was a good simple exercise to give them a go. The first 'Check Meta' task is actually redundant as the metadata is never present if multipart transfer is used, I was originally also triggering off of PutObject and CopyObject as well which is why I had the Check Meta task.

How can I search the changes made on a `s3` bucket between two timestamp?

I am using s3 bucket to store my data. And I keep pushing data to this bucket every single day. I wonder whether there is feature I can compare the files different in my bucket between two date. I not, is there a way for me to build one via aws cli or sdk?
The reason I want to check this is that I have a s3 bucket and my clients keep pushing data to this bucket. I want to have a look how much data they pushed since the last time I load them. Is there a pattern in aws support this query? Or do I have to create any rules in s3 bucket to analyse it?
Listing from Amazon S3
You can activate Amazon S3 Inventory, which can provide a daily file listing the contents of an Amazon S3 bucket. You could then compare differences between two inventory files.
List it yourself and store it
Alternatively, you could list the contents of a bucket and look for objects dated since the last listing. However, if objects are deleted, you will only know this if you keep a list of objects that were previously in the bucket. It's probably easier to use S3 inventory.
Process it in real-time
Instead of thinking about files in batches, you could configure Amazon S3 Events to trigger something whenever a new file is uploaded to the Amazon S3 bucket. The event can:
Trigger a notification via Amazon Simple Notification Service (SNS), such as an email
Invoke an AWS Lambda function to run some code you provide. For example, the code could process the file and send it somewhere.

push logs in S3 to dynamoDB continuously

we have our application logs pumped to S3 via Kinesis Firehose. we want this data to also flow to DynamoDB so that we can efficiently query the data to be presented in web UI (Ember app). need for this is so that users are able to filter and sort the data and so on. basically to support querying abilities via web UI.
i looked into AWS Data pipeline. this is reliable but more tuned to one time imports or scheduled imports. we want the flow of data from s3 to dynamoDB to be continuous.
what other choices are out there to achieve this? moving data from S3 to dynamoDB isn't a very unique requirement. so how have you solved this problem?
Is an S3 event triggered lambda an option? if yes, then how to make this lambda fault tolerant?
For Full Text Querying
You can design your solution as follows for better querying using AWS Elasticsearch as the destination for rich querying.
Setup Kinesis Firehouse Destination to Amazon Elastic Search. This will allow you to do full text querying from your Web UI.
You can choose to either back up failed records only or all records. If you choose all records, Kinesis Firehose backs up all incoming source data to your S3 bucket concurrently with data delivery to Amazon Elasticsearch. 
For Basic Querying
If you plan to use DynamoDB to store the metadata of logs its better to configure S3 Trigger to Lambda which will retrieve the file and update the metadata to DynamoDB.
Is an S3 event triggered lambda an option?
This is definitely an option. You can create a PutObject event on your S3 bucket and have it call your Lambda function, which will invoke it asynchronously.
if yes, then how to make this lambda fault tolerant?
By default, asynchronous invocations will retry twice upon failure. To ensure fault-tolerance beyond the two retries, you can use Dead Letter Queues and send the failed events to an SQS queue or SNS topic to be handled at a later time.