How to stream data from Amazon SQS to files in Amazon S3 - amazon-s3

How to quickly create mechanism that reads json data from Amazon SQS and saves it in avro files (may be other format) in s3 bucket, partitioned by date and value of given field in json message?

You can write an AWS Lambda function that gets triggered by a message being sent to an Amazon SQS queue. You are responsible for writing that code, so the answer is that it depends on your coding skill.
However, if each message is processed individually, you will end up with one Amazon S3 object per SQS message, which is quite inefficient to process. The fact that the file is in Avro format is irrelevant because each file will be quite small. This will add a lot of overhead when processing the files.
An alternative could be to send the messages to an Amazon Kinesis Data Stream, which can aggregate messages together by size (eg every 5MB) or time (eg every 5 minutes). This will result in fewer, larger objects in S3 but they will not be partitioned, nor in Avro format.
To get the best performance out of a columnar format like Avro, combine the data into larger files that will be more efficient for processing. So, for example, you could use Kinesis for collecting the data, then a daily Amazon EMR job to combine those files into partitioned Avro files.
So, the answer is: "It's pretty easy, but you probably don't want to do it."
Your question does not define how the data gets into SQS. If, rather than processing messages as soon as they arrive, you are willing for the data to accumulate in SQS for some period of time (eg 1 hour or 1 day), you could then write a program that reads all of the messages and outputs them into partitioned Avro files. This uses SQS as a temporary holding area, allowing data to accumulate before being processed. However, it would lose any real-time reporting aspect.

Related

AWS Kinesis Data Firehose and Lambda

I have different data sources and I need to publish them to S3 in real-time. I also need to process and validate data before delivering them to S3 buckets. So, I have to use AWS Lambda and validating data. The question is that what is the difference between AWS Kinesis Data Firehose and using AWS Lambda to directly store data into S3 Bucket? Clearly, what is the advantages of using Kinesis Data Firehose? because we can use AWS Lambda to directly put records into S3!
We might want to clarify near real time, as for me, it is below 1 sec.
Kinesis Firehose in this case will batch the items before delivering them into S3. This will result in more items per S3 object.
You can configured how often you want the data to be stored. (You can also connect a lambda to firehose, so you can process the data before delivering them to S3). Kinesis Firehose will scale automatically.
Note that each PUT to S3 as a cost associated to it.
If you connect your data source to AWS Lambda, then each event will trigger the lambda (unless you have a batching mechanism in place, which you didn't mention) and for each event, you will make a PUT request to S3. This will result in a lot of small object in S3 and therefore a lot of S3 PUT api.
Also, depending on the number of items received per seconds, Lambda might not be able to scale and cost associated will increase.

How to transfer better sqs queue messages into redshift?

I have a sqs queue, that my application constantly sends messages to (about 5-15 messages per second).
I need to take the messages data and put it in redshift.
Right now, I have background service which gets X messages from the queue every Y minutes, then the service put them in an s3 file, and transfer the data into redshift using the COPY command.
This implementation have some problems:
In my service, I get X messages at a time, and because of the sqs limits, amazon allow to receive only 10 messages at max at a time (meaning that if I want to get 1000 messages, I will need to make 100 network calls)
My service doesn't scale as the application scales -> when there will be 30 (or 300) messages per second, my service won't be able to handle all the messages.
Using aws firehose is a little inconvenient the way I see it, because SHARDS are not scalable (I will need to configure manually to add shards) but maybe I'm wrong here
...
A a result of those things, I need something that will be scalable and efficient as possible. any ideas?
For the purpose you have described, I think AWS would say that Kinesis Data Streams plus Kinesis Data Firehose is a more appropriate service than SQS.
Yes, like you said, you do have to configure the shards. But just one shard can handle 1000 incoming records/sec. Also there are ways to automate the scaling, for example like AWS have documented here
One further advantage of using Kinesis Data Firehose is you can create a delivery stream which pushes the data straight into Redshift if you wish.

AWS Glue - how to crawl a Kinesis Firehose output folder from S3

I have what I think should be a relatively simple use case for AWS Glue, yet I'm having a lot of trouble figuring out how to implement it.
I have a Kinesis Firehose job dumping streaming data into a S3 bucket. These files consist of a series of discrete web browsing events represented as JSON documents with varying structures (so, say, one document might have field 'date' but not field 'name', whereas another might have 'name' but not 'date').
I wish to run hourly ETL jobs on these files, the specifics of which are not relevant to the matter at hand.
I'm trying to run a S3 data catalog crawler and the problem I'm running into is that the Kinesis output format is not, itself, valid JSON, which is just baffling to me. Instead it's a bunch of JSON documents separated by a line break. The crawler can automatically identify and parse JSON files, but it cannot parse this.
I thought of writing a lambda function to 'fix' the Firehose file, triggered by its creation on the bucket, but it sounds like a cheap workaround for two pieces that should fit neatly together.
Another option would be just bypassing the data catalog altogether and doing the necessary transformations in the Glue script itself, but I have no idea how to get started on this.
Am I missing anything? Is there an easier way to parse Firehouse output files or, failing that, bypassing the need for a crawler?
cheers and thanks in advance
It sounds like you're describing the behaviour of Kinesis Firehose, which is to concatenate multiple incoming records according to some buffering (time and size) settings, and then write the records to S3 as a single object. Firehose Data Delivery
The batching of multiple records into a single file is important if the workload will contain a large number of records, as performance (and S3 costs) for processing many small files from S3 can be less than optimal.
AWS Glue Crawlers and ETL jobs do support processing of 'JSON line' (newline delimited JSON) format.
If the crawler is failing to run please include the logs or error details (and if possible the crawler run duration, and number of tables created and updated.
I have seen a crawler fail in an instance where differences in the files being crawled forced it into a table-per-file mode, and it hit a limit on the number of tables. AWS Glue Limits
I managed to fix this; basically the problem was that not every JSON document had the same underlying structure.
I wrote a lambda script as part of the Kinesis process that forced every document into the same structure, by adding NULL fields where necessary. The crawlers were then able to correctly parse the resulting files and map them to a single table.
Can you please paste few lines from the json file the firehose creating? I ran the crawler on json file generated by Kinesis Streams and it was able to parse it successfully.
Did you also try "convert record format" when you create the Firehose job? There you can specify the JSONSerDe or Glue catalog to parse your data.
What solved this for me was to add a newline field '/n' to end of each payload sent to firehose.
msg_pkg = (str(json_response) + '\n').encode('utf-8')
record = {'Data': msg_pkg}
put_firehose('agg2-na-firehose', record
Because apparently the Hive JSON SerDe is the default used to proces json data. After doing this I was able to crawl the json data and read it in Athena as well.

Batching and Uploaded real-time traffic to S3

I am looking for some suggestion/solutions on implementing a archiving work flow at at big data scale.
The source of data are messages in kafka. Which is written to in real-time. Destination is S3 bucket.
I need to partition the data based on a field in message. For each partition i need to batch data to 100Mb chunks and then upload it.
The data rate is ~5GB/Minute. So the 100Mb batch should get filled within couple of seconds.
My trouble is around scaling and batching. Since i need to batch and compression data for a "field" in message, i need to bring that part of data together by partitioning. Any suggestions on tech/work flow ?
You can use Kafka Connect. There's a connector for S3:
http://docs.confluent.io/current/connect/connect-storage-cloud/kafka-connect-s3/docs/s3_connector.html
You can use Apache spark to do scaling and batching processes for you. So basically the flow can look like this:
Apache Kafka -> Apache Spark -> Amazon S3.
Spark Streaming API enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka and can be processed using complex algorithms such as high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems like Amazon S3.

Concatenate files in S3 using AWS Lambda

Is there a way to use Lambda for S3 file concatenation?
I have Firehose streaming data into S3 with the longest possible interval (15 minutes or 128mb) and therefore I have 96 data files daily, but I want to aggregate all the data to a single daily data file for the fastest performance when reading the data later in Spark (EMR).
I created a solution where Lambda function gets invoked when Firehose streams a new file into S3. Then the function reads (s3.GetObject) the new file from source bucket and the concatenated daily data file (if it already exists with previous daily data, otherwise creates a new one) from the destination bucket, decode both response bodies to string and then just add them together and write to the destination bucket with s3.PutObject (which overwrites the previous aggregated file).
The problem is that when the aggregated file reaches 150+ MB, the Lambda function reaches its ~1500mb memory limit when reading the two files and then fails.
Currently I have a minimal amount of data, with a few hundred MB-s per day, but this amount will be growing exponentially in the future. It is weird for me that Lambda has such low limits and that they are already reached with so small files.
Or what are the alternatives of concatenating S3 data, ideally invoked by S3 object created event or somehow a scheduled job, for example scheduled daily?
I would reconsider whether you actually want to do this:
The S3 costs will go up.
The pipeline complexity will go up.
The latency from Firehose input to Spark input will go up.
If a single file injection into Spark fails (this will happen in a distributed system) you have to shuffle around a huge file, maybe slice it if injection is not atomic, upload it again, all of which could take very long for lots of data. At this point you may find that the time to recover is so long that you'll have to postpone the next injection…
Instead, unless it's impossible in the situation, if you make the Firehose files as small as possible and send them to Spark immediately:
You can archive S3 objects almost immediately, lowering costs.
Data is available in Spark as soon as possible.
If a single file injection into Spark fails there's less data to shuffle around, and if you have automated recovery this shouldn't even be noticeable unless some system is running full tilt at all times (at which point bulk injections would be even worse).
There's a tiny amount of latency increase from establishing TCP connections and authentication.
I'm not familiar with Spark specifically, but in general such a "piped" solution would involve:
A periodic trigger or (even better) an event listener on the Firehose output bucket to process input ASAP.
An injector/transformer to move data efficiently from S3 to Spark. It sounds like Parquet could help with this.
A live Spark/EMR/underlying data service instance ready to receive the data.
In case of an underlying data service, some way of creating a new Spark cluster to query the data on demand.
Of course, if it is not possible to keep Spark data ready (but not queriable ("queryable"? I don't know)) for a reasonable amount of money, this may not be an option. It may also be possible that it's extremely time consuming to inject small chunks of data, but that seems unlikely for a production-ready system.
If you really need to chunk the data into daily dumps you can use multipart uploads. As a comparison, we're doing light processing of several files per minute (many GB per day) from Firehose with no appreciable overhead.
You may create a Lambda function that will be invoked only once a day using Scheduled Events and in your Lambda function you should use Upload Part - Copy that does not need to download your files on the Lambda function. There is already an example of this in this thread