Service on AWS to move data from one service to another - api

I want to build a service running on AWS that would fetch metrics from another service A, also running on AWS, do some processing, and then post it to a different service B for computing overall resource usage running on a different public cloud. The APIs for the existing services A and B are already defined, and are beyond my control.
My principal concern is that the volume of data I will fetch and post may be high, and I may have to do some computation on this data before posting the results. If the service is to run on a fixed periodic schedule, and I need to make it resilient, how should it be deployed?
EC2 VM.
Lambda.
Additionally:
How do I make the service resilient / highly available?
How do I scale it with higher data volumes? One thought is to partition the keyspace of the data based on tenant, etc. and perform the computation in concurrent compute instances that are independent, non-overlapping.
If I store the data in transit for intermediate processing, how can I make the data in transit resilient?
These questions are from an AWS infrastructure perspective because I have very little prior knowledge of AWS.
Sample numbers
Input data from service A: 10000 records per minute, each record about 1 KiB. So 10 MiB per minute.
Processing latency in the service - maximum 0.1 second per record.
Data posted to service B: About 2 MiB per minute. The connections are over WAN.

Related

scalability of azure cloud queue

In current project we currently use 8 worker role machines side by side that actually work a little different than azure may expect it.
Short outline of the system:
each worker start up to 8 processes that actually connect to cloud queue and processes messages
each process accesses three different cloud queues for collecting messages for different purposes (delta recognition, backup, metadata)
each message leads to a WCF call to an ERP system to gather information and finally add retreived response in an ReDis cache
this approach has been chosen over many smaller machines due to costs and performance. While 24 one-core machines would perform by 400 calls/s to the ERP system, 8 four-core machines with 8 processes do over 800 calls/s.
Now to the question: when even increasing the count of machines to increase performance to 1200 calls/s, we experienced outages of Cloud Queue. In same moment of time, 80% of the machines' processes don't process messages anymore.
Here we have two problems:
Remote debugging is not possible for these processes, but it was possible to use dile to get some information out.
We use GetMessages method of Cloud Queue to get up to 4 messages from queue. Cloud Queue always answers with 0 messages. Reconnect the cloud queue does not help.
Restarting workers does help, but shortly lead to same problem.
Are we hitting the natural end of scalability of Cloud Queue and should switch to Service Bus?
Update:
I have not been able to fully understand the problem, I described it in the natual borders of Cloud Queue.
To summarize:
Count of TCP connections have been impressive. Actually too impressive (multiple hundreds)
Going back to original memory size let the system operate normally again
In my experience I have been able to get better raw performance out of Azure Cloud Queues than service bus, but Service Bus has better enterprise features (reliable, topics, etc). Azure Cloud Queue should process up to 2K/second per queue.
https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/
You can also try partitioning to multiple queues if there is some natural partition key.
Make sure that your process don't have some sort of thread deadlock that is the real culprit. You can test this by connecting to the queue when it appears hung and trying to pull messages from the queue. If that works it is your process, not the queue.
Also take a look at this to setup some other monitors:
https://azure.microsoft.com/en-us/documentation/articles/storage-monitor-storage-account/
It took some time to solve this issue:
First a summarization of the usage of the storage account:
We used the blob storage once a day pretty heavily.
The "normal" diagonistics that Azure provides out of the box also used the same storage account.
Some controlling processes used small tables to store and read information once an hour for ca. 20 minutes
There may be up to 800 calls/s that try to increase a number to count calls to an ERP system.
When recognizing that the storage account is put under heavy load we split it up.
Now there are three physical storage accounts heaving 2 queues.
The original one still keeps up to 800/s calls for increasing counters
Diagnositics are still on the original one
Controlling information has been also moved
The system runs now for 2 weeks, working like a charm. There are several things we learned from that:
No, the infrastructure is "not just there" and it doesn't scale endlessly.
Even if we thought we didn't use "that much" summarized we used quite heavily and uncontrolled.
There is no "best practices" anywhere in the net that tells the complete story. Esp. when start working with the storage account a guide from MS would be quite helpful
Exception handling in storage is quite bad. Even if the storage account is overused, I would expect some kind of exception and not just returning zero message without any surrounding information
Read complete story here: natural borders of cloud storage scalability
UPDATE:
The scalability has a lot of influences. You may are interested in Azure Service Bus: Massive count of listeners and senders to be aware of some more pitfalls.

Google Compute Engine Load Balancer limits

I'm thinking of using Google Compute Engine to run a LOT of instances in a target pool behind a network load balancer. Each of those instances will end up real-time processing many large data streams, so at full scale and peak times there might be multiple Terabytes per second go through.
Question:
Is there a quota or limit to the data you can push through those load balancers? Is there a limit of instances you can have in a target pool? (the documentation does not seem to specify this)
It seems like load balancers have a dedicated IP (means it's a single machine?)
There's no limit on the amount of data that you can push through a LB. As for instances, there are default limits on CPUs, persistent or SSD disks, and you can see those quotas in the Developers Console at 'Compute' > 'Compute Engine'> 'Quotas', however you can always request increase quota at this link. You can have as many instances that you need in a target pool. Take a look to the Compute Engine Autoscaler that will help you to spin up machines as your service needs. The single IP provided for your LB is in charge of distributing incoming traffic across your multiple instances.

Is there a way to leverage Hadoop tools to mange parallel REST API calls to external sources?

I am writing software that creates a large graph database. The software needs to access dozens of different REST APIs with millions of total requests. The data will then be processed by the Hadoop cluster. Each of these APIs have rate limits that vary by requests/second, per window, per day and per user (typically via OAuth).
Does anyone have any suggestions on how I might use either a Map function or other Hadoop-ecosystem tool to manage these queries? The goal would to be to leverage the parallel processing in Hadoop.
Because of the varied rate limits, it often makes sense to switch to a different API query while waiting for the first limit to reset. An example would be one API call that creates nodes in the graph and another that enriches the data for that node. I could have the system go out and enrich the data for the new nodes while waiting for the first API limit to reset.
I have tried using SQS queuing on EC2 to manage the various API limits and states (creating a queue for each API call), but have found it to be ridiculously slow.
Any ideas?
It looks like the best option for my scenario will be using Storm, or specifically the Trident abstraction. It gives me the greatest flexibility for both workload management but process management as well

distributed cluster questions about performance

I'm using 6 servers to make a cluster and they are all disk nodes. I use rabbitmq for collecting log file for our website. Now at the peak hour, the publish rate is about 30k message per second. There are 2 main consumers(hdfs and elasticsearch) and each one need to handle all message, so the delivery rate hit about 60k per second.
In my scenario, a single server can hold 10k delivery rate and I use 6 node to load balance the pressure. My solution is that I created 2 queues on each node. Each message is with a random routing-key(something like message.0, message.1, etc) to distribute the pressure to every node.
What confused me is:
All message send to one node. Should I use a HA Proxy to load balance this publish pressure?
Is there any performance difference between Durable Queues and Transient Queues?
Is there any performance difference between Memory Node and Disk Node? What I know is the difference between memory node and disk node is only about the meta data such as queue configuration.
How can I imrove the performance in publish and delivery codes? I've researched and I know several methods:
disable the confirm mechanism(in publish codes?)
enable HiPE(I've done that and it helped a lot)
For example, input is 1w mps(message per second), there are two consumers to consume all message. Then the output is 2w mps. If my server can handle 1w mps, I need two server to handle the 2w-mps-pressure. Now a new consumer need to consume all message, too. As a result, output hits 3w mps, so I need another one more server. For a conclusion, one more consumer for all message, one more server?
"All message send to one node. Should I use a HA Proxy to load balance this publish pressure?"
This article outlines a number of designs aimed at distributing load in RabbitMQ.
"Is there any performance difference between Durable Queues and Transient Queues?"
Yes, Durable Queues are backed up to disk so that they can be reinstated on server-restart, for example. This adds a nominal overhead, though the actual process occurs asynchronously.
"Is there any performance difference between Memory Node and Disk Node?"
Not that I'm aware of, but that would depend on the machine itself.
"How can I imrove the performance in publish and delivery codes?"
Try this out.

Would Amazon S3 work as storage for sensor data streaming in from millions of endpoints?

I am looking for options for reliable (and speedy) storage for small amounts of sensor data that would be coming in from (getting optimistic here) millions of endpoints.
The scale I'm talking is 1M endpoints, each sending 100 bytes every minute. This data needs to be available for analysis shortly after this.
Additionally, this data will be kept for a few years and may exceed 100TB of total storage.
Is S3 the solution to this, or would I be better off hosting my own NoSQL cluster like Cassandra/MongoDB etc?
Please let me know if I have not specified any information.
Yes, you could. But, there are no query mechanisms nor any method of reading multiple objects in one request in S3. You would also not have any mechanism to inspect the data before it's written.
This might be a better idea:
Have clients write sensor data onto a SQS queue
Your application reads messages off the queue and writes the data into SimpleDB or other data store
It would de-couple receipt of the data, with any data load/storage phase.
Note that many Amazon services have a per-request charge. For SQS it's $0.01/10000 requests. If you want to have 1 million clients write one message each minute request charges alone would be over $40,000 a month. Doubling when taking reading the messages into account.
(((1000000 * (60*24*30)) / 10000) * $0.01) * 2 = $86,400
For S3, it's $0.01/1000 for POSTs (client writes), and $0.01/10000 GETs (reads). For 1 million clients your per-request charges alone could easily reach $500,000 per month.
Ultimately, at 1 million clients, you likely need to run your own receiving endpoints simply due to economic factors.