How to set optimum value for storm.config.setMessageTimeoutSecs - apache

I am working on Apache storm topology. I have different bolts who carrying out functionality like cosmosdb insertion , REST api call etc.
I want to set value for storm.config.setMessageTimeoutSecs for my Storm topology.
Now I have set it for 5 min, still I can see failures at spot side due to message time out. Is there any max value for message time out for topology.
and How to set optimum value for storm.config.setMessageTimeoutSecs

I don't believe there is a maximum timeout, no. You probably don't want to set it too high though, since it will mean that if a tuple is actually lost during network transfer, it will take much longer to figure out that it's been lost and to replay it.
Are you setting topology.max.spout.pending to a reasonable value? That can help to avoid too many tuples in flight at a time, which helps keep the complete latency down.
I'd also look at whether 5+ minutes is a reasonable processing time for your tuples. I don't know your use case, so maybe that's reasonable, but it seems like a long time to me.
If you actually need to process some tuples that will take such a long time, you might consider splitting your topology into smaller topologies, so you can set a high timeout for the slow part, and a lower timeout for the rest.

Related

Rapidly changing large data processing advise

My team has the following dilemma that we need some architectural/resources advise:
Note: Our data is semi-structured
Over-all Task:
We have a semi-large data that we process during the day
each day this "process" get executed 1-5 times a day
each "process" takes anywhere from 30 minutes to 5 hours
semi-large data = ~1 million rows
each row gets updated anywhere from 1-10 times during the process
during this update ALL other rows may change, as we aggregate these rows for UI
What we are doing currently:
our current system is functional, yet expensive and inconsistent
we use SQL db to store all the data and we retrieve/update as process requires
Unsolved problems and desired goals:
since this processes are user triggered we never know when to scale up/down, which causes high spikes and Azure doesnt make it easy to do autoscale based on demand without data warehouse which we are wanting to stay away from because of lack of aggregates and other various "buggy" issues
because of constant IO to the db we hit 100% of DTU when 1 process begins (we are using Azure P1 DB) which of course will force us to grow even larger if multiple processes start at the same time (which is very likely)
yet we understand the cost comes with high compute tasks, we think there is better way to go about this (SQL is about 99% optimized, so much left to do there)
We are looking for some tool that can:
Process large amount of transactions QUICKLY
Can handle constant updates of this large amount of data
supports all major aggregations
is "reasonably" priced (i know this is an arguable keyword, just take it lightly..)
Considered:
Apache Spark
we don't have ton of experience with HDP so any pros/cons here will certainly be useful (does the use case fit the tool??)
ArangoDB
seems promising.. Seems fast and has all aggregations we need..
Azure Data Warehouse
too many various issues we ran into, just didn't work for us.
Any GPU-accelerated compute or some other high-end ideas are also welcome.
Its hard to try them all and compare which one fits the best, as we have a fully functional system and are required to make adjustments to whichever way we go.
Hence, any before hand opinions are welcome, before we pull the trigger.

Optaplanner termination strategy

In the opta planner configuration ,there is a provision to specify the termination time out.
Is there a better way to handle the termination time out strategy? For example , my problem size is small and I have set the termination time out as 10 sec.
But I can see from the logs that the best score is obtained well within 2 - 3 seconds. Is there any means to exit once the best score is reached ?
Or should the program always run till the timeout is reached and then output the best score.
Take a look at the Termination chapter in the OptaPlanner documentation.
What you are referring to is called BestScoreTermination but it might not be what you actually want -- do note that OptaPlanner has no way of knowing if the score is "the optimal score"... unless you configure Exhaustive Search (which doesn't scale well).
Therefore, if you misjudge your problem and set the BestScoreTermination to something "better" than the optimal value, OptaPlanner will run until it tries out all combinations (which might take effectively forever on big problems). If you're looking for a compromise, take a look at "termination composition"

boto dynamodb: is there a way to optimize batch writing?

I am indexing large amounts of data into DynamoDB and experimenting with batch writing to increase actual throughput (i.e. make indexing faster). Here's a block of code (this is the original source):
def do_batch_write(items,conn,table):
batch_list = conn.new_batch_write_list()
batch_list.add_batch(table, puts=items)
while True:
response = conn.batch_write_item(batch_list)
unprocessed = response.get('UnprocessedItems', None)
if not unprocessed:
break
# identify unprocessed items and retry batch writing
I am using boto version 2.8.0. I get an exception if items has more than 25 elements. Is there a way to increase this limit? Also, I noticed that sometimes, even if items is shorter, it cannot process all of them in a single try. But there does not seem to be correlation between how often this happens, or how many elements are left unprocessed after a try, and the original length of items. Is there a way to avoid this and write everything in one try? Now, the ultimate goal is to make processing faster, not just avoid repeats, so sleeping for a long period of time between successive tries is not an option.
Thx
From the documentation:
"The BatchWriteItem operation puts or deletes multiple items in one or more tables. A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB."
The reason for some not succeeded is probably due to exceeding the provisioned throughput of your table. Do you have other write operations being performed on the table at the same time? Have you tried increasing the write throughput on your table to see if more items are processed.
I'm not aware of any way of increasing the limit of 25 items per request but you could try asking on the AWS Forums or through your support channel.
I think the best way to get maximum throughput is to increase the write capacity units as high as you can and to parallelize the batch write operations across several threads or processes.
From my experience, there is little to be gained in trying to optimize your write throughput using either batch write or multithreading. Batch write saves a little network time, and multithreading saves close to nothing as the item size limitation is quite low and the bottleneck is very often DDB throttling your request.
So (like it or not) increasing your Write Capacity in DynamoDB is the way to go.
Ah, like garnaat said, latency inside the region is often really different (like from 15ms to 250ms) from inter-region or outside AWS.
Not only increasing the Write Capacity will make it faster.
if your HASH KEY diversity is poor, then even if you will increase your write capacity, then you can have throughput errors.
throughput errors are depends on your hit map.
example: if your hash key is a number between 1-10, and you have 10 records with hash value 1-10 but 10k records with value 10, then you will have many throughput errors even while increasing your write capacity.

Number of Redis queries

I have a script that runs 12-16 redis commands;
This might look like a dumb question but, considering that this script is usually called every couple of seconds, are they maybe too much to be executed all at once? I mean is Redis also designed to have such number of commands or should it be kept really as minimal as possible?
I am basically using LISTS for queues, SETS and strings.
Thanks in advance.
With the numbers you posted there are no problems at all, Redis can handle this traffic without any issue unless those queries are intersections of big sets, or SORT, or other commands that have a run time proportional to number of elements in your type.
However it is not just a matter of Redis able to handle the traffic, you should also be concerned about latency. If you use pipelining (http://redis.io/topics/pipelining) you can ask multiple queries at once and avoid paying the round trip time multiple times.

Redis mimic MASTER/MASTER? or something else?

I have been reading a lot of the posts on here and surfing the web, but maybe I am not asking the right question. I know that Redis is currently Master/slave until Cluster becomes available. However, I was wondering if someone can tell me how I would want to configure Redis logistically to meet my needs (or if its not the right tool).
Scenerio:
we have 2 sites on opposite ends of the US. We want clients to be able to write at each site at a high volume. We then want each client to be able to perform reads at their site as well. However we want the data to be available from a write at the sister site in < 50ms. Given that we have plenty of bandwidth. Is there a way to configure redis to meet our needs? our writes maximum size would be on the order of 5k usually much less. The main point is how can i have2 masters that are syncing to one another even if it is not supported by default.
The catch with Tom's answer is that you are not running any sort of cluster, you are just writing to two servers. This is a problem if you want to ensure consistency between them. Consider what happens when your client fails a write to the remote server. Do you undo the write to local? What happens to the application when you can't write to the remote server? What happens when you can't read from the local?
The second catch is the fundamental physics issue Joshua raises. For a round trip you are talking a theoretical minimum of 38ms leaving a theoretical maximum processing time on both ends (of three systems) of 12ms. I'd say that expectation is a bit too much and bandwidth has nothing to do with latency in this case. You could have a 10GB pipe and those timings are still extant. That said, transferring 5k across the continent in 12ms is asking a lot as well. Are you sure you've got the connection capacity to transfer 5k of data in 50ms, let alone 12? I've been on private no-utilization circuits across the continent and seen ping times exceeding 50ms - and ping isn't transferring 5k of data.
How will you keep the two unrelated servers in-sync? If you truly need sub-50ms latency across the continent, the above theoretical best-case means you have 12ms to run synchronization algorithms. Even one query to check the data on the other server means you are outside the 50ms window. If the data is out of sync, how will you fix it? Given the above timings, I don't see how it is possible to synchronize in under 50ms.
I would recommend revisiting the fundamental design requirements. Specifically, why this requirement? Latency requirements of 50ms round trip across the continent are usually the sign of marketing or lack of attention to detail. I'd wager that if you analyze the requirements you'll find that this 50ms window is excessive and unnecessary. If it isn't, and data synchronization is actually important (likely), than someone will need to determine if the significant extra effort to write synchronization code is worth it or even possible to keep within the 50ms window. Cross-continent sub-50ms latency data sync is not a simple issue.
If you have no need for synchronization, why not simply run one server? You could use a slave on the other side of the continent for recovery-only purposes. Of course, that still means that best-case you have 12ms to get the data over there and back. I would not count on
50ms round trip operations+latency+5k/10k data transfer across the continent.
It's about 19ms at the speed of light to cross the US. <50ms is going to be hard to achieve.
http://www.wolframalpha.com/input/?i=new+york+to+los+angeles
This is probably best handled as part of your client - just have the client write to both nodes. Writes generally don't need to be synchronous, so sending the extra command shouldn't affect the performance you get from having a local node.