boto dynamodb: is there a way to optimize batch writing? - batch-processing

I am indexing large amounts of data into DynamoDB and experimenting with batch writing to increase actual throughput (i.e. make indexing faster). Here's a block of code (this is the original source):
def do_batch_write(items,conn,table):
batch_list = conn.new_batch_write_list()
batch_list.add_batch(table, puts=items)
while True:
response = conn.batch_write_item(batch_list)
unprocessed = response.get('UnprocessedItems', None)
if not unprocessed:
break
# identify unprocessed items and retry batch writing
I am using boto version 2.8.0. I get an exception if items has more than 25 elements. Is there a way to increase this limit? Also, I noticed that sometimes, even if items is shorter, it cannot process all of them in a single try. But there does not seem to be correlation between how often this happens, or how many elements are left unprocessed after a try, and the original length of items. Is there a way to avoid this and write everything in one try? Now, the ultimate goal is to make processing faster, not just avoid repeats, so sleeping for a long period of time between successive tries is not an option.
Thx

From the documentation:
"The BatchWriteItem operation puts or deletes multiple items in one or more tables. A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB."
The reason for some not succeeded is probably due to exceeding the provisioned throughput of your table. Do you have other write operations being performed on the table at the same time? Have you tried increasing the write throughput on your table to see if more items are processed.
I'm not aware of any way of increasing the limit of 25 items per request but you could try asking on the AWS Forums or through your support channel.
I think the best way to get maximum throughput is to increase the write capacity units as high as you can and to parallelize the batch write operations across several threads or processes.

From my experience, there is little to be gained in trying to optimize your write throughput using either batch write or multithreading. Batch write saves a little network time, and multithreading saves close to nothing as the item size limitation is quite low and the bottleneck is very often DDB throttling your request.
So (like it or not) increasing your Write Capacity in DynamoDB is the way to go.
Ah, like garnaat said, latency inside the region is often really different (like from 15ms to 250ms) from inter-region or outside AWS.

Not only increasing the Write Capacity will make it faster.
if your HASH KEY diversity is poor, then even if you will increase your write capacity, then you can have throughput errors.
throughput errors are depends on your hit map.
example: if your hash key is a number between 1-10, and you have 10 records with hash value 1-10 but 10k records with value 10, then you will have many throughput errors even while increasing your write capacity.

Related

efficient way to store 100GB dataset

I have 100GB dataset in this format with row format as seen below.
cookie,iplong1,iplong2..,iplongN
I am currently trying to fit this data into redis as a sorted set data structure. I would also need to set a TTL for each of those IPs. I was thinking to have TTL implemented on each element in that set, I will probably add a score to them, where score is the epoch time. And may be I will write a separate script to parse the scores and remove expired IPs based score as applicable. With that said, I am also noticing that it almost takes 100GB memory to this 100GB dataset. I was wondering if there is any other way of efficiently packing this data in redis with minimal memory footprint.
I am also happy to know if there are any other tech stack out there that can handle this better. This dataset would be updated frequently based on hourly logs, also the expectation is we should be able to read from it at faster rate, concurrently.
Thanks in advance.

SCAN vs KEYS performance in Redis

A number of sources, including the official Redis documentation, note that using the KEYS command is a bad idea in production environments due to possible blocking. If the approximate size of the dataset is known, does SCAN have any advantage over KEYS?
For example, consider a database with at most 100 keys of the form data:number:X where X is an integer. If I want to retrieve all of these, I might use the command KEYS data:number:*. Is this going to be significantly slower than using SCAN 0 MATCH data:number:* COUNT 100? Or are the two commands essentially equivalent in this circumstance? Would it be accurate to say that SCAN is preferable to KEYS because it protects against the scenario where an unexpectedly large set would be returned?
You shouldn't care about current command execution but about the impact to all other commands, since Redis processes commands using a single thread (i.e. while a command is being executed all others need to await until executing one ends).
While keys or scan might provide you similar or identical performance executed alone in your case, some milliseconds blocking Redis will significantly decrease overall I/O.
This the main reason to use keys for development purposes and scan on production environments.
OP said:
"While keys or scan might provide you similar or identical performance
executed alone in your case, some milliseconds blocking Redis will
significantly decrease overall I/O." - This sentence seems to indicate
that one command blocks Redis, and the other doesn't, which can't be
the case. If I am guaranteed 100 results from my call to KEYS, in what
way is it worse than SCAN? Why do you feel that one command is more
prone to blocking?
There should be a good difference when you can paginate the search. It's not the same being forced to get 100 keys in a single pass than being able to implement pagination and get 100 keys, 10 by 10 (or 50 and 50). This very small interruption can let other commands sent by the application layer be processed by Redis. See what Redis official documentation says about this:
Since these commands allow for incremental iteration, returning only a
small number of elements per call, they can be used in production
without the downside of commands like KEYS or SMEMBERS that may block
the server for a long time (even several seconds) when called against
big collections of keys or elements
.
The answer is in the SCAN documentation
These commands allow for incremental iteration, returning only a small number of elements per call, they can be used in production without the downside of commands like KEYS or SMEMBERS that may block the server for a long time (even several seconds) when called against big collections of keys or elements.
So ask for small chunks of data rather than getting whole of it
Also as Matías Fidemraizer pointed out, Redis is single threaded and KEYS is a blocking call thus blocking any incoming requests for operation until execution of KEYS is done.
Whether your data is small or not, it never hurts to apply best practices.
There is no performance difference between KEYS and SCAN other than pagination (count) where the amount bytes transferred (IO) from redis to client will be controlled in pagination.
The count option it self has its own specification where sometimes you will not get data, but still scan cursor is on, so will get data in the next iterations. So the count option should be reasonable amount say 200 to something max to avoid multiple round trip time. I think this value depends on total number of keys in your db.
There is no point/difference when we use SCAN within LUA compare to KEYS, though there is no IO involved, still both are blocking other calls till entire big collection get iterated. I haven't tried this, my guess it is.

How to optimize indexation on elasticsearch?

I am trying to understand how indexing can be optimized on elasticsearch. Let me clarify my needs;
I have two indices rigth now. Lets say, indexA and indexB ( Two indices can be seen approximately same size)
I have 6 machines dedicated to elasticsearch (we can say exactly the same hardware)
The most important part of my elasticsearch usage is on writing since I am doing heavy writing on real time.
So my question is, how I can I optimize the writing operation using those 6 machines ?
Should I separate machines into two part like 3 machines for indexA and 3 machines for indexB ?
or
Should I use all of 6 machines in order to index indexA and indexB ?
and
What else should I need to give attention in order to optimize write operations ?
Thank you in advance
It depends, but let me take to a direction as per your problem statement which led to following assumptions:
you want to do more write operations (not worried about search performance)
both the indices are in the same cluster
in future more systems can get added
For better indexing performance first thing is you may want to have single shard for your index (unless you are using routing). But since you have 6 servers having single shard will be waste of resources so you can assign 3 shard to each of indexA and indexB. This is for current scenario but it is recommended to have ~10 shards(for future scalibility and your data size dependent)
Turn off the replica (if possible as index requests wait for the replicas to respond before returning). Though in production environment it is highly recommended to have at least one replica for high availability.
Set refresh rate to "-1" or at least to a larger figure say "30m". (You will lose NRT search if you do so but as you have mentioned you are concerned about indexing)
Turn of index warmers if you have any.
avoid using "doc_values" for your field mapping. (though it is beneficial for reducing memory footprint during search time it will increase your index time as it prepares field values during indexing)
If possible/not required disable "norms" in your mapping
Lastly read this.
Word of caution: some of the approach above will impact your search performance.

ElasticSearch - Determining maximum shard size

Hopefully this question isn't out of date, but I haven't found a clear answer anywhere yet. According to one of the ES presentations from last year (http://www.elasticsearch.org/videos/big-data-search-and-analytics/), there's a "maximum" size for a shard. I'm trying to determine this for my application, but as far as I can tell, I haven't hit it yet. Does anyone know what the behavior of a single-shard index that's reached its maximum? Do inserts fail, or is it just that the index becomes unusable?
To test this myself, I indexed all the English articles in Wikipedia (without any history information) in a single elasticsearch shard. The elasticsearch data folder grew to ~42GB at the end of the test. Lessons learned are:
indexing speed will not be affected by the size of the shard. Mind you, I did not try indexing with more than one thread at a time, but single thread indexing speed was more or less constant for the duration of the test
querying speed on the other hand was drastically affected by shard size. Especially once you try to query with more than one user at a time. The exact numbers will depend heavily on the power of your machine, data structure and how many threads are querying. To give you an idea, with elasticsearch running on my dev machine, querying the Wikipedia shard with 25 concurrent users resulted in an average response time of 3.5 seconds (with peaks towards half a minute).
My conclusion is that a shard too large will not make elasticsearch fail just from indexing. Querying the large shard may be too slow for your needs, or, in certain situations, even break elasticsearch with an OutOfMemoryException (for example a big faceted query).
This answer is based on my own investigation. Full story can be read on my blog:
http://blog.trifork.com/2013/09/26/maximum-shard-size-in-elasticsearch/
http://blog.trifork.com/2013/11/05/maximum-shard-size-in-elasticsearch-revisited/

Scattered-write speed versus scattered-read speed on modern Intel or AMD CPUs?

I'm thinking of optimizing a program via taking a linear array and writing each element to a arbitrary location (random-like from the perspective of the CPU) in another array. I am only doing simple writes and not reading the elements back.
I understand that a scatted read for a classical CPU can be quite slow as each access will cause a cache miss and thus a processor wait. But I was thinking that a scattered write could technically be fast because the processor isn't waiting for a result, thus it may not have to wait for the transaction to complete.
I am unfortunately unfamiliar with all the details of the classical CPU memory architecture and thus there may be some complications that may cause this also to be quite slow.
Has anyone tried this?
(I should say that I am trying to invert a problem I have. I currently have an linear array from which I am read arbitrary values -- a scattered read -- and it is incredibly slow because of all the cache misses. My thoughts are that I can invert this operation into a scattered write for a significant speed benefit.)
In general you pay a high penalty for scattered writes to addresses which are not already in cache, since you have to load and store an entire cache line for each write, hence FSB and DRAM bandwidth requirements will be much higher than for sequential writes. And of course you'll incur a cache miss on every write (a couple of hundred cycles typically on modern CPUs), and there will be no help from any automatic prefetch mechanism.
I must admit, this sounds kind of hardcore. But I take the risk and answer anyway.
Is it possible to divide the input array into pages, and read/scan each page multiple times. Every pass through the page, you only process (or output) the data that belongs in a limited amount of pages. This way you only get cache-misses at the start of each input page loop.