Aerospike: Device Overload Error when size of map is too big - aerospike

We got "device overload" error after the program ran successfully on production for a few months. And we find that some maps' sizes are very big, which may be bigger than 1,000.
After I inspected the source code, I found that the reason of "devcie overload" is that the write queue is beyond limitations, and the length of the write queue is related to the effiency of processing.
So I checked the "particle_map" file, and I suspect that the whole map will be rewritten even if we just want to insert one pair of KV into the map.
But I am not so sure about this. Any advice ?

So I checked the "particle_map" file, and I suspect that the whole map will be rewritten even if we just want to insert one pair of KV into the map.
You are correct. When using persistence, Aerospike does not update records in-place. Each update/insert is buffered into an in-memory write-block which, when full, is queued to be written to disk. This queue allows for short bursts that exceed your disks max IO but if the burst is sustained for too long the server will begin to fail the writes with the 'device overload' error you have mentioned. How far behind the disk is allowed to get is controlled by the max-write-cache namespace storage-engine parameter.
You can find more about our storage layer at https://www.aerospike.com/docs/architecture/index.html.

Related

o(1) complexity python structure for storing data in memory or to disc depends on number of items

At this moment I'm working on this library:
https://pypi.org/project/daffi/
Which is suppose to be kind of multiprocess RPC communication framework with ability to execute sync or async tasks on remotes.
The process of communication includes server which temporary stores some message metadata about receiver/transmitter processes and keep it until message is returned to process that sent message.
Generally speaking it is not the problem to keep 1k or 10k metadata items in memory as they are small and typically communication is fast so I haven't even experienced the state when server keeps so many items.
But I'd like to be protected for such cases and reduce memory consumption on server side when it happened.
So my question. Would you suggest any library or algorithm to store metadata in dict like object with ability to store items to disc under certain conditions?
Criteria is the following:
Lets say number of items in dict is less then 1k. In this case it acts like regular dict and store items in RAM.
If number or items becomes greater then 1k it starts serialize and storing them to disk with ability to take them by key and deserialize.
If, after spike number of items returns to normal (< 1K) it returns back to normal dict like behavior.
The speed is very important so I'd like to keep 0(1) complexity if possible.

how does file reading ( streaming ) really work in mule?

Am trying to understand how streaming works w.r.t mule 4.4.
Am reading a large file and am using 'Repeating file store stream' as streaming strategy
'In memory size' = 128 KB
The file is 24 MB and for sake of argument lets say 1000 records is equivalent to 128 KB
so about 1000 records will be stored in memory and rest all will be written to file store by mule .
Here's the flow:
At stage#1 we are reading a file
At stage#2 we are logging payload - so I am assuming initially 128KB worth of data is logged and internally mule will move rest of the data from file storage to in memory and this data will be written to log.
Question : so does the heap memory increase from 128KB to 24 MB ?
I am assuming no , but needed confirmation ?
At stage#3 we are using transform script to create a json payload
So what happens here :
so now is the json payload all in memory now ? ( say 24 MB ) ?
what has happened to the stream ?
so really I am struggling to understand how stream is beneficial if during transformation the data is stored in memory ?
Thanks
It really depends on how each component works but usually logging means to load the full payload in memory. Having said that, logging 'big' payloads is considered a bad practice and you should avoid doing it in the first place. Even a few KBs logs are really not a good idea. Logs are not intended to be used that way. Using logs, as any computational operation, have a cost in processing and resource usage. I have seen several times people causing out of memory errors or performance issues because of excessive logging.
The case with the Transform component is different. In some cases it is able to benefit from streaming, depending on the format used and the script. Sequential access to records is required for streaming to work. If you try an indexed access to the 24MB payload it will probably load the entire payload in memory (example payload[14203]). Also referencing the payload more than once in a step may fail. Streamed records are consumed after being read so it is not possible to use them twice.
Streaming for Dataweave needs to be enabled (it is not the default) by using the property streaming=true.
You can find more details in the documentation for DataWeave Streaming and Mule Streaming.

Flink batching Sink

I'm trying to use flink in both a streaming and batch way, to add a lot of data into Accumulo (A few million a minute). I want to batch up records before sending them to Accumulo.
I ingest data either from a directory or via kafka, convert the data using a flatmap and then pass to a RichSinkFunction, which adds the data to a collection.
With the streaming data, batching seems ok, in that I can add the records to a collection of fixed size which get sent to accumulo once the batch threshold is reached. But for the batch data which is finite, I'm struggling to find a good approach to batching as it would require a flush time out in case there is no further data within a specified time.
There doesn't seem to be an Accumulo connector unlike for Elastic search or other alternative sinks.
I thought about using a Process Function with a trigger for batch size and time interval, but this requires a keyed window. I didn't want to go down the keyed route as data looks to be very skewed, in that some keys would have a tonne of records and some would have very few. If I don't use a windowed approach, then I understand that the operator won't be parallel. I was hoping to lazily batch, so each sink only cares about numbers or an interval of time.
Has anybody got any pointers on how best to address this?
You can access timers in a sink by implementing ProcessingTimeCallback. For an example, look at the BucketingSink -- its open and onProcessingTime methods should get you started.

Best way to handle timouts on rabbitmq message processing

I am trying to get my head around an issue I have recently encountered and I hope someone will be able to point me in the most reasonable direction of solving it.
I am using Riak KV store and working on CRDT data, where I have some sort of counter inside each CRDT item stored in database.
I have a rabbitmq queue, where each message is a request to increase or decrease a certain amount of aforementioned counters.
Finally, I have a group of service-workers, that listens on the queue, and for each request try to change the amount of counters accordingly.
The issue I have is as follows: While a single worker is processing a request, it may get stuck for a while on a write operation to database – let’s say on a second change of counters out of three. It’s connection with rabbitmq gets lost (timeout) so the message-request gets back on to the queue (I cannot afford to miss one). Then it is picked up by second worker, which begins all processing anew. However, the first worker finishes its work, and as a results I have processed a single message twice.
I can split those increments into single actions, but this still leaves me with dilemma – can still change value of counter twice, if some worker gets stuck on a write operation for a long period.
I do not have possibility of making Riak KV CRDT writes work faster, nor can I accept missing out a message-request. I need to implement some means of checking whether a request was already processed before.
My initial thoughts were to use some alternative, quick KV store for storing rabbitMQ message ID if they are being processed. That way other workers could tell whether they are not starting to process a message that is already parsed elsewhere.
I could use any help and pointers to materials I can read.
You can't have "exactly one delivery" semantic. You can reduce double-sent messages or missed deliveries, so it's up to you to decide which misbehavior is the least inconvenient.
First of all are you sure it's the CRDTs that are too slow ? Are you using simple counters or counters inside maps ? In my experience they are quite fast, although slower than kv. You could try:
- having simple CRDTs (no maps), and more CRDTs objects, to lower their stress( can you split the counters in two ?)
- not using CRDTs but using good old sibling resolution on client side on simple key/values.
- accumulate the count updates orders and apply them in batch, but then you're accepting an increase in latency so it's equivalent to increasing the timeout.
Can you provide some metrics? Like how long the updates take, what numbers you'd expect, if it's as slow when you have few updates or many updates, etc

Couchbase 3.1.0 - Hard out of memory error when performing full backup

We recently migrated to Couchbase 3.1.0. The odd thing is - when performing full backup of a bucket, web UI alerts "Hard Out Of Memory Error. Bucket X on node Y is full. All memory allocated to this bucket is used for metadata". The numbers from RAM usage in the web UI contradict that - about 75% is used, but not 100%. I looked into the logs, but haven't find any similar errors there.
Is that even normal?
This is a known issue in the Couchbase Server 3.x releases.
To understand the problem, we must also first understand Database Change Protocol (DCP), the protocol used to transfer data throughout the system. At a high level the flow-control for DCP is as follows:
The Consumer creates a connection with the Producer and sends an Open Connection message. The Consumer then sends a Control message to indicate per stream flow control. This messages will contain “stream_buffer_size” in the key section and the buffer size the Consumer would like each stream to have in the value section.
The Consumer will then start opening streams so that is can receive data from the server.
The Producer will then continue to send data for the stream that has buffer space available until it reaches the maximum send size.
Steps 1-3 continue until the connection is closed, as the Consumer continues to consume items from the stream.
The cbbackup utility does not implement any flow control (data buffer limits) however, and it will try to stream all vbuckets from all nodes at once, with no cap on the buffer size.
While this does not mean that it will use the same amount of memory as your overall data size (as the streams are being drained slowly by the cbbackup process), it does mean that a large memory overhead is required to be able to store the data streams.
When you are in a heavy DGM (disk greater than memory) scenario, the amount of memory required to store the streams is likely to grow more rapidly than cbbackup can drain them as it is streaming large quantities of data off of disk, leading to very large streams, which take up a lot of memory as previously mentioned.
The slightly misleading message about metadata taking up all of the memory is displayed as there is no memory left for the data, so all of the remaining memory is allocated to the metadata, which when using value eviction cannot be ejected from memory.
The reason that this only affects Couchbase Server versions prior to 4.0 is that in 4.0 a server-side improvement to DCP stream management was made that allows the pausing of DCP streams to keep the memory footprint down, this is tracked as MB-12179.
As a result, you should not experience the same issue on Couchbase Server versions 4.x+, regardless of how DGM your bucket is.
Workaround
If you find yourself in a situation where this issue is occurring, then terminating the backup job should release all of the memory consumed by the streams immediately.
Unfortunately if you have already had most of your data evicted from memory as a result of the backup, then you will have to retrieve a large quantity of data off of disk instead of RAM for a small period of time, which is likely to increase your get latencies.
Over time 'hot' data will be brought into memory when requested, so this will only be a problem for a small period of time, however this is still a fairly undesirable situation to be in.
The workaround to avoid this issue completely is to only stream a small number of vbuckets at once when performing the backup, as opposed to all vbuckets which cbbackup does by default.
This can be achieved using cbbackupwrapper which comes bundled with all Couchbase Server releases 3.1.0 and later, details of using cbbackupwrapper can be found in the Couchbase Server documentation.
In particular the parameter to pay attention to is the -n flag, which specifies the number of vbuckets to be backed up in a batch at once.
As the name suggests, cbbackupwrapper is simply a wrapper script on top of cbbackup which partitions the vbuckets up and automatically handles all of the directory creation and backup generation, while still using cbbackup under the hood.
As an example, with a batch size of 50, cbbackupwrapper would backup vbuckets 0-49 first, followed by 50-99, then 100-149 etc.
It is suggested that you test with cbbackupwrapper in a testing environment which mirrors your production environment to find a suitable value for -n and -P (which controls how many backup processes run at once, the combination of these two controls the amount of memory pressure caused by backup as well as the overall speed).
You should not find that lowering the value of -n from its default 100 decreases the backup speed, in some cases you may find that the backup speed actually increases due to the fact that there is far less memory pressure on the server.
You may however wish to sensibly adjust the -P parameter if you wish to speed up the backup further.
Below is an example command:
cbbackupwrapper http://[host]:8091 [backup_dir] -u [user_name] -p [password] -n 50
It should be noted that if you use cbbackupwrapper to perform your backup then you must also use cbrestorewrapper to restore the data, as cbrestorewrapper is automatically aware of the directory structures used by cbbackupwrapper.
When you run a full backup, by default the backup tool streams data from all nodes over the network. This is not the best way, because it causes a lot of extra load and increased memory usage, especially of you run cbbackup on one of the Couchbase nodes. I would use the data-copy mode of cbbackup, which copies data directly from the files on disk:
> sudo /opt/couchbase/bin/cbbackup couchstore-files:///opt/couchbase/var/lib/couchbase/data/ /tmp/backup
Of course, change the data path to wherever your Couchbase data is actually stored. (In my example it runs as sudo because only root has read access to /opt/couchbase/blabla..) Do this on every node, then collect all the backup folders and put them somewhere. Note that the backups are very compressible, so you might want to zip them before copying over the network.