calling getall in apache geode to getall data for the keys in the list is slow - gemfire

I am using region getall method to get values for all keys, but what i found is that for the key present in apache geode it gets data quickly but for one which is not present in apache geode. it calls the cache loader one by one. Is there any mechanism so that calling cacheloader can be made parallel.

I don't think there's a way of achieving this out of the box, at least not while using a single Region.getAll() operation. If I recall correctly, the servers just iterate through the keys and performs a get on every single one, which ends up triggering the CacheLoader execution.
You could, however, achieve some degree of parallelism by splitting the set of keys into multiple sets, and launching different threads to execute the Region.getAll() operation using these smaller sets. The actual size of each set and the number of threads will depend on the ratio of cache hits / misses you expect, and your SLA requirements, of course.

Related

REDIS search does not return results consistently

I am reading events from Kafka and depositing into REDIS. Then, we read events using Python and in case we don’t find events we drop/re-create the index.
However, I noticed at times even after re-creating the index I still don’t find events.
I have couple of questions -
[Q1] Is re-indexing a good approach where we are continuously getting a huge flow of events?
[Q2] Also, I noticed during REDIS search at times I do get events and then at another instance query does not return results, can this be related to dropping / re-creating index?
[Q3] Is there a better standard approach to ensure JSON are deposited / retrieved consistenly.
[Q4] Is there an explanation as to why at times everything just seems to work fine continuously for several hours and then does not work at all for few hours.
I would appreciate alternate approaches for this simple use case as I am fairly new to REDIS
Here are some points. I hope it would be helpful.
[Q1]
Re-indexing can be a good approach if you are getting a huge flow of events, as it will help keep your data up-to-date. However, you also need to make sure that your Redis instance is able to handle the increased load otherwise Redis can become very slow when it is re-indexing and may not be able to keep up with the flow of event. And in that case, you may need to scale your Redis instance to handle the load or incremental indexing may be a better option.
[Q2]
There could be many reasons why search results vary, but it is possible that the index is being dropped and re-created, which would cause the event to not be found. There are a few other things that could be happening:
There could be an issue with the search algorithm, which would cause search results to not be returned.
The data in the database could be changing, which would cause search results to not be returned.
[Q3]
There is no one standard approach to ensure JSON are deposited or retrieved consistently. However, some best practices you may consider include using a library or tool that supports serialization and deserialization of JSON data, validating input and output data, and using an editor that highlights errors in JSON syntax. You can also use locking mechanism to ensure that only one process can write to the Redis instance at a time, or using a queue to buffer writes to Redis. Also, different developers may have different preferences and opinions on the best way to handle JSON in Redis. Some possible methods include using commands such as JSET or JGET to manage JSON objects, or using a library such as JRedis to simplify the process.
[Q4]
Redis can be temperamental, and its behavior can vary depending on the specific configuration and usage scenarios. There is no specific explanation for this behavior, but it could be due to various factors such as load on the Redis server, network conditions, or other applications using the same Redis instance. In that case, server will not be able to handle requests properly and will stop responding. If everything is working fine for a few hours and then suddenly stops working, you can try restarting Redis or checking the logs for any errors that may have occurred.

Making sure (distributed) caches only store the latest value in a distributed system

Let's say I want to use a built-in solution such as Redis or Memcached to cache database rows (as an example), to avoid recurrent costly trips to the database.
For the sake of the argument, let's assume I have a TABLE(id, x, y) and that I want to cache all rows so I never have to read directly from the database.
Questions:
Consider the following case: NodeA tries to update a given row's field x while NodeB tries to update y, then both simultaneously try to update the cache line. If they try to "manually" update the field they just changed to the row in the cache, if we follow the typical last-write-wins, one of the fields is going to be discarded, which is catastrophic. This makes me think I need to always fill the cache's rows with a full row read from the database.
But this by itself won't necessarily help me. If NodeA writes to x and loads the entire row in memory and then NodeB writes to y and reads the entire row in memory, if NodeB writes to the cache before NodeA then NodeB's changes will be overwritten! This makes me believe I need to always somehow version the rows both in the DB and in the cache. Is this the case? Memcached seems to have a compare and set primitive, but I see no such thing in Redis.
Even if 1. and 2. are not an issue, I still need to guarantee that my write / read has read-after-write consistency, otherwise it may happen that what I'm reading and intending to put in the cache is not necessarily the most up-to-date version. If that's the case, how can I make sure of this? By requiring w + r > n?
This seems to be a very common use-case, I'd guess it's pretty much a solved problem. What can I try to resolve this?
Key value stores as redis support advance data structures, such as HASHs.
If you're doing partial updates to cached entities (only a set of fields is updated as part of the super set), and given your goal is to avoid time-consuming database reads, simply save the table entry as a HASH K/V pairs (using HSET) and the use HGETALL for reading.
Redis OPS are atomic by nature, so that should solve your problems, if I got them right.
On a side note, if you're caching an entire entity yet doing partial updates, you should consider a simpler caching approach, such as read-through (making cache validity a reader-only concern).
As opposed to Database accesses. Redis cache access from different location unless somehow serialized, will always have the potential of being out of order when it comes to distributed systems, as there's always the execution environment (network, threading) to introduce possible delays.
Doing read-through caching will ensure data is always updated after the most recent write without the need to synchronize anything else.
This is how Facebook solved the issue with Memcached: http://nil.csail.mit.edu/6.824/2020/papers/memcache-faq.txt.
The idea is to use the concept of a lease: when a request for a cached value is received and there is no data for such key, a lease token (64 bits id) is returned.
When the webserver fetches the data from the database it can then store the data in the cache with that token. Every time an invalidation request is invoked on a key, a new lease token is created, and as such, if a put is attempted for an old token, the put ends up rejected.
As far as I understand, it's not really possible to (easily) replicate this behavior with Redis without resorting to LUA scripts.

Trident or Storm topology that writes on Redis

I have a problem with a topology. I try to explain the workflow...
I have a source that emits ~500k tuples every 2 minutes, these tuples must be read by a spout and processed exatly once like a single object (i think a batch in trident).
After that, a bolt/function/what else?...must appends a timestamp and save the tuples into Redis.
I tried to implement a Trident topology with a Function that save all the tuples into Redis using a Jedis object (Redis library for Java) into this Function class, but when i deploy i receive a NotSerializable Exception on this object.
My question is.How can i implement a Function that writes on Redis this batch of tuples? Reading on the web i cannot found any example that writes from a function to Redis or any example using State object in Trident (probably i have to use it...)
My simple topology:
TridentTopology topology = new TridentTopology();
topology.newStream("myStream", new mySpout()).each(new Fields("field1", "field2"), new myFunction("redis_ip", "6379"));
Thanks in advance
(replying about state in general since the specific issue related to Redis seems solved in other comments)
The concepts of DB updates in Storm becomes clearer when we keep in mind that Storm reads from distributed (or "partitioned") data sources (through Storm "spouts"), processes streams of data on many nodes in parallel, optionally perform calculations on those streams of data (called "aggregations") and saves the results to distributed data stores (called "states"). Aggregation is a very broad term that just means "computing stuff": for example computing the minimum value over a stream is seen in Storm as an aggregation of the previously known minimum value with the new values currently processed in some node of the cluster.
With the concepts of aggregations and partition in mind, we can have a look at the two main primitives in Storm that allow to save something in a state: partitionPersist and persistentAggregate, the first one runs at the level of each cluster node without coordination with the other partitions and feels a bit like talking to the DB through a DAO, while the second one involves "repartitioning" the tuples (i.e. re-distributing them across the cluster, typically along some groupby logic), doing some calculation (an "aggregate") before reading/saving something to DB and it feels a bit like talking to a HashMap rather than a DB (Storm calls the DB a "MapState" in that case, or a "Snapshot" if there's only one key in the map).
One more thing to have in mind is that the exactly once semantic of Storm is not achieved by processing each tuple exactly once: this would be too brittle since there are potentially several read/write operations per tuple defined in our topology, we want to avoid 2-phase commits for scalability reasons and at large scale, network partitions become more likely. Rather, Storm will typically continue replaying the tuples until he's sure they have been completely successfully processed at least once. The important relationship of this to state updates is that Storm gives us primitive (OpaqueMap) that allows idempotent state update so that those replays do not corrupt previously stored data. For example, if we are summing up the numbers [1,2,3,4,5], the resulting thing saved in DB will always be 15 even if they are replayed and processed in the "sum" operation several times due to some transient failure. OpaqueMap has a slight impact on the format used to save data in DB. Note that those replay and opaque logic are only present if we tell Storm to act like that, but we usually do.
If you're interested in reading more, I posted 2 blog articles here on the subject.
http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/
http://svendvanderveken.wordpress.com/2014/02/05/error-handling-in-storm-trident-topologies/
One last thing: as hinted by the replay stuff above, Storm is a very asynchronous mechanism by nature: we typically have some data producer that post event in a queueing system (e,g. Kafka or 0MQ) and Storm reads from there. As a result, assigning a timestamp from within storm as suggested in the question may or may not have the desired effect: this timestamp will reflect the "latest successful processing time", not the data ingestion time, and of course it will not be identical in case of replayed tuples.
Have you tried trident-state for redis. There is a code on github that does it already:
https://github.com/kstyrc/trident-redis.
Let me know if this answers your question or not.

Locking and Redis

We have 75 (and growing) servers that need to share data via Redis. All 75 servers would ideally want to write to two fields in Redis with INCRBYFLOAT operations. We anticipate eventually having potentially millions of daily write operations and billions of daily reads on these two fields. This data must be persistent.
We're concerned that Redis locking might cause write operations to be repeatedly retried with many simultaneous attempts to increment the same field.
Questions:
Is multiple, simultaneous INCRBYFLOAT on a single field a bad idea under a very heavy load?
Should we have an external process "summarize" separate fields and write the two fields instead? (this introduces another failure point)
Will reads on those two fields block while writing?
Redis does not lock. Also, it is single threaded; so there are no race conditions. Reads or Writes do not block.
You can run millions of INCRBYFLOAT on the same key without any problems. No need for external processes. Reading those fields does not pose any problems.
That said, "Millions of updates to two keys" sounds strange. If you can explain your use case, perhaps there might be a better way to handle it within Redis.
Since Redis is single threaded, you will probably want to use master-slave replication to separate writes from reads, since yes, writes will generally block reads.
Alternatively you can consider using Apache Zookeeper for this, it provides reliable cluster coordination without single points of failure (like single Redis instance).

What happens when redis gets overloaded?

If redis gets overloaded, can I configure it to drop set requests? I have an application where data is updated in real time (10-15 times a second per item) for a large number of items. The values are outdated quickly and I don't need any kind of consistency.
I would also like to compute parallel sum of the values that are written in real time. What's the best option here? LUA executed in redis? Small app located on the same box as redis using UNIX sockets?
When Redis gets overloaded it will just slow down its clients. For most commands, the protocol itself is synchronous.
Redis supports pipelining though, but there is no way a client can cancel the traffic still in the pipeline, but not yet acknowledged by the server. Redis itself does not really queue the incoming traffic, the TCP stack does it.
So it is not possible to configure the Redis server to drop set requests. However, it is possible to implement a last value queue on client side:
the queue is actually a represented by 2 maps indexed by your items (only one value stored per item). The primary map will be used by the application. The secondary map will be used by a specific thread. The 2 maps content can be swapped in an atomic way.
a specific thread is blocking when the primary map is empty. When it is not, it swaps the content of the two maps, sends the content of the secondary map asynchronously to Redis using aggressive pipelining and variadic parameters commands. It also receives ack from Redis.
while the thread is working with the secondary map, the application can still fill the primary map. If Redis is too slow, the application will only accumulate last values in the primary map.
This strategy could be implemented in C with hiredis and the event loop of your choice.
However, it is not trivial to implement, so I would first check if the performance of Redis against all the traffic is not enough for my purpose. It is not uncommon to benchmark Redis at more than 500K op/s these days (using a single core). Nothing prevents you to shard your data on multiple Redis instances if needed.
You will likely saturate the network links before the CPU of the Redis server. That's why it is better to implement the last value queue (if needed) on client side rather than server side.
Regarding the sum computation, I would try to calculate and maintain it in real time. For instance, the GETSET command can be used to set a new value while returning the previous one.
Instead of just setting your values, you could do:
[old value] = GETSET item <new value>
INCRBY mysum [new value] - [old value]
The mysum key will contain the sum of your values for all the items at any time. With Redis 2.6, you can use Lua to encaspulate this calculation to save roundtrips.
Running a big batch to calculate statistics on existing data (this is how I understand your "parallel" sum) is not really suitable for Redis. It is not designed for map/reduce like computation.