Is it possible to force ConsistentHashingPool to create routee per hash? I want one routee actor to process only messages of the same hash. And if new hash comes in, then new routee is created.
I tried looking into Resizer class, but I was not able to figure out the way to achieve the thing I wanted.
I think you're misunderstanding the ConsistentHashRouter (CHR) a bit. It already does what you've stated—consistently routes messages whose keys fall in a given hash range to the same routee.
Routees are added to / removed from the CHR routee table as new nodes/virtual nodes join the cluster. Then, the hash range will be rebalanced to account for the new nodes in the cluster and the CHR will route messages to the node that is now responsible for the part of the hash range the key falls into. This may be the same node that was responsible for it before, or it may shift from one node to another. Essentially you're sharding the hash range across the cluster.
UPDATE: as of writing this (October 2015) this management process must be done manually. There is a module coming called Akka.Cluster.Sharding that will do the rebalancing of shards for you across nodes. It is currently available on the JVM.
(From a newbie perspective...)
I agree with Oliver, this is too simple a use-case to require things called clustering and sharding.
Consider an actor holding some state for a user or a session or something - obviously each actor must receive only the messages for that entity-instance-id.
From having read a few docs I'm pretty sure it's trivial to code yourself: You just write a parent actor which checks for the existence of a child for a given id, creates it if it doesn't exist, then routes the message to it.
I also expected there to be something like a create-unique-actors setting on the ConsistentHashingRouter to do this automatically for you. (Maybe it's not generally useful since you need to consider when and how to terminate the actors to prevent them from living for ever?)
Related
I'm using Redis in Cluster mode(6 nodes, 3 masters, and 3 slaves) and I'm using SE.Redis, However, commands with multiple keys in different hash slots are not supported as usual
so I'm using HashTags to be sure that certain key belongs to a particular hash slot using the {}. for example I have 2 keys like cacheItem:{1}, cacheItem:{94770}
I set those keys using ( each key in a separate request):
SEclient.Database.StringSet(key,value)
this works fine,
but now I want to query key1 and key2 which belongs to multiple hash slot
SEclient.Database.StringGet(redisKeys);
above will fail and throws an exception because those keys belong to multiple hash slots
while querying keys, I can't make sure that my keys will belong to the same hash slot,
this example is just 2 keys I have hundreds of keys which I want to query.
so I have following questions:
how can I query multiple keys when they belong to different hash slots?
what's the best practice to do that?
should I calculate hash slots on my side and then send individual requests per hash slot?
can I use TwemProxy for my scenario?
any helps highly appreciated
I can’t speak to SE.Redis, but you are on the right track. You either need to:
Make individual requests per key to ensure they go to the right cluster node, or...
Precalculate the shard + server each key belongs to, grouping by the host. Then send MGET requests with those keys to each host that owns them
Precalculating will require you (or your client) to know the cluster topology (hash slot owners) and the Redis key hashing method (don’t worry, it is simple and well documented) up front.
You can query cluster info from Redis to get owned slots.
The basic hashing algorithm is HASH_SLOT=CRC16 (key) mod 16384. Search around and you can find code for that for about any language 🙂 Remember that the use of hash tags makes this more complicated! See also: https://redis.io/commands/cluster-keyslot
Some Redis cluster clients will do this for you with internal magic (e.g. Lettuce in Java), but they are not all created equal 🙂
Also be aware that cluster topology can change at basically any time, and the above work is complicated. To be durable you’ll want to have retries if you get cross slot errors. Or you can just make many requests for single keys as it is much much simpler to maintain.
I have a fairly simple Akka.NET system that tracks in-memory state, but contains only derived data. So any actor can on startup load its up-to-date state from a backend database and then start receiving messages and keep their state from there. So I can just let actors fail and restart the process whenever I want. It will rebuild itself.
But... I would like to run across multiple nodes (mostly for the memory requirements) and I'd like to increase/decrease the number of nodes according to demand. Also for releasing a new version without downtime.
What would be the most lightweight (in terms of Persistence) setup of clustering to achieve this? Can you run Clustering without Persistence?
This not a single question, so let me answer them one by one:
So I can just let actors fail and restart the process whenever I want - yes, but keep in mind, that hard reset of the process is a lot more expensive than graceful shutdown. In distributed systems if your node is going down, it's better for it to communicate that to the rest of the nodes before, than requiring them to detect the dead one - this is a part of node failure detection and can take some time (even sub minute).
I'd like to increase/decrease the number of nodes according to demand - this is a standard behavior of the cluster. In case of Akka.NET depending on which feature set are you going to use, you may sometimes need to specify an upper bound of the cluster size.
Also for releasing a new version without downtime. - most of the cluster features can be scoped to a set of particular nodes using so called roles. Each node can have it's set of roles, that can be used what services it provides and detect if other nodes have required capabilities. For that reason you can use roles for things like versioning.
Can you run Clustering without Persistence? - yes, and this is a default configuration (in Akka, cluster nodes don't need to use any form of persistent backend to work).
Allow me to preface this by saying I'm fairly new to Redis. I have used Redis in the context of Resque.
I have a service that dispatches jobs to multiple other services. Those jobs either succeed or fail. Regardless of this, I'd like to send the results of what happened with a given job to a client that can then store they jobs in some sort of logical way, for example: JobType1Success, JobTypeOneFailure, etc.
I know I can create lists or sets with Redis and easily add some string representation of the data as values to the lists. Additionally, I know with traditional key/string values in Redis you can set an expiration in seconds. In my ideal world I would create several lists such as the ones mentioned above. Each job would then be prepended to it's appropriate list type, and after 7 days in the list a value would expire. Right now it seems fairly trivial to add the string values to a given list. However I am unable to find anything on whether or not I am able to expire values of a certain age from a list, and if so how to do that. I am working with a Node stack and using the Node Redis library. Any help here would be enormously appreciated.
I have a problem with a topology. I try to explain the workflow...
I have a source that emits ~500k tuples every 2 minutes, these tuples must be read by a spout and processed exatly once like a single object (i think a batch in trident).
After that, a bolt/function/what else?...must appends a timestamp and save the tuples into Redis.
I tried to implement a Trident topology with a Function that save all the tuples into Redis using a Jedis object (Redis library for Java) into this Function class, but when i deploy i receive a NotSerializable Exception on this object.
My question is.How can i implement a Function that writes on Redis this batch of tuples? Reading on the web i cannot found any example that writes from a function to Redis or any example using State object in Trident (probably i have to use it...)
My simple topology:
TridentTopology topology = new TridentTopology();
topology.newStream("myStream", new mySpout()).each(new Fields("field1", "field2"), new myFunction("redis_ip", "6379"));
Thanks in advance
(replying about state in general since the specific issue related to Redis seems solved in other comments)
The concepts of DB updates in Storm becomes clearer when we keep in mind that Storm reads from distributed (or "partitioned") data sources (through Storm "spouts"), processes streams of data on many nodes in parallel, optionally perform calculations on those streams of data (called "aggregations") and saves the results to distributed data stores (called "states"). Aggregation is a very broad term that just means "computing stuff": for example computing the minimum value over a stream is seen in Storm as an aggregation of the previously known minimum value with the new values currently processed in some node of the cluster.
With the concepts of aggregations and partition in mind, we can have a look at the two main primitives in Storm that allow to save something in a state: partitionPersist and persistentAggregate, the first one runs at the level of each cluster node without coordination with the other partitions and feels a bit like talking to the DB through a DAO, while the second one involves "repartitioning" the tuples (i.e. re-distributing them across the cluster, typically along some groupby logic), doing some calculation (an "aggregate") before reading/saving something to DB and it feels a bit like talking to a HashMap rather than a DB (Storm calls the DB a "MapState" in that case, or a "Snapshot" if there's only one key in the map).
One more thing to have in mind is that the exactly once semantic of Storm is not achieved by processing each tuple exactly once: this would be too brittle since there are potentially several read/write operations per tuple defined in our topology, we want to avoid 2-phase commits for scalability reasons and at large scale, network partitions become more likely. Rather, Storm will typically continue replaying the tuples until he's sure they have been completely successfully processed at least once. The important relationship of this to state updates is that Storm gives us primitive (OpaqueMap) that allows idempotent state update so that those replays do not corrupt previously stored data. For example, if we are summing up the numbers [1,2,3,4,5], the resulting thing saved in DB will always be 15 even if they are replayed and processed in the "sum" operation several times due to some transient failure. OpaqueMap has a slight impact on the format used to save data in DB. Note that those replay and opaque logic are only present if we tell Storm to act like that, but we usually do.
If you're interested in reading more, I posted 2 blog articles here on the subject.
http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/
http://svendvanderveken.wordpress.com/2014/02/05/error-handling-in-storm-trident-topologies/
One last thing: as hinted by the replay stuff above, Storm is a very asynchronous mechanism by nature: we typically have some data producer that post event in a queueing system (e,g. Kafka or 0MQ) and Storm reads from there. As a result, assigning a timestamp from within storm as suggested in the question may or may not have the desired effect: this timestamp will reflect the "latest successful processing time", not the data ingestion time, and of course it will not be identical in case of replayed tuples.
Have you tried trident-state for redis. There is a code on github that does it already:
https://github.com/kstyrc/trident-redis.
Let me know if this answers your question or not.
When I am making an Insert in a specified keyspace, I wants that the data is stored only in a specified node (or node list). The information contains in the insert, may be confidential and should not be distributed on any nodes.
I first thought about implementing my own AbstractReplicationStrategy, but it seams that the first node choose depends on the Token (selected by the partitioner) and not the implemented strategy.
How can I be sure the the information contains in a keyspace only comes where I allow this?
I don't think that it is possible to do what you are asking. Cassandra actively tries to maintain a certain number of replicas of each piece of data- even if you managed to force only a single node to store your insert (which is fairly straighforward), you'd have no control over which node that was (as you found, this is controlled by the partitioner), and if the node went down your data would be lost.
The short answer is that controlling replication is not the way to achieve data security- you should use proper security techniques such as encryption, segregated networks, controlled access, etc.