What is a 'Partition' in Apache Helix - apache

I am learning Apache Helix. I came across the keyword 'Partitions'.
According to the definition mentioned here http://helix.apache.org/Concepts.html, Each subtask (of a main task) is referred to as a partition in Helix.
When I gone through the recipe - Distributed Lock Manager, partitions are nothing but instances of a resource. (Increase the numOfPartitions, number of locks is increased).
final int numPartitions = 12;
admin.addResource(clusterName, lockGroupName, numPartitions, "OnlineOffline",
RebalanceMode.FULL_AUTO.toString());
Can someone explain with simple example, what exactly the partition in Apache Helix is ?

I think you're right that a partition is essentially an instance of a resource. As is the case in other distributed systems, partitions are used to achieve parallelism. A resource with only one instance can only run on one machine. Partitions simply provide the construct necessary to split a single resource among many machines by, well, partitioning the resource.
This is a pattern that is found in a large portion of distributed systems. The difference, though, is while e.g. distributed databases explicitly define partitions essentially as a subset of some larger data set that can fit on a single node, Helix is more generic in that partitions don't have a definite meaning or use case, but many potential meanings and potential use cases.
One of these use cases in a system with which I'm very familiar is Apache Kafka's topic partitions. In Kafka, each topic - essentially a distributed log - is broken into a number of partitions. While the topic data can be spread across many nodes in the cluster, each partition is constrained to a single log on a single node. Kafka provides scalability by adding new partitions to new nodes. When messages are produced to a Kafka topic, internally they're hashed to some specific partition on some specific node. When messages are consumed from a topic, the consumer switches between partitions - and thus nodes - as it consumes from the topic.
This pattern generally applies to many scalability problems and is found in almost any HA distributed database (e.g. DynamoDB, Hazelcast), map/reduce (e.g. Hadoop, Spark), and other data or task driven systems.
The LinkedIn blog post about Helix actually gives a bunch of useful examples of the relationships between resources and partitions as well.

Related

Multiple microservices and Redis - one database vs node per application in cloud

I would like to know what is the best practice for using Redis in cloud (Google Memorystore in my case, Standard Tier) for multiple microservices/applications. From what I have researched so far following options are available:
Use single cluster and database, scaled horizontally for all the microservices. This seems most cost-effective as I will use the exact amount of nodes I will need for the whole system. The data isolation is impacted here, but I can reduce the impact e.g. by prefixing the keys with the microservice name.
Use separate clusters and databases for each microservice. In this case the isolation is better, the scaling of the needed cluster will impact a single microservice only, but this doesn't seem cost effective, as many nodes may be underloaded (e.g. microservice M1 utilizes 50% capacity of a node, microservice M2 utilizes 40% capacity of a node so in case 1 both microservices would by served only by a single node).
In theory I could use multiple databases to isolated data in a single cluster, but as far as I have read this is not supported by Redis (and using multiple databases on a single node causes performance issues).
I am leaning towards option 1., but perhaps I am missing something?
Not sure about best practices, I will tell you my experience.
In general I would go with Option #2.
Each microservices gets it's own redis instance or cluster.
Redis clusters follow their own microservice life. Ex they might get respawned when you redeploy or restart a service.
You might pay a bit more but you gain in resiliency and maintenance hassle.

RavenDb Sharding Hilo storage pattern

My understanding was that RavenDb was designed so that if one shard goes down, the other shards can operate without problems.
But recently I was implementing ShardingResolutionStrategy and found out the MetadataShardIdFor method. It is the method where for each document type we can specify what shard to use for storage.
So if I get it right, if the shard where Hilo for specific document type is stored is down, we can not create new documents of this type at other shards (at least autogenerated ids will not work). Or may be I am wrong and Hilo is replicated between shards in some magical way?
Sharding is designed to be independent, but in order to create consistent ids, we need to be able to create them from a consistent store.
Because of that, we separate the notion of splitting data to multiple nodes and HA.
The typical scenario is that the metadata shard is independent, and is running with replicated database that is shared on all sharded nodes. In this fashion, if you lose the metadata shard, you just switch over.
This take advantage on the fact that RavenDB sharding & replication are orthogonal

Accessing Multiple Redis Shards

Hi I'm going to be using multiple Redis instances and some sharding between instances.
My question is will performance suffer [a noticeable amount] if loading a webpage requires multiple shards accessed.
My basic overview is to have load balanced between multiple Redis shards*footnote below, possibly using Twemproxy for this. And have everything pertaining to a particular users' data on only one shard, (for things like 'likes','user-information','save-list' etc.) but also have multiple instances of Redis containing objects (which many different users will access) and data about said objects which will load for users also. I will not need to have Redis operations on multiple keys in different databases, but I will need to have Redis instances return m keys from n instances in real time.
To come completely clean with you I'm also planning on using something like this https://github.com/mpalmer/redis/blob/nds-2.6/README.nds so that I can use Redis while saving many keys to disc when not in use.
FOOTNOTE: (I am aware of Redis's Master-Slave replication, but prefer sharding for the extra storage in place of just more access)
Please, if your only comment is along the lines of, ""don't bother to shard until you absolutely have to"", keep it to yourself. I'm not interested in hearing responses that sharding is only important for a certain percentage of sites. That may be your opinion and that may even be fact but that is not what I am asking here.
IMO, if you're going to perform multiple reads from multiple shards instead of a single instance, you're most likely to get better performance as long as:
1. The sharding layer isn't slowing you down
2. The app can pull the data from the different shards asynchronously

Difference between Partial Replication and Sharding?

I was wondering if sharding is an alternate name for partial replication or not. What I have figured out that --
Partial Repl. – each data item has only copies at some but not all of the nodes (‘Sharding’?)
Pure Partial Repl. – has only copies of a subset of the data item but no node contains a full copy of the database
Hybrid Partial Repl. – a set of nodes are full replicas and another set of nodes are partial replicas
Partial replication is an interesting way, in which you distribute the data with replication from a master to slaves, each contains a portion of the data. Eventually you get an array of smaller DBs, read only, each contains a portion of the data. Reads can very well be distributed and parallelized.
But what about the writes?
Those are still clogged, in 1 big fat lazy master database, tasks as buffer management, locking, thread locks/semaphores, and recovery tasks - are the real bottleneck of the OLTP, they make writes impossible to scale... See more in my blog post here: http://database-scalability.blogspot.com/2012/08/scale-up-partitioning-scale-out.html. BTW - your topic right here just gave me a great idea for another post. I'll link to this question and give you the credit! :)
Sharding is where data appears only once, within an array of DBs. Each database is the complete owner of the data, data is read from there, data is written to there. This way, reads and writes are distributed and parallelized. Real scale-out can be acheived.
Sharding is a mess to handle, to maintain, it's hard as hell. ScaleBase (I work there), enable automatic transparent scale-out, just throw it in the middle and you'll have 10 DBs at the back, and it'll look like 1 to your app. Automatic, transparent super-sharding - in a box.
Sharding is a method of horizontal partitioning of a table. It doesn't related to replication.
Traditionally an RDBMS server located in the center of system with star like topology. That's why it becomes:
the single point of failure
the performance bottleneck of the system
To resolve issue #1 you use replication: if original server dies you fail over to a replica.
To resolve issue #2 you can:
use sharding
1.1 do sharding by yourself
1.2 use your RDBMS "out of the box" clustering mechanism
migrate to a NoSQL solution
Sharding allows you to scale out database to many servers by splitting the data among them. However sharding is a trade-off. It limits you in data joining/intersecting/etc.
You still have issue #1 if you use sharding. So it's a good practice to replicate sharded nodes.

Does Redis support strong consistency

I am looking at porting a Java application to .NET, the application currently uses EhCache quite heavily and insists that it wants to support strong consistency (http://ehcache.org/documentation/get-started/consistency-options).
I am would like to use Redis in place of EhCache but does Redis support strong consistency or just support eventual consistency?
I've seen talk of a Redis Cluster but I guess this is a little way off release yet.
Or am I looking at this wrong? If Redis instance sat on a different server altogether and served two frontend servers how big could it get before we'd need to look at a Master / Slave style affair?
A single instance of Redis is consistent. There are options for consistency across many instances. #antirez (Redis developer) recently wrote a blog post, Redis data model and eventual consistency, and recommended Twemproxy for sharding Redis, which would give you consistency over many instances.
I don't know EhCache, so can't comment on whether Redis is a suitable replacement. One potential problem (porting to .NET) with Twemproxy is it seems to only run on Linux.
How big can a single Redis instance get? Depends on how much RAM you have.
How quickly will it get this big? Depends on how your data looks.
That said, in my experience Redis stores data quite efficiently. One app I have holds info for 200k users, 20k articles, all relationships between objects, weekly leader boards, stats, etc. (330k keys in total) in 400mb of RAM.
Redis is easy to use and fun to work with. Try it out and see if it meets your needs. If you do decide to use it and might one day want to shard, shard your data from the beginning.
Redis is not strongly consistent out of the box. You will probably need to apply 3rd party solutions to make it consistent. Here is a quote from docs:
Write safety
Redis Cluster uses asynchronous replication between nodes, and last failover wins implicit merge function. This means that the last elected master dataset eventually replaces all the other replicas. There is always a window of time when it is possible to lose writes during partitions. However these windows are very different in the case of a client that is connected to the majority of masters, and a client that is connected to the minority of masters.
Usually you need to have synchronous replication to achieve strong consistence in a distributed partitioned systems.