OrientDB, distributed load balancing

OrientDB, distributed load balancing - replication

I am using OrientDB and faced a problem during working with replication. I have two servers with OrientDB and I can see that these OrientDB instances are communicating between each other. The problem is when I generate a batch of requests I can see that connections count increases only for one node (I can see this in OrientDB output). It seems that only one node services all requests. I read about load balancing in orientDB documentation and it suggests to set strategy like this
final ODatabaseDocumentTx db = new ODatabaseDocumentTx("remote:localhost/demo");
db.setProperty(OStorageRemote.PARAM_CONNECTION_STRATEGY, OStorageRemote.CONNECTION_STRATEGY.ROUND_ROBIN_CONNECT);
db.open(user, password);
But I can't use this way because I create ODatabaseDocumentTx object through OPartitionedDatabasePool.
Why are all requests serviced by one node? How can I configure load balancing?
Thanks

Related

How can I handle redis-cluster with Redis Stack (RedisJSON & RediSearch)?

I'm currently having problem dealing with redis-cluster.
creating a redis cluster, I'm using the "redis/redis-stack-server:latest" docker image.
I am doing a query test using RediSearch, but in standalone mode, the number of requests per second is not high than other DBs. so I'm trying to optimize the speed using Redis-Cluster.
When I first started redis-cluster, I thought I would be able to import it across nodes. However, the test results didn't. I can only access the data within the same node and cannot ft.search (RediSearch) across all the nodes.
so I looked through the document again to find a solution. The document related to RediSearch says to use RSCoordinator.
When Running RediSearch in a clustered database, you can span the index across shards using RSCoordinator. In this case the above does not apply.
https://redis.io/commands/ft.create/
but looking at github, it seems that RSCoordinator is already built-in.
https://github.com/RediSearch/RediSearch/tree/638de1dbc5e641ca4f8943732c3c468c59c5a47a
Is there a way for redis-cluster to fetch data across all nodes?
Additionally, I wonder if the reason why Redis does not have higher requests per second than other DBs (MongoDB, PostgreSQL, etc..) in large dataset(about 2,400,000) is because Redis is Single Thread? Are there other factors?

What are the options to bulk/batch load data into Apache Geode(Gemfire)?

We need to load millions of key/values into Apache Geode and we'd like to know what are some the options available. Our values happen to be in the 256kb range.

There are several options depending on your application requirements/SLAs or whether you need to perform conversion or other transformations, etc.
Out-of-the-box, Apache Geode provides the Cache & Region Snapshot Service. This is useful when you want to migrate data from 1 existing Apache Geode cluster to another, for instance. Not so useful if your data is coming from an external source, like a RDBMS.
Another option is to lazily load the data based on need. This can be accomplished by implementing the CacheLoader interface and registering the CacheLoader with a Region. Obviously, you could create a CacheLoader implementation that intelligently loads a block of data based on some rules/criteria in addition to loading and returning the single value of interests based on the current requests.
A lot of times, users create an external, custom Conversion process or tool to extract, transform and bulk load (ETL) a bunch of data into Apache Geode. This is typical in complex Use Cases or requirements. However, it is highly advisable to use perhaps a framework/tool like...
Spring XD (now Spring Cloud Data Flow on Pivotal's Cloud Foundry (PCF)) is great ETL tool and pipeline for creating stream-based applications. Spring XD / SCDF provides many different options for "sources" and "sinks" (e.g. GemFire Server). In addition to sources & sinks, you can even "tap" the stream to process the data with "Processors". So whether you are doing real-time stream or batch-oriented data operations (e.g. bulk loads), Spring XD is a great option.
I am sure Google might provide other answers on how to perform ETL with a KeyValue store like Apache Geode.
Hope this helps get you going.
Cheers,
John

We have very limited options to load Gemfire regions .
1) Spring batch:
Create Gemfire writer for load data and remove data
Create batch configuration and lod it
2) Apache Spark
https://www.linkedin.com/pulse/fast-data-access-using-gemfire-apache-spark-part-vaquar-khan-/

Uneven cache hits

I have integrated twemproxy into web layer and I have 6 Elasticache(1 master , 5 read replicas) I am getting issue that the all replicas have same keys everything is same but cache hits on one replica is way more than others and I performed several load testing still on every test I am getting same result. I have separate data engine that writes on the master of this cluster and remaining 5 replicas get sync with it. So I am using twemproxy only for reading data from Elasticache not for sharding purpose. So my simple question is why i am getting 90% of hits on single read replicas of Elasticache it should distribute the hits evenly among all read replicas? right?
Thank you in advance

Twemproxy hashes everything as I recall. This means it will try to split keys among the masters you give it. If you have one master this means it hashes everything to one server. Thus, as far as it is concerned you have one server for acceptable queries. As such, it isn't helping you in this case.
If you want to have a single endpoint to distribute reads across a bank of identical slaves, you will need to put a TCP load balancer in front of the slaves and have your application talk to the IP:port of the load balancer. Common options are Nginx and HAProxy for software based ones, on AWS you could use their load balancer but you could run into various resource limits out of your control there, and pretty much any hardware load balancer would work as well (though this is difficult if not impossible on AWS).
Which load balancer to use is dependent on your (or your personnel's) comfort and knowledge level with each option.

Couchbase node failure

My understanding could be amiss here. As I understand it, Couchbase uses a smart client to automatically select which node to write to or read from in a cluster. What I DON'T understand is, when this data is written/read, is it also immediately written to all other nodes? If so, in the event of a node failure, how does Couchbase know to use a different node from the one that was 'marked as the master' for the current operation/key? Do you lose data in the event that one of your nodes fails?
This sentence from the Couchbase Server Manual gives me the impression that you do lose data (which would make Couchbase unsuitable for high availability requirements):
With fewer larger nodes, in case of a node failure the impact to the
application will be greater
Thank you in advance for your time :)

By default when data is written into couchbase client returns success just after that data is written to one node's memory. After that couchbase save it to disk and does replication.
If you want to ensure that data is persisted to disk in most client libs there is functions that allow you to do that. With help of those functions you can also enshure that data is replicated to another node. This function is called observe.
When one node goes down, it should be failovered. Couchbase server could do that automatically when Auto failover timeout is set in server settings. I.e. if you have 3 nodes cluster and stored data has 2 replicas and one node goes down, you'll not lose data. If the second node fails you'll also not lose all data - it will be available on last node.
If one node that was Master goes down and failover - other alive node becames Master. In your client you point to all servers in cluster, so if it unable to retreive data from one node, it tries to get it from another.
Also if you have 2 nodes in your disposal you can install 2 separate couchbase servers and configure XDCR (cross datacenter replication) and manually check servers availability with HA proxies or something else. In that way you'll get only one ip to connect (proxy's ip) which will automatically get data from alive server.

Hopefully Couchbase is a good system for HA systems.
Let me explain in few sentence how it works, suppose you have a 5 nodes cluster. The applications, using the Client API/SDK, is always aware of the topology of the cluster (and any change in the topology).
When you set/get a document in the cluster the Client API uses the same algorithm than the server, to chose on which node it should be written. So the client select using a CRC32 hash the node, write on this node. Then asynchronously the cluster will copy 1 or more replicas to the other nodes (depending of your configuration).
Couchbase has only 1 active copy of a document at the time. So it is easy to be consistent. So the applications get and set from this active document.
In case of failure, the server has some work to do, once the failure is discovered (automatically or by a monitoring system), a "fail over" occurs. This means that the replicas are promoted as active and it is know possible to work like before. Usually you do a rebalance of the node to balance the cluster properly.
The sentence you are commenting is simply to say that the less number of node you have, the bigger will be the impact in case of failure/rebalance, since you will have to route the same number of request to a smaller number of nodes. Hopefully you do not lose data ;)
You can find some very detailed information about this way of working on Couchbase CTO blog:
http://damienkatz.net/2013/05/dynamo_sure_works_hard.html
Note: I am working as developer evangelist at Couchbase

Cache Regions in Velocity/AppFabric using WCF

I have a service based architecture where a web farm full of asp clients hit application server farm of WCF services. Obviously all the database access is done by the WCF services. Now I would like to cache my frequently used database retrieved objects using Velocity at the service tier level. I am considering to make each physical application server also part of the cache cluster.
According to Velocity documentation, if I use regions, objects are stored only at a single host. I actually wouldn't have any problem if each host kept it's own cache provided that I could somehow synchronize them.
So my questions are
If I create one region on one host is it also created on another one?
When I clear a cache region, is it cleared on one host only?
If I subscribe to a region level notification on all the hosts, can I catch events of one host on another one?
In this scenario should I use regions at all or stay away from them?
I hope my questions are clear. Actually I am more interested in a solution to my problem than answers to my questions

Yes you are right in reading the doc that the region will exists only in one host.
" I actually wouldn't have any problem if each host kept it's own cache provided that I could somehow synchronize them."
When you say synchronize, you mean when HA in enabled ? Velocity would actually take care of that if thats what you meant.
For the questions:
1. No.
2. Yes
3. Notifications will be sent to the client. So i am not sure if there is anyway to send notifications to other host.
4. Regions gives Search capabilities and takes away HA from you. In your case, you could use the advantages of HA.

Having regions not necessarily means that you don't have HA. if your create your own cache (and don't use the 'default' one) you can create it with Secondarys = 1 (HA on)
now let’s say you have 4 cache hosts; when you define a region , it will have both primary and secondary hosts. so each action on the region will result it being applied in both.
Shany

Named caches distribute across participating nodes. Named regions live on a single node. Regions can be HA, but they cannot take full advantage of distributed cache scaling, as their object load does not distribute across participating nodes in the cluster. Also, using named caches with HA requires three nodes minimum, rather than two nodes if you used the "default" cache only.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas