I am new to Ignite and looking at compute grid, sounds like something we can use, i have compute job (same job but on different segment of data), i can have multiple jobs to run on multiple nodes. My question is there is only one node that will call the job execution (thus pass to one of the nodes), how can i determine/set thus only one node will do this coordination?
Thanks
Shannon
You can use cluster singletons, but you also do not have to care about coordination. Ignite cluster will handle most of cases of compute failover automatically.
Related
In our Project we are using Ignite and in it we have multiple Ignite cluster and we are using zookeeper Discovery, I wanted to know if Ignite supports Join on two different Ignite cluster if yes please share the approach.
Found few inputs on this but not that helpful.
Apache Ignite: caching ClusterGroup
Communication between two Ignite clusters (maybe merging two Ignite clusters in one) ignite-clusters-in
Well, if they are two independent clusters, they are independent. The most common scenario I can think of is - having a Master and Replica cluster with a synchronization in between.
If you want to perform a SQL over multiple nodes, they have to be in a single cluster. There is no hard limit on how many nodes you can have. I know that some companies have hundreds of nodes. But it could be tricky in terms of maintenance, like using Zookeeper discovery and paying a lot of attention to the network.
If you indeed need to join some result from two completely independent clusters, you will need to do it manually. I.e. get a result from one, then from the second and do an aggregation/processing.
We're running Matillion (v1.54) on an AWS EC2 instance (CentOS), based on Tomcat 8.5.
We have developped a few ETL jobs by now, and their execution takes quite a lot of time (that is, up to hours). We'd like to speed up the execution of our jobs, and I wonder how to identify the bottle neck.
What confuses me is that both the m5.2xlarge EC2 instance (8 vCPU, 32G RAM) and the database (Snowflake) don't get very busy and seem to be sort of idle most of the time (regarding CPU and RAM usage as shown by top).
Our environment is configured to use up to 16 parallel connections.
We also added JVM options -Xms20g -Xmx30g to /etc/sysconfig/tomcat8 to make sure the JVM gets enough RAM allocated.
Our Matillion jobs do transformations and loads into a lot of tables, most of which can (and should) be done in parallel. Still we see, that most of the tasks are processed in sequence.
How can we enhance this?
By default there is only one JDBC connection to Snowflake, so your transformation jobs might be getting forced serial for that reason.
You could try bumping up the number of concurrent connections under the Edit Environment dialog, like this:
There is more information here about concurrent connections.
If you do that, a couple of things to avoid are:
Transactions (begin, commit etc) will force transformation jobs to
run in serial again
If you have a parameterized transformation job,
only one instance of it can ever be running at a time. More information on that subject is here
Because the Matillion server is just generating SQL statements and running them in Snowflake, the Matillion server is not likely to be the bottleneck. You should make sure that your orchestration jobs are submitting everything to Snowflake at the same time and there are no dependencies (unless required) built into your flow.
These steps will be done in sequence:
These steps will be done in parallel (and will depend on Snowflake warehouse size to scale):
Also - try the Alter Warehouse Component with a higher concurrency level
I have a fairly simple Akka.NET system that tracks in-memory state, but contains only derived data. So any actor can on startup load its up-to-date state from a backend database and then start receiving messages and keep their state from there. So I can just let actors fail and restart the process whenever I want. It will rebuild itself.
But... I would like to run across multiple nodes (mostly for the memory requirements) and I'd like to increase/decrease the number of nodes according to demand. Also for releasing a new version without downtime.
What would be the most lightweight (in terms of Persistence) setup of clustering to achieve this? Can you run Clustering without Persistence?
This not a single question, so let me answer them one by one:
So I can just let actors fail and restart the process whenever I want - yes, but keep in mind, that hard reset of the process is a lot more expensive than graceful shutdown. In distributed systems if your node is going down, it's better for it to communicate that to the rest of the nodes before, than requiring them to detect the dead one - this is a part of node failure detection and can take some time (even sub minute).
I'd like to increase/decrease the number of nodes according to demand - this is a standard behavior of the cluster. In case of Akka.NET depending on which feature set are you going to use, you may sometimes need to specify an upper bound of the cluster size.
Also for releasing a new version without downtime. - most of the cluster features can be scoped to a set of particular nodes using so called roles. Each node can have it's set of roles, that can be used what services it provides and detect if other nodes have required capabilities. For that reason you can use roles for things like versioning.
Can you run Clustering without Persistence? - yes, and this is a default configuration (in Akka, cluster nodes don't need to use any form of persistent backend to work).
I want to back up entire Ignite cluster so that back up clutser will be used if the original(active) cluster is down. Is there any approach for this?
If you need two separate clusters with replication across data center, it would be better to look at GridGain solutions that supports Datacenter Replication.
Unfortunately, Ignite does not support DR.
With Apache Ignite you can logically divide you cluster to two zones to have guarantee that every zone contains full copy of data. However, there is no way to choose primary node for partitions manually. See, AffinityFunction and affinityBackupFilter() method of standard implementations.
As answered above, ready made solution is only available in paid version. Open source Apache ignite provides ability to take cluster wide absolute snapshot. You can add a cron job in your ignite cluster to take this snapshot and add another job to copy snapshot data to object storage like S3.
On the other side, you download this data node wise to work directories of respective nodes as per manual restore procedure and start the cluster. It should automatically activate when all baseline nodes are started successfully and your cluster is ready to use.
I started exploring drill for our requirement to run SQL-on-semi structured data. I have setup a 4node drill cluster with zookeeper. Have few questions on how it actually works,
When I run Drill in distributed mode, using dfs (local file system) i.e., I have a 1GB Json file on one of the nodes(say n1). I am able to run the query by launching sqlline from any of the nodes(n1, n2, n3, n4) inspire have date only on n1. My questions is
a. Is the query being executed on all the nodes? i.e., will Drill parallelise the query execution by distributing the data to other node n2,n3n4?
b. If NO, by copying the same file on all the nodes n2,n3,n4 will help in leveraging MPP architecture of Drill?
Is the query being executed on all the nodes?
Maybe, the node has to be running Drill and the data you are querying has to be on the distributed file system, such as HDFS. Drill doesn't distribute the files.
The nodes that run the Drillbit service (where you installed Drill) participate in the query work. Only columns that appear in the query are loaded from the file. Drill tries to push any filter in your query to leaf nodes to prevent the nodes from sending a row that doesn't pass the filter. Drill maximizes data locality during query execution without moving data over the network or between nodes, per the docs. Minor Fragments section talks about parallelizing. Drill parallelizes operations when number of records in a fragment reaches 100,000.
Drill executes all queries based on their metadata. So if your target is stored in a distributed format across nodes, the same applies to Drill. In other case, it would be running only on the specific node.
Consider drill pointing to a RDBMS source, it executes query on one node then.