We are using ignite JDBC thin driver to store 1 million records in a table on ignite cache.
To insert 1 Million records on single node it take 60 sec, where as on cluster of 2 nodes it takes 5 min and time grows exponentially as number of nodes are increased.
attached ignite log file where time was consumed on cluster.
attached configuration file for the cluster.
The log and configuration file is here
IS there any additional configuration required to get time down to insert records over a cluster.
Please make sure that you always test networked configuration.
You should avoid testing "client and server on the same machine" configuration because it can not be compared with "two server nodes on different machines". And it certainly should not be compared with "two server nodes on the same machine" :)
I've heard that thin JDBC driver is not optimized yet for fast INSERTs. Please try client-node JDBC driver with batching (via PreparedStatement.addBatch()).
Related
We're running Matillion (v1.54) on an AWS EC2 instance (CentOS), based on Tomcat 8.5.
We have developped a few ETL jobs by now, and their execution takes quite a lot of time (that is, up to hours). We'd like to speed up the execution of our jobs, and I wonder how to identify the bottle neck.
What confuses me is that both the m5.2xlarge EC2 instance (8 vCPU, 32G RAM) and the database (Snowflake) don't get very busy and seem to be sort of idle most of the time (regarding CPU and RAM usage as shown by top).
Our environment is configured to use up to 16 parallel connections.
We also added JVM options -Xms20g -Xmx30g to /etc/sysconfig/tomcat8 to make sure the JVM gets enough RAM allocated.
Our Matillion jobs do transformations and loads into a lot of tables, most of which can (and should) be done in parallel. Still we see, that most of the tasks are processed in sequence.
How can we enhance this?
By default there is only one JDBC connection to Snowflake, so your transformation jobs might be getting forced serial for that reason.
You could try bumping up the number of concurrent connections under the Edit Environment dialog, like this:
There is more information here about concurrent connections.
If you do that, a couple of things to avoid are:
Transactions (begin, commit etc) will force transformation jobs to
run in serial again
If you have a parameterized transformation job,
only one instance of it can ever be running at a time. More information on that subject is here
Because the Matillion server is just generating SQL statements and running them in Snowflake, the Matillion server is not likely to be the bottleneck. You should make sure that your orchestration jobs are submitting everything to Snowflake at the same time and there are no dependencies (unless required) built into your flow.
These steps will be done in sequence:
These steps will be done in parallel (and will depend on Snowflake warehouse size to scale):
Also - try the Alter Warehouse Component with a higher concurrency level
Ignite Version: 2.5
Ignite Cluster Size: 10 nodes
One of our spark job is writing data to ignite cache in every one hour. The total records per hour is 530 million. Another spark job read the cache but when it try to read it, we are getting the error as "Failed to Execute the Query (all affinity nodes left the grid)
Any pointers will be helpful.
If you are using "embedded mode" deployment it will mean that nodes are started when jobs are ran and stopped when jobs finish. If you do not have enough backups you will lose data when this happens. Any chance this may be your problem? Be sure to connect to Ignite cluster with client=true.
I have 2 nodes, in which im trying to run 4 ignite servers, 2 on each node and 16 ignite clients, 8 on each node. I am using replicated cache mode. I could see the load on cluster is not distributed eventually to all servers.
My intension of having 2 servers per node is to split the load of 8 local clients to local servers and server can work in write behind to replicate the data across all servers.
But I could notice that only one server is taking the load, which is running at 200% cpu and other 3 servers are running at very less usage of around 20%cpu. How can I setup the cluster to eventually distribute the client loads across all servers. Thanks in advance.
I'm generating load by inserting same value 1Million times and trying to get the value using the same key
Here is your problem. Same key is always stored on the same Ignite node, according to Affinity Function (see https://apacheignite.readme.io/docs/data-grid), so only one node takes read and write load.
You should use a wide range of keys instead.
I am a new user of Apache Hadoop. There is one moment which I do not understand. I have a simple cluster (3 nodes). Every node have about 30GB free space. When I look at Overview site of Hadoop I see DFS Remaining: 90.96 GB. I set the Replication factor to 1.
Then I create one file 50GB and try to upload it to HDFS. But space is out. Why? Do I can't upload file which more than free space one node of cluster?
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
I am new to Infinispan. We have setup a Infinispan cluster so that we can make use of the Distributed Cache for our CPU and memory intensive task. We are using UDP as the communication medium and Infinispan MapReduce for distributed processing. The problem we are using facing is with the throughput. When we run the program on a single node machine, the total program completes in around 11 minutes i.e. processing a few hundred thousand records and finally emitting around 400 000 records.
However, when we use a cluster for the same dataset, we are seeing a throughput of only 200 records per second being updated between the nodes in the cluster, which impacts the overall processing time. Not sure which configuration changes are impacting the throughput so badly. I am sure its something to do with the configuration of either JGroups or Infinispan.
How can this be improved?