Slave servers KETTLE pentaho

Slave servers KETTLE pentaho - pentaho

I am using Carte web server to execute transformations remotly, somtimes when the web service called multiple times at the same moment I got a timeout then an error with description "GC overhead limit exceeded".
I want to know Why I am getting this issue, and should creating multiple slave servers be the solution, if so what is the procedure?
NOTE: https://xxxx/kettle/getSlaves return :
<SlaveServerDetections></SlaveServerDetections>

The answer
GC overhead limit exceeded
is about you carte server is run out of memory. Carte server is just a jetty server with PDI functionality, it is java process by it's nature wich is run jobs or transformations. Jobs and transformations by it's nature just a description of what carte server should do. Fetch some data, sort string, anything that have been configured. If you want to run massive tasks of Carte server you have to tune Carte startup script to give to java process more memory, more heap space, define best GC strategy or what ever based on your knowledge on what is exactly have to be tuned. Just try to google on 'GC overhead limit exceeded' and play with java process startup arguments.
When server returns
<SlaveServerDetections></SlaveServerDetections>
I is just says it is did not find any slaves (most probably you carte server is a alone master). It is not related to a GC overhead.

Related

Matillion: How to identify performance bottleneck

We're running Matillion (v1.54) on an AWS EC2 instance (CentOS), based on Tomcat 8.5.
We have developped a few ETL jobs by now, and their execution takes quite a lot of time (that is, up to hours). We'd like to speed up the execution of our jobs, and I wonder how to identify the bottle neck.
What confuses me is that both the m5.2xlarge EC2 instance (8 vCPU, 32G RAM) and the database (Snowflake) don't get very busy and seem to be sort of idle most of the time (regarding CPU and RAM usage as shown by top).
Our environment is configured to use up to 16 parallel connections.
We also added JVM options -Xms20g -Xmx30g to /etc/sysconfig/tomcat8 to make sure the JVM gets enough RAM allocated.
Our Matillion jobs do transformations and loads into a lot of tables, most of which can (and should) be done in parallel. Still we see, that most of the tasks are processed in sequence.
How can we enhance this?

By default there is only one JDBC connection to Snowflake, so your transformation jobs might be getting forced serial for that reason.
You could try bumping up the number of concurrent connections under the Edit Environment dialog, like this:
There is more information here about concurrent connections.
If you do that, a couple of things to avoid are:
Transactions (begin, commit etc) will force transformation jobs to
run in serial again
If you have a parameterized transformation job,
only one instance of it can ever be running at a time. More information on that subject is here

Because the Matillion server is just generating SQL statements and running them in Snowflake, the Matillion server is not likely to be the bottleneck. You should make sure that your orchestration jobs are submitting everything to Snowflake at the same time and there are no dependencies (unless required) built into your flow.
These steps will be done in sequence:
These steps will be done in parallel (and will depend on Snowflake warehouse size to scale):
Also - try the Alter Warehouse Component with a higher concurrency level

ActiveMQ scheduler jobs count

I have activemq 5.15.* with jolokia getting for jmx status + python.
With this code i can get all scheduled jobs
j4p.request(type = 'read', mbean = '*:brockerName=*:name=JMS:service=JobSheduler:type=Broker')
If number of jobs too big request running too long with http timeout.
But I no need all jobs only they count, there is any way get only jobs count?

Because if the architecture of the on disk storage for the Job Scheduler there is no in memory job count that is kept as the in memory index holds on a cached subset of the total jobs and you don't always have an accurate view of what is on disk (especially after broker restart) so the management interface only exposes access to fetch jobs not to fetch statistics in general. To load and collect the numbers you'd generally be doing just what the code does now and then only exposing a fixed numeric result following all the hard work.
You could extend the store interface and carefully add such features if you wanted, the source code is open. You'd need to properly test that it works both during normal operation and after restart or after some cached data is paged out. The project is always looking for contributors.

Running multiple Kettle transformation on single JVM

We want to use pan.sh to execute multiple kettle transformations. After exploring the script I found that it internally calls spoon.sh script which runs in PDI. Now the problem is every time a new transformation starts it create a separate JVM for its executions(invoked via a .bat file), however I want to group them to use single JVM to overcome memory constraints that the multiple JVM are putting on the batch server.
Could somebody guide me on how can I achieve this or share the documentation/resources with me.
Thanks for the good work.

Use Carte. This is exactly what this is for. You can startup a server (on the local box if you like) and then submit your jobs to it. One JVM, one heap, shared resource.
Benefit of that is then scalability, so when your box becomes too busy just add another one, also using carte and start sending some of the jobs to that other server.
There's an old but still current blog here:
http://diethardsteiner.blogspot.co.uk/2011/01/pentaho-data-integration-remote.html
As well as doco on the pentaho website.
Starting the server is as simple as:
carte.sh <hostname> <port>
There is also a status page, which you can use to query your carte servers, so if you have a cluster of servers, you can pick a quiet one to send your job to.

ETL in Amazon EC2 and utilization doubts

I am using pentaho for sometime. I just have a basic question on ETL infrastructure. I need to run job on a remote EC2 instance to extract data from multiple database say around 2000. I need to have a machine which is capable to doing this in EC2. This ETL Ec2 will be serving only as process point and the storage is in another host.Now I need to know which instance I should go for in Amazon.
These ETL jobs will just have select query and just put in the table output.
No complex transformation and no sorting.
Are the ETL processes CPU intensive or memory intensive?.
How to decide whether the ETL process is CPU or memory intensive or I/O intensive?

I would say it will be all upto you, i am using m3.medium instance according to data in my database and it is perfectly fine, if you have no problem with the amount of time it will take to execute the transformation then choose some small size instance or go with some higher instance.

How can I make SQL eligible for zIIP processing?

Is it possible to change SQL in a z/OS mainframe COBOL application so that it becomes eligible to be directed to the IBM System z Integrated Information Processor (zIIP)?

An important distinction to make is that according to IBM, zIIP is only available for "eligible database workloads", and those "eligible" loads are mostly targeted for large BI/ERP/CRM solutions that run on distributed servers, which are connecting through DDF (Distributed Data Facility) over TCP/IP.
IBM has a list of DB2 workloads that can utilize zIIP. These include:
DDF server threads that process SQL requests from applications that
access DB2 by TCP/IP (up to 60%)
Parallel child processes. A portion of each
child process executes under a dependent enclave SRB if it processes
on behalf of an application that originated from an allied address
space, or under an independent enclave SRB if the processing is
performed on behalf of a remote application that accesses DB2 by
TCP/IP. The enclave priority is inherited from the invoking allied
address space for a dependent enclave or from the main DDF server
thread enclave classification for an independent enclave. (Versions up
to 11 allowed 80% to run on zIIP, v12 upped this to 100% eligible).
Utility index build and maintenance processes for the LOAD, REORG, and
REBUILD INDEX utilities.
And if you're on DB2 v10, you can also use zIIP with:
Remote native SQL procedures.
XML Schema validation and non-validation parsing.
DB2 utility functions for maintaining index structures.
Certain portions of processing for the RUNSTATS utility.
Prefetch and deferred write processing for the DB2 buffer pool
Version 11 added the following:
Asynchronous enclave SRBs (service request blocks) that execute in the
Db2 ssnmMSTR, ssnmDBM1 and ssnmDIST address spaces, with the exception
of p-lock negotiation processing. These processes include Db2 buffer pool
processing for prefetch, deferred write, page set castout, log read, and
log write processing. Additional eligible processes include index
pseudo-delete and XML multi version document cleanup processing.
Version 12 allowed parallel child tasks to go 100% to zIIP after a certain threshold of CPU usage.
So, if you're using COBOL programs, it would appear that IBM does not intend for you to use zIIP with these workloads. You can still take advantage of zIIP with utilites (LOAD, REORG), and some steps of the RUNSTATS utility, so it may still be worthwhile to have some zIIP.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas