How can I get progress info from a java ee 7 batch - batch-processing

I am working with java ee 7 batch and - from a different thread of execution - would like to retrieve info on the progress of the job. I know I can always retrieve the job execution but I would like more detailed info, such as how many records have been processed.
Is there a way to feed this information back into the batch framework during processing so that I could retrieve it in the other thread? As far as I've seen I cannot update the job properties during processing.
Thanks, regards,
gufi.

Batch API does not mandate any progress monitoring but it is really easy to roll your own - have a look at batch listeners - that way you have access to all phases of batch execution(read, process, write) and you can just plug in some custom reporting - e.g. update the job values in DB, or send them via JMS to some listener or websocket or anything you like.

Related

Distributed JobRunr using single data source

I want to create a scheduler using JobRunr that will run in two different server. This Scheduler will select data from SQL database and will call an api endpoint. How can I make sure these 2 schedulers running in 2 different server will not pick same data from database ?
Main concern is they should not call the API with duplicate data from 2 different server.
As per documentation JobRunr will push the job in the queue first, but I am wondering how one scheduler queue will know that the same data has not picked by other scheduler in different server, is they any extra locking mechanism I need to maintain ?
JobRunr will run any job only once - the locking is already done in JobRunr itself.

ExecuteSQL Processor in Apache Nifi

I am facing a problem while working with Apache Nifi. Is there a way to stop ExecuteSQL processor once it is completed fetching all the data in the table, instead of fetching repeatedly until I stop it manually?
Generally processors are meant to be scheduled on some frequency through their scheduling tab. Processors in the middle of the graph with incoming relationships usually leave their scheduling at 0 seconds, which means run as fast as possible when data is queue. Source processors typically run on some interval based on Timer Driver or Cron Driven scheduling.
That being said... ExecuteSQL supports being triggered by incoming flow files, so you might be able to do something like put a ListenHTTP processor in front of ExecuteSQL and whenever you want to trigger it you would invoke the http end-point for ListenHTTP. This way you can leave it running, but it will only be triggered when you want it to be.

RabbitMQ job queus completion indicator event

I am trying out RabbitMQ with springboot. I have a main process and within that process I am creating many number of small tasks that can be processed from other workers. From the main process perspective, I like to know when all of these tasks are completed so that it can move to next step. I did not find a easy way to query rabbitmq if the tasks are complete.
One solution I can think of is to store these tasks in a database and when each message is completed, update the database with COMPLETE status. Once all jobs are in COMPLETE status, the main process can know the jobs are COMPLETE and it can move to next step o fits process.
Another solution I can think of is that the main process maintain the list of jobs that is being sent to other workers. Once each worker completes it's job, it can send a message to the main process indicating the job is complete. Then the Main process can mark the job is complete and remove the item from the list.Once the list is empty, the main process will know the jobs are complete and it can move to next step of it's work.
I am looking to learn best practice on how other people have dealt this kind of situation. I appreciate for any suggestion.
Thank you!
There is no way to query RabbitMQ for this information.
The best way to approach this is with the use of a process manager.
The basic idea is to have your individual steps send a message back to a central process that keeps track of which steps are done. When that main process receives notice that all of the steps are done, it lets the system move on to the next thing.
The details of this approach are fairly complex, but I do have a blog post that covers the core of a process manager from a JavaScript/NodeJS perspective.
You should be able to find something like a "process manager" or "saga" as they are sometimes called, within your language and RabbitMQ framework of choice. If not, you should be able to write one for your process without too much trouble, as described in my blog post.

Trapping All Batch Job from MVS

I'm trying to trap all the batch Job from MVS.
I want to transmit all the batch job information (start,end,error) to an external system in order to conduct further analysis.
Has anyone got any idea on how to do this ?
Write an IEFACTRT exit (or whatever its modern day equivalent is) and have the systems programmers install it.
IBM actually provides a facility for this. You can have it write SMF (System Management Facility) records for all jobs. The record layouts are available and you can write code to do analysis on them or you can get 3rd party products like OmegaMon that will do the analysis and reporting for you.
as in my shop, we print the job info into plain files, and ftp down to some file servers and from where we run extract/format with some scripts and pull the data into BI platform for later analysis/visualisation.
Currently, we are studying to utilise the power of graph db like Neo4j to deeper understand our batch job relationship/better present the job relationship with people who interested. and for now we think graph db is a very neat tool for such kind of thing(batch job management)...
Hope my answer can give you some inspiration/reminders...
Typically, installations cut SMF type 30 records. Subtype 1 is written when a new transaction is started. transaction means, System Resources Manager (SRM) transaction. Don't confuse it with transactions in the context of e.g. a database system. A batch job that begins execution is such a transaction. Subtype 5 is written when a transaction ends. Along with subtype 5, there is a completion section that reports the job termination status.
Now, SMF processing is traditionally done in batch as you have to prepare the SMF records first either by extracting them from the log stream or from one of the SYS1.MANx data sets.
But recently, capabilities have been added to z/OS that allow you to hook into the process when SMF records are written. A product like the IBM Common Data Provider for z/OS can be used transform the data in the way you want it to be and to stream it to a destination of choice, for instance logstash. Following such a technique allows to process SMF records almost in real time.

What is the practice for scheduling multiple inter-dependent SQL Server Agent jobs?

The way my team currently schedules jobs is through the SQL Server Job Agent. Many of these jobs have dependencies on other internal servers which in turn have their own SQL Server Jobs that need to be run to keep their data up to date.
This has created dependencies in the start time and length of each of our SQL Server Jobs. Job A might depend on Job B finishing, so we schedule Job B a certain estimated time in advance to Job A. All of this process is very subjective and not scalable, as we add more jobs and servers which create more dependencies.
I would love to get out of the business of subjectively scheduling these jobs and hoping that the dominos fall in the right order. I am wondering what the accepted practices for scheduling SQL Server jobs are. Do people use SSIS to chain jobs together? Is there tooling already built into the SQL Server Job Agent to handle this?
What is the accepted way to handle the scheduling of multiple SQL Server jobs with dependencies on each other?
I have used Control-M before to schedule multiple inter-dependent jobs in different environment. Control-M generally works by using batch files (from what I remember) to execute SSIS packages.
We had a complicated environment hosting 2 data warehouses side by side (1 International and 1 US Local). There were jobs that were dependent on other jobs and those jobs on others and so on, but by using Control-M we could easily decide on the dependency (It has a really nice and intuitive GUI). Other tool that comes to my mind is Tidal Scheduler.
There is no set standard for job scheduling, but I think its safe to say that job schedules depend entirely on what an organization needs. For example Finance jobs might be dependent on Sales and Sales on Inventory and so on. But the point is, if you need to have job inter dependency, using a third party software such as Control-M is a safe bet. It can control jobs on different environments and give you real sense of the company wide job control.
We too had the requirement to manage dependencies between multiple agent jobs - after looking at various 3rd party tools and discounting them for various reasons (mainly down to the internal constraints relating to the use of 3rd party software) we decided to create our own solution.
The solution centres around a configuration database that holds details about processes (jobs) that need to run and how they are grouped (batches), along with the dependencies between processes.
Summary of configuration tables used:
Batch - highlevel definition of a group of related processes, includes metadata such as max concurrent processes, and current batch instance etc.
Process - meta data relating to a process (job) such as name, max wait time, earliest run time, status (enabled / disabled), batch (what batch the process belongs to), process job name etc.
Batch Instance - the active instance of a given batch
Process Instance - active instances of processes for a given batch
Process Dependency - dependency matrix
Batch Instance Status - lookup for batch instance status
Process Instance Status - loolup for process instance status
Each batch has 2 control jobs - START BATCH and UPDATE BATCH. The 1st deals with starting all processes that belong to it and the 2nd is the last to run in any given batch and deals with updating the outcome statuses.
Each process has an agent job associated with it that gets executed by the START BATCH job - processes have a capped concurrency (defined in the batch configuration) so processes are started up to a max of x at a time and then START BATCH waits until a free slot becomes available before starting the next process.
The process agent job steps call a templated SSIS package that deals with the actual ETL work and with the decision making around whether the process needs to run and has to wait for dependencies etc.
We are currently looking to move to a Service Broker solution for greater flexibility and control.
Anyway, probably too much detail and not enough example here so VS2010 project available on request.
I'm not sure how much this will help, but we ended up creating an email solution for scheduling.
We built an email reader that accesses an exchange mailbox. As jobs finish, they send an email to the mail reader to start another job. The other nice part, is that most applications have email notifications built in, so there really isn't much in the way of custom programming.
We really only built it in the first place to handle data files coming in from lots of other partners. It was much easier to give them an email address rather than setting them up with an ftp site, etc.
The mail reader app now has grown to include basic filtering, time of day scheduling, use of semaphores to prevent concurrent jobs, etc. It really works great.