JEE 7 JSR 352 passing data from batchlet to a chunk-step - java-ee-7

I have read the standard (and the javadoc) but still have some questions.
My use case is simple:
A batchlet fetches data from an external source and acknowledges the data (meaning that the data is deleted from the external source after acknowledgement).
Before acknowledging the data the batchlet produces relevant output (in-menory-object) that is to be passed to the next chunk oriented step.
Questions:
1) What is the best practice for passing data between a batchlet and a chunk step?
It seems that I can do that by calling jobContext#setTransientUserData
in the batchlet and then in my chunk step I can access that data by calling
jobContext#getTransientUserData.
I understand that both jobContext and stepContext are implemented in threadlocal-manner.
What worries me here is the "Transient"-part.
What will happen if the batchlet succeeds but my chunk-step fails?
Will the "TransientUserData"-data still be available or will it be gone if the job/step is restarted?
For my use case it is important that the batchlet is run just once.
So even if the job or the chunk step is restarted it is important that the output data from the successfully-run-batchlet is preserved - otherwise the batchlet have to be once more. (I have already acknowledged the data and it is gone - so running the batchlet once more would not help me.)
2)Follow up question
In stepContext there is a couple of methods: getPersistentUserData and setPersistentUserData.
What is these method's intended usage?
What does the "Persistent"-part refer to?
Are these methods relevant only for partitioning?
Thank you!
/ Daniel

Transient user data is just transient, and will not be available during job restart. A job restart can happen in a different process or machine, so users cannot count on job transient from previous run being available at restart.
Step persistent user data are those application data that the batch job developers deem necessary to save/persist for purpose of restarting, monitoring or auditing. They will be available at restart, but they are typically scoped to the current step (not across steps).
From reading your brief descriptioin, I got the feeling that your 2 steps are too tightly coupled and you can almost consider them one single unit of work. You want them either both succeed or both fail in order to maintain your application state integrity. I think that could be the root of the problem.

Related

Fix inconsistent state right away or lazily when data is requested

Our users go through several steps of workflow - the further they go the more objects we create. We also allow users to go back to Step#1 and change one of the existing objects. Which may cause inconsistencies so we must update/delete some of the objects at Step#2. I see 2 options:
Update/delete objects from Step#2 right away. This leads to:
Operation that's supposed to be a simple PATCH of an entity field becomes complicated. And it's a shared object between multiple workflows - so we'll have to add if-statements and do different things depending on the workflow.
Circular dependencies. Operations on Step#1 have to know about objects/operations on Step#2.
On each request in Step#1 we'd have to load data for Step#2 in order to determine whether Step#2 really needs to be updated. Which slows down operations on Step#1. So to change 1 record in DB we'll have to load hundreds (or even thousands) records for Step#2.
Many actions on Step#1 may need fixing state at Step#2. So we have to ensure we don't forget anything today and in the future.
Fix Step#2 lazily - when user goes there (our current approach). Step#2 will recognize that objects are inconsistent and fix them. Which leads to just 1 place where we need to care, but:
Until user opens Step#2 - DB will contain inconsistent objects. This hasn't resulted in any problems so far. But I can imagine it may complicate future SQL migrations.
We update DB state on GET request. This one doesn't seem like that big of a deal since GET stays idempotent anyway. But still it feels awkward.
Anyone knows better approaches? Or maybe improvements to these two?
Update
I haven't found perfect solution, but eventually we implemented an improved version of #1. When updating state on Step#1 we also set a flag "need to rebuild Step#2", when UI opens Step#2 it first checks this flag and issues a PUT to rebuild the state, and only then it GETs Step#2.
This still means that DB state is inconsistent for some period of time. But at least we'll know this for sure from the flag in DB. And if needed - we could write migrations taking this flag into account. This also allows (if needed in the future) to create an async job to fix the state.
I think it is more flexible to separate the state and the context where the objects are stored. Any creation of a new object at any step is accompanied by the preservation of the invariant and consistency of context.
There are separate rules of states - these are rules for transition from one to another and available objects for creation and separate rules for the context, rules for its consistency, which is ensured every time it changes.
What about dirty data asynchronous cleanup?
Whenever user goes back to Step #1 and changes something, mark all related data as "dirty" (e.g. add links to it in "DirtyData" table) and be done for now.
Have a DataCleanup worker (e.g. separate thread or smth) that constantly looks for data to be cleaned up.
Before editing data for Step #2, check if the data is not dirty.
Depending on your logic, 3) might result in user error (e.g. user would need to repeat Step #2). If DataCleanup worker has enough resources (i.e. it processes DirtyData table almost instantaneously), that should happen only on very rare occasions. If that is not OK, you could opt for checking for dirty data on each fetch, but that could be expensive.
It sounds like you're familiar with the HTTP spec regarding GET requests, but for future readers:
Why shouldn't a GET request change data on the server?
Why is using a HTTP GET to update state on the server in a RESTful call incorrect?
For the other bullet under 2, we probably don't need a specification to agree that persisting valid data is preferable to persisting invalid data.
So what can we do for the bullets under 1 to avoid complex branching logic in a particular step and also circular dependencies? My suggestion is an event-driven design. When step #2 changes it should fire a change event. In this scenario, step #2 has no knowledge of the concrete listener(s) who may receive its events, so it remains decoupled from any complex handling logic.
There's probably no way to guarantee you don't forget anything in the future; but if every step in the workflow is defined as a listener, it forces you to consider change events to some extent every time you implement a new step.
One side note on granularity: if a step has many changes, it can batch up its events rather than fire each one individually. You can adjust the size for efficiency.
In summary, I would strongly consider the Observer design pattern.

Mule is taking a long time for the simple select for the first execution

I am just using a HTTP listener and Select in mule flow. It is a get method, passing ID as an input, and the same ID is passed to select (input). It is taking 3 to 4 minutes of delay when we execute via mule for the first time, but in DB, it took only millisecond.
This delay only happens after adding the parameter in the select.
Someone help me, why there is a delay for the first time and how to resolve it?
Possible cause could be how you create Metadata. For example you use huge CSV file as example for your data structure. Mule reads whole file to have headers. It takes time.
Solution - if you create Metadata by example - use small examples with couple rows of data.
Usually the main points that cause performance issues in first executions are:
JVM and Mule Runtime warmup
Time to establish connections
The first one can not be avoided. For the second one usually a connection pool is used to mitigate it somewhat. Having said that 4 minutes is a very excessive time for either of those. You need to do some performance analysis, adding logs before and after operation in the flow, enabling debug logs for the database connector and even using a Java profiler connected to the Mule JVM to understand what could be happening.
You also have to consider if there is a high number of records that need to be processed, even if the database can answer quickly, it might take some time to format.

best use of NSUrlConnection when getting multiple json objects that depend on the previous

what I am doing is I am querying an API to search for articles in various data bases. There are multiple steps involved, each returns a json object. Each step involves a NSUrlConnection with different query strings to the API
step 1: returns json object indicating status of query & record set ID.
step 2: takes record set id from step 1 and returns list of databases that are valid for querying
step 3: queries each database that was ready from step 2 and gets json data array that has results
I am confused as to the best way of going about this. Is it better to use one nsurlconnection and reopen that connection in connection did finish loading based on which step I am in. Or is it better to open a new connection at the end of each subsequent connection?
A couple of observations
Network latency:
The key phenomenon that we need to be sensitive to here (and it sounds like you are) is network latency. Too often we test our apps in an idea scenario (on simulator with high speed internet access, or on device connected to wifi). But when you use an app in a real-world scenario, network latency can seriously impact performance and you'll want to architect a solution that minimizes this.
Simulating sub-optimal, real-world network situations:
By the way, if you're not doing it already, I'd suggest you install the "Network Link Conditioner" which is part of the "Hardware IO Tools" (available from the "Xcode" menu, choose "Open Developer Tool" - "More Developer Tools"). If you install the "Network Link Conditioner", you can then have your simulator simulate a variety of network experiences (e.g. Good 3G connection, Poor Edge connection, etc.).
Minimize network requests:
Anyway, I'd try to figure out how to minimize separate requests that are dependent upon the previous one. For example, I see step 1 and step 2 and wonder if you could merge those two into a single JSON request. Perhaps that's not possible, but hopefully you get the idea. You want to reduce the number of separate requests that have to happen sequentially.
I'd also look at step 3, and those look like they have to be dependent upon step 2, but perhaps you can run a couple of those step 3 requests concurrently, reducing the latency effect there.
Implementation:
In terms of how this would be implemented, I personally use a concurrent NSOperationQueue with some reasonable maxConcurrentOperationCount setting (e.g. 4 or 5, enough to enjoy concurrency and reduce latency, but not so many as to tax either the device or the server) and submit network operations. In this case, you'll probably submit step 1, with a completion operation that will submit step 2, with a completion operation that will submit a series of step 3 requests and these step 3 requests might run concurrent.
In terms of how to make a good network operation object, I might suggest using something like AFNetworking, which already has a decent network operation object (including one that parses JSON), so maybe you can start there.
In terms of re-using a NSURLConnection, generally its one connection per request. If I have had an app that wanted to have a lengthy exchange of messages with a server (e.g. a chat like service where you want the server to be able to send a message to the client whenever it wants, such as in a chat service), I've done a sockets implementation, but that doesn't seem like the right architecture here.
I would dismiss the first connection and create a new one for each connection.
Just, don't ask me why.
BTW, I would understand the question if this was about reusing vs. creating new objects in some performance sensitive context like scrolling through a table or animations or if it is just about of 10 thousands of iterations where it happens. But you are talking about 3 objects to either create new or reuse the old one. What is the gain of even thinking about it?

Service Broker Design

I’m looking to introduce SS Service Broker,
I have a remote orders database and a local processing database, all activity on the processing database has to happen in sequence, this seems a perfect job for Service Broker!
I’ve set up the infrastructure, I can send and receive messages and now I’m looking at the design of the processing. As I said all processes for one order need to be completed in sequence so I’ll put them in one conversation.
One of these processes is a request for external flat file data, we then wait (could be several days) and then import and process this file when it returns. How can I process half the tasks, then wait for the flat file to return before processing the other half.
I’ve had some ideas but I’m sure I’m missing a trick somewhere
1) Write all queue items to a status table and use status values – seems to remove some of the flexibility of SSSB and add another layer of tasks
2) Keep the transaction open until we get the data back – not ideal
3) Have the flat file import task continually polling for the file to appear – this seems inefficient
What is the most efficient way of managing this workflow?
thanks in advance
In my opinion it is like chain of responsibility. As far as i can understand we have the following workflow.
1.) Process for message.
2.) Wait for external file, now this can be a busy wait or if external data provides you a notification then we can actually do it in non-polling manner.
3.) Once data is received then process the data.
So my suggestion would be to use 3 different Queues one for each part, when one is done it will forward or put a new message in chained queue.
I am assuming, one order processing will not disrupt another order processing.
I am thinking MSMQ with Windows Sequential Work flow, might also be a candidate for this task.

Best practice for inserting and querying data from memory

We have an application that takes real time data and inserts it into database. it is online for 4.5 hours a day. We insert data second by second in 17 tables. The user at any time may query any table for the latest second data and some record in the history...
Handling the feed and insertion is done using a C# console application...
Handling user requests is done through a WCF service...
We figured out that insertion is our bottleneck; most of the time is taken there. We invested a lot of time trying to finetune the tables and indecies yet the results were not satisfactory
Assuming that we have suffecient memory, what is the best practice to insert data into memory instead of having database. Currently we are using datatables that are updated and inserted every second
A colleague of ours suggested another WCF service instead of database between the feed-handler and the WCF user-requests-handler. The WCF mid-layer is supposed to be TCP-based and it keeps the data in its own memory. One may say that the feed handler might deal with user-requests instead of having a middle layer between 2 processes, but we want to seperate things so if the feed-handler crashes we want to still be able to provide the user with the current records
We are limited in time, and we want to move everything to memory in short period. Is having a WCF in the middle of 2 processes a bad thing to do? I know that the requests add some overhead, but all of these 3 process(feed-handler, In memory database (WCF), user-request-handler(WCF) are going to be on the same machine and bandwidth will not be that much of an issue.
Please assist!
I would look into creating a cache of the data (such that you can also reduce database selects), and invalidate data in the cache once it has been written to the database. This way, you can batch up calls to do a larger insert instead of many smaller ones, but keep the data in-memory such that the readers can read it. Actually, if you know when the data goes stale, you can avoid reading the database entirely and use it just as a backing store - this way, database performance will only affect how large your cache gets.
Invalidating data in the cache will either be based on whether its written to the database or its gone stale, which ever comes last, not first.
The cache layer doesn't need to be complicated, however it should be multi-threaded to host the data and also save it in the background. This layer would sit just behind the WCF service, the connection medium, and the WCF service should be improved to contain the logic of the console app + the batching idea. Then the console app can just connect to WCF and throw results at it.
Update: the only other thing to say is invest in a profiler to see if you are introducing any performance issues in code that are being masked. Also, profile your database. You mention you need fast inserts and selects - unfortunately, they usually trade-off against each other...
What kind of database are you using? MySQL has a storage engine MEMORY which would seem to be suited to this sort of thing.
Are you using DataTable with DataAdapter? If so, I would recommend that you drop them completely. Insert your records directly using DBCommand. When users request reports, read data using DataReader, or populate DataTable objects using DataTable.Load (IDataReader).
Storying data in memory has the risk of losing data in case of crashes or power failures.