Pentaho large source table processing to target table same schema - pentaho

I currently have an etl job that reads source table with over 1 million records and then sequentially processing to target table. Both source and target are in same schema but in between there is an external rest endpoint call to post some data from the source table and this job is performing very bad right now and Can someone please let me know what are some ways to improve performance in terms of how to parallelize this or reducing fetchsize etc to reduce this job running time ?

Check if your rest endpoint supports batching, and then implement that. Most APIs do these days. (In this case, you send multiple requests in one json/xml file to the end point)
Otherwise you simply need to use multiple copies of the REST client step. you should be able to get away with 8-10 at least, but check that you're not limited in some way at the other end.
Finally if none of that helps, try concocting your own httpclient in the java class step (not the javascript) and be sure that you only authenticate with the rest endpoint once, not every request, by keeping the session open. I'm not 100% convinced the rest client does this, and authentication is often the most expensive bit.

Related

API which provide data from Elastic Search and not SQL

I have a system where there are large dataset(s) where I want to have quick searches, and elastic search is suitable for it. So the data resides in SQL, and is synced to ES. There is an obvious small delay in this sync.
There are consumers of this data which could work with slightly stale data. So if there's an API for UI which end users use to see the dataset. A delay of 3-4 seconds is acceptable. So API handler which deals with ES is perfect here.
Then there are consumers of this data (bots) who want to work with real time data. So for the almost same requirements, should I create another API just like that in UI consumer, which gets data from SQL?
What is the usual best practice which is followed, and I'm assuming this is a very common usecase.
You probably should stick to creating just a sinlge API and use a query string parameter to decide which of the two data sources to use. This will result in less code to maintain.

How to efficiently trigger system command with SQL query or table change?

I have data conversion and caching service running as self-hosted WCF service.
Now it uses database polling in constant short intervals to update its data.
I think it's unnecessary. The data can be changed only if one of the tables is changed, and when the data is changed depends on system users actions.
There is no problem in setting a trigger for specific tables, however I would need an action outside SQL-Server to update my cache. My WCF service could perform update when receiving specific URI via HTTP. So all I need is a command in table trigger which would send a request. Is it even possible?
I think about a hack I used back in the days with HTTP requests. I halted HTTP request response at server until data packet from somewhere else arrived. There was no delay between polling requests. I achieved fully asynchronous, "real-time" updates.
Maybe this approach is possible to apply with SQL? I think about a query which blocks termination until receives a signal. Well, it eventually times out, but it's good enough to try. Then - how to signal and wait in SQL? By locking and unlocking shared resource, like cursor or dummy table?
Any other options?
I need the cache update done at lowest possible frequency (because it's pretty expensive, so once per minute is great), but I need immediate update when the data is changed.
To answer your question, have you looked at xp_cmdshell?
https://msdn.microsoft.com/en-us/library/ms175046.aspx
However, the security/performance implications of such a decision could be non-trivial depending on your use case.

Handling paging with changing sort orders

I'm creating a RESTful web service (in Golang) which pulls a set of rows from the database and returns it to a client (smartphone app or web application). The service needs to be able to provide paging. The only problem is this data is sorted on a regularly changing "computed" column (for example, the number of "thumbs up" or "thumbs down" a piece of content on a website has), so rows can jump around page numbers in between a client's request.
I've looked at a few PostgreSQL features that I could potentially use to help me solve this problem, but nothing really seems to be a very good solution.
Materialized Views: to hold "stale" data which is only updated every once in a while. This doesn't really solve the problem, as the data would still jump around if the user happens to be paging through the data when the Materialized View is updated.
Cursors: created for each client session and held between requests. This seems like it would be a nightmare if there are a lot of concurrent sessions at once (which there will be).
Does anybody have any suggestions on how to handle this, either on the client side or database side? Is there anything I can really do, or is an issue such as this normally just remedied by the clients consuming the data?
Edit: I should mention that the smartphone app is allowing users to view more pieces of data through "infinite scrolling", so it keeps track of it's own list of data client-side.
This is a problem without a perfectly satisfactory solution because you're trying to combine essentially incompatible requirements:
Send only the required amount of data to the client on-demand, i.e. you can't download the whole dataset then paginate it client-side.
Minimise amount of per-client state that the server must keep track of, for scalability with large numbers of clients.
Maintain different state for each client
This is a "pick any two" kind of situation. You have to compromise; accept that you can't keep each client's pagination state exactly right, accept that you have to download a big data set to the client, or accept that you have to use a huge amount of server resources to maintain client state.
There are variations within those that mix the various compromises, but that's what it all boils down to.
For example, some people will send the client some extra data, enough to satisfy most client requirements. If the client exceeds that, then it gets broken pagination.
Some systems will cache client state for a short period (with short lived unlogged tables, tempfiles, or whatever), but expire it quickly, so if the client isn't constantly asking for fresh data its gets broken pagination.
Etc.
See also:
How to provide an API client with 1,000,000 database results?
Using "Cursors" for paging in PostgreSQL
Iterate over large external postgres db, manipulate rows, write output to rails postgres db
offset/limit performance optimization
If PostgreSQL count(*) is always slow how to paginate complex queries?
How to return sample row from database one by one
I'd probably implement a hybrid solution of some form, like:
Using a cursor, read and immediately send the first part of the data to the client.
Immediately fetch enough extra data from the cursor to satisfy 99% of clients' requirements. Store it to a fast, unsafe cache like memcached, Redis, BigMemory, EHCache, whatever under a key that'll let me retrieve it for later requests by the same client. Then close the cursor to free the DB resources.
Expire the cache on a least-recently-used basis, so if the client doesn't keep reading fast enough they have to go get a fresh set of data from the DB, and the pagination changes.
If the client wants more results than the vast majority of its peers, pagination will change at some point as you switch to reading direct from the DB rather than the cache or generate a new bigger cached dataset.
That way most clients won't notice pagination issues and you don't have to send vast amounts of data to most clients, but you won't melt your DB server. However, you need a big boofy cache to get away with this. Its practical depends on whether your clients can cope with pagination breaking - if it's simply not acceptable to break pagination, then you're stuck with doing it DB-side with cursors, temp tables, coping the whole result set at first request, etc. It also depends on the data set size and how much data each client usually requires.
I am not aware of a perfect solution for this problem. But if you want the user to have a stale view of the data then cursor is the way to go. Only tuning you can do is to store only the data for 1st 2 pages in the cursor. Beyond that you fetch it again.

Good ways to decouple GUIs from SOAP/WS-API update/write calls?

Let's assume we have some configuration GUI that in its current form uses direct DB transactions to submit new configurations for more than one configurable component in a consistent manner.
Now let's move the data (DB) stuff behind some SOAP/WS API. The GUI has no direct DB access anymore. The transactional behaviour must remain, but the API should NOT be designed to explcitly accommodate the GUI form submissions. In fact, I don't even know how the new GUI will work or how the user input will be structured. Therefore I need to provide something like WS-AtomicTransaction on the API server side. However, there are (at least) two caveats:
The GUI is written in PHP: I don't think there is any WS-Transaction support in PHP available.
I don't want to keep DB transactions open on the server side while waiting for additional client requests.
Solutions I can think of:
using Camel's aggregation. However, that would make things more complicated in at least two ways:
You cannot use DB row ids of newly inserted rows in the subsequent calls inside the same transaction. You need to use some sort of symbolic back-referencing because there would be no communication between client and server while processing the aggregated messages.
call replies would not be immediate (or the immediate and separate reply to each single call would only be some sort of a stub, ie. not containing any useful information beyond "your message has been attached to TX xyz" -- if that's at all possible in the Camel aggregation case).
the two disadvantages of the previous solution make me think of request batches where possibly the WS standards provide means for referencing call results in subsequent calls inside the batch transaction. Is there any such thing already available? Maybe even as a PHP client?
trying to eliminate lock contention in the database by carefully using row-level locks etc. However, when inserting new elements, my guess is that usually pages and index pages need to be locked by the DB.
maybe some server-side persistence layer using optimistic locking? But again, that would not return any DB IDs back to the client before the final commit if DB writes would be postponed until the commit (don't know if that's possible at all).
What do YOU think?
Transactions are a powerful tool and we easily get into a thinking pattern in which we see every problem as a nail we hit with this big hammer. I can relate to your confusion because I've experienced it myself. Unfortunately I have no better advice for you than to try not think in terms of transactions but of atomic API calls.
When I think in terms of transactions, my thought pattern usually goes like this:
start transaction
read (repeat as required)
update (repeat as required)
commit/roll back
It takes some time to realize that we overuse this pattern. Actual conflicts are rare and there are many other ways of dealing with them. Here is a commonly used one in APIs
read and send data to client (atomic API call)
update data (on the client)
send original + updates back to the server (atomic API call)
start transaction (on server)
read
compare with original from client
if not same, return error (client should retry)
if same, update
commit
The last six points are part of the implementation of the API call.
Ferenc Mihaly
http://theamiableapi.com

Best practice for inserting and querying data from memory

We have an application that takes real time data and inserts it into database. it is online for 4.5 hours a day. We insert data second by second in 17 tables. The user at any time may query any table for the latest second data and some record in the history...
Handling the feed and insertion is done using a C# console application...
Handling user requests is done through a WCF service...
We figured out that insertion is our bottleneck; most of the time is taken there. We invested a lot of time trying to finetune the tables and indecies yet the results were not satisfactory
Assuming that we have suffecient memory, what is the best practice to insert data into memory instead of having database. Currently we are using datatables that are updated and inserted every second
A colleague of ours suggested another WCF service instead of database between the feed-handler and the WCF user-requests-handler. The WCF mid-layer is supposed to be TCP-based and it keeps the data in its own memory. One may say that the feed handler might deal with user-requests instead of having a middle layer between 2 processes, but we want to seperate things so if the feed-handler crashes we want to still be able to provide the user with the current records
We are limited in time, and we want to move everything to memory in short period. Is having a WCF in the middle of 2 processes a bad thing to do? I know that the requests add some overhead, but all of these 3 process(feed-handler, In memory database (WCF), user-request-handler(WCF) are going to be on the same machine and bandwidth will not be that much of an issue.
Please assist!
I would look into creating a cache of the data (such that you can also reduce database selects), and invalidate data in the cache once it has been written to the database. This way, you can batch up calls to do a larger insert instead of many smaller ones, but keep the data in-memory such that the readers can read it. Actually, if you know when the data goes stale, you can avoid reading the database entirely and use it just as a backing store - this way, database performance will only affect how large your cache gets.
Invalidating data in the cache will either be based on whether its written to the database or its gone stale, which ever comes last, not first.
The cache layer doesn't need to be complicated, however it should be multi-threaded to host the data and also save it in the background. This layer would sit just behind the WCF service, the connection medium, and the WCF service should be improved to contain the logic of the console app + the batching idea. Then the console app can just connect to WCF and throw results at it.
Update: the only other thing to say is invest in a profiler to see if you are introducing any performance issues in code that are being masked. Also, profile your database. You mention you need fast inserts and selects - unfortunately, they usually trade-off against each other...
What kind of database are you using? MySQL has a storage engine MEMORY which would seem to be suited to this sort of thing.
Are you using DataTable with DataAdapter? If so, I would recommend that you drop them completely. Insert your records directly using DBCommand. When users request reports, read data using DataReader, or populate DataTable objects using DataTable.Load (IDataReader).
Storying data in memory has the risk of losing data in case of crashes or power failures.