Join Multiple Streams of Data with Pentaho DI

Join Multiple Streams of Data with Pentaho DI - pentaho

New to pentaho, I am calculating multiple metrics in this job by filtering data to multiple streams.
I have validated individual stream the calculations are working fine.
Now I want to load them to target database, tried using Multiway join wasn't sure if that is the right component but it's not yielding any records.
Please suggest appropriate steps to achieve this. I have enclosed the kettle file here.
Thanks!! DimLoad

Ok, got the transform. After looking at it for a while, I think the problem is that each stream that flows into the Multiway Merge Join will need to be sorted by the join keys. There is practically no documentation on this step, but it works the same way the regular Merge Join step does, just with more than two streams, and the merge join step requires sorted input.
FYI, the Filter Rows step is a performance killer. If you have a large input set, I'd look at pushing down that first filter into the select statement of the Table Input. Then split out the other rows with a Switch/Case instead of 13 different filter rows. You're making 13 copies if each row in the entire table.

Related

Generate serie of queries in PENTAHO

How do I build a general query based in a list of tables and run/export the results?
Here is my conceptual structure
conceptual structure
Basically the from clause in query would be inserted by the table rows.
After that, each schema.table with true result returned must be stored in a text archive.
Could someone help me?

As pentaho doesn't support loops, it's a bit complicated. Basically you read a list of tables and copy the rows to the result and then you have a foreach-job that runs for every row in the result.
I use this setup for all jobs like this: How to process a Kettle transformation once per filename
In the linked example, it is used to process files, but it can be easily adapted to your use.

Improve performance of processing billions-of-rows data in Spark SQL

In my corporate project, I need to cross join a dataset of over a billion rows with another of about a million rows using Spark SQL. As cross join was used, I decided to divide the first dataset into several parts (each having about 250 million rows) and cross join each part with the million-row one. I then made of use "union all".
Now I need to improve the performance of the join processes. I heard it can be done by partitioning data and distribution of work to Spark workers. My questions are how the effective performance can be made with partitioning? and What are the other ways to do this without using partitioning?
Edit: filtering already included.

Well, in all scenarios, you will end up with tons of data. Be careful, try to avoid cartesian joins on big data set as much as possible as it usually ends with OOM exceptions.
Yes, partitioning can be the way that help you, because you need to distribute your workload from one node to more nodes or even to the whole cluster. Default partitioning mechanism is hash of key or original partitioning key from source (Spark is taking this from source directly). You need to first evaluate what is your partitioning key right now and afterwards you can find maybe better partitioning key/mechanism and repartition data, therefore distribute load. But, anyway join must be done, but it will be done with more parallel sources.

There should be some filters on your join query. you can use filter attributes as key to partition the data and then join based on the partitioned.

Do joins on multiple columns store the cartesian product?

I transliterated a script from T-SQL to U-SQL and ran into an issue running the job, namely that it seemed to get "stuck" on one of the stages - after 2.5 hours the job graph showed it had read in 200MB and had written over 3TB but wasn't anywhere near finished. (Didn't take a screenshot, sorry.)
I tracked it down to one of the queries joining a table with 34 million rows twice to a table with 1600 rows:
#ProblemQuery =
SELECT
gp.[Group], // 16 groups
gp.[Percentile], // 1-100
my_fn(lt1.[Value], lt2.[Value], gp.[Value]) AS CalculatedNumber
FROM
#LargeTable AS lt1
INNER JOIN #GroupPercent AS gp
ON lt1.[Group] == gp.[Group]
AND lt1.[Row ID] == gp.[Row ID 1]
INNER JOIN #Large Table AS lt2
ON gp.[Group] == lt2.[Group]
AND gp.[Row ID 2] == lt2.[Row ID]
;
It seems that the full cartesian product (~2e18 rows) is being stored during the processing rather than just the filtered 1600 rows. My first thought was that it might be from using AND rather than &&, but changing that made no difference.
I've managed to work around this by splitting the one query with two joins in to two queries with one join each, and the whole job completed in under 15 minutes without a storage blowout.
But it's not clear to me whether this is fully expected behaviour when multiple columns are used in the join or a bug, and whether there's a better approach to this sort of thing. I've got another similar query to split up (with more joins, and more columns in the join condition) and I can't help but feel there's got to be a less messy way of doing this.

U-SQL applies some join reorder heuristics (although I don't know how it deals with the apparent self-join). I doubt it is related to you using multiple columns in the join predicate. I assume that our heuristic may be off. Can you please either file an incident or send me the job link to [usql] at microsoft dot com? That way we can investigate what causes the optimizer to pick the worse plan.
Until then, splitting the joins into two statements and thus forcing the better join order is the best workaround.

Informatica coding

I am currently working on a scenario in Informatica Powercenter designer where the situation is as follows:
SQ1: I am pulling employee records according to the criteria of having a layer of employees based on their hierarchy (Client relation directors) which is the first source qualifier and in which i am doing a sql override to extract data from 3 tables,
and for those selected employees I have to pull some other information for example:
SQ2: what client relations they are handling which is in a separate source qualifier and
SQ3 some of the personal information from their profile which is in a third source qualifier.
I have single mapping in which there are three source qualifiers as described above and in all of them I am using SQL override. My question is that the data that i have pulled in first qualifier brings a subset of the total employee records, but in Source qualifier 2 and source qualifier 3 I have to pull all employee data and then do a join on employee_id in two joiners to finally collect data for the layer of employees that are coming from source qualifier 1. What i want is that if somehow I save the employee ids from and SQ1 and use them in SQ2 and SQ3 so that i pull data for only a subset of employees, the problem is I cant split the mapping and cannot add the code for selecting the subset from SQ1 bc it will be repitition of code and taking a long time to run, also the number of records are about one million. I cant find a way to perform the above that is why i am asking for help here.
I am pulling data from db2, and working in powercenter designer 9.5.1.
I will be thankful if i can get guidance regarding the above issue

What you can do is if all the table is in database,you can pull the source tables in one source qualifier and then override the SQL and create a join.
So the point is instead of 3 different source qualifier you can have one source qualifier.

I assume you are having three separate source qualifiers because the data is present in different databases. If not, doing an application join from three different source qualifiers( you will have to use 2 joiners) is very expensive. There are a couple of ways you can do this:
split the mapping to stage the data first and then use this staging layer as source to perform more complex operation
Identify your driving table. Since the record count in the SQ2 and SQ3 are bigger, I am assuming they can be the driving table. Use a lookup for SQ1 (Since its a smaller table size, the cache time would not be very big)
I would still suggest you use a staging layer to extract and stage the data, then transform it. Try to perform database joins(or lookups) as much as you can instead of joining at the application layer.

Consider using a pipeline lookup as a query for your SQ1 and use it in the pipeline that joins SQ2 and SQ3 .
Usage for pipeline lookup can be found at :
https://marketplace.informatica.com/solutions/performance_tuning_pipeline_lookup
Let me know if it helps .

SSIS 2005 Can Merge Join accommodate one-to-many joins

I have a Data Flow Task that does some script component tasks, sorts, then does a Merge Join. I'd like to have the Merge Join do the join as a 1-many. If I do an Inner Join, I get too few records:
If I do a Left Outer Join, I get WAY too many records:
I'm looking for the Goldilocks version of 'Just Right' (which would be 39240 records).

You can add a Conditional Split after your left join version of the Merge Join, with a non-matching condition like
isnull(tmpAddressColumn)
and send the relevant matching flow condition (the default output) to your destination.
If you still don't get the correct number, you'll need to check the merge join conditions and check if there are duplicate IDs in each source.

The number of rows shouldn't be what you're using to gauge if you're using the correct options for the Merge Join. The resulting data set should be the driving factor. Do the results look correct in the tmpManAddress table?
For development you might want to push the output of the script components to tables so you can see what data you're starting with. This will allow you to work out which type of join, and on which columns, give you the results you want.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Join Multiple Streams of Data with Pentaho DI - pentaho

Related

Generate serie of queries in PENTAHO

Improve performance of processing billions-of-rows data in Spark SQL

Do joins on multiple columns store the cartesian product?

Informatica coding

SSIS 2005 Can Merge Join accommodate one-to-many joins

Categories

Resources