Pentaho: Switch -> Insert row into two tables - pentaho

I have a merge-diff, if the rows are "changed", I need to insert that into two different rows.
How can I direct "changed" CASE to two different tables?

I ended up duplicating the rows to TWO switch-case, one for each table.
It does not look elegant, not sure if there is a maintainable/better/efficient way.

Yeah, the switch step is thought to have only one hop out for each condition, and once a condition is met, it doesn't evaluate the rest of the conditions, so if you need to stream the rows for one of the conditions to two separate steps, in this case the Dummy step would be my choice too, to allow afterwards two or more hops with a copy row to next step.

Related

Delete duplicates excluding columns

I am trying to delete duplicates from an internal table, comparing all columns excluding some of them. Obviously I can list all the columns that I want to compare using COMPARING, but this would not look good in code.
So let's say there are 100 columns and I want to exclude from the comparing 2.
How can I achieve that with a smart way?
You could use the DELETE ADJUSTMENT DUPLICATES operator, there you can define which columns you compare. You'll just have to sort the itab before this operation.

Should i combine two table become one table to improve query performance

I have two result table for calculate Monthly Allowace and OT-Allowance.
Should i combine two table become one table or use view or only join table in sql statement?
Example
Result
The question becomes, why are there two tables in the first place? If both tables are always 1 to 1 (every time a row goes into first table it must also always go into second table) then yes, combine them, save the space of the duplicate data and indexes. If it is possible for the two tables to have different entries (one always a row per "incident" but the other may not or data sometimes in first table and sometimes in second) the keep them as two to minimize the dead space per row. The data should drive the table designs. You don't say how many rows you are talking. If the rows, tables, and DB are small, does it really matter other than trying to design well (always a good idea)?
Another related consideration is how much data is already in the current design and how much effort is needed to convert to a different design? Is it worth the effort? I prefer to normalize the data "mostly" (am not fanatical about it) but also tend to not mess with what isn't broken.
Sorry if you were wanting a more clear-cut answer...

Merging multiple tables from multiple databases with all rows and columns

I have 30 databases from a survey application that all have a table of results with approximately 100 columns in each. Most of the columns are identical but each survey seems to have a unique column or two added in with no real pattern (these are the added questions and results of the survey). As I am working on the statement to join all of the tables into one large master table the code is getting quite complex. Is there a more efficient way to merge these tables from multiple databases and just select all rows and columns so it will merge if the column exists and create if it encounters a new column?
No, there isn't an automatic way to merge a bunch of similar, but not quite the same, tables into one. At least, not in any database system that I know of.
You could possibly automate something like that with a fairly simple script that relies on your database's information schema (or equivalent).
However, with only 30 tables and only a column or two different in each, I'm not sure it's worth it. A manual approach, with copying and pasting and making minor changes, would probably be faster.
Also, consider whether the "extra" columns that are unique to individual tables need to go into the combined table. The point of making a big single table is to process/analyze all the data together. If something only applies to a single source, this isn't possible.

Find which table is causing duplicate rows in a view

I have a view in sql server which should be returning one row per project. A few of the projects have multiple rows. The view has a lot of table joins so I would not like to have to manually run a script on each table to find out which one is causing duplicates. Is there a quick automated way to find out which table is the problem table (aka the one with duplicate rows)?
The quickest way I've found is:
find an example dupe
copy out the query
comment out all joins
add the joins back one at a time until you get another row
Whatever the join is where you started getting dupes, is where you have multiple records.
My technique is to make a copy of the view and modify it to return every column from every table in the order of the FROM clause, with extra columns between with the table names as the column name (see example below). Then select a few rows and slowly scan to the right until you can find the table that does NOT have duplicate row data, and this is the one causing dupes.
SELECT
TableA = '----------', TableA.*,
TableB = '----------', TableB.*
FROM ...
This is usually a very fast way to find out. The problem with commenting out joins is that then you have to comment out the matching columns in the select clause each time, too.
I used a variation of SpectralGhost's technique to get this working even though neither method really solves the problem of avoiding the manual checking of each table for duplicate rows.
My variation was to use a divide and conquer method of commenting out the joins instead of
commenting out each one individually. Due to the sheer number of joins this was much faster.

How (and where) should I combine one-to-many relationships?

I have a user table, and then a number of dependent tables with a one to many relationship
e.g. an email table, an address table and a groups table. (i.e. one user can have multiple email addresses, physical addresses and can be a member of many groups)
Is it better to:
Join all these tables, and process the heap of data in code,
Use something like GROUP_CONCAT and return one row, and split apart the fields in code,
Or query each table independently?
Thanks.
It really depends on how much data you have in the related tables and on how many users you're querying at a time.
Option 1 tends to be messy to deal with in code.
Option 2 tends to be messy to deal with as well in addition to the fact that grouping tends to be slow especially on large datasets.
Option 3 is easiest to deal with but generates more queries overall. If your data-set is small and you're not planning to scale much beyond your current needs its probably the best option. It's definitely the best option if you're only trying to display one record.
There is a fourth option however that is a middle of the road approach which I use in my job in which we deal with a very similar situation. Instead of getting the related records for each row 1 at a time, use IN() to get all of the related records for your results set. Then loop in your code to match them to the appropriate record for display. If you cache search queries you can cache that second query as well. Its only two queries and only one loop in the code (no parsing, use hashes to relate things by their key)
Personally, assuming my table indexes where up to scratch I'd going with a table join and get all the data out in one go and then process that to end up with a nested data structure. This way you're playing to each systems strengths.
Generally speaking, do the most efficient query for the situation you're in. So don't create a mega query that you use in all cases. Create case specific queries that return just the information you need.
In terms of processing the results, if you use GROUP_CONCAT you have to split all the resulting values during processing. If there are extra delimiter characters in your GROUP_CONCAT'd values, this can be problematic. My preferred method is to put the GROUPed BY field into a $holder during the output loop. Compare that field to the $holder each time through and change your output accordingly.