Automatic column mapping in UPDATE step - pentaho

For inserts, if source and target columns are the same, no mapping or "select values" step is required. But for updates, there seems to be a need to specify list of update fields.
My concern is around manually updating the KTR's each time a source table is altered for columns. Is there a way to enable automatic mapping during the Update step? See screenshot for the "update fields", automatic mapping would mean that update fields section can be left blank.

There are good reason NOT to do so.
Believe me, having a robot to change your ktr is not a good idea. And there are good reasons not to change often the column names in an OPAP schema, unless you like to be in conflict with the Reports Designers and, even worse, with the Dashboard and Front End Javascript guys.
So if a press button is not a solution for you, because maybe you have 1000 tables to update, what you can do is to use a Metadata Injection step. You'll find nice examples on Diethard Steiner's blog or Jens Bleuel's blog. In two words, you make the Update metadata dynamic, but you first have to examine each table to get the column names.

Related

Which is better? Trigger or Select sql then update records? for related tables multiple row updating

Scenario:
Entity 1 can have 0 or more Entity 2.
What trying to do:
When a field in Entity 1 is updated, a field in Entity 2 is consecutively updated.
What I'm doing:
Update field in Entity 1 by update sql, then querying related Entity 2 records (using SELECT ATTR FROM ENTITY2 WHERE ENTITY1.ID = ENTITY2.ENT1_ID) just to get the old value of ENTITY2 attr before doing an update on that records. Type of update (e.g. Subtract or add) on ENTITY2 record is based on the update value on ENTITY1.
Alternative :
Using triggers to consecutively update these related records.
(I still plan to study to implement triggers but I am not sure if it is worth it.
Any help from this also please? Or links?)
Is it better to use triggers? Or just stick to my current solution (which I think is quite slow due to the number of sql executions but easier to track down).
There are people, such as Tom Kyte who believe triggers should be used as little as possible, if at all.
There are others, such as Toon Koppelaars who believe they should be used, if their use is considered carefully.
I am of the second camp and believe triggers may be used. However, this use should not be to 'automagically' cause cascade actions such as you are suggesting. Instead these triggers may be used to enforce integrity constraints that cannot be declared using the standard mechanism of a table constraint clause i.e. the triggers themselves do no DML other than SELECT from tables.
(note: there are other mechanisms by which these constraints may be enforced, including materialized views or the introduction of additional columns and the use of specific indexing strategies) Therefore, I would suggest another alternative. Create triggers - or use these alternative mechanisms - to ensure no data that breaks your integrity constraints can be committed. Then create APIs, using PL/SQL, that encapsulate the multi-table data amendments that are required to ensure the integrity constraints are not broken and use these as your update path.
In this way you can be assured that no invalid data exists in the database, but also that the actual DML required to achieve this is not hidden across the database in multiple program units and triggers but is stated explicitly in one place.
Tom Kyte is brilliant. But he is, at heart, still just a DBA. Always keep that in mind when considering his advice on table design.
Can triggers be overused? Of course. But here's the rub: anything can be overused. I lean toward triggers because there is just no way to guarantee that all data manipulation will go through your app or any single channel. Or, if possible, define a foreign key relationship and let "cascade update" take care of everything. Tricky, I admit, and could be problematic, but don't reject any solution out of hand.
Having said that, I don't know if a trigger for this purpose is called for. I don't know why you are duplicating the data to a field in a different table. Without knowing your overall design and what you are trying to accomplish, there is no way to judge. But consider keeping data in one field in one table then use a view to expose that field as part of a second "table." Change the data where it resides, viola, it is now changed wherever it appears.
Performance hit? Yes. But keeping duplicate data in different places and keeping them synchronized is a data integrity hit. Only you know (or are in a position to find out) which way this balance tilts.
Oh, can views be overused? Of course. But there's always that rub I mentioned; and besides, views are so chronically underused in most databases that overuse would be a long way away.

Add/remove columns of a table - code maintenance / optimisation

What is the best way to maintain code of a big project?
Let's say you have 1000 stored procedures, and you have to add a new column to a table (or remove)
There might be 1-2 or 30 stored procedures, that might be affected.
Just a single "search" for the tablename might not be good enough, let's say you only need to know the places where the table has insert/update/delete.
searching for 'insert tablename' might be a good idea, but you might have a space between those 2 words or 2 spaces, or a TAB ... maybe the tablename is written like '[tablename]'
The same for all 3 (insert/update/delete.)
I am basically looking for some kind of 'restricted dependencies'
How is this being handled the best way?
Keep a database table with this kind of information, and change that table every time you make changes to stored procedures?
keep some specific code as comment next to each insert/update/delete, and in this way, you will be able to search for what you need?
Example: 'insert_tablename', 'update_tablename', 'delete_tablename'
anyone having a better idea?
Ideally, changes are backward compatible. Not just so that you can change a table without breaking all of the objects that reference it, but also so that you can deploy all of the database changes before you deploy all of the application code (in a distributed architecture, think a downloadable desktop app or an iPhone app, where folks connect to your database remotely, this is crucial).
For example, if you add a new column to a table, it should be NULLable or have a default value so that INSERT statements don't need to be updated immediately to reference it. Stored procedures can be updated gradually to accept a new parameter to represent this column, and it should be nullable / optional so that the application(s) don't need to be aware of this column immediately. Etc.
This also demands that your original insert statements included an explicit column list. If you just say:
INSERT dbo.table VALUES(#p1, #p2, ...);
Then that makes it much tougher to make your changes backward compatible.
As for removing a column, well, that's a little tougher. Dependencies are not perfect in SQL Server, but you should be able to find a lot of information from these dynamic management objects:
sys.dm_sql_referenced_entities
sys.dm_sql_referencing_entities
sys.sql_expression_dependencies
You might also find these articles interesting:
Keeping sysdepends up to date
Make your database changes backward compatible when adding a new column
Make your database changes backward compatible when dropping a column
Make your database changes backward compatible when renaming an entity
Make your database changes backward compatible when changing a relationship

netTiers database schema backwards compatibility

I've found that the netTiers generated code relies on an exact database schema and is very unforgiving with variations. For example, adding a column to an existing table - if a column is added somewhere in the middle of the table you will see a cast error at runtime unless netTiers is recompiled. This is because the columns are accessed by ordinal and not by name. (Looking through the change log I see that this was done as a performance improvement)
This hasn't been a problem in the past, but on my current project we are trying to build a system with zero downtime upgrades. The challenge we have is database upgrades and it would be great if we could update the database without affected the code.
Has anyone using netTiers had similar problems or looked into similar requirements?
Would altering the templates to access the columns by name be more tolerant of previous schema versions? If so, for me I think this would be worth a minor hit to performance (3% is quoted here DataReader ordinal-based lookups vs named lookups)
As you have noted, .NetTiers uses ordinal based lookups in the DataReader. The only way to get .NetTiers to play nicely, is to always add new columns to the end of existing tables, and never restructure the table field order.
By doing this, your v1 code will still work against a table that has a new column appended at the end of the table, whilst your v2 code will work with the new addition.

Implementing soft delete with minimal impact on performance and code

There are some similar questions on the topic, but they are not really helping me.
I want to implement a soft delete feature like on StackOverflow, where items are not really deleted, but just hidden. I am using a SQL database. Here are 3 options:
Add a is_deleted boolean field.
Advantages: Simple.
Disadvantages: No date record. Forces me to add a is_deleted = 0 in every query.
Add a deleted_date date field. This is set to NULL if it's not deleted.
Advantages: Has date.
Disadvantages: Still cluttering my queries.
For both of the above
It will also impact performance because there are all these useless rows. They still have to be maintained in indexes. Also an index on the deleted column won't help when fetching non-deleted (the majority) of the rows. Full table scan is needed.
Another option is to create a separate table to hold deleted items:
Advantages: Improved performance when querying non-deleted rows. No need to add conditions to my queries on non-deleted rows. Easier on index maintenance.
Disadvantages: Complexity: Requires data migration for both deletion and undeletion. Need for new tables. Referential integrity is harder to handle.
Is there a better option?
I personally would base my answer off of how often you anticipate your users wanting to access that deleted data or "restore" that deleted data.
If it's often, then I would go with a "Date_Deleted" field and put a calculated "IsDeleted" in my poco in the code.
If it's never (or almost never) then a history table or deleted table is good for the benefits you explained.
I personally almost never use deleted tables (and opt for isDeleted or date_deleted) because of the potential risk to referencial integrity. You have A -> B and you remove the record from B database... You now have to manage referencial integrity because of your design choice.
If the key is numeric, I handle a "soft-delete" by negating the key. (Of course, won't work for identity keys). You don't need to change your code at all, and can easily restore the record by multiplying by -1.
Just another approach to give some thought to... If the key is alphanumeric, you can do something similar by prepending a unique "marker" characters. Since deleted records will all begin with this marker, then will end up off by themselves in the index.
In my opinion, the best way forward, when thinking about scaling and eventual table/database sizes is your third option - a separate table for deleted items. Such a table can eventually be moved to a different database to support scaling.
I believe you have listed the three most common options. As you have seen, each has advantages and disadvantages. Personally, I like taking the longer view on things.
I think your analysis of the options is good but you missed a few relevant points which I list below. Almost all implementations that I have seen use some sort of deleted or versioning field on the row as you suggest in your first two options.
Using one table with deleted flag:
If your indexes all contain the deleted flag field first and your query's mostly contain a where isdeleted=false type structure then it DOES solve you performance problems and the indexes very efficiently exclude the deleted rows. Similar logic could be used for the deleted date option.
Using two Tables
In general you need to make massive changes to reports because some reports may refer to deleted data (like old sales figures might refer to a deleted sales category). One can overcome this by creating a view which is a union of the two tables to read from and only write to the active records table.
Let's suppose we create a field called dead to mark deleted rows. We can create a index where field dead is false.
In this way, we only search non-deleted rows using the hint use index.

Strategy for identifying unused tables in SQL Server 2000?

I'm working with a SQL Server 2000 database that likely has a few dozen tables that are no longer accessed. I'd like to clear out the data that we no longer need to be maintaining, but I'm not sure how to identify which tables to remove.
The database is shared by several different applications, so I can't be 100% confident that reviewing these will give me a complete list of the objects that are used.
What I'd like to do, if it's possible, is to get a list of tables that haven't been accessed at all for some period of time. No reads, no writes. How should I approach this?
MSSQL2000 won't give you that kind of information. But a way you can identify what tables ARE used (and then deduce which ones are not) is to use the SQL Profiler, to save all the queries that go to a certain database. Configure the profiler to record the results to a new table, and then check the queries saved there to find all the tables (and views, sps, etc) that are used by your applications.
Another way I think you might check if there's any "writes" is to add a new timestamp column to every table, and a trigger that updates that column every time there's an update or an insert. But keep in mind that if your apps do queries of the type
select * from ...
then they will receive a new column and that might cause you some problems.
Another suggestion for tracking tables that have been written to is to use Red Gate SQL Log Rescue (free). This tool dives into the log of the database and will show you all inserts, updates and deletes. The list is fully searchable, too.
It doesn't meet your criteria for researching reads into the database, but I think the SQL Profiler technique will get you a fair idea as far as that goes.
If you have lastupdate columns you can check for the writes, there is really no easy way to check for reads. You could run profiler, save the trace to a table and check in there
What I usually do is rename the table by prefixing it with an underscrore, when people start to scream I just rename it back
If by not used, you mean your application has no more references to the tables in question and you are using dynamic sql, you could do a search for the table names in your app, if they don't exist blow them away.
I've also outputted all sprocs, functions, etc. to a text file and done a search for the table names. If not found, or found in procedures that will need to be deleted too, blow them away.
It looks like using the Profiler is going to work. Once I've let it run for a while, I should have a good list of used tables. Anyone who doesn't use their tables every day can probably wait for them to be restored from backup. Thanks, folks.
Probably too late to help mogrify, but for anybody doing a search; I would search for all objects using this object in my code, then in SQL Server by running this :
select distinct '[' + object_name(id) + ']'
from syscomments
where text like '%MY_TABLE_NAME%'