How should I reliably mark the most recent row in SQL Server table? - sql

The existing design for this program is that all changes are written to a changelog table with a timestamp. In order to obtain the current state of an item's attribute we JOIN onto the changelog table and take the row having the most recent timestamp.
This is a messy way to keep track of current values, but we cannot readily change this changelog setup at this time.
I intend to slightly modify the behavior by adding an "IsMostRecent" bit to the changelog table. This would allow me to simply pull the row having that bit set, as opposed to the MAX() aggregation or recursive seek.
What strategy would you employ to make sure that bit is always appropriately set? Or is there some alternative you suggest which doesn't affect the current use of the logging table?
Currently I am considering a trigger approach, which turns the bit off all other rows, and then turns it on for the most recent row on an INSERT

I've done this before by having a "MostRecentRecorded" table which simply has the most recently inserted record (Id and entity ID) fired off a trigger.
Having an extra column for this isn't right - and can get you into problems with transactions and reading existing entries.
In the first version of this it was a simple case of
BEGIN TRANSACTION
INSERT INTO simlog (entityid, logmessage)
VALUES (11, 'test');
UPDATE simlogmostrecent
SET lastid = ##IDENTITY
WHERE simlogentityid = 11
COMMIT
Ensuring that the MostRecent table had an entry for each record in SimLog can be done in the query but ISTR we did it during the creation of the entity that the SimLog referred to (the above is my recollection of the first version - I don't have the code to hand).
However the simple version caused problems with multiple writers as could cause a deadlock or transaction failure; so it was moved into a trigger.

Edit: Started this answer before Richard Harrison answered, promise :)
I would suggest another table with the structure similar to below:
VersionID TableName UniqueVal LatestPrimaryKey
1 Orders 209 12548
2 Orders 210 12549
3 Orders 211 12605
4 Orders 212 10694
VersionID -- being the tables key
TableName -- just in case you want to roll out to multiple tables
UniqueVal -- is whatever groups multiple rows into a single item with history (eg Order Number or some other value)
LatestPrimaryKey -- is the identity key of the latest row you want to use.
Then you can simply JOIN to this table to return only the latest rows.
If you already have a trigger inserting rows into the changelog table this could be adapted:
INSERT INTO [MyChangelogTable]
(Primary, RowUpdateTime)
VALUES (#PrimaryKey, GETDATE())
-- Add onto it:
UPDATE [LatestRowTable]
SET [LatestPrimaryKey] = #PrimaryKey
WHERE [TableName] = 'Orders'
AND [UniqueVal] = #OrderNo
Alternatively it could be done as a merge to capture inserts as well.

One thing that comes to mind is to create a view to do all the messy MAX() queries, etc. behind the scenes. Then you should be able to query against the view. This way would not have to change your current setup, just move all the messiness to one place.

Related

Limit Rows in ETL Without Date Column for Cue

We have two large tables (Clients and Contacts) which undergo an ETL process every night, being inserted into a single "People" table in the data warehouse. This table is used in many places and cannot be significantly altered without a lot of work.
The source tables are populated by third party software; we used to assume that we could identify the rows that had been updated since last night by using the "UpdateDate" column in each, but more recently identified some rows that were not touched by the ETL, as the "UpdateDate" column was not behaving as we had thought; the software company do not see this as a bug, so we have to live with this fact.
As a result, we now take all source rows, transformed into a temp staging table and then Merge that into the data warehouse, using the Merge to identify any changed values. We have noticed that this process is taking too long on some days and would like to limit the number of rows that the ETL process looks at, as we believe that the reason for the hold-up is the principally the sheer volume of data that is examined and stored on the temp database. We can see no way to look purely at the source data and identify when each row last changed.
Here is a simplified pseudocode of the ETL stored procedure, although what the procedure actually does is not really relevant to the question (included just in case you disagree with me!)
CREATE #TempTable (ClientOrContact BIT NOT NULL, Id INT NOT NULL, [Some_Other_Columns])
INSERT #TempTable
SELECT 1 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ClientsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
INSERT #TempTable
SELECT 0 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ContactsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
ALTER #TempTable ADD PRIMARY KEY (ClientOrContact, Id);
MERGE Target_PeopleTable AS Tgt
USING (SELECT [SomeColumns] FROM #TempTable JOIN [SomeOtherTables]) AS Src
ON Tgt.ClientOrContact = Src.ClientOrContact AND Tgt.Id = Src.Id
WHEN MATCHED AND NOT EXISTS (SELECT Tgt.* INTERSECT SELECT Src.*)
THEN UPDATE SET ([All_NonKeyTargetColumns] = [All_NonKeySourceColumns])
WHEN NOT MATCHED BY Target THEN INSERT [All_TargetColumns] VALUES [All_SourceColumns]
OUTPUT $Action INTO #Changes;
RETURN COUNT(*) FROM #Changes;
GO
The source tables have about 1.5M rows each, but each day only a relatively small number of rows are inserted or updated (never deleted). There are about 50 columns in each table, of those, about 40 columns can have changed values each night. Most columns are VARCHAR and each table contains an independent incremental primary key column. We can add indexes to the source tables, but not alter them in any other way (They have already been indexed by a predecessor) The source tables and target table are on the same server, but different databases. Edit: The Target Table has a composite primary key on the ClientOrContact and Id columns, matching that shown on the temp table in the script above.
So, my question is this - please could you suggest any general possible strategies that might be useful to limit the number of rows we look at or copy across each night? If we only touched the rows that we needed to each night, we would be touching less than 1% of the data we do at the moment...
Before you try the following suggestion, just one thing to check is that the Target_PeopleTable has an index or primary key on the id column. It probably does but without schema information to verify I am making no assumptions and this might speed up the merge stage.
As you've identified if you could somehow limit the records in TempTable to just the changed rows then this could offer a performance win for the actual MERGE statement (depending on how expensive determining just the changed rows is).
As a general strategy I would consider some kind of checksum to try and identify the changed records only. The T-SQL Checksum function could be used to calculate a check sum across the required columns by specifying the columns as a comma separated list to that function or there are actual column types available for this such as Binary_Checksum.
Since you cannot change the source schema you would have to maintain a list of record ids and associated checksums in your target database so that you can readily compare the source checksums to the target checksums from the last run in order to identify a difference.
You can then only insert into the Temp table where there is a checksum difference between the target and source or the id does not exist in the target db.
This might just be moving the performance problem to the temp insert part but I think it's worth a try.
Have you considered triggers? I avoid them like the plague, but they really are the solution to some problems.
Put an INSERT/UPDATE [/DELETE?] trigger on your two source tables. Program it such that when rows are added or updated, the trigger will log the IDs of these rows in a (you'll have to create this) audit table, where that table would contain the ID, the type of change (update or insert – and delete, if you have to worry about those) and when the change was made. When you run ETL, join this list of “to be merged” items with the source tables. When you’re done, delete the table and it’s reset for the next run. (Use the “added on” datetime column to make sure you don’t delete rows that may have been added while you were running ETL.)
There’s lots of details behind proper use and implementation, but overall this idea should do what you need.

Incremental load for Updates into Warehouse

I am planning for an incremental load into warehouse (especially for updates of source tables in RDBMS).
Capturing the updated rows in staging tables from RDBMS based the updates datetime. But how do I determine which column of a particular row needs to be updated in the target warehouse tables?
Or do I just delete a particular row in the warehouse table (based on the primary key of the row in staging table) and insert the new updated row?
Which is the best way to implement the incremental load between the RDBMS and Warehouse using PL/SQL and SQL coding?
In my opinion, the easiest way to accomplish this is as follows:
Create a stage table identical to your host table. When you do your incremental/net-change load, load all changed records into this table (based on whatever your "last updated" field is)
Delete the records from your actual table based on the primary key. For example, if your primary key is customer, part, the query might look like this:
delete from main_table m
where exists (
select null
from stage_table s
where
m.customer = s.customer and
m.part = s.part
);
Insert the records from the stage to the main table.
You could also do an update existing records / insert new records, but either way that's two steps. The advantage of the method I listed is that it will work even if your tables have partitions and the newly updated data violates one of the original partition rules, whereas an update would not accomplish that. Also, the syntax is much simpler as your update would have to list every single field, whereas the delete from / insert into allows you list only the primary key fields.
Oracle also has a merge clause that will update if it exists or insert if it does not. I honestly don't know how that would be impacted if you had partitions.
One major caveat. If your updates include deletes -- records that need to be deleted from the main table, none of these will resolve that and you will need some other way to handle that. It may not be necessary, depending on your circumstances, but it's something to consider.

Oracle PL/SQL query regarding handling data

I have a live production table which has more than 1 million records. Now i don't need to tamper anything on this table and would like to create another table which fetches all records from this live production table. I would schedule a job which can take entries from my main table and inserts them to my new table. But i don't want all the records daily; i just need the records added on a daily basis in the production table to get added in my new table.
Please suggest a faster and efficient approach.
You could do this with an INSERT/UPDATE/DELETE trigger to send the INSERTED/UPDATED/DELETED row to the new table, however this feels like reinventing the wheel on the most basic level.
You could just use asynchronous replication rather than hand-rolling it all yourself, this is probably safer, more sustainable and scalable. You could add as many tables as you like to the replicated source.
Copying one million records from an existing table to a new table should not take very long -- and might even be faster than figuring out what records to copy. You could do something like:
truncate table copytable;
insert into copytable
select *
from productiontable;
Note that you should explicitly list the columns when doing the insert.
You can also readily add new records -- assuming you have some form of id on the production table, such as an id assigned by a sequence. Then you can do:
insert into copytable
select *
from productiontable p
where p.id > (select max(id) from copytable);

How to get last inserted records(row) in all tables in Firebird database?

I have a problem. I need to get last inserted rows in all tables in Firebird db. And one more, these rows must contain specfied column name. I read some articles about rdb$ but have a few experience with that one.
There is no reliable way to get "last row inserted" unless the table has a timestamp field which stores that information (insertion timestamp).
If the table uses integer PK generated by sequense (generator in Firebird lingo) then you could query for the higest PK value but this isn't reliable either.
There is no concept of 'last row inserted'. Visibility and availability to other transactions depends on the time of commit, transaction isolation specified etc. Even use of a generator or timestamp as suggested by ain does not really help because of this visibility issue.
Maybe you are better of specifying the actual problem you are trying to solve.
SELECT GEN_ID(ID_HEDER,0)+1 FROM ANY_TABLE INTO :ID;
INSERT INTO INVOICE_HEADER (No,Date_of,Etc) VALUES ('122','2013-10-20','Any text')
/* ID record of INVOICE_HEADER table gets the ID_number from the generator above. So
now we have to check if the ID =GEN_ID(ID_HEADER,0) */
IF (ID=GEN_ID(ID_HEADER,0)) THEN
BEGIN
INSERT INTO INVOICE_FOOTER (RELACION_ID, TEXT, Etc) Values (ID, 'Text', Etc);
END
ELSE
REVERT TRANSACTION
That is all

plsql- Table enrichment from another's table data

Assume table 'Source' that is filled with data every hour with an in-place procedure. I want to run a procedure that will fill my new table 'NEW' with ONLY the new rows of the target table SOURCE each time the pros is executed,keeping in mind that the new table MUST always keep all the time the already inserted data (i mean that a solution each time the process to insert into...NEW then insert into to a temporary table distinct values, delete NEW, insert into from temp etc is not helpful).
If i got right what you want, and would be, insert only new records from SOURCE to NEW, there are tons of ways of doing this. Here are some of them:
1) Create a trigger on SOURCE that automatically inserts into NEW so you don't have to worry about a thing.
2) Use statement like this to select only new rows from SOURCE. This one assumes that primary keys that are the same on both tables.
INSERT INTO NEW
SELECT * FROM SOURCE s1 WHERE NOT EXISTS (SELECT 1 FROM NEW n1 WHERE n1.key=s1.key)
3) Use materialized view & mv log feature. This one is a bit complex and i suggest looking into oracle documentation or some other resources if you are not familiar with it.
4) Change your procedure that inserts into SOURCE to also insert into NEW.
Of course, you have to figure out what to do if there are possible updates on SOURCE. I can explain these methods in detail if you want.
Voted up dsmoljanovic's solution.
One reason why a procedural solution (with timestamps or sequences) isn't good for this is uncommitted data.
Take an example:
At 02:55:00: 75 rows are added to SOURCE and are timestamped '02:55:00'
At 02:55:30: The 75 row insert is committed
At 02:59:55: 100 rows are added to SOURCE and are timestamped '02:59:55'
At 03:00:00: Your process kicks off and selects from source
At 03:00:20: The 100 row insert is committed
The process won't see those 100 rows (since they are not committed) and the next time it runs, it may miss them if it looks for rows timestamped after 03:00:00.