How to keep track of which rows have been imported in SQL?

How to keep track of which rows have been imported in SQL? - sql

Let's say I want to import all the customers (or all the rows in some other specific table) to some external system. Not all at once but every one after they have been created in database. To do that I have to keep record of all the rows that have already been reported because I want to find only the ones that have not been reported yet. Is it generally better to add a column to do that or to create some kind of a batchlog table?
I'm using MS SQL Server if that is relevant
A Simplified example:
select * from Customer where reportedToExternalSystem is null
or
select * from Customer where cus_id not in (select cus_id from integrationBatchLog)
or is there maybe some more ways to do that that might be even better? This is the first time I do something like this so I don't know the best practise yet.

The simple solution is to add a column that marks the row as imported. A status int (0/1) or if you want to keep track of when it was imported an imported date. This solution does have some limitations:
You can only import the row once. Do you need to import the customer again when the record is updated? Are you going to clear the update field when the customer is updated?
It causes the row to be locked when you update the row status. Are you sure the application that inserts the customer record will be happy with your code locking the records?
On some system it causes the entire row to be written to the log system for recovery. Depending on the size of the row this can be a lot of log writing for just one field.
In a highly parallel import system you can have a lot of contention for resources. If one import program is locking the table, think how bad it would be if many import programs are locking the table at the same time.
If the customer data is updated several times between your import polling interval, you will only see the latest data and will skip over the intermediate updates. This is only an issue if you care about the intermedaite updates. For customers you might not care, for order statuses you might care a lot.
You have to modify the table structure. This might not be allowed by the source application due to data/support/political issues.
Besides putting a status column in the table, one technique that works well is to put a trigger on the table and mirror the import data to a second table. You would then 'consume' the data in the second table. This has several advantages:
It keeps the locking issues contained to the second table.
It allows you to process every update to the main table.
You can add an index to the second table that is used to keep track of the update statuses without the issues of changing the main table.
If you delete the rows from the second table (either immediately as they are consumed or after a short audit period) the size of the table/index will be kep to a minimum.
When I use this technique in Sql Server I put the second table in a seperate schema. Since most apps store their tables in dbo, you can end up with dbo.Customers and Import.Customers. This can help you to keep track of which tables you are importing and keeps you from having to come up with new names for your import tables.

Unless you have to complicate implementation, go with the simplest solution possible. One important thing you should consider, is how hard would it be to refactor this simple to more general one, in case if you need it.
In your case I see only one problem in upgrading from column to table. If you would need history of imports. Solution: make reportedToExternalSystem column of DateTime (or Timestamp) type

I would use a separate table indicating, say, import date cross-referenced to the key of the record in the table you're tracking. In other words, a table with 3 columns: auto-increment key, record-id-from-other-table, import-date. Something like that. This also allows the case if a record is ever re-imported later. You'd have track of all the imports by date.

I Prefer having a column for importing status. Maintaining a separate log leads to time consumable results with growing table size. I do have conceptual idea on SQL Servers but seems that it works. Keep posting!

Related

avoiding write conflicts while re-sorting a table

I have a large table that I need to re-sort periodically. I am partly basing this on a suggestion I was given to stay away from using cluster keys since I am inserting data ordered differently (by time) from how I need it clustered (by ID), and that can cause re-clustering to get a little out of control.
Since I am writing to the table on a hourly I am wary of causing problems with these two processes conflicting: If I CTAS to a newly sorted temp table and then swap the table name it seems like I am opening the door to have a write on the source table not make it to the temp table.
I figure I can trigger a flag when I am re-sorting that causes the ETL to pause writing, but that seems a bit hacky and maybe fragile.
I was considering leveraging locking and transactions, but this doesn't seem to be the right use case for this since I don't think I'd be locking the table I am copying from while I write to a new table. Any advice on how to approach this?

I've asked some clarifying questions in the comments regarding the clustering that you are avoiding, but in regards to your sort, have you considered creating a nice 4XL warehouse and leveraging the INSERT OVERWRITE option back into itself? It'd look something like:
INSERT OVERWRITE INTO table SELECT * FROM table ORDER BY id;
Assuming that your table isn't hundreds of TB in size, this will complete rather quickly (inside an hour, I would guess), and any inserts into the table during that period will queue up and wait for it to finish.

There are some reasons to avoid the automatic reclustering, but they're basically all the same reasons why you shouldn't set up a job to re-cluster frequently. You're making the database do all the same work, but without the built in management of it.
If your table is big enough that you are seeing performance issues with the clustering by time, and you know that the ID column is the main way that this table is filtered (in JOINs and WHERE clauses) then this is probably a good candidate for automatic clustering.
So I would recommend at least testing out a cluster key on the ID and then monitoring/comparing performance.
To give a brief answer to the question about resorting without conflicts as written:
I might recommend using a time column to re-sort records older than a given time (probably in a separate table). While it's sorting, you may get some new records. But you will be able to use that time column to marry up those new records with the, now sorted, older records.

You might consider revoking INSERT, UPDATE, DELETE privileges on the original table within the same script or procedure that performs the CTAS creating the newly sorted copy of the table. After a successful swap you can re-enable the privileges for the roles that are used to perform updates.

Is it viable to have a SQL table with only one row and one column?

I'm currently working on my first application that uses a database so I'm very new to this. The database has multiple tables that are what you would expect to normally see.
However, I created one table which only has one row and one column used to keep a count of the total items processed by the program so it's available to access elsewhere. I can't just use
SELECT COUNT(*) FROM table_name
because these items that I am processing I do not want to actually keep in a table.
It seems like a waste to use a table to store one value so I am wondering if there a better way to keep track of this value.

What is your table storing? it's storing some kind of processing audit. So make it a little more useful - add a column storing the last datetime that the data was processed. Add a column for the time it took to process. Add another column which stores the username (or some identifier) of whoever ran the process. Now add a row for every table that is processed (there's only one now but there might be more in future). Try and envisage how your processing is going to grow in future

DB schema for updating downstream sources?

I want a table to be sync-able by a web API.
For example,
GET /projects?sequence_latest=2113&limit=10
[{"state":"updated", "id":12,"sequence":2116},
{"state":"deleted" "id":511,"sequence":2115}
{"state":"created", "id":601,"sequence":2114}]
What is a good schema to achieve this?
I intend this for Postgresql with Django ORM, which uses surrogate keys. Presence of an ORM may kill answers like unions.
I can come up with only half-solutions.
I could have a modified_time column, but we cannot convey deletions.
I could have a table for storing deleted IDs, when returning 10 new/updated rows, I could return all the deleted rows between them. But this works only when the latest change is an insert/update and there are a moderate number of deleted rows.
I could set a deleted flag on the row and null the rest, but its kinda bad schema design to set all columns nullable.
I could have another table that stores ID, modification sequence number and state(new, updated, deleted), but its another table to maintain and setting sequence numbers cause contentions; imagine n concurrent requests querying for latest ID.

If you're using an ORM you want simple(ish) and if you're serving the data via an API you want quick.
To go through your suggested options:
Correct, so this doesn't help you. You could have a deleted flag in your main table though.
This seems quite a random way of doing it and breaks your insistence that there be no UNION queries.
Not sure why you would need to NULL the rest of the column here? What benefit does this bring?
I would strongly advise against having a table that has a modification sequence number. Either this means that you're performing a lot of analytic queries in order to find out the most recent state or you're updating the same rows multiple times and maintaining a table with the same PK as your normal one. At that point you might as well have a deleted flag in your main table.
Essentially the design of your API gives you one easy option; you should have everything in the same table because all data is being returned through the same method. I would follow your point 2 and Wolph's suggestion, have a deleted_on column in your table; making it look like:
create table my_table (
id ... primary key
, <other_columns>
, created_on date
, modified_on date
, deleted_on date
);
I wouldn't even bother updating all the other columns to be NULL. If you want to ensure that you return no data create a view on top of your table that nulls data where the deleted_on column has data in it. Then, your API only accesses the table through the view.
If you are really, really worried about space and the volume of records and will perform regular database maintenance to ensure that both are controlled then maybe go with option 4. Create a second table that has the state of each ID in your main table and actually delete the data from your main table. You then can do a LEFT OUTER JOIN to the main table to get the data. When there is no data that ID has been deleted. Honestly, this is overkill until you know whether you will definitely require it.
You don't mention why you're using an web API for data-transfers; but, if you're going to be transferring a lot of data or using this for internal systems only it might be worth using a lower-level transfer mechanism.

Join or storing directly

I have a table A which contains entries I am regularly processing and storing the result in table B. Now I want to determine for each entry in A its latest processing date in B.
My current implementation is joining both tables and retrieving the latest date. However an alternative, maybe less flexible, approach would be to simply store the date in table A directly.
I can think of pros and cons for both cases (performance, scalability, ....), but didnt have such a case yet and would like to see whether someone here on stackoverflow had a similar situation and has a recommendation for either one for a specific reason.
Below a quick schema design.
Table A
id, some-data, [possibly-here-last-process-date]
Table B
fk-for-A, data, date
Thanks

Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.
I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.
I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.

This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).
You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.
On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.
So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.
Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.

We had a similar situation in our project tracking system where the latest state of the project is stored in the projects table (Cols: project_id, description etc.,) and the history of the project is stored in the project_history table (Cols: project_id, update_id, description etc.,). Whenever there is a new update to the project, we need find out the latest update number and add 1 to it to get the sequence number for the next update. We could have done this by grouping the project_history table on the project_id column and get the MAX(update_id), but the cost would be high considering the number of the project updates (in a couple of hundreds of thousands) and the frequency of update. So, we decided to store the value in the projects table itself in max_update_id column and keep updating it whenever there is a new update to a given project. HTH.

If I understand correctly, you have a table whose each row is a parameter and another table that logs each parameter value historically in a time series. If that is correct, I currently have the same situation in one of the products I am building. My parameter table hosts a listing of measures (29K recs) and the historical parameter value table has the value for that parameter every 1 hr - so that table currently has 4M rows. At any given point in time there will be a lot more requests FOR THE LATEST VALUE than for the history so I DO HAVE THE LATEST VALUE STORED IN THE PARAMETER TABLE in addition to it being in the last record in the parameter value table. While this may look like duplication of data, from the performance standpoint it makes perfect sense because
To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table
So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.

Database history for client usage

I'm trying to figure out what would be the best way to have a history on a database, to track any Insert/Delete/Update that is done. The history data will need to be coded into the front-end since it will be used by the users. Creating "history tables" (a copy of each table used to store history) is not a good way to do this, since the data is spread across multiple tables.
At this point in time, my best idea is to create a few History tables, which the tables would reflect the output I want to show to the users. Whenever a change is made to specific tables, I would update this history table with the data as well.
I'm trying to figure out what the best way to go about would be. Any suggestions will be appreciated.
I am using Oracle + VB.NET

I have used very successfully a model where every table has an audit copy - the same table with a few additional fields (time stamp, user id, operation type), and 3 triggers on the first table for insert/update/delete.
I think this is a very good way of handling this, because tables and triggers can be generated from a model and there is little overhead from a management perspective.
The application can use the tables to show an audit history to the user (read-only).

We've got that requirement in our systems. We added two tables, one header, one detail called AuditRow and AuditField. The AuditRow contains one row per row changed in any other table, and the AuditField contains one row per column changed with old value and new value.
We have a trigger on every table that writes a header row (AuditRow) and the needed detail rows (one per changed colum) on each insert/update/delete. This system does rely on the fact that we have a guid on every table that can uniquely represent the row. Doesn't have to be the "business" or "primary" key, but it's a unique identifier for that row so we can identify it in the audit tables. Works like a champ. Overkill? Perhaps, but we've never had a problem with auditors. :-)
And yes, the Audit tables are by far the largest tables in the system.

If you are lucky enough to be on Oracle 11g, you could also use the Flashback Data Archive

Personally, I would stay away from triggers. They can be a nightmare when it comes to debugging and not necessarily the best if you are looking to scale out.
If you are using an PL/SQL API to do the INSERT/UPDATE/DELETEs you could manage this in a simple shift in design without the need (up front) for history tables.
All you need are 2 extra columns, DATE_FROM and DATE_THRU. When a record is INSERTed, the DATE_THRU is left NULL. If that record is UPDATEd or DELETEd, just "end date" the record by making DATE_THRU the current date/time (SYSDATE). Showing the history is as simple as selecting from the table, the one record where DATE_THRU is NULL will be your current or active record.
Now if you expect a high volume of changes, writing off the old record to a history table would be preferable, but I still wouldn't manage it with triggers, I'd do it with the API.
Hope that helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas