Persisting data from SQL tables day by day - sql

I manage data-tier applications for a small company and my SW is receiving criticism for the fact that information for part-costing can't be retrieved historically. So, for instance, what they would like is to be able to, at any point in time, retrieve the cost of a part as it was 6 months ago.
They used to do this through spreadsheets. They would copy the part table every day into a .xlsx file, and then anytime they wanted to know "hey, what was the cost of that part Jan 20 of last year?", they could just pull it up in excel.
So, we've begun doing the same thing in SQL, and the plan so far is that we will create a new table each time the part costs are updated, name the table with today's date, and persist it in a database for archived information. Then, we're planning to pull in whichever table we need according to it's time-stamp.
I can't help but think this is going to get very messy. Is this a bad approach for archiving data? Are there any industry standards I can adhere to for solving this problem in as few headaches as possible?

So, we've begun doing the same thing in SQL, and the plan so far is
that we will create a new table each time the part costs are updated,
name the table with today's date, and persist it in a database for
archived information. Then, we're planning to pull in whichever table
we need according to it's time-stamp.
I can't help but think this is going to get very messy. Is this a bad
approach for archiving data? Are there any industry standards I can
adhere to for solving this problem in as few headaches as possible?
You are right ... this solution will be messy.
Simplest thing that you can do is to create a History table say Parts_History that will have all the columns as the main Parts table and add additional timestamp column(s) to track updates. Every time there is a new price for a part ( which I hope is done thru a stored procedure) the existing price gets moved into the new table and the main table gets updated ALL Inside one transaction. If you dont have a single SP that handles the update then you can do that inside a trigger.
I will try and see if there are any good examples out there.

As far as I now there is no standard but approach is rather obvious. You have a table say part(partid int primary key, price decimal). Create an audit table part_audit(auditId int identity(1,1) primary key, partId int, price decimal, dateChange datetime default getdate()) and a trigger on part after update, delete. In the trigger check update(price) and if so insert into part_audit from deleted. To find historical price select nearest dateChange after date of interest.

Related

Track database changes or differentiate records with timestamp?

Keeping track of changes to a database must be a big concern for lots of people, but it seems that the big names have software for that.
My question is for a small SQL database with 10 tables, <10 columns each, using joins to create a "master" junction table: is there a downside to updating a few times per year by adding rows (with a lot of duplicate information) and then taking the MAX id (PK) to generate and post on a website the most recent data in tabular form (excerpted from the "master")? This versus updating the records, in which I'll lose information on the values at a particular moment.
A typical row for teacher contact information would have fName, lName, schoolName, [address & phone info]; for repertoire or audition information: year, instrument, piece, composer, publisher/edition.
Others have asked about tracking db changes, but only one recently, and not with a lot of votes/details:
How to track data changes in a database table
Keeping history of data revisions - best practice?
How to track data changes in a database table
This lightweight solution seems promising, but I don't know if it didn't get votes because it's not helpful, or because folks just weren't interested.
How to keep track of changes to data in a table?
more background if needed:
I'm a music teacher (i.e. amateur programmer) maintaining a Joomla website for our organization. I'm using a Joomla plugin called Sourcerer to create dynamic content (PHP/SQL to the Joomla database) to make it easier to communicate changes (dates, personnel, rules, repertoire, etc.) For years, this was done with static pages (and paper handbooks) that took days to update.
I also, however, want to be able to look back and see the database state at a particular time: who taught where, what audition piece was listed, etc., as we could with paper versions. NOTE: I'm not tracking HTML changes, only that information fed from the database.
Thanks for any help! (I've followed SO for years, but this is my first question.)
The code I'm using now to generate the "master junction table." I would modify this to "INSERT into" for my new rows and query from it via Sourcerer to post the information online.
CREATE TABLE 011people_to_schools_junction
AS (
SELECT *
FROM (
SELECT a.peopleID, a.districtID, a.firstName, a.lastName, a.statusID, c.schoolName
FROM 01People a
INNER JOIN (
SELECT districtID, MAX(peopleID) peopleID
FROM 01People
GROUP BY districtID
) b
ON a.districtID = b.districtID
AND a.peopleID = b.peopleID
INNER JOIN (
SELECT schoolID, MAX(peopleID) peopleID
FROM 01people_to_schools_junction ab
GROUP BY schoolID
) z
ON z.peopleID = a.peopleID
LEFT JOIN 01Schools c
ON c.schoolID = z.schoolID
WHERE z.schoolID IS NOT NULL
OR z.peopleID IS NOT NULL
ORDER BY c.schoolName
) t1
);
#Add a primary key as the first column
ALTER TABLE 011people_to_schools_junction
ADD COLUMN 011people_to_schoolsID INT NOT NULL AUTO_INCREMENT FIRST,
ADD PRIMARY KEY (011people_to_schoolsID);
To answer your questions in order:
Is there a downside?
Of course, and it's performance - related. If you add a million records each year, it will hurt performances; and occupy space on disk.
Where the suggestions in the linked question bad or just not popular?
The question and answers are good; but the right answer depends on your specific use case: are you doing it for legal reasons, how fast you want to be able to access the data, how much data and updates you have, how much you want your history functionality to last without changes... only if it met your use case you would vote.
As a rule of thumb, history should go to a different table, this would provide several advantages:
your current tables don't change, so your code needs no change except for storing the current version also in history;
your application doesn't slow down;
if your history tables grow you can move them easily to a different server;
In order to choose whether to have a single history table or several (one per backed up table) depends on how you plan to retrieve the data and what you want to do with it:
if you mirror each of your tables adding a timestamp and the user id, your code would need little modifications; but you'd end up with twice as many tables, and any structure change would then need to be replicated on the history table as well;
if you build a single history table with the timestamp, the user id, the table name and a json representation of the record, you will have an easier life building it, while for retrieving it you should access the data using an Object per row i.e. using Joomla's dbo getObjectList(), then the objects will be the same format you store in the history table and the changes there will be fairly easy. But querying for changes across specific tables/fields will be much harder.
Keep in mind that having data is useless if you can't retrieve it properly.
Since you mention pushing to the website a few times a year, the overhead of the queries should not be an issue (if you update monthly, waiting 5 minutes may not be a problem).
You should seek the best solution based on the other uses of this data: for it to be useful to anyone, you will have to implement a system to retrieve historical data. If phpmyadmin is enough, well look no further.
I hope this scared you. Either way it's a lot of hard work.
If you just want to be able to look up old data, you may instead store a copy of the markup/output you generate from time to time, and save it to different folders on the webserver. This will take minutes to set up, and be extremely reliable.
Sure, it's more fun to code it. But are you really sure you need it? And you can keep the database dumps just in case one day you change your mind.

Postgresql logging daily stats

Evening all,
I am attempting to create a table that stores a series of stats on web usage for my application on a daily basis (trivial things like no. new users, total visits etc.), I am currently querying these on the fly, however I would now like to start storing them partly for performance (reducing a load of aggregate querys to single lookup) and partly to allow for historic analysis.
I have come up with the follow basic schema for the table (there will be more columns than this, just to give an idea)
create table web_stats(
web_stat_id bigserial primary key,
date_created timestamp not null default now(),
user_count integer not null,
new_user_count integer not null
);
comment on table web_stats is 'Table stores statistics on web usage';
Now, I am happy to create the queries to populate the table going forward (I am using Quartz scheduler to run the queries daily)
However I am not so sure the best way to populate the table retrospectively for past dates, should I use an INSERT statement to create a blank row for every day since the application went live (about 2 years ago), then use an UPDATE to populate the blank rows? Or can this be done in one fell swoop? Can someone provide some SQL for creating the rows?
If there is anything wrong with my design assumptions please let me know!
This is how I ended up doing it
INSERT INTO web_stats (date_created)
SELECT DATE('2011-08-20')+x.id
FROM generate_series(0,521) AS x(id);
Where 2011-08-20 is the date the application went live, and 521 is the number of days from now until then
This creates the empty table so that I can use the date_created field to populate the other fields using UPDATE statements
Maybe not the most efficient method but it works

How to structure databse of API usage history

I have a database of users for a web API, but I also want to store usage history for each user, i.e: page request count, data volumes, etc. What is the best way to implement this, in terms of database structure? My initial thought was to retain the main table, but then create a history table for each user. This seems horribly impractical, however. My gut feeling is that I probably need one separate table for usage history, but I am unclear as to how to structure it.
I am using SQLite.
For an event logging model (which is what you want), I can recommend two options
One table, lets call it activity_log.
`activity_log`{
id INTEGER PRIMARY KEY,
user_id MEDIUM INT NOT NULL,
event_type VARCHAR(10),
event_time TIMESTAMP
}
For each event in your system affecting a user, you insert a record into this role (i believe the column names are self-explanatory). I believe SQLite doesn't provide native TIMESTAMP type so you'll have to handle the storage in your application code. What this design will leave you with a table that has the potential to grow very large, but it will give you fine grained statistics. SQLite doesn't support clustered indexes but there are some options here that will help you out with performance tuning.
The same table as above, only instead of inserting a new row for every event, you're going to perform a conditional insert i.e. update existing rows for users already in and update for new users. This option will keep your table several times smaller than what you have above, but you'll only have access to the most recent use of your api.
If you can afford it, I'd say go with number 1.
In one of my programs, I maintain a table of module usage per user. The structure of the table is
table id
user id
prog id
date/time
history flag (0=current, 1=history)
runs (number of time user has run program on date)
About once a week, I aggregate the data in the table: if user 1 has run program 1 twice on a given date, then initially there will be two entries in the table:
1;1;1;04/10/12 08:56;0;1
2;1;1;04/10/12 09:33;0;1
After aggregation, the table becomes
3;1;1;04/10/12 00:00;1;2
Whilst the aggregation loses the time part, no other data is lost and queries against the table will be quicker.

How to handle archiving for seasonal database values on SQL Server

I am on SQL Server 2008 R2 and I am currently developing a database structure which contains seasonal values for some products.
By seasonal I mean that those values won't be useful after a particular date in terms of customer use. But, those values will be used for statistical results by internal stuff.
On the sales web site, we will add a feature for product search and one of my aim is to make this search as optimized as possible. But, more row inside the database table, less fast this search will become. So, I consider archiving the unused values.
I can handle auto archiving with SQL Server Jobs automatically. No problem there. But I am not sure how I should archive those values.
Best way I can come up with is that I create another table inside the same database with same columns and put them there.
Example :
My main table name is ProductPrices and there a primary key has been
defined for this database. Then, I have created another table named
ProdutcPrices_archive. I created a primary key field for this table
as well and the same columns as ProductPrices table except for
ProdutPrices primary key value. I don't think it is useful to
archive that value (do I think correct?).
For the internal use, I consider putting two table values together
with UNION (Is that the correct way?).
This database is meant to use for long time and it should be designed with best structure. I am not sure if I miss something here for the long run.
Any advice would be appreciated.
I'd consider one of two options initially
Use partitioning to separate the single table into current working set and archive data.
No need to use an archive table
Add validForm, ValidTo columns to implement a type 2 SCD
Then add an indexed view for ValidTo IS NULL to get the current set of data
I wouldn't have 2 separate tables if all data has to be "on-line" in one database.
This leads to a 3rd option: an entirely separate database with all data. Only "current" data stays in live. (as #Mike_Walsh's answer explains)
The indexed view option is easiest and works with standard edition (with NOEXPAND hint)
gbn brings up some good approaches. I think the "right" longer term answer for you is the t3rd option, though.
It sounds like you have two business use cases of your data -
1.) Real time Online Transaction Processing (OLTP). This is the POS transactions, inventory management, quick "how did receipts look today, how is inventory, are we having any operational problems?" kind of questions and keeps the business running day to day. Here you want the data necessary to conduct operations and you want a database optimized for updates/inserts/etc.
2.) Analytical type questions/Reporting. This is looking at month over month numbers, year over year numbers, running averages. These are the questions that you ask as that are strategic and look at a complete picture of your history - You'll want to see how last years Christmas seasonal items did against this years, maybe even compare those numbers with the seasonal items from that same period 5 years ago. Here you want a database that contains a lot more data than your OLTP. You want to throw away as little history as possible and you want a database largely optimized for reporting and answering questions. Probably more denormalized. You want the ability to see things as they were at a certain time, so the Type 2 SCDs mentioned by gbn would be useful here.
It sounds to me like you need to create a reporting database. You can call it a data warehouse, but that term scares people these days. Doesn't need to be scary, if you plan it properly it doesn't have to take you 6 years and 6 million dollars to make ;-)
This is definitely a longer term answer but in a couple years you'll be happy you spent the time creating one. A good book to understand the concept of dimensional modeling and thinking about data warehouses and their terminology is The Data Warehouse Toolkit.

Best practice for auditing data in SQL Server and retrieving point in time data

I've been doing history tables for some time now in databases, but never put too much effort or thought into it. I wonder what is the best practice out there.
My main goal is to record any changes to a record for a particular day. If more than one change happens in a day then then only one history record will exist. I need to record the date the record was changed, also when I retrieve data I need to pull the correct record from history as it was at a particular time. So for example I have a customers table and want to pull out what their address was for a particular date. My Sprocs like get Cust details will take in an optional date and if no date is passed in then it returns the most recent record.
So here's what I was looking for advice on:
Do I keep the history table in the same table and use a logical delete flag to hide the historical ones? I normally don't do this as some tables can change a lot and have lots of records. Do I use a separate table that mirrors the main table? I usually do this. Should I only put change records into the history table and not the current one? What is the most efficient way given a date to pull out the right record at a point in time, get every record for a customer <= date passed in, and then sort by most recent date and take the top?
Thanks for all the help... regards M
Suggestion is to use trigger based auditing and create triggers for all tables you need to audit.
With triggers you can accomplish the requirement for not storing more than one record update per day.
I’d suggest you check out ApexSQL Audit that generates triggers for you and try to reverse engineer what triggers they use, how storage tables look like and such.
This will give you a good start and you can work form there.
Disclaimer: not affiliated with ApexSQL but I do use their tools on a daily basis.
I'm no expert in the field but a good sql consultant once told me that a good aproach is generally to use the same table if all data can be changed. Otherwise have the original table contain only core nonchangable data and the historical table contain only stuff that can be changed.
You should defintely read this article on managing bitemporal data. The nice thing about this approach is it enables an auditable way of correcting historical data.
I beleive this will address your concerns about modidying the history data
I've always used a modified version of the audit table described in this article. While it does require you to pivot data so that it resembles your table's native structure, it is resilient against changes to the schema.
You can create a UDF that returns a table and accepts a table name (varchar) and point in time (datetime) as parameters. The UDF should rebuild the table using the audit (historical values) and give you the effective values at that date & time.