Keeping track of changes to a database must be a big concern for lots of people, but it seems that the big names have software for that.
My question is for a small SQL database with 10 tables, <10 columns each, using joins to create a "master" junction table: is there a downside to updating a few times per year by adding rows (with a lot of duplicate information) and then taking the MAX id (PK) to generate and post on a website the most recent data in tabular form (excerpted from the "master")? This versus updating the records, in which I'll lose information on the values at a particular moment.
A typical row for teacher contact information would have fName, lName, schoolName, [address & phone info]; for repertoire or audition information: year, instrument, piece, composer, publisher/edition.
Others have asked about tracking db changes, but only one recently, and not with a lot of votes/details:
How to track data changes in a database table
Keeping history of data revisions - best practice?
How to track data changes in a database table
This lightweight solution seems promising, but I don't know if it didn't get votes because it's not helpful, or because folks just weren't interested.
How to keep track of changes to data in a table?
more background if needed:
I'm a music teacher (i.e. amateur programmer) maintaining a Joomla website for our organization. I'm using a Joomla plugin called Sourcerer to create dynamic content (PHP/SQL to the Joomla database) to make it easier to communicate changes (dates, personnel, rules, repertoire, etc.) For years, this was done with static pages (and paper handbooks) that took days to update.
I also, however, want to be able to look back and see the database state at a particular time: who taught where, what audition piece was listed, etc., as we could with paper versions. NOTE: I'm not tracking HTML changes, only that information fed from the database.
Thanks for any help! (I've followed SO for years, but this is my first question.)
The code I'm using now to generate the "master junction table." I would modify this to "INSERT into" for my new rows and query from it via Sourcerer to post the information online.
CREATE TABLE 011people_to_schools_junction
AS (
SELECT *
FROM (
SELECT a.peopleID, a.districtID, a.firstName, a.lastName, a.statusID, c.schoolName
FROM 01People a
INNER JOIN (
SELECT districtID, MAX(peopleID) peopleID
FROM 01People
GROUP BY districtID
) b
ON a.districtID = b.districtID
AND a.peopleID = b.peopleID
INNER JOIN (
SELECT schoolID, MAX(peopleID) peopleID
FROM 01people_to_schools_junction ab
GROUP BY schoolID
) z
ON z.peopleID = a.peopleID
LEFT JOIN 01Schools c
ON c.schoolID = z.schoolID
WHERE z.schoolID IS NOT NULL
OR z.peopleID IS NOT NULL
ORDER BY c.schoolName
) t1
);
#Add a primary key as the first column
ALTER TABLE 011people_to_schools_junction
ADD COLUMN 011people_to_schoolsID INT NOT NULL AUTO_INCREMENT FIRST,
ADD PRIMARY KEY (011people_to_schoolsID);
To answer your questions in order:
Is there a downside?
Of course, and it's performance - related. If you add a million records each year, it will hurt performances; and occupy space on disk.
Where the suggestions in the linked question bad or just not popular?
The question and answers are good; but the right answer depends on your specific use case: are you doing it for legal reasons, how fast you want to be able to access the data, how much data and updates you have, how much you want your history functionality to last without changes... only if it met your use case you would vote.
As a rule of thumb, history should go to a different table, this would provide several advantages:
your current tables don't change, so your code needs no change except for storing the current version also in history;
your application doesn't slow down;
if your history tables grow you can move them easily to a different server;
In order to choose whether to have a single history table or several (one per backed up table) depends on how you plan to retrieve the data and what you want to do with it:
if you mirror each of your tables adding a timestamp and the user id, your code would need little modifications; but you'd end up with twice as many tables, and any structure change would then need to be replicated on the history table as well;
if you build a single history table with the timestamp, the user id, the table name and a json representation of the record, you will have an easier life building it, while for retrieving it you should access the data using an Object per row i.e. using Joomla's dbo getObjectList(), then the objects will be the same format you store in the history table and the changes there will be fairly easy. But querying for changes across specific tables/fields will be much harder.
Keep in mind that having data is useless if you can't retrieve it properly.
Since you mention pushing to the website a few times a year, the overhead of the queries should not be an issue (if you update monthly, waiting 5 minutes may not be a problem).
You should seek the best solution based on the other uses of this data: for it to be useful to anyone, you will have to implement a system to retrieve historical data. If phpmyadmin is enough, well look no further.
I hope this scared you. Either way it's a lot of hard work.
If you just want to be able to look up old data, you may instead store a copy of the markup/output you generate from time to time, and save it to different folders on the webserver. This will take minutes to set up, and be extremely reliable.
Sure, it's more fun to code it. But are you really sure you need it? And you can keep the database dumps just in case one day you change your mind.
Related
I created some website which contain users,comments,videos,photos,messages and more.All of the data is in the one table which contain 100 column.I thought one table is better than more because user need just connect one table but I heard that some programmer doesnt like this method.And Can someone say me which one is better?One very large table or a lot of little tables.
and Why I need use a lot tables?Why it is useful?Which one is fast for user?
What is the advantages and disadvantages of large table and a lot of little tables?
100 columns in a single table is bad design in most situations.
Read this page: http://www.tutorialspoint.com/sql/sql-rdbms-concepts.htm
Break your data up into related chunks and give each of them their own table.
You said you have this information (users,comments,videos,photos,messages) so you should have something like these tables.
Users which contains (User ID, Name, Email etc)
Comments which contains (Comment ID, User ID, Comment Text etc)
Videos which contains (Video ID, User ID, Comment ID, Video Data etc)
Photos which contains (Photo ID, User ID, Comment ID, Photo Data etc)
Messages which contains (Message ID, User ID, Message Text etc)
Then when your writing your SQL you can write proper SQL to query based on exactly what information you need.
SELECT UserID, MessageID, MessageText
FROM Users as USR
JOIN Messages as MSG
on USR.UserID = MSG.UserID
WHERE USR.UserID = 1234567
With your current query your having to deal with rows containing data that you dont need or care about.
EDIT
Just to give some further information to the OP as to why this is better design.
Lets take the "Users" as a starting example.
In a proper database design you would have a table called Users which has all the required columns that are required for a user to exist. Username, email, id number etc.
Now we want to create a new user so we want to insert Username, email and id number. But wait i still have to populate 97 other columns with totally unrelated information to our process of creating a new user! Even if you store NULL in all columns its going to use some space in the database.
Also imagine you have hundreds of users all trying to select, update and delete from a single database table. There is a high chance of the table being locked. But if you had one user updating the Users table, another user Inserting into the Messages table then the work is spread out.
And as other users have said, purely performance. The database needs to get all information and filter out what you want. If you have alot of columns this is unnecessary work.
Performance Example.
Lets say your database has been running for years. You have 5000 users, 2,000,000 comments, 300,000 pictures, 1,000,000 messages. Your single table now contains 3,305,000 records.
Now you want to find a User with the ID of 12345 who has more than 20 pictures. You need to search through all 3,305,000 records to get this result.
If you had a split table design then you would only need to search through 305,000 records.
Obvious performance gain!!
EDIT 2
Performance TEST.
I created a dummy table containing 2 million rows and 1 column. I ran the below query which took 120ms on average over 10 executions.
SELECT MyDate1 from dbo.DummyTable where MyDate1 BETWEEN '2015-02-15 16:59:00.000' and '2015-02-15 16:59:59.000'
I then truncated the table and created 6 more columns and populated them with 2 million rows of test data and ran the same query. It took 210ms on average over 10 executions.
So adding more columns decreases performance even though your not viewing the extra data.
Wide tables can cause performance problems if they are wider than the database can store in one place.
You need to read about normalization as this type of structure is very bad and is not what the database is optimized for. In your case you will have many repeated records that you will have to use distinct (which is a performance killer) to get rid of when you want to only show the user name or the comments.
Additionally, you may have some fields that are repeats like comment1, comment2, etc. Those are very hard to query over time and if you need another one, then you have to change the table structure and potentially change the queries. That is a bad way to do business.
Further when you only have one table, it becomes a hot spot in your database and you will have more locking and blocking.
Now also suppose that one of those pieces of information is updated, now you have to make sure to update all the records not just one. This can also be also a performance killer and if you don't do it, then you will have data integrity problems which will make the data in your database essentially useless. Denormalizing is almost always a bad idea and always is a bad idea when done by someone who is not an expert in database design. There are many ramifications of denormalization that you probably haven't thought of.
Overall your strategy is sure loser over time and needs to be fixed ASAP because the more records you have in a database, the harder it is to refactor.
For your situation it is better to have multiple tables. The reason for this is because if you put all your data into one table then you will have update anomalies. For example, if a user decides to update his username, you will have to update every single row in your big table that has that user's username. But if you split it into multiple tables then you will only need to update one row in your User table and all the rows in your other tables will reference that updated row.
As far as speed, having one table will be faster than multiple tables with SELECT statements because joining tables is slow. INSERT statements will be about the same speed in either situation because you will be inserting one row. However, updating someone's username with an UPDATE statement will be very slow with one table if they have a lot of data about them because it has to go through each row and update every one of them as opposed to only having to update one row in the User table.
So, you should create tables for everything you mentioned in your first sentence (users, comments, videos, photos, and messages) and connect them using Ids like this:
User
-Id
-Username
Video
-Id
-UploaderId references User.Id
-VideoUrl
Photo
-Id
-UploaderId references User.Id
-PhotoUrl
VideoComment
-CommenterId references User.Id
-VideoId references Video.Id
-CommentText
PhotoComment
-CommenterId reference User.Id
-PhotoId references Photo.Id
-CommentText
Message
-SenderId references User.Id
-ReceiverId references User.Id
-MessageText
I've got question concerning auto deleting particular records in one table of Oracle database using SQL.
I am making small academic project of database for private clinic and I have to design Oracle database and client application in Java.
One of my ideas is to arrange table "Visits" which stores all patients visits which took place in the past for history purposes. Aforementioned table will grow pretty fast so it will have weak searching performance.
So the idea is to make smaller table called "currentVisits" which holds only appointments for future visits because it will be much faster to search through ~1000 records than few millions after few years.
My question is how to implement auto deleting records in SQL from temporary table "currentVisits" after they took place.
Both tables will store fields like dateOfVisit, patientName, doctorID etc.
Is there any possibility to make it work in simple way? For example using triggers?
I am quite new in this topic so thanks for every answer.
Don't worry about the data size. Millions of records is not particularly large for a database on modern computing hardware. You will need an appropriate data structure, however.
In this case, you will want an index on the column that indicates current records. In all likelihood, the current records will be appended onto the end of the table, so they will tend to be congregating on a handful of data pages. This is a good thing.
If you have a heavy deletion load on the table, or you are using a clustered index, then the pages with the current records might be spread throughout the database. In that case, you want to include the "current" column in the clustered index.
I want a table to be sync-able by a web API.
For example,
GET /projects?sequence_latest=2113&limit=10
[{"state":"updated", "id":12,"sequence":2116},
{"state":"deleted" "id":511,"sequence":2115}
{"state":"created", "id":601,"sequence":2114}]
What is a good schema to achieve this?
I intend this for Postgresql with Django ORM, which uses surrogate keys. Presence of an ORM may kill answers like unions.
I can come up with only half-solutions.
I could have a modified_time column, but we cannot convey deletions.
I could have a table for storing deleted IDs, when returning 10 new/updated rows, I could return all the deleted rows between them. But this works only when the latest change is an insert/update and there are a moderate number of deleted rows.
I could set a deleted flag on the row and null the rest, but its kinda bad schema design to set all columns nullable.
I could have another table that stores ID, modification sequence number and state(new, updated, deleted), but its another table to maintain and setting sequence numbers cause contentions; imagine n concurrent requests querying for latest ID.
If you're using an ORM you want simple(ish) and if you're serving the data via an API you want quick.
To go through your suggested options:
Correct, so this doesn't help you. You could have a deleted flag in your main table though.
This seems quite a random way of doing it and breaks your insistence that there be no UNION queries.
Not sure why you would need to NULL the rest of the column here? What benefit does this bring?
I would strongly advise against having a table that has a modification sequence number. Either this means that you're performing a lot of analytic queries in order to find out the most recent state or you're updating the same rows multiple times and maintaining a table with the same PK as your normal one. At that point you might as well have a deleted flag in your main table.
Essentially the design of your API gives you one easy option; you should have everything in the same table because all data is being returned through the same method. I would follow your point 2 and Wolph's suggestion, have a deleted_on column in your table; making it look like:
create table my_table (
id ... primary key
, <other_columns>
, created_on date
, modified_on date
, deleted_on date
);
I wouldn't even bother updating all the other columns to be NULL. If you want to ensure that you return no data create a view on top of your table that nulls data where the deleted_on column has data in it. Then, your API only accesses the table through the view.
If you are really, really worried about space and the volume of records and will perform regular database maintenance to ensure that both are controlled then maybe go with option 4. Create a second table that has the state of each ID in your main table and actually delete the data from your main table. You then can do a LEFT OUTER JOIN to the main table to get the data. When there is no data that ID has been deleted. Honestly, this is overkill until you know whether you will definitely require it.
You don't mention why you're using an web API for data-transfers; but, if you're going to be transferring a lot of data or using this for internal systems only it might be worth using a lower-level transfer mechanism.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Problem to be solved:
Im new to DataBases and Im trying to find out the best way to store changes in a table, that is a daily snapshot of some statuses: eg. "hotel_room_rentals" table (with 20 columns - every can change).
Id like to be able to generate that table for a selected day (e.g. data inside changes on production, so I have to store it somewhere else), or do some other transformations on it (e.g. average number of days rented in a period)
My theoretical example - detailed:
Let's say that Im creating a DB for a hotel.
In the production system I have a table that shows info for all 10 000 rooms in the hotel.
This is a daily snapshot - let's assume that the table is updated once per day.
Some attributes of a room change often: e.g. is_rented; customer_number, rate_usd.
Some attributes dont change too often: e.g. disabled_room, room_color, type_of_furniture.
Room_number obviously does not change (primary key)
Now I want to find the best way to track changes in this table; the best way to create statistics on base of this table (e.g. average number of days rented in a period) and to be able to generate the table for selected date (e.g. 2013-01-01)
MY IDEA:
Since I have no clue about databases, my idea is to copy the whole table every day, with 1 more column, called "DB_dump_date" (with a date). This is a pretty straightforward approach, which will probably require a lot of space; since my 10k rooms table, will have to be copied 365 times in a year.
OTHER SOLUTIONS:
On some other website, I was recommended to create two tables:
"Reservation" table with these columns: Startdate Enddate Room Rate Occupant_name
Then to transform this table into a FactReservations table: Date Room Is_occupied Rate Occupant_name
I do not understand how does this help me... in fact I assume I would have to make 20 intermediary tables and then 20 Fact tables (since I have 20 columns in my database).
QUESTIONS:
What are the recommended ways to deal with such problems?
Is there any DB schema that is prepared to deal with it, without the user making magic ETLs? (e.g. a DB that can optimize the problem by itself)
What are the alternatives?
How would you, smart people, do this? (preferably in MS Access... or some freeware technology)
edit:
one more thing - everything can change in the table, not only room reservetions, everything; and I want to be able to track the changes
stop - slow down - and take a breath.
do not - repeat do not make copies of tables each day. this approach is way off base.
your problem is a normalization problem. as you indicate - you have other suggestions on how to normalize - this is the direction you want to go.
Your goal will be to find a structure that accommodates the SQL statements that can answer your questions (and hopefully many more that you haven't thought up yet) This will be one static model where the tables do not change or get copied, but are instead static - and the only thing that changes is the data inside the tables. (ideally - to me there will also be few to no updates, only inserts)
You will certainly need a ROOM table, and a CUSTOMER table, and then a relation between them possibly RESERVATION.
these can then fill up - and you can get all the answers to the questions you posed without any copying or materialization or anything.. just SQL.
You need to focus on the requirements and start there. So far for requirements I see are:
-Generate that table for a selected day
-average number of days rented in a period
If we consider two extremes of design, at the more complex end would be a datamart with SCD tables, tracking changes to rooms, and at the simple end would be some kind of log table, along the lines of what you have already mentioned.
Reading between the lines, I don't really see any requirement for knowing the attributes of a room on a given day, but I do see a requirement for analysis of historical transactions.
So my suggestion is have a good hard think about your requirements before you start designing the database.
There is no magic design to cover this automatically. Dimensional design is a standard way of modelling business data to allow for easy analysis, but it might be over the top for your requirement.
Welcome to the world of databases! With that in mind – take almost everything that you know about Excel and throw it out the window. Whereas it’s much more difficult in Excel to define relationships between two sheets of a workbook and report off of those two different sheets, so the majority of the time it’s easier to simply copy the same data down a single sheet, it’s trivially easy to do using Access or any other relational database.
Typically what you’d want to do is create several normalized tables and define a relationship between them. Then, when querying the view, you can easily join between the tables to get the data that you need.
So, working off of the assumption that you’re building this for simple reporting and not to create a property management system (if you are looking at that – I’d recommend that you look at some of the players in the industry, like Micros or Agilysys), based on my experience working in the industry, I’d recommend the following table layout:
Reservations – this holds the reservation information (guest name,
arrival date, departure date, check-in date, check-out date, rate if
you use a blended rate, etc.)
Rooms – this holds information on your rack (number, wing code, max
guests, # beds, smoking/non, view, type, etc.)
Room Status – Only if you need to track if a room is on
reserve/hold/OOO/OTM (Status type, date start, date end)
Room Status Types – Types of room status holds and how it affects
inventory (type, out of inventory flag)
Rates (if you don’t use a blended rate) – one entry per reservation
per night (guest, rate)
Personally, I’m a huge fan of using surrogate keys for the unique identifiers, because all too often I've been burned where something changes in the business process and a natural key that was previously unique all of a sudden can be duplicated. In that vein, each table would have a surrogate key and the joins would be as follows:
Reservations – Rooms (many to one)
Rooms – Room Status (one to many)
Room Status – Room Status Types (many to one)
Reservations – Rates (one to many)
If you define the relationships properly in Access (i.e. foreign key relationships in other DBMS), it should automatically use them to build your joins when creating your queries (called Views in just about every other DBMS) or reports.
For learning about databases I’d recommend that you review:
Wikipedia on Join types
Wikipedia on Slowly Changing Dimension (you could use some of
these techniques to record changes in room information over time)
Wikipedia on Relational Databases
Office documentation on Access
Kimball Group Design Tips (great for data warehouse/datamart
design)
if you need to use your existing table then the following is not applicable. If the data can be migrated to a new schema then this will readily address the challenge. TRE is an approach which uses the current view paradigm for development but fully supports the time dimensions of data (which are system time=when the data goes into the db and valid time=the business time which applies to the data). By working in the current view approach of TRE this sort of problem is straightforward. Take a look at:- http://youtu.be/V1EcsuJxUno
I have a table A which contains entries I am regularly processing and storing the result in table B. Now I want to determine for each entry in A its latest processing date in B.
My current implementation is joining both tables and retrieving the latest date. However an alternative, maybe less flexible, approach would be to simply store the date in table A directly.
I can think of pros and cons for both cases (performance, scalability, ....), but didnt have such a case yet and would like to see whether someone here on stackoverflow had a similar situation and has a recommendation for either one for a specific reason.
Below a quick schema design.
Table A
id, some-data, [possibly-here-last-process-date]
Table B
fk-for-A, data, date
Thanks
Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.
I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.
I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.
This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).
You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.
On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.
So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.
Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.
We had a similar situation in our project tracking system where the latest state of the project is stored in the projects table (Cols: project_id, description etc.,) and the history of the project is stored in the project_history table (Cols: project_id, update_id, description etc.,). Whenever there is a new update to the project, we need find out the latest update number and add 1 to it to get the sequence number for the next update. We could have done this by grouping the project_history table on the project_id column and get the MAX(update_id), but the cost would be high considering the number of the project updates (in a couple of hundreds of thousands) and the frequency of update. So, we decided to store the value in the projects table itself in max_update_id column and keep updating it whenever there is a new update to a given project. HTH.
If I understand correctly, you have a table whose each row is a parameter and another table that logs each parameter value historically in a time series. If that is correct, I currently have the same situation in one of the products I am building. My parameter table hosts a listing of measures (29K recs) and the historical parameter value table has the value for that parameter every 1 hr - so that table currently has 4M rows. At any given point in time there will be a lot more requests FOR THE LATEST VALUE than for the history so I DO HAVE THE LATEST VALUE STORED IN THE PARAMETER TABLE in addition to it being in the last record in the parameter value table. While this may look like duplication of data, from the performance standpoint it makes perfect sense because
To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table
So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.