Buffer table in a database, Good or not? - sql

I have a question !
I need to make a university project, and in this project i will have one database table like this :
This table will have a LOT of records !!!!!!
And for manage this i need to create a validation system.
What is the best (and why) between create a buffer table like this :
Or add a column in my table like this :
Thank you !

Your question does not have enough information to provide a real answer. Here is some guidance on how to think about the situation. Which approach depends on the nature of your application and especially on what "validation" means.
One reasonable interpretation is that "validation" is part of a work-flow process, so it happens only once (or 99% of the time only once). And, you never want to see unvalidated advertisements when you look look at advertisements. If this is the case, then there would typically be additional information about the validation process.
This scenario suggests two reasonable approaches:
Do the validation inside a transaction. This would be reasonable if the validation process were entirely in the database and was measured in seconds.
Have a separate table for advertisements being validated. Perhaps even a separate table per "user" or "entity" responsible for them. Depending on the nature of the validation process, this could be a queue that feeds them to people doing the validation.
Putting them in the "advertisements" table doesn't make sense, because there is likely to be additional information involved with the validation process -- who, what, where, when, how.
If an advertisement can be validated and invalidated multiple times, then the best approach may be to put them in the same table. Once again, there are questions about the nature of the process.
Getting access to the two groups without a full table scan is tricky. If 10% of the rows are invalidated and 90% are validated, then a normal index would require a full table scan for reading either group. To get faster access to the smaller group, here are two options:
clustered index on the validation flag.
separate partitions for validated and invalidated rows.
In both cases, changing the validation flag for a record is relatively expensive, because it involves reading and writing the record on different data pages. Unless dozens of changes are made per second, this is probably not a big deal.

Here, there is no need to have a separate "buffer table". You can just properly index the valid field. So the following index would essentially automatically create a buffer table:
create unique index x on y (id)
include (all columns)
where (valid = 0)
This index creates a copy of the yet invalid data. You can do lots of variations such as
create unique index x on y (valid, id)
There's really no need for a separate table. Indexes are very easy compared to partitioning or even manually partitioning. Much less work, more general, more flexible and less potential for human error.

Either approach is valid, and which will perform better will depend more on the type of database you are using rather than the theoretical question of whether it is more correct to use a boolean or partition this into two tables.
I actually prefer the partitioning approach (your buffer table idea), but it will be more complex to code around. This may be a significant point to consider. Most modern databases will handle the boolean criteria very well with an index, but sometimes you can be surprised.
The most important thing from a development perspective right now is to pick one and run with it instead of paralyzing your project while you decide the "right" one.

Related

PostgreSQL - "Ten most frequent entries"

We've got a table with two colums: USER and MESSAGE
An USER can have more than one message.
The table is frequently updated with more USER-MESSAGE pairs.
I want to frequently retrieve the top X users that sent the most messages. What would be the optimal (DX and performnce wise) solution for it?
The solutions I see myself:
I could GROUP BY and COUNT, however it doesn't seem like the most performant nor clean solution.
I could keep an additional table that'd keep count of every user's messages. On every message insertion into the main table, I could also update the relevant row here. Could the update be done automaticaly? Perhaps I could write a procedure for it?
For the main table, I could create a VIEW that'd have an additional "calculated" column - it'd GROUP BY and COUNT, but again, it's probably not the most performant solution. I'd query the view instead.
Please tell me whatever you think might be the best solution.
Some databases have incrementally updated views, where you create a view like in your example 3, and it automatically keeps it updated like in your example 2. PostgreSQL does not have this feature.
For your option 1, it seems pretty darn clean to me. Hard to get much simpler than that. Yes, it could have performance problems, but how fast do you really need it to be? You should make sure you actually have a problem before worrying about solving it.
For your option 2, what you are looking for is a trigger. For each insertion, it would increment a count in the user table. If you ever delete, you would also need to decrease the count. Also, if ever update to change the user of an existing entry, the trigger would need to decrease the count of the old user and increase it of the new user. This will decrease the concurrency, as if two processes try to insert messages from the same user at the same time, one will block until the other finishes. This may not matter much to you. Also, the mere existence of triggers imposes some CPU overhead, plus whatever the trigger itself actually does. But unless our server is already overloaded, this might not matter.
Your option 3 doesn't make much sense to me, at least not in PostgreSQL. There is no performance benefit, and it would act to obscure rather than clarify what is going on. Anyone who can't understand a GROUP BY is probably going to have even more problems understanding a view which exists only to do a GROUP BY.
Another option is a materialized view. But you will see stale data from them between refreshes. For some uses that is acceptable, for some it is not.
The first and third solutions are essentially the same, since a view is nothing but a “crystallized” query.
The second solution would definitely make for faster queries, but at the price of storing redundant data. The disadvantages of such an approach are:
You are running danger of inconsistent data. You can reduce that danger somewhat by using triggers that automatically keep the data synchronized.
The performance of modifications of message will be worse, because the trigger will have to be executed, and each modification will also modify users (that is the natural place to keep such a count).
The decision should be based on the question whether the GROUP BY query will be fast enough for your purposes. If yes, use it and avoid the above disadvantages. If not, consider storing the extra count.

Is CHECK better then invalid data?

I'm writing a code for a database. I have a table with log of machines activity, looking like:
CREATE TABLE Work(
id SERIAL PRIMARY KEY,
machine_ID integer NOT NULL DEFAULT 0,
start_work timestamp,
etc...
);
I know machine_ID can be in 1 to 5.
Here my question comes:
Are there any benefits of using CHECK(machine_ID >= 1 AND machine_ID <=5)?
Wouldn't it be better to accept polluted data in order to repair possible bugs later, or even to use the possibility to clean up the data?
Perhaps this is more appropriate as a comment.
But a column called machine_id should not be validated using a check constraint. Instead, you should have a table -- say machines -- that is a reference table for machines.
Your code should use a foreign key constraint. This is a special type of data validation called relational integrity.
As for your question. In my experience, it is usually better to capture errors when they go into the database. Bad data in the database usually leads to problems further down the road -- problems that could have been avoided.
That will be okey if you want to limit how may id can be in your database, but nothing else.
This is massively subjective, with different approaches and opinions, and is highly dependent on your business case.
One principle is to keep things simple, but even what that means is subjective.
Reject Early
In a pre-processing layer, apply rules to the data being received. As soon as possible reject any row/file/source and request re-submission.
This makes it clear what has been accepted, what hasn't been accepted, how to deal with such scenarios, etc.
Consume Everything
Where failures can be complicated and varied, a database can often be the best place to do analysis to understand the source, cause, magnitude and/or impact of failures. In which case ingesting everything can be beneficial.
But that can have a potentially uncontrolled impact. As such, my general experience of this is to have a separate staging area in the database. Essentially as unstructured as possible, to allow as much data in as possible. But even that's not always helpful. If you use strings for all fields, you can accept data that would normally have to be accepted. But even then, what if you have a file with 9 columns being supplied for a table of 8 columns? You could accept the whole row as a single string, but analyzing that for meaningful results will be close to impossible.
What this means in your case depends on the project you are working on, and that's what you haven't described.
My personal default position is to re-categorise certain source data failures as predicted/business-as-usual inconsistencies. You can then structure your staging area to handle these, for reporting, reconciliation, remedial action, etc. And, more importantly, explicitly build them in to all related business processes. Doing so will introduce a cost, which can be estimated against the benefit of accommodating them, with the intention of only accommodating inconsistencies where it's actually a material benefit. (Rather than just being a hoarder that keeps everything just in case it might help someone someday, maybe.)
Whatever direction that goes, however, the operational database itself (not the staging area) is heavily structured, with integrity constraints in place to prevent bugs, unexpected inputs, etc, from corrupting existing data or causing unexpected, unmonitored consequences.
If you truly believe that your operational database (be that transactional, analytical, or anything else) would benefit from being absent some/many/all of these data integrity tools, then a SQL Relational Database is probably the wrong tool for you. Instead consider the enormous number of unstructured data storage and processing platforms.
There are a bunch of different considerations.
Firstly, I prefer not to use check constraints for business rules that are likely to change - I don't want to roll out a database schema change (and a check constraint is a schema change) in response to a predictable business event. Adding or removing machines feels like a predictable business event; so as #gordonLinoff suggests, I'd recommend using a foreign key to the "machines" table.
Secondly, I don't like "hidden code" - from a development and maintenance point of view, I like to keep all the validation as intelligible as possible. Check constraints and triggers are relatively "hidden", hard to document, hard to debug, hard to test for non-trivial use cases; foreign keys, on the other hand, are explicit and developers expect them.

Is adding a bit mask to all tables in a database useful?

A colleague is adding a bit mask to all our database tables. In theory this is so we can track certain properties of each row across the entire system. For example...
Is the row shipped with the system or added by the client once they've started using the system
Has the row been deleted from the table (soft deletes)
Is the row a default value within a set of rows
Is this a good idea? Are there other uses where this approach would be beneficial?
My preference is these properties are obviously important, and having a dedicated column for each property is justified to make what is happening clearer to fellow developers.
Not really, no.
You can only store bits in it, and only so many. So, seems to me like it's asking for a lot of application-level headaches later on keeping track of what each one means and potential abuse later on because "hey they're everywhere". Is every bitmask on every table going to use the same definition for each bit? Will it be different on each table? What happens when you run out of bits? Add another?
There are lots of potential things you could do with it, but it begs the question "why do it that way instead of identifying what we will use those bits for right now and just make them proper columns?" You don't really circumvent the possibility of schema changes this way anyway, so it seems like it's trying to solve a problem that you can't really "solve" and especially not with bitmasks.
Each of the things you mentioned can be (and should be) solved with real columns on the database, and those are far more self-documenting than "bit 5 of the BitMaskOptions field".
A dedicated column is is better, because it's undoubtedly more obvious and less error-prone. SQL Server already stores BIT columns efficiently, so performance is not an issue.
The only argument I could see for a bitmask is not having to change the DB schema every time you add a new flag, but really, if you're adding new flags that often then something is not right.
No, it is not even remotely a good idea IMO. Each column should represent a single concept and value. Bit masks have all kinds of performance and maintenance problems. How do new developers understand what each of the bits mean? How do you prevent someone from accidentally mixing the meaning of the order of the bits?
It would be better to have a many-to-many relationship or separate columns rather than a bit mask. You will be able to index on it, enable referential integrity (depending on approach), easily add new items and change the order of the results to fit different reports and so on.

BASIC Object-Relation Mapping question asked by a noob

I understand that, in the interest of efficiency, when you query the database you should only return the columns that are needed and no more.
But given that I like to use objects to store the query result, this leaves me in a dilemma:
If I only retrieve the column values that I need in a particular situation, I can only partially populate the object. This, I feel, leaves my object in a non-ideal state where only some of the properties and methods are available. Later, if a situation arises where I would like to the reuse the object but find that the new situation requires a different but overlapping set of columns to be returned, I am faced with a choice.
Should I reuse the existing SQL and add to the list of selected columns the additional fields that are required by the new situation so that the same query and object mapping logic can be reused for both? Or should I create another method that results in the execution of a slightly different SQL which results in the populating of only those object properties that were returned by the 2nd query?
I strongly suspect that there is no magic answer and that the answer really "depends" on the situation but I guess I am looking for general advice. In general, my approach has been to either return all columns from the queried table or to add to the query the additional columns as they are needed but to reuse the same SQL (and mapping code) that is, until performance becomes a concern. In general, I find that unless you are retrieving a large number of row - and I usually am not - that the cost of adding additional columns to the output does not have a noticable effect on performance and that the savings in development time and the simplified API that result are a good trade off.
But how do you deal with this situation when performance does become a factor? Do you create methods like
Employees.GetPersonalInfo
Employees.GetLittleMorePersonlInfoButMinusSalary
etc, etc etc
Or do you somehow end up creating an API where the user of your API has to specify which columns/properties he wants populated/returned, thereby adding complexity and making your API less friendly/easy to use?
Let's say you want to get Employee info. How many objects would typically be involved?
1) an Employee object
2) An Employees collection object containing one Employee object for each Employee row returned
3) An object, such as EmployeeQueries that returns contains methods such as "GetHiredThisWeek" which returns an Employees collection of 0 or more records.
I realize all of this is very subjective, but I am looking for suggestions on what you have found works best for you.
I would say make your application correct first, then worry about performance in this case.
You could be optimizing away your queries only to realize you won't use that query anyway. Create the most generalized queries that your entire app can use, and then as you are confident things are working properly, look for problem areas if needed.
It is likely that you won't have a great need for huge performance up front. Some people say the lazy programmers are the best programmers. Don't over-complicate things up front, make a single Employee object.
If you find a need to optimize, you'll create a method/class, or however your ORM library does it. This should be an exception to the rule; only do it if you have reason to do so.
...the cost of adding additional columns to the output does not have a noticable effect on performance...
Right. I don't quite understand what "new situation" could arise, but either way, it would be a much better idea (IMO) to get all the columns rather than run multiple queries. There isn't much of a performance penalty at all for getting more columns than you need (although the queries will take more RAM, but that shouldn't be a big issue; besides, hardware is cheap). Also, you'd save yourself quite a bit of development time.
As for the second part of your question, it's really up to you. As an example, Rails takes more of a "usability first, performance last" approach, but that may not be what you want. It just depends on your needs. If you're willing to sacrifice a little usability for performance, by all means, go for it. I would.
If you are using your Objects in a "row at a time" CRUD type application, then, by all means copy all the columns into your object, the extra overhead is minimal, and you object becomes truly re-usable for any program wanting row access to the table.
However if your SQL is doing a complex join or returning a large set of rows, then request precisely and only what you want. You get two performance penalties here, one handling each column each time will eat up cpu for no benefit, and, two most DBMS systems have a bag of tricks for optimising queries (such as index only access) which can only be used if you specify precisely which columns you want.
There is no reuse issue in most of these cases as scan/search processes tend to very specific to a particular use case.

DB Design: Does having 2 Tables (One is optimized for Read, one for Write) improve performance?

I am thinking about a DB Design Problem.
For example, I am designing this stackoverflow website where I have a list of Questions.
Each Question contains certain meta data that will probably not change.
Each Question also contains certain data that will be consistently changing (Recently Viewed Date, Total Views...etc)
Would it be better to have a Main Table for reading the constant meta data and doing a join
and also keeping the changing values in a different table?
OR
Would it be better to keep everything all in one table.
I am not sure if this is the case, but when updating, does the ROW lock?
When designing a database structure, it's best to normalize first and change for performance after you've profiled and benchmarked your queries. Normalization aims to prevent data-duplication, increase integrity and define the correct relationships between your data.
Bear in mind that performing the join comes at a cost as well, so it's hard to say if your idea would help any. Proper indexing with a normalized structure would be much more helpful.
And regarding row-level locks, that depends on the storage engine - some use row-level locking and some use table-locks.
Your initial database design should be based on conceptual and relational considerations only, completely indepedent of physical considerations. Database software is designed and intended to support good relational design. You will hardly ever need to relax those considerations to deal with performance. Don't even think about the costs of joins, locking, and activity type at first. Then further along, put off these considerations until all other avenues have been explored.
Your rdbms is your friend, not your adversary.
You should have the two table separated out as you might want to record the history of the question. The main Question table is indexed by question ID then the Status table is indexed by query ID and date/time stamp and contains a row for each time the status changes.
Don't know that the updates are really significant unless you were using pessimistic locking where the row would be locked for a period of time.
I would look at caching your results either locally with Asp.net caching or using MemCached.
This would certainly be a bad idea if you were using Oracle. In Oracle, you can quite happily read records while other sessions are modifying them due to it's multi-version concurency control. You would incur extra performance penalty for the join for no savings.
A design patter that is useful, however, is to pre-join tables, pre-calculate aggregates or pre-apply where clauses using materialized views.
As already said, better start with a clean normalized design. It's just easier to denormalize later, than to go the other way around. The experience teaches that you will never denormalize that one big table! You will just throw more columns in as needed. And you will need more and more indexes and updates will go slower and slower.
You should also take a look at the expected loads: Will be there more new answers or just more querying? What other operations will you have? When it comes to optimization, you can use the features of your dbms system: indexing, views, ...
Eran Galperin already provided most of my answer. In addition, the structure you propose really wouldn't help you in terms of locking. If their are relatively static and dynamic attributes in the same row, breaking the static and dynamic into two tables isn't of much benefit. It doesn't matter if static data is being locked, since no one is trying to change it anyway.
In fact, you may actually do worse with this design. Some database engines use page locking. If a table has fewer/smaller columns, more rows will fit on a page. The more rows there are on a page, the more likely there will be a lock contention. By having the static data mixed in with the dynamic, the rows are bigger, therefore there are fewer rows in a page, and therefore fewer waits on page locks.
If you have two independent sets of dynamic attributes, and they are normally modified by different actors, then you might get some benefit by breaking them into different tables. This is a pretty unusual case, however.
I'd also point out that breaking the table into a static and dynamic portion may not be of benefit in a relatively small environment, but in a large distributed environment it may be useful to cache and replicate the dynamic data at different rates than the static data.