Do you need to fully validate data both in Database and Application?

Do you need to fully validate data both in Database and Application? - sql

For example, if I need to store a valid phone number in a database, should I fully validate the number in SQL, or is it enough if I fully validate it in the app, before inserting it in the db, and just add some light validation in SQL constraints (like having the correct number of digits).

There is no correct answer to this question.
In general, you want the database to maintain data integrity -- and that includes valid values in columns. You want this for multiple reasons:
Databases are usually more efficient, because they are on multi-threaded servers.
Databases can handle concurrent threads (not an issue for a check constraint, but an issue for other types of constraints).
Databases ensure the data integrity regardless of how the data is changed.
A check constraint (presumably what you want) is part of the data definition and applies to all inserts and updates. Such operations might occur in multiple places in the application.
The third piece is important. If you want to ensure that a phone number looks like a phone number, then you don't want someone to change it accidentally using update.
However, there might be checks that are simpler in the application. Or that might only apply when a new row is inserted, but not later updated. Or, that you want only to apply to data that comes in from the application (as opposed to manual changes). So, there are reasons why you might not want to do all checks in the database.

You definitily have to validate incoming data at your backend before e.g. doing crud operations on your database, since client side validation could bei omitted or even faked. It is considered to be a good practise to validate input data at the client. But you should never ever trust the client.

Related

Extending a set of existing tables into a dynamic client defined structure?

We have an old repair database that has alot of relational tables and it works as it should but i need to update it to be able to handle different clients ( areas ) - currenty this is done as a single client only.
So i need to extend the tables and the sql statements so ex i can login as user A and he will see his own system only and user B will have his own system too.
Is it correctly understood that you wouldnt create new tables for each client but just add a clientID to every record in every ( base ) table and then just filter with a clientid in all sql statements to be able to achieve multiple clients ?
Is this also something that would work ( how is it done ) on hosted solutions ? Am worried about performance if thats an issue lets say i had 500 clients ( i wont but from a theoretic viewpoint ) ?

The normal situation is to add a client key to each table where appropriate. Many tables don't need them -- such as reference tables.
This is preferred for many reasons:
You have the data for all clients in one place, so you can readily answers a question such as "what is the average X for each client".
If you change the data structure, then it affects all clients at the same time.
Your backup and restore strategy is only implemented once.
Your optimization is only implemented once.
This is not always the best solution. You might have requirements that specify that data must be separated -- in which case, each client should be in a separate database. However, indexes on the additional keys are probably a minor consideration and you shouldn't worry about it.

This question has been asked before. The problem with adding the key to every table is that you say you have a working system, and this means every query needs to be updated.
Probably the easiest is to create a new database for each client, so that the only thing you need to change is the connection string. This also means you can get automated query tools for example to work without worrying about cross-client data leakage.
And it also allows you to backup, transfer, delete a single client easily as well.
There are of course pros and cons to this approach, but it will simplify the development effort. Also remember that if you plan to spin it up in a cloud environment then spinning up databases like this is also very easy.

Postgres SQL: Best way to check for new data in a database I don't control

For an application I am writing, I need to be able to identify when new data is inserted into several tables of a database.
The problem is two fold, this data will be been inserted many times per minute into sometimes very large databases (and I need to be sensitive to demand / database polling issues) and I have no control of the application creating this data (so as far as I know, I can't use the notify / listen functionality available within postgres for exactly this kind of task*).
Any suggestion regarding a good strategy would be much appreciated.
*I believe the application controlling this data is using the notify / listen functionality itself, but I haven't a clue how (if at all possible) to know what the "channel" it uses externally and if it is ever able to latch on to that.

Generally, you need something in the table that you can use to determine newness, and there are a few approaches.
A timestamp column would let you use the date but you'd still have the application issue of storing a date outside of your database, and data that isn't in the database means another realm of data to manage. Yuck.
A tracking table that stored last update/insert timestamps on a per-table basis could give you what you want. You'd want to use a trigger to maintain the last-DML timestamp.
A solution you don't want to use is a serial (integer) id that comes from nextval, for any purpose than uniqueness. The standard/common mistake is to presume serial keys will be contiguous (they're not) or monotonic (they're not).

How to use database triggers in a real world project?

I've learned a lot about triggers and active databases in the last weaks, but I've some questions about real world examples for these.
At work we use the Entity Framework with ASP.Net and an MSSQL Server. We just use the auto generated constrains and no triggers.
When I heared about triggers I asked myself the following questions:
Which tasks can be performed by triggers?
e.g.: Generation of reporting data: currently the data for the reports is created in vb, but I think a trigger could handle this as well. The creation in vb takes a lot of time and the user should not need to wait for it, because it's not necessary for his work.
Is this an example for a perfect task for a trigger?
How does OR-Mapper handle trigger manipulated data?
e.g.: Do OR-Mapper recognize if a trigger manipulated data? The entity framework seems to cache a lot of data, so I'm not sure if it reads the updated data if a trigger manipulates the data, after the insert/update/delete from the framework is processed.
How much constraint handling should be within the database?
e.g.: Sometimes constrains in the database seem much easier and faster than in the layer above (vb.net,...), but how to throw exceptions to the upper layer that could be handled by the OR-Mapper?
Is there a good solution for handeling SQL exceptions (from triggers) in any OR-Mapper?
Thanks in advance

When you hear about a new tool or feture it doesn't mean you have to use it everywhere. You should think about design of your application.
Triggers are used a lot when the logic is in the database but if you build ORM layer on top of your database you want logic in the business layer using your ORM. It doesn't mean you should not use triggers. It means you should use them with ORM in the same way as stored procedures or database functions - only when it makes sense or when it improves performance. If you pass a lot of logic to database you can throw away ORM and perhaps whole your business layer and use two layered architecture where UI will talk directly to database which will do everything you need - such architecture is considered "old".
When using ORM trigger can be helpful for some DB generated data like audit columns or custom sequences of primary key values.
Current ORM mostly don't like triggers - they can only react to changes to currently processed record so for example if you save Order record and your update trigger will modify all ordered items there is no automatic way to let ORM know about that - you must reload data manually. In EF all data modified or generated in the database must be set with StoreGeneratedPattern.Identity or StoreGeneratedPattern.Computed - EF fully follows pattern where logic is either in the database or in the application. Once you define that value is assigned in the database you cannot change it in the application (it will not persist).
Your application logic should be responsible for data validation and call persistence only if validation passes. You should avoid unnecessary transactions and roundtrips to database when you can know upfront that transaction will fail.

I use triggers for two main purposes: auditing and updating modification/insertion times. When auditing, the triggers push data to related audit tables. This doesn't affect the ORM in any way as those tables are not typically mapped in the main data context (there's a separate auditing data context used when needed to look at audit data).
When recording/modifying insert/modification times, I typically mark those properties in the model as [DatabaseGenerated( DatabaseGenerationOptions.Computed )] This prevents any values set on in the datalayer from being persisted back to the DB and allows the trigger to enforce setting the DateTime fields properly.
It's not a hard and fast rule that I manage auditing and these dates in this way. Sometimes I need more auditing information than is available in the database itself and handle auditing in the data layer instead. Sometimes I want to force the application to update dates/times (since they may need to be the same over several rows/tables updated at the same time). In those cases I might make the field nullable, but [Required] in the model to force a date/time to be set before the model can be persisted.

The old Infomodeler/Visiomodeler ORM (not what you think - it was Object Role Modeling) provided an alternative when generating the physical model. It would provide all the referential integrity with triggers. For two reasons:
Some dbmses (notably Sybase/SQL Server) didn't have declarative RI yet, and
It could provide much more finely grained integrity - e.g. "no more than two children" or "sons or daughters but not both" or "mandatory son or daughter but not both".
So trigger logic related to the model in the same way that any RI constraint does. In SQL Server it handled violations with RAISERROR.
An conceptual issue with triggers is that they are essentially context-free - they always fire regardless of context (at least without great pain, and you might better include their logic with the rest of the context-specific logic.) So global domain constraints are the only place I find them useful - which I guess is another general way to identify "referential integrity".

Triggers are used to maintain integrity and consistency of data (by using constraints), help the database designer ensure certain actions are completed and create database change logs.
For example, given numeric input, if you want the value to be constrained to say, less then 100, you could write a trigger that fires for every row on update or insert, and raise an application error if the value of that column does not meet that contraint.
Suppose you want to log historical changes to a table. You could create a Trigger that fires AFTER each INSERT, UPDATE, and DELETE, which also inserts the data into a logging table. If you need to execute custom custom logic, then Triggers may appeal to you.

SQL Server Database securing against clever admins?

I want to secure events stored in one table, which has relations to others.
Events are inserted through windows service, that is connecting to hardware and reading from the hardware.
In events table is PK, date and time, and 3 different values.
The problem is that every admin can log in and insert/update/delete data in this table e.g. using sql management studio. I create triggers to prevent update and delete, so if admin doesn't know triggers, he fail to change data, but if he knows trigger, he can easily disable trigger and do whatever he wants.
So after long thinking I have one idea, to add new column (field) to table and store something like checksum in this field, this checksum will be calculated based on other values. This checksum will be generated in insert/update statements.
If someone insert/update something manually I will know it, because if I check data with checksum, there will be mismatches.
My question is, if you have similar problem, how do you solve it?
What algorithm use for checksum? How to secure against delete statement (I know about empty numbers in PK, but it is not enough) ?
I'm using SQL Server 2005.

As admins have permissions to do everything on your SQL Server, I recommend a temper-evident auditing solution. In this scenario – everything that happens on a database or SQL Server instance is captured and saved in a temper-evident repository. In case someone who has the privileges (like admins) modifies or deletes audited data from the repository, it will be reported
ApexSQL Comply is such a solution, and it has a built in integrity check option
There are several anti-tampering measures that provide different integrity checks and detect tampering even when it’s done by a trusted party. To ensure data integrity, the solution uses hash values. A hash value is a numeric value created using a specific algorithm that uniquely identifies it
Every table in the central repository database has the RowVersion and RowHash column. The RowVersion contains the row timestamp – the last time the row was modified. The RowHash column contains the unique row identifier for the row calculated using the values other table columns
When the original record in the auditing repository is modified, ApexSQL Comply automatically updates the RowVersion value to reflect the time of the last change. To verify data integrity, ApexSQL Comply calculates the RowHash value for the row based on the existing row values. The values used in data integrity verification now updated, and the newly calculated RowHash value will therefore be different from the RowHash value stored in the central repository database. This will be reported as suspected tampering
To hide the tampering, I would have to calculate a new value for RowHash and update it. This is not easy, as the formula used for calculation is complex and non-disclosed. But that’s not all. The RowHash value is calculated using the RowHash value from the previous row. So, to cover up tampering, I would have to recalculate and modify the RowHas values in all following rows
For some tables in the ApexSQL Comply central repository database, the RowHash values are calculated based on the rows in other tables, so to cover tracks of tampering in one table, the admin would have to modify the records in several central repository database tables
This solution is not tamper-proof, but definitely makes covering tempering tracks quite difficult
Disclaimer: I work for ApexSQL as a Support Engineer

Security through obscurity is a bad idea. If there's a formula to calculate a checksum, someone can do it manually.
If you can't trust your DB admins, you have bigger problems.

Anything you do at the server level the admin can undo. That's the very definition of its role and there's nothing you can do to prevent it.
In SQL 2008 you can request auditing of the said SQL server with X events, see http://msdn.microsoft.com/en-us/library/cc280386.aspx. This is CC compliant solution that is tamper evident. That means the admin can stop the audit and do its mischievous actions, but the stopping of the audit is recorded.
In SQL 2005 the auditing solution recommended is using the profiler infrastructure. This can be made tamper evident when correctly deployed. You would prevent data changes with triggers and constraints and audit DDL changes. If the admin changes the triggers, this is visible in the audit. If the admin stops the audit, this is also visible in the audit.
Do you plan this as a one time action against a rogue admin or as a feature to be added to your product? Using digital signatures to sign all your application data can be very costly in app cycles. You also have to design a secure scheme to show that records were not deleted, including last records (ie. not a simple gap in an identity column). Eg. you could compute CHECSUM_AGG over BINARY_CHECKSUM(*), sign the result in the app and store the signed value for each table after each update. Needles to say, this will slow down your application as basically you serialize every operation. For individual rows cheksums/hashes you would have to compute the entire signature in your app, and that would require possibly values your app does not yet have (ie. the identity column value to be assigned to your insert). And how far do you want to go? A simple hash can be broken if the admin gets hold of your app and monitors what you hash, in what order (this is trivial to achieve). He then can recompute the same hash. An HMAC requires you to store a secret in the application which is basically impossible against a a determined hacker. These concerns may seem overkill, but if this is an application you sell for instance then all it takes is for one hacker to break your hash sequence or hmac secret. Google will make sure everyone else finds out about it, eventually.
My point is that you're up the hill facing a loosing battle if you're trying to deter the admin via technology. The admin is a person you trust and if this is broken in your case, the problem is trust, not technology.

Ultimately, even if admins do not have delete rights, they can give themselves access, make the change to not deny deletes, delete the row and then restore the permission and then revoke their access to make permission changes.
If you are auditing that, then when they give themselves access, you fire them.
As far as an effective tamper-resistant checksum, it's possible to use public/private key signing. This will mean that if the signature matches the message, then no one except who the record says created/modified the record could have done it. Anyone can change and sign the record with their own key, but not as someone else.

I'll just point to Protect sensitive information from the DBA in SQL Server 2008

The idea of a checksum computed by the application is a good one. I would suggest that you research Message Authentication Codes, or MACs, for a more secure method.
Briefly, some MAC algorithms (HMAC) use a hash function, and include a secret key as part of the hash input. Thus, even if the admin knows the hash function that is used, he can't reproduce the hash, because he doesn't know all of the input.
Also, in your case, a sequential number should be part of the hash input, to prevent deletion of entire entries.
Ideally, you should use a strong cryptographic hash function from the SHA-2 family. MD5 has known vulnerabilities, and similar problems are suspected in SHA-1.

It might be more effective to try to lock down permissions on the table. With the checksum, it seems like a malicious user might be able spoof it, or insert data that appears to be valid.
http://www.databasejournal.com/features/mssql/article.php/2246271/Managing-Users-Permissions-on-SQL-Server.htm

If you are concerned about people modifying the data, you should also be concerned about them modifying the checksum.
Can you not simply password protect certain permissions on that database?

Sorting on the server or on the client?

I had a discussion with a colleague at work, it was about SQL queries and sorting. He has the opinion that you should let the server do any sorting before returning the rows to the client. I on the other hand thinks that the server is probably busy enough as it is, and it must be better for performance to let the client handle the sorting after it has fetched the rows.
Anyone which strategy is best for the overall performance of a multi-user system?

In general, you should let the database do the sorting; if it doesn't have the resources to handle this effectively, you need to upgrade your database server.
First off, the database may already have indexes on the fields you want so it may be trivial for it to retrieve data in sorted order. Secondly, the client can't sort the results until it has all of them; if the server sorts the results, you can process them one row at a time, already sorted. Lastly, the database is probably more powerful than the client machine and can probably perform the sorting more efficiently.

It depends... Is there paging involved? What's the max size of the data set? Is the entire dataset need to be sorted the same one way all the time? or according to user selection? Or, (if paging is involved), is it only the records in the single page on client screen need to be sorted? (not normally acceptable) or does the entire dataset need to be sorted and page one of the newly sorted set redisplayed?
What's the distribution of client hardware compared to the processing requirements of this sort operation?
bottom line is; It's the overall user experience (measured against cost of course), that should control your decision... In general client machines are slower than servers, and may cause additional latency. ...
... But how often will clients request additional custom sort operations after initial page load? (client sort of data already on client is way faster than round trip...)
But sorting on client always requires that entire dataset be sent to client on initial load... That delays initials page display.. which may require lazy loading, or AJAX, or other technical complexities to mitigate...
Sorting on server otoh, introduces additional scalability issues and may require that you add more boxes to the server farm to deal with additional load... if you're doing sorting in DB, and reach that threshold, that can get complicated. (To scale out on DB, you have to implement some read-only replication scheme, or some other solution that allows multiple servers (each doing processing) to share read only data)..

I am in favor of Roberts answer, but I wanted to add a bit to it.
I also favor the sorting of data in SQL Server, I have worked on many systems that have tried to do it on the client side and in almost every case we have had to re-write the process to have it done inside SQL Server. Why is this you might ask? Well we have two primary reasons.
The amount of data being sorted
The need to implement proper paging due to #1
We deal with interfaces that show users very large sets of data, and leveraging the power of SQL Server to handle sorting and paging is by far better performing than doing it client side.
To put some numbers to this, a SQL Server Side sort to a client side sort in our environment, no paging for either. Client side 28 seconds using XML for sorting, and Server side sort total load time 3 seconds.

Generally I agree with the views expressed above that server-side sorting is usually the way to go. However, there are sometimes reasons to do client-side sorting:
The sort criteria are user-selectable or numerous. In this case, it may not be a good idea to go adding a shedload of indices to the table - especially if insert performance is a concern. If some sort criteria are rarely used, an index isn't necessarily worth it since inserts will outnumber selects.
The sort criteria can't be expressed in pure SQL [uncommon], or can't be indexed. It's not necessarily any quicker client-side, but it takes load of the server.
The important thing to remember is that while balancing the load between powerful clients and the server may be a good idea in theory, only the server can maintain an index which is updated on every insert. Whatever the client does, it's starting with a non-indexed unsorted set of data.

As usual, "It Depends" :)
If you have a stored procedure, for instance, that sends results to your presentation layer (whether a report, grid, etc.), it probably doesn't matter which method you go with.
What I typically run across, though, are views which have sorting (because they were used directly by a report, for instance) but are also used by other views or other procedures with their own sorting.
So as a general rule, I encourage others to do all sorting on the client-side and only on the server when there's reasonable justification for it.

If the sorting is just cosmetic and the client is getting the entire set of data I would tend to let the client handle it as it is about the presentation.
Also, say in a grid, you may have to implement the sorting in the client anyway as the user may change the ordering by clicking a column header (don't want to have to ask the server to retrieve all the information again)

Like any other performance related question, the universal answer is... "It Depends." However, I have developed a preference for sorting on the client. We write browser-based apps, and my definition of client is split between the web servers an the actual end-user client, the browser. I have two reasons for preferring sorting on the client to sorting in the DB.
First, there's the issue of the "right" place to do it from a design point of view. Most of the time the order of data isn't a business rule thing but rather a end-user convenience thing, so I view it as a function of the presentation, and I don't like to push presentation issues into the database. There are exceptions, for example, where the current price for an item is the most recent one on file. If you're getting price with something like:
SELECT TOP 1 price
FROM itemprice
WHERE ItemNumber = ?
AND effectivedate <= getdate()
ORDER BY effectivedate DESC
Then the order of the rows is very much a part of the business rule and obviously belongs in the database. However, if you're sorting on LastName when the user views customer by last name, and then again on FirstName when they click the FirstName column header, and again on State when they click that header then your sorting is a function of the presentation and belongs in the presentation layer.
The second reason I prefer sorting in the client layer is one of performance. Web servers scale horizontally, that is, if I overload my web server with users I can add another, and another, and another. I can have as many frontend servers as I need to handle the load and everything works just fine. But, if I overload the database I'm screwed. Databases scale vertically, you can throw more hardware at the problem, sure, but at some point that becomes cost prohibitive, so I like to let the DB do the selection, which it has to do, and let the client do the sorting, which it can to quite simply.

I prefer custom sorting on the client, however I also suggest that most SQL statements should have some reasonable ORDER BY clause by default. It causes very little impact on the database, but without it you could wind up with problems later. Often times without ever realizing it, a developer or user will begin to rely on some initial default sort order. If an ORDER BY clause wasn't specified, the data is only in that order by chance. At some later date an index could change or the data might be re-organized and the users will complain because the initial order of the data might have changed out from under them.

Situations vary, and measuring performance is important.
Sometimes it's obvious - if you have a big dataset and you're interested in a small range of the sorted list (e.g. paging in a UI app) - sorting on the server saves the data transfer.
But often you have one DB and several clients, and the DB may be overloaded while the clients are idle. Sorting on the client isn't heavy, and in this situation it could help you scale.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas