Linking Users table to other entities/tables - sql

I have a data driven ASP application. And like most data driven applications, i have a Users table and a CreatedBy field in most of the tables.
I am trying to create a DeleteUserFunction in my application. Before deleting any user i must check each and every table to see if that user has created any records.
Building relationships between the users table and the rest of database tables can make the DeleteUserFunction easier to validate.
Therefore, I am trying to figure whether a users table must be explicitly linked to other tables (via foreign key constaints) or must it be handle in application business logic.

First, your functional requirements need to be clarified. What should happen if a user gets deleted?
1 User may not get deleted if records linked to her are present.
2 User may be deleted, and
2.1 all records linked to her stay present, without link to that user,
2.2 all records linked to her must be deleted as well.
A foreign key constraint can support 1 and 2.2, but not 2.1, because it won't change the user foreign key in the referring record.
However, using a foreign key constraint as the only way to enforce this might lead to strange software structure and user experience. In all cases, the violation of the constraint will be detected by the database, and will, by default lead to some exception in the data access module. But probably, you don't want your application to crash with an exception, but rather tell your user that this function is not available (in case 1) or trigger some application function (in cases 2.1, 2.2). This would mean passing the exception around in the software until the right layer to handle the case is reached.
Therefore, I'd recommend to perform the necessary checks to find out whether the deletion is legal and to trigger the logical consequences as part of the application logic. The foreign key constraint may still be useful as a way to detect application error during tests.

Related

Do you need to fully validate data both in Database and Application?

For example, if I need to store a valid phone number in a database, should I fully validate the number in SQL, or is it enough if I fully validate it in the app, before inserting it in the db, and just add some light validation in SQL constraints (like having the correct number of digits).
There is no correct answer to this question.
In general, you want the database to maintain data integrity -- and that includes valid values in columns. You want this for multiple reasons:
Databases are usually more efficient, because they are on multi-threaded servers.
Databases can handle concurrent threads (not an issue for a check constraint, but an issue for other types of constraints).
Databases ensure the data integrity regardless of how the data is changed.
A check constraint (presumably what you want) is part of the data definition and applies to all inserts and updates. Such operations might occur in multiple places in the application.
The third piece is important. If you want to ensure that a phone number looks like a phone number, then you don't want someone to change it accidentally using update.
However, there might be checks that are simpler in the application. Or that might only apply when a new row is inserted, but not later updated. Or, that you want only to apply to data that comes in from the application (as opposed to manual changes). So, there are reasons why you might not want to do all checks in the database.
You definitily have to validate incoming data at your backend before e.g. doing crud operations on your database, since client side validation could bei omitted or even faked. It is considered to be a good practise to validate input data at the client. But you should never ever trust the client.

What is the optimal relational database design for storing an unknown number of similar but unique entities

The database we are designing allows users to authenticate with multiple 3rd party services, mostly social media (twitter, facebook, etc). There will be an unknown and growing number of these services. Each service requires a unique set of data for authentication that is not standard with the other services.
One user may authenticate many services, but they may only authenticate with one of each type of service.
Possible Solutions:
A) The most direct solution to this issue is to simply add a column for each service to the user table which contains the JSON authentication for that service. However, this violates normalization by leaving a large number of nulls in the database. What happens when there are 50 of these integrations for instance?
B) Each service gets its own table in the database. JSON is no longer needed as each field can be properly described. Then a lookup table is needed "user_has_service" for each service. This is a table which contains only two foreign keys, one for the user and one for the service, linking them together. This option seems the most correct but is very inefficient and will take many operations to determine what services a user has, increasing with the number of services. I believe also in this case, the ID field for the lookup table would need to be some kind of hash of the user and service together so that duplicate inserts are not possible.
Not at all a database expert and I have been grappling with this one for quite a while. Any thoughts?
A) The most direct solution ... JSON
You are right, option A is grossly incorrect. It breaks Codds' First Normal Form, thus it is not Relational. NULL in the database is an indication of incomplete Normalisation, which leads to complex SQL code. To be avoided at all costs.
similar but unique
To be clear, that they are unique to the Service is true. That {LoginName; UserName; Email; UserId; etc} are all similar is true in the implementation sense only, not in the data.
I may need to sketch this out.
That is a great idea. A visual data model is far more effective, because (a) the mind can comprehend it much better than text, and (b) therefore work out details; contradictions; missing bits; etc. Much easier to progress each iteration visually, than with text.
Second, we have had visual modelling tools since 1987 (1984 for a closed group), which have been made a Standard in 1993. Hopefully you appreciate that a standard-compliant model is better than a home-grown or corporate-supplied one. It displays all technical details rather than a small subset.
Is there a name for this strategy
It is plain old Relational Data Modelling, which includes Normalisation (ensuring compliance with Codd's Normal Forms, as opposed to the insanity of implementing the NFs is fragmented progressive steps).
Obstacle
One problem that needs to be understood and eliminated is this. The "theoreticians" market and propagate 1960's Record Filing Systems under the banner of "relational". That is characterised by a Record IDs in every file. That method ensures the database remains physical, not logical, the very thing that Codd overcame with his Relational Model: a database that is logical and therefore extremely easy to navigate, by any querying party, current; planned; or unplanned.
The essential difference between 1960's RFS and post-1970 Relational Databases is:
whereas the RFS maintains references between Files by physical pointer (Record ID), the Relational Database maintains references between Tables by logical Key.
A logical Key is "made up from the data" as per Codd
(A datum that is fabricated by the system is not "made up from the data")
(Use of the SQL command PRIMARY KEY does not magically anoint the datum with the properties and qualities of a Relational Key: if you use PRIMARY KEY RecordID you are in 1960's physical paradigm, not the post-1970 Relational paradigm)
Logical Keys provide Relational Integrity (as distinct from Referential Integrity, which is an ordinary function of SQL), which is far superior to that obtained by 1960's RFS
As well as far superior Speed and Power (far less JOINs, and smaller sets)
Relational Database
Therefore I will give you the answer as a Relational Data Model, as per Codd.
Just one example of Relational Integrity:
the ServiceProperty FK elements in UserServiceProperty is constrained to PK (particular combination) in ServiceProperty
a UserServiceProperty row with Facebook.Email is prevented
A Record ID based 1960's RFS that the "theoreticians" promote as "relational" cannot do that, various errors such as that one are allowed.
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
The IDEF1X Anatomy is a refresher for those who have lapsed.
If you have trouble reading the Predicates directly from the Data Model, let me know and I will produce them in text form.
Please feel free to ask questions, the more specific the better.
You could set up:
a referential table called services to list all the available services, with columns like service_id (primary key), service_name and descriptions and so on. Each service is represented as one record in this table.
a table called services_properties to store the properties of the services; this table has 3 columns: service_id (foreign key to the primary key of services), property_name and property_value. A unique constraint can be set up on service_id/propery_value tuples to avoid duplicates. Each service has several records in the services_properties table. This flexible structure lets you store as many different properties as needed for each service without creating a new table for each service
a mapping table called user_services, that relates users to services. Columns would be service_id and user_id, as foreign keys to the primary keys of the services table and users table. You can query this table to easily list the services subscribed by each user.

How to move records from one DB2 database to another DB2 database?

At regular times we want to clean up (delete) records from our production DB (DB2) and move them to an archive DB (also DB2 database having the same schema).
To complete the story there are plenty of foreign key constraints in our DB.
So if record b in table B has a foreign key to record a in table A and we are deleting record a in production DB then also record b must be deleted in the production B and both records must be created in the archive DB.
Of course it is very important that no data gets lost. So that it is not possible that we delete records in the production DB while these records will never be inserted in the archive DB.
What is the best approach to do this ?
FYI I have checked https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dm.doc/doc/r0024482.html and the proposed solutions have following short comings.
Load utility, Ingest utility, Import utility : is only addressing the part of inserting records in the archive DB. It doesn't cover the full move.
Export utility : is only covering a means of exporting data (which might be imported by the Import utility*.
db2move, restore command, db2relocatedb, ADMIN_COPY_SCHEMA, ADMIN_MOVE_TABLE and split mirror : are not an option if you only want to move specific records meeting a certain condition to the archive DB.
So based on my research, the current best solution seems to be a kind of in-house developed script that is
Exporting the records to move in IXF format
Importing those exported records in the archive DB
deleting those records in the production DB
In order to cause no transaction log full errors, this script should do this in batches (e.g. of 50000 records)
In order to have no foreign key constraint errors in step 3: we must also assure that in step 1 we are also exporting all records having foreign key constraint to the exported records and all records having a foreign key constraint to these records ...
Questions that ask the "best" approach have limited use because the assessment criteria are omitted.
Sometimes the assessment criteria differ between technicians and business people.
Sometimes multiple policies of the client company can determine such criteria, so awareness of local policies and procedures or patterns is crucial .
Often the operational-requirements and security-requirements and licensing-requirements will influence the approach, apart from the skill level and experience of the implementation-team.
Occasionally corporates have specific standardised tools for archival and deletion, or specific patterns sometimes influenced by the industry-sector or even industry-specific regulatory requirements.
As stackoverflow is a programming oriented website, questions like yours can be considered off-topic because you are asking for advice about which design-approach are possible while omitting lots of context that is specific to your company/industry-sector that may well influence the solution pattern.
Some typical requirements or questions that influence the approach are below:
do local security requirements allow the data to leave the Db2 environment? (i.e. data stored on disk outside of Db2 tables). Sometimes this constrains use of export, or load-from-file/pipe). The data can be at risk of modification or inspection or deletion (whether accidental or deliberate) whilst outside of the RDBMS.
the restartability of the solution in the event of runtime errors. This is often a crucial requirement. When copying data between different physical databases (even if the same RDBMS) there are many possibilities of error (network errors, resource issues, concurrency issues, operational issues etc). Must the solution guarantee that any restarts after failures resume from the point of failure, or must cleanup happen and the entire job be restarted? The answer can determine the design.
if federation exists between the two databases (or if it can be added within the Db2-licence terms), then this is often the easiest practical approach to push or pull content. Local and remote tables appear to be in the same logical database which simplifies the approach. The data never needs to leave the RDBMS. This also simplifies restartability of failed jobs. It also allows the data to remain encrypted if that is a requirement.
if SQL-replication or Q-based-replication is licensed then it can be configured to intelligently sync the source and target tables and respect RI if suitably configured. This approach requires significant configuration skills.
if the production database is highly-available, and/or if the archival database is highly-available then the solution must respect the HA approach. Sometimes this prevents use of LOAD, depending on the operating-system platform of the Db2-server.
timing windows for scheduling are often crucial. If the archival+removal job must guarantee to fully complete with specific time intervals this can influence the design pattern.
if fastest rollout is a key requirement then range-partitioning is usually the best option.
There are tools out there for this (such as Optim Archive) which may better satisfy requirements you didn't realize you had.
In the interim - look into federation and the tool asntdiff.
On the archive database you can define a connection to the live database (CREATE SERVER). Using this definition you can define nicknames to the live tables (CREATE NICKNAME). Using these nicknames you can load the appropriate data into your archive table. You can either use your favorite data movement utility - load, import, insert, etc.
Once loaded you can verify the tables by using the asntdiff tool with appropriate selection criteria. The -f option is great.
Once you are satisfied the data exists in both locations you can delete the rows in the live database.
For your foreign key relationships - use the view SYSCAT.TABDEP to find such dependencies. You can define your foreign keys as "not enforced" (or don't define them) in the archive database to avoid errors during the previous process.
Data archiving is a big and common topic regardless of the database. You may also want to look at range partitioned tables for better performance and control.

How to use database triggers in a real world project?

I've learned a lot about triggers and active databases in the last weaks, but I've some questions about real world examples for these.
At work we use the Entity Framework with ASP.Net and an MSSQL Server. We just use the auto generated constrains and no triggers.
When I heared about triggers I asked myself the following questions:
Which tasks can be performed by triggers?
e.g.: Generation of reporting data: currently the data for the reports is created in vb, but I think a trigger could handle this as well. The creation in vb takes a lot of time and the user should not need to wait for it, because it's not necessary for his work.
Is this an example for a perfect task for a trigger?
How does OR-Mapper handle trigger manipulated data?
e.g.: Do OR-Mapper recognize if a trigger manipulated data? The entity framework seems to cache a lot of data, so I'm not sure if it reads the updated data if a trigger manipulates the data, after the insert/update/delete from the framework is processed.
How much constraint handling should be within the database?
e.g.: Sometimes constrains in the database seem much easier and faster than in the layer above (vb.net,...), but how to throw exceptions to the upper layer that could be handled by the OR-Mapper?
Is there a good solution for handeling SQL exceptions (from triggers) in any OR-Mapper?
Thanks in advance
When you hear about a new tool or feture it doesn't mean you have to use it everywhere. You should think about design of your application.
Triggers are used a lot when the logic is in the database but if you build ORM layer on top of your database you want logic in the business layer using your ORM. It doesn't mean you should not use triggers. It means you should use them with ORM in the same way as stored procedures or database functions - only when it makes sense or when it improves performance. If you pass a lot of logic to database you can throw away ORM and perhaps whole your business layer and use two layered architecture where UI will talk directly to database which will do everything you need - such architecture is considered "old".
When using ORM trigger can be helpful for some DB generated data like audit columns or custom sequences of primary key values.
Current ORM mostly don't like triggers - they can only react to changes to currently processed record so for example if you save Order record and your update trigger will modify all ordered items there is no automatic way to let ORM know about that - you must reload data manually. In EF all data modified or generated in the database must be set with StoreGeneratedPattern.Identity or StoreGeneratedPattern.Computed - EF fully follows pattern where logic is either in the database or in the application. Once you define that value is assigned in the database you cannot change it in the application (it will not persist).
Your application logic should be responsible for data validation and call persistence only if validation passes. You should avoid unnecessary transactions and roundtrips to database when you can know upfront that transaction will fail.
I use triggers for two main purposes: auditing and updating modification/insertion times. When auditing, the triggers push data to related audit tables. This doesn't affect the ORM in any way as those tables are not typically mapped in the main data context (there's a separate auditing data context used when needed to look at audit data).
When recording/modifying insert/modification times, I typically mark those properties in the model as [DatabaseGenerated( DatabaseGenerationOptions.Computed )] This prevents any values set on in the datalayer from being persisted back to the DB and allows the trigger to enforce setting the DateTime fields properly.
It's not a hard and fast rule that I manage auditing and these dates in this way. Sometimes I need more auditing information than is available in the database itself and handle auditing in the data layer instead. Sometimes I want to force the application to update dates/times (since they may need to be the same over several rows/tables updated at the same time). In those cases I might make the field nullable, but [Required] in the model to force a date/time to be set before the model can be persisted.
The old Infomodeler/Visiomodeler ORM (not what you think - it was Object Role Modeling) provided an alternative when generating the physical model. It would provide all the referential integrity with triggers. For two reasons:
Some dbmses (notably Sybase/SQL Server) didn't have declarative RI yet, and
It could provide much more finely grained integrity - e.g. "no more than two children" or "sons or daughters but not both" or "mandatory son or daughter but not both".
So trigger logic related to the model in the same way that any RI constraint does. In SQL Server it handled violations with RAISERROR.
An conceptual issue with triggers is that they are essentially context-free - they always fire regardless of context (at least without great pain, and you might better include their logic with the rest of the context-specific logic.) So global domain constraints are the only place I find them useful - which I guess is another general way to identify "referential integrity".
Triggers are used to maintain integrity and consistency of data (by using constraints), help the database designer ensure certain actions are completed and create database change logs.
For example, given numeric input, if you want the value to be constrained to say, less then 100, you could write a trigger that fires for every row on update or insert, and raise an application error if the value of that column does not meet that contraint.
Suppose you want to log historical changes to a table. You could create a Trigger that fires AFTER each INSERT, UPDATE, and DELETE, which also inserts the data into a logging table. If you need to execute custom custom logic, then Triggers may appeal to you.

archiving strategies and limitations of data in a table

Environment: Jboss, Mysql, JPA, Hibernate
Our web application will be catering to a large amount of users (~ 1,000,000) and there are a lots of child table where user specific data are stored (e.g. personal, health, forum contributions ...).
What would be the best practice to archive user & user specific information.
[a] Would it be wise to move the archived user & user specific information to their respective tables within the same database (e.g. user_archive, user_forum_comments_archive ...) OR
[b] Would you just mark the database entries with a flag in the original table(s) and just query only non archived entries.
We have a unique constraint on User.loginid, how do you handle this requirement if the users are archived via 1-[a] (i.e if a user with loginid 'samuel' gets moved into the archive table and if a new user gets added with the same name in the original table, how would you prevent this. What would be the best strategy to address the unique key constraints.
We have a requirement to selectively archive records and bring it back if necessary, will you rely on database tools are would you handle this via your persistence APIs exposed by the JPA entity model.
Personally, I'd go for solution "[a]".
Having things split on two table sets (current and archived) would make things a bit hard to manage in terms of common RDBMS concepts (example: forum comment author would be a foreign key pointing to the user's table... but you can't have a field behave as a foreign key to two different tables).
You could go for a compromise (users table uses solution "a", all the other tables like profile get archived to a twin table like per solution "b") but this would make things unnecessarily complicated for your code (in some cases you have to look at the non-archived, in some to the archived only, in some other cases to the union of both).
Solution A would easily solve #2 and #3 requirements, too. Uniqueness of user name is easy to enforce if everything is in the same table, and resurrecting archived users is just a matter of flipping a bit (Archived=Y/N) on the main user table.
10% is not much, I doubt that the difference in terms of performance would really justify the extra complexity (and risk of bugs).
I would put an archived flag on the table and then create a view to use when you don't want to see archived records. That way people will be more consistent in applying the archive flag I suspect.