archiving strategies and limitations of data in a table - sql

Environment: Jboss, Mysql, JPA, Hibernate
Our web application will be catering to a large amount of users (~ 1,000,000) and there are a lots of child table where user specific data are stored (e.g. personal, health, forum contributions ...).
What would be the best practice to archive user & user specific information.
[a] Would it be wise to move the archived user & user specific information to their respective tables within the same database (e.g. user_archive, user_forum_comments_archive ...) OR
[b] Would you just mark the database entries with a flag in the original table(s) and just query only non archived entries.
We have a unique constraint on User.loginid, how do you handle this requirement if the users are archived via 1-[a] (i.e if a user with loginid 'samuel' gets moved into the archive table and if a new user gets added with the same name in the original table, how would you prevent this. What would be the best strategy to address the unique key constraints.
We have a requirement to selectively archive records and bring it back if necessary, will you rely on database tools are would you handle this via your persistence APIs exposed by the JPA entity model.

Personally, I'd go for solution "[a]".
Having things split on two table sets (current and archived) would make things a bit hard to manage in terms of common RDBMS concepts (example: forum comment author would be a foreign key pointing to the user's table... but you can't have a field behave as a foreign key to two different tables).
You could go for a compromise (users table uses solution "a", all the other tables like profile get archived to a twin table like per solution "b") but this would make things unnecessarily complicated for your code (in some cases you have to look at the non-archived, in some to the archived only, in some other cases to the union of both).
Solution A would easily solve #2 and #3 requirements, too. Uniqueness of user name is easy to enforce if everything is in the same table, and resurrecting archived users is just a matter of flipping a bit (Archived=Y/N) on the main user table.
10% is not much, I doubt that the difference in terms of performance would really justify the extra complexity (and risk of bugs).

I would put an archived flag on the table and then create a view to use when you don't want to see archived records. That way people will be more consistent in applying the archive flag I suspect.

Related

What is the optimal relational database design for storing an unknown number of similar but unique entities

The database we are designing allows users to authenticate with multiple 3rd party services, mostly social media (twitter, facebook, etc). There will be an unknown and growing number of these services. Each service requires a unique set of data for authentication that is not standard with the other services.
One user may authenticate many services, but they may only authenticate with one of each type of service.
Possible Solutions:
A) The most direct solution to this issue is to simply add a column for each service to the user table which contains the JSON authentication for that service. However, this violates normalization by leaving a large number of nulls in the database. What happens when there are 50 of these integrations for instance?
B) Each service gets its own table in the database. JSON is no longer needed as each field can be properly described. Then a lookup table is needed "user_has_service" for each service. This is a table which contains only two foreign keys, one for the user and one for the service, linking them together. This option seems the most correct but is very inefficient and will take many operations to determine what services a user has, increasing with the number of services. I believe also in this case, the ID field for the lookup table would need to be some kind of hash of the user and service together so that duplicate inserts are not possible.
Not at all a database expert and I have been grappling with this one for quite a while. Any thoughts?
A) The most direct solution ... JSON
You are right, option A is grossly incorrect. It breaks Codds' First Normal Form, thus it is not Relational. NULL in the database is an indication of incomplete Normalisation, which leads to complex SQL code. To be avoided at all costs.
similar but unique
To be clear, that they are unique to the Service is true. That {LoginName; UserName; Email; UserId; etc} are all similar is true in the implementation sense only, not in the data.
I may need to sketch this out.
That is a great idea. A visual data model is far more effective, because (a) the mind can comprehend it much better than text, and (b) therefore work out details; contradictions; missing bits; etc. Much easier to progress each iteration visually, than with text.
Second, we have had visual modelling tools since 1987 (1984 for a closed group), which have been made a Standard in 1993. Hopefully you appreciate that a standard-compliant model is better than a home-grown or corporate-supplied one. It displays all technical details rather than a small subset.
Is there a name for this strategy
It is plain old Relational Data Modelling, which includes Normalisation (ensuring compliance with Codd's Normal Forms, as opposed to the insanity of implementing the NFs is fragmented progressive steps).
Obstacle
One problem that needs to be understood and eliminated is this. The "theoreticians" market and propagate 1960's Record Filing Systems under the banner of "relational". That is characterised by a Record IDs in every file. That method ensures the database remains physical, not logical, the very thing that Codd overcame with his Relational Model: a database that is logical and therefore extremely easy to navigate, by any querying party, current; planned; or unplanned.
The essential difference between 1960's RFS and post-1970 Relational Databases is:
whereas the RFS maintains references between Files by physical pointer (Record ID), the Relational Database maintains references between Tables by logical Key.
A logical Key is "made up from the data" as per Codd
(A datum that is fabricated by the system is not "made up from the data")
(Use of the SQL command PRIMARY KEY does not magically anoint the datum with the properties and qualities of a Relational Key: if you use PRIMARY KEY RecordID you are in 1960's physical paradigm, not the post-1970 Relational paradigm)
Logical Keys provide Relational Integrity (as distinct from Referential Integrity, which is an ordinary function of SQL), which is far superior to that obtained by 1960's RFS
As well as far superior Speed and Power (far less JOINs, and smaller sets)
Relational Database
Therefore I will give you the answer as a Relational Data Model, as per Codd.
Just one example of Relational Integrity:
the ServiceProperty FK elements in UserServiceProperty is constrained to PK (particular combination) in ServiceProperty
a UserServiceProperty row with Facebook.Email is prevented
A Record ID based 1960's RFS that the "theoreticians" promote as "relational" cannot do that, various errors such as that one are allowed.
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
The IDEF1X Anatomy is a refresher for those who have lapsed.
If you have trouble reading the Predicates directly from the Data Model, let me know and I will produce them in text form.
Please feel free to ask questions, the more specific the better.
You could set up:
a referential table called services to list all the available services, with columns like service_id (primary key), service_name and descriptions and so on. Each service is represented as one record in this table.
a table called services_properties to store the properties of the services; this table has 3 columns: service_id (foreign key to the primary key of services), property_name and property_value. A unique constraint can be set up on service_id/propery_value tuples to avoid duplicates. Each service has several records in the services_properties table. This flexible structure lets you store as many different properties as needed for each service without creating a new table for each service
a mapping table called user_services, that relates users to services. Columns would be service_id and user_id, as foreign keys to the primary keys of the services table and users table. You can query this table to easily list the services subscribed by each user.

How to move records from one DB2 database to another DB2 database?

At regular times we want to clean up (delete) records from our production DB (DB2) and move them to an archive DB (also DB2 database having the same schema).
To complete the story there are plenty of foreign key constraints in our DB.
So if record b in table B has a foreign key to record a in table A and we are deleting record a in production DB then also record b must be deleted in the production B and both records must be created in the archive DB.
Of course it is very important that no data gets lost. So that it is not possible that we delete records in the production DB while these records will never be inserted in the archive DB.
What is the best approach to do this ?
FYI I have checked https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dm.doc/doc/r0024482.html and the proposed solutions have following short comings.
Load utility, Ingest utility, Import utility : is only addressing the part of inserting records in the archive DB. It doesn't cover the full move.
Export utility : is only covering a means of exporting data (which might be imported by the Import utility*.
db2move, restore command, db2relocatedb, ADMIN_COPY_SCHEMA, ADMIN_MOVE_TABLE and split mirror : are not an option if you only want to move specific records meeting a certain condition to the archive DB.
So based on my research, the current best solution seems to be a kind of in-house developed script that is
Exporting the records to move in IXF format
Importing those exported records in the archive DB
deleting those records in the production DB
In order to cause no transaction log full errors, this script should do this in batches (e.g. of 50000 records)
In order to have no foreign key constraint errors in step 3: we must also assure that in step 1 we are also exporting all records having foreign key constraint to the exported records and all records having a foreign key constraint to these records ...
Questions that ask the "best" approach have limited use because the assessment criteria are omitted.
Sometimes the assessment criteria differ between technicians and business people.
Sometimes multiple policies of the client company can determine such criteria, so awareness of local policies and procedures or patterns is crucial .
Often the operational-requirements and security-requirements and licensing-requirements will influence the approach, apart from the skill level and experience of the implementation-team.
Occasionally corporates have specific standardised tools for archival and deletion, or specific patterns sometimes influenced by the industry-sector or even industry-specific regulatory requirements.
As stackoverflow is a programming oriented website, questions like yours can be considered off-topic because you are asking for advice about which design-approach are possible while omitting lots of context that is specific to your company/industry-sector that may well influence the solution pattern.
Some typical requirements or questions that influence the approach are below:
do local security requirements allow the data to leave the Db2 environment? (i.e. data stored on disk outside of Db2 tables). Sometimes this constrains use of export, or load-from-file/pipe). The data can be at risk of modification or inspection or deletion (whether accidental or deliberate) whilst outside of the RDBMS.
the restartability of the solution in the event of runtime errors. This is often a crucial requirement. When copying data between different physical databases (even if the same RDBMS) there are many possibilities of error (network errors, resource issues, concurrency issues, operational issues etc). Must the solution guarantee that any restarts after failures resume from the point of failure, or must cleanup happen and the entire job be restarted? The answer can determine the design.
if federation exists between the two databases (or if it can be added within the Db2-licence terms), then this is often the easiest practical approach to push or pull content. Local and remote tables appear to be in the same logical database which simplifies the approach. The data never needs to leave the RDBMS. This also simplifies restartability of failed jobs. It also allows the data to remain encrypted if that is a requirement.
if SQL-replication or Q-based-replication is licensed then it can be configured to intelligently sync the source and target tables and respect RI if suitably configured. This approach requires significant configuration skills.
if the production database is highly-available, and/or if the archival database is highly-available then the solution must respect the HA approach. Sometimes this prevents use of LOAD, depending on the operating-system platform of the Db2-server.
timing windows for scheduling are often crucial. If the archival+removal job must guarantee to fully complete with specific time intervals this can influence the design pattern.
if fastest rollout is a key requirement then range-partitioning is usually the best option.
There are tools out there for this (such as Optim Archive) which may better satisfy requirements you didn't realize you had.
In the interim - look into federation and the tool asntdiff.
On the archive database you can define a connection to the live database (CREATE SERVER). Using this definition you can define nicknames to the live tables (CREATE NICKNAME). Using these nicknames you can load the appropriate data into your archive table. You can either use your favorite data movement utility - load, import, insert, etc.
Once loaded you can verify the tables by using the asntdiff tool with appropriate selection criteria. The -f option is great.
Once you are satisfied the data exists in both locations you can delete the rows in the live database.
For your foreign key relationships - use the view SYSCAT.TABDEP to find such dependencies. You can define your foreign keys as "not enforced" (or don't define them) in the archive database to avoid errors during the previous process.
Data archiving is a big and common topic regardless of the database. You may also want to look at range partitioned tables for better performance and control.

Should I create multiple tables, or even databases for multiple users of a CRM

I'm working on creating an application best described as a CRM. There is a relatively complex table structure, and I'm thinking about allowing users to do a fair bit of customization (adding fields and the like). One concern is that I will be reaching a certain level of scale almost immediately. We have about 50,000 individual users who will be coming online within about nine months of launch. So I want to build to last.
I'm thinking about two and maybe even three options.
One table set with a userID column on everything and with a custom attributes table created by creating a table which indexes custom attributes, then another table which has their values, which can then be joined to the existing contact records for the user. -- From what I've read, this seems like the right option, but I keep feeling like it's not. It seems like once these tables start reaching the millions of records searching for just one users records in every query is going to become a database hog.
For each user account recreate the table set, preened with a unique identifier (the userID for example.) Then rather than using a WHERE userID=? everywhere I can use a FROM ?_contacts. For attributes I could then have a custom attributes table where users could add additional columns for custom attributes. -- This feels like the simplest way to go, though, of course when I decide to change the database structure there would be a migration from hell.
The third option, which I'm pretty confident is wrong, but for that reason alone I can not rule out, is that a new database should be created for each user with all the requisite tables.
Am I crazy? Is option one really the best?
The first method is the best. Create individual userId's and then you can assign specific roles to them. A database retrieval time indeed depends on the number of records too. But, there is a trade-off where you can write efficient sql queries to fetch data. Well, according to this site, you will probably won't run out of memory or run into concurrency issues, because with a good server, the performance ought to be good, provided that you are efficient in writing queries.
If you recreate table sets, you will just end up creating lots of tables and can make the indexing slow which is a bad practice. Whereas if you opt of relational database scheme rather than an ordinary database scheme, and normalize the database and datatables for improving efficiency.
Creating a new database for each and every user, just sums up the complexity from both the above statements resulting in a shabby and disorganized database access. Because, if you decide to run individual instances of databases for every single user, you would just end up consuming your servers physical resources like RAM and CPU usage which will affect the service quality of all the other users.
Take up option 1. Assign separate userIds and assign them roles and privileges where needed. That is more efficient than the other two methods.

How to manage multiple versions of the same record

I am doing short-term contract work for a company that is trying to implement a check-in/check-out type of workflow for their database records.
Here's how it should work...
A user creates a new entity within the application. There are about 20 related tables that will be populated in addition to the main entity table.
Once the entity is created the user will mark it as the master.
Another user can make changes to the master only by "checking out" the entity. Multiple users can checkout the entity at the same time.
Once the user has made all the necessary changes to the entity, they put it in a "needs approval" status.
After an authorized user reviews the entity, they can promote it to master which will put the original record in a tombstoned status.
The way they are currently accomplishing the "check out" is by duplicating the entity records in all the tables. The primary keys include EntityID + EntityDate, so they duplicate the entity records in all related tables with the same EntityID and an updated EntityDate and give it a status of "checked out". When the record is put into the next state (needs approval), the duplication occurs again. Eventually it will be promoted to master at which time the final record is marked as master and the original master is marked as dead.
This design seems hideous to me, but I understand why they've done it. When someone looks up an entity from within the application, they need to see all current versions of that entity. This was a very straightforward way for making that happen. But the fact that they are representing the same entity multiple times within the same table(s) doesn't sit well with me, nor does the fact that they are duplicating EVERY piece of data rather than only storing deltas.
I would be interested in hearing your reaction to the design, whether positive or negative.
I would also be grateful for any resoures you can point me to that might be useful for seeing how someone else has implemented such a mechanism.
Thanks!
Darvis
I've worked on a system like this which supported the static data for trading at a very large bank. The static data in this case is things like the details of counterparties, standard settlement instructions, currencies (not FX rates) etc. Every entity in the database was versioned, and changing an entity involved creating a new version, changing that version and getting the version approved. They did not however let multiple people create versions at the same time.
This lead to a horribly complex database, with every join having to take version and approval state into account. In fact the software I wrote for them was middleware that abstracted this complex, versioned data into something that end-user applications could actually use.
The only thing that could have made it any worse was to store deltas instead of complete versioned objects. So the point of this answer is - don't try to implement deltas!
This looks like an example of a temporal database schema -- Often, in cases like that, there is a distinction made between an entity's key (EntityID, in your case) and the row primary key in the database (in your case, {EntityID, date}, but often a simple integer). You have to accept that the same entity is represented multiple times in the database, at different points in its history. Every database row still has a unique ID; it's just that your database is tracking versions, rather than entities.
You can manage data like that, and it can be very good at tracking changes to data, and providing accountability, if that is required, but it makes all of your queries quite a bit more complex.
You can read about the rationale behind, and design of temporal databases on Wikipedia
You are describing a homebrew Content Management System which was probably hacked together over time, is - for the reasons you state - redundant and inefficient, and given the nature of such systems in firms is unlikely to be displaced without massive organizational effort.

Database for microblogging startup

I will do microblogging web service (for school, so don't blast me for lack of new idea) and I worry that DB could be often be overloaded (user could following other users or even tag so I suppouse that SELECT will be heavy - check 20 latest messages which contains all observing tags and user).
My idea is create another table, and store in it only statusID and userID (who should pick up message). Danger of that is, if some tag or user has many followers there will be a lot of record with that status ID. So, is it good idea? Or maybe better is used M2M relation? (one status -> many receivers)
I think most databases can easily handle large record sets. The responsibility to have it preform lies in your design with properly setting up the indexes. If you create the right indexes the select clauses should perform really well.
I'd go with a users table, a table to have the m2m relationship between users and messages table.
You can then do one select to find all of the users a user is following and then a second select in to get all of the messages of interest (sorting and limiting the results as appropriate). Extending this to tagging should be pretty simple.
This design should be fine for large numbers of users and messages as long as you index the right columns. If you got massive then you could also run the users tables and messages tables to different servers or have read only replicates. I wouldn't even worry about that for the moment - you'd need to be huge.
When implementing Collabinate (http://www.collabinate.com), a service-based engine for microblogging and shared activity streams, I used a graph database. The fact that people create posts and follow other people lends itself to a graph structure. With the right relationships and algorithms, this can be a very efficient and performant solution.