Create durable key in Data Warehouse - sql

In dimension Project, there is business logic that Project Number and Project Name can change over time.
Unfortunately, I cannot get any durable key directly from the source system. Is there a way how to deal with this situation? How to generate Project ID column - in this example is acting as Durable key?
See desired dimension structure.
Project ID - durable key not from source system, I need to generate it.

Related

What is the optimal relational database design for storing an unknown number of similar but unique entities

The database we are designing allows users to authenticate with multiple 3rd party services, mostly social media (twitter, facebook, etc). There will be an unknown and growing number of these services. Each service requires a unique set of data for authentication that is not standard with the other services.
One user may authenticate many services, but they may only authenticate with one of each type of service.
Possible Solutions:
A) The most direct solution to this issue is to simply add a column for each service to the user table which contains the JSON authentication for that service. However, this violates normalization by leaving a large number of nulls in the database. What happens when there are 50 of these integrations for instance?
B) Each service gets its own table in the database. JSON is no longer needed as each field can be properly described. Then a lookup table is needed "user_has_service" for each service. This is a table which contains only two foreign keys, one for the user and one for the service, linking them together. This option seems the most correct but is very inefficient and will take many operations to determine what services a user has, increasing with the number of services. I believe also in this case, the ID field for the lookup table would need to be some kind of hash of the user and service together so that duplicate inserts are not possible.
Not at all a database expert and I have been grappling with this one for quite a while. Any thoughts?
A) The most direct solution ... JSON
You are right, option A is grossly incorrect. It breaks Codds' First Normal Form, thus it is not Relational. NULL in the database is an indication of incomplete Normalisation, which leads to complex SQL code. To be avoided at all costs.
similar but unique
To be clear, that they are unique to the Service is true. That {LoginName; UserName; Email; UserId; etc} are all similar is true in the implementation sense only, not in the data.
I may need to sketch this out.
That is a great idea. A visual data model is far more effective, because (a) the mind can comprehend it much better than text, and (b) therefore work out details; contradictions; missing bits; etc. Much easier to progress each iteration visually, than with text.
Second, we have had visual modelling tools since 1987 (1984 for a closed group), which have been made a Standard in 1993. Hopefully you appreciate that a standard-compliant model is better than a home-grown or corporate-supplied one. It displays all technical details rather than a small subset.
Is there a name for this strategy
It is plain old Relational Data Modelling, which includes Normalisation (ensuring compliance with Codd's Normal Forms, as opposed to the insanity of implementing the NFs is fragmented progressive steps).
Obstacle
One problem that needs to be understood and eliminated is this. The "theoreticians" market and propagate 1960's Record Filing Systems under the banner of "relational". That is characterised by a Record IDs in every file. That method ensures the database remains physical, not logical, the very thing that Codd overcame with his Relational Model: a database that is logical and therefore extremely easy to navigate, by any querying party, current; planned; or unplanned.
The essential difference between 1960's RFS and post-1970 Relational Databases is:
whereas the RFS maintains references between Files by physical pointer (Record ID), the Relational Database maintains references between Tables by logical Key.
A logical Key is "made up from the data" as per Codd
(A datum that is fabricated by the system is not "made up from the data")
(Use of the SQL command PRIMARY KEY does not magically anoint the datum with the properties and qualities of a Relational Key: if you use PRIMARY KEY RecordID you are in 1960's physical paradigm, not the post-1970 Relational paradigm)
Logical Keys provide Relational Integrity (as distinct from Referential Integrity, which is an ordinary function of SQL), which is far superior to that obtained by 1960's RFS
As well as far superior Speed and Power (far less JOINs, and smaller sets)
Relational Database
Therefore I will give you the answer as a Relational Data Model, as per Codd.
Just one example of Relational Integrity:
the ServiceProperty FK elements in UserServiceProperty is constrained to PK (particular combination) in ServiceProperty
a UserServiceProperty row with Facebook.Email is prevented
A Record ID based 1960's RFS that the "theoreticians" promote as "relational" cannot do that, various errors such as that one are allowed.
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
The IDEF1X Anatomy is a refresher for those who have lapsed.
If you have trouble reading the Predicates directly from the Data Model, let me know and I will produce them in text form.
Please feel free to ask questions, the more specific the better.
You could set up:
a referential table called services to list all the available services, with columns like service_id (primary key), service_name and descriptions and so on. Each service is represented as one record in this table.
a table called services_properties to store the properties of the services; this table has 3 columns: service_id (foreign key to the primary key of services), property_name and property_value. A unique constraint can be set up on service_id/propery_value tuples to avoid duplicates. Each service has several records in the services_properties table. This flexible structure lets you store as many different properties as needed for each service without creating a new table for each service
a mapping table called user_services, that relates users to services. Columns would be service_id and user_id, as foreign keys to the primary keys of the services table and users table. You can query this table to easily list the services subscribed by each user.

Service fabric backup, partition ID

If I have a statelful service fabric split into three partitions and want to backup to an external source, I can incorporate the partition ID of the service into the folder name of the external source containing the backup. When a restore is required the service instance can pass it's partition ID into the backup provider and so pull down the data for that particular partition.
My concern is that if there is a catastrophic failure and the service fabric needs to be rebuilt, the partitions will no longer have the same partition ID (it seems to be a Guid), in which case the restore process will not be able to find the backup for the new partition ID.
What is the recommended way to deal with this?
Instead of the PartitionID I am currently using the partition key, is this OK?
Yes that makes sense and you're doing it right. You can take a backup from one partition and restore it on a different partition in another service.
The target service must use the same partition count and type.
This way you can also use backup and restore to make a copy of an existing service, for debugging purposes for example.
Code from this project can help you create and restore SF backups, and help you store the data in an external store.

Handling mapping tables to an external API

Our data model has users. We use an external API (I'll call it Z) to handle payments. We create users in Z and have a mapping table that links our internal IDs to Z IDs. This works fine when there is a 1-to-1 association between "environments."
The problem is that Z provides us one testing environment called "staging." But we have multiple environments, "sandbox", "staging", each dev's local, etc. Ideally we could point the various environments to Z's staging, but then the mapping tables will be wrong in each environment. Each environment has a different user base and the emails could clash and point to the wrong Z IDs. Z provides no delete (or archive) functionality either.
How can we manage those mapping tables in this situation?
This is a common problem. It happens not only when dealing with external systems, but very often internal systems as well.
You really only have two choices. Disallow contact with the external system except from your Staging environment, or allow contact with the external system from multiple environments.
Since you want to do the latter, you have to accept that your mapping tables in each of your environments will not match ID for ID with the external Staging environment. This shouldn't be a problem unless you have some requirement that you have the exact same number of IDs in your mapping table as in the external environment. If this is your case, then you are stuck with option 1.
More likely, there is no real need that every ID in the external environment have a corresponding entry in the same mapping table. In this case, you are really only concerned that every ID in a mapping table have a corresponding ID in the external Staging environment.
You can prevent collisions by creating the ID in the external system before creating it in your mapping table. If the ID is already taken, require the user to pick a different one.

System consuming WCF services from another system, when underlying databases have relationships

This is an issue that I have struggled with in a number of systems but this one is a good example. It is to do with when one system consumes WCF services from another system, and each system has their own database, but there are relationships between the two databases.
We have a central database that holds a record of all documents in the company. This database includes Document and Folder tables and it mimicks a windows file structure. NHibernate takes care of data access, a domain layer handles logic (validating filenames/no identical filenames in the same folder etc.) and a service layer sits on that, with services named 'CreateDocument(bytes[])', 'RenameDocument(id, newName) ', 'SearchDocuments(filename, filesize, createdDate)' etc. These services are exposed with WCF.
An HR system consumes these services. The HR database has a separate database that has foreign keys to the Document database: it contains an HRDocument table that has a foreign key DocumentId, and then HR specific such as EmployeeId and ContractId.
Here are the problems amonst others:
1) In order to save a document, I have to call the WCF service to save it to the central db, return the ID and then save to the HRDocument table (along with the HR specific information). Because of the WCF call and all Document specific data access being done within the Document application, this can't be done all within one transaction, resulting in a possible loss of transaction integrity.
2) In order to search on say, employeeId and createdDate, I have to call the search service passing in the createdDate (Document database specific fields) and then search the HRDocument database on the Id's of the returned records to filter the results returned. This feels messy, slow and just wrong.
I could duplicate the NHibernate mapping files to the Document database in the DAL of the HR application. This means I could specify the relationship between HRDocument and Document. This means I could join the tables and search like that but would also mean I would have to duplicate domain logic and violate the DRY principle, and all that entails.
I can't help feeling I'm doing something wrong here and have missed something simple.
I recommend you to apply CQRS and Event Driven Architecture principles here
Use Guids as primary keys - then you
will be able to generate primary key
for document and pass it to WCF
method call.
Use messaging on other side of WCF service to prevent data loss
(in case of database failure and
something like that).
Remove constaints between databases - immediate
consistent applications don't
scale. Use eventual consistency
paradigm instead.
Introduce separate data storage for
reads purpose that contains denormalized data. Then you will be able
to do search very easy. To ensure
consistency in your read storage (in
case when Document creation
failed) you could implement some
simple workflow (saga in terms of
CQRS)
You can create a common codebase which will include base implementation of Document along with all the mappings, base Domain Model etc.
A Document Service and an HR System use the same codebase. But in HR System you extend base Document class (or classes) with your HRDocument using inheritance mapping strategy which will suit your needs the best.
public class HRDocument : Document
And from HR System you don't even have to call Document Service anymore, you just use NH and enjoy ACID and all that. But Document Service is still there and there's no code duplication.

archiving strategies and limitations of data in a table

Environment: Jboss, Mysql, JPA, Hibernate
Our web application will be catering to a large amount of users (~ 1,000,000) and there are a lots of child table where user specific data are stored (e.g. personal, health, forum contributions ...).
What would be the best practice to archive user & user specific information.
[a] Would it be wise to move the archived user & user specific information to their respective tables within the same database (e.g. user_archive, user_forum_comments_archive ...) OR
[b] Would you just mark the database entries with a flag in the original table(s) and just query only non archived entries.
We have a unique constraint on User.loginid, how do you handle this requirement if the users are archived via 1-[a] (i.e if a user with loginid 'samuel' gets moved into the archive table and if a new user gets added with the same name in the original table, how would you prevent this. What would be the best strategy to address the unique key constraints.
We have a requirement to selectively archive records and bring it back if necessary, will you rely on database tools are would you handle this via your persistence APIs exposed by the JPA entity model.
Personally, I'd go for solution "[a]".
Having things split on two table sets (current and archived) would make things a bit hard to manage in terms of common RDBMS concepts (example: forum comment author would be a foreign key pointing to the user's table... but you can't have a field behave as a foreign key to two different tables).
You could go for a compromise (users table uses solution "a", all the other tables like profile get archived to a twin table like per solution "b") but this would make things unnecessarily complicated for your code (in some cases you have to look at the non-archived, in some to the archived only, in some other cases to the union of both).
Solution A would easily solve #2 and #3 requirements, too. Uniqueness of user name is easy to enforce if everything is in the same table, and resurrecting archived users is just a matter of flipping a bit (Archived=Y/N) on the main user table.
10% is not much, I doubt that the difference in terms of performance would really justify the extra complexity (and risk of bugs).
I would put an archived flag on the table and then create a view to use when you don't want to see archived records. That way people will be more consistent in applying the archive flag I suspect.