I am wandering what is the best way to organize the following into a relational DB structure (specifically an Oracle db).
I have to represent many hosts; every host is streaming (multicasting, input is with and without source filtering) one or more streams which contain an unknown number of components.
There are many kind of machines that will apply different kinds of modification based on some specific configuration (that have to be stored), and almost all kind of combination is valid. Normally based on that info, a machine will take some component, eventually change it in some way and restream it; others will just reorganize multiple multicast components (so they have multiple IP addresses).
My user will probably be looking for information about a specific host or component, but many views will have to be provided, which will navigate the data in every "relation direction".
I've camed up with a lot of ideas, like creating a standard "host" table, that has 2 fields, containing the specialized table name and row (obviously all specialized tables are a lot different from each other). But this clashes with the foreign keys perspective (only one parent table defined on table creation); so maybe it is a better idea to make specialized tables point to host table with foreign key, but then that relation has be "reverse navigable" (possible, but I feel like it is a hack).
Also for multicast I've come up with two alternatives, as it is a many-to-many relationship, it will use a third connection table, but the problem is what to put in that table to keep it aligned when one multicast or machine IP changes.
I'm really lost in that as I can't even find some good keyword to point me to some example or discussion about an organization with such complexity.
Related
I am working at a company that merged with another company a while ago.
There we have several business units that are basically equivalent. One in Europe, one in China, each. We already had an in-house MariaDB database, which we want to start sharing.
The problem is that there are different GDPR regulations and contracts that prohibit sharing certain data across sites. So what I can't do, is replicate data across sites and then just hide in from the user in the frontend. The private data has to stay at the facility, it belongs to.
So my idea was to separate each table that we have now and where possibly sensitive information is contained into two tables each.
One say table_contracts_private and table_contracts_public.
This would still seem pretty doable with basic database replication and replicating the public tables across sites. But how would you go about publishing private data? Also how would I best combine private and public data? Just using a view
I just could not find any good mechanisms for this, especially because we would also like to avoid data duplication, so the private entries would need to be removed and replaced by the public ones, which would entail also changing all referencing IDs.
Is this a possible application of sharding?
I'd be really grateful, if someone could point me in the right direction, or if someone has a demo project with similar requirements that I could check out.
Cheers
Is this a possible application of sharding?
I wouldn't think so. Sharding is a performance optimization method. What you need is to support legal constraints. Those are two very different problems.
I think you are on the right track. I call this a "walled garden" approach. You create a database with all non-PII information, using ids only. Nothing that remotely directly identifies people, their addresses, phones, credit cards, and so on. This can be tricky. In some jurisdictions combinations of demographics can be PII.
Some of those ids then refer to another database where you store all the sensitive information; this is the "walled garden". I would recommend that this second database be on a separate server. It has a very restricted access list. And this is where you implement requirements for things like "forgetting" a customer.
In any case, the point is that sharding is not the right approach. You want an application redesign with privacy and security as the top priorities. Happily, this is not actually that hard to implement, although if the databases are changing, you may need period auditing. For instance, in one database I worked with, we discovered that "coupon codes" sometimes contained unencrypted email addresses. Arrgggh!
The database we are designing allows users to authenticate with multiple 3rd party services, mostly social media (twitter, facebook, etc). There will be an unknown and growing number of these services. Each service requires a unique set of data for authentication that is not standard with the other services.
One user may authenticate many services, but they may only authenticate with one of each type of service.
Possible Solutions:
A) The most direct solution to this issue is to simply add a column for each service to the user table which contains the JSON authentication for that service. However, this violates normalization by leaving a large number of nulls in the database. What happens when there are 50 of these integrations for instance?
B) Each service gets its own table in the database. JSON is no longer needed as each field can be properly described. Then a lookup table is needed "user_has_service" for each service. This is a table which contains only two foreign keys, one for the user and one for the service, linking them together. This option seems the most correct but is very inefficient and will take many operations to determine what services a user has, increasing with the number of services. I believe also in this case, the ID field for the lookup table would need to be some kind of hash of the user and service together so that duplicate inserts are not possible.
Not at all a database expert and I have been grappling with this one for quite a while. Any thoughts?
A) The most direct solution ... JSON
You are right, option A is grossly incorrect. It breaks Codds' First Normal Form, thus it is not Relational. NULL in the database is an indication of incomplete Normalisation, which leads to complex SQL code. To be avoided at all costs.
similar but unique
To be clear, that they are unique to the Service is true. That {LoginName; UserName; Email; UserId; etc} are all similar is true in the implementation sense only, not in the data.
I may need to sketch this out.
That is a great idea. A visual data model is far more effective, because (a) the mind can comprehend it much better than text, and (b) therefore work out details; contradictions; missing bits; etc. Much easier to progress each iteration visually, than with text.
Second, we have had visual modelling tools since 1987 (1984 for a closed group), which have been made a Standard in 1993. Hopefully you appreciate that a standard-compliant model is better than a home-grown or corporate-supplied one. It displays all technical details rather than a small subset.
Is there a name for this strategy
It is plain old Relational Data Modelling, which includes Normalisation (ensuring compliance with Codd's Normal Forms, as opposed to the insanity of implementing the NFs is fragmented progressive steps).
Obstacle
One problem that needs to be understood and eliminated is this. The "theoreticians" market and propagate 1960's Record Filing Systems under the banner of "relational". That is characterised by a Record IDs in every file. That method ensures the database remains physical, not logical, the very thing that Codd overcame with his Relational Model: a database that is logical and therefore extremely easy to navigate, by any querying party, current; planned; or unplanned.
The essential difference between 1960's RFS and post-1970 Relational Databases is:
whereas the RFS maintains references between Files by physical pointer (Record ID), the Relational Database maintains references between Tables by logical Key.
A logical Key is "made up from the data" as per Codd
(A datum that is fabricated by the system is not "made up from the data")
(Use of the SQL command PRIMARY KEY does not magically anoint the datum with the properties and qualities of a Relational Key: if you use PRIMARY KEY RecordID you are in 1960's physical paradigm, not the post-1970 Relational paradigm)
Logical Keys provide Relational Integrity (as distinct from Referential Integrity, which is an ordinary function of SQL), which is far superior to that obtained by 1960's RFS
As well as far superior Speed and Power (far less JOINs, and smaller sets)
Relational Database
Therefore I will give you the answer as a Relational Data Model, as per Codd.
Just one example of Relational Integrity:
the ServiceProperty FK elements in UserServiceProperty is constrained to PK (particular combination) in ServiceProperty
a UserServiceProperty row with Facebook.Email is prevented
A Record ID based 1960's RFS that the "theoreticians" promote as "relational" cannot do that, various errors such as that one are allowed.
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
The IDEF1X Anatomy is a refresher for those who have lapsed.
If you have trouble reading the Predicates directly from the Data Model, let me know and I will produce them in text form.
Please feel free to ask questions, the more specific the better.
You could set up:
a referential table called services to list all the available services, with columns like service_id (primary key), service_name and descriptions and so on. Each service is represented as one record in this table.
a table called services_properties to store the properties of the services; this table has 3 columns: service_id (foreign key to the primary key of services), property_name and property_value. A unique constraint can be set up on service_id/propery_value tuples to avoid duplicates. Each service has several records in the services_properties table. This flexible structure lets you store as many different properties as needed for each service without creating a new table for each service
a mapping table called user_services, that relates users to services. Columns would be service_id and user_id, as foreign keys to the primary keys of the services table and users table. You can query this table to easily list the services subscribed by each user.
Background
We have chosen Cassandra as our storage engine since we have an application that must handle async messaging between many users on the website and event storing (some types of analytics, what happens on site and when, etc.). Also we have a voting platform so we are storing votes per users per day and Cassandra are good in those use cases.
Recently we got new requirements to build a relational model on top of our existing system (at least we think it is relational). Some types of political candidates with lists of jobs, education, historical voting, endorsements, etc.
Problem
We have relations which can be edited on both ends (i.e. candidate is supported by companies, but in our admin panel that company can be edited without candidate). A candidate is one row in our Cassandra DB identified by a UUID. On the front end, we would need full information about candidates (political party, schools, jobs, voting history, supporting companies). We want to place the majority of candidate info in a single row so we can read data with a single read. However when we place the list of supporting companies UDT we have problems editing it (we need to change it in company_by_id and candidate_by_id tables).
Question
How to solve the editing problem and relational model issues in our situation?
We came up with couple of solutions:
Track relations in Cassandra with additional index-like tables: candidates_by_supporting_company. When updating company, we update candidates who have that company as well.
Similar to 1, but using secondary index if relation is low carnality and updating based on secondary index (we have 10 political parties so we can place index on political party in candidates table and when political party changes we can change candidates by political party since we have index)
Use a relational database for relational type of data and leave Cassandra to handle only suitable use cases like time-series data, messaging, event sorting (this adds the maintenance cost of one more database, deployment costs and problems since our system is distributed how to have replication of data)
Use Spark to do joins (this will not be the sole purpose of adding Spark to the system, we are thinking of adding it for importing huge data sets in CSV and doing transformation so having Spark will be an added bonus and we can use SparkSQL for places where we need joins)
We are leaning towards option 3 since we will add Spark anyway, we will stay with only Cassandra database (which does not complicate maintenance and deployment of one more database) and we get sort of JOINS and GROUP BY efficient on application level with it.
What do you think?
If you want to use only cassandra the right way to proceed is the number 1: denormalization. But if yu have a lot of relationships it will bring a lot of effort at application level.
If adding an other dbms is not a problem in your environment, using the right tool for the right job is the best choice: number 3 for me
Environment: Jboss, Mysql, JPA, Hibernate
Our web application will be catering to a large amount of users (~ 1,000,000) and there are a lots of child table where user specific data are stored (e.g. personal, health, forum contributions ...).
What would be the best practice to archive user & user specific information.
[a] Would it be wise to move the archived user & user specific information to their respective tables within the same database (e.g. user_archive, user_forum_comments_archive ...) OR
[b] Would you just mark the database entries with a flag in the original table(s) and just query only non archived entries.
We have a unique constraint on User.loginid, how do you handle this requirement if the users are archived via 1-[a] (i.e if a user with loginid 'samuel' gets moved into the archive table and if a new user gets added with the same name in the original table, how would you prevent this. What would be the best strategy to address the unique key constraints.
We have a requirement to selectively archive records and bring it back if necessary, will you rely on database tools are would you handle this via your persistence APIs exposed by the JPA entity model.
Personally, I'd go for solution "[a]".
Having things split on two table sets (current and archived) would make things a bit hard to manage in terms of common RDBMS concepts (example: forum comment author would be a foreign key pointing to the user's table... but you can't have a field behave as a foreign key to two different tables).
You could go for a compromise (users table uses solution "a", all the other tables like profile get archived to a twin table like per solution "b") but this would make things unnecessarily complicated for your code (in some cases you have to look at the non-archived, in some to the archived only, in some other cases to the union of both).
Solution A would easily solve #2 and #3 requirements, too. Uniqueness of user name is easy to enforce if everything is in the same table, and resurrecting archived users is just a matter of flipping a bit (Archived=Y/N) on the main user table.
10% is not much, I doubt that the difference in terms of performance would really justify the extra complexity (and risk of bugs).
I would put an archived flag on the table and then create a view to use when you don't want to see archived records. That way people will be more consistent in applying the archive flag I suspect.
I will do microblogging web service (for school, so don't blast me for lack of new idea) and I worry that DB could be often be overloaded (user could following other users or even tag so I suppouse that SELECT will be heavy - check 20 latest messages which contains all observing tags and user).
My idea is create another table, and store in it only statusID and userID (who should pick up message). Danger of that is, if some tag or user has many followers there will be a lot of record with that status ID. So, is it good idea? Or maybe better is used M2M relation? (one status -> many receivers)
I think most databases can easily handle large record sets. The responsibility to have it preform lies in your design with properly setting up the indexes. If you create the right indexes the select clauses should perform really well.
I'd go with a users table, a table to have the m2m relationship between users and messages table.
You can then do one select to find all of the users a user is following and then a second select in to get all of the messages of interest (sorting and limiting the results as appropriate). Extending this to tagging should be pretty simple.
This design should be fine for large numbers of users and messages as long as you index the right columns. If you got massive then you could also run the users tables and messages tables to different servers or have read only replicates. I wouldn't even worry about that for the moment - you'd need to be huge.
When implementing Collabinate (http://www.collabinate.com), a service-based engine for microblogging and shared activity streams, I used a graph database. The fact that people create posts and follow other people lends itself to a graph structure. With the right relationships and algorithms, this can be a very efficient and performant solution.