Uniqueness in many-to-many - sql

I couldn't figure out what terms to google, so help tagging this question or just pointing me in the way of a related question would be helpful.
I believe that I have a typical many-to-many relationship:
CREATE TABLE groups (
id integer PRIMARY KEY);
CREATE TABLE elements (
id integer PRIMARY KEY);
CREATE TABLE groups_elements (
groups_id integer REFERENCES groups,
elements_id integer REFERENCES elements,
PRIMARY KEY (groups_id, elements_id));
I want to have a constraint that there can only be one groups_id for a given set of elements_ids.
For example, the following is valid:
groups_id | elements_id
1 | 1
1 | 2
2 | 2
2 | 3
The following is not valid, because then groups 1 and 2 would be equivalent.
groups_id | elements_id
1 | 1
1 | 2
2 | 2
2 | 1
Not every subset of elements must have a group (this is not the power set), but new subsets may be formed. I suspect that my design is incorrect since I'm really talking about adding a group as a single entity.
How can I create identifiers for subsets of elements without risk of duplicating subsets?

That is an interesting problem.
One solution, albeit a klunky one, would be to store a concatenation of groups_id and elements_id in the groups table: 1-1-2 and make it a unique index.
Trying to do a search for duplicate groups before inserting a new row, would be an enormous performance hit.

The following query would spit out offending group ids:
with group_elements_arr as (
select groups_id, array_agg(elements_id order by elements_id) elements
from group_elements
group by groups_id )
select elements, count(*), array_agg(groups_id) offending_groups
from group_elements_arr
group by elements
having count(*) > 1;
Depending on the size of group_elements and its change rate you might get away with stuffing something along this lines into a trigger watching group_elements. If that's not fast enough you can materialize group_elements_arr into a real table managed by triggers.
And I think, the trigger should be FOR EACH STATEMENT and INITIALLY DEFERRED for easy building up a new group.

This link from user ypercube was most helpful: unique constraint on a set. In short, a bit of what everyone is saying is correct.
It's a question of tradeoffs, but here are the best options:
a) Add a hash or some other combination of element values to the groups table and make it unique, then populate the groups_elements table off of it using triggers. Pros of this method are that it preserves querying ability and enforces the constraint so long as you deny naked updates to groups_elements. Cons are that it adds complexity and you've now introduced logic like "how do you uniquely represent a set of elements" into your database.
b) Leave the tables as-is and control the access to groups_elements with your access layer, be it a stored procedure or otherwise. This has the advantage of preserving querying ability and keeps the database itself simple. However, it means that you are moving an analytic constraint into your access layer, which necessarily means that your access layer will need to be more complex. Another point is that it separates what the data should be from the data itself, which has both pros and cons. If you need faster access to whether or not a set already exists, you can attack that problem separately.

Related

Is it good practice to have two SQL tables with bijective row correspondence?

I have a table of tasks,
id | name
----+-------------
1 | brush teeth
2 | do laundry
and a table of states.
taskid | state
--------+-------------
1 | completed
2 | uncompleted
There is a bijective correspondence between the tables, i.e.
each row in the task table corresponds to exactly one row in the state table.
Another way of implementing this would be to place a state row in the task table.
id | name | state
----+-------------+-------------
1 | brush teeth | completed
2 | do laundry | uncompleted
The main reason why I have selected to use two tables instead of this one, is because updating the state will then cause a change in the task id.
I have other tables referencing the task(id) column, and do not want to have to update all those other tables too when altering a task's state.
I have two questions about this.
Is it good practice to have two tables in bijective row-row correspondence?
Is there a way I can ensure a constraint that there is exactly one row in the state table corresponding to each row in the task table?
The system I am using is postgresql.
You can ensure the 1-1 correspondence by making the id in each table a primary key and a foreign key that references the id in the other table. This is allowed and it guarantees 1-1'ness.
Sometimes, you want such tables, but one table has fewer rows than the other. This occurs when there is a subsetting relationship, and you don't want the additional columns on all rows.
Another purpose is to store separate columns in different places. When I learned about databases, this approach was called vertical partitioning. Nowadays, columnar databases are relatively common; these take the notion to the extreme -- a separate "store" for each column (although the "store" is not exactly a "table").
Why would you do this? Here are some reasons:
You have infrequently used columns that you do not want to load for every query on the more frequent columns.
You have frequently updated columns and you do not want to lock the rest of the columns.
You have too many columns to store in one row.
You have different security requirements on different columns.
Postgres does offer other mechanisms that you might find relevant. In particular, table inheritance might be useful in your situation.
All that said, you would not normally design a database like this. There are good reasons for doing so, but it is more typical to put all columns related to an entity in the same table.

Programmatically determine required series of JOIN statements in SQL

In the standard database theory on referential integrity and inclusion dependencies, it's alluded to that there is a PSPACE algorithm for taking a known list of e.g. foreign key and primary key relationships plus a candidate relationship and determining if the candidate relationship is logically implied by the known list. The non-deterministic version of the algorithm is given in algorithm 9.1.6 of this book [pdf], but not any explicit deterministic version (and I'm guessing the idea isn't to somehow manually use Savitch's Theorem to find that algorithm).
A related question would be how to programmatically ask a database system a question like this:
Given the result set from Query 1 and the result set from Query 2, is there a known-in-advance chain of primary / foreign key dependencies that would allow me to join the two result sets together (possibly requring additional columns that would come from the derived chain of key relationships).
I'm thinking specifically in the context of Postgres, MSSQL, or MySQL, how to actually ask the database for this information programmatically.
I can query the sp_help and sp_fkeys stored procedures, but it's unclear how I can provide a set of columns from a given table and a set of columns from a target table, and just directly ask the database if it can derive, from outputs of things like sp_fkeys, a way to join the tables to each other.
I'm more than willing (happy even) to write code to do this myself if there are known algorithms for it (based on the theorem mentioned above, at least the math algorithm exists even if it's not part of the database systems directly).
My application is to ingest an assortment of tables with poorly maintained schemas, and do some metaprogramming during a schema discovery phase, so that there does not need to be a human-in-the-loop to try to manually search for chains of key relationships that would imply the way two given tables can be joined.
Suppose my database is just these three tables:
TableA
| a_key1 | a_key2 | a_value |
+---------+---------+---------+
| 1 | 'foo' | 300 |
| 2 | 'bar' | 400 |
TableB
| b_key1 | b_key2 | b_value |
+---------+------------+---------+
| 'foo' | 2012-12-01 | 'Bob' |
| 'bar' | 2012-12-02 | 'Joe' |
TableC
| c_key1 | c_key2 | c_value |
+------------+---------+---------+
| 2012-12-01 | 100 | 3.4 |
| 2012-12-02 | 200 | 2.7 |
And suppose that b_key1 is a foreign key into a_key2, that c_key1 is a foreign key into b_key2, and the the "key" columns of each table, together, form the primary key. It's not important whether the tables are normalized well (so, for instance, the fact that b_key1 can identify a row in TableA even though TableA ostensibly has a 2-tuple for its key -- this is unimportant and happens often with tables you inherit in the wild).
Then in this case, just by knowing the key relationships in advance, we know the following join is "supported" (in the sense that the data types and value ranges must make sense under the system of foreign keys):
select A.*, c.c_value
from TableA A
join TableB B
on A.a_key2 = B.b_key1
join Table C
on B.b_key2 = C.c_key1
So if a person came to the database with two separate queries:
select *
from TableA
select c_key1, c_value
from TableC
then the database system ought to be able to deduce the joins required for the query that brings these two into alignment (as implied by foreign keys, not other arbitrary ways of making them into alignment).
My question is: does this search functionality to produce a series of required joins exist already in the major SQL-compliant RDBMSs?
One can imagine following the idea discussed in this answer and then sucking that metadata into some other programming language, deconstructing it into a kind of graph of the key relationships, and then using a known graph path-finding algorithm to solve the problem. But any homebrewed implementations of this will be fraught with errors and it's the kind of problem where the algorithms can easily be written so as to be exponentially slow. By asking if this exists in a database system already, I'm just trying to leverage work done to make a smart and database-aware implementation if that work was already done.
Let the set of input queries be Q, and assume that each of these queries selects a subset of columns from a single table. I'll assume that, when joining from a table t in Q to a table outside Q, we want to consider only those FKs that can be formed from the columns actually present in the query for t; but that when joining from a table outside Q to another table outside Q, we can use any FK defined between these two tables.
In this case I would suggest that what you really want to know is if there is a unique set of joins that will generate the combined column set of all input queries. And this can be tested efficiently:
Form a graph G having V = {all tables} and E = {(u,v) | there is a FK relationship from u to v or v to u}, with the added restriction that edges incident on tables in Q can only use columns present in the corresponding input queries.
Build a spanning forest F on G.
If any pair of tables in Q appear in different components of F then we know that there is no query that satisfies our goal, and we can stop.
Otherwise, there is at least 1 such query, involving some or all of the tables in the unique component X containing all tables in Q.
To get rid of pointless joins, delete all edges in X that are not on a path from some table in Q to some other table in Q (e.g. by repeatedly "nibbling away" all pendant edges to some table outside of Q, until this can no longer be done).
For each edge e that remains in this new, smaller tree X:
Create a copy X_e of the induced subgraph of V(X) in G, and delete the edge e from X_e.
Create a spanning forest F_e from X_e.
If F_e consists of a single component, this can only mean that there is a second (i.e. distinct) way of joining the tables in Q that avoids the edge e, so we can stop.
If we get to this point, then every edge e in X is necessary, and we can conclude that the set of joins implied by X is the unique way to produce the set of columns from the input query set Q.
Any two tables can be joined. (The predicate of a join is the conjunction of its operands' predicates.)
So a lot of what you wrote doesn't make sense. Suggest you seek a characterization of your problem that doesn't involve whether or "the way" two tables can be joined. Maybe, whether there are likely constraints on expressions that are not implied by given constraints on operands and/or what constraints definitely hold on expressions given constraints on operands. (Including likely/inferred FDs, MVDs, JDs and candidate keys of expressions, or inclusion dependencies or FKs between expressions.) (Nb you don't need to restrict your expressions to joins. Eg maybe a table subtypes another.)
It seems to me like you're asking about the mechanics of (/theoretical rules underpinning) the inference of inclusion dependency satisfaction between the results of arbitrary queries.
In the years I've been studying everything I could find about relational theory, I've never seen anything matching that description, so if my understanding of the question was correct, then my guess is your chances will be slim.

SQL Server FK same table

I'm thinking of adding a relationship table to a database and I'd like to include a sort of reverse relation functionality by using a FK pointing to a PK within the same table. For example, Say I have table RELATIONSHIP with the following:
ID (PK) Relation ReverseID (FK)
1 Parent 2
2 Child 1
3 Grandparent 4
4 Grandchild 3
5 Sibling 5
First, is this even possible? Second, is this a good way to go about this? If not, what are your suggestions?
1) It is possible.
2) It may not be as desirable in your case as you might want - you have cycles, as opposed to an acyclic structure - because of this if your FK is in place you cannot insert any of those rows as they are. One possibility is that after allowing NULLs in your ReverseID column in your table DDL, you would have to INSERT all the rows with NULL ReverseID and then doing an UPDATE to set the ReverseID columns which will now have valid rows to reference. Another possibility is to disable the foregin key or don't create it until the data is in a completely valid state and then apply it.
3) You would have to do an operation like this almost every time, and if EVERY relationship has an inverse you either wouldn't be able to enforce NOT NULL in the schema or you would regularly be disabling and re-enabling constraints.
4) The sibling situation is the same.
I would be fine using the design if this is controlled in some way and you understand the implications.

Bending the rules of UNIQUE column SQLITE

I am working with an extensive amount of third party data. Each data set has items with unique identifiers. So it is very easy for me to utilise UNIQUE column in SQLITE to enforce some data integrity.
Out of thousands of records I have id from third party source A matching 2 unique ids from third party source B.
Is there a way of bending the rules, and allowing a duplicate entry in a unique column? If not how should I reorganise my data to take care of this single edge case.
UPDATE:
CREATE TABLE "trainer" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT,
"name" TEXT NOT NULL,
"betfair_id" INTEGER NOT NULL UNIQUE,
"racingpost_id" INTEGER NOT NULL UNIQUE
);
Problem data:
Miss Beverley J Thomas http://www.racingpost.com/horses/trainer_home.sd?trainer_id=20514
Miss B J Thomas http://www.racingpost.com/horses/trainer_home.sd?trainer_id=11096
vs. Miss Beverley J. Thomas http://form.horseracing.betfair.com/form/trainer/1/00008861
Both Racingpost entires (my primary data source) match a single Betfair entry. This is the only one (so far) out of thousands of records.
If racingpost should have had only 1 match it is an error condition.
If racingpost is allowed to have 2 matches per id, you must either have two ids, select one, or combine the data.
Since racingpost is your primary source, having 2 ids may make sense. However if you want to improve upon that data set combining that data or selecting the most useful may be more accurate. The real question is how much data overlaps between these two records and when it does can you detect it reliably. If the overlap is small or you have good detection of an overlap condition, then combining makes more sense. If the overlap is large and you cannot detect it reliably, then selecting the most recent updated or having two ids is more useful.

redundant column

I have a database that has two tables, these tables look like this
codes
id | code | member_id
1 | 123 | 2
2 | 234 | 1
3 | 345 |
4 | 456 | 3
members
id | code_id | other info
1 | 2 | blabla
2 | 1 | blabla
3 | 4 | blabla
the basic idea is that if a code is taken then its member id field is filled in, however this is creating a circle link (members points to codes, codes points to members) is there a different way of doing this? is this actually a bad thing?
Update
To answer your questions there are three different code tables with approx 3.5 million codes each, each table is searched depending on different criteria, if the member_id column is empty then the code is unclaimed, else, the code is claimed, this is done so that when we are searching the database we do not need to include another table to tell if it it claimed.
the members table contains the claimants for every single code, so all 10.5 million members
the additional info has things like mobile, flybuys.
the mobile is how we identify the member, but each entry is considered a different member.
It's a bad thing because you can end up with anomalies. For example:
codes
id | code | member_id
1 | 123 | 2
members
id | code_id | other info
2 | 4 | blabla
See the anomaly? Code 1 references its corresponding member, but that member doesn't reference the same code in return. The problem with anomalies is you can't tell which one is the correct, intended reference and which one is a mistake.
Eliminating redundant columns reduces the chance for anomalies. This is a simple process that follows a few very well defined rules, called rules of normalization.
In your example, I would drop the codes.member_id column. I infer that a member must reference a code, but a code does not necessarily reference a member. So I would make members.code_id reference codes.id. But it could go the other way; you don't give enough information for the reader to be sure (as #OMG Ponies commented).
Yeah, this is not good because it presents opportunities for data integrity problems. You've got a one-to-one relationship, so either remove Code_id from the members table, or member_id from the codes table. (in this case it seems like it would make more sense to drop code_id from members since it sounds like you're more frequently going to be querying codes to see which are not assigned than querying members to see which have no code, but you can make that call)
You could simply drop the member_id column and use a foreign key relationship (or its absence) to signify the relationship or lack thereof. The code_id column would then be used as a foreign key to the code. Personally, I do think it's bad simply because it makes it more work to ensure that you don't have corrupt relationships in the DB -- i.e., you have to check that the two columns are synchronized between the tables -- and it doesn't really add anything in the general case. If you are running into performance problems, then you may need to denormalize, but I'd wait until it was definitely a problem (and you'd likely replicate more than just the id in that case).
It depends on what you're doing. If each member always gets exactly one unique code then just put the actual code in the member table.
If there are a set of codes and several members share a code (but each member still has just one) then remove the member_id from the codes table and only store the unique codes. Access a specific code through a member. (you can still join the code table to search on codes)
If a member can have multiple codes then remove the code_id from the member table and the member_id from the code table can create a third table that relates members to codes. Each record in the member table should be a unique record and each record in the code table should be a unique record.
What is the logic behind having the member code in the code table?
It's unnecessary since you can always just do a join if you need both pieces of information.
By having it there you create the potential for integrity issues since you need to update BOTH tables whenever an update is made.
Yes this is a bad idea. Never set up a database to have circular references if you can help it. Now any change has to be made both places and if one place is missed, you have a severe data integrity problem.
First question can each code be assigned to more than one member? Or can each member have more than one code? (this includes over time as well as at any one moment if you need historical records of who had what code when))If the answer to either is yes, then your current structure cannot work. If the answer to both is no, why do you need two tables?
If you can have mulitple codes and multiple members you need a bridging table that has memberid and code id. If you can have multiple members assigned one code, put the code id in the members table. If it is the other way it should be the memberid in the code table. Then properly set up the foreign key relationship.
#Bill Karwin correctly identifies this as a probably design flaw which will lead to anomalies.
Assuming code and member are distinct entities, I would create a thrid table...
What is the relationship between a code and member called? An oath? If this is a real life relationship, someone with domain knowledge in the business will be able to give it a name. If not look for further design flaws:
oaths
code_id | member_id
1 | 2
2 | 1
4 | 3
The data suggest that a unique constraint is required for (code_id, member_id).
Once the data is 'scrubbed', drop the columns codes.member_id and members.code_id.