Programmatically determine required series of JOIN statements in SQL - sql

In the standard database theory on referential integrity and inclusion dependencies, it's alluded to that there is a PSPACE algorithm for taking a known list of e.g. foreign key and primary key relationships plus a candidate relationship and determining if the candidate relationship is logically implied by the known list. The non-deterministic version of the algorithm is given in algorithm 9.1.6 of this book [pdf], but not any explicit deterministic version (and I'm guessing the idea isn't to somehow manually use Savitch's Theorem to find that algorithm).
A related question would be how to programmatically ask a database system a question like this:
Given the result set from Query 1 and the result set from Query 2, is there a known-in-advance chain of primary / foreign key dependencies that would allow me to join the two result sets together (possibly requring additional columns that would come from the derived chain of key relationships).
I'm thinking specifically in the context of Postgres, MSSQL, or MySQL, how to actually ask the database for this information programmatically.
I can query the sp_help and sp_fkeys stored procedures, but it's unclear how I can provide a set of columns from a given table and a set of columns from a target table, and just directly ask the database if it can derive, from outputs of things like sp_fkeys, a way to join the tables to each other.
I'm more than willing (happy even) to write code to do this myself if there are known algorithms for it (based on the theorem mentioned above, at least the math algorithm exists even if it's not part of the database systems directly).
My application is to ingest an assortment of tables with poorly maintained schemas, and do some metaprogramming during a schema discovery phase, so that there does not need to be a human-in-the-loop to try to manually search for chains of key relationships that would imply the way two given tables can be joined.
Suppose my database is just these three tables:
TableA
| a_key1 | a_key2 | a_value |
+---------+---------+---------+
| 1 | 'foo' | 300 |
| 2 | 'bar' | 400 |
TableB
| b_key1 | b_key2 | b_value |
+---------+------------+---------+
| 'foo' | 2012-12-01 | 'Bob' |
| 'bar' | 2012-12-02 | 'Joe' |
TableC
| c_key1 | c_key2 | c_value |
+------------+---------+---------+
| 2012-12-01 | 100 | 3.4 |
| 2012-12-02 | 200 | 2.7 |
And suppose that b_key1 is a foreign key into a_key2, that c_key1 is a foreign key into b_key2, and the the "key" columns of each table, together, form the primary key. It's not important whether the tables are normalized well (so, for instance, the fact that b_key1 can identify a row in TableA even though TableA ostensibly has a 2-tuple for its key -- this is unimportant and happens often with tables you inherit in the wild).
Then in this case, just by knowing the key relationships in advance, we know the following join is "supported" (in the sense that the data types and value ranges must make sense under the system of foreign keys):
select A.*, c.c_value
from TableA A
join TableB B
on A.a_key2 = B.b_key1
join Table C
on B.b_key2 = C.c_key1
So if a person came to the database with two separate queries:
select *
from TableA
select c_key1, c_value
from TableC
then the database system ought to be able to deduce the joins required for the query that brings these two into alignment (as implied by foreign keys, not other arbitrary ways of making them into alignment).
My question is: does this search functionality to produce a series of required joins exist already in the major SQL-compliant RDBMSs?
One can imagine following the idea discussed in this answer and then sucking that metadata into some other programming language, deconstructing it into a kind of graph of the key relationships, and then using a known graph path-finding algorithm to solve the problem. But any homebrewed implementations of this will be fraught with errors and it's the kind of problem where the algorithms can easily be written so as to be exponentially slow. By asking if this exists in a database system already, I'm just trying to leverage work done to make a smart and database-aware implementation if that work was already done.

Let the set of input queries be Q, and assume that each of these queries selects a subset of columns from a single table. I'll assume that, when joining from a table t in Q to a table outside Q, we want to consider only those FKs that can be formed from the columns actually present in the query for t; but that when joining from a table outside Q to another table outside Q, we can use any FK defined between these two tables.
In this case I would suggest that what you really want to know is if there is a unique set of joins that will generate the combined column set of all input queries. And this can be tested efficiently:
Form a graph G having V = {all tables} and E = {(u,v) | there is a FK relationship from u to v or v to u}, with the added restriction that edges incident on tables in Q can only use columns present in the corresponding input queries.
Build a spanning forest F on G.
If any pair of tables in Q appear in different components of F then we know that there is no query that satisfies our goal, and we can stop.
Otherwise, there is at least 1 such query, involving some or all of the tables in the unique component X containing all tables in Q.
To get rid of pointless joins, delete all edges in X that are not on a path from some table in Q to some other table in Q (e.g. by repeatedly "nibbling away" all pendant edges to some table outside of Q, until this can no longer be done).
For each edge e that remains in this new, smaller tree X:
Create a copy X_e of the induced subgraph of V(X) in G, and delete the edge e from X_e.
Create a spanning forest F_e from X_e.
If F_e consists of a single component, this can only mean that there is a second (i.e. distinct) way of joining the tables in Q that avoids the edge e, so we can stop.
If we get to this point, then every edge e in X is necessary, and we can conclude that the set of joins implied by X is the unique way to produce the set of columns from the input query set Q.

Any two tables can be joined. (The predicate of a join is the conjunction of its operands' predicates.)
So a lot of what you wrote doesn't make sense. Suggest you seek a characterization of your problem that doesn't involve whether or "the way" two tables can be joined. Maybe, whether there are likely constraints on expressions that are not implied by given constraints on operands and/or what constraints definitely hold on expressions given constraints on operands. (Including likely/inferred FDs, MVDs, JDs and candidate keys of expressions, or inclusion dependencies or FKs between expressions.) (Nb you don't need to restrict your expressions to joins. Eg maybe a table subtypes another.)

It seems to me like you're asking about the mechanics of (/theoretical rules underpinning) the inference of inclusion dependency satisfaction between the results of arbitrary queries.
In the years I've been studying everything I could find about relational theory, I've never seen anything matching that description, so if my understanding of the question was correct, then my guess is your chances will be slim.

Related

Is it good practice to have two SQL tables with bijective row correspondence?

I have a table of tasks,
id | name
----+-------------
1 | brush teeth
2 | do laundry
and a table of states.
taskid | state
--------+-------------
1 | completed
2 | uncompleted
There is a bijective correspondence between the tables, i.e.
each row in the task table corresponds to exactly one row in the state table.
Another way of implementing this would be to place a state row in the task table.
id | name | state
----+-------------+-------------
1 | brush teeth | completed
2 | do laundry | uncompleted
The main reason why I have selected to use two tables instead of this one, is because updating the state will then cause a change in the task id.
I have other tables referencing the task(id) column, and do not want to have to update all those other tables too when altering a task's state.
I have two questions about this.
Is it good practice to have two tables in bijective row-row correspondence?
Is there a way I can ensure a constraint that there is exactly one row in the state table corresponding to each row in the task table?
The system I am using is postgresql.
You can ensure the 1-1 correspondence by making the id in each table a primary key and a foreign key that references the id in the other table. This is allowed and it guarantees 1-1'ness.
Sometimes, you want such tables, but one table has fewer rows than the other. This occurs when there is a subsetting relationship, and you don't want the additional columns on all rows.
Another purpose is to store separate columns in different places. When I learned about databases, this approach was called vertical partitioning. Nowadays, columnar databases are relatively common; these take the notion to the extreme -- a separate "store" for each column (although the "store" is not exactly a "table").
Why would you do this? Here are some reasons:
You have infrequently used columns that you do not want to load for every query on the more frequent columns.
You have frequently updated columns and you do not want to lock the rest of the columns.
You have too many columns to store in one row.
You have different security requirements on different columns.
Postgres does offer other mechanisms that you might find relevant. In particular, table inheritance might be useful in your situation.
All that said, you would not normally design a database like this. There are good reasons for doing so, but it is more typical to put all columns related to an entity in the same table.

Turn two database tables into one?

I am having a bit of trouble when modelling a relational database to an inventory managament system. For now, it only has 3 simple tables:
Product
ID | Name | Price
Receivings
ID | Date | Quantity | Product_ID (FK)
Sales
ID | Date | Quantity | Product_ID (FK)
As Receivings and Sales are identical, I was considering a different approach:
Product
ID | Name | Price
Receivings_Sales (the name doesn't matter)
ID | Date | Quantity | Type | Product_ID (FK)
The column type would identify if it was receiving or sale.
Can anyone help me choose the best option, pointing out the advantages and disadvantages of either approach?
The first one seems reasonable because I am thinking in a ORM way.
Thanks!
Personally I prefer the first option, that is, separate tables for Sales and Receiving.
The two biggest disadvantage in option number 2 or merging two tables into one are:
1) Inflexibility
2) Unnecessary filtering when use
First on inflexibility. If your requirements expanded (or you just simply overlooked it) then you will have to break up your schema or you will end up with unnormalized tables. For example let's say your sales would now include the Sales Clerk/Person that did the sales transaction so obviously it has nothing to do with 'Receiving'. And what if you do Retail or Wholesale sales how would you accommodate that in your merged tables? How about discounts or promos? Now, I am identifying the obvious here. Now, let's go to Receiving. What if we want to tie up our receiving to our Purchase Order? Obviously, purchase order details like P.O. Number, P.O. Date, Supplier Name etc would not be under Sales but obviously related more to Receiving.
Secondly, on unnecessary filtering when use. If you have merged tables and you want only to use the Sales (or Receving) portion of the table then you have to filter out the Receiving portion either by your back-end or your front-end program. Whereas if it a separate table you have just to deal with one table at a time.
Additionally, you mentioned ORM, the first option would best fit to that endeavour because obviously an object or entity for that matter should be distinct from other entity/object.
If the tables really are and always will be identical (and I have my doubts), then name the unified table something more generic, like "InventoryTransaction", and then use negative numbers for one of the transaction types: probably sales, since that would correctly mark your inventory in terms of keeping track of stock on hand.
The fact that headings are the same is irrelevant. Seeking to use a single table because headings are the same is misconceived.
-- person [source] loves person [target]
LOVES(source,target)
-- person [source] hates person [target]
HATES(source,target)
Every base table has a corresponding predicate aka fill-in-the-[named-]blanks statement describing the application situation. A base table holds the rows that make a true statement.
Every query expression combines base table names via JOIN, UNION, SELECT, EXCEPT, WHERE condition, etc and has a corresponding predicate that combines base table predicates via (respectively) AND, OR, EXISTS, AND NOT, AND condition, etc. A query result holds the rows that make a true statement.
Such a set of predicate-satisfying rows is a relation. There is no other reason to put rows in a table.
(The other answers here address, as they must, proposals for and consequences of the predicate that your one table could have. But if you didn't propose the table because of its predicate, why did you propose it at all? The answer is, since not for the predicate, for no good reason.)

VB.NET Access Database 255 Columns Limit

I'm currently developing an application for a client using Visual Basic .NET. It's a rewrite of an application that accessed an Oracle database, filtered the columns and performed some actions on the data. Now, for reasons beyond my control, the client wants to use an Access (.mdb) database for the new application. The problem with this is that the tables have more than the 255 columns access supports so the client suggested splitting the data into multiple databases/tables.
Well even when the tables are split, at some point, I have to query all columns simultaneously (I did an INNER JOIN on both tables) which, of course, yields an error. The limit apparently is on number of simultaneously queryable columns not on the total number of columns.
Is there a possiblility to circumvent the 255 columns limit somehow? I was thinking in the direction of using LINQ to combine queries of both tables, i.e. have an adapter that emulates a single table I can perform queries on. A drawback of this is that .mdb is not a first-class citizen of LINQ-to-SQL (i.e. no insert/update supported etc.).
As a workaround, I might be able to rewrite my stuff so as to only need all columns at one point (I dynamically create control elements depending on the column names in the table). Therefore I would need to query say the first 250 columns and after that the following 150.
Is there a Access-SQL query that can achieve something like this. I thought of something like SELECT TOP 255 * FROM dbname or SELECT * FROM dbname LIMIT 1,250 but these are not valid.
Do I have other options?
Thanks a lot for your suggestions.
The ADO.NET DataTable object has no real limitations on the number of columns that it could contain.
So, once you have splitted the big table in two tables and set the same primary key in both subtables with less columns, you can use, on the VB.NET side, the DataTable.Merge method.
In their example on MSDN they show two tables with the same schema merged together, but it works also if you have two totally different schemas, but just the Primary key in common
Dim firstPart As DataTable = LoadFirstTable()
Dim secondPart As DataTable = LoadSecondTable()
firstPart.Merge(secondPart)
I have tested this just with only one column of difference, so I am not very sure that this is a viable solution in terms of performance.
As I know there is no way to directly bypass this problem using Access.
If you cannot change the db your only way I can think of is to make a wrapper that understand you're were the field are, automatically splits the query in more queryes and then regroup it in a custom class containing all the columns for every row.
For example you can split every table in more tables duplicating the field you're making the conditions on.
TABLEA
Id | ConditionFieldOne | ConditionFierldTwo | Data1 | Data2 | ... | Data N |
in
TABLEA_1
Id | ConditionFieldOne | ConditionFieldTwo | Data1 | Data2 | ... | DataN/2 |
TABLEA_2
Id | ConditionFieldOne | ConditionFieldTwo | Data(N/2)+1 | Data(n/2)+2 | ... | DataN |
and a query where is
SELECT * FROM TABLEA WHERE CONDITION1 = 'condition'
become with the wrapper
SELECT * FROM TABLEA_1 WHERE ConditionFieldOne = 'condition'
SELECT * FROM TABLEA_2 WHERE ConditionFieldOne = 'condition'
and then join the results.

Uniqueness in many-to-many

I couldn't figure out what terms to google, so help tagging this question or just pointing me in the way of a related question would be helpful.
I believe that I have a typical many-to-many relationship:
CREATE TABLE groups (
id integer PRIMARY KEY);
CREATE TABLE elements (
id integer PRIMARY KEY);
CREATE TABLE groups_elements (
groups_id integer REFERENCES groups,
elements_id integer REFERENCES elements,
PRIMARY KEY (groups_id, elements_id));
I want to have a constraint that there can only be one groups_id for a given set of elements_ids.
For example, the following is valid:
groups_id | elements_id
1 | 1
1 | 2
2 | 2
2 | 3
The following is not valid, because then groups 1 and 2 would be equivalent.
groups_id | elements_id
1 | 1
1 | 2
2 | 2
2 | 1
Not every subset of elements must have a group (this is not the power set), but new subsets may be formed. I suspect that my design is incorrect since I'm really talking about adding a group as a single entity.
How can I create identifiers for subsets of elements without risk of duplicating subsets?
That is an interesting problem.
One solution, albeit a klunky one, would be to store a concatenation of groups_id and elements_id in the groups table: 1-1-2 and make it a unique index.
Trying to do a search for duplicate groups before inserting a new row, would be an enormous performance hit.
The following query would spit out offending group ids:
with group_elements_arr as (
select groups_id, array_agg(elements_id order by elements_id) elements
from group_elements
group by groups_id )
select elements, count(*), array_agg(groups_id) offending_groups
from group_elements_arr
group by elements
having count(*) > 1;
Depending on the size of group_elements and its change rate you might get away with stuffing something along this lines into a trigger watching group_elements. If that's not fast enough you can materialize group_elements_arr into a real table managed by triggers.
And I think, the trigger should be FOR EACH STATEMENT and INITIALLY DEFERRED for easy building up a new group.
This link from user ypercube was most helpful: unique constraint on a set. In short, a bit of what everyone is saying is correct.
It's a question of tradeoffs, but here are the best options:
a) Add a hash or some other combination of element values to the groups table and make it unique, then populate the groups_elements table off of it using triggers. Pros of this method are that it preserves querying ability and enforces the constraint so long as you deny naked updates to groups_elements. Cons are that it adds complexity and you've now introduced logic like "how do you uniquely represent a set of elements" into your database.
b) Leave the tables as-is and control the access to groups_elements with your access layer, be it a stored procedure or otherwise. This has the advantage of preserving querying ability and keeps the database itself simple. However, it means that you are moving an analytic constraint into your access layer, which necessarily means that your access layer will need to be more complex. Another point is that it separates what the data should be from the data itself, which has both pros and cons. If you need faster access to whether or not a set already exists, you can attack that problem separately.

Architecture of SQL tables

I am wondering is it more useful and practical (size of DB) to create multiple tables in sql with two columns (one column containing foreign key and one column containing random data) or merge it and create one table containing multiple columns. I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
example a. one table
productID productname weight no_of_pages
1 book 130 500
2 watch 50 null
3 ring null null
example b. three tables
productID productname
1 book
2 watch
3 ring
productID weight
1 130
2 50
productID no_of_pages
1 500
The multi-table approach is more "normal" (in database terms) because it avoids columns that commonly store NULLs. It's also something of a pain in programming terms because you have to JOIN a bunch of tables to get your original entity back.
I suggest adopting a middle way. Weight seems to be a property of most products, if not all (indeed, a ring has a weight even if small and you'll probably want to know it for shipping purposes), so I'd leave that in the Products table. But number of pages applies only to a book, as do a slew of other unmentioned properties (author, ISBN, etc). In this example, I'd use a Products table and a Books table. The books table would extend the Products table in a fashion similar to class inheritance in object oriented program.
All book-specific properties go into the Books table, and you join only Products and Books to get a complete description of a book.
I think this all depends on how the tables will be used. Maybe your examples are oversimplifying things too much but it seems to me that the first option should be good enough.
You'd really use the second example if you're going to be doing extremely CPU intensive stuff with the first table and will only need the second and third tables when more information about a product is needed.
If you're going to need the information in the second and third tables most times you query the table, then there's no reason to join over every time and you should just keep it in one table.
I would suggest example a, in case there is a defined set of attributes for product, and an example c if you need variable number of attributes (new attributes keep coming every now and then) -
example c
productID productName
1 book
2 watch
3 ring
attrID productID attrType attrValue
1 1 weight 130
2 1 no_of_pages 500
3 2 weight 50
The table structure you have shown in example b is not normalized - there will be separate id columns required in second and third tables, since productId will be an fk and not a pk.
It depends on how many rows you are expecting on your PRODUCTS table. I would say that it would not make sense to normalize your tables to 3N in this case because product name, weight, and no_of_pages each describe the products. If you had repeating data such as manufacturers, it would make more sense to normalize your tables at that point.
Without knowing the background (data model), there is no way to tell which variant is more "correct". both are fine in certain scenarios.
You want three tables, full stop. That's best because there's no chance of watches winding up with pages (no pun intended) and some books without. If you normalize, the server works for you. If you don't, you do the work instead, just not as well. Up to you.
I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
That's always true of nullable columns. Here's the rule: a nullable column has an optional relationship to the key. A nullable column can always be, and usually should be, in a separate table where it can be non-null.