Language features to implement relational algebra - orm

I've been trying to encode a relational algebra in Scala (which to my knowlege has one of the most advanced type systems) and just don't seem to find a way to get where I want.
As I'm not that experienced with the academic field of programming language design I don't really know what feature to look for.
So what language features would be needed, and what language has those features, to implement a statically verified relational algebra?
Some of the requirements:
A Tuple is a function mapping names from a statically defined set of valid names for the tuple in question to values of the type specified by the name. Lets call this name-type set the domain.
A Relation is a Set of Tuples with the same domain such that the range of any tuple is uniqe in the Set
So far the model can eaisly be modeled in Scala simply by
trait Tuple
trait Relation[T<Tuple] extends Set[T]
The vals, vars and defs in Tuple is the name-type set defined above. But there should'n be two defs in Tuple with the same name. Also vars and impure defs should probably be restricted too.
Now for the tricky part:
A join of two relations is a relation where the domain of the tuples is the union of the domains from the operands tuples. Such that only tuples having the same ranges for the intersection of their domains is kept.
def join(r1:Relation[T1],r2:Relation[T2]):Relation[T1 with T2]
should do the trick.
A projection of a Relation is a Relation where the domain of the tuples is a subset of the operands tuples domain.
def project[T2](r:Relation[T],?1):Relation[T2>:T]
This is where I'm not sure if it's even possible to find a sollution. What do you think? What language features are needed to define project?
Implied above offcourse is that the API has to be usable. Layers and layers of boilerplate is not acceptable.

What your asking for is to be able to structurally define a type as the difference of two other types (the original relation and the projection definition). I honestly can't think of any language which would allow you to do that. Types can be structurally cumulative (A with B) since A with B is a structural sub-type of both A and B. However, if you think about it, a type operation A less B would actually be a supertype of A, rather than a sub-type. You're asking for an arbitrary, contravariant typing relation on naturally covariant types. It hasn't even been proven that sort of thing is sound with nominal existential types, much less structural declaration-point types.
I've worked on this sort of modeling before, and the route I took was to constraint projections to one of three domains: P == T, P == {F} where F in T, P == {$_1} where $_1 anonymous. The first is where the projection is equivalent to the input type, meaning it is a no-op (SELECT *). The second is saying that the projection is a single field contained within the input type. The third is the tricky one. It is saying that you are allowing the declaration of some anonymous type $_1 which has no static relationship to the input type. Presumably it will consist of fields which delegate to the input type, but we can't enforce that. This is roughly the strategy that LINQ takes.
Sorry I couldn't be more helpful. I wish it were possible to do what you're asking, it would open up a lot of very neat possibilities.

I think I have settled on just using the normal facilities for mapping collection for the project part. The client just specify a function [T<:Tuple](t:T) => P
With some java trickery to get to the class of P I should be able to use reflection to implement the query logic.
For the join I'll probably use DynamicProxy to implement the mapping function.
As a bonus I might be able to get the API to be usable with Scalas special for-syntax.

Related

SQL Database Design ERD - Empty entity because of different function

As you can see below, the User is able to make a Call, the Operator will log it, writing the time (error on my part, Column2), his own ID and the ID of the caller. The Operator is also able to create a Solution, by generating a Solution ID and describing the solution.
Note that nothing differentiate the User from the Operator in terms of attributes. Indeed they both inherit their ID from the Person entity.
So I have two questions.
First, as you can see, the Call entity has two attributes which are the same column (ID for User and Operator), but will always represent two different people (i.e. a User will never be an Operator). Is this the correct notation for such a thing?
Secondly, I am not sure about having User and Operator as separate entities because no attribute distinguish them from one another, only their ability to do something or not (User can't create a solution). This would mean that they don't have attributes apart from the ones they inherit. Is this correct or should the two entities be merged under the Personentity?
Thanks in advance.
It's valid to create subtypes with distinct relationships and/or constraints, even if they have no distinct attributes. You'll be able to use referential integrity to ensure that Operator IDs and User IDs don't get mixed up in the Call table, and it's possible to enforce mutual exclusion between the IDs in the User and Operator tables.
As far as notation is concerned, I would show the ID in the User and Operator tables, and use Crow's foot lines to represent the FK constraints between the tables. If I wanted to make the subtyping explicit, I would rather show that on an EER diagram using Chen's notation than on a table diagram.

UML composition attributes not feature in the class?

I have a class A and it has data members of class B and class C which are composition relationships. As I am going to draw a composition relationship line from B to A and C to A, does this mean I cannot also include the data members within the class A "box" because the relationship is inferred from the composition relationship lines?
I ask because the data member variable names seem a good way to help understand the context and this cannot be represented if you omit the data members from the class A "box"??
I am not sure if there is a cast-iron rule in UML or whether I am free to choose. This is not for auto-generation of code- just human reading.
At least, in UML you can show a name of each property like a figure below.
According to UML specification, both representations of data members, visually depicted association/composition between two classes or in-class data member display) are equivalent. Here is an example (a little bit modified your case, to make it clearer):
Note that association end also show the scope and collection type (besides the name of course). col_B is defined ase private {ordered} collection (like array).
So, getting back to the formal side of UML spec... a, x, aa, col_b and m_c are all co called A's structural features (or properties). They can all be visually depicted using relationsips between the classes or inside the class itself. You can even show "int" data type as a class and link it using a composition!
Which way you will use is up to you, kind of matter of personal taste.
I always apply a simple rule - basic data types (int, boolean, date, string, etc) and their arrays are showed in the class itself, while the class and enumeration based properties are depicted by a relationship (example on the top).
As simple data types are clear and well known and does not have their own properties, I find it clear enough to show them in-class (diagram is simpler and smaller).
The complex data types (classes and enumerations) however typically have their own properties (data members, associations), even inheritances, and I want to make the class structure stand out on my diagarm.
You can use your own logic though.
In a class diagram you cannot model the same composition both showing the association and the attribute, because in the UML semantic that would mean your class has two composition :-)
If in your diagram you already have classes B and C, I suggest you opt for the association ("relationship lines") solution.
To better understand the context, you can put the roles on the association: this is equivalent to the name of your attributes.

What is a database closure?

I came across this term called database closure.
I tried to look for it and what exactly it means but I have not found any simple explanation.
Can someone please explain what the concept of closure is and specifically what is a database closure, if it is good /bad, how it can be used or avoided ?
Also seems like there is in general a closure term: http://en.wikipedia.org/wiki/Closure_%28computer_science%29 which relates to binding of variables to function. Is a database closure related to this ?
Thanks!
Closure is actually a relatively simple concept. When designing databases we want to know that our database tables have as little redundancy as possible. This means making sure that we can have as little relationships between sets (or tables) as possible.
An example:
If we have two sets X and Y (which you can think of as two tables called X and Y) and they have a relationship with each other as so:
X -> Y (Read this as Y is dependent on X)
And we have another set Z which is dependent on Y:
Y -> Z (also read as Y determines Z)
To find the closure we find the minimum number of tables that we can reach all relationships with. In this case all we need is X.
So now, when we design our database we know that we only have to have a relationship from X, and Z and Y can actually be derived from X. We can therefore make sure there are no extra relationships in our database which cause redundancy.
If you want to read more, closure is a part of a topic called normalisation.
Closure is mentioned in database theory / set theory discussions -- as in, Dr. Codd / design & normalization kind of stuff. It has to do with finding the minimally representational elements of sets (i.e., without redundancy, etc.). I tried reading-up on it a long time ago, but my eyes went crossed, and I got a really bad headache.
If you want to read a decent summary of closure, here is one: http://www.cs.sfu.ca/CC/354/jpei/slides/ClosureDecomposition.pdf
All operations are performed on an entire relation and result in an entire relation, a concept known as closure. And that is one of relational database systems characteristics
The closure is essentially the full set of attributes that can be determined from a set of known attributes, for a given database, using its functional dependencies.
Formal math definition:
Given a set of functional dependencies, F, and a set of attributes X. The closure is defined to be the set of attributes Y such that X -> Y follows from F.
Algorithm definition:
Closure(X, F)
1 INITIALIZE V:= X
2 WHILE there is a Y -> Z in F such that:
- Y is contained in V and
- Z is not contained in V
3 DO add Z to V
4 RETURN V
It can be shown that the two definition coincide.
A database closure might refer to the closure of all of the database attributes. According to the definitions above, this closure would be the set of all attributes of the database itself.
The closure (computer science) term that you linked to is not related to closure in databases but the mathematical closure is.
For a better understanding of functional dependencies and a simple example for closure in databases I suggest reading this.
If we are referring to Closure in the Functional Dependency sense (relating to database design),
The closure of a set F of functional dependencies is the set of all functional dependencies logically implied by F.
The minimal representation of sets is referred to as the canonical cover: the irreducible set of FD's that describe the closure.

Can I (theoretically) use a Collection (e.g., Array, List) as a foreign key in a relational Database schema?

Is is possible to use a Collection such as a Multiset or Array as the foreign key in a database scheme?
Background: A student proposes to use such a construct (don't use a join table for the n:m association) to store the following object structure in a database
public class Person {
String name;
List<Resource> res;
…
}
public class Resource {
int id;
List<Person> prs;
…
}
SQL:2003
IMHO, the student didn't understand relational concepts. I don't know how collection types are implemented in todays databases, but they most probably store them in separate tables.
Edit
If it would be technically possible, I doubt that it would be useful. Consider the query language. Sql is designed for relational structures, I doubt that you could really have the same flexibility and possibilities using collection types. If you had it, you couldn't read it anymore. Consider indexes. etc. etc.
Relational structures are primitive, but very powerful and fast. You can do (almost) everything with them. Collection type are actually not needed, although they may be useful in certain cases. Using collections (for relational stuff) just would be more complex and less pure.
As David pointed out, theory allows attribute values to be of a collection type.
However, in your case, which is just to model n:m relationships (am I right about that), it simply does not apply.
If a Person P1 has associated resources R1 and R2, the row for this person would be like {P1, {R1, R2}}. If that collection-typed column were a foreign key referencing some other table, it would mean that there had to be another table in which a row appeared with the collection value {R1, R2} in some column. Which table would that be in your example ?
Collection-typed attributes are mostly useful if you have a need for dealing with empty collections alongside non-empty ones. There is no relational join in the world that will do its equivalent for you.
Simply put, I would have said no. I don't think that it is possible in SQL2003 and in any case it would couple the code and the database structure too closely. Remember good practice of structuring code so that a change to your database doesn't require a change to your code and vice versa.
As Stefan said you need separate tables for Resource and Person with Foreign Key links to the indexes between them.
So based on the classes shown each table would need 3 coloumns.
You would then obtain your class data by using an appropriate query to the database.
In principle, yes you can implement such a referential constraint. That's assuming your RDBMS allows a suitable type for the set of values. For instance it could be a relation value if relation-valued attributes (RVA) are supported.
If it was a RVA then the constraint could easily be expressed in the relational algebra / calculus or its equivalent. For instance you can do it in a RDBMS like Rel which supports the Tutorial D language. Doing it in SQL is probably going to be a lot harder - but then SQL is not a real relational language.
Of course, the fact that you can do it relationally does not necessarily make it a good idea...

Model Heterogeneous Type in Database

I am trying to figure out the best way to model a set of "classes" in my system. Note that I'm not talking about OO classes, but classes of responses (in a survey). So the model goes like this:
A Class can be defined with three different types of data:
A Class of Coded Responses (where a coded responses consists of a string label and an integer value)
A Class of Numeric Responses (defined as a set of intervals where each interval ranges from a min to a max value)
A Class of String Responses (defined as a set of regular expression patterns)
Right now we have: Class table (to define unique classes) and a ClassCoded, ClassNumeric and ClassString table (all with a ClassID as a foreign key to Class table).
My problem is that right now a Class could technically be both Coded and Numeric by this system. Is there any way to define the set of tables to be able to handle this situation??
There are two main ways to handle subtypes, either with sparse columns by adding columns for every possible property (preferrably with check constraints to make sure only one type has values) or to create a table for the supertype and then three tables for the sub-types, each with foreign keys back to the supertype table. Then add a check constraint to ensure that only one of the three possible type columns is not null.
Personally I decide which of the two implementations to use based on how similar the subtypes are. If 90% of the columns are shared I use the sparse columns approach, if very little information is shared I use the multiple tables approach.
Relational databases don't handle this elegantly. Simplest way is to define columns for all different types of data, and only fill the appropriate ones.
I don't understand what the issue is. This is just mixin inheritance. Why can't a Class just have an entry each both ClassCoded and ClassNumeric?
The enforcement of business rules isn't going to be done in the DB anyways, so you can easily enforce these constraints in the business layer code with special rules for Classes that have entries in both these tables.