Explanation about Direct Mapping to convert Many-to-Many relationship - sparql

One of the standards from W3C for RDB2RDF is Direct Mapping. I heard that there is a problem when converting many-to-many relationship from a relational database and they say it loses semantic, I need more explanation about it.

...there is a problem when converting many-to-many relationship from a relational database
I'd say that direct mapping introduces additional "parasitic" semantics, treating normalization artefacts as first-class object.
Let's consider the D011-M2MRelations testcase.
Student
+---------+-----------+----------+
| ID (PK) | FirstName | LastName |
+---------+-----------+----------+
| 10 | Venus | Williams |
| 11 | Fernando | Alonso |
| 12 | David | Villa |
+---------+-----------+----------+
Student_Sport
+------------+----------+
| ID_Student | ID_Sport |
+------------+----------+
| 10 | 110 |
| 11 | 111 |
| 11 | 112 |
| 12 | 111 |
+------------+----------+
Sport
+---------+-------------+
| ID (PK) | Description |
+---------+-------------+
| 110 | Tennis |
| 111 | Football |
| 112 | Formula1 |
+---------+-------------+
Direct mapping generates a lot of triples of this kind:
<Student_Sport/ID_Student=11;ID_Sport=111> <Student_Sport#ref-ID_Student> <Student/ID=11>.
<Student_Sport/ID_Student=11;ID_Sport=111> <Student_Sport#ref-ID_Sport> <Sport/ID=111>.
<Student_Sport/ID_Student=11;ID_Sport=112> <Student_Sport#ref-ID_Student> <Student/ID=11>.
<Student_Sport/ID_Student=11;ID_Sport=112> <Student_Sport#ref-ID_Sport> <Sport/ID=112>.
Modeling from scratch, you'd probably write something like this (R2RML allows to achieve that):
<http://example.com/student/11> <http://example.com/plays> <http://example.com/sport/111>.
<http://example.com/student/11> <http://example.com/plays> <http://example.com/sport/112>.
Moreover, one can't improve results denormalizing original tables or creating SQL views: without primary keys, results are probably even worse.
In order to improve results, subsequent DELETE/INSERT (or CONSTRUCT) seems to be the only option available. The process should be named ELT rather than ETL. Perhaps the following DM-generated triples were intended to help in such transformation:
<Student_Sport/ID_Student=11;ID_Sport=111> <Student_Sport#ID_Student> "11"^^xsd:integer.
<Student_Sport/ID_Student=11;ID_Sport=111> <Student_Sport#ID_Sport> "111"^^xsd:integer.
<Student_Sport/ID_Student=11;ID_Sport=112> <Student_Sport#ID_Student> "11"^^xsd:integer.
<Student_Sport/ID_Student=11;ID_Sport=112> <Student_Sport#ID_Sport> "112"^^xsd:integer.
...they say it loses semantics
#JuanSequeda means that DM doesn't generate an OWL ontology from an relational schema, this behaviour is not many-to-many relations specific.
See also links from Issue 14.

Related

How to define a relationship between two tables from different sources with different identifiers

Background:
I'm working on a project that does not allow me to share the data, but I'll do my best to give you some visualisation below. So before going further, I know (some) SQL, and I have done basic work relationship before, but the data was clean and simple and for some reason I just can't' figure out a solution.
Problem (?)
I'm trying to define a relationship between two tables from two different sources that each work with different identifiers. I do have however a mapping table from one of those but again the identifiers do not align. Let me try explain visually:
| TABLE 1 (cies) | | TABLE 2 (forms) |
| ------------ | | ------------- |
| id(PK) | | id(PK) |
| 4_digit_code | | 16_digit_code |
| ...more fields | | ...more fields |
The second source provided me a mapping table they use internally:
| MAPPING TABLE |
| ------------- |
| id(PK) |
| 4_digit_code | (= to the one in TABLE 1)
| 16_digit_code | (= to the one in TABLE 2)
My first thought was to create a script and just merge the info in the mapping table in TABLE 1 like so:
| TABLE 1 | | TABLE 2 |
| ------------ | | ------------- |
| id(PK) | | id(PK) |
| 16_digit_code | ==== | 16_digit_code |
| 4_digit_code |
The issue here is the 16_digit_code is not unique so I believe this does not work. Now comes something I have no experience with so I am just thinking out loud here:
Can I keep (?) the mapping table and each time reference that one to get my data from the other table via another? On other hand should not all values in a mapping table be unique as well for it to work? The reason there are non-unique values is that (some) very old numbers end up getting recycled.
For example get me all forms from company with id 1:
| TABLE 1 | | MAPPING TABLE | | TABLE 2 |
| ------------ | | ------------- | | ------------- |
| id(PK) | | id(PK) | | id(PK) |
| 16_digit_code | | 16_digit_code | ==== | 16_digit_code |
| 4_digit_code | ==== | 4_digit_code | | ...more fields |
And in the above, I would not know how to efficiently approach this problem. I really don't know if it makes any sense though what I am saying or I am missing something or making this way too complex.
Solution?
I'd love it if someone could point me in the right direction. And if you have the solution I'd love to know the reasoning, not just the solution as I'd love to learn from this for the future obviously.
Edit/Clarification:
Just for completion sake, the mapping combination (4 digit + 16 digit code) is unique. Although, as I said earlier one 16 digit code can be linked to multiple 4 digit codes.

Flexibility of scenarios in Gherkin.

I looking for mechanism that will allow to build more flexible scenarios.
For example for these two very similar scenarios that test existence of records in database:
Scenario Outline: Testing query with 1 attribute with these 2 record in and another 2 out of result
Given I'm connected to <db> database
When I select <query> from database
Then Result should contain fields:
| <row> |
| <yes1> |
| <yes2> |
And Result should not contain fields:
| <row> |
| <no1> |
| <no2> |
Examples:
| db | row | yes1 | yes2 | no1 | no2 | query |
| 1 | model | 1013 | 1006 | 1012 | 1007 | "SELECT model FROM pc WHERE speed >= 3.0;" |
| 1 | maker | E | A | C | H | "SELECT maker FROM product NATURAL JOIN laptop WHERE hd >= 100;" |
Scenario Outline: Testing query with 2 attributes with these 2 record in and another 2 out of result
Given I'm connected to <db> database
When I select <query> from database
Then Result should contain fields:
| <rowA> | <rowB> |
| <yes1A> | <yes1B> |
| <yes2A> | <yes2B> |
And Result should not contain fields:
| <rowA> | <rowB> |
| <no1A> | <no1B> |
| <no2A> | <no2B> |
Examples:
| db | rowA | rowB | yes1A | yes1B | yes2A | yes2B | no1A | no1B | no2A | no2B | query |
| 1 | model | price | 1004 | 649 | 2007 | 1429 | 2004 | 1150 | 3007 | 200 | "SELECT model,price FROM product" |
| 2 | name | country | Yamato | Japan | North | USA | Repulse | Brit | Cal | USA | "SELECT name, country FROM clases" |
I would like to be able to write one scenario with general number of attributes. It would be great if number of tested rows will not be determined too.
My dream is to write only one general scenario:
Testing query with N attribute with these M record in and another L out of result
How to do this in Gherkin? Is it possible with any hacks?
The short answer is, No. Gherkin is not about flexibility, Gherkin is about concrete examples. Concrete example are everything except flexible.
A long answer is:
You are describing a usage of Gherkin as a test tool. The purpose with Gherkin is, however, not to test things. The purpose with Gherkin is to facilitate communication between development and the stakeholders that want a specific behaviour.
If you want to test something, there are other tooling that will support exactly what you want. Any test framework will be usable. My personal choice would be JUnit since I work mostly with Java.
The litmus test for deciding on the tooling is, who will have to be able to understand this?
If the answer is non techs, I would probably use Gherkin with very concrete examples. Concrete examples are most likely not comparing things in a database. Concrete examples tend to describe external, observable behaviour of the system.
If the answer is developers, then I would probably use a test framework where I have access to a programming language. This would allow for the flexibility you are asking for.
In your case, you are asking for a programming language. Gherkin and Cucumber are not the right tools in your case.
You can do it without any hacks, but I don't think you want to, at least not the entire scenario in a single line.
You will want to follow BDD structure, else why use BDD?
You should have and follow a structure like:
Given
When
Then
You need to split and have a delimitation between initial context, action(s) and result(s).It will be a bad practice to not have a limit between these.
Also note that a clear delimitation will increase reusability, readability and also help you a lot in debugging.
Please do a research of what BDD means and how it helps, it may help if you have a checklist with best practices of BDD that could also help in code review of the automated scenarios.

Database functional dependency for Nullable Columns

I have 4 columns in my non-decomposed, non-normalized Job Application table which are all Nullable, for example my table is:
Name | SSN | Education | City | Job Applied | Post | Job Obtained | Post Obtained
John. | 123 | High School | LA | USPS | MailMan | USPS | MailMan
John. | 123 | High School | LA | Dept. of Agri | Assistant | *null* | *null*
Sam. | 123 | BS | NY | Intel | QA Analyst | Intel | QA Analyst
The first 4 Columns are non-nullable so I can easily determine Functional Dependencies between them.
The last 4 columns, can or cannot have values depending on if a person has got a job and also depending on if he/she has applied for a job.
My question is: Can I have Functional Dependencies on Nullable Columns either them being on the LHS or the RHS.
The answer should be yes, please see:
http://en.wikipedia.org/wiki/Functional_dependency

RDBMS schema for unknown columns

I have a project with a MySQL database, and I would like to be able to upload various datasets. Say I am building a restaurant reviews aggregator. So we would like to keep adding all sources of restaurant reviews we could get our hands on, and keeping all the information.
I have a table review_sources
=========================
| id | name |
=========================
| 1 | Zagat |
| 2 | GoodEats Magazine|
| ... |
| 50 | Allergy News |
=========================
Now say I have a table reviews
=====================================================================
| id | Restaurant Name | source_id | Star Rating | Description |
=====================================================================
| 0 | Joey's Burgers | 1 | 3.5 | Wow! |
| 1 | Jamal's Steaks | 1 | 3.5 | Yummy! |
| 2 | Jenny's Crepes | 1 | 4.5 | Sweet! |
| .... |
| 253| Jeeva's Curries | 3 | 4 | Spicy! |
=====================================================================
Now suppose someone wants to add reviews from "Allergy News", they have a field "nut-free". Or a source of reviews could describe the degree of kashrut compliance, or halal compliance or vegan-friendliness. I as a designer don't know the possible optional fields future data sources may have. I want to be able to answer queries:
What are all the fields in the Zagat reviews?
For review id=x, what is value of the optional field "vegan-friendly"?
So how do I design a schema that can handle these disparate data sources and answer these queries? My reasons for not going for NoSQL are that I do want certain types of normalization, and that this is part of an existing MySQL based project.
I'd use a many-to-many relationship with a table containing a review_id, a field (e.g. "vegan-friendly") and the value of the field. Then of course a reviews_fields table to map one to the other.
Cheers

Schema design: to-one foreign relation with heterogeneous type (different targets), but same role

I have been working to build a more abstract schema, where there had been several tables modeling remarkably similar relationships, I want to model just the "essence". Due to the environment I am working with (Drupal 7), I can't change the nature of the issue: that a relationship of the same essential type could reference one of two different tables for the object in one role. Let's bring in some example to clarify (this is not my actual problem domain, but a similar problem). Here are the requirements:
First, if you are unfamiliar with Drupal, here's the gist: Users in one table, every other entity in a single second table (gross generalization, but enough).
Let's say we want to model the "works for" relationship, and lets have the given be that "companies" are of type "entity" and "supervisor" is of type "user" (and by "type" I mean that's the table in the database where their tuples reside). Here are the simplified requirements:
A user can work for a company
A company can work for a company
These "works for" relationships should be in the same table.
I have two ideas, and both don't exactly sit well with my current disposition toward schema quality, and this is where I would like some insight.
One foreign-key column paired with a 'type' column
Two foreign-key columns, always at most one utilized (ick!)
In case you are a visual thinker, here are the two options representing the fact that users 123 and 632, as well as entity 123 all work for entity 435:
Option 1
+---------------+-------------+---------------+-------------+
| employment_id | employee_id | employee_type | employer_id |
+---------------+-------------+---------------+-------------+
| 1 | 123 | user | 435 |
+---------------+-------------+---------------+-------------+
| 2 | 123 | entity | 435 |
+---------------+-------------+---------------+-------------+
| 3 | 632 | user | 435 |
+---------------+-------------+---------------+-------------+
Option 2
+---------------+------------------+--------------------+-------------+
| employment_id | employee_user_id | employee_entity_id | employer_id |
+---------------+------------------+--------------------+-------------+
| 1 | 123 | <NULL> | 435 |
+---------------+------------------+--------------------+-------------+
| 2 | <NULL> | 123 | 435 |
+---------------+------------------+--------------------+-------------+
| 3 | 632 | <NULL> | 435 |
+---------------+------------------+--------------------+-------------+
Thoughts on option 1: I like that the employee_id column has concrete role, but I despise that it has ambiguous target. Option 2 has ambiguous role (which column is the employee?), but has concrete target for any given FK, so I can think of it this way:
+-----------+-----------+----------+
| | ROLE |
| | ambiguous | concrete |
+-----------+-----------+----------+
| T | | |
| A ambig. | | 1 |
| R | | |
| G -------+-----------+----------+
| E | | |
| T concr. | 2 | ? |
| | | |
+-----------+-----------+----------+
Option two has very pragmatic benefits for my project, but I do not feel comfortable with so many nulls (you might not even call it 1NF!)
So here's the crux of my question for SO: How can option 1 be improved, or else what knowledge gap might I have that leaves me unsettled? While I can't bring to mind a specific rule which it violates, the design clearly is not in keeping with the intentions of normalization (requiring two columns to uniquely identify a relationship is not doing me any favors for safeguarding against anomalies).
I do understand that the ideal solution would be to redesign the users entity to be the same as what I have been calling "entity" here, but please consider that beside the point/circumstantial (or at least let's draw the pragmatic line right exactly there for this question).
Again, the essential question: What, in terms of normalization, is wrong with schema option 1, and how might you model this relationship given the constraint of not refactoring "user" into "entity"?
note: For this, I am more interested in theoretical purity than a pragmatic solution
The solutions you present contravene 4th normal form as #podiluska says. If this is recast into the form below, then the solution removes this difficult and is in 5NF (and even 6NF?).
Adopt one of the patterns for sub/super types. This uses the relation definitions set out below, plus the super/subtype constraint. This constraint is that each tuple in the super type relation must correspond exactly to one sub type tuple. In other words, the subtypes must form a disjoint, covering set over the supertype.
I suspect the performance of this in a real situation might require some heavy tuning:
Table: Employment
+---------------+-------------+
| employee_id | employer_id |
+---------------+-------------+
| 1 | 435 |
+---------------+-------------+
| 2 | 435 |
+---------------+-------------+
| 3 | 435 |
+---------------+-------------+
Table: Employee (SuperType)
+---------------+
| employee_id |
+---------------+
| 1 |
+---------------+
| 2 |
+---------------+
| 3 |
+---------------+
Table: User employee (SubType)
+---------------+-------------+
| employee_id | user_id |
+---------------+-------------+
| 1 | 123 |
+---------------+-------------+
| 3 | 632 |
+---------------+-------------+
Table: Entity employee (SubType)
+---------------+-------------+
| employee_id | entity_id |
+---------------+-------------+
| 2 | 123 |
+---------------+-------------+
What is wrong with option 1 ( and option 2) is that it is a multivalued dependency, and as such, a breach of 4th normal form. However, within the constraints you have given, there's not a lot you can do about that.
If you could replace the worksfor table with a view, then you could keep user-company and company-company relations separate.
Of your two choices, Option 2 has the advantage that it may be easier to enforce the referential integrity, depending on your platform.
One potential, if icky, pragmatic solution within you current constraints could be to give companies positive IDs and users negative IDs which eliminates the empty column of option 2 and turns the type column of option 1 into an implication, but I feel dirty even suggesting it.
Similarly, if you don't need to know what type the entity is as long as you can determine it via joining, then using Guids as IDs would eliminate the need for the type column