Transform / query SQL database to XML document? - sql

I have a very basic database with tables for nodes, node-properties and node-relationships. The relationships are just pointers with no particular implied graph structure.
Now I want to query this database to build an xml document with a hierarchical structure based on certain relationships in this database.
Let's say I have nodes A, B and C where C is related to B as parent, B->A:parent, C->A:user, A->B:context.
The database doesn't know anything about any hierarchies but now I want to build a hierarchical XML document based on the relationships named "parent".
On top of that I will also want to add the node properties and other relationships to this graph, as well as transient properties such as the number of nodes directly related to other nodes as for instance "owner" etc.
So my question is: Is this anything you'd commonly do? Can it be done with any existing tools in the Microsoft or 3:rd party world or do I have to build this xml structure manually step by step?
What I ideally want is something similar to how XSLT/XPath works where you navigate the source (sql-db in this case) and transform it into another structure.
I guess it's somehting similar to a pivot table (but not a table but a hierarichal graph).

I think I found my answer now. "FOR XML" seem to be my friend.

Related

SQL persisting graph like data structures

I'm trying to figure out the best way to store graph data structures in an SQL database. After some research, it seems that I can store graph Nodes in a table and just create a join table with the many-to-many relationships between them which would represent the edges (or connections). That seems exactly what I was looking for, but now I want to introduce the users who own the nodes.
From the performance point of view, would it make sense to create a new join table userNodes, or just save users as nodes assuming that node is a generic structure? And what are the implications of storing everything in a single table?
If you have individual attributes that should be stored on a per-node level, then those attributes should be in the nodes table. That is what the table is for.
If the attributes are really a list, then you would want another table. For instance, if multiple users could own a node, then one option would be a userNodes table. However, as you describe the data, there is only one user per node.

SQL Query For Multiple Indirect Relationships

I have a question about whether I can extract indirect relationship information from a SQL database, just using SQL.
A modelling tool I'm using is based upon an SQL database that essentially has just two tables: Elements and Relations. The table definitions are as follows
Elements:
id, type, name
Relations:
id, type, name, source, target
Elements is just a list of entities with a system generated unique ID, and Relations (also with a unique Id) uses the Elements' unique IDs to define relationships between them in the 'source' and 'target' columns.
What I'm trying to figure out is: for any two elements in the model, e.g. A and D, or A and E, in the diagram below - are they linked, and if so how - including by multiple indirect relationships through other elements?
I know I could solve this problem by writing some procedural code that recursively trawls the Relations table exhaustively checking each relationship between A and D (or E). But this is not straightforward to code.
So I wondered if anyone can point me at the best solution to this problem?
Specifically, is there an SQL-only solution to this problem or will I have to write some procedural code?
The SQL DB in question is alaSQL underpinning the Archi modelling tool. But I'm also interested if other solutions, as pumping the data into a new DBMS is not a problem.
Thanks.

SQL vs NoSQL for data that will be presented to a user after multiple filters have been added

I am about to embark on a project for work that is very outside my normal scope of duties. As a SQL DBA, my initial inclination was to approach the project using a SQL database but the more I learn about NoSQL, the more I believe that it might be the better option. I was hoping that I could use this question to describe the project at a high level to get some feedback on the pros and cons of using each option.
The project is relatively straightforward. I have a set of objects that have various attributes. Some of these attributes are common to all objects whereas some are common only to a subset of the objects. What I am tasked with building is a service where the user chooses a series of filters that are based on the attributes of an object and then is returned a list of objects that matches all^ of the filters. When the user selects a filter, he or she may be filtering on a common or subset attribute but that is abstracted on the front end.
^ There is a chance, depending on user feedback, that the list of objects may match only some of the filters and the quality of the match will be displayed to the user through a score that indicates how many of the criteria were matched.
After watching this talk by Martin Folwler (http://www.youtube.com/watch?v=qI_g07C_Q5I), it would seem that a document-style NoSQL database should suit my needs but given that I have no experience with this approach, it is also possible that I am missing something obvious.
Some additional information - The database will initially have about 5,000 objects with each object containing 10 to 50 attributes but the number of objects will definitely grow over time and the number of attributes could grow depending on user feedback. In addition, I am hoping to have the ability to make rapid changes to the product as I get user feedback so flexibility is very important.
Any feedback would be very much appreciated and I would be happy to provide more information if I have left anything critical out of my discussion. Thanks.
This problem can be solved in by using two separate pieces of technology. The first is to use a relatively well designed database schema with a modern RDBMS. By modeling the application using the usual principles of normalization, you'll get really good response out of storage for individual CRUD statements.
Searching this schema, as you've surmised, is going to be a nightmare at scale. Don't do it. Instead look into using Solr/Lucene as your full text search engine. Solr's support for dynamic fields means you can add new properties to your documents/objects on the fly and immediately have the ability to search inside your data if you have designed your Solr schema correctly.
I'm not an expert in NoSQL, so I will not be advocating it. However, I have few points that can help you address your questions regarding the relational database structure.
First thing that I see right away is, you are talking about inheritance (at least conceptually). Your objects inherit from each-other, thus you have additional attributes for derived objects. Say you are adding a new type of object, first thing you need to do (conceptually) is to find a base/super (parent) object type for it, that has subset of the attributes and you are adding on top of them (extending base object type).
Once you get used to thinking like said above, next thing is about inheritance mapping patterns for relational databases. I'll steal terms from Martin Fowler to describe it here.
You can hold inheritance chain in the database by following one of the 3 ways:
1 - Single table inheritance: Whole inheritance chain is in one table. So, all new types of objects go into the same table.
Advantages: your search query has only one table to search, and it must be faster than a join for example.
Disadvantages: table grows faster than with option 2 for example; you have to add a type column that says what type of object is the row; some rows have empty columns because they belong to other types of objects.
2 - Concrete table inheritance: Separate table for each new type of object.
Advantages: if search affects only one type, you search only one table at a time; each table grows slower than in option 1 for example.
Disadvantages: you need to use union of queries if searching several types at the same time.
3 - Class table inheritance: One table for the base type object with its attributes only, additional tables with additional attributes for each child object type. So, child tables refer to the base table with PK/FK relations.
Advantages: all types are present in one table so easy to search all together using common attributes.
Disadvantages: base table grows fast because it contains part of child tables too; you need to use join to search all types of objects with all attributes.
Which one to choose?
It's a trade-off obviously. If you expect to have many types of objects added, I would go with Concrete table inheritance that gives reasonable query and scaling options. Class table inheritance seems to be not very friendly with fast queries and scalability. Single table inheritance seems to work with small number of types better.
Your call, my friend!
May as well make this an answer. I should comment that I'm not strong in NoSQL, so I tend to lean towards SQL.
I'd do this as a three table set. You will see it referred to as entity value pair logic on the web...it's a way of handling multiple dynamic attributes for items. Lets say you have a bunch of products and each one has a few attributes.
Prd 1 - a,b,c
Prd 2 - a,d,e,f
Prd 3 - a,b,d,g
Prd 4 - a,c,d,e,f
So here are 4 products and 6 attributes...same theory will work for hundreds of products and thousands of attributes. Standard way of holding this in one table requires the product info along with 6 columns to store the data (in this setup at least one third of them are null). New attribute added means altering the table to add another column to it and coming up with a script to populate existing or just leaving it null for all existing. Not the most fun, can be a head ache.
The alternative to this is a name value pair setup. You want a 'header' table to hold the common values amoungst your products (like name, or price...things that all rpoducts always have). In our example above, you will notice that attribute 'a' is being used on each record...this does mean attribute a can be a part of the header table as well. We'll call the key column here 'header_id'.
Second table is a reference table that is simply going to store the attributes that can be assigned to each product and assign an ID to it. We'll call the table attribute with atrr_id for a key. Rather straight forwards, each attribute above will be one row.
Quick example:
attr_id, attribute_name, notes
1,b, the length of time the product takes to install
2,c, spare part required
etc...
It's just a list of all of your attributes and what that attribute means. In the future, you will be adding a row to this table to open up a new attribute for each header.
Final table is a mapping table that actually holds the info. You will have your product id, the attribute id, and then the value. Normally called the detail table:
prd1, b, 5 mins
prd1, c, needs spare jack
prd2, d, 'misc text'
prd3, b, 15 mins
See how the data is stored as product key, value label, value? Any future product added can have any combination of any attributes stored in this table. Adding new attributes is adding a new line to the attribute table and then populating the details table as needed.
I beleive there is a wiki for it too... http://en.wikipedia.org/wiki/Entity-attribute-value_model
After this, it's simply figuring out the best methodology to pivot out your data (I'd recommend Postgres as an opensource db option here)

Table design for hierarchical data

i am trying to design a table which contains sections and each section contains tasks and each task contains sub tasks and so on. I would like to do it under one table. Please let me know the best single table approach which is scalable. I am pretty new to database design. Also please suggest if single table is not the best approach then what could be the best approach to do this. I am using db2.
Put quite simply, I would say use 1 table for tasks.
In addition to all its various other attributes, each task should have a primary identifier, and another column to optionally contain the identifier of its parent task.
If you are using DB2 for z/OS, then you will use a recursive query with a common table expression. Otherwise you you can use a hierarchical recursive query in DB2 for i, or possibly in DB2 for LUW (Linux, Unix, Windows).
Other designs requiring more tables, each specializing in a certain part of the task:subtask relationship, may needlessly introduce issues or limitations.
There are a few ways to do this.
One idea is to use two tables: Sections and Tasks
There could be a one to many relationship between the two. The Task table could be designed as a tree with a TaskId and a ParentTaksId which means you can have Tasks that go n-levels deep (sub tasks of sub tasks og sub tasks etc). Every Task except for the root task will have a parent.
I guess you can also solve this by using a single table where you just add a section column to the Task table I described above.
If you are going to put everything into one table although convenient will be inefficient in the long run. This would mean you will be storing unnecessary repeated groups of data in your database which would not be processor and memory friendly at all. It would in fact violate the Normalization rules and to be more specific the 1st Normal Form which says that there should be no repeating groups that could be found in your table. And it would actually also violate the 3rd Normal Form which means there will be no (transitional) dependency of a non-primary key to another non-primary key.
To give you an illustration, I will put your design into one table. Although I will be guessing on the possible fields but just bear with it because this is for the sake of discussion. Look at the graphics below:
If you look the graphics above (although this is rather small you could download the image and see it closer for yourself), the SectionName, Taskname, TaskInitiator, TaskStartDate and TaskEndDate are unnecessary repeated which as I mentioned earlier a violation of the 1st Normal Form.
Secondly, Taskname, TaskInitiator, TaskStartDate and TaskEndDate are functionally dependent on TaskID which is not a primary key instead of SectionID which in this case should be the primary key (if on a separate table). This is violation of 3rd Normal Form which says that there should be no Transitional Dependence or non-primary key should be dependent on
another non-primary key.
Although there are instances that you have to de-normalized but I believe this one should be normalized. In my own estimation there should be three tables involved in your design, namely, Sections,Tasks and SubTasks that would like the one below.
Section is related to Tasks, that is, a section could have many Tasks.
And Task is related to Sub-Tasks, that is, a Task could have many Sub-tasks.
If I understand correctly the original poster does not know, how many levels of hierarchy will be needed (hence "and so on"). His problem is to create a design that can hold a structure of any depth.
Imho that is a complex issue that does not have a single answer. When implementing such a design you need to count such factors as:
Will the structure be fairly constant? (How many writes?)
How often will this structure be read?
What operations will need to be possible? (Get all children objects of a given object? Get the parent object? Get the direct children?)
If the structure will be constant You could use the nested set model (http://en.wikipedia.org/wiki/Nested_set_model)
In this way the table has a 'left' and 'right' column. The parent object has its left and right column encompasing the values of any of its children object.
In that way you can list all the children of an object using a query like this:
SELECT child.id
FROM table AS parent
JOIN table AS child
ON child.left BETWEEN parent.left AND parent.right
AND child.right BETWEEN parent.left AND parent.right
WHERE
parent.id = #searchId
This design can be VERY fast to read, but is also EXTREMELY costly when the structure changes (for example when adding a child to any object You will have to update any object with a 'right' value that is higher than the inserted one).
If you need to be able to make changes to structure in real time you should probably use a design with two tables - one holding the objects, the second the structure (something like parentId, childId, differenceInHierarchyLevels).

Sql server 2005 - Returning data as XML or datatables

I have three classes
a) Parent - Which contains some properties, an Item collection and a collection of children
b) Child - Which contains some properties and an Item collection
c) Item - Which contains some properties
The relationship is depicted in the below XML structure.
<Parents>
<Parent1>
<Property1></Property1>
<Property2></Property2>
<Property3></Property3>
<Parent1Children>
<Child1>
<Child1Property1></Child1Property1>
<Child1Property2></Child1Property2>
</Child1>
<Child2>
<Child2Property1></Child2Property1>
<Child2Property2></Child2Property2>
</Child2>
</Parent1Children>
<Parent1SomeCollection>
<Item1>
<Item1Property1></Item1Property1>
<Item1Property2></Item1Property2>
</Item2>
<Item2>
<Item2Property1></Item2Property1>
<Item2Property2></Item2Property2>
</Item2>
</Parent1SomeCollection>
</Parent1>
</Parents>
I need to use the data from the tables of the three classes above. I can think of the following options:-
Get this structure in the form of
XML from the database as depicted
above from a stored procedure and
use the same as XDocument in the
business layer.
Get in the form of 3 tables from the
database them in datatable and
establish relationship between the
to get the data?
Which one would give optimum performance and is a better way or are there better ways to do the same?
A DataSet would hold the data better (assuming you're referring to .NET code), and maintain the relationships, while allowing you the ability to save the data in XML, and perform mass inserts and updates. So performance should be very good. Using an ORM would also be a good idea, since the ORM would give yous similar functionality, plus it may support LINQ (ORMs like Entity Framework and Linq-to-SQL).