SQL persisting graph like data structures - sql

I'm trying to figure out the best way to store graph data structures in an SQL database. After some research, it seems that I can store graph Nodes in a table and just create a join table with the many-to-many relationships between them which would represent the edges (or connections). That seems exactly what I was looking for, but now I want to introduce the users who own the nodes.
From the performance point of view, would it make sense to create a new join table userNodes, or just save users as nodes assuming that node is a generic structure? And what are the implications of storing everything in a single table?

If you have individual attributes that should be stored on a per-node level, then those attributes should be in the nodes table. That is what the table is for.
If the attributes are really a list, then you would want another table. For instance, if multiple users could own a node, then one option would be a userNodes table. However, as you describe the data, there is only one user per node.

Related

SQL Data modeling -Querying Records that have tags across multiple categories

I have a table that stores different software services a company offers. The services are tagged by the Industry it serves, the LoB it belongs to, and the technology involved in the service. The service can have multiple tags on each of Industry,LOB, and Technology.
For eg: Following could be the master data:
And a transaction data could look like this :
I need to create a view that can be used to query data by Industry/LoB and Technology tags. For time being I've Left outer joined all tagtoService relation tables(service-technology, service-LoB, Service-Industry tables) to the services transaction table. but this goes for a huge number of records as it is possible to typically have one service tagged to up to 10-15 industries and technologies.
Just wanted to know what is the optimal way to model this data so that I have provision to query for service by all of the three tags right from within one view.
I am not a Data modeling expert and this is more of my first venture into the data modeling side- so please pardon the 'noob'ness of my question :). I use SAP HANA as the database and expose data via an OData service for which I want to use this view as a datasource.
If you're asking modeling the data: Normally in your transaction table, you keep the foreign keys, not the text columns that can be obtained via foreign keys from the master tables. I bet that's what you meant as well but the example shows text values in the transaction tables.
Other than that, I think what you have is sound and reasonable. These "tag" tables represent different level of granularity for the "services" table and it can be counterproductive if you combine them in a single table (examples: single column with comma separated tags, XML / JSON columns, multiple columns [LOBTag1, LOBTag2, ...] ) b/c that will make these columns non-indexable and/or hard to query. You may have optimization with XML and JSON columns but those are should not be considered unless the columns are too many and sparse.

Should I use one-to-one relationships to avoid repetition even if the new table wasn't really needed?

I'm designing a database from scratch and I'm wondering if the way I'm using one-to-one relationships is correct.
Imagine I have a table that needs the columns city and country_id, the first being alphanumeric and the second being a foreign key to another table. Should I place these in a locations table and use a one-to-one relationship?
Another example:
I have a table with the factory information of a device like the serial number and other fields. These will later be used to register a device in another table. Of course this is a one-to-one relationship, but should the columns of the first table be in the second table instead? Have in mind that the registrations table has another 4-5 columns.
I've read a lot of times that these relationships can often be omitted. However, I like the separation of concerns that creating a new table can give, in some cases.
Thanks in advance!
This may be a duplicate question, e.g. see:
SQL one to one relationship vs. single table
There's no perfect answer to this question as it depends on use case, but here's the rule of thumb I recommend abiding by: if you can already envision the potential need for separate tables, then I would err on the side of splitting them and using a 1:1 relationship. For example, imagine in the future that you want to have some kind of one-to-many relationship between some new table and the country table, or between some new table and the device table. In these cases you probably don't want city information mixed up with the former, and you probably don't want device registration information mixed up with the latter. By keeping your DB schema normalized, you can better future-proof it and you can mitigate the need for (likely extremely painful) updates that may have otherwise cropped up.

How to properly model a Master/Main/Detail kind of relation

The relationship I wanna model goes kinda like this:
A master TextResource object that stores high level shared non-localizable data, like maximum length.
One single detail object we can call SourceText, that needs to be tracked separately.
The rest of detail objects, that we can call TargetText.
Both Source and TargetText objects store a string localized in a particular language, along with other localization data.
But the string stored in SourceText is the original one and hence, even if the data schema is the same, they're not functionally equal and this piece of data needs to exist and be unique per each TextResource master object.
And the options I've thought of are:
Regular master-detail tables, but store the SourceText ID in the master table... Potentially creating a circular reference?
Regular master-detail tables, but add a flag/category column to the detail table that marks a row as Source or Target... Though this could potentially lead to having more than one "Source" detail per master and could make querying for the Source data less straight-forward
Store the Source data in the master table, even if this means having similar columns on both master and detail tables (and screwing normalization while at it)
Create three different tables: master, main and detail. Master (TextResource) and main (SourceText) would have a 1-to-1 relationship while there could be n detail (TargetText) rows per master, but other than that the Source and Target tables would share most of their schema
I see benefits and potential problems on all four approaches, so maybe you could lend me a hand deciding one?
What I want to achieve would boil down to:
Have one, and only one, Source string per Resource
Be able to query Text Resources and their Source strings easily and fast
Be able to query the localized strings of a given Resource, including the Source one, easily and fast
Be able to store versioned data of each localized string, including the Source one
And of course, avoid bad practices and observe normalization
Thanks in advance :)

SQL vs NoSQL for data that will be presented to a user after multiple filters have been added

I am about to embark on a project for work that is very outside my normal scope of duties. As a SQL DBA, my initial inclination was to approach the project using a SQL database but the more I learn about NoSQL, the more I believe that it might be the better option. I was hoping that I could use this question to describe the project at a high level to get some feedback on the pros and cons of using each option.
The project is relatively straightforward. I have a set of objects that have various attributes. Some of these attributes are common to all objects whereas some are common only to a subset of the objects. What I am tasked with building is a service where the user chooses a series of filters that are based on the attributes of an object and then is returned a list of objects that matches all^ of the filters. When the user selects a filter, he or she may be filtering on a common or subset attribute but that is abstracted on the front end.
^ There is a chance, depending on user feedback, that the list of objects may match only some of the filters and the quality of the match will be displayed to the user through a score that indicates how many of the criteria were matched.
After watching this talk by Martin Folwler (http://www.youtube.com/watch?v=qI_g07C_Q5I), it would seem that a document-style NoSQL database should suit my needs but given that I have no experience with this approach, it is also possible that I am missing something obvious.
Some additional information - The database will initially have about 5,000 objects with each object containing 10 to 50 attributes but the number of objects will definitely grow over time and the number of attributes could grow depending on user feedback. In addition, I am hoping to have the ability to make rapid changes to the product as I get user feedback so flexibility is very important.
Any feedback would be very much appreciated and I would be happy to provide more information if I have left anything critical out of my discussion. Thanks.
This problem can be solved in by using two separate pieces of technology. The first is to use a relatively well designed database schema with a modern RDBMS. By modeling the application using the usual principles of normalization, you'll get really good response out of storage for individual CRUD statements.
Searching this schema, as you've surmised, is going to be a nightmare at scale. Don't do it. Instead look into using Solr/Lucene as your full text search engine. Solr's support for dynamic fields means you can add new properties to your documents/objects on the fly and immediately have the ability to search inside your data if you have designed your Solr schema correctly.
I'm not an expert in NoSQL, so I will not be advocating it. However, I have few points that can help you address your questions regarding the relational database structure.
First thing that I see right away is, you are talking about inheritance (at least conceptually). Your objects inherit from each-other, thus you have additional attributes for derived objects. Say you are adding a new type of object, first thing you need to do (conceptually) is to find a base/super (parent) object type for it, that has subset of the attributes and you are adding on top of them (extending base object type).
Once you get used to thinking like said above, next thing is about inheritance mapping patterns for relational databases. I'll steal terms from Martin Fowler to describe it here.
You can hold inheritance chain in the database by following one of the 3 ways:
1 - Single table inheritance: Whole inheritance chain is in one table. So, all new types of objects go into the same table.
Advantages: your search query has only one table to search, and it must be faster than a join for example.
Disadvantages: table grows faster than with option 2 for example; you have to add a type column that says what type of object is the row; some rows have empty columns because they belong to other types of objects.
2 - Concrete table inheritance: Separate table for each new type of object.
Advantages: if search affects only one type, you search only one table at a time; each table grows slower than in option 1 for example.
Disadvantages: you need to use union of queries if searching several types at the same time.
3 - Class table inheritance: One table for the base type object with its attributes only, additional tables with additional attributes for each child object type. So, child tables refer to the base table with PK/FK relations.
Advantages: all types are present in one table so easy to search all together using common attributes.
Disadvantages: base table grows fast because it contains part of child tables too; you need to use join to search all types of objects with all attributes.
Which one to choose?
It's a trade-off obviously. If you expect to have many types of objects added, I would go with Concrete table inheritance that gives reasonable query and scaling options. Class table inheritance seems to be not very friendly with fast queries and scalability. Single table inheritance seems to work with small number of types better.
Your call, my friend!
May as well make this an answer. I should comment that I'm not strong in NoSQL, so I tend to lean towards SQL.
I'd do this as a three table set. You will see it referred to as entity value pair logic on the web...it's a way of handling multiple dynamic attributes for items. Lets say you have a bunch of products and each one has a few attributes.
Prd 1 - a,b,c
Prd 2 - a,d,e,f
Prd 3 - a,b,d,g
Prd 4 - a,c,d,e,f
So here are 4 products and 6 attributes...same theory will work for hundreds of products and thousands of attributes. Standard way of holding this in one table requires the product info along with 6 columns to store the data (in this setup at least one third of them are null). New attribute added means altering the table to add another column to it and coming up with a script to populate existing or just leaving it null for all existing. Not the most fun, can be a head ache.
The alternative to this is a name value pair setup. You want a 'header' table to hold the common values amoungst your products (like name, or price...things that all rpoducts always have). In our example above, you will notice that attribute 'a' is being used on each record...this does mean attribute a can be a part of the header table as well. We'll call the key column here 'header_id'.
Second table is a reference table that is simply going to store the attributes that can be assigned to each product and assign an ID to it. We'll call the table attribute with atrr_id for a key. Rather straight forwards, each attribute above will be one row.
Quick example:
attr_id, attribute_name, notes
1,b, the length of time the product takes to install
2,c, spare part required
etc...
It's just a list of all of your attributes and what that attribute means. In the future, you will be adding a row to this table to open up a new attribute for each header.
Final table is a mapping table that actually holds the info. You will have your product id, the attribute id, and then the value. Normally called the detail table:
prd1, b, 5 mins
prd1, c, needs spare jack
prd2, d, 'misc text'
prd3, b, 15 mins
See how the data is stored as product key, value label, value? Any future product added can have any combination of any attributes stored in this table. Adding new attributes is adding a new line to the attribute table and then populating the details table as needed.
I beleive there is a wiki for it too... http://en.wikipedia.org/wiki/Entity-attribute-value_model
After this, it's simply figuring out the best methodology to pivot out your data (I'd recommend Postgres as an opensource db option here)

Recommend SQL data model for Semantic Network nodes?

We're building a RDBMS-based web site for a federal semantic network (RDF, Protege, etc). This is basically a large collection of nodes, each having a large and indefinite set of named relationships to (and from) other nodes.
My first thought is a single table for all the nodes (name, description, etc), plus one table per named relationship. Any better ideas out there?
On further reflection, two tables total might do, one for nodes (id, name, description), and other for relations (id, name, description, from, to),
where from and two are ids in the nodes table (ints). Still on the right track?
You could optimize the performance by creating 2 rows per relation.
Let's say you have a table Items and a table Relations and that Person A has a relation with Person B. The Relations table has a left and right column, both referring to Items. Now, if you only have one row for this relation, and you want all relations for a certain Item, you would have a query looking like this:
SELECT * FROM Relations WHERE LeftItemId = #ItemId OR RightItemId = #ItemId
The OR in this query will ruin your performance! If you would duplicate the row and switch the relation (left becomes right and vice versa) the query looks like this:
SELECT * FROM Relations WHERE LeftItemId = #ItemId
With the right index this one will go blazingly fast.
No, that sould be fine. Pay attention to primary key and indexes, so that the performance is good.
If you didn't have a single table for the nodes, you'd have to define a lot of relation tables. Each new node type would require a new relation table with every old node type. That could get out of hand quickly.
So a single table sounds best. You can always use a 1:1 relation to extend it, if you need additional fields for certain node types.
if you're using sql server 2008, you might want to consider the new HierarchyID datatype to store your hierarchy in. It's optimized for storage.