Are there any open source resources for SQL schema design patterns? - sql

I can barely count the number of times I've created a "users" table, similar for "computers" and "customers". I've tried looking around, but haven't ever seen a resource for modeling these schema that we see over and over again. It seems like some of these objects should be some-kind-of-solved by now. Is there anything like this?

I have never seen anything like this either and I'm not sure it's necessary. Yes, there are a lot of similarities but every application is different. At one point I had built an internal library of some of my more "standard" tables (user is a good example) to use as a jumping point, but I have yet to create two identical tables for different systems.
Thus, I have yet to ever use the library I built because I can write the new table quicker and more error free than I can modify another existing example to work for the current project.

You could look at the source code of some popular open-source CRM/ERPs, such as OpenERP, though some of them are not great.
These are the top books on data modelling patterns:
Analysis Patterns, Fowler
Data Model Resource Book, vol. 1,2,3, Silverston
Enterprise Model Patterns, Hay
Patterns of Data Modeling, Blaha


Should I use EAV database design model or a lot of tables

I started a new application and now I am looking at two paths and don't know which is good way to continue.
I am building something like eCommerce site. I have a categories and subcategories.
The problem is that there are different type of products on site and each has different properties. And site must be filterable by those product properties.
This is my initial database design:
Products{ProductId, Name, ProductCategoryId}
ProductCategories{ProductCategoryId, Name, ParentId}
CategoryProperties{CategoryPropertyId, ProductCategoryId, Name}
ProductPropertyValues{ProductId, CategoryPropertyId, Value}
Now after some analysis I see that this design is actually EAV model and I read that people usually don't recommend this design.
It seems that dynamic sql queries are required for everything.
That's one way and I am looking at it right now.
Another way that I see is probably named a LOT WORK WAY but if it's better I want to go there.
To make table
Product{ProductId, CategoryId, Name, ManufacturerId}
and to make table inheritance in database wich means to make tables like
Cpus{ProductId ....}
HardDisks{ProductId ....}
MotherBoards{ProductId ....}
erc. for each product (1 to 1 relation).
I understand that this will be a very large database and very large application domain but is it better, easier and performance better than the option one with EAV design.
EAV is rarely a win. In your case I can see the appeal of EAV given that different categories will have different attributes and this will be hard to manage otherwise. However, suppose someone wants to search for "all hard drives with more than 3 platters, using a SATA interface, spinning at 10k rpm?" Your query in EAV will be painful. If you ever want to support a query like that, EAV is out.
There are other approaches however. You could consider an XML field with extended data or, if you are on PostgreSQL 9.2, a JSON field (XML is easier to search though). This would give you a significantly larger range of possible searches without the headaches of EAV. The tradeoff would be that schema enforcement would be harder.
This questions seems to discuss the issue in greater detail.
Apart from performance, extensibility and complexity discussed there, also take into account:
SQL databases such as SQL Server have full-text search features; so if you have a single field describing the product - full text search will index it and will be able to provide advanced semantic searches
take a look at no-sql systems that are all the rage right now; scalability should be quite good with them and they provide support for non-structured data such as the one you have. Hadoop and Casandra are good starting points.
You could very well work with the EAV model.
We do something similar with a Logistics application. It is built on .net though.
Apart from the tables, your application code has to handle the objects correctly.
See if you can add generic table for each object. It works for us.

Recursively querying through structured table data / process design

After my first try to misappropriate Ms-Access - with your help - turned out to be a great success, I have been sent back to do "more of this".
A bit of introduction you can skip if you want:
I am building a data foundation about certain projects from which I want to create analysises and overviews.
The data and findings are to be represented in programs like Excel or Powerpoint, so the process itself is very open. It will probably be very visual with detailed points on request.
However, the data might be changing periodically and if this turns out well, I might repeat the process.
Therefore I think the ideal way would be to have a data layer, then a fixed set of queries on that data and then I would (semi-)manually compile the results into a report in whatever format fits, maybe using external data analysis tools such as R in between.
Trouble is, the only database I have access to is.. well.. Ms Access 2010. I am not at liberty to install anything on this machine.
I could of course use non-install or online tools if you have recommendations for this.
tl;dr: I want to use Ms Access to query data from a relational db into tabular format to be processed further by hand, using as little of Ms-Access VBA and forms as possible.
I have since started to implement a prototype in ms-Access, a standard relational database.
One interesting problem I have come up with with this kind of design is that I have a table for companies involved in the projects. Along with this, I have a table of "relationship" - like stakeholding, ownerships or cooperations.
So let's say company A is building project A, but is just a subsidary of company B, which then partly owned by company C and so on.
Now let's say I want to query all companies involved in a project, but as owners I just want to show the last "elements" of the chain.
Imagine I want to sort the list by net assets, which is usually a figure which is only available for the public companies at the end of the chain, not the project subsidaries up the chain etc.
Is this possible with (Ms-)SQL or would I need to do this in VBA?
Right now I think I could manage do write a VBA function and dump it into a temporary table, but then I'd have to create forms and such.
Another idea that immediately springs from this is ´to answer the question "In which project does company C have stake" by a query. You can see where this is going.
I would prefer the database and the queries to be as flexible as possible (and in this case, independend from the actual Access).
So this time, no mock-program or user-interface. It was a pain to get what I want from Access in the last project and that was with a very specific question set...
But in general I am also open to use different tools if I can.
Thank you so much!
Modelling hierarchies in an RDBMS is a fairly tricky process - some (like Oracle) have built-in functionality to query hierarchical data, but I don't think Access does.
The best solution is to use a "nested set" model. This allows you to model hierarchical data while using standard SQL; it's also pretty fast for querying.
If your data isn't hierarchical, the nested set isn't so useful; the typical solution in that case is to introduce a table to map the relationship - typically including the two related entities, and often with a "relationship type" field (e.g. "parent", "part owner" etc.). This is often called a Directed Acyclical Graph or DAG. There are several ways of modelling these in a database; a "Closure table" is probably the most efficient. This article shows how to do this - it's a heavy read, but I think it answers your question.

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.

Migrating procedural, antique CRUD code and proprietary DBMS to OO ORM on SQL

Please excuse my long-winded explanation, but I wanted to be as explicit as possible in the hopes of getting as much useful feedback on my situation as possible. You can skip to the questions at the bottom if you are impatient.
At my current job, development is done in an antiquated language that is hard-wired to a proprietary DBMS that comes with the language. The language is CRUD-focused, and is essentially a glorified database querying/reporting/updating language with some programming features bolted on as an afterthought. Most programs are top-down procedures and there is very little code reuse; updating a record often requires updating many entangled, related records at the same time that you just need to "know about" as the proprietary database has no inherent foreign key relationships. If a table needs to be updated, we generally must grep our source code and update every procedure that creates/updates records for that table and recompile. I could go on with other annoyances, but needless to say, I am looking for a way to abstract away as much of this behavior as possible into reusable code segments.
The language has semi-recently added some support for object-oriented development, and I have been able to demonstrate the benefits of reusable code to my coworkers with a recent project written using OO constructs. However, my project was only possible because it was a rare task that did not require interacting with our database.
I have really been trying hard to find a way to create re-usable code using OO techniques with this language, but since everything is so database-focused, what I really need is a way to create container classes around our table designs, putting most of our data processing logic into class methods and merging N related tables into 1 singular class. This has brought me to the idea of ORM frameworks, which of course is non-existent on the language I am using at work.
What I have found, is that the DBMS for this language can run a SQL99 engine concurrently with the proprietary language engine, and it includes JDBC and ODBC drivers. This has opened the door for me to explore migration strategies, which is where I think we eventually need to go. Since the SQL engine runs concurrently with the old engine, it is possible for us to do an incremental migration, running new code alongside old code with an eventual goal of migrating our data to a "pure" SQL DBMS when all the old code is replaced.
I initially did quite a bit of reading and proposed Java (using JPA2 for ORM) to my manager, but I think I scared him as he views Java as being a bit heavyweight for our needs. I then did a little more digging and re-proposed Ruby using the JRuby interpreter (using either ActiveRecord or DataMapper for ORM), which was much better received as Rails seems to fit in well with the re-shifting of our development to Web-based front-ends that we are attempting to move to with our old cludgy code, and of course because the ability to interact with Java if the need arises is a great capability.
The Questions
Nearly all of the reading I have
been doing about ORM is focused on
starting with a class structure, and
creating the mapped database
structure as a secondary process.
Is going the other way around
(starting with an existing database
and mapping classes to it) a very
odd thing to do?
Assuming question #1 == true, how
flexible are existing ORM frameworks
such as JPA2, ActiveRecord,
DataMapper etc. to "imperfect" table
design? I am sure we will have to
do some refactoring of existing
table design, but would like to know
if I am undertaking a Herculean task
before I waste too much time on the
If anyone has a better idea for
language+ORM, I would love to hear
it. It must be SQL-ready using JDBC
or ODBC to fit into our incremental
migration plan.
If anyone has any experience on a similar effort and could point out any helpful resources (especially books), I would be very grateful!
Nearly all of the reading I have been doing about ORM is focused on starting with a class structure, and creating the mapped database structure as a secondary process. Is going the other way around (starting with an existing database and mapping classes to it) a very odd thing to do?
Not really. There are several approaches when dealing with the persistence layer of an application:
Top-down: You start with the object model and the mappings and you derive the database schema from that data.
Bottom-up: You start with your data model i.e. the database schema and you derive the object model and the mappings from the tables.
Middle-out: You start with the mapping and you generate the object model and the tables.
Meet-in-the-middle: You start with an existing database schema and an existing object model, you develop a mapping to map between the two (you can even introduce an additional object layer and brige the existing one).
The top-down approach is the most object-oriented but the meet-in-the-middle approach is probably the most common.
Assuming question #1 == true, how flexible are existing ORM frameworks such as JPA2, ActiveRecord, DataMapper etc. to "imperfect" table design? I am sure we will have to do some refactoring of existing table design, but would like to know if I am undertaking a Herculean task before I waste too much time on the effort.
I would say that JPA is not the most flexible, it will not deal very well with exotic or heavily denormalized schemas (the result might be ugly from an OO point of view). Accesses that don't go through JPA might also be a problem. A data mapper tool like iBatis (now mybatis) will give you more flexibility.
If anyone has a better idea for language+ORM, I would love to hear it. It must be SQL-ready using JDBC or ODBC to fit into our incremental migration plan.
I know that RoR can deal with existing databases, I'm just not sure what the result will look like. But I don't really have enough experience with RoR so I'll let experts elaborate on this.
If anyone has any experience on a similar effort and could point out any helpful resources (especially books), I would be very grateful!
I suggest to browse Scott Ambler website and his book(s):
The Process of Database Refactoring: Strategies for Improving Database Quality
More food for thought:
Working Effectively with Legacy Code by Michael Feathers
Clean Code by Robert Martin

How do you maintain a library of useful SQL in a team environment?

At my work everyone has sql snippets that they use to answer questions. Some are specific to a customer, while some are generic for a given database. I want to consolidate those queries into a library/repository that can be accessed by anyone on the team. The requirements would be:
Tagable (multiple tags allowed per sql)
Exportable (create a document containing all queries with certain tags)
I'm interested in what has been found to work in other team environments.
You could use a wiki.
You could get started with something as simple as Tiddly wiki.
A wiki is a great approach.
For database specific or project specific snippets it's also very useful to have links to where a similar construct occurs in the code. We use trac's wiki which gives nice integration with out SVN for this.
Rather than pasting SQL snippets, I would consider graduating to an ORM (Object-Relational Mapper) or some other library to make representing and manipulating the data easier. It provides a layer of encapsulation to guard against schema changes and a layer of abstraction so you can think of the data in terms of business logic (ie. a user) rather than a collection of tables (ie. a user table, a password table, an access table...).
In Perl this would be something like DBIx::Class.
Another approach you may want to look at is creating views in your database. 'select * from some_view' can hide quite a bit of SQL. You'll still want to use a wiki to document them, but if its a view you don't have to worry about people keeping outdated copies.