Recursively querying through structured table data / process design - sql

After my first try to misappropriate Ms-Access - with your help - turned out to be a great success, I have been sent back to do "more of this".
A bit of introduction you can skip if you want:
I am building a data foundation about certain projects from which I want to create analysises and overviews.
The data and findings are to be represented in programs like Excel or Powerpoint, so the process itself is very open. It will probably be very visual with detailed points on request.
However, the data might be changing periodically and if this turns out well, I might repeat the process.
Therefore I think the ideal way would be to have a data layer, then a fixed set of queries on that data and then I would (semi-)manually compile the results into a report in whatever format fits, maybe using external data analysis tools such as R in between.
Trouble is, the only database I have access to is.. well.. Ms Access 2010. I am not at liberty to install anything on this machine.
I could of course use non-install or online tools if you have recommendations for this.
tl;dr: I want to use Ms Access to query data from a relational db into tabular format to be processed further by hand, using as little of Ms-Access VBA and forms as possible.
I have since started to implement a prototype in ms-Access, a standard relational database.
One interesting problem I have come up with with this kind of design is that I have a table for companies involved in the projects. Along with this, I have a table of "relationship" - like stakeholding, ownerships or cooperations.
So let's say company A is building project A, but is just a subsidary of company B, which then partly owned by company C and so on.
Now let's say I want to query all companies involved in a project, but as owners I just want to show the last "elements" of the chain.
Imagine I want to sort the list by net assets, which is usually a figure which is only available for the public companies at the end of the chain, not the project subsidaries up the chain etc.
Is this possible with (Ms-)SQL or would I need to do this in VBA?
Right now I think I could manage do write a VBA function and dump it into a temporary table, but then I'd have to create forms and such.
Another idea that immediately springs from this is ´to answer the question "In which project does company C have stake" by a query. You can see where this is going.
I would prefer the database and the queries to be as flexible as possible (and in this case, independend from the actual Access).
So this time, no mock-program or user-interface. It was a pain to get what I want from Access in the last project and that was with a very specific question set...
But in general I am also open to use different tools if I can.
Thank you so much!

Modelling hierarchies in an RDBMS is a fairly tricky process - some (like Oracle) have built-in functionality to query hierarchical data, but I don't think Access does.
The best solution is to use a "nested set" model. This allows you to model hierarchical data while using standard SQL; it's also pretty fast for querying.
If your data isn't hierarchical, the nested set isn't so useful; the typical solution in that case is to introduce a table to map the relationship - typically including the two related entities, and often with a "relationship type" field (e.g. "parent", "part owner" etc.). This is often called a Directed Acyclical Graph or DAG. There are several ways of modelling these in a database; a "Closure table" is probably the most efficient. This article shows how to do this - it's a heavy read, but I think it answers your question.

Related

Normalisation and multi-valued fields

I'm having a problem with my students using multi-valued fields in access and getting confused about normalisation as a result.
Here is what I can make out. Given a 1-to-many relationship, e.g.
Articles Comments
-------- --------
artID{PK} commID{PK}
text text
artID{FK}
Access makes it possible to store this information into what appears to be one table, something like
Articles
--------
artID{PK}
text
comment
+ value
"value" referring to multiple comment values for the comment "column", which access actually stores as a separate table. The specifics of how the values are stored - table, its PK and FK - is completely hidden, but it is possible to query the multi-valued field, e.g. in the example above with the query
INSERT INTO article( [comment].Value )
VALUES ('thank you')
WHERE artID = 1;
But the query doesn't quite reveal the underlying structure of the hidden table implementing the multi-valued field.
Given this (disaster, in my view) - my problem is how to help newcomers to database design and normalisation understand what Access is offering them, why it may not be helpful, and that it is not a reason to ignore the basics of the relational model. More specifically:
Are there better ways, besides queries as above, to reveal the structure behind multi-valued fields?
Are there good examples of where the multi-valued field is not good enough, and shows the advantage of normalising explicitly?
Are there straightforward ways to obtain the multi-select visual output of Access multi-values, but based on separate, explicit tables?
Thanks!
I cannot give you advice in using this feature, because I never used it; however, I can give you reasons not to use it.
I want to have full control on what I'm doing. This is not the case for multi-valued fields, therefore I don't use them.
This feature is not expandable. What if you want to add a date field to your comments, for instance?
It is sometimes necessary to upsize an Access (backend) database to a "big" database (SQL Server, Oracle). These Databases don't offer such a feature. It is often the customer who decides which database has to be used. Recently I had to migrate an Access application (frontend) using an Oracle backend to a SQL-Server backend because my client decided to drop his Oracle server. Therefore it is a good idea to restrict yourself to use only common features.
For common tasks like editing lookup tables I created generic forms. My existing solutions will not work with multi-valued fields.
I have a (self-made) tool that synchronizes changes in the structure of the database on my developer’s site with the database on the client’s site. This tool cannot deal with multi-valued fields.
I have tools for the security management that can grant SELECT, INSERT, UPDATE and DELETE rights on tables or revoke them. Again, the management tool does not work with multi-valued fields.
Having a separate table for the comments allows you to quickly inspect all the comments (by opening the table). You cannot do this with multi-valued fields.
You will not see the 1 to n relation between the articles and the comments in a database diagram.
With a separate table you can choose whether you want to cascade deletes to the details table or not. If you don't, you will not be able to delete an article as long as there are comments attached to it. This can be desirable, if you want to protect the comments from being deleted inadvertently.
It is important to realize the difference between physical and logical relationships. Today the whole internet and web services (SOAP) quite much realizes on a data format that is multi-value in nature.
When you represent multi-value data with a relational database (such as Access), then behind the scenes you are using a traditional (and legitimate) relation. I cannot stress that as such, then the use of multi-value columns in Access is in fact a LEGITIMATE relational model.
The fact that table is not exposed does not negate this issue. In fact, if you represent an invoice (master record, and repeating details) as a XML data cube, then we see two things:
1) you can build and represent that invoice with a relational database like Access
2) such a relational data model that is normalized can ALSO be represented as a SINGLE xml string.
3) deleting the XML record (or string) means that cascade delete of the child rows (invoice details) MUST occur.
So while it is true that Multi-Value fields been added to Access to deal with SharePoint, it is MOST important to realize that such data can be mapped to a relational database (if you could not do this, then Access could not consume that XML data using relational database tables as ACCESS CURRENTLY DOES RIGHT NOW).
And with the web such as XML, and SharePoint then the need to consume and manage and utilize such data is not only widespread, but is in fact a basic staple of the internet.
As more and more data becomes of a complex nature, we find the requirement for multi-value data exploding in use. Anyone who used that so called "fad" the internet is thus relying and using data that is in fact VERY OFTEN XML and is multi-value (complex) in nature.
As long as the logical (not physical) relational data model is kept, then use of multi-value columns to represent such data is possible and this is exactly what Access is doing (it is mapping the relational data model to a complex model). Note that the complex (xml) data model does NOT necessary have to be relational in nature. However, if you ARE going to map such data to Access then the complex multi-value model MUST CONFORM TO A RELATIONAL data model.
This is EXACTLY what is occurring in Access.
The fact that such a correct and legitimate math relational model is not exposed is of little issue here. Are we to suggest that because Excel does not expose the binary codes used then users will never learn about computers? Or perhaps we all must program in assembler so we all correctly learn how computers works.
At the end of the day, who cares and why does this matter? The fact that people drive automatic cars today does not toss out the concept that they are using different gears to operate that car. The idea that we shut down all of society because someone is going to drive an automatic car or in this case use complex data would be galactic stupid on our part.
So keep in mind that extensions to SQL do exist in Access to query the multi-value data, but as well pointed out here those underlying tables are not exposed. However, as noted, exposing such tables would STILL REQUIRE one to not change or mess with cascade delete since that feature is required TO MAINTAIN A INTERSECTION OF FEATURES and a CORRECT MATH relational model between the complex data model (xml) and that of using two related tables to represent such data.
In other words, you can use related tables to represent the complex data model IF YOU REMOVE the ability of users to play with the referential integrity options. The RI options MUST remain as set in those hidden tables else such data will not be able to make the trip BACK to the XML or complex data model of which it was consumed from.
As noted, in regards to users being taught how gasoline reacts with oxygen for that of learning to drive a car, or using a word processor and being forced to learn a relational model and expose the underlying tables makes little sense here.
However, the points made here in regards to such tables being exposed are legitimate concerns.
The REAL problem is SQL server and Oracle etc. cannot consume or represent that complex data WHILE ACCESS CAN CONSUME such data.
As noted, the complex data ship has LONG ago sailed! XML, soap, and the basic technologies of the internet are based on this complex data model.
In effect, SQL server, Oracle and most databases cannot that consume this multi-value data represent it without users having to create and model such data in a relational fashion is a BIG shortcoming of SQL server etc.
Access stands alone in this ability to consume this data.
So, for anyone who used a smartphone, iPad or the web, you are using basic technologies that are built around using complex data, something that Access now allows.
It is likely that the rest of the industry will have to follow suit given that more and more data is complex in nature. If the database industry does not change, then the mainstream traditional relational database system will NOT be the resting place of such data.
A trend away from storing data in related tables is occurring at a rapid pace right now and products like SharePoint, or even Google docs is proof of this concept. So Access is only reacting to market pressures and it is likely that other database vendors will have to follow suit or simply give up on being part of the "fad" called the internet.
XML and complex data structures are STAPLE and fact of our industry right now – this is not an issue we all should run away from, but in fact embrace.
Albert D. Kallal (Access MVP)
Edmonton, Alberta Canada
kallal#msn.com
The technical discussion is interesting. I think the real problem lies in student understanding. Because it is available in Access students will use it, and initially it will probably provide a simple solution to some design problems. The negatives will occur later when they try and use the data. Maybe a simple example demonstrating the problems would persuade some students to avoid using multi-valued fields ? Maybe an example of storing the data in another, more usable format would help ?
Good luck !
Peter Bullard
MS Access does a great job of simplifying database management and abstracting out a lot of complexity. This however makes the learning of dbms concepts a bit difficult. Have you tried using other 'standard' dbms tools like MySQL (or even sqlite). From a learning perspective they may be better.
I know this post is old. But, it's not quite the same as every other post I've seen on this topic. This one has someone making a good case for using Multi Valued Fields...
As someone who is trying who is still trying very hard to get their head around Access, I find the discussion for and against using the Multi Valued Fields incredibly frustrating.
I'm trying to sort through it all, but if everyone is so against them, what is an alternative method? It seems that in every search result I find everyone is either telling you how to use Multi Valued Fields and Controls or telling you how horrible and what a mistake they are. Many people refer to an alternative to them, but nobody says "Here's an example". I'm here to learn about these things. And while I know that this is a simpler concept for a lot of people in these forums, I could really use some examples to take a look at.
I'm at a point where I have to decide which way to go. It would be wonderful to compare examples of using Multi Valued Fields and alternatives and using a control to select multiple values.
Or am I wrong and the functionality of a combobox where you can select multiple items is only available through Access?
I want to address the last of your questions first. There is a way of providing a visual presentation of a parent child relationship. It's called subforms. If you get help about subforms in Access, it will explain the concept.
I have used subforms in a project where I wanted to display the transaction header in a form and the transaction details in a subform. There is nothing to hinder this construct even when the data is stored in two normalized tables.
Of course, this affects the screen, not the database. That's the whole point. Normalization is relevant to storage and retrieval, not to other uses of data.

Custom user-driven reports on a known schema

There's an upcoming project at work to fill a requirement that end-users be able to generate custom reports off their data in within our fixed/known-schema relational database.
The interface needs to be very user friendly and so transposing all of t-sql's language concepts into a graphical paradigm is far too complex for both the project team and the end user.
What research or products, open-source or otherwise, exist around satisfying this of business need? I'm aware of general Business Analytic tools but this is more specific and I'm trying to understand the problem domain better rather than trying to reverse engineer it from vendor marketing materials.
I assume the research would be in the form of a some encoding of the schema that specifies which joins and tables are allowed, which fields are available, then then a method for allowing the user to select one particular valid combination among the possible many, generate the query, and display the results.
Brainstorming - feature support in order of complexity: SELECT, WHERE filters, FULL JOIN, LEFT JOIN, sorting, paging, grouping, aggregation, HAVING filter.
My backup plan is to just dumb it down to pre-written SQL Views (with JOINs built-in) with the ability to display available columns with custom row-wise filtering. Paging and sorting is doable. By itself, this doesn't allow for grouping, aggregate functions, HAVING filters, or other inter-row analysis.
As a follow-up to #Dems post (comment box wasn't bit enough :) )..
Agreed on most counts.. If your data is mostly analytic, then you might want to look into a tool like PowerPivot. In this case, you can write a general query then allow the users to derive reports based on the result set in a familiar tool (Excel).
At the core of every ad hoc reporting engine, you will find a few common themes:
Metadata
There will be some way of describing the schema such that the model may be easily consumed by the user. Sql Server Reporting Services (SSRS) requires you to build a metadata model in order to use the report builder. When using PowerPivot, you can alias column names to make them more readable, but in the end, you are simply providing a flat dataset and allowing the user to build the joins/relationships.
Query Builder
Once the metadata has been manipulated by the user, an intermediary system must be in place to convert the conceptual report into an actual query. Many tools are measured based on the complexity of the Sql that they produce as this can greatly affect performance. One way to get around this is to create views that the reporting engine may build queries against. One of the best open source examples of this that I have seen is the engine that backs Hibernate/NHibernate (look into how the various Dialects are used when building queries).
Rendering Engine
In my experience building a rendering engine is not a road you want to go down. There are many device-specific concerns as well as look & feel problems (i.e. how do you plan on representing cascading joins/relationships?). Every rendering engine has it's own quirks (PowerPivot uses Excel, SSRS has a service that builds the raw result and return it to the consuming application) that must be accounted for, so be careful how you choose.
Earlier I mentioned that I agreed on most counts. I would not recommend encouraging your users to learn Sql or allowing them to pass-through Sql to the underlying data-store. This opens the door to malicious code being written and can become a security nightmare. Not to mention that most business users think in terms of flat tables, not hierarchical sets.
Figure out what your users are comfortable with and try to fit your solution to that domain. I have often found that for sophisticated business users something like PowerPivot is perfect. For more day-to-day end users, having "canned" reports that might be modified by the end user via a simple user interface that allows them to modify restrictions/groupings/sorting is more useful.
There are many options out there, but the best of them cost money.
I really like QlikView as an easy to use report designed for semi-technical people. If your user base is more technically minded it may be a bit restrictive, but if your user base have no logical thought capabilities, it's too complicated. That's the biggest trap I see you falling in to...
- No, I want more than that!
- No, that's too complicated for me!
- At the same time...
If you were to build your own tool-set internally, you'd probably be best sticking with OLAP cubes. Let people slice and dice the data as they like, but with all the relationships pre-defined. Do it right and you can just point an Excel Pivot Table at the OLAP Cube and let them play...
The next up, as Bobby D says, could be SQL Server Reporting Services, or something similar.
But if your users end up wanting absolute flexibility, the tool they need is SQL itself. Unfortunately, all tools follow the same trend: The more flexible and powerful, the more time you need to spend learning/training.

Schema-agnostic OLAP-like tool?

Do there exist any (ideally free or open-source) tools for performing OLAP analyses on arbitrary tables in a relational database, without requiring any advance specification of dimensional hierarchies, cardinalities, or any other meta-information about the table beyond what can be extracted automatically from the table itself?
My inability to Google for anything like what I'm describing makes me suspect I'm using incorrect terminology and what I'm searching for isn't properly considered to be OLAP. If this is the case, what I specifically want is anything that would let technically unsophisticated users create cross-tab or contingency table aggregations using tables in a relational DB without needing to write elaborate SQL queries.
Or, in other words, I'd like something that mimics Excel's PivotTables on a larger scale. I appreciate that Excel does indeed generate extensive caches behind the scenes when you make a PivotTable, but it does this without the user having to explain to it which caches need creating. This is the functionality I'm trying to find elsewhere, if it exists.
The best options I know of are Excel and Access, but of course they are not open source. This space kinda got trampled in the explosion of interest in what is now called Business Intelligence and a lot of companies got bought by MS and others. It's pretty thin now as far as I can tell. I'll watch this thread though.
The most useful paradigm to attach to is I think spreadsheets and there's not much competition there any more. Google Docs spreadsheets can import csv etc. exported from databases, and there's a pivot chart available, but not much more.
The other place I've seen OLAP capabilities is in the Adobe Flex libraries to build on with ActionScript if you have any inclination in that direction. As usual, Adobe manages to get it about 90% right but doesn't quite provide a whole product.
icCube aims to setup an OLAP cube as simply as possible. It is not schema-agnostic, but I guess this is quite simple to define dimensions and facts from existing DB tables. Nevertheless, this could be not so "simple" depending on your tables - difficult to say without knowledge about them. I guess there's no generic easy solution ;-)
Then you can use Excel pivot table (amongst others) to access the cubes. Note as far as I know Excel does not do any caching neither aggregation when connecting to a cube. Indeed, it is generating all the required MDX requests to the cube.
Hope that helps.

Database System Architecture discussion

I'd like to start a discussion about the implementation of a database system.
I'm working for a company having a database system grown over ca. the last 10 years.
Let me try to describe what it's doing and how it's implemented:
The system is divided into 3 main parts handled by 3 different teams.
Entry:
The Entry Team is responsible for creating GUIs for the system. In the background is a huge MS SQL database (ca. 100 tables) and the GUI is created using .NET. There are different GUI applications and each application has lots of different tabs to fill in the corresponding tables. If e.g. a new column is added to the database, this column is added manually to the GUI application.
Dataflow:
The purpose of the Dataflow Team is to do do data calculations and prepare the data for the reporting team. This is done via multiple levels. Let me try to explain the process a little bit more in detail: The Dataflow Team uses the data from the Entry database copied to another server and another database via Transactional-Replication (this data contains information from all clients). Then once per hour a self-written application is checking for changed rows in the input tables (using a ChangedDate column) and then calling a stored procedure for each output table calculating new data using 1-N of the input tables. After that the data is copied to another database on another server using again Transaction-Replication. Here another stored procedure is called to calclulate additional new output tables. This stored procedure is started using a SQL job. From there the data is split to different databases, each database being client specific. This copying is done using another self-written application using the .NET bulkcopy command (filtering on the client). These client specific databases are copied to different client specific reporting databases on other servers via another self-written application which compares the reporting database with the client specific database to calculate the data difference. Just the data differences are copied (because the reporting database run in former times on the client servers).
This whole process is orchestrated by another self-written application to control e.g. if the Transactional-Replications are finished before starting the job to call the Stored procedure etc... Futhermore also the synchronisation between the different clients is orchestrated here. The process can be graphically displayed by a self-written monitoring tool which looks pretty complex as you can imagine...
The status of all this components is logged and can be viewed by another self-written application.
If new columns or tables are added all this components have to be manually changed.
For deployment installation instructions are written using MS Word. (ca. 10 people working in this team)
Reporting:
The Reporting Team created it's own platform written in .NET to allow the client to create custom reports via a GUI. The reports are accessible via the Web.
The biggest tables have around 1 million rows. So, I hope I didn't forget anything important.
Well, what I want to discuss is how other people realize this scenario, I can't imagine that every company writes it's own custom applications.
What are actually the possibilities to allow fast calculations on databases (next to using T-SQL). I'm somehow missing the link here to the object oriented programming I'm used to from my old company, but we never dealt with so much data and maybe for fast calculations this is the way to do it...Or is it possible using e.g. LINQ or BizTalk Server to create the algorithms and calculations, maybe even in a graphical way? The question is just how to convert the existing meter-long Stored procedures into the new format...
In future we want to use data warehousing, but that will take a while, so maybe it's possible to have a separate step to streamline the process.
Any comments are appreciated.
Thanks
Daniel
Why on earth would you want to convert existing working complex stored procs (which can be performance tuned) to LINQ (or am I misunderstanding you)? Because you personally don't like t-sql? Not a good enough reason. Are they too slow? Then they can be tuned (which is something you really don't want to try to do in LINQ). It is possible the process can be made better using SSIS, but as complex as SSIS is and the amount of time a rewrite of the process would take, I'm not sure you really would gain anything by doing so.
"I'm somehow missing the link here to the object oriented programming..." Relational databases are NOT Object-oriented and cannot perform well if you try to treat them like they are. Learn to think in terms of sets not objects when accessing databases. You are coming from the mindset of one user at a time inserting one record at a time, but this is not the mindset neeeded to deal with the transfer of large amounts of data. For these types of things, using the database to handle the problem is better than doing things in an object-oriented manner. Once you have a large amount of data and lots of reporting, people are far more interested in performance than you may have been used to in the past when you used some tools that might not be so good for performance. Whether you like T-SQL or not, it is SQL Server's native language and the database is optimized for it's use.
The best advice, having been here before, is to start by learning first how SQL works, and doing it in the context of the existing architecture sounds like a good way to start (since nothing you've described sounds irrational on the face of it.)
Whatever abstractions you try to lay on top (LINQ, Biztalk, whatever) all eventually resolve to pure SQL. And almost always they add overhead and complexity.
Your OO paradigms aren't transferable. Any suggestions about abstractions will need to be firmly defensible based on your firm grasp of the SQL consequences.
It will take a while, but it's all worth knowing, both professionally and personally.
I'm currently re-engineering a complex system which is moving from Focus (a database and language) to a data warehouse (separate team) and processing (my team) and reporting (separate team).
The current process is combined - data is loaded and managed in the Focus language and Focus database(s) and then reported (and historical data is retained)
In the new process, the DW is loaded and then our process begins. Our processes are completely coded in SQL, and a million row fact table (for one month) would be relatively small. We have some feeds where the monthly data is 25 million rows. There are some statistics tables produced which are over 200 million rows (a month). The processing can take several hours a month, end to end. We use tables to store intermediate results, and we ensure indexing strategies are suitable for the processing. Except for one piece implemented as an SSIS flow from the database back to itself because of extremely poor scalar UDF performance, the entire system is implemented as a series of T-SQl SPs.
We also have a process monitoring system similar to what you are discussing as well as having the dependencies in a table which ensures that each process runs only if all its prerequisites are satisfied. I've recently grafted on the MSAGL to graphically display and interact with the process (previously I was using graphviz to generate static images) from a .NET Windows application. The new system thus has much clearer dependency information as well as good information about process performance so effort can be concentrated on the slowest performing bottlenecks.
I would not plan on doing any re-engineering of any complex system without a clear strategy, a good inventory of the existing system and a large budget for time and money.
From the sounds of what you are saying, you have a three step process.
Input data
Analyze data
Report data
Steps one and three need to be completed by "users". Therefore, a GUI is needed for each respective team to do the task at hand, otherwise, they would be directly working on SQL Server, and would require extensive SQL knowledge. For these items, I do not see any issue with the approach your organization is taking, you are building a customized system to report on the data at hand. The only item that might be worth considering on these side, is standardization between the teams on common libraries and the technologies used.
Your middle step does seem to be a bit lengthy, with many moving parts. However, I've worked on a number of large reporting systems where that is truly the only way to get around it. WIthout knowing more of your organization and the exact nature of operations.
By "fast calculations" you must mean "fast retrieval" Data warehouses (both relational and otherwise) are fast with math because the answers are pre-calculated in advance. SQL, unless you are using CLR stored procedures, is usually a rather slow when it comes to math.
You'd be hard pressed to defeat the performance of BCP and SQL with anything else. If the update routines are long and bloated because they loop through the tables, then sure I can see why you'd want to go to .NET. But you'd probably increase performance by figuring out how to rewrite them all nice and SET based. BCP is not going to be able to be beaten. When I used SQL Server 2000 BCP was often faster than DTS. And SSIS in general (due to all the data type checking) seems to be way slower than DTS. If you kill performance no doubt people are going to be coming to you. Still if you are doing a ton of row by row complex calculations, optimizing that into a CLR stored procedure or even a .NET application that is called from SQL Server to do the processing will probably result in a speed up. Of course if you were row processing and you manage to rewrite the queries to do set processing you'd probably get a bigger speed up. But depending upon how complex the calculations are .NET may help.
Now if a front end change could immediately update and propagate the data, then you might want to change things to .NET so that as soon as a row is changed it can be recalculated and update all the clients. However if a lot of rows are changed or the database is just ginormous then you will kill performance. If the operation needs to be done in bulk then probably the way it is currently being done is the best.
The only thing I might as is that maybe there is a lot of duplicate SQL that looks exactly the same except for a table name and or the column names. If so, you can probably use .NET combined with SQL-SMO(or DMO if using SQL Server 2000) to code generate it.
Here's an example that I often see to load a datawarehouse
Assuming some row tables are loaded with the data from the source
select changed rows from source into temporary tables
see if any columns that matter were changed
if so terminate existing row (or clone it into some history table)
insert/update new row
I often see one of those queries per table and the only variations are the table/column names and maybe references to the key column. You can easily get the column definitions and key definitions out of SQL Server and then make a .NET program to create the INSERT/SELECT/ETC. In the worst case you may just have to store some type of table with TABLE_NAME, COLUMN_NAME for the columns that matter. Then instead of having to wrap your head around a complex ETL process and 20 or 200 update queries, you just need to wrap your head around UPDATE and one query. Any changes to the way things are done can be done once and applied to all the queries.
In particular my guess is that you can apply this technique to the individual client databases if you haven't already. Probably all the queries/bulk copy scripts are the same or almost the same with the exception of database/server name. So you can just autogenerate them based on a CLIENTs table or something.....

How to document a database [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
This post was edited and submitted for review 12 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
(Note: I realize this is close to How do you document your database structure? , but I don't think it's identical.)
I've started work at a place with a database with literally hundreds of tables and views, all with cryptic names with very few vowels, and no documentation. They also don't allow gratuitous changes to the database schema, nor can I touch any database except the test one on my own machine (which gets blown away and recreated regularly), so I can't add comments that would help anybody.
I tried using "Toad" to create an ER diagram, but after leaving it running for 48 hours straight it still hadn't produced anything visible and I needed my computer back. I was talking to some other recent hires and we all suggested that whenever we've puzzled out what a particular table or what some of its columns means, we should update it in the developers wiki.
So what's a good way to do this? Just list tables/views and their columns and fill them in as we go? The basic tools I've got to hand are Toad, Oracle's "SQL Developer", MS Office, and Visio.
In my experience, ER (or UML) diagrams aren't the most useful artifact - with a large number of tables, diagrams (especially reverse engineered ones) are often a big convoluted mess that nobody learns anything from.
For my money, some good human-readable documentation (perhaps supplemented with diagrams of smaller portions of the system) will give you the most mileage. This will include, for each table:
Descriptions of what the table means and how it's functionally used (in the UI, etc.)
Descriptions of what each attribute means, if it isn't obvious
Explanations of the relationships (foreign keys) from this table to others, and vice-versa
Explanations of additional constraints and / or triggers
Additional explanation of major views & procs that touch the table, if they're not well documented already
With all of the above, don't document for the sake of documenting - documentation that restates the obvious just gets in people's way. Instead, focus on the stuff that confused you at first, and spend a few minutes writing really clear, concise explanations. That'll help you think it through, and it'll massively help other developers who run into these tables for the first time.
As others have mentioned, there are a wide variety of tools to help you manage this, like Enterprise Architect, Red Gate SQL Doc, and the built-in tools from various vendors. But while tool support is helpful (and even critical, in bigger databases), doing the hard work of understanding and explaining the conceptual model of the database is the real win. From that perspective, you can even do it in a text file (though doing it in Wiki form would allow several people to collaborate on adding to that documentation incrementally - so, every time someone figures out something, they can add it to the growing body of documentation instantly).
One thing to consider is the COMMENT facility built into the DBMS. If you put comments on all of the tables and all of the columns in the DBMS itself, then your documentation will be inside the database system.
Using the COMMENT facility does not make any changes to the schema itself, it only adds data to the USER_TAB_COMMENTS catalog table.
In our team we came to useful approach to documenting legacy large Oracle and SQL Server databases. We use Dataedo for documenting database schema elements (data dictionary) and creating ERD diagrams. Dataedo comes with documentation repository so all your team can work on documenting and reading recent documentation online. And you don’t need to interfere with database (Oracle comments or SQL Server MS_Description).
First you import schema (all tables, views, stored procedures and functions – with triggers, foreign keys etc.). Then you define logical domains/modules and group all objects (drag & drop) into them to be able to analyze and work on smaller chunks of database. For each module you create an ERD diagram and write top level description. Then, as you discover meaning of tables and views write a short description for each. Do the same for each column. Dataedo enables you to add meaningful title for each object and column – it’s useful if object names are vague or invalid. Pro version enables you to describe foreign keys, unique keys/constraints and triggers – which is useful but not essential to understand a database.
You can access documentation through UI or you can export it to PDF or interactive HTML (the latter is available only in Pro version).
Described here is a continuous process rather than one time job. If your database changes (eg. new columns, views) you should sync your documentation on regular basis (couple clicks with Dataedo).
See sample documentation:
http://dataedo.com/download/Dataedo%20repository.pdf
Some guidelines on documentation process:
Diagrams:
Keep your diagrams small and readable – just include important tables, relations and columns – only the one that have any meaning to understand big picture – primary/business keys, important attributes and relations,
Use different color for key tables in a diagram,
You can have more than one diagram per module,
You can add diagram to description of most important tables/with most relations.
Descriptions:
Don’t document the obvious – don’t write description “Document date” for document.date column. If there’s nothing meaningful to add just leave it blank,
If objects stored in tables have types or statuses it’s good to list them in general description of a table,
Define format that is expected, eg. “mm/dd/yy” for a date that is stored in text field,
List all known/important values an it’s meaning, e.g. for status column could be something like this: “Document status: A – Active, C – Cancelled, D – Deleted”,
If there’s any API to a table – a view that should be used to read data and function/procedures to insert/update data – list it in the description of table,
Describe where does rows/columns’ values come from (procedure, form, interface etc.) ,
Use “[deprecated]” mark (or similar) for columns that should not be used (title column is useful for this, explain which field should be used instead in description field).
We use Enterprise Architect for our DB definitions. We include stored procedures, triggers, and all table definitions defined in UML. The three brilliant features of the program are:
Import UML Diagrams from an ODBC Connection.
Generate SQL Scripts (DDL) for the entire DB at once
Generate Custom Templated Documentation of your DB.
You can edit your class / table definitions within the UML tool, and generate a fully descriptive with pictures included document. The autogenerated document can be in multiple formats including MSWord. We have just less than 100 tables in our schema, and it's quite managable.
I've never been more impressed with any other tool in my 10+ years as a developer. EA supports Oracle, MySQL, SQL Server (multiple versions), PostGreSQL, Interbase, DB2, and Access in one fell swoop. Any time I've had problems, their forums have answered my problems promptly. Highly recommended!!
When DB changes come in, we make then in EA, generate the SQL, and check it into our version control (svn). We use Hudson for building, and it auto-builds the database from scripts when it sees you've modified the checked-in sql.
(Mostly stolen from another answer of mine)
This answer extends Kieveli's above, which I upvoted. If your version of EA supports Object Role Modeling (conceptual design, vs. logical design = ERD), reverse engineer to that and then fill out the model with the expressive richness it gives you.
The cheap and lighter-weight option is to download Visiomodeler for free from MS, and do the same with that.
The ORM (call it ORMDB) is the only tool I've ever found that supports and encourages database design conversations with non-IS stakeholders about BL objects and relationships.
Reality check - on the way to generating your DDL, it passes through a full-stop ERD phase where you can satisfy your questions about whether it does anything screwy. It doesn't. It will probably show you weaknesses in the ERD you designed yourself.
ORMDB is a classic case of the principle that the more conceptual the tool, the smaller the market. Girls just want to have fun, and programmers just want to code.
A wiki solution supports hyperlinks and collaborative editing, but a wiki is only as good as the people who keep it organized and up to date. You need someone to take ownership of the document project, regardless of what tool you use. That person may involve other knowledgeable people to fill in the details, but one person should be responsible for organizing the information.
If you can't use a tool to generate an ERD by reverse engineering, you'll have to design one by hand using TOAD or VISIO.
Any ERD with hundreds of objects is probably useless as a guide for developers, because it'll be unreadable with so many boxes and lines. In a database with so many objects, it's likely that there are "sub-systems" of a few dozen tables and views each. So you should make custom diagrams of these sub-systems, instead of expecting a tool to do it for you.
You can also design a pseudo-ERD, where groups of tables are represented by a single object in one diagram, and that group is expanded in another diagram.
A single ERD or set of ERD's are not sufficient to document a system of this complexity, any more than a class diagram would be adequate to document an OO system. You'll have to write a document, using the ERD's as illustrations. You need text descriptions of the meaning and use of each table, each column, and the relationships between tables (especially where such relationships are implicit instead of represented by referential integrity constraints).
All of this is a lot of work, but it will be worth it. If there's a clear and up-to-date place where the schema is documented, the whole team will benefit from it.
Since you have the luxury of working with fellow developers that are in the same boat, I would suggest asking them what they feel would convey the needed information, most easily. My company has over 100 tables, and my boss gave me an ERD for a specific set tables that all connect. So also, you might want to try breaking 1 massive ERD into a bunch of smaller, manageable, ERDs.
Well, a picture tells a thousand words so I would recommend creating ER diagrams where you can view the relationship between tables at a glance, something that is hard to do with a text-only description.
You don't have to do the whole database in one diagram, break it up into sections. We use Visual Paradigm at work but EA is a good alternative as is ERWIN, and no doubt there are lots of others that are just as good.
If you have the patience, then using html to document the tables and columns makes your documentation easier to access.
If describing your databases to your end users is your primary goal Ooluk Data Dictionary Manager can prove useful. It is a web-based multi-user software that allows you to attach descriptions to tables and columns and allows full text searches on those descriptions. It also allows you to logically group tables using labels and browse tables using those labels. Tables as well as columns can be tagged to find similar data items across your database/databases.
The software allows you to import metadata information such as table name, column name, column data type, foreign keys into its internal repository using an API. Support for JDBC data sources comes built-in and can be extended further as the API source is distributed under ASL 2.0. It is coded to read the COMMENTS/REMARKS from many RDBMSs.You can always manually override the imported information. The information you can store about tables and columns can be extended using custom fields.
The Data Dictionary Manager uses the "data object" and "attribute" terminology instead of table and column because it isn't designed specifically for relational databases.
Notes
If describing technical aspects of your database such as triggers,
indexes, statistics is important this software isn't the best option.
It is however possible to combine a technical solution with this
software using hyperlink custom fields.
The software doesn't produce an ERD
Disclosure: I work at the company that develops this product.