How should I organize complex SQL views in Rails? - sql

I manage a research database with Ruby on Rails. The data that is entered is primarily used by scientists who prefer to have all the relevant information for a study in one single massive table for use in their statistics software of choice. I'm currently presenting it as CSV, as it's very straightforward to do and compatible with the tools people want to use.
I've written many views (the SQL kind, not the Rails HTML/ERB kind) to make the output they expect a reality. Some of these views are quite large and have a fair amount of complexity behind them. I wrote them in SQL because there are many calculations and comparisons that are more easily done with SQL. They're currently loaded into the database straight from a file named views.sql. To get the requested data, I do a select * from my_view;.
The views.sql file is getting quite large. Part of the problem is that we're still figuring out what the data we collect means, so there's a lot of changes being made to the views all the time -- and a ton of them are being created. Many of them need to be repeatable.
I've recently run into issues organizing and testing these views. Rails works great for user interface stuff and business logic, but I'm not aware of much existing structure for handling the reporting we require.
Some options I've thought of:
Should I move them into the most relevant models somehow? Several of the views interact with each other, which makes this situation more complex than just doing a single find_by_sql, so I don't know if they should only be part of the model.
Perhaps they should be treated as a "view" in the MVC sense? (That is, they could be moved into app/views/ and live alongside the HTML, perhaps as files named something like my_view.csv.sql which return CSV.)
How would you deal with a complex reporting problem like this?
UPDATE for Mladen Jablanović
It started by having a couple of views for reporting purposes. My boss(es) decided they wanted more, so I started writing more. Some give couple hundred columns of data, based on the requirements I've been given.
I have a couple thousand lines of views all shoved in a single file now. I don't like that situation, so I want to reorganize/refactor the code. I'd also like an easy way of providing CSVs -- I'm currently running queries and emailing them by hand, which could easily be automated. Finally, I would like to be able to write some tests on the output of the views, since a couple of regressions have already popped up.

I haven't worked much with SQL and views directly, so I can't help you there, but you can certainly build an ActiveRecord model on top of a view, very easily in fact. The book Enterprise Rails has a whole chapter on it (here it is at Google Books).

We are using views in our DB extensively and some of them are exposed as Rails models. You work with them as you would with tables, except for you can't update them of course.
Also, some of the columns may be calculated using other columns (different ratios for example) so we don't do it in the view, but in the model instead (ok, not entirely true, we construct SQL snippet and pass it to :select => '' portion of find call).
Presentation logic (such as date and number formatting) goes to Rails views.
I'm afraid I can't help you with more concrete advice, as the scope of the question is pretty wide.
EDIT:
Hundreds of columns doesn't sound reasonable. Sounds like immense amount of data in one place. How do they use it at all? We have web application where they can drill down and filter the results, narrow timespan and time step etc, so they never have more then 10-20 columns in the reports.
We store our views one view per SQL file. Also, you can combine it with a numerical prefix in order to ensure proper creation order (in case some of them depend on others). No migrations there, whole DB layer is app-agnostic.
For CSV, you can create either a set of scripts you can invoke either manually, or using cron, or you can use FasterCSV from your Rails app and generate CSVs by HTTP request.

Related

Database Type Agnostic Select Query Encapsulation class

I am upgrading a webapp that will be using two different database types. The existing database is a MySQL database, and is tightly integrated with the current systems, and a MongoDB database for the extended functionality. The new functionality will also be relying pretty heavily on the MySQL database for environmental variables such as information on the current user, content, etc.
Although I know I can just assemble the queries independently, it got me thinking of a way that might make the construction of queries much simpler (only for easier legibility while building, once it's finished, converting back to hard coded queries) that would entail an encapsulation object that would contain:
what data is being selected (including functionally derived data)
source (including joined data, I know that join's are not a good idea for non-relational db's, but it would be nice to have the facility just in case, which can be re-written into two queries later for performance times)
where and having conditions (stored as their own object types so they can be processed later, potentially including other select queries that can be interpreted by whatever db is using it)
orders
groupings
limits
This data can then be passed to an interface adapter that can build and execute the query, returning it in an array, or object or whatever is desired.
Although this sounds good, I have no idea if any code like this exists. If so, can anybody point it out to me, if not, are there any resources on similar projects undertaken that might allow me to continue the work and build a basic version?
I know this is a complicated library, but I have been working on this update for the last few days, and constantly switching back and forth has been getting me muddled at times and allowing for mistakes to occur
I would study things like the SQL grammar: http://www.h2database.com/html/grammar.html
Gives you an idea of how queries should be constructed.
You can study existing libraries around LINQ (C#): https://code.google.com/p/linqbridge/
Maybe even check out this link about FQL (Facebook's query language): https://code.google.com/p/mockfacebook/issues/list?q=label:fql
Like you already know, this is a hard problem. It will be a big challenge to make it run efficiently. Maybe consider moving all data from MySQL and Mongo to a third data store that has a copy of all the data and then running queries against that? Replicating all writes to something like Redis or Elastic Search and then write your queries there?
Either way, good luck!

Select * vs specific columns & loading object properties

I always thought SELECT * was bad and that you should always return only the columns you are going to use. One of the reasons for this is that the DB can return the result without hitting any tables if all the columns needed are in the index.
I have a factory class that loads the properties of a Product object. It loads all the properties everytime GetProduct is called etc.
Many of the pages won't be using all of the Product properties even though they will be loaded from the database because of the SELECT*.
Is there any design advice/guidelines on this?
The trade-off here is between squeaking out every last bit of potential performance versus code maintainability. There is no question that bringing back columns you won't use wastes some CPU cycles. The question becomes: how many? Then you have to consider what is more expensive, your wasted CPU cycles or your programmers' time for building and maintaining the code?
If you are working on a system with huge performance requirements then it may very well pay to optimize your ORM / factory code. On the other hand, if you're building a departmental line of business app and you've got scores or hundreds of ORM classes, maybe you are better off keeping it simple for the programmers (and the people who have to pay for them) and stop worrying about a few cycles. This becomes even more the case if you use a framework that scaffolds up most of your ORM code for you with code generation - like Entity Framework (or many others)...
If you are building your system without the use of any kind of code generating framework, and if your data access layer is pretty close to bare metal SQL then only bringing back what you need is good advice. If you are building an app that is going to be used by thousands or millions of people simultaneously, then by all means tune your SQL from the outset. If, on the other hand, you work in a shop that uses ORM frameworks and RAD or agile then writing dozens of SQLs is counter productive.
I'd definitely avoid SELECT *. Just retrieve the data you know you'll need. I'd prefer to write a dozen queries to the same table, where each one refers to just the few columns I need for a particular purpose, rather than write one query that retrieves all the columns and just use that everywhere.
Even if you know you need every column currently in a table, list each one explicitly. That way, if someone adds half a dozen more columns to the table in the future, all your old queries won't suddenly be retrieving more data than is needed.

Pros and cons of putting logic in SQL? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
At a new job, I've just been exposed to the concept of putting logic into SQL statements.
In MySQL, a dumb example would be like this:
SELECT
P.LastName, IF(P.LastName='Baldwin','Michael','Bruce') AS FirstName
FROM
University.PhilosophyProfessors P
// This is like a ternary operator; if the condition is true, it returns
// the first value; else the second value. So if a professor's last name
// is 'Baldwin', we will get their first name as "Michael"; otherwise, "Bruce"**
For a more realistic example, maybe you're deciding whether a salesperson qualifies for a bonus. You could grab various sales numbers and do some calculations in your SQL query, and return true / false as a column value called "qualifies."
Previously, I would have gotten all the sales data back from the query, then done the calculation in my application code.
To me, this seems better, because if necessary, I can walk through the application logic step-by-step with a debugger, but whatever the database is doing is a black box to me. But I'm a junior developer, so I don't know what's normal.
What are the pros and cons of having the database server do some of your calculations / logic?
**Code example based on Monty Python sketch.
This way SQL becomes part of your domain model. It's one more (and not necessarily obvious) place where domain knowledge is implemented. Such leaks result in tighter coupling between business logic / application code and database, what usually is a bad idea.
One exception is views, report queries etc. But these usually are so isolated that it's obvious what role they play.
One of the most persuasive reasons to push logic out to the database is to minimise traffic. In the example given, there is little gain, since you are fetching the same amount of data whether the logic is in the query or in your app.
If you want to fetch only users with a first name of Michael, then it makes more sense to implement the logic on the server. Actually, in this simple example, it doesn't make much difference, since you could specify users who's lastname is Baldwin. But consider a more interesting problem, whereby you give each user a "popularity" score based on how common their first and last names are, and you want to fetch the 10 most "popular" users. Calculating "popularity" in the app would mean that you have to fetch every single user before ranking, sorting and choosing them locally. Calculating it on the server means you can fetch just 10 rows across the wire.
There aren't a lot of absolute pros and cons to this argument, so the answer is 'it depends.' Some scenarios with different conditions that affect this decision might be:
Client-server app
One example of a place where it might be appropriate to do this is an older 4GL or rich client application where all database operations were done through stored procedure based update, insert, delete sprocs. In this case the gist of the architecture was to have the sprocs act as the main interface for the database and all business logic relating to particular entities lived in the one place.
This type of architecture is somewhat unfashionable these days but at one point it was considered to be the best way to do it. Many VB, Oracle Forms, Informix 4GL and other client-server apps of the era were done like this and it actually works fairly well.
It's not without its drawbacks, however - SQL is not particularly good at abstraction, so it's quite easy to wind up with fairly obtuse SQL code that presents a maintenance issue through being hard to understand and not as modular as one might like.
Is it still relevant today? Quite often a rich client is the right platform for an application and there's certainly plenty of new development going on with Winforms and Swing. We do have good open-source ORMs today where a 1995 vintage Oracle Forms app might not have had the option of using this type of technology. However, the decision to use an ORM is certainly not a black and white one - Fowler's Patterns of Enterprise Application Architecture does quite a good job of running through a range of data access strategies and discussing their relative merits.
Three tier app with rich object model
This type of app takes the opposite approach, and places all of the business logic in the middle tier model object layer with a relatively thin database layer (or perhaps an off-the-shelf mechanism like an ORM). In this case you are attempting to place all the application logic in the middle-tier. The data access layer has relatively little intelligence, except perhaps for a handful of stored procedured needed to get around limits of an ORM.
In this case, SQL based business logic is kept to a minimum as the main repository of application logic is the middle-tier.
Overhight batch processes
If you have to do a periodic run to pick out records that match some complex criteria and do something with them it may be appropriate to implement this as a stored procedure. For something that may have to go over a significant portion of a decent sized database a sproc based approch is probably going to be the only reasonably performant way to do this sort of thing.
In this case SQL may well be the appropriate way to do this, although traditional 3GLs (particularly COBOL) were designed specifically for this type of processing. In really high volume environments (particularly mainframes) doing this type of processing with flat or VSAM files outside a database may be the fastest way to do it. In addition, some jobs may be inherently record-oriented and procedural, or may be much more transparent and maintanable if implemented in this way.
To paraphrase Ed Post, 'you can write COBOL in any language' - although you might not want to. If you want to keep it in the database, use SQL, but it's certainly not the only game in town.
Reporting
The nature of reporting tools tends to dictate the means of encoding business logic. Most are designed to work with SQL based data sources so the nature of the tool forces the choice on you.
Other domains
Some applications like ETL processing may be a good fit for SQL. ETL tools start to get unwiedly if the transformation gets too complex, so you may want to go for a stored procedure based architecture. Mixing Queries and transformations across extraction, ETL processing and stored-proc based processing can lead to a transformation process that is hard to test and troubleshoot.
Where you have a significant portion of your logic in sprocs it may be better to put all of the logic in this as it gives you a relatively homogeneous and modular code base. In fact I have it on fairly good authority that around half of all data warehouse projects in the banking and insurance sectors are done this way as an explicit design decision - for precisely this reason.
Many times the answer to this type of question is going to depend a great deal on deployment approach. Where it makes the most sense to place your logic depends on what you'll need to be able to get access to when making changes.
In the case of web applications that aren't compiled, it can be easier to deal with changes to a page or file than it is to work with queries (depending on query complexity, programming backgrounds / expertise, etc). In these kinds of situations, logic in the scripting language is typically ok and make make it easier to revise later.
In the case of desktop applications that require more effort to modify, placing this kind of logic in the database where it can be adjusted without requiring a recompilation of the application may benefit you. If there was a decision made that people used to qualify for bonuses at 20k, but now must make 25k, it'd be much easier to adjust that on the SQL Server than to recompile your accounting application for all of your users, for example.
I'm a strong advocate of putting as much logic as possible directly into the database. That means incorporating it in views and stored procedures. I believe that most follows the DRY principle.
For example, consider a table with FirstName and LastName columns, and an application that frequently makes use of a FullName field. You have three choices:
Query first and last name and compute the full name in application code.
Query first, last, and (first || last) in your application's SQL whenever you query the table.
Define a view CustomerExt that includes the first and last columns, and a computed full name column and then query against that view, rather than the customer table.
I believe option 3 is clearly correct. Consider the addition of a MiddleInitial field to the table and the full name computation. Using option 3, you simply need to replace the view and every application across your company will instantly use the new format for FullName. The view still makes the base columns available for those instances in which you need to do some special formatting, but for the standard instance everything works "automatically".
That's a simple case, but the principle is the same for more complex situations. Perform application- or company-wide data logic directly in the database and you do not need to concern yourself with keeping different applications up to date.
The answer depends on your expertise and your familiarity with the technologies involved. Also, if you're a technical manager, it depends on your analysis of the skills of the people working on your team and whom you intend on hiring / keeping on staff to support, extend and maintain the application in future.
If you are not literate and proficient in the database , (as you are not) then stick with doing it in code. If otoh, you are literate and proficient in database coding (as you should be), then there is nothing wrong (and a lot right) abput doing it in the database.
Two other considerations that might influence your decision are whether the logic is of such a complex nature that doing it in database code would be inordinately more complex or more abstract than in code, and second, if the process involved requires data from outside the database (from some other source) In either of these scenarios I would consider moving the logic to a code module.
The fact that you can step through the code in your IDE more easily is really the only advantage to your post-processing solution. Doing the logic in the database server reduces the sizes of result sets, often drastically, which leads to less network traffic. It also allows the query optimizer to get a much better picture of what you really want done, again often allowing better performance.
Therefore I would nearly always recommend SQL logic. If you treat a database as a mere dumb store, it will return the favor by behaving dumb, and depending on the situation, that can absolutely kill your performance - if not today, possibly next year when things have taken off...
That particular first example is a bad idea. Per-row functions do not scale well as the table gets bigger. In fact, a (likely) better way to do it would be to index LastName and use something like:
SELECT P.LastName, 'Michael' AS FirstName
FROM University.PhilosophyProfessors P
WHERE P.LastName = 'Baldwin'
UNION ALL SELECT P.LastName, 'Bruce' AS FirstName
FROM University.PhilosophyProfessors P
WHERE P.LastName <> 'Baldwin'
On databases where data are read more often than written (and that's most of them), these sorts of calculations should be done at write time such as using an insert/update trigger to populate a real FirstName field.
Databases should be used for storing and retrieving data, not doing massive non-databasey calculations that will slow down everything.
One big pro: a query may be all you can work with. Reports have been mentioned: many reporting tools or reporting plugins to existing programs only allow users to make their own queries (the results of which they will display).
If you cannot alter the code (because it isn't yours), you may yet be able to alter a query. And in some cases (data migration), you'll be writing queries to do migration as well.
I like to distinguish data vs business rules, and push the data rules into the stored procs as much as possible. There is not always a hard and fast distinction between the two, but in your example of calculating sales bonuses, the formula itself might be a business rule but the work of gathering and aggregating the various figures used in the formula is a data rule.
Sometimes, though, it depends on the deployment model and change control procedures. If the sales formula changes frequently and deployment of the business layer code is cumbersome, then tweaking just one function/stored proc in the database would be a great solution.
I'm a big fan of elegant database queries because the code is closer to the data and SQL works very well. But such queries, whether they're text in you app, generated by an OR mapper or stored in the database are harder to test, especially in the cloud, because you need a database to run against.
Database is exactly what it's called. DATABASE.
You should not mix the business logic with data layer.
Keep it separate as any close coupling between data and business makes impossible to follow best standards in programming.
I was working recently on a project where all logic was in MS SQL. Horrible idea, that back-fired after few years (energy company), no easy way to scale-out, no easy way to follow up CI/CD, Agile or code repos. Very difficult to co-work, very slow and very inefficient.
Company basically was reaching hardware limits in order to make it work (they've spent £100k on SSD SAN), while you could reach the same performance with C# for business and keep the database for data, with perhaps 3-4 cheap servers, that could easily scale-out.
Horrible, horrible idea. Guess what ? Company went under, as one time SQL server has reached it's potential (sometimes some queries were running for hours (very well written, but SQL is not for business logic. End of story)) when one time failed to bill all DD customers and basically didn't took the monthly payment that they needed to survive till next month (millions of pounds).

Database System Architecture discussion

I'd like to start a discussion about the implementation of a database system.
I'm working for a company having a database system grown over ca. the last 10 years.
Let me try to describe what it's doing and how it's implemented:
The system is divided into 3 main parts handled by 3 different teams.
Entry:
The Entry Team is responsible for creating GUIs for the system. In the background is a huge MS SQL database (ca. 100 tables) and the GUI is created using .NET. There are different GUI applications and each application has lots of different tabs to fill in the corresponding tables. If e.g. a new column is added to the database, this column is added manually to the GUI application.
Dataflow:
The purpose of the Dataflow Team is to do do data calculations and prepare the data for the reporting team. This is done via multiple levels. Let me try to explain the process a little bit more in detail: The Dataflow Team uses the data from the Entry database copied to another server and another database via Transactional-Replication (this data contains information from all clients). Then once per hour a self-written application is checking for changed rows in the input tables (using a ChangedDate column) and then calling a stored procedure for each output table calculating new data using 1-N of the input tables. After that the data is copied to another database on another server using again Transaction-Replication. Here another stored procedure is called to calclulate additional new output tables. This stored procedure is started using a SQL job. From there the data is split to different databases, each database being client specific. This copying is done using another self-written application using the .NET bulkcopy command (filtering on the client). These client specific databases are copied to different client specific reporting databases on other servers via another self-written application which compares the reporting database with the client specific database to calculate the data difference. Just the data differences are copied (because the reporting database run in former times on the client servers).
This whole process is orchestrated by another self-written application to control e.g. if the Transactional-Replications are finished before starting the job to call the Stored procedure etc... Futhermore also the synchronisation between the different clients is orchestrated here. The process can be graphically displayed by a self-written monitoring tool which looks pretty complex as you can imagine...
The status of all this components is logged and can be viewed by another self-written application.
If new columns or tables are added all this components have to be manually changed.
For deployment installation instructions are written using MS Word. (ca. 10 people working in this team)
Reporting:
The Reporting Team created it's own platform written in .NET to allow the client to create custom reports via a GUI. The reports are accessible via the Web.
The biggest tables have around 1 million rows. So, I hope I didn't forget anything important.
Well, what I want to discuss is how other people realize this scenario, I can't imagine that every company writes it's own custom applications.
What are actually the possibilities to allow fast calculations on databases (next to using T-SQL). I'm somehow missing the link here to the object oriented programming I'm used to from my old company, but we never dealt with so much data and maybe for fast calculations this is the way to do it...Or is it possible using e.g. LINQ or BizTalk Server to create the algorithms and calculations, maybe even in a graphical way? The question is just how to convert the existing meter-long Stored procedures into the new format...
In future we want to use data warehousing, but that will take a while, so maybe it's possible to have a separate step to streamline the process.
Any comments are appreciated.
Thanks
Daniel
Why on earth would you want to convert existing working complex stored procs (which can be performance tuned) to LINQ (or am I misunderstanding you)? Because you personally don't like t-sql? Not a good enough reason. Are they too slow? Then they can be tuned (which is something you really don't want to try to do in LINQ). It is possible the process can be made better using SSIS, but as complex as SSIS is and the amount of time a rewrite of the process would take, I'm not sure you really would gain anything by doing so.
"I'm somehow missing the link here to the object oriented programming..." Relational databases are NOT Object-oriented and cannot perform well if you try to treat them like they are. Learn to think in terms of sets not objects when accessing databases. You are coming from the mindset of one user at a time inserting one record at a time, but this is not the mindset neeeded to deal with the transfer of large amounts of data. For these types of things, using the database to handle the problem is better than doing things in an object-oriented manner. Once you have a large amount of data and lots of reporting, people are far more interested in performance than you may have been used to in the past when you used some tools that might not be so good for performance. Whether you like T-SQL or not, it is SQL Server's native language and the database is optimized for it's use.
The best advice, having been here before, is to start by learning first how SQL works, and doing it in the context of the existing architecture sounds like a good way to start (since nothing you've described sounds irrational on the face of it.)
Whatever abstractions you try to lay on top (LINQ, Biztalk, whatever) all eventually resolve to pure SQL. And almost always they add overhead and complexity.
Your OO paradigms aren't transferable. Any suggestions about abstractions will need to be firmly defensible based on your firm grasp of the SQL consequences.
It will take a while, but it's all worth knowing, both professionally and personally.
I'm currently re-engineering a complex system which is moving from Focus (a database and language) to a data warehouse (separate team) and processing (my team) and reporting (separate team).
The current process is combined - data is loaded and managed in the Focus language and Focus database(s) and then reported (and historical data is retained)
In the new process, the DW is loaded and then our process begins. Our processes are completely coded in SQL, and a million row fact table (for one month) would be relatively small. We have some feeds where the monthly data is 25 million rows. There are some statistics tables produced which are over 200 million rows (a month). The processing can take several hours a month, end to end. We use tables to store intermediate results, and we ensure indexing strategies are suitable for the processing. Except for one piece implemented as an SSIS flow from the database back to itself because of extremely poor scalar UDF performance, the entire system is implemented as a series of T-SQl SPs.
We also have a process monitoring system similar to what you are discussing as well as having the dependencies in a table which ensures that each process runs only if all its prerequisites are satisfied. I've recently grafted on the MSAGL to graphically display and interact with the process (previously I was using graphviz to generate static images) from a .NET Windows application. The new system thus has much clearer dependency information as well as good information about process performance so effort can be concentrated on the slowest performing bottlenecks.
I would not plan on doing any re-engineering of any complex system without a clear strategy, a good inventory of the existing system and a large budget for time and money.
From the sounds of what you are saying, you have a three step process.
Input data
Analyze data
Report data
Steps one and three need to be completed by "users". Therefore, a GUI is needed for each respective team to do the task at hand, otherwise, they would be directly working on SQL Server, and would require extensive SQL knowledge. For these items, I do not see any issue with the approach your organization is taking, you are building a customized system to report on the data at hand. The only item that might be worth considering on these side, is standardization between the teams on common libraries and the technologies used.
Your middle step does seem to be a bit lengthy, with many moving parts. However, I've worked on a number of large reporting systems where that is truly the only way to get around it. WIthout knowing more of your organization and the exact nature of operations.
By "fast calculations" you must mean "fast retrieval" Data warehouses (both relational and otherwise) are fast with math because the answers are pre-calculated in advance. SQL, unless you are using CLR stored procedures, is usually a rather slow when it comes to math.
You'd be hard pressed to defeat the performance of BCP and SQL with anything else. If the update routines are long and bloated because they loop through the tables, then sure I can see why you'd want to go to .NET. But you'd probably increase performance by figuring out how to rewrite them all nice and SET based. BCP is not going to be able to be beaten. When I used SQL Server 2000 BCP was often faster than DTS. And SSIS in general (due to all the data type checking) seems to be way slower than DTS. If you kill performance no doubt people are going to be coming to you. Still if you are doing a ton of row by row complex calculations, optimizing that into a CLR stored procedure or even a .NET application that is called from SQL Server to do the processing will probably result in a speed up. Of course if you were row processing and you manage to rewrite the queries to do set processing you'd probably get a bigger speed up. But depending upon how complex the calculations are .NET may help.
Now if a front end change could immediately update and propagate the data, then you might want to change things to .NET so that as soon as a row is changed it can be recalculated and update all the clients. However if a lot of rows are changed or the database is just ginormous then you will kill performance. If the operation needs to be done in bulk then probably the way it is currently being done is the best.
The only thing I might as is that maybe there is a lot of duplicate SQL that looks exactly the same except for a table name and or the column names. If so, you can probably use .NET combined with SQL-SMO(or DMO if using SQL Server 2000) to code generate it.
Here's an example that I often see to load a datawarehouse
Assuming some row tables are loaded with the data from the source
select changed rows from source into temporary tables
see if any columns that matter were changed
if so terminate existing row (or clone it into some history table)
insert/update new row
I often see one of those queries per table and the only variations are the table/column names and maybe references to the key column. You can easily get the column definitions and key definitions out of SQL Server and then make a .NET program to create the INSERT/SELECT/ETC. In the worst case you may just have to store some type of table with TABLE_NAME, COLUMN_NAME for the columns that matter. Then instead of having to wrap your head around a complex ETL process and 20 or 200 update queries, you just need to wrap your head around UPDATE and one query. Any changes to the way things are done can be done once and applied to all the queries.
In particular my guess is that you can apply this technique to the individual client databases if you haven't already. Probably all the queries/bulk copy scripts are the same or almost the same with the exception of database/server name. So you can just autogenerate them based on a CLIENTs table or something.....

SQL Server Views, blessing or curse? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I once worked with an architect who banned the use of SQL views. His main reason was that views made it too easy for a thoughtless coder to needlessly involve joined tables which, if that coder tried harder, could be avoided altogether. Implicitly he was encouraging code reuse via copy-and-paste instead of encapsulation in views.
The database had nearly 600 tables and was highly normalised, so most of the useful SQL was necessarily verbose.
Several years later I can see at least one bad outcome from the ban - we have many hundreds of dense, lengthy stored procs that verge on unmaintainable.
In hindsight I would say it was a bad decision, but what are your experiences with SQL views? Have you found them bad for performance? Any other thoughts on when they are or are not appropriate?
There are some very good uses for views; I have used them a lot for tuning and for exposing less normalized sets of information, or for UNION-ing results from multiple selects into a single result set.
Obviously any programming tool can be used incorrectly, but I can't think of any times in my experience where a poorly tuned view has caused any kind of drawbacks from a performance standpoint, and the value they can provide by providing explicitly tuned selects and avoiding duplication of complex SQL code can be significant.
Incidentally, I have never been a fan of architectural "rules" that are based on keeping developers from hurting themselves. These rules often have unintended side-effects -- the last place I worked didn't allow using NULLs in the database, because developers might forget to check for null. This ended up forcing us to work around "1/1/1900" dates and integers defaulted to "0" in all the software built against the databases, and introducing a litany of bugs caused by devs working around places where NULL was the appropriate value.
You've answered your own question:
he was encouraging code reuse via copy-and-paste
Reuse the code by creating a view. If the view performs poorly, it will be much easier to track down than if you have the same poorly performing code in several places.
Not a big fan of views (Can't remember the last time I wrote one) but wouldn't ban them entirely either. If your database allows you to put indexes on the views and not just on the table, you can often improve performance a good bit which makes them better. If you are using views, make sure to look into indexing them.
I really only see the need for views for partitioning data and for extremely complex joins that are really critical to the application (thinking of financial reports here where starting from the same dataset for everything might be critical). I do know some reporting tools seem to prefer views over stored procs.
I am a big proponent of never returning more records or fields than you need in a specific instance and the overuse of views tends to make people return more fields (and in way too many cases, too many joins) than they need which wastes system resources.
I also tend to see that people who rely on views (not the developer of the view - the people who only use them) often don't understand the database very well (so they would get the joins wrong if not using the view) and that to me is critical to writing good code against the database. I want people to understand what they are asking the db to do, not rely on some magic black box of a view. That is all personal opinion of course, your mileage may vary.
Like BlaM I personally haven't found them easier to maintain than stored procs.
Edited in Oct 2010 to add:
Since I orginally wrote this, I have had occasion to work with a couple of databases designed by people who were addicted to using views. Even worse they used views that called views that called views (to the point where eventually we hit the limit of the number of tables that can be called). This was a performance nightmare. It took 8 minutes to get a simple count(*) of the records in one view and much longer to get data. If you use views, be very wary of using views that call other views. You will be building a system that will very probably not work under the normal performance load on production. In SQL Server you can only index views that do not call other views, so what ends up happening when you use views in a chain, is that the entire record set has to be built for each view and it is not until you get to the last one that the where clause criteria are applied. You may need to generate millions of records just to see three. You may join to the same table 6 times when you really only need to join to it once, you may return many many more columns than you need in the final results set.
My current database was completely awash with countless small tables of no more than 5 rows each. Well, I could count them but it was cluttered. These tables simply held constant type values (think enum) and could very easily be combined into one table. I then made views that simulated each of the tables I deleted to ensure backward compactability. Worked great.
One thing that hasn't been mentioned thus far is use of views to provide a logical picture of the data to end users for ad hoc reporting or similar.
This has two merits:
To allow the user to single "tables" containing the data they expect rather requiring relatively non technical users to work out potentially complex joins (because the database is normalised)
It provides a means to allow some degree of ah hoc access without exposing the data or the structure to the end users.
Even with non ad-hoc reporting its sometimes signicantly easier to provide a view to the reporting system that contains the relveant data, neatly separating production of data from presentation of same.
Like all power, views have its own dark side. However, you cannot blame views for somebody writing bad performing code. Moreover views can limit the exposure of some columns and provide extra security.
Views are good for ad-hoc queries, the kind that a DBA does behind the scenes when he/she needs quick access to data to see what's going on with the system.
But they can be bad for production code. Part of the reason is that it's sort of unpredictable what indexes you will need with a view, since the where clause can be different, and therefore hard to tune. Also, you are generally returning a lot more data than is actually necesary for the individual queries that are using the view. Each of these queries could be tightened up and tuned individually.
There are specific uses of views in cases of data partitioning that can be extremely useful, so I'm not saying they should avoided altogether. I'm just saying that if a view can be replaced by a few stored procedures, you will be better off without the view.
We use views for all of our simple data exports to csv files. This simplifies the process of writing a package and embedding the sql within the package which becomes cumbersome and hard to debug against.
Using views, we can execute a view and see exactly what was exported, no cruft or unknowns. It greatly helps in troubleshooting problems with improper data exports and hides any complex joins behind the view. Granted, we use a very old legacy system from a TERMS based system that exports to sql, so the joins are a little more complex than usual.
Some time ago I've tried to maintain code that used views built from views built from views... That was a pain in the a**, so I got a little allergic to views :)
I usually prefer working with tables directly, especially for web applications where speed is a main concern. When accessing tables directly you have the chance to tweak your SQL-Queries to achieve the best performance. "Precompiled"/cached working plans might be one advantage of views, but in many cases just-in-time compilation with all given parameters and where clauses in consideration will result in faster processing over all.
However that does not rule out views totally, if used adequately. For example you can use a view with the "users" table joined with the "users_status" table to get an textual explanation for each status - if you need it. However if you don't need the explanation: use the "users" table, not the view. As always: Use your brain!
Views have been helpful to us in their role for use by public web based applications that dip from a production database. Simplified security is the primary advantage we see since the table design in the database may combine sensitive and non-sensitive data within the same table. A stored procedure shares much of this advantage, but the view is read-only, has potential interop advantages, and is a less complex thing for junior people to implement.
This security abstraction advantage also applies when views are used for end-user ad-hoc queries; this would be less of an advantage if we had a proper, flattened, data warehouse representation of our data.
From an application stand point which uses an ORM, it's a lot harder to execute a custom query than doing a select on a discretely mapped type (eg, the view).
For example, if you need just 5 fields of a table that has many (say 30 or 40) an ORM framework will create an entity to represent the table.
That means that even though you only need a few properties of the entity, the select query generated by the ORM framework will bring the entire entity in its full glory. A view on the other hand, although also mapped to an entity with the ORM framework, will only bring the data you need.
Second, since ORM frameworks map entities to tables, relationships between entities are generated (and hydrated) on the client side, meaning that the query has to execute and return to the app before linking of those entities can happen at runtime within the app.
Some frameworks bypass that by returning the data from multiple linked entities in a giant select (with multiple joins), bringing in the columns of all related tables in one call. Internally the framework disassembles the giant result set and structures the logical presentation of the linked entities before returning those entities to the caller app.
Point being is that views are a life saver for apps using ORM. The alternative is to manually make db calls, and manually passing the resulting recordsets into usable entities/models.
While this approach is good and definitely produces a result, it has lots of negative facets. Manual code... is manual; hard to maintain, cumbersome in implementation, and causes devs to worry more about the specifics of the DB provider API vs the logical domain model. Not to mention that it increases time to production (its a lot more labourious) costs for development, maintenance, surface area of bugs, etc.
So for anyone saying views are bad, please consider the other side of things; The stuff the high and mighty DBA's most often have no clue about.
Let's see if I can come up with a lame analogy ...
"I don't need a phillips screwdriver. I carry a flat head and a grinder!"
Dismissing views out of hand will cause pain long term. For one, it's easier to debug and modify a single view definition than it is to ship modified code.
Views can also reduce the size of complex queries (in the same way stored procs can).
This can reduce network bandwith for very busy databases.