What are your best practices for ensuring the correctness of the reports from SQL? - sql

Part of my work involves creating reports and data from SQL Server to be used as information for decision. The majority of the data is aggregated, like inventory, sales and costs totals from departments, and other dimensions.
When I am creating the reports, and more specifically, I am developing the SELECTs to extract the aggregated data from the OLTP database, I worry about mistaking a JOIN or a GROUP BY, for example, returning incorrect results.
I try to use some "best practices" to prevent me for "generating" wrong numbers:
When creating an aggregated data set, always explode this data set without the aggregation and look for any obvious error.
Export the exploded data set to Excel and compare the SUM(), AVG(), etc, from SQL Server and Excel.
Involve the people who would use the information and ask for some validation (ask people to help to identify mistakes on the numbers).
Never deploy those things in the afternoon - when possible, try to take a look at the T-SQL on the next morning with a refreshed mind. I had many bugs corrected using this simple procedure.
Even with those procedures, I always worry about the numbers.
What are your best practices for ensuring the correctness of the reports?

have you considered filling your tables with test data that produces known results and compare your query results with your expected results.

Signed, in writing
I've found that one of the best practices is that both the reader/client and the developers are on the same (documented) page. That way, when mysterious numbers appear (and they do), I can point to the specification in writing and say, "This is why you see this number. Would you like it to be different?".
Test, test, test
For seriously complicated reports, we went through test data up and down with the client, until all the numbers were correct, and client was were satisfied.
Edge Cases
We discovered a seriously complicated case in our reporting system that turned everything upside down (on our end). What if the user generates a report (say Year-End 2009) , enters data for the new year, and then comes back to generate the same report? The data has changed but that report should not. Thinking and working these cases out can save a lot of heartache.

Write some automated tests.
We have quite a lot of reporting services reports - we test them using Selenium. We use a test data page to squirt some known data into an empty database, then run the report and assert that the numbers are as expected.
The builds run every time we check in, so we know we haven't done anything too stupid

Related

Lower the number of Append Queries in Access 2000?

I have to make a report with 168 rows. Most of them are sequential data, but there are summation rows for which I need to build helper tables.
Therefore I need to build like 45-50 queries, most of them Append Queries.
Is there a way to minimize the number of queries and develop a large report with 168 rows?
Should I use code?
Just this last year I created a complicated, multi-part and multi-page report with graphs, summations, running averages, trends, "pivot-tables", etc. I did not count how many "rows" of data, but here are some things I did to manage the many queries:
Most important lesson learned: After much optimization and attempts to consolidate and reuse queries and temporary tables, it still turns out that there is no set of "magic few" queries that will return the data you need. Even if you reduce the number of SQL queries from 45 to 35 (which would be impressive in many cases), there are still many queries that you need to manage in an intelligent way. The point is to worry more about writing manageable queries and good infrastructure, rather than making the focus on reducing the number. (If your process is similar, you'll inevitably have to add more queries and more details later anyway.)
Union queries indeed have their place and are sometimes necessary, but simply combining queries to "reduce the number" can have negative consequences. 1) Union queries cannot be built or visualized using the Design View. I consider myself a "real coder", but I still appreciate the ability to use UI components when I can. Design View offers various useful syntax and datatype checks. 2) It is often useful in debugging and optimization to be able to run queries individually. 3) Unions do not improve efficiency and might actually slow down queries when duplicate removal and sorting are not necessary. 4) I have experienced certain perfectly correct queries that result in errors when combined in Unions. I haven't learned how to predict this behavior so it's almost not worth mentioning... except to not be fooled into thinking that the individual queries are somehow flawed. (There are usually workarounds.)
Create all related report queries and temporary tables in a separate Access database and link to the main database. In other words, create a separate reporting front-end if possible. Not only can this keep the source database cleaner, it can make it more efficient (highly dependent on number of users and how they're sharing the database).
Name queries using a consistent pattern. I tried using numbered queries with some success. I personally find that descriptive names are more useful than short, cryptic names. Much cut and pasting becomes necessary however.
VBA code or macros can be better than individual saved queries.
I rarely use complicated macros, so most of these tips are relevant to VBA code, but I won't argue against macros because they offer at least some similar benefits. It's also possible without much work to create a useful "dashboard" form that makes VBA code click-n-run similar to macros.
Comments can be included adjacent to the SQL. This can be invaluable in outlining ugly SQL. For example, it can be worth explaining why you choose a LEFT JOIN with extra WHERE criteria instead of an INNER JOIN, especially to prevent "helpful" coworkers (or yourself) from rewriting a query just to find they failed to consider all the original contexts.
An entire sequence of query texts and execution can be traced and debugged live. If error handling is coded appropriately, you can create custom logs and specialized handling of errors. SQL text can be edited and rerun without stopping the entire process.
Query parameters can be passed to queries without awkward UI prompts. Parameters can be used to code proper queries with user input (i.e. avoid SQL injection) and can reduce number of similar queries that simply have different input or criteria but are otherwise identical.
Multiple queries can be wrapped in a transactions and all committed or rolled back together! Not sure whether macros support this.
You can either move the SQL to VBA, to a macro, or if they're all appending to one table, make a large union subquery. All will reach that goal. For usability, I often go for the macro, since it's click to run. Just SetWarnings first, and then chain RunSQL statements.
The UNION query is also elegant solution if applicable.

Managing very large SQL queries

I'm looking for some ideas managing very large SQL queries in Oracle.
My employer is looking to build very wide reports ( 150 - 200 ) columns of data per report.
Each item is a sub-query or an element from a view. The data has to be real time, so DW style batch processing is not an option. We also don't use any BI tools , just a java app that generates Excel ( its a requirement to output data in Excel)
The query also contains unions as feeds from other systems.
The queries result in very large SQL ( about 1500 lines) that is very difficult to manage.
What strategies can I employ to make the work more manageable?
It is also not a performance problem. I was able to optimize the query to be very efficient , its mostly width of the query , managing 200 columns is a challenge in itself.
I deal with queries this length daily and here is some of what helps me out in manitaining them:
First alias every single one of the those columns. When you are building it you may know where each one came from but when it is time to make a change, it is really helpful to know exactly where each column came from. This applies to join conditions, group by and where conditions as well as the select columns.
Organize in easily understandable and testable chunks. I use temp tables to pull things that make sense together and so I can see the results before the final query while in test mode.
This brings me to test mode. If I have chunks of data, I design the proc with a test mode and then query individual temp tables when in test mode, so I can see where the data went wrong if there is a bug. Not sure how Oracle works but in SQL Server, I make this the last parameter and give it a default value, so that it doesn't need to be passed in by the application.
Consider logging the execution details and the values of passed in parameters and certainly log any error messages. This will help tremendously when you have to troubleshoot why this report that has functioned perfectly for six years doesn't work for this one user.
Put columns on a separate line for each one and do the same for where clauses. At times you may have to troublshoot by commenting out joins until you find the one that is causing the problem. It is easier if you can easily comment out the associated fields as well.
If you don't have a technical design document, then at least use comments to explain your thought process. You want to understand the whys not the hows in any comments. This stuff is hard to come back to later and understand even when you wrote it. Give your future self some help.
In developing from scratch, I put the select list in and then comment all but the first item. Then I build the query only until I get that value - testing until I am sure what I got was correct. Then I add the next one and whatever joins or where conditions I might need to get it. Test again making sure it is right. (Oops why did that go from 1000 records to 20000 when I added that? Hmm maybe there is something I need to handle there or is that right?) By adding only one thing at a time, you will find an error in the logic much faster and be much more confident of your results. It will also take you less time than trying to build a massive query in one go.
Finally, there is no substitute for understanding your data. There are plently of complex queries that work but do not give the correct answer. Know if you need an inner join or a left join. Know what where conditions you need to get the records you want. Know how to handle the records when you have a one-to-many relationship (this may require push back on the requirements); should you have 3 lines (one for each child record), or should you put that data in a comma delimited list or should you pick only one of the many records and have one line using aggregation. If the latter, what is the criteria for choosing the record you want to keep?
Without seeing the specifics of your problem, here are a couple of ideas that immediately come to mind:
If you are looking purely for management, I might suggest organizing your subqueries as a number of views and then referencing those views in your final query.
For performance on the other hand you may want to consider creating temp tables or even materialized views (which are fixed views) to break up the heavier parts of your process.
If your queries require an enormous amount of subquerying in order to gain usable data, you might need to rethink your database design and possibly create a number of datamarts to easily access reporting data. Think of these as mini-warehouses sans the multi-year trended data.
Finally, I know you said you don't use any BI tools but this problem certainly seems like one that might make sense by organizing your data into "cubes" or Business Object "universes". It might be worthwhile to at least entertain the cost of bringing on a BI tool vs. the programming hours to support the current setup.

For really complex reports, do people sometimes code in their language rather than in sql?

I have some pretty complex reports to write. Some of them... I'm not sure how I could write an sql query for just one of the values, let alone stuff them in a single query.
Is it common to just pull a crap load of data and figure it all via code instead? Or should I try and find a way to make all the reports rely on sql?
I have a very rich domain model. In fact, parts of code can be expanded on to calculate exactly what they want. The actual logic is not all that difficult to write - and it's nicer to work my domain model than with SQL. With SQL, writing the business logic, refactoring it, testing it and putting it version control is a royal pain because it's separate from your actual code.
For example, one the statistics they want is the % of how much they improved, especially in relation to other people in the same class, the same school, and compared to other schools. This requires some pretty detailed analysis of how they performed in the past to their latest information, as well as doing a calculation for the groups you are comparing against as a whole. I can't even imagine what the sql query would even look like.
The thing is, this % improvement is not a column in the database - it involves a big calculation in of itself by analyzing all the live data in real-time. There is no way to cache this data in a column as doing this calculation for every row it's needed every time the student does something is CRAZY.
I'm a little afraid about pulling out hundreds upon hundreds of records to get these numbers though. I may have to pull out that many just to figure out 1 value for 1 user... and if they want a report for all the users on a single screen, it's going to basically take analyzing the entire database. And that's just 1 column of values of many columns that they want on the report!
Basically, the report they want is a massive performance hog no matter what method I choose to write it.
Anyway, I'd like to ask you what kind of solutions you've used to these kind of a problems.
Sometimes a report can be generated by a single query. Sometimes some procedural code has to be written. And sometimes, even though a single query CAN be used, it's much better/faster/clearer to write a bit of procedural code.
Case in point - another developer at work wrote a report that used a single query. That query was amazing - turned a table sideways, did some amazing summation stuff - and may well have piped the output through hyperspace - truly a work of art. I couldn't have even conceived of doing something like that and learned a lot just from readying through it. It's only problem was that it took 45 minutes to run and brought the system to its knees in the process. I loved that query...but in the end...I admit it - I killed it. ((sob!)) I dismembered it with a chainsaw while humming "Highway To Hell"! I...I wrote a little procedural code to cover my tracks and...nobody noticed. I'd like to say I was sorry, but...in the end the job ran in 30 seconds. Oh, sure, it's easy enough to say "But performance matters, y'know"...but...I loved that query... ((sniffle...)) Anybody seen my chainsaw..? >;->
The point of the above is "Make Things As Simple As You Can, But No Simpler". If you find yourself with a query that covers three pages (I loved that query, but...) maybe it's trying to tell you something. A much simpler query and some procedural code may take up about the same space, page-wise, but could possibly be much easier to understand and maintain.
Share and enjoy.
Sounds like a challenging task you have ahead of you. I don't know all the details, but I think I would go at it from several directions:
Prioritize: You should try to negotiate with the "customer" and prioritize functionality. Chances are not everything is equally useful for them.
Manage expectations: If they have unrealistic expectations then tell them so in a nice way.
IMHO SQL is good in many respects, but it's not a brilliant programming language. So I'd rather just do calculations in the application rather than in the database.
I think I'd go for some delay in the system .. perhaps by caching calculated results for some minutes before recalculating. This is with a mind towards performance.
The short answer: for analysing large quantities of data, a SQL database is probably the best tool around.
However, that does not mean you should analyse this straight off your production database. I suggest you look into Datawarehousing.
For a one-off report, I'll write the code to produce it in whatever I can best reason about it in.
For a report that'll be generated more than once, I'll check on who is going to be producing it the next time. I'll still write the code in whatever I can best reason about it in, but I might add something to make it more attractive to use to that other person.
People usually use a third party report writing system rather than writing SQL. As an application developer, if you're spending a lot of time writing complex reports, I would severely question your manager's actions in NOT buying an off-the-shelf solution and letting less-skilled people build their own reports using some GUI.

Finding unused columns

I'm working with a legacy database which due to poor management and design has had a wildgrowth of columns which never have been or are no longer beeing used.
Is it possible to some how query for column usage? As in how often a column is beeing selected (either specifically or with *, or joined on)?
Seems to me like this is something we should be able to somehow retrieve but i have been unable to find anything like this.
Greetings,
F.B. ten Kate
Unfortunately, this analysis on the DB side isn't really going to be a full answer. I've seen a LOT of instances where application code only needed 3 columns of a 10+ column table, but selected them all anyway.
Your column would still show up on a usage report in any sort of trace or profiling you did, but it still may not ACTUALLY be in use.
You might have to either a) analyze the entire collection of apps that use this website or b) start drafting the a return-on-investment style doc on whether it's worth rebuilding.
This article will give you a good idea of how to search all fixed code (prodedures, views, functions and triggers) for the columns that are used. The code in the article searches for a specific table/column combination. You could easily adapt it to run for all columns. For anything dynamically executed, you'd probably have to set up a profiler trace.
Even if you could determine whether a column had been used in the past X period of time, would that be good enough? There may be some obscure program out there that populates a column once a week, a month, a year; or once every time they click the mystery button that no one ever clicks, or to log the report that only Fred in accounting ever runs (he quit two years ago), or that gets logged to if that one rare bug happens (during daylight savings time, perhaps?)
My point is, the only way you can truly be certain that a column is absolutely not used by anything is to review everything -- every call, every line of code, every ad hoc Excel data dump, every possible contingency -- everything that references the database . As this may be all but unachievable, try to get a formally defined group of programs and procedures that must be supported, bend over backwards to make sure they are supported, and be prepared to fix things when some overlooked or forgotten piece of functionality turns up.

Database System Architecture discussion

I'd like to start a discussion about the implementation of a database system.
I'm working for a company having a database system grown over ca. the last 10 years.
Let me try to describe what it's doing and how it's implemented:
The system is divided into 3 main parts handled by 3 different teams.
Entry:
The Entry Team is responsible for creating GUIs for the system. In the background is a huge MS SQL database (ca. 100 tables) and the GUI is created using .NET. There are different GUI applications and each application has lots of different tabs to fill in the corresponding tables. If e.g. a new column is added to the database, this column is added manually to the GUI application.
Dataflow:
The purpose of the Dataflow Team is to do do data calculations and prepare the data for the reporting team. This is done via multiple levels. Let me try to explain the process a little bit more in detail: The Dataflow Team uses the data from the Entry database copied to another server and another database via Transactional-Replication (this data contains information from all clients). Then once per hour a self-written application is checking for changed rows in the input tables (using a ChangedDate column) and then calling a stored procedure for each output table calculating new data using 1-N of the input tables. After that the data is copied to another database on another server using again Transaction-Replication. Here another stored procedure is called to calclulate additional new output tables. This stored procedure is started using a SQL job. From there the data is split to different databases, each database being client specific. This copying is done using another self-written application using the .NET bulkcopy command (filtering on the client). These client specific databases are copied to different client specific reporting databases on other servers via another self-written application which compares the reporting database with the client specific database to calculate the data difference. Just the data differences are copied (because the reporting database run in former times on the client servers).
This whole process is orchestrated by another self-written application to control e.g. if the Transactional-Replications are finished before starting the job to call the Stored procedure etc... Futhermore also the synchronisation between the different clients is orchestrated here. The process can be graphically displayed by a self-written monitoring tool which looks pretty complex as you can imagine...
The status of all this components is logged and can be viewed by another self-written application.
If new columns or tables are added all this components have to be manually changed.
For deployment installation instructions are written using MS Word. (ca. 10 people working in this team)
Reporting:
The Reporting Team created it's own platform written in .NET to allow the client to create custom reports via a GUI. The reports are accessible via the Web.
The biggest tables have around 1 million rows. So, I hope I didn't forget anything important.
Well, what I want to discuss is how other people realize this scenario, I can't imagine that every company writes it's own custom applications.
What are actually the possibilities to allow fast calculations on databases (next to using T-SQL). I'm somehow missing the link here to the object oriented programming I'm used to from my old company, but we never dealt with so much data and maybe for fast calculations this is the way to do it...Or is it possible using e.g. LINQ or BizTalk Server to create the algorithms and calculations, maybe even in a graphical way? The question is just how to convert the existing meter-long Stored procedures into the new format...
In future we want to use data warehousing, but that will take a while, so maybe it's possible to have a separate step to streamline the process.
Any comments are appreciated.
Thanks
Daniel
Why on earth would you want to convert existing working complex stored procs (which can be performance tuned) to LINQ (or am I misunderstanding you)? Because you personally don't like t-sql? Not a good enough reason. Are they too slow? Then they can be tuned (which is something you really don't want to try to do in LINQ). It is possible the process can be made better using SSIS, but as complex as SSIS is and the amount of time a rewrite of the process would take, I'm not sure you really would gain anything by doing so.
"I'm somehow missing the link here to the object oriented programming..." Relational databases are NOT Object-oriented and cannot perform well if you try to treat them like they are. Learn to think in terms of sets not objects when accessing databases. You are coming from the mindset of one user at a time inserting one record at a time, but this is not the mindset neeeded to deal with the transfer of large amounts of data. For these types of things, using the database to handle the problem is better than doing things in an object-oriented manner. Once you have a large amount of data and lots of reporting, people are far more interested in performance than you may have been used to in the past when you used some tools that might not be so good for performance. Whether you like T-SQL or not, it is SQL Server's native language and the database is optimized for it's use.
The best advice, having been here before, is to start by learning first how SQL works, and doing it in the context of the existing architecture sounds like a good way to start (since nothing you've described sounds irrational on the face of it.)
Whatever abstractions you try to lay on top (LINQ, Biztalk, whatever) all eventually resolve to pure SQL. And almost always they add overhead and complexity.
Your OO paradigms aren't transferable. Any suggestions about abstractions will need to be firmly defensible based on your firm grasp of the SQL consequences.
It will take a while, but it's all worth knowing, both professionally and personally.
I'm currently re-engineering a complex system which is moving from Focus (a database and language) to a data warehouse (separate team) and processing (my team) and reporting (separate team).
The current process is combined - data is loaded and managed in the Focus language and Focus database(s) and then reported (and historical data is retained)
In the new process, the DW is loaded and then our process begins. Our processes are completely coded in SQL, and a million row fact table (for one month) would be relatively small. We have some feeds where the monthly data is 25 million rows. There are some statistics tables produced which are over 200 million rows (a month). The processing can take several hours a month, end to end. We use tables to store intermediate results, and we ensure indexing strategies are suitable for the processing. Except for one piece implemented as an SSIS flow from the database back to itself because of extremely poor scalar UDF performance, the entire system is implemented as a series of T-SQl SPs.
We also have a process monitoring system similar to what you are discussing as well as having the dependencies in a table which ensures that each process runs only if all its prerequisites are satisfied. I've recently grafted on the MSAGL to graphically display and interact with the process (previously I was using graphviz to generate static images) from a .NET Windows application. The new system thus has much clearer dependency information as well as good information about process performance so effort can be concentrated on the slowest performing bottlenecks.
I would not plan on doing any re-engineering of any complex system without a clear strategy, a good inventory of the existing system and a large budget for time and money.
From the sounds of what you are saying, you have a three step process.
Input data
Analyze data
Report data
Steps one and three need to be completed by "users". Therefore, a GUI is needed for each respective team to do the task at hand, otherwise, they would be directly working on SQL Server, and would require extensive SQL knowledge. For these items, I do not see any issue with the approach your organization is taking, you are building a customized system to report on the data at hand. The only item that might be worth considering on these side, is standardization between the teams on common libraries and the technologies used.
Your middle step does seem to be a bit lengthy, with many moving parts. However, I've worked on a number of large reporting systems where that is truly the only way to get around it. WIthout knowing more of your organization and the exact nature of operations.
By "fast calculations" you must mean "fast retrieval" Data warehouses (both relational and otherwise) are fast with math because the answers are pre-calculated in advance. SQL, unless you are using CLR stored procedures, is usually a rather slow when it comes to math.
You'd be hard pressed to defeat the performance of BCP and SQL with anything else. If the update routines are long and bloated because they loop through the tables, then sure I can see why you'd want to go to .NET. But you'd probably increase performance by figuring out how to rewrite them all nice and SET based. BCP is not going to be able to be beaten. When I used SQL Server 2000 BCP was often faster than DTS. And SSIS in general (due to all the data type checking) seems to be way slower than DTS. If you kill performance no doubt people are going to be coming to you. Still if you are doing a ton of row by row complex calculations, optimizing that into a CLR stored procedure or even a .NET application that is called from SQL Server to do the processing will probably result in a speed up. Of course if you were row processing and you manage to rewrite the queries to do set processing you'd probably get a bigger speed up. But depending upon how complex the calculations are .NET may help.
Now if a front end change could immediately update and propagate the data, then you might want to change things to .NET so that as soon as a row is changed it can be recalculated and update all the clients. However if a lot of rows are changed or the database is just ginormous then you will kill performance. If the operation needs to be done in bulk then probably the way it is currently being done is the best.
The only thing I might as is that maybe there is a lot of duplicate SQL that looks exactly the same except for a table name and or the column names. If so, you can probably use .NET combined with SQL-SMO(or DMO if using SQL Server 2000) to code generate it.
Here's an example that I often see to load a datawarehouse
Assuming some row tables are loaded with the data from the source
select changed rows from source into temporary tables
see if any columns that matter were changed
if so terminate existing row (or clone it into some history table)
insert/update new row
I often see one of those queries per table and the only variations are the table/column names and maybe references to the key column. You can easily get the column definitions and key definitions out of SQL Server and then make a .NET program to create the INSERT/SELECT/ETC. In the worst case you may just have to store some type of table with TABLE_NAME, COLUMN_NAME for the columns that matter. Then instead of having to wrap your head around a complex ETL process and 20 or 200 update queries, you just need to wrap your head around UPDATE and one query. Any changes to the way things are done can be done once and applied to all the queries.
In particular my guess is that you can apply this technique to the individual client databases if you haven't already. Probably all the queries/bulk copy scripts are the same or almost the same with the exception of database/server name. So you can just autogenerate them based on a CLIENTs table or something.....