Best way to report comparisons of one agency to the rest of the state/nation - sql

When attempting to do some benchmarking type reports, I run into the issue of extreme slowness due to the amount of data residing in the database, and this will get incrementally worse. I'm curious of what would be considered the best approach for reports that show for example a percentage of patients entering the hospital within a certain date range that were there due to a specific condition, as well as how that particular hospital compares to the state percentage and also the national percentage. Of course this is all based on the hospitals whose data resides in the database. I have just been writing stored procedures to calculate these percentages, but I know this isn't the best approach. I'm curious how other more experienced reporting professionals would tackle this. I'm currently using SSRS for reporting. I know a little about SSAS, but not enough to know if I should consider it for this type of reporting.

This all depends on the data-structure and the kind of calculations you have to do.
You try to narrow down the amount of data you have to process and the complexity of operations in every possible way
If you have lots of data on a slow system you first try to select the needed data, transfer it to the calculation point and then keep it cached as long as you can.
If you have huge amounts of data you try to preprocess it as much as you can. E.g. for datawarehouses you have a datetime-table with year/month/day/day-of-week/week-of-year etc in it and just constraints to them in the other tables. Like this you can avoid timeconsuming calculations.
If the operations are complex you have to analyze them to make them simpler/faster but on this point it is impossible to predict how much (if at all) there is some room.
It all depends on your understanding of the data-structure and processes you need them for, in order to improve everything as much as you can.
I myself haven't worked with SSAS yet but this is also a great tool but (imho) more for lots of different analysis.

Related

Fast reporting with user parameters and temp result sets

I have come across a problem with reporting from SQL Server databases using SSRS, that I wonder if you could help me with.
When you have a huge amount of data in a table, and you want to select only those rows within a certain criteria, and you want to allow the users to specify that criteria (for example, it might be a start date and end date), and you then want to take that data (within the criteria) and perform a ton of other transformations on it, including producing various temporary result sets along the way (using CTEs or Table Variables or Temp tables) to finally produce the report, this basically takes ages in SQL. You can do it, but your users might have to wait an hour or two from the moment they've hit View Report, to their report being rendered.
I don't know much about MDX or DAX, cubes or tabular models, but I wonder if there is a quicker way to do what I want. Note the important aspect of the problem: the user is specifying a criteria that has to go all the way back to the original table, and then various transformations (including temp result sets) have to be applied to produce the final report.
What is the best way to do this? Am I doing it the only way possible? I know it's a broad question, but I'd like to know, theoretically, what the answer is. Where should I be looking? Should I be looking at Cubes? Tabular Models? Should I be using R in SQL Server?
There is always a balance when it comes to handling large datasets. Sometimes it makes sense to do some of the work ahead of time so that on-demand reports can run in a reasonable amount of time.
In order for a model to be a good option here are some general guidelines:
Many reports would be able to use common attributes from the model
The data involves aggregates, not just lists of records
The data does not need to be live
You have plenty of development and testing time
Anyone who would be using it as a data source will have to have be
trained on the structure and be at least slightly familiar with MDX
Another option for you to consider is to have a stored procedure that "prepares" the data for you overnight in a separate table. This table could be well indexed because the write time is not as important. They report would then point to this table to be able to quickly retrieve the data it needs to present. This shifts most of the preparation/aggregation work. You can still of course have parameters that limit how much of this data you pull back.
Based on the little bit of information you've given us (300 million rows in a single non-normalized table), there is definitely a faster way. However, there will not be any quick solutions and you haven't provided enough information for me to give any recommendations.
I think you may need to seek some professional help to review your infrastructure and needs along with your usage and objectives so you can be pointed in the right direction.

When to use a query or code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am asking for a concrete case for Java + JPA / Hibernate + Mysql, but I think you can apply this question to a great number of languages.
Sometimes I have to perform a query on a database to get some entities, such as employees. Let's say you need some specific employees (the ones with 'John' as their firstname), would you rather do a query returning this exact set of employees, or would you prefer to search for all the employees and then use a programming language to retrieve the ones that you are interested with? why (ease, efficiency)?
Which is (in general) more efficient?
Is one approach better than the other depending on the table size?
Considering:
Same complexity, reusability in both cases.
Always do the query on the database. If you do not you have to copy over more data to the client and also databases are written to efficiently filter data almost certainly being more efficient than your code.
The only exception I can think of is if the filter condition is computationally complex and you can spread the calculation over more CPU power than the database has.
In the cases I have had a database the server has had more CPU power than the clients so unless overloaded will just run the query more quickly for the same amount of code.
Also you have to write less code to do the query on the database using Hibernates query language rather than you having to write code to manipulate the data on the client. Hibernate queries will also make use of any client caching in the configiration without you having to write more code.
There is a general trick often used in programming - paying with memory for operation speedup. If you have lots of employees, and you are going to query a significant portion of them, one by one (say, 75% will be queried at one time or the other), then query everything, cache it (very important!), and complete the lookup in memory. The next time you query, skip the trip to RDBMS, go straight to the cache, and do a fast look-up: a roundtrip to a database is very expensive, compared to an in-memory hash lookup.
On the other hand, if you are accessing a small portion of employees, you should query just one employee: data transfer from the RDBMS to your program takes a lot of time, a lot of network bandwidth, a lot of memory on your side, and a lot of memory on the RDBMS side. Querying lots of rows to throw away all but one never makes sense.
In general, I would let the database do what databases are good at. Filtering data is something databases are really good at, so it would be best left there.
That said, there are some situations where you might just want to grab all of them and do the filtering in code though. One I can think of would be if the number of rows is relatively small and you plan to cache them in your app. In that case you would just look up all the rows, cache them, and do subsequent filtering against what you have in the cache.
It's situational. I think in general, it's better to use sql to get the exact result set.
The problem with loading all the entities and then searching programmatically is that you ahve to load all the entitites, which could take a lot of memory. Additionally, you have to then search all the entities. Why do that when you can leverage your RDBMS and get the exact results you want. In other words, why load a large dataset that could use too much memory, then process it, when you can let your RDBMS do the work for you?
On the other hand, if you know the size of your dataset is not too, you can load it into memory and then query it -- this has the advantage that you don't need to go to the RDBMS, which might or might not require going over your network, depending on your system architecture.
However, even then, you can use various caching utilities so that the common query results are cached, which removes the advantage of caching the data yourself.
Remember, that your approach should scale over time. What may be a small data set could later turn into a huge data set over time. We had an issue with a programmer that coded the application to query the entire table then run manipulations on it. The approach worked fine when there were only 100 rows with two subselects, but as the data grew over the years, the performance issues became apparent. Inserting even a date filter to query only the last 365 days, could help your application scale better.
-- if you are looking for an answer specific to hibernate, check #Mark's answer
Given the Employee example -assuming the number of employees can scale over time, it is better to use an approach to query the database for the exact data.
However, if you are considering something like Department (for example), where the chances of the data growing rapidly is less, it is useful to query all of them and have in memory - this way you don't have to reach to the external resource (database) every time, which could be costly.
So the general parameters are these,
scaling of data
criticality to bussiness
volume of data
frequency of usage
to put some sense, when the data is not going to scale frequently and the data is not mission critical and volume of data is manageable in memory on the application server and is used frequently - Bring it all and filter them programatically, if needed.
if otherwise get only specific data.
What is better: to store a lot of food at home or buy it little by little? When you travel a lot? Just when hosting a party? It depends, isn't? Similarly, the best approach is a matter of performance optimization. That involves a lot of variables. The art is to both prevent painting yourself into a corner when designing your solution and optimize later, when you know your real bottlenecks. A good starting point is here: en.wikipedia.org/wiki/Performance_tuning One think could be more or less universally helpful: encapsulate your data access well.

Thoughts on dimension measures for BI

I am working with a consultant who recommends creating a measure dimension and then adding the measure dimension key to our fact table.
I can see how this can make adding new measures easier by just adding rows instead of physically creating columns in the fact table. I can also see how this can add work to the ETL process, adds another join to the star schema, one generic column in fact table to hold all measure data etc.
I'm interested in how others have dealt with this situation. We currently have close to twenty measures.
Instinctively, I don't like it: it's the EAV model, which is not very popular (you can Google the reasons why).
The EAV model is generally considered to be a headache to query and maintain
Different measures go together with different dimensions; this approach could easily turn into "one giant fact table for everything" instead of multiple smaller fact tables for specific reporting areas
I suspect you would end up creating views to give the appearance of multiple fact tables anyway
You will multiply the number of rows in your fact table by the number of measures, resulting in a much bigger physical table
Even with a good indexing/partitioning scheme, queries that include more than one measure will have to read a lot more rows to get the data
What about measures with different data types?
Is this easily supported in your reporting tool?
I'm sure there are other issues, but those are the ones that come to mind immediately. As a rule of thumb, if someone suggests an EAV implementation in any context, you should be very wary and ask them exactly what advantages it offers and how it will be managed as the data and complexity increase. But I think you've already identified some key areas of concern.
SSAS will do this, and I know of a major vendor of insurance policy administration software that provided a M.I. solution for their system that works like this. You do get some flexibility from the approach in that you can add measures without having to deploy a build of the cube, although for 20 measures I don't think you need to worry about that.
'Measures' is essentially another dimension (and often referred to as such in the documentation). I believe SSAS uses a largely column-oriented structure behind the scenes.
However, a naive application of this approach does have some issues that could come and bite you to a greater or lesser extent.
You only have one measure, [Value], [Amount] or whatever it's called. If your tool won't let you inject calculated measures at the front-end then you can't sort the whole data set on the value of one of your attribute types. ProClarity and report builder >=2.0 will do this but Excel won't.
You can't do ratios or other calculated measures in this way. You will have to either embed them in the cube script (meaning you need to deploy a build to add them) or use a tool that lets you define them in the client.
Although it doesn't make a lot of differece to the cube it will be slow to query on the database and increase storage requirements. It's also fiddly to query on the database.

Database Normalisation and Searching it Quickly

I'm working on the technical architecture for a content solution integration. The data from the solution provider runs to millions of rows and normalised to 3NF. It is updated on a regular schedule (daily most likely) and its data is split down to a very granular level of atomicity.
I need to search and query this data and my current inclination is to leave the normalised data alone and create a denormalised database from its data (OLAP to OLTP). The 'transfer' can be a custom built program that can contain the necessary business logic in addition to the raw copying power and be run at a set schedule as required. The denormalised database would then reduce the atomicity and allow the keyword searches and queries to run efficiently. I was looking at using Lucene .NET for the keyword work on the denormalised database.
So before I sing loudly from the hills that this is the way forward, I wanted some expert opinion on this and what is the perceived "best practise". Is the method I have suggested the best way forward considering the data I will be provided? It was suggested that perhaps I could use a 'search engine' to search the normalised data. This scared the hell out of me, but raised the question; what search engine and how?
Opinions, flames, bad language and help appreciated :)
I have built reporting databases and data warehouses based on data stored in normalized form. There is quite a bit of work involved in the transfer program (ETL). Given your description of the data feed, maybe some of that work has been done for you by the feeder.
Millions of rows isn't a lot, these days. You may be able to get away with report oriented views into the existing database. Try it and see.
The biggest benefit to building an OLAP oriented database is not speed. It's flexibility. "We love this report, but now we want to see it weekly and quarterly instead of monthly. Bam! Done!" "Can you break it down by marketing category instead of manufacturing category? Bam! Done!" And so on.
A resonably normalized model (3NF/BCNF) provides the best average performance and the least amount of modification anomalies for the largest number of scenarios. That's big, so I would start from there. As your requirements are fuzzy, it's seems like the most sensible option.
Actually, the most sensible thing would be to go over the requirements until they are a bit more "crisp" ;)
Also, if you could get your hands on a few early extracts from your data provider you could experiment with it and get a feeling for the data distributions (not all people live in one country, and some countries holds more people than others. Not all people have children, and the number children per person is vastly different depending on the country). This is a major point and it is crucial that the optimizer can make good decisions.
Other than that, I agree with everything Walter said and also gave him my vote.

Would this method work to scale out SQL queries?

I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.
It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.
One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.
By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?
I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.
David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.
My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)
If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.
Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.