I have been trying out a few index views and am impressed but I nearly always need a max or a min as well and can not understand why it doesn't work with these, can anyone explain why?
I KNOW they are not allowed, I just can't understand why!!! Count etc. is allowed why not MIN/MAX, I'm looking for explanation...
These aggregates are not allowed because they cannot be recomputed solely based on the changed values.
Some aggregates, like COUNT_BIG() or SUM(), can be recomputed just by looking at the data that changed. These are allowed within an indexed view because, if an underlying value changes, the impact of that change can be directly calculated.
Other aggregates, like MIN() and MAX(), cannot be recomputed just by looking at the data that is being changed. If you delete the value that is currently the max or min, then the new max or min has to be searched for and found in the entire table.
The same principle applies to other aggregates, like AVG() or the standard variation aggregates. SQL cannot recompute them just from the values changed, but needs to re-scan the entire table to get the new value.
Aggregate functions like MIN/MAX aren't supported in indexed views. You have to do the MIN/MAX in the query surrounding the view.
There's a full definition on what is and isn't allowed within an indexed view here (SQL 2005).
Quote:
The AVG, MAX, MIN, STDEV, STDEVP, VAR,
or VARP aggregate functions. If
AVG(expression) is specified in
queries referencing the indexed view,
the optimizer can frequently calculate
the needed result if the view select
list contains SUM(expression) and
COUNT_BIG(expression). For example, an
indexed view SELECT list cannot
contain the expression AVG(column1).
If the view SELECT list contains the
expressions SUM(column1) and
COUNT_BIG(column1), SQL Server can
calculate the average for a query that
references the view and specifies
AVG(column1).
if you just want to see things ordered without adding a sort by when you use a view I just add a column with and order by in it.
id = row_number() over (order by col1, col2)
Besides the reasons specified by Remus, there is less practical need to support MIN and MAX.
Unlike COUNT() or SUM(), MAX and MIN are fast to calculate - you are all set after just one lookup - you don't need to read a lot of data.
Related
I have a pretty simple table with an ID, Date, and 20 value columns. Each column and each row can hold different type of data - with different unit of measure - and the ID column defines each fields unit of measure. So basically the ID field helps identifying the meaning behind each fields. Naturally I have an explanatory table that holds these definitions by ID.
The table holds sensor data, and these sensors are inserting thousands of rows of data each second (each TYPE of sensor has their own ID).
My problem is: how to aggregate this kind of table? Because each type of measurement requires different aggregation (some measuremants I need to average, other to sum or min or max etc...).
I think the perfect solution would be something like having an explanatory table by ID, which defines for each field (of that ID) that how should I aggregate them, and the aggregation command (somehow... magically...) should be dynamic by this table...
Do you have any suggestion how I can accomplish that? Or is it even possible to make the aggregation function dynamic by a certain condition (in this case the explanatory tables value)?
Are you sure SQL is the right tool for the job? sounds to me you want columnar DBs, or other types of noSQL will fit better
Today I had an apparently very common problem of selecting the row with the minimum value from each group of a dataset split by a group by. I found a solution that is unique to SQLite (it works incorrectly in MySQL and throws an error in PostgreSQL) and doesn't use any joins. It looks like this:
SELECT *, min(x) FROM table GROUP BY y
Here is a fiddle with an example.
However, I don't understand why this works - just by including an aggregate function each group was somehow implicitly sorted and returned the row to which the result of the aggregate function corresponds. Default SQL behavior is to select an arbitrary row. I dug through relevant SQLite documentation and found no explanation of this. This is what I'd like an explanation for.
Edit: both answers so far guess that this is a coincidence. It is not. In the actual table I have ~90 records split into ~30 groups with this method and it works as expected on every one. See for yourself.
To be compatible with MySQL, SQLite allows to use columns that are neither aggregated nor grouped by.
MySQL does not guarantee that the values come from any specific row, and neither did SQLite before version 3.7.11. However, due to how grouping is implemented in SQLite, the values in such columns happened to come from the row that matches the min()/max() in certain cases.
Some paying customer found this useful and wanted a guarantee for this, so SQLite enforced it in all cases and documented it in the changelog of version 3.7.11, which makes it a supported feature (i.e., it's tested, and will never be removed).
While it is safe to use, this behaviour is a violation extension of the SQL standard that was never properly designed, and never meant to be a selling feature, so it is not mentioned in the actual documentation.
It probably works by accident. SQLite will return an arbitrary row for each group. The row does not necessarily have to have the minimum x value for the group.
Learn to express the query correctly:
SELECT t.*
FROM table t
WHERE t.x = (SELECT MIN(t2.x) FROM table t2 WHERE t2.y = t.y)
The record you see was arbitrary chosen.
You cannot count on the behaviour which seems fix to you.
It can be changed due to changes in the table structure (e.g. added/removed indexes), between versions etc.
https://www.sqlite.org/lang_select.html
If the SELECT statement is an aggregate query with a GROUP BY clause
...
Each expression in the result-set is then evaluated once for each
group of rows. If the expression is an aggregate expression, it is
evaluated across all rows in the group. Otherwise, it is evaluated
against a single arbitrarily chosen row from within the group. If
there is more than one non-aggregate expression in the result-set,
then all such expressions are evaluated for the same row.
This reminds me of a famous pitfall related to Oracle's GROUP BY.
Everybody just knew that if you use GROUP BY you can skip the ORDER BY because the result set is already ordered.
The reason the result set was ordered at that time is that Oracle used a sort based algorithm for the implementation of the group by.
In version 10gR2 Oracle added an additional GROUP BY algorithm based on HASH.
You can guess the rest of the story.
I have been trying to figure out what kinds of aggregates I can use to create indexed view. FYI: I was I able to create one with SUM(). I also found that I can't create indexed view with MIN, MAX and AVG. How about others? It is possible? I couldn't find any info in the web also couldn't make it work on my comp.
According to TechNet, scalar aggregates are supported in indexed views. As to why Min/Max are not supported, see this answer.
sqlmag.com says:
Do Index Sorting, Grouping, and Aggregating Columns
You also need to consider indexing columns that you use to order by and those that you use in a grouping expression. You might benefit from indexing the columns that the MIN(), MAX(), COUNT(), SUM(), and AVG() functions use to aggregate the data. When you use the MIN() and MAX() functions, SQL Server does a simple lookup for the minimum and maximum values in the column, respectively. If an index's data values are arranged in ascending order, SQL Server can read the index to quickly determine the correct values of MIN() or MAX(). The range-of-values query incorporates a filter or constraint (expressed in the SELECT query's WHERE clause or HAVING clause) to limit the rows that the query returns. Similarly, when you have an index, you can optimize data sorting (by using the ORDER BY clause) and data grouping (by using the GROUP BY clause), especially if the table or tables you're querying contain many rows.
I want the computed column to store count totals from another table, how would I do it? (would the following work)
create table sample
(
column1 AS (SELECT COUNT(*) FROM table2) PERSISTED
)
For SQL Server you could potentially do this with an Indexed View.
Those present a host of other restrictions, though, so be sure the value is enough to justify the increased effort in maintenance.
One of the handier aspects of indexed views is that you don't need to query them directly to get the benefits - if the optimizer detects you querying an aggregate that is indexed it'll make use of it "behind the scenes".
Per MSDN:
A computed column is computed from an expression that can use other columns in the same table. The expression can be a noncomputed column name, constant, function, and any combination of these connected by one or more operators. The expression cannot be a subquery.
How does a function like SUM work? If I execute
select id,sum(a) from mytable group by id
does it sort by id and then sum over each range of equal id's? I am no planner expert, but it looks like that is what is happening, where mytable is maybe a hundred million rows with a few million distinct id's.
Or does it just keep a hash of id -> current_sum, and then at each row either increments the value of id or add a new key? Isn't that far faster and less memory hungry?
SQL standards try to dictate external behavior, not internal behavior. In this particular case, a SQL implementation that conforms to (one of the many) standards is supposed to act like it does things in this order.
Build a working table from all the table constructors in the FROM clause. (There's only one in your example.)
In the GROUP BY clause, partition the working table into groups. Reduce each group to one row. Replace the working table with the grouped table.
Resolve the expressions in the SELECT clause.
Query optimizers that follow SQL standards are free to rearrange things however they like, as long as the result is the same as if it had followed those steps.
You can find more details in the answers and comments to this SO question.
So, I found this, http://helmingstay.blogspot.com/2009/06/postgresql-poetry-aggregate-median-with.html, which claims that it does indeed use the accumulator pattern. Hmmm.