requirements for an Indexed View - sql

I am currently working towards my certification in SQL Server 70-461. I'm working through some practice tests at the moment and have come across a question on requirements for an indexed view. I understand that indexed views must have SCHEMABINDING and COUNT_BIG(*) if a GROUP BY clause is used and that the index must be clustered and that this will then materialise the data.
CREATE VIEW VW_Test
AS
SELECT ColumnA, ColumnB FROM Table
WHERE ColumnC > 200
In the sample question, the index is to be created on ColumnA. ColumnB and ColumnC are both computed columns.
The question is, what are the requirements for ColumnB and ColumnC?
Determinstic
Precise
Marked PERSISTED
Unfortunately, in my training material I have not come across these terms in this context so if you can give me some guidance on what they mean then I will be able to figure it out from there.

Deterministic: Refers to functions referenced by computed columns. Deterministic functions always return the same value when given the same input. For example, SUM is deterministic, but GETDATE is not.
Precise: A deterministic expression that does not contain float expressions.
Marked PERSISTED: When building a computed column, there is an option to mark it as 'PERSISTED' so that the computed column is physically stored to the database, rather than re-calculated on-the-fly when referenced.
As to the question itself about requirements for columns B and C, it would seem that that the following applies:
"Only precise deterministic expressions can participate in key columns and in WHERE or GROUP BY clauses of indexed views."
Create Indexed Views

Related

Current State Query / Architecture

I expect this is a common enough use-case, but I'm unsure the best way to leverage database features to do it. Hopefully the community can help.
Given a business domain where there are a number of attributes to make up a record. We can just call these a,b,c
Each of these belong to a parent record, of which there can be many,
Given an external datasource that will post updates to those attributes, at arbitrary times, and typically only a subset, so you get instructions like
z:{a:3}
or
y:{b:2,c:100}
What are good ways to be able to query postgres for the 'current state', ie. wanting a single row result that represents the most recent value for all of a,b,c, for each of the parent records.
current state looks overall like
x:{a:0, b:0, c:1}
y:{a:1, b:2, c:3}
z:{a:2, b:65, c:6}
If it matters, The difference in time between updates on a single value could be arbitrarily long
I am deliberately avoiding having a table that keeps updating and writing an individual row for the state because the write-contention could be a problem, and I think there must be a better overall pattern.
Your question is a bit theorical - but in essence you are describing a top-1-per-group problem. In Postgres, you can use distinct on for this.
Assuming that your table is called mytable, where attributes are stored in column attribute, and that column ordering_id defines the ordering of the rows (that could be a timestamp or an serial for example), you would phrase the query as:
select distinct on (attribute) t.*
from mytable t
order by attribute, ordering_id desc

Open-Sql translated to native oracle sql chooses arbitrary indices by where clause?

We spotted a strange behaviour during a customer project. The underlying database is oracle, so this question is flagged for as oracle, because we only operate at the open-sql-layer, and this makes our knowledge of the underlying database-architecture/processes/optimizations equal to zero.
Let me describe the design as it is:
We have two tables , one is the core-table, the other is the
optional additional table
The core table is always populated, the other table not
The core table has the keys MANDT, WERKS and our own running ID
The optional table has the same keys, which are set up as a set
of foreing keys, referring to the core-table
The foreign-key-setup is: Control desired, no message, kind of foreign keys is "not specified", no cardinality is set.
Let me now specify, what we do/want to do:
We have an own query builder, which allows us to create select-statements almost by a drag and drop technique.
We want to select the core table with an inner join to the optional table ( yes, in this specific case master-data tells us, that this optional table is always populated, too), based on the three keyfields(two actually).
We have some indices on both tables, in the core we use the indexed ERDAT( creation date ) and in the optional table we use an an char20-field, also indexed.
We want to select excactly this way.
We debug...
Let me describe, what we saw:
We analyzed the created select statement in the debugger, and verified, that
it looked, as we want it to be, meaning, the where clause first covers ERDAT from the core table and the second line covers the char20-field of the optional table
We made runtime analyzis (providing other values for the where-criteria), running ST05 etc... though the underlying database being a black-box to us, apparently ST05 could show the generated native SQL-index-order, which was used.
The result: The index CHAR20 of the optional table is used first, and the index on ERDAT of thecore table is used as second.
Let me ask:
Can anybody explain, why ? (if more information is necessary, I will try to provide them ).
Oracle will most commonly use a cost-based optimizer to execute the query. Depending on how selective the index is and how the join is performed best, the DB decides which table to query first.
I can only guess, but I think your core table has a lot of values and the index on that table is not very selective. So querying the core table first might produce many result rows.
If the optional table only has few rows, it is expected to deliver less rows when queried using the index on the CHAR20 field. In this case, the DB will start with a smaller result set into the join, and hence prefer this option.
The query optimization is a very complex process, you can read more about this here:
https://docs.oracle.com/cd/B10501_01/server.920/a96533/optimops.htm

Storing flags in SQL column, and indexing them

I need to store a set of flags that are related to an entity into database. Flags might not be the best word because these are not binary information (on/off), but rather a to-be-defined set of codes.
Normally, you would store each information (say each flag value) in a distinct column, but I'm exploring opportunities for storing such information in data structures different than one-column-for-each-attribute to prevent a dramatic increase in column mappings. Since each flag is valid for each attribute of an entity, you understand that for large entities that intrinsically require a large number of columns the total number of columns may grow as 2n.
Eventually, these codes can be mapped to a positional string.
I'm thinking about something like: 02A not being interpreted as dec 42 but rather as:
Flag 0 in position 1 (or zero if you prefer...)
Flag 2 in position 2
Flag A in position 3
Data formatted in such a way can be easily processed by high-level programming languages, because PL/SQL is out of the scope of the question and all these values are supposed to be processed by Java.
Now the real problem
One of my specs is to optimize searching. I have been required to find a way (say, an efficient way) to seek for entities that show a certain flag (or a special 0 flag) in a given position.
Normally, in SQL, given the RDBMS-specific substring function, you would
SELECT * FROM ENTITIES WHERE SUBSTRING(FLAGS,{POSITION},1) = {VALUE};
This works, but I'm afraid it may be a little slow on all platforms but Oracle, which, AFAIK, supports creating secondary indexes mapped to a substring.
However, my solution must work in MySQL, Oracle, SQL Server and DB2 thanks to Hibernate.
Given such a design, is there some, possibly cross-platform, indexing strategy that I'm missing?
If performance is an issue I would go for a some different model here.
Say a table that store entities and a relation 1->N to another table (say: flags table: entId(fk), flag, position) and this table would have an index on flag and position.
The issue here would be to get this flags in a simple column wich can be done in java or even on the database (but it would be difficult to have a cross plataform query to this)
If you want a database-independent, reasonable method for storing such flags, then use typical SQL data types. For a binary flag, you can use bit or boolean (this differs among databases). For other flags, you can use tinyint or smallint.
Doing bit-fiddling is not going to be portable. If nothing else, the functions used to extract particular bits from data differ among databases.
Second, if performance is an issue, then you may need to create indexes to avoid full table scans. You can create indexes on normal SQL data types (although some databases may not allow indexes on bits).
It sounds like you are trying to be overly clever. You should first get the application to work using reasonable data structures. Then you will understand where the performance issues are and can work on fixing them.
I have improved my design and performed a benchmark and found an interesting result.
I created a dummy demographic entity with first/last name columns, birthdate, birthplace, email, SSN...
Then in version 1
I added a column VALIDATION VARCAHR(40) NULL DEFAULT NULL with an index on it.
Instead of positional flags, the new column contains an unordered set of codes each representing a specific format error (e.g. A01 means "last name not specified", etc.). Each code is terminated by a colon : symbol.
Example columns look like
NULL
'A01:A03:A10:'
'A05:'
Typical queries are:
SELECT * FROM ENTITIES WHERE VALIDATION IS {NOT} NULL
Search for entities that are valid/invalid (NULL = no problem)
SELECT * FROM ENTITIES WHERE VALIDATION LIKE '%AXX:';
Selects entities with a specific problem
Then in version 1
I added a column VALID TINYINT NOT NULL with an index which is 0=invalid, 1=valid (Hibernate maps a Boolean to a TINYINT in MySQL).
I added a lookup table
CREATE TABLE ENTITY_VALIDATION (
ID BIGINT NOT NULL PRIMARY KEY,
PERSON_ID LONG NOT NULL, --REFERENCES PERSONS(ID) --Omitted for performance
ERROR CHAR(3) NOT NULL
)
With index on both PERSON_ID and ERROR. This represents the 1:N relationship
Queries:
SELECT * FROM ENTITIES WHERE VALIDATION = {0|1}
Select invalid/valid entities
SELECT * FROM ENTITIES JOIN ENTITY_VALIDATION ON ENTITIES.ID = ENTITY_VALIDATION.PERSON_ID WHERE ERROR = 'Axx';
Selects entities with a given problem
Then I benchmarked
the count(*) function via JUnit+JDBC. So the same queries you see above replace * with COUNT(*).
I did several benchmarks, with entity table containing 100k, 250k, 500k, 750k, 1M entities with a mean ratio entity:flag of 1:3 (there are meanly 3 errors for each entity).
The result
is displayed below. While correct/incorrect entities lookup is equally performing, it looks like MySQL is faster in the LIKE operator rather than in a JOIN, even though there are indexes
Of course,
This was only a benchmark on MySQL. While the approach is cross-platform, the benchmark does not (yet) compare performance in different DBMSes

Computed columns store aggr count

I want the computed column to store count totals from another table, how would I do it? (would the following work)
create table sample
(
column1 AS (SELECT COUNT(*) FROM table2) PERSISTED
)
For SQL Server you could potentially do this with an Indexed View.
Those present a host of other restrictions, though, so be sure the value is enough to justify the increased effort in maintenance.
One of the handier aspects of indexed views is that you don't need to query them directly to get the benefits - if the optimizer detects you querying an aggregate that is indexed it'll make use of it "behind the scenes".
Per MSDN:
A computed column is computed from an expression that can use other columns in the same table. The expression can be a noncomputed column name, constant, function, and any combination of these connected by one or more operators. The expression cannot be a subquery.

When are computed columns appropriate?

I'm considering designing a table with a computed column in Microsoft SQL Server 2008. It would be a simple calculation like (ISNULL(colA,(0)) + ISNULL(colB,(0))) - like a total. Our application uses Entity Framework 4.
I'm not completely familiar with computed columns so I'm curious what others have to say about when they are appropriate to be used as opposed to other mechanisms which achieve the same result, such as views, or a computed Entity column.
Are there any reasons why I wouldn't want to use a computed column in a table?
If I do use a computed column, should it be persisted or not? I've read about different performance results using persisted, not persisted, with indexed and non indexed computed columns here. Given that my computation seems simple, I'm inclined to say that it shouldn't be persisted.
In my experience, they're most useful/appropriate when they can be used in other places like an index or a check constraint, which sometimes requires that the column be persisted (physically stored in the table). For further details, see Computed Columns and Creating Indexes on Computed Columns.
If your computed column is not persisted, it will be calculated every time you access it in e.g. a SELECT. If the data it's based on changes frequently, that might be okay.
If the data doesn't change frequently, e.g. if you have a computed column to turn your numeric OrderID INT into a human-readable ORD-0001234 or something like that, then definitely make your computed column persisted - in that case, the value will be computed and physically stored on disk, and any subsequent access to it is like reading any other column on your table - no re-computation over and over again.
We've also come to use (and highly appreciate!) computed columns to extract certain pieces of information from XML columns and surfacing them on the table as separate (persisted) columns. That makes querying against those items just much more efficient than constantly having to poke into the XML with XQuery to retrieve the information. For this use case, I think persisted computed columns are a great way to speed up your queries!
Let's say you have a computed column called ProspectRanking that is the result of the evaluation of the values in several columns: ReadingLevel, AnnualIncome, Gender, OwnsBoat, HasPurchasedPremiumGasolineRecently.
Let's also say that many decentralized departments in your large mega-corporation use this data, and they all have their own programmers on staff, but you want the ProspectRanking algorithms to be managed centrally by IT at corporate headquarters, who maintain close communication with the VP of Marketing. Let's also say that the algorithm is frequently tweaked to reflect some changing conditions, like the interest rate or the rate of inflation.
You'd want the computation to be part of the back-end database engine and not in the client consumers of the data, if managing the front-end clients would be like herding cats.
If you can avoid herding cats, do so.
Make Sure You Are Querying Only Columns You Need
I have found using computed columns to be very useful, even if not persisted, especially in an MVVM model where you are only getting the columns you need for that specific view. So long as you are not putting logic that is less performant in the computed-column-code you should be fine. The bottom line is for those computed (not persisted columns) are going to have to be looked for anyways if you are using that data.
When it Comes to Performance
For performance you narrow your query to the rows and the computed columns. If you were putting an index on the computed column (if that is allowed Checked and it is not allowed) I would be cautious because the execution engine might decide to use that index and hurt performance by computing those columns. Most of the time you are just getting a name or description from a join table so I think this is fine.
Don't Brute Force It
The only time it wouldn't make sense to use a lot of computed columns is if you are using a single view-model class that captures all the data in all columns including those computed. In this case, your performance is going to degrade based on the number of computed columns and number of rows in your database that you are selecting from.
Computed Columns for ORM Works Great.
An object relational mapper such as EntityFramework allow you to query a subset of the columns in your query. This works especially well using LINQ to EntityFramework. By using the computed columns you don't have to clutter your ORM class with mapped views for each of the model types.
var data = from e in db.Employees
select new NarrowEmployeeView { Id, Name };
Only the Id and Name are queried.
var data = from e in db.Employees
select new WiderEmployeeView { Id, Name, DepartmentName };
Assuming the DepartmentName is a computed column you then get your computed executed for the latter query.
Peformance Profiler
If you use a peformance profiler and filter against sql queries you can see that in fact the computed columns are ignored when not in the select statement.
Computed columns can be appropriate if you plan to query by that information.
For instance, if you have a dataset that you are going to present in the UI. Having a computed column will allow you to page the view while still allowing sorting and filtering on the computed column. if that computed column is in code only, then it will be much more difficult to reasonably sort or filter the dataset for display based on that value.
Computed column is a business rule and it's more appropriate to implement it on the client and not in the storage. Database is for storing/retrieving data, not for business rule processing. The fact that it can do something doesn't mean you should do it that way. You too you are free to jump from tour Eiffel but it will be a bad decision :)