Database VIEW does not reflect the data in the underying TABLE - sql

Input:
The customer claims that the application (.NET) when querying for some data returns data different from when the customer looks into the data table directly.
I understand there could be various reasons and in completely different places. My purpose is not to solve it here, but rather to ask experienced DBAs and DB developers if:
Is it possible for a VIEW to show data that does not match the underlying TABLE(s)?
What are possible causes/reasons for this?
Can an UPDATE statement on a view cause future SELECTs to return 'updated' data, when the table really does not?
Possible causes (please comment on those with question-marks):
the reason is that there are two separate transactions, which would explain the customers' confusion.
the underlying table was altered, but the view was not refreshed (using sp_refreshview)
a different user is connecting and can see different data due to permissions ?
programmer error: wrong tables/columns, wrong filters (all-in-one here)
corruption occurs: DBCC CHECKDB should help
can SELECT ... FOR UPDATE cause this ???
? __
What really happened (THE ANSWER):
Column positions were altered in some tables: Apparently the customer gave full database access to a consultant for database usage analysis. That great guy changed the order of the columns to see the few audit fields at the beginning of the table when using SELECT * ... clauses.
Using dbGhost the database schema was compared to the schema of the backup taken few days before the problem appeared, and the column position differences were discovered.
What came next was nothing related to programming, but more an issue of politics.
Therefore the sp_refreshview was the solution. I just took one step more to find who caused the problem. Thank you all.

Yes, sort of.
Possible Causes:
The View needs to be refreshed or recompiled. Happens when source column definitions change and the View (or something it depends on) is using "*", can be nasty. Call sp_RefreshView. Can also happen because of views or functions (data sources) that it calls as well.
The View is looking at something different from what they/you think. They are looking at the wrong table or view.
The View is transforming the data in an unexpected way. It works right, just not like they expected.
The View is returning a different subset of the data than expected. Again, it works right, just not like they think.
They are looking at the wrong database/server or with a Logon/user identity that causes the View to alter what it shows. Particularly nefarious because unlike Management Studio, most client programs do not tell you what database/server they are pointed at.

it is possible if the underlying table has been changed and sp_refreshview has not been ran against the view, so the view will have missing columns if those were added to the table.
To see what I mean read how to make sure that the view will have the underlying table changes by using sp_refreshview

You can create views with locking hints which would mean you might be getting a dirty read. Or alternatively when they access the table directly, they might be using locking hints which could be giving them a dirty read at that point.
Another possibility that users don't seem to understand is that the data is fluid. The data you read at 3:00 in a view may not be the same data that you see at 3:30 looking directly at the table becasue there have been changes in the meantime.

A few possibilities:
Your .NET application may not be pointing to where you or they think it is pointing. For example, it's pointed to a test server by mistake
If the view has an index on a float or numeric value, the value may appear different from the underlying query due to rounding
The ANSI_NULLS setting is specific to the view when it was created. If it's different from the setting during the select(s) on the underlying tables it could cause discrepancies for certain kinds of queries
The underlying table structures have changed and the view hasn't been refreshed (especially a problem if it uses "SELECT *")
I'll edit this post if I think of any others.
EDIT: Here's an example of how the ANSI_NULLS setting can throw off your results:
SET ANSI_NULLS ON
DECLARE
#i INT,
#j INT
SET #i = NULL
SET #j = 1
SELECT
CASE WHEN #i <> #j THEN 'Not Equal' ELSE 'Equal' END
SET ANSI_NULLS OFF
SELECT
CASE WHEN #i <> #j THEN 'Not Equal' ELSE 'Equal' END
The results which you should receive are:
Equal
Not Equal

Assuming the view does not actually transform the data, technically it is possible if a corruption occurs. View retrieves data from one index, 'table' retrieves from another (ie. from clustered) and the two are out of sync. A DBCC CHECKDB should reveal the problem.
But human error is much more likely, ie. they are looking at different table than the view, or at different records.

For sure there are other things:
1) Derived attributes are pulling from wrong tables in the view
2) The view is using incorrect tables
3) incorrect or missing joins in the view
to name a few

Related

SQL Server, VIEW has mixed up data

My team and I are using Microsoft SQL Server 2014 - Standard Edition (64-bit)
We have created some views as we usually do, they work(ed) normally while we were coding and testing.
Then suddenly, one of our QA noticed that the data in the application was mixed up, for example, description data was in the name field, name was where sex was supposed to be, etc.
We cheched data in the Tables and it was correct, then we checked the view, querying the view like this SELECT * FROM VIEW and realized that the view had the data mixed up.. The next logic step was to check the view queries, for our surprise all the queries were correct. so what was happening?
Well, that is the question, why the data in a view is corrupt or mixed up if the queries within the view are correct and they were working well for long time?
We just ALTERED the view, not modifying anything and that fixed the issue.
But, we need to know the cause of the data corruption, because we don't want to monitor and alter views all the time.
VIEW CODE AS REQUESTED
ALTER VIEW [dbo].[pvvClient] AS
SELECT *
FROM Table
INNER JOIN Table 2 ON.....
The first thing that came to my mind was that the Table(s) has changed and that raised this behavior, do you think SCHEMABINDING can help to avoid this kind of issues
When you put * in the column list of a view and the underlying tables change your view will not automatically update to include the changed columns. In fact, if you delete a column you can get the data mixed up across columns. This has been discussed and documented many times. Aaron Bertrand has a great article covering this topic.
Bad habits to kick : using SELECT * / omitting the column list
Moral of the story, avoid using select * unless the select is inside an EXISTS.
It doesn't make sense that simply altering the view without changing any of the code would fix it, but an important point is this part of your view...
select pvtConsumidorFinanciero.*
If this table definition changes... that is, if more columns are added or some are removed, the columns in this view would also change. That is why it is good practice to never select * in a view, especially when querying another view.
Additionally, this table could have the same column names as other tables.
What also could have happened is in your application, you are select * from view. Again, if a DBA changed the view, this could mess up your application, so i would avoid it an explicitly list the columns you want returned in the order you want them returned.
I think this is is also part of the answer:
When you create views it is a good practice to use SCHEMABINDING, this way when you alter the table under the view, you are forced to review your view as well.

Will updating any field cause updating indexed view?

Here is a description in MSDN of the indexed view:"As modifications are made to the data in the base tables, the data modifications are reflected in the data stored in the indexed view".
But i'm confused now, will updating any field in base table cause the auto maintain of the indexed view, even this field isn't included in the definition of the the indexed view?
Thanks!
It's not documented, but I believe that some activity will occur.
From various hints in the documentation, I believe that SQL Server actually constructs something quite like a trigger on each base table. So the "trigger" will run for every update/insert/delete action on the table.
However, I believe that the trigger uses something like COLUMNS_UPDATED to check which columns have had actual update activity, and will exit early if no relevant columns were affected.
E.g. if you sent ANSI_NULLS to OFF (ON is required by indexed views), then any updates against columns not used by the view will work. Whereas any updates that mention a column used in the view (even if the SET is a no-op, e.g. Column1 = Column1), then you'll get an error message.

Can select * usage ever be justified?

I've always preached to my developers that SELECT * is evil and should be avoided like the plague.
Are there any cases where it can be justified?
I'm not talking about COUNT(*) - which most optimizers can figure out.
Edit
I'm talking about production code.
And one great example I saw of this bad practice was a legacy asp application that used select * in a stored procedure, and used ADO to loop through the returned records, but got the columns by index. You can imagine what happened when a new field was added somewhere other than the end of the field list.
I'm quite happy using * in audit triggers.
In that case it can actually prove a benefit because it will ensure that if additional columns are added to the base table it will raise an error so it cannot be forgotten to deal with this in the audit trigger and/or audit table structure.
(Like dotjoe) I am also happy using it in derived tables and column table expressions. Though I habitually do it the other way round.
WITH t
AS (SELECT *,
ROW_NUMBER() OVER (ORDER BY a) AS RN
FROM foo)
SELECT a,
b,
c,
RN
FROM t;
I'm mostly familiar with SQL Server and there at least the optimiser has no problem recognising that only columns a,b,c will be required and the use of * in the inner table expression does not cause any unnecessary overhead retrieving and discarding unneeded columns.
In principle SELECT * ought to be fine in a view as well as it is the final SELECT from the view where it ought to be avoided however in SQL Server this can cause problems as it stores column metadata for views which is not automatically updated when the underlying tables change and the use of * can lead to confusing and incorrect results unless sp_refreshview is run to update this metadata.
There are many scenarios where SELECT * is the optimal solution. Running ad-hoc queries in Management Studio just to get a sense of the data you're working with. Querying tables where you don't know the column names yet because it's the first time you've worked with a new schema. Building disposable quick'n'dirty tools to do a one-time migration or data export.
I'd agree that in "proper" development, you should avoid it - but there's lots of scenarios where "proper" development isn't necessarily the optimum solution to a business problem. Rules and best practices are great, as long as you know when to break them. :)
I'll use it in production when working with CTEs. But, in this case it's not really select *, because I already specified the columns in the CTE. I just don't want to respecify in the final select.
with t as (
select a, b, c from foo
)
select t.* from t;
None that I can think of, if you are talking about live code.
People saying that it makes adding columns easier to develop (so they automatically get returned and can be used without changing the Stored procedure) have no idea about writing optimal code/sql.
I only ever use it when writing ad-hoc queries that will not get reused (finding out the structure of a table, getting some data when I am not sure what the column names are).
I think using select * in an exists clause is appropriate:
select some_field from some_table
where exists
(select * from related_table [join condition...])
Some people like to use select 1 in this case, but it's not elegant, and it doesn't buy any performance improvements (early optimization strikes again).
In production code, I'd tend to agree 100% with you.
However, I think that the * more than justifies its existence when performing ad-hoc queries.
You've gotten a number of answers to your question, but you seem to be dismissing everything that isn't parroting back what you want to hear. Still, here it is for the third (so far) time: sometimes there is no bottleneck. Sometimes performance is way better than fine. Sometimes the tables are in flux, and amending every SELECT query is just one more bit of possible inconsistency to manage. Sometimes you've got to deliver on an impossible schedule and this is the last thing you need to think about.
If you live in bullet time, sure, type in all the column names. But why stop there? Re-write your app in a schema-less dbms. Hell, write your own dbms in assembly. That'd really show 'em.
And remember if you use select * and you have a join at least one field will be sent twice (the join field). This wastes database resources and network resources for no reason.
As a tool I use it to quickly refresh my memory as to what I can possibly get back from a query. As a production level query itself .. no way.
When creating an application that deals with the database, like phpmyadmin, and you are in a page where to display a full table, in that case using SELECT * can be justified, I guess.
About the only thing that I can think of would be when developing a utility or SQL tool application that is being written to run against any database. Even here though, I would tend to query the system tables to get the table structure and then build any necessary query from that.
There was one recent place where my team used SELECT * and I think that it was ok... we have a database that exists as a facade against another database (call it DB_Data), so it is primarily made up of views against the tables in the other database. When we generate the views we actually generate the column lists, but there is one set of views in the DB_Data database that are automatically generated as rows are added to a generic look-up table (this design was in place before I got here). We wrote a DDL trigger so that when a view is created in DB_Data by this process then another view is automatically created in the facade. Since the view is always generated to exactly match the view in DB_Data and is always refreshed and kept in sync, we just used SELECT * for simplicity.
I wouldn't be surprised if most developers went their entire career without having a legitimate use for SELECT * in production code though.
I've used select * to query tables optimized for reading (denormalized, flat data). Very advantageous since the purpose of the tables were simply to support various views in the application.
How else do the developers of phpmyadmin ensure they are displaying all the fields of your DB tables?
It is conceivable you'd want to design your DB and application so that you can add a column to a table without needing to rewrite your application. If your application at least checks column names it can safely use SELECT * and treat additional columns with some appropriate default action. Sure the app could consult system catalogs (or app-specific catalogs) for column information, but in some circumstances SELECT * is syntactic sugar for doing that.
There are obvious risks to this, however, and adding the required logic to the app to make it reliable could well simply mean replicating the DB's query checks in a less suitable medium. I am not going to speculate on how the costs and benefits trade off in real life.
In practice, I stick to SELECT * for 3 cases (some mentioned in other answers:
As an ad-hoc query, entered in a SQL GUI or command line.
As the contents of an EXISTS predicate.
In an application that dealt with generic tables without needing to know what they mean (e.g. a dumper, or differ).
Yes, but only in situations where the intention is to actually get all the columns from a table not because you want all the columns that a table currently has.
For example, in one system that I worked on we had UDFs (User Defined Fields) where the user could pick the fields they wanted on the report, the order as well as filtering. When building a result set it made more sense to simply "select *" from the temporary tables that I was building instead of having to keep track of which columns were active.
I have several times needed to display data from a table whose column names were unknown. So I did SELECT * and got the column names at run time.
I was handed a legacy app where a table had 200 columns and a view had 300. The risk exposure from SELECT * would have been no worse than from listing all 300 columns explicitly.
Depends on the context of the production software.
If you are writing a simple data access layer for a table management tool where the user will be selecting tables and viewing results in a grid, then it would seem *SELECT ** is fine.
In other words, if you choose to handle "selection of fields" through some other means (as in automatic or user-specified filters after retrieving the resultset) then it seems just fine.
If on the other hand we are talking about some sort of enterprise software with business rules, a defined schema, etc. ... then I agree that *SELECT ** is a bad idea.
EDIT: Oh and when the source table is a stored procedure for a trigger or view, "*SELECT **" should be fine because you're managing the resultset through other means (the view's definition or the stored proc's resultset).
Select * in production code is justifiable any time that:
it isn't a performance bottleneck
development time is critical
Why would I want the overhead of going back and having to worry about changing the relevant stored procedures, every time I add a field to the table?
Why would I even want to have to think about whether or not I've selected the right fields, when the vast majority of the time I want most of them anyway, and the vast majority of the few times I don't, something else is the bottleneck?
If I have a specific performance issue then I'll go back and fix that. Otherwise in my environment, it's just premature (and expensive) optimisation that I can do without.
Edit.. following the discussion, I guess I'd add to this:
... and where people haven't done other undesirable things like tried to access columns(i), which could break in other situations anyway :)
I know I'm very late to the party but I'll chip in that I use select * whenever I know that I'll always want all columns regardless of the column names. This may be a rather fringe case but in data warehousing, I might want to stage an entire table from a 3rd party app. My standard process for this is to drop the staging table and run
select *
into staging.aTable
from remotedb.dbo.aTable
Yes, if the schema on the remote table changes, downstream dependencies may throw errors but that's going to happen regardless.
If you want to find all the columns and want order, you can do the following (at least if you use MySQL):
SHOW COLUMNS FROM mytable FROM mydb; (1)
You can see every relevant information about all your fields. You can prevent problems with types and you can know for sure all the column names. This command is very quick, because you just ask for the structure of the table. From the results you will select all the name and will build a string like this:
"select " + fieldNames[0] + ", fieldNames[1]" + ", fieldNames[2] from mytable". (2)
If you don't want to run two separate MySQL commands because a MySQL command is expensive, you can include (1) and (2) into a stored procedure which will have the results as an OUT parameter, that way you will just call a stored procedure and every command and data generation will happen at the database server.

A SQL story in 2 parts - Are SQL views always good and how can I solve this example?

I'm building a reporting app, and so I'm crunching an awful lot of data. Part of my approach to creating the app in an agile way is to use SQL views to take the strain off the DB if multiple users are all bashing away.
One example is:
mysql_query("CREATE VIEW view_silverpop_clicks_baby_$email AS SELECT view_email_baby_position.EmailAddress, view_email_baby_position.days, silverpop_campaign_emails.id, silverpop_actions.`Click Name` , silverpop_actions.`Mailing Id`
FROM silverpop_actions
INNER JOIN view_email_baby_position ON (silverpop_actions.Email = view_email_baby_position.EmailAddress ) , silverpop_campaign_emails
WHERE silverpop_campaign_emails.id = $email
AND view_email_baby_position.days
BETWEEN silverpop_campaign_emails.low
AND silverpop_campaign_emails.high
AND silverpop_actions.`Event Type` = 'Click Through'") or die(mysql_error());
And then later in the script this view is used to calculate the number of clicks a particular flavour of this email has had.
$sql = "SELECT count(*) as count FROM `view_silverpop_clicks_baby_$email` WHERE `Click Name` LIKE '$countme%'";
My question is in 2 parts really:
Are views always good? Can you
have too many?
Could I create yet another set of
views to cache the count variable in
the second snippet of code. If so
how could I approach this? I can't
quite make this out yet.
Thanks!
To answer your questions.
1.) I don't know that I can think of an instance where views are BAD in and of themselves, but it would be bad to use them unnecessarily. Whether you can have too many really depends on your situation.
2.) Having another set of views will not cache the count variable so it wouldn't be beneficial from that standpoint.
Having said that, I think you have a misunderstanding on what a view actually does. A view is just a definition of a particular SQL statement and it does not cache data. When you execute a SELECT * FROM myView;, the database is still executing the select statement defined in the CREATE VIEW definition just as it would if a user was executing that statement.
Some database vendors offer a different kind of view called a materialized view. In this case the table data needed to create the view is stored/cached and is usually updated based on a refresh rate specified when you create it. This is "heavy" in the sense that your data is stored twice, but can create better execution plans because the data is already joined, aggregated, etc. Note though, you only see the data based on the last refresh of the materialized view, where with a normal view you see the data as it currently exists in the underlying tables. Currently, MySQL does not support materialized views.
Some useful uses of views are to:
Create easier/cleaner SQL statements for complex queries (which is something you are doing)
Security. If you have tables where you want a user to be able to see some columns or rows, but not other columns/rows, you restrict access to the base table and create a view of the base table that only selects the columns/rows that the user should have access too.
Create aggregations of tables
Views are used by query optimizer so they often help in querying for information more efficiently.
Indexed or materialized views however create a table with the required information which can make quite a difference. Think of it as denormalization of you db scheme without changing existing scheme. You get best of both worlds.
Some views are never used so they represent needles compexity -which is bad.
Indexed views cannot reference other views (mssql) so there's hardly a point in creating such view.

Date ranges in views - is this normal?

I recently started working at a company with an enormous "enterprisey" application. At my last job, I designed the database, but here we have a whole Database Architecture department that I'm not part of.
One of the stranger things in their database is that they have a bunch of views which, instead of having the user provide the date ranges they want to see, join with a (global temporary) table "TMP_PARM_RANG" with a start and end date. Every time the main app starts processing a request, the first thing it does it "DELETE FROM TMP_PARM_RANG;" then an insert into it.
This seems like a bizarre way of doing things, and not very safe, but everybody else here seems ok with it. Is this normal, or is my uneasiness valid?
Update I should mention that they use transactions and per-client locks, so it is guarded against most concurrency problems. Also, there are literally dozens if not hundreds of views that all depend on TMP_PARM_RANG.
Do I understand this correctly?
There is a view like this:
SELECT * FROM some_table, tmp_parm_rang
WHERE some_table.date_column BETWEEN tmp_parm_rang.start_date AND tmp_parm_rang.end_date;
Then in some frontend a user inputs a date range, and the application does the following:
Deletes all existing rows from
TMP_PARM_RANG
Inserts a new row into
TMP_PARM_RANG with the user's values
Selects all rows from the view
I wonder if the changes to TMP_PARM_RANG are committed or rolled back, and if so when? Is it a temporary table or a normal table? Basically, depending on the answers to these questions, the process may not be safe for multiple users to execute in parallel. One hopes that if this were the case they would have already discovered that and addressed it, but who knows?
Even if it is done in a thread-safe way, making changes to the database for simple query operations doesn't make a lot of sense. These DELETEs and INSERTs are generating redo/undo (or whatever the equivalent is in a non-Oracle database) which is completely unnecessary.
A simple and more normal way of accomplishing the same goal would be to execute this query, binding the user's inputs to the query parameters:
SELECT * FROM some_table WHERE some_table.date_column BETWEEN ? AND ?;
If the database is oracle, it's possibly a global temporary table; every session sees its own version of the table and inserts/deletes won't affect other users.
There must be some business reason for this table. I've seen views with dates hardcoded that were actually a partioned view and they were using dates as the partioning field. I've also seen joining on a table like when dealing with daylights saving times imagine a view that returned all activity which occured during DST. And none of these things would ever delete and insert into the table...that's just odd
So either there is a deeper reason for this that needs to be dug out, or it's just something that at the time seemed like a good idea but why it was done that way has been lost as tribal knowledge.
Personally, I'm guessing that it would be a pretty strange occurance. And from what you are saying two methods calling the process at the same time could be very interesting.
Typically date ranges are done as filters on a view, and not driven by outside values stored in other tables.
The only justification I could see for this is if there was a multi-step process, that was only executed once at a time and the dates are needed for multiple operations, across multiple stored procedures.
I suppose it would let them support multiple ranges. For example, they can return all dates between 1/1/2008 and 1/1/2009 AND 1/1/2006 and 1/1/2007 to compare 2006 data to 2008 data. You couldn't do that with a single pair of bound parameters. Also, I don't know how Oracle does it's query plan caching for views, but perhaps it has something to do with that? With the date columns being checked as part of the view the server could cache a plan that always assumes the dates will be checked.
Just throwing out some guesses here :)
Also, you wrote:
I should mention that they use
transactions and per-client locks, so
it is guarded against most concurrency
problems.
While that may guard against data consistency problems due to concurrency, it hurts when it comes to performance problems due to concurrency.
Do they also add one -in the application- to generate the next unique value for the primary key?
It seems that the concept of shared state eludes these folks, or the reason for the shared state eludes us.
That sounds like a pretty weird algorithm to me. I wonder how it handles concurrency - is it wrapped in a transaction?
Sounds to me like someone just wasn't sure how to write their WHERE clause.
The views are probably used as temp tables. In SQL Server we can use a table variable or a temp table (# / ##) for this purpose. Although creating views are not recommended by experts, I have created lots of them for my SSRS projects because the tables I am working on do not reference one another (NO FK's, seriously!). I have to workaround deficiencies in the database design; that's why I am using views a lot.
With the global temporary table GTT approach that you comment is being used here, the method is certainly safe with regard to a multiuser system, so no problem there. If this is Oracle then I'd want to check that the system either is using an appropriate level of dynamic sampling so that the GTT is joined appropriately, or that a call to DBMS_STATS is made to supply statistics on the GTT.