Find which table is causing duplicate rows in a view

Find which table is causing duplicate rows in a view - sql

I have a view in sql server which should be returning one row per project. A few of the projects have multiple rows. The view has a lot of table joins so I would not like to have to manually run a script on each table to find out which one is causing duplicates. Is there a quick automated way to find out which table is the problem table (aka the one with duplicate rows)?

The quickest way I've found is:
find an example dupe
copy out the query
comment out all joins
add the joins back one at a time until you get another row
Whatever the join is where you started getting dupes, is where you have multiple records.

My technique is to make a copy of the view and modify it to return every column from every table in the order of the FROM clause, with extra columns between with the table names as the column name (see example below). Then select a few rows and slowly scan to the right until you can find the table that does NOT have duplicate row data, and this is the one causing dupes.
SELECT
TableA = '----------', TableA.*,
TableB = '----------', TableB.*
FROM ...
This is usually a very fast way to find out. The problem with commenting out joins is that then you have to comment out the matching columns in the select clause each time, too.

I used a variation of SpectralGhost's technique to get this working even though neither method really solves the problem of avoiding the manual checking of each table for duplicate rows.
My variation was to use a divide and conquer method of commenting out the joins instead of
commenting out each one individually. Due to the sheer number of joins this was much faster.

Related

Does number of columns in a table affect performance if you don't use them in a join?

I have a table with 49 columns and I need to add two more. I could add another table related to that one with those new columns to avoid making the table bigger and bigger.
However I would like to know how much would affect performance having 2 more columns in that table if they are not used in joins?
Is there really a difference in performance doing the join of table A with table B if A has 4 columns or 100 if you only use 3 of them?
Also the table is not highly populated, it doesn't even have 500 rows but would like to know as the DBA doesn't like it just to understand his point of view.
Thanks.
EDIT:
I'll edit to explain that my only work in this table is to add 2 more columns to the existing 49 that the table currently holds and that they will be bit columns. So that's why I wanted to know if increasing the columns size would impact the performance at all assuming they never do a select * when they join with that table.

I think the best answer is: does these new columns will be empty most of the time for your rows ?
Yes: Maybe you can add these columns to the main table. Depends if you needs them most of the time you select rows in this table.
No: Create a new table and join. Empty column for each row is useless disk space.
NB: 50 columns seems like horrible anyway...

Adding these two columns to your table should not significantly impact performance, in particular as your table stores less than 500 rows. Saying that, your DBA does not like this as it does't follow best practices for table design, in particular if many column values are NULL/empty and this design will not scale well. However, unless you anticipate that this table is going to grow in size rapidly, adding two columns should not pose a performance problem.

If you add another table then I assume you will have to use joins to access that data properly. Really this could end up costing more than the two new attributes would added to the single table.
If your table could be refactored then that would be the best option, but if not then you would only lose efficiency by attempting to do so. Dont make a second table simply to stay under the 50 attributes. Two columns added to 49 is not going to be an unworkable load by any measure, but there could be other reasons to redesign your table. If you have a bunch of empty or null cells then you are wasting resources and giving your system more work to do, finding a way to eliminate these would undoubtedly have a greater effect on performance than adding a column or two

Should I use distinct in my queries

Where I am working I have been recently told that using distinct in your queries is a bad sign of a programmer. So I am wondering I guess the only way to not use this function is to use a group by .
It was my understanding that the distinct function works very similarly to a group by except in how its read. A distinct function checks each individual selection criteria vs a group by which does the same thing only done as a whole.
Keep in mind I only do reporting . I do not create/alter the data. So my question is for best practices should I be using distinct or group by. If neither then is there an alternative. Maybe the group by should be used in more complex queries than my non-real example here, but you get the idea. I could not find an answer that really explained why or why not I should use distinct in my queries
select distinct
spriden_user_id as "ID",
spriden_last_name as "last",
spriden_first_name as "first",
spriden_mi_name as "MI",
spraddr_street_line1 as "Street",
spraddr_street_line2 as "Street2",
spraddr_city as "city",
spraddr_stat_code as "State",
spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'
VS
select
spriden_user_id as "ID",
spriden_last_name as "last",
spriden_first_name as "first",
spriden_mi_name as "MI",
spraddr_street_line1 as "Street",
spraddr_street_line2 as "Street2",
spraddr_city as "city",
spraddr_stat_code as "State",
spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'
group by "ID","last","first","MI","Street","Street2","city","State","zip"

Databases are smart to recognize what you mean. I expect both of your queries to perform equally well. It is important for someone else maintaining your query to know what you meant. If you really meant to retrieve distinct records, use DISTINCT. If your intention was to do aggregation, use GROUP BY
Take a look at this question. There are some nice answers that might help.

The answer provided by #zedfoxus is useful to understand the context.
However, I don't believe your query should require distinct records if the data is designed correctly.
It appears you are selecting the primary key of table spriden, so all that data should be unique. You're also joining onto the spraddr table; does that table really contain valid duplicate data? Or is there perhaps an additional join criterium that's required to filter out those duplicates?
This is why I get nervous about use of "distinct" - the spraddr table may include additional columns which you should use to filter out data, and "distinct" may be hiding that.
Also, you may be generating a massive result set which needs to be filtered by the "distinct" clause, which can cause performance issues. For instance, if there are 1 million rows in spraddr for each row in spriden, and you should use the "is_current" flag to find the 2 or 3 "real" ones.
Finally, I get nervous when I see "group by" used as a substitute for distinct, not because it's "wrong", but because stylistically, I believe group by should be used for aggregate functions. That's just a personal preference.

In your example distinct and group by do the same thing. I think your colleagues means that your query should not return duplicates in the first instance and that you should be able to write your query without a distinct or group by clause. You maybe be able to reduce the duplicates by extending your join conditions.

Ask them why is it a bad practice. A lot of people make up rules or come up with things that they consider bad practice from reading the first page of the book or the first result of a google search. If it does the job and doesn't cause any issues there is no reason to create more work by finding alternatives. From the two options you have posted I would use distinct too because its shorter and easier to read and maintain.

Whoever told you using DISTINCT is a bad sign in itself is wrong. In reality, it all depends on what problem you are trying to solve by using DISTINCT in the first place.
If you're querying a table that is expected to have repeated values of some field or combination of fields, and you're reporting a list of the values or combinations of values (and not performing any aggregations on them), then DISTINCT is the most sensible thing to use. It doesn't really make sense in my mind to use GROUP BY instead just because somebody thinks DISTINCT shouldn't be used. Indeed, I think this is the kind of thing DISTINCT is designed for.
If OTOH you've found that your query has a bug meaning that repeated values are being returned, you shouldn't use either DISTINCT or GROUP BY to cancel out this bug. Rather, you should figure out the cause of the bug and fix it.
Using DISTINCT as a safety net is also a poor practice, as it potentially hides problems, and furthermore it can be computationally expensive (typically O(n log n) or O(n2)). In this scenario, I can't see that using GROUP BY instead would help you.

Yes, Distinct tends to raise a little alarm in my head when I come across it in someones' query. It is required in some cases ofcourse, but most data models should not require it. It tends to be a last resort, or outlier case, for having to use it. It may also be systemic of a bad application sitting ontop of the database, allowing duplicate entries to be inserted or updated to be duplicates (and likewise, no corresponding database level constraints to prevent such actions). So the first thing to check is the data. It could be a sign of bad datamodel design. But most likely the query should not get to that stage in a select where duplicate rows are lingering.
In constructing a large query, normally I would start with the nugget of a subquery which is specifying the unique fields, and any subquery after that must Inner join or Left join onto that but never add or reduce the number of rows already defined by the nugget query.. and remembering to handle the possible NULLs of the left joins.
So for example, the nugget query could select the right rows also by using Partitions to, for example, select the most recent row of a joined table, or to do some other grouping at that stage.
In your example, I would not expect duplicates. If a person can have historical addresses, fine, but then do you need to see all addresses, or only the most recent, and if there were duplicate addresses, for the same person, does that mean incorrectly duplicated data, or does it mean the person left that address but returned to it later... in which case the partition select would fix that with much better control than a distinct.. especially when fields are added to the query by someone else later and breaks the distinct-ness.
This means that all other data hangs off this nugget of a sub query.. you stick the other possible fields onto the right of the core set of fields.
If Distincts are a last resort, then they are typically reserved for cases where the data is known to have duplicate entries in that table for that set of fields, and it's perfectly normal. In my head though a distinct is a slow, post-select process in the plan especially when it's a large result set being returned. I ought to verify that one of these days.

Provided your queries are correct, DISTINCT and GROUP BY provide the same result set, but your colleagues are correct in stating that DISTINCT hides problems. If you are missing a join and using a GROUP BY, you'll get back more information than you're expecting. If you are missing a join and using DISTINCT the SQL engine will perform an unbounded (or partially bounded) join, narrow the results down, and then come up with the expected answer.
Beyond the obvious performance degradation of generating more data than is necessary, you also run the risk of filling your tempdb (i.e.: running out of room on the hard drive where your tempdb lives).
Use GROUP BY in production.

Determine if a SQL Insert/Update statement affects the result from a stored Select Statement

Thought this would be a good place to ask for some "brainstorming." Apologies if it's a little broad/off subject.
I was wondering if anyone here had any ideas on how to approach the following problem:
First assume that I have a select statement stored somewhere as an object (this can be the tree form of the query). For example (for simplicity):
SELECT A, B FROM table_A WHERE A > 10;
It's easy to determine the below would change the result of the above query:
INSERT INTO table_A (A,B) VALUES (12,15);
But, given any possible Insert/Update/Whatever statement, as well as any possible starting Select (but we know the Selects and can analyze them all day) I'd like to determine if it would affect the result of the Select Statement.
It's fine to assume that there won't be any "outside" queries, and that we know about all the queries being sent to the DB. It is also assumed we know the DB schema.
No, this isn't for homework. Just a brain teaser I've been thinking about and started to get stuck on (obviously, SQL can get very complicated.)

Based on the reply to the comment, I'd say that without additional criteria, this ranges between very hard and impossible.
Very hard (leastways, it would be for me) because you'd have to write something to parse and interpret your SQL statements into a workable frame of reference for your goals. Doable, but can it be worth the effort?
Impossible because some queries transcend phrases like "Byzantinely complex". (Think nested queries, correlated subqueries, views, common table expressions, triggers, outer joins, and who knows what all.) Without setting criteria such as "no subqueries, no views or triggers, no more than X joins" and so forth, the problem becomes open-ended enough to warrant an NP Complete answer.

My first thought would be to put a trigger on table_A, where if any of the columns you're affecting (col A in this case) changes to meet (or no longer meet) the condition (> 10 here), then the trigger records that an "affecting" change has taken place.
E.g. have another little table to record a "last update timestamp", which the trigger could pop a getdate() into when it detects such a change.
Then, you could check that table to see if the timestamp has changed since the last time you ran the select query - if it has, then you know you need to re-run it, if it hasn't, then you know the results would be the same.
The table could hold many such timestamps (one per row, perhaps with the table/trigger name as a key value in another column) to service many such triggers.
Advantage? Being done in a trigger on the table means no risk of a change that could affect the select statement being missed.
Disadvantage? I guess depending on how your select statements come into existence, you might have an undesirable/unmanageable overhead in creating the trigger(s).

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.

If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.

So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

Oracle bug? SELECT returns no dupes, INSERT from SELECT has duplicate rows

I'm getting some strange behaviour from an Oracle instance I'm working on. This is 11gR1 on Itanium, no RAC, nothing fancy. Ultimately I'm moving data from one Oracle instance to another in a data warehouse scenario.
I have a semi-complex view running over a DB link; 4 inner joins over large-ish tables and 5 left joins over mid-size tables.
Here's the problem: when I test the view in SQL Developer (or SQL*Plus) it seems fine, no duplication whatsoever. However, when I actually use the view to insert data into a table I get a large number of dupes.
EDIT: - The data is going into an empty table. All of the tables in the query are on the database link. The only thing passed into the query is a date (e.g. INSERT INTO target SELECT * FROM view WHERE view.datecol = dQueryDate) -
I've tried adding a ROW_NUMBER() function to the select statement, partitioned by the PK for the view. All rows come back numbered as 1. Again though, the same statement run as an insert generates the same dupes as before and now conveniently numbered. The number of duped rows is not the same per key. Some records exist 4 times some only exist once.
I find this to behaviour to be extremely perplexing. :) It reminds me of working with Teradata where you have SET tables (unique rows only) and MULTISET tables (duplicate rows allowed) but Oracle has no such functionality.
A select that returns rows to the client should behave identically to one that inserts those rows to another location. I can't imagine a legitimate reason for this to happen, but maybe I'm suffering from a failure of imagination. ;)
I wonder if anyone else has experienced this or if it's a bug on this platform.
SOLUTION
Thanks to #Gary, I was able to get to the bottom of this by using "EXPLAIN PLAN FOR {my query};" and "SELECT * FROM TABLE(dbms_xplan.display);". The explain that actually gets used for the INSERT is very different from the SELECT.
For the SELECT most of the plan operations are 'TABLE ACCESS BY INDEX ROWID' and 'INDEX UNIQUE SCAN'. The 'Predicate Information' block contains all of the joins and filters from the query. At the end it says "Note - fully remote statement".
For the INSERT there is no reference to the indexes. The 'Predicate Information' block is just three lines and a new 'Remote SQL' block shows 9 small SQL statements.
The database has split my query into 9 subqueries and then attempts to join them locally. By running the smaller selects I've located the source of the duplicates.
I believe this is bug in the Oracle compiler around remote links. It creates logical flaws when re-writing the SQL. Basically the compiler is not properly applying the WHERE clause. I was just testing it and gave it an IN list of 5 keys to bring back. SELECT brings back 5 rows. INSERT puts 77,000+ rows into the target and totally ignores the IN list.
{Still looking for a way to force the correct behaviour, I may have to ask for the view to be created on the remote database although that is not ideal from a development viewpoint. I'll edit this when I've got it working…}

It seems to be Oracle Bug, we have found this following workarround:
If you want that your "insert into select ..." work like your "select ...", you can pack your select in a sub select.
For example :
select x,y,z from table1, table2, where ...
--> no duplicate
insert into example_table
select x,y,z from table1, table2, where ...
--> duplicate error
insert into example_table
select * from (
select x,y,z from table1, table2, where ...
)
--> no duplicate
Regards

One thing that comes to mind is that generally an optimizer plan for a SELECT will prefer a FIRST_ROWS plan to give rows back to the caller early, but an INSERT...SELECT will prefer an ALL_ROWS plan as it is going to have to deliver the full dataset.
I'd check the query plans using DBMS_XPLAN.DISPLAY_CURSOR (using the sql_id from V$SQL).
I have a semi-complex view running
over a DB link; 4 inner joins over
large-ish tables and 5 left joins over
mid-size tables.
...
All of the tables in the query are on
the database link
Again, a potential trouble-spot. If all the tables in the SELECT were on the other end of the DB link, the whole query would be sent to the remote database and the resultset returned. Once you throw the INSERT in, it is more likely that the local database will take charge of the query and pull all the data from the child tables over. But that may depend on whether the view is defined in the local database or the remote database. In the latter case, as far as the local optimizer is concerned there is just one remote object and it gets data from that, and the remote database will do the join.
What happens if you just go to the remote DB and do the INSERT on a table there ?

This is a bug in Oracle's handling of joins over DB links. I have a simpler situation which does not involve an INSERT versus SELECT. If I run my query remotely, I get duplicate rows, but if I run it locally, I do not. The only difference between the queries is the "#..." appended to the tables in the remote query. I am querying a 9i database from a 10.2 database using Oracle SQL Developer 3.0.
This even more stupid than that bug in Oracle which prevents you from joining tables with more than 1000 total columns, which is VERY easy to do when querying the ERP system. And no, the error message is nothing about tables having too many columns.
It's almost as stupid as that other Oracle database bug that prohibits querying tables containing LOB locators using ANSI syntax. Only Oracle syntax works!

Several options occur to me.
The dupes you see were already in the destination table ??
If in your Select, you reference the table you are Inserting into, ( ? ), then The Insert is interacting with the select in your combined
Insert ... Select ... From ...
In such a way (cartesian products ?) as to create the duplicates

I can't help but think that maybe you are experiencing a side-effect from something else related to the table. Are there any triggers which may be manipulating data?

How did you determine that there are no dupes in the original table?
As others have noted this seems to be the simpledst explanation for this strange behaviour.

Check your JOINs carefully. Potentially you have no duplicates in the individual tables, but underspecified joins can cause inadvertant CROSS JOINs so that your result set has duplicates due to multiplicity and, when inserted, this violates a uniqueness constraint in your destination table.
What I do in this case is to nest the query in a view or CTE and try to detect the duplicates straight from the SELECT:
WITH resultset AS (
-- blah, blah
)
SELECT a, b, c, COUNT(*)
FROM resultset
GROUP BY a, b, c
HAVING COUNT(*) > 1

I would suggest getting a plan on the query you are running and looking for a CARTESIAN JOIN in there. This could indicate a missing condition that is causing duplicated rows.

AS #Pop has already suggested this behaviour could happen if you are using a different login in SQLPlus to the login when your insert is running. (That is if the other login has a table/view/synonym with the same name)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas