I'm encountering a very strange problem concerning what appears to be a corrupt index of some kind. Not corrupt in the sense that dbcc checkdb will pick it up, but corrupt in the sense that it has rows that it shouldn't have.
I have two tables, TableA and TableB. For the purposes of my application, some rows are considered functionally duplicate, meaning while not all the column values are the same, the row is treated as a dup by my app. To filter these out, I created a view, called vTableAUnique. The view is defined as follows:
SELECT a.*
FROM TableA a
INNER JOIN
(
SELECT ID, ROW_NUMBER() OVER
(PARTITION By Col1
ORDER BY Col1) AS Num
FROM TableA
) numbered ON numbered.ID = a.ID
WHERE numbered.Num = 1
The results of the view is all the records from TableA that don't have any other rows in TableA with the same values for Col1. For this example, let's say that TableA has 10 total rows, but only 7 with distinct values that show up in vTableAUnique.
TableB is basically just a list of values that match the values of Col1 from TableA. In this case, let's say that TableB has all 8 unique values that appear in vTableAUnique. So the data from TableA, TableB, and vTableAUnique would look like:
TableA (ID, Col1, Col2, Col3)
1,A,X,X
2,A,X,X
3,B,X,X
4,A,X,X
5,E,X,X
6,F,X,X
7,G,X,X
8,H,X,X
9,I,X,X
10,J,X,X
TableB (ID)
A
B
C
D
E
F
G
H
I
J
vTableAUnique (ID, Col1, Col2, Col3)
1,A,X,X
3,B,X,X
5,E,X,X
6,F,X,X
7,G,X,X
8,H,X,X
9,I,X,X
10,J,X,X
So here is the strange part. Sometimes when I join vTableAUnique with TableB on Col1, I get back the non-distinct values from TableA. In other words, rows that do NOT exist in vTableAUnique, but that do exist in TableA, appear when I do the join. If I do the select just off vTableAUnique, I don't get these rows. In this case, I would get back not just rows with the ids of 1,3,5,6,7,8,9,10, but ALSO rows with the ids of 2 and 4!
After banging my head against my desk, I decided to try and rebuild all the indexes in the DB. Sure enough, the problem disappeared. The same query now returned the correct rows. After an indererminant period of time, however, the problem comes back. DBCC CHECKDB doesn't show any issues, and I'm having a hard time tracking down which index might be causing this.
I'm using SQL Server 2008 Developer Edition on Vista x64.
HELP!
ROW_NUMBER() OVER (PARTITION By Col1 ORDER BY Col1)
is not a stable sort order, it can change from query to query depending on access path.
Your view may return different results being run several times.
Rebuilding indexes seems to affect the sort order.
Use this:
ROW_NUMBER() OVER (PARTITION By Col1 ORDER BY Id)
instead, it guarantees stable sort order.
script out the indexes and look at the script, was it created with ALLOW_DUP_ROW? if so then that could be your problem
Related
I'm trying to copy over all rows from one table into another that are distinct on one column (Using a Postgresql database). I know that this can be done like so:
INSERT INTO table2(col1, col2, col3, ...)
SELECT
DISTINCT ON (col1) col1, col2, col3, ...
FROM table1;
The problem I'm having is that table1 has 100+ columns and so I don't want to write out all of the column names. I tried to do something like:
INSERT INTO table2 (*)
SELECT
DISTINCT ON (col1) *
FROM table1;
which resulted in a syntax error. Could someone please provide a code snippet with the correct syntax?
If the columns exactly line up, you can use:
INSERT INTO table2
SELECT DISTINCT ON (col1) t1.*
FROM table1 t1
ORDER BY col1;
Very importantly: When using DISTINCT ON, you should always have an ORDER BY, where the keys for the ORDER BY match the expressions in parentheses.
Leaving out the explicit columns in the INSERT is dangerous -- precisely because there might be some slip-up (columns out of order or a different number of columns). Sometimes when you are writing scripts and you know that the destination table really does match the source table, though, it can be handy.
Let's assume we have a following table:
In short, there are unique ids in col1 and some non-unique corresponding values in col2.
Say we want to find the rows where col2 values are not uniquely defined.
e.g. in the following example such rows are 1 and 4.
col1
col2
1
"a"
2
"b"
3
"c"
4
"a"
So I found the following cryptic-looking (for me) code that does the job (test is the name of the table above):
SELECT *
FROM test a
WHERE col2 IN (SELECT col2 FROM test b WHERE b.col1 <> a.col1);
Sure, one way to do the task is to group by col2 and filter out those values that have count(col1) equal 1, but what does concern me is not the task at hand, but rather how does the WHERE clause in this context work.
I am aware of how tables are explicitly joined with JOINs, and I also understand the common use of WHERE clause like WHERE somecol != value. Yet, the way WHERE somecol != othercol work in this context is beyond me.
Could someone give me a clue of how does the code above work?
Maybe the question is stupid, sorry if that is the case.
Thanks!
edit:
Execution analysis here
In the absence of indexes, such a where clause is generally going to be implemented as a nested loop construct.
That is, for each row in the outer query, the engine is going to run the inner query. For each row, it will compare col1. And when these are not equal, it will check if col2 is the same in the outer query.
Engines do have a variety of algorithms so this is not guaranteed. However, non-equality conditions are harder to optimize and less frequent.
That said, there are much more efficient ways to express the query. For instance, you can use window functions. I believe this is the same logic -- assuming the values in the columns are not NULL:
select t.*
from (select t.*,
min(col1) over (partition by col2) as min_col1,
max(col1) over (partition by col2) as max_col1
from test t
) t
where min_col1 <> max_col1;
I have two tables both with one column each. I want to copy/merge the data from those two tables into another table with both columns. So in the example below I want the data from Table1 and Table2 to go into Table3.
I used this query:
INSERT **TABLE3** (BIGNUMBER)
SELECT BIGNUMBER
FROM **TABLE1**;
INSERT **TABLE3** (SMALLNUMBER)
SELECT SMALLNUMBER
FROM **TABLE2**;
When I did this it copied the data from Table1 and Table2 but didn't put the data on the same lines. So it ended up like this:
I am trying to get the data to line up... match. So BIGNUMBER 1234567812345678 should have SMALLNUMBER 123456 next to it. If I am querying I could do this with a JOIN and a LIKE 'SMALLNUMBER%' but I am not sure how to do that here to make the data end up like this:
It doesn't have to be fancy comparing the smallnumber to the bignumber. When I BULK insert data into TABLE1 and TABLE2 they are in the same order so simply copying the data into TABLE3 without caring if SMALL is the start of BIG is fine with me.
There is no relationship at all in these tables. This is the simplest form I can think of. Basically two flat tables that need to be merged side by side. There is no logic to implement... start at row 1 and go to the end on BIGNUMBER. Start at row 1 again and go to the end on SMALLNUMBER. All that matters is if BIGBUMBER has 50 rows and SMALLNUMBER has 50 rows, in the end, there is still only 50 rows.
When I was using the query above I was going off of a page I was reading on MERGE. Now that I look over this I don't see MERGE anywhere... so maybe I just need to understand how to use MERGE.
If the order of numbers is not important and you don't want to add another field to your source tables as jcropp suggested, you can use ROW_NUMBER() function within a CTE to align a number to each row and then make a join based on them
WITH C1 AS(
SELECT ROW_NUMBER() OVER (ORDER BY TABLE1.BIGNUMBER) AS Rn1
,BIGNUMBER
FROM TABLE1
)
,C2 AS(
SELECT ROW_NUMBER() OVER (ORDER BY TABLE2.SMALLNUMBER) AS Rn2
,SMALLNUMBER
FROM TABLE2
)
INSERT INTO TABLE3
SELECT C1.BIGNUMBER
,C2.SMALLNUMBER
FROM C1
INNER JOIN C2 ON C1.Rn1 = C2.Rn2
More information about ROW_NUMBER(), CTE and INSERT INTO SELECT
In order to use a JOIN statement to merge the two tables they each have to have a column that has common data. You don’t have that, but you may be able to introduce it:
Edit the structure of the first table. Add a column named something
like id and set the attributes of the id column to autonumber.
Browse the table to make sure that theid column has been assigned
numbers in the correct order.
Do the same for the second table.
After you’ve done a thorough check to ensure that the rows are
numbered correctly, run a query to merge the tables:
SELECT TABLE1.id, TABLE1.BIGNUMBER, TABLE2.SMALLNUMBER INTO TABLE3
FROM TABLE1 INNER JOIN TABLE2 ON TABLE1.id = TABLE2.id
If my title hurts your head... I'm with you. I don't want to get into why this table exists except that it is part of a legacy system, also the system does "record level access"(RLA) and this I know will be an issue for many tables, anyways the RLA is mentioned because adding a column will change the table format and then many very old programs will no longer work...
Apparently adding a PK has been shown not to change the table format. So I've been told that a certain set of keys is guarantied to be unique, well what do you know... it isn't. And now I need to show where they aren't.
All I can think of is:
Get the cross product where the table matches on it's primary key.
Somehow get a count column onto the result set for the number of entries where the PK matches it self.
Filter that result set for values where count id greater than 2.
I'm going to see if I expand the PK sufficiently I'll actually find something unique.
Remove the constraints / unique indexes, insert the data, and then run this query:
SELECT col1, col2, ..., coln, COUNT(*)
FROM your_table
GROUP BY col1, col2, ..., coln
HAVING COUNT(*) > 1
where col1, col2, ..., coln is the list of columns in your key (one or more columns). The result will be the list of keys that occur more than once together with a count showing how often they occur.
select col1, ... from tab group by col1, ... having count(*)>1;
SELECT * FROM (SELECT ID, COUNT(*) CNT FROM MY_TABLE GROUP BY ID) WHERE CNT > 1
Similar: How can I delete duplicate rows in a table
I have a feeling this is impossible and I'm going to have to do it the tedious way, but I'll see what you guys have to say.
I have a pretty big table, about 4 million rows, and 50-odd columns. It has a column that is supposed to be unique, Episode. Unfortunately, Episode is not unique - the logic behind this was that occasionally other fields in the row change, despite Episode being repeated. However, there is an actually unique column, Sequence.
I want to try and identify rows that have the same episode number, but something different between them (aside from sequence), so I can pick out how often this occurs, and whether it's worth allowing for or I should just nuke the rows and ignore possible mild discrepancies.
My hope is to create a table that shows the Episode number, and a column for each table column, identifying the value on both sides, where they are different:
SELECT Episode,
CASE WHEN a.Value1<>b.Value1
THEN a.Value1 + ',' + b.Value1
ELSE '' END AS Value1,
CASE WHEN a.Value2<>b.Value2
THEN a.Value2 + ',' + b.Value2
ELSE '' END AS Value2
FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode
WHERE a.Value1<>b.Value1
OR a.Value2<>b.Value2
(That is probably full of holes, but the idea of highlighting changed values comes through, I hope.)
Unfortunately, making a query like that for fifty columns is pretty painful. Obviously, it doesn't exactly have to be rock-solid if it will only be used the once, but at the same time, the more copy-pasta the code, the more likely something will be missed. As far as I know, I can't just do a search for DISTINCT, since Sequence is distinct and the same row will pop up as different.
Does anyone have a query or function that might help? Either something that will output a query result similar to the above, or a different solution? As I said, right now I'm not really looking to remove the duplicates, just identify them.
Use:
SELECT DISTINCT t.*
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
DISTINCT is just shorthand for writing a GROUP BY with all the columns involved. Grouping by all the columns will show you all the unique groups of records associated with the episode column in this case. So there's a risk of not having an accurate count of duplicates, but you will have the values so you can decide what to remove when you get to that point.
50 columns is a lot, but setting the ORDER BY will allow you to eyeball the list. Another alternative would be to export the data to Excel if you don't want to construct the ORDER BY, and use Excel's sorting.
UPDATE
I didn't catch that the sequence column would be a unique value, but in that case you'd have to provide a list of all the columns you want to see. IE:
SELECT DISTINCT t.episode, t.column1, t.column2 --etc.
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
There's no notation that will let you use t.* but not this one column. Once the sequence column is omitted from the output, the duplicates will become apparent.
Instead of typing out all 50 columns, you could do this:
select column_name from information_schema.columns where table_name = 'your table name'
then paste them into a query that groups by all of the columns EXCEPT sequence, and filters by count > 1:
select
count(episode)
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
This should give you a list of all the rows that have the same episode number. (But just neither the sequence nor episode numbers themselves). Here's the rub: you will need to join this result set to YourTable on ALL the columns except sequence and episode since you don't have those columns here.
Here's where I like to use SQL to generate more SQL. This should get you started:
select 't1.' + column_name + ' = t2.' + column_name
from information_schema.columns where table_name = 'YourTable'
You'll plug in those join parameters to this query:
select * from YourTable t1
inner join (
select
count(episode) 'epcount'
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
) t2 on
...plug in all those join parameters here...
select count distinct ....
Should show you without having to guess. You can get your columns by viewing your table definition so you can copy/paste your non-sequence columns.
I think something like this is what you want:
select *
from t
where t.episode in (select episode from t group by episode having count(episode) > 1)
order by episode
This will give all rows that have episodes that are duplicated. Non-duplicate rows should stick out fairly obviously.
Of course, if you have access to some sort of scripting, you could just write a script to generate your query for you. It seems pretty straight-forward. (i.e. describe t and iterate over all the fields).
Also, your query should have some sort of ordering, like FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode AND a.Sequence < b.Sequence, otherwise you'll get duplicate non-duplicates.
A relatively simple solution that Ponies sparked:
SELECT t.*
FROM Table t
INNER JOIN ( SELECT episode
FROM Table
GROUP BY Episode
HAVING COUNT(*) > 1
) AS x ON t.episode = x.episode
And then, copy-paste into Excel, and use this as conditional highlighting for the entire result set:
=AND($C2=$C1,A2<>A1)
Column C is Episode. This way, you get a visual highlight when the data's different from the row above (as long as both rows have the same value for episode).
Generate and store a hash key for each row, designed so the hash values mirror your
definition of sameness. Depending on the complexity of your rows, updating the
hash might be a simple trigger on modifying the row.
Query for duplicates of the hash key, which are your "very probably" identical rows.