We have two tables we want to merge. Say, table1 and table2.
They have the exact same columns, and the exact same purpose. The difference being table2 having newer data.
We used a query that uses LEFT JOIN to find the the rows that are common between them, and skip those rows while merging. The problem is this. both tables have 500M rows.
When we ran the query, it kept going on and on. For an hour it just kept running. We were certain this was because of the large number of rows.
But when we wanted to see how many rows were already inserted to table2, we ran the code select count(*) from table2, it gave us the exact same row count of table2 as when we started.
Our questions is, is that how it's supposed to be? Do the rows get inserted all at the same time after all the matches have been found?
If you would like to read uncommited data, than the count should me modified, like this:
select count(*) from table2 WITH (NOLOCK)
NOLOCK is over-used, but in this specific scenario, it might be handy.
No data are inserted or updated one by one.
I have no idea how it is related with "Select count(*) from table2 WITH (NOLOCK) "
Join condition is taking too long to produce Resultset which will be use by insert operator .So actually there is no insert because no resultset is being produce.
Join query is taking too long because Left Join condition produces very very high cardinality estimate.
so one has to fix Join condition first.
for that need other info like Table schema ,Data type and length and existing index,requirement.
I've got two tables containing a column with the same name. I try to find out which distinct values exist in Table2 but don't exist in Table1. For that I have two SELECTs:
SELECT DISTINCT Field
FROM Table1
SELECT DISTINCT Field
FROM Table2
Both SELECTs finish within 2 Seconds and return about 10 rows each. If I restructure my query to find out which values are missing in Table1, the query takes several minutes to finish:
SELECT DISTINCT Field
FROM Table1
WHERE Field NOT IN
(
SELECT DISTINCT Field
FROM Table2
)
My temporary workaround is inserting the results of the second distinct in a temporary table an comparing against it. But the performance still isn't great.
Does anyone know why this happens? I guess because SQL-Server keeps recalculating the second DISTINCT but why would it? Shouldn't SQL-Server optimize this somehow?
Not sure if this will improve performance but i'd use EXCEPT:
SELECT Field
FROM Table1
EXCEPT
SELECT Field
FROM Table2
There is no need to use DISTINCT because EXCEPT is a set operator that removes duplicates.
EXCEPT returns distinct rows from the left input query that aren’t
output by the right input query.
The number and the order of the columns must be the same in all queries.
The data types must be compatible.
I have a database where each object property is stored in a separate row. The attached query does not return distinct values in a redshift database but works as expected when testing in any mysql compatible database.
SELECT DISTINCT distinct_value
FROM
(
SELECT
uri,
( SELECT DISTINCT value_string
FROM `test_organization__app__testsegment` AS X
WHERE X.uri = parent.uri AND name = 'hasTestString' AND parent.value_string IS NOT NULL ) AS distinct_value
FROM `test_organization__app__testsegment` AS parent
WHERE
uri IN ( SELECT uri
FROM `test_organization__app__testsegment`
WHERE name = 'types' AND value_uri_multivalue = 'Document'
)
) AS T
WHERE distinct_value IS NOT NULL
ORDER BY distinct_value ASC
LIMIT 10000 OFFSET 0
This is not a bug and behavior is intentional, though not straightforward.
In Redshift, you can declare constraints on the tables but Redshift doesn't enforce them, i.e. it allows duplicate values if you insert them. The only difference here is that when you run SELECT DISTINCT query against a column that doesn't have a primary key declared it will scan the whole column and get unique values, and if you run the same on a column that has primary key constraint it will just return the output without performing unique list filtering. This is how you can get duplicate entries if you insert them.
Why is this done? Redshift is optimized for large datasets and it's much faster to copy data if you don't need to check constraint validity for every row that you copy or insert. If you want you can declare a primary key constraint as a part of your data model but you will need to explicitly support it by removing duplicates or designing ETL in a way there are no such.
More information with specific examples in this Heap blog post Redshift Pitfalls And How To Avoid Them
Perhaps You can solve this by using appropriate joins.
for example i have duplicate values in table 1 and i want values of table 1 by joining it to table 2 and there is some logic behind joining two tables according to your conditions.
so i can do something like this!!
select distinct table1.col1 from table1 left outer join table2 on table1.col1 = table2.col1
this worked for me very well and i got unique values from table1 and could remove dublicates
i have a huge table which has no indexing on it. and indexing cant be added. i need to delete rows like this :
delete from table1 where id in (
select id from table2 inner join table3 on table2.col1 = table3.col1);
but since it has huge number of rows its taking too much time. what i can do to make it faster other than indexing (not permitted).
I am using oracle db.
I have a process that consolidates 40+ identically structured databases down to one consolidated database, the only difference being that the consolidated database adds a project_id field to each table.
In order to be as efficient as possible, I'm try to only copy/update a record from the source databases to the consolidated database if it's been added/changed. I delete outdated records from the consolidated database, and then copy in any non-existing records. To delete outdated/changed records I'm using a query similar to this:
DELETE FROM <table>
WHERE NOT EXISTS (SELECT <primary keys>
FROM <source> b
WHERE ((<b.fields = a.fields>) or
(b.fields is null and a.fields is null)))
AND PROJECT_ID = <project_id>
This works for the most part, but one of the tables in the source database has over 700,000 records, and this query takes over an hour to complete.
How can make this query more efficient?
Use timestamps or better yet audit tables to identify the records that changed since time "X" and then save time "X" when last sync started. We use that for interface feeds.
You might want to try LEFT JOIN with NULL filter:
DELETE <table>
FROM <table> t
LEFT JOIN <source> b
ON (t.Field1 = b.Field1 OR (t.Field1 IS NULL AND b.Field1 IS NULL))
AND(t.Field2 = b.Field2 OR (t.Field2 IS NULL AND b.Field2 IS NULL))
--//...
WHERE t.PROJECT_ID = <project_id>
AND b.PrimaryKey IS NULL --// any of the PK fields will do, but I really hope you do not use composite PKs
But if you are comparing all non-PK columns, then your query is going to suffer.
In this case it is better to add a UpdatedAt TIMESTAMP field (as DVK suggests) on both databases which you could update with the AFTER UPDATE trigger, then your sync procedure would be much faster, given that you create an index including PKs and UpdatedAt column.
You can reorder the WHERE statement; it has four comparisons, put the one most likely to fail first.
If you can alter the databases/application slightly, and you'll need to do this again, a bit field that says "updated" might not be a bad addition.
I usually rewrite queries like this to avoid the not...
Not In is horrible for performance, although Not Exists improves on this.
Check out this article, http://www.sql-server-pro.com/sql-where-clause-optimization.html
My suggestion...
Select out your pkey column into a working/temp table, add a column (flag) int default 0 not null, and index the pkey column. Mark flag =1 if record exists in your subquery (much quicker!).
Replace your sub select in your main query with an exists where (select pkey from temptable where flag=0)
What this works out to is being able to create a list of 'not exists' values that can be used inclusively from an all inclusive set.
Here's our total set.
{1,2,3,4,5}
Here's the existing set
{1,3,4}
We create our working table from these two sets (technically a left outer join)
(record:exists)
{1:1, 2:0, 3:1, 4:1, 5:0}
Our set of 'not existing records'
{2,5} (Select * from where flag=0)
Our product... and much quicker (indexes!)
{1,2,3,4,5} in {2,5} = {2,5}
{1,2,3,4,5} not in {1,3,4} = {2,5}
This can be done without a working table, but its use makes visualizing what's happening easier.
Kris