MSSQL - Question about how insert queries run - sql

We have two tables we want to merge. Say, table1 and table2.
They have the exact same columns, and the exact same purpose. The difference being table2 having newer data.
We used a query that uses LEFT JOIN to find the the rows that are common between them, and skip those rows while merging. The problem is this. both tables have 500M rows.
When we ran the query, it kept going on and on. For an hour it just kept running. We were certain this was because of the large number of rows.
But when we wanted to see how many rows were already inserted to table2, we ran the code select count(*) from table2, it gave us the exact same row count of table2 as when we started.
Our questions is, is that how it's supposed to be? Do the rows get inserted all at the same time after all the matches have been found?

If you would like to read uncommited data, than the count should me modified, like this:
select count(*) from table2 WITH (NOLOCK)
NOLOCK is over-used, but in this specific scenario, it might be handy.

No data are inserted or updated one by one.
I have no idea how it is related with "Select count(*) from table2 WITH (NOLOCK) "
Join condition is taking too long to produce Resultset which will be use by insert operator .So actually there is no insert because no resultset is being produce.
Join query is taking too long because Left Join condition produces very very high cardinality estimate.
so one has to fix Join condition first.
for that need other info like Table schema ,Data type and length and existing index,requirement.

Related

Count rows with column varbinary NOT NULL tooks a lot of time

This query
SELECT COUNT(*)
FROM Table
WHERE [Column] IS NOT NULL
takes a lot of time. The table has 5000 rows, and the column is of type VARBINARY(MAX).
What can I do?
Your query needs to do a table scan on a column that can potentially be very large without any way to index it. There isn't much you can do to fix this without changing your approach.
One option is to split the table into two tables. The first table could have all the details you have now in it and the second table would have just the file. You can make this a 1-1 table to ensure data is not duplicated.
You would only add the binary data as needed into the second table. If it is not needed anymore, you simply delete the record. This will allow you to simply write a JOIN query to get the information you are looking for.
SELECT
COUNT(*)
FROM dbo.Table1
INNER JOIN dbo.Table2
ON Table1.Id = Table2.Id

SQL - Two DISTINCTs performing very poorly

I've got two tables containing a column with the same name. I try to find out which distinct values exist in Table2 but don't exist in Table1. For that I have two SELECTs:
SELECT DISTINCT Field
FROM Table1
SELECT DISTINCT Field
FROM Table2
Both SELECTs finish within 2 Seconds and return about 10 rows each. If I restructure my query to find out which values are missing in Table1, the query takes several minutes to finish:
SELECT DISTINCT Field
FROM Table1
WHERE Field NOT IN
(
SELECT DISTINCT Field
FROM Table2
)
My temporary workaround is inserting the results of the second distinct in a temporary table an comparing against it. But the performance still isn't great.
Does anyone know why this happens? I guess because SQL-Server keeps recalculating the second DISTINCT but why would it? Shouldn't SQL-Server optimize this somehow?
Not sure if this will improve performance but i'd use EXCEPT:
SELECT Field
FROM Table1
EXCEPT
SELECT Field
FROM Table2
There is no need to use DISTINCT because EXCEPT is a set operator that removes duplicates.
EXCEPT returns distinct rows from the left input query that aren’t
output by the right input query.
The number and the order of the columns must be the same in all queries.
The data types must be compatible.

How does a DELETE FROM with a SELECT in the WHERE work?

I am looking at an application and I found this SQL:
DELETE FROM Phrase
WHERE Modified < (SELECT Modified FROM PhraseSource WHERE Id = Phrase.PhraseId)
The intention of the SQL is to delete rows from Phrase where there are more recent rows in the PhraseSource table.
Now I know the tables Phrase and PhraseSource have the same columns and Modified holds the number of seconds since 1970 but I cannot understand how/why this works or what it is doing. When I look at it then it seems like on the left of the < it is just one column and on the right side of the > it would be many rows. Does it even make any sense?
The two tables are identical and have the following structure
Id - GUID primary key
...
...
...
Modified int
the ... columns are about ten columns containing text and numeric data. The PhraseSource table may or may not contain more recent rows with a higher number in the Modified column and different text and numeric data.
The SELECT statement in parenthesis is a sub-query or nested query.
What happens is that for each row, the Modified column value is compared with the result of the sub-query (which is run once for each of the rows in the Phrase table).
The sub-query has a WHERE statement, so it finds a row that has the same ID as the row from Phrase table that we are currently evaluating and returns the Modified value (which is for a sigle row, actually a single scalar value).
The two Modified values are compared and in case the Phrase's row has been modified before the row in PhraseSource, it is deleted.
As you can see this approach is not efficient, because it requires the database to run a separate query for each of the rows in the Phrase table (although I imagine that some databases might be smart enough to optimize this a little bit).
A better solution
The more efficient solution would be to use INNER JOIN:
DELETE p FROM Phrase p
INNER JOIN PhraseSource ps
ON p.PhraseId=ps.Id
WHERE p.Modified < ps.Modified
This should do the exact same thing as your query, but using efficient JOIN mechanism. INNER JOIN uses the ON statement to choose how to "match" rows in two different tables (which is done very efficiently by the DB) and then again compares the Modified values of matching rows.

Inconsistent results from BigQuery: same query, different number of rows

I noticed today that one my query was having inconsistent results: every time I run it I have a different number of rows returned (cache deactivated).
Basically the query looks like this:
SELECT *
FROM mydataset.table1 AS t1
LEFT JOIN EACH mydataset.table2 AS t2
ON t1.deviceId=t2.deviceId
LEFT JOIN EACH mydataset.table3 AS t3
ON t2.email=t3.email
WHERE t3.email IS NOT NULL
AND (t3.date IS NULL OR DATE_ADD(t3.date, 5000, 'MINUTE')<TIMESTAMP('2016-07-27 15:20:11') )
The tables are not updated between each query. So I'm wondering if you also have noticed that kind of behaviour.
I usually make queries that return a lot of rows (>1000) so a few missing rows here and there is hardly noticeable. But this query return a few row, and it varies everytime between 10 and 20 rows :-/
If a Google engineer is reading this, here are two Job ID of the same query with different results:
picta-int:bquijob_400dd739_1562d7e2410
picta-int:bquijob_304f4208_1562d7df8a2
Unless I'm missing something, the query that you provide is completely deterministic and so should give the same result every time you execute it. But you say it's "basically" the same as your real query, so this may be due to something you changed.
There's a couple of things you can do to try to find the cause:
replace select * by an explicit selection of fields from your tables (a combination of fields that uniquely determine each row)
order the table by these fields, so that the order becomes the same each time you execute the query
simplify your query. In the above query, you can remove the first condition and turn the two left outer joins into inner joins and get the same result. After that, you could start removing tables and conditions one by one.
After each step, check if you still get different result sets. Then when you have found the critical step, try to understand why it causes your problem. (Or ask here.)

That was not the right table: Access wiped the wrong data

I... don't quite know if I have the right idea about Access here.
I wrote the following, to grab some data that existed in two places:-
Select TableOne.*
from TableOne inner join TableTwo
on TableOne.[LINK] = TableTwo.[LINK]
Now, my interpretation of this is:
Find the table "TableOne"
Match the LINK field to the corresponding field in the table "TableTwo"
Show only records from TableOne that have a matching record in TableTwo
Just to make sure, I ran the query with some sample tables in SSMS, and it worked as expected.
So why, when I deleted the rows from within that query, did it delete the rows from TableTwo, and NOT from TableOne as expected? I've just lost ~3 days of work.
Edit: For clarity, I manually selected the rows in the query window and deleted them. I did not use a delete query - I've been stung by that a couple of times lately.
Since you have deleted the records manually, your query has to be updateable. This means that your query couldn't have been solely a cartesian join or a join without referential integrity, since these queries are non-updateable in ms access.
When I recreate your query based on two fields without indexes or primary keys, I am not even able to manualy delete records. This leads me to believe there was unknowingly a relationship established which deleted the records in table two. Perhaps you should take a look in the design view of your queries and relationships window, since the query itself should indeed select only records from table one.
Not sure why it got deleted, but I suggest to rewrite your query:
delete TableOne
where LINK in (select LINK from TableTwo)
This should work for you:
DELETE TableOne
FROM TableOne a
INNER JOIN
TableTwo b on b.Bid = a.Bid
and [my filter condition]