Data between test and production environments are different - sql

The count(*) of a table in both Test and Production return the same value. However, a user of the table was doing some validation/testing and noticed that the sum of a column/field is different between the 2 environments. Being the better SQL user between the 2 of us, I'm trying to figure out how to find the discrepancies.
What's a good way to do so? This isn't that big of a table (~1 mill) but I'd like to keep the query/statement rather small
This is in Teradata

Alright, here is a framework for you to build off of then. Since you are looking at sums and the like, you'll need to assemble most of the WHERE clause for this, since I don't have enough information to know what you are summing. So, I'll write this to find discrepencies in the rows themselves...
SELECT t1.id
FROM Production.[schema].table1 t1
INNER JOIN Test.[schema].table1 t2 ON t1.id = t2.id
WHERE t1.column <> t2.column
....
Just push the columns you want to compare into the WHERE clause...this will sync up the two tables from TEST and PROD and let you look for differences between columns. It will return a list of row ids where there is a mismatch.

Related

Use to SQL to detect changes between tables

I want to create a SQL script that would compare 2 of the same fields in two different tables. These tables may be in two different servers. I want to use this script to see if one field gets updated in one table/server, it is also updated in the other table/server. Any ideas to approach this?
First thing you need to be sure of is that your servers are linked, otherwise you won't easily be able to compare the two. If the servers are linked, and the tables are identical you can use an EXCEPT query to identify the changes e.g.
select * from [server1].[db].[schema].[table]
except
select * from [server2].[db].[schema].[table]
This query will return all rows from the table in server1 that don't appear in server2 from here you can either wrap this in a count or insert/update the missing/changed rows from one table to another
Identifying whether the rows have changed or been inserted will rely on using a primary key, with that you can join one table to another and identify what needs updating using a query like so:
select *
from [server1].[db].[schema].[table] t1
inner join [server2].[db].[schema].[table] t2 on t1.id = t2.id
where ( t1.col1 <> t2.col1 or t1.col2 <> t2.col2 ... )
Another way of tracking changes is to use a DML trigger and have this propagate changes from one table to another.
I was working on a SQL Server auditing tool that uses these principles, have a look through the code if you like its not 100% working https://github.com/simtbak/panko/blob/main/archive/Panko%20v003.sql

SQL Like Operator very slow when using from Another table in AWS Athena

I have SQL query in athena that is very slow when using like operator value from another table
Select * from table1 t1
Where t1.value like (select
concat('%',t2.value,'%') as val
from table2 t2 where t2.id =1
limit 1)
The above query is very slow
When i am using something like below query its working super fast
Select * from table1 t1
Where t1.value like
'%somevalue%'
In my scenario the like value is not fixed it can be changed by the time that's why i need to use this value from another table.
Please suggest fastest way
"Slow" is a relative term, but a query that joins two tables will always be slower than a query that doesn't. A query that compares against a pattern that needs to be looked up in another table at query time will always be slower than a query that uses a static pattern.
Does that mean that the second query is slow? Perhaps, but it you have to base that on what you're actually asking the query engine to do.
Let's dissect what your query is doing:
The outer query looks for all columns of all rows of the first table where one of the columns contains a particular string.
That string is dynamically looked up by scanning every row in the second table looking for a row with a particular value for the id column.
In other words, the first query scans only the first table but the second scans both tables. That's always going to be slower, because it's doing a lot more work. How much more work? That depends on the sizes of the tables. You aren't specifying the running times of any of the queries or the sizes of the tables, so it's hard to know.
You don't provide enough context in your question to answer any more precise than this. We can only respond with generalities like: if it's slow then don't use LIKE, that's a slow operation. Don't do a correlated subquery that reads the whole second table, that's slow.
i have found other method to the same and it's super faster in Athena
Select * from table1 t1
Where POSITION ( (select concat('%',t2.value,'%') as val from table2 t2 where t2.id =1 limit 1) in t1.value )>0

MSSQL - Question about how insert queries run

We have two tables we want to merge. Say, table1 and table2.
They have the exact same columns, and the exact same purpose. The difference being table2 having newer data.
We used a query that uses LEFT JOIN to find the the rows that are common between them, and skip those rows while merging. The problem is this. both tables have 500M rows.
When we ran the query, it kept going on and on. For an hour it just kept running. We were certain this was because of the large number of rows.
But when we wanted to see how many rows were already inserted to table2, we ran the code select count(*) from table2, it gave us the exact same row count of table2 as when we started.
Our questions is, is that how it's supposed to be? Do the rows get inserted all at the same time after all the matches have been found?
If you would like to read uncommited data, than the count should me modified, like this:
select count(*) from table2 WITH (NOLOCK)
NOLOCK is over-used, but in this specific scenario, it might be handy.
No data are inserted or updated one by one.
I have no idea how it is related with "Select count(*) from table2 WITH (NOLOCK) "
Join condition is taking too long to produce Resultset which will be use by insert operator .So actually there is no insert because no resultset is being produce.
Join query is taking too long because Left Join condition produces very very high cardinality estimate.
so one has to fix Join condition first.
for that need other info like Table schema ,Data type and length and existing index,requirement.

Optimize query that compares two tables with similar schema in different databases

I have two different tables with similar schema in different database. What is the best way to compare records between these two tables. I need to find out-
records that exists in first table whose corresponding record does not exist in second table filtering records from the first table with some where clauses.
So far I have come with this SQL construct:
Select t1_col1, t1_ col2 from table1
where t1_col1=<condition> AND
t1_col2=<> AND
NOT EXISTS
(SELECT * FROM
table2
WHERE
t1_col1=t2_col1 AND
t1_col2=t2_col2)
Is there a better way to do this?
This above query seems fine but I suspect it is doing row by row comparison without evaluating the conditions in the first part of the query because the first part of the query will reduce the resultset very much. Is this happening?
Just use except keyword!!!
Select t1_col1, t1_ col2 from table1
where t1_col1=<condition> AND
t1_col2=<condition>
except
SELECT t2_col1, t2_ col2 FROM table2
It returns any distinct values from the query to the left of the EXCEPT operand that are not also returned from the right query.
For more information on MSDN
If the data in both table are expected to have the same primary key, you can use IN keyword to filter those are not found in the other table. This could be the simplest way.
If you are open to third party tools like Redgate Data Compare you can try it, it's a very nice tool. Visual Studio 2010 Ultimate edition also have this feature.

Is it better to join two fields together, or to compare them each to the same constant?

For example which is better:
select * from t1, t2 where t1.country='US' and t2.country=t1.country and t1.id=t2.id
or
select * from t1, t2 where t1.country'US' and t2.country='US' and t1.id=t2.id
better as in less work for the database, faster results.
Note: Sybase, and there's an index on both tables of country+id.
I don't think there is a global answer to your question. It depends on the specific query. You would have to compare the execution plans for the two queries to see if there are significant differences.
I personally prefer the first form:
select * from t1, t2 where t1.country='US' and t2.country=t1.country and t1.id=t2.id
because if I want to change the literal there is only one change needed.
There are a lot of factors at play here that you've left out. What kind of database is it? Are those tables indexed? How are they indexed? How large are those tables?
(Premature optimization is the root of all evil!)
It could be that if "t1.id" and "t2.id" are indexed, the database engine joins them together based on those fields, and then uses the rest of the WHERE clause to filter out rows.
They could be indexed but incredibly small tables, and both fit in a page of memory. In which case the database engine might just do a full scan of both rather than bother loading up the index.
You just don't know, really, until you try.
I had a situation similar to this and this was the solution I resorted to:
Select *
FROM t1
INNER JOIN t2 ON t1.id = t2.id AND t1.country = t2.country AND t1.country = 'US'
I noticed that my query ran faster in this scenario. I made the assumption that joining on the constant saved the engine time because the WHERE clause will execute at the end. Joining and then filtering by 'US' means you still pulled all the other countries from your table and then had to filter out the ones you wanted. This method pulls less records in the end, because it will only find US records.
The correct answer probably depends on your SQL engine. For MS SQL Server, the first approach is clearly the better because the statistical optimizer is given an additional clue which may help it find a better (more optimal) resolution path.
I think it depends on the library and database engine. Each one will execute the SQL differently, and there's no telling which one will be optimized.
I'd lean towards only including your constant in the code once. There might be a performance advantage one way or the other, but it's probably so small the maintenance advantage of only one parameter trumps it.
If you ever wish to make the query more general, perhaps substituting a parameter for the target country, then I'd go with your first example, as it requires only a single change. That's less to worry about getting wrong in the future.
I suspect this is going to depend on the tables, the data and the meta-data. I expect I could work up examples that would show results both ways - benchmark!
The extressions should be equivalent with any decent optimizer, but it depends on which database you're using and what indexes are defined on your table.
I would suggest using the EXPLAIN feature to figure out which of expressions is the most optimal.
I think a better SQL would be:
select * from t1, t2 where t1.id=t2.id
and t1.country ='US'
There's no need to use the second comparison to 'US' unless it's possisble that the country in t2 could be different than t1 for the same id.
Rather than use an implicit inner join, I would explicitly join the tables.
Since you want both the id fields and country fields to be the same, and you mentioned that both are indexed (I'm presuming in the same index), I would include both columns in the join so you can make use of an index seek instead of a scan. Finally, add your where clause.
SELECT *
FROM t1
JOIN t2 ON t1.id = t2.id AND t1.country = t2.country
WHERE t1.country = 'US'