Comparing two tables for similar values and Inserting values - sql

I am working on databases and now I need some advice's from you guys..
I have 2 Tables with many rows and columns and these db's contain addresses of customers. Names of the tables are Data, Orders.
Now the problem is I have to search the addresses present in Table Orders with the addresses in Data using email as the criteria.
If there is a match in emails then its ok....or else we should insert the addresses of the table Orders in table Data. ...
I made this query but i am getting some error.
INSERT INTO orders (orders_id, customers_id, customers_cid, customers_vat_id, customers_name, customers_email_address) VALUES( (select o.* from Test.dbo.orders o where o.customers_email_address not in ( select a.email0 from CobraDemoData.dbo.Data a)))
Any help is much appreciated..
Thanks,
subash

You can insert the values directly from a select statement--don't use values when you want to do that. Additionally, you can use not exists in lieu of not in, as SQL Server usually runs that much faster, but it's case-by-case, so you can look at the query plan if it's really an issue.
insert into orders (orders_id, customers_id, customers_cid, customers_vat_id, customers_name, customers_email_address)
select
o.*
from
Test.dbo.orders o
where
not exists (
select 1
from
CobraDemoData.dbo.Data a
where
a.email0 = o.customers_email_address
)
Also, you probably want to specify the columns in the select statement, just to make sure the right columns are transposed.

Related

Best way to combine two tables, remove duplicates, but keep all other non-duplicate values in SQL

I am looking for the best way to combine two tables in a way that will remove duplicate records based on email with a priority of replacing any duplicates with the values in "Table 2", I have considered full outer join and UNION ALL but Union all will be too large as each table has several 1000 columns. I want to create this combination table as my full reference table and save as a view so I can reference it without always adding a union or something to that effect in my already complex statements. From my understanding, a full outer join will not necessarily remove duplicates. I want to:
a. Create table with ALL columns from both tables (fields that don't apply to records in one table will just have null values)
b. Remove duplicate records from this master table based on email field but only remove the table 1 records and keep the table 2 duplicates as they have the information that I want
c. A left-join will not work as both tables have unique records that I want to retain and I would like all 1000+ columns to be retained from each table
I don't know how feasible this even is but thank you so much for any answers!
If I understand your question correctly you want to join two large tables with thousands of columns that (hopefully) are the same between the two tables using the email column as the join condition and replacing duplicate records between the two tables with the records from Table 2.
I had to do something similar a few days ago so maybe you can modify my query for your purposes:
WITH only_in_table_1 AS(
SELECT *
FROM table_1 A
WHERE NOT EXISTS
(SELECT * FROM table_2 B WHERE B.email_field = A.email_field))
SELECT * FROM table_2
UNION ALL
SELECT * FROM only_in_table_1
If the columns/fields aren't the same between tables you can use a full outer join on only_in_table_1 and table_2
try using a FULL OUTER JOIN between the two tables and then a COALESCE function on each resultset column to determine from which table/column the resultset column is populated

T-SQL Match records 1 to 1 without join condition

I have a group of enitities which need to have another record associated with them from another table.
When I try to output an Id for the table to be matched on it doesn't work because you can only output from inserted, updated etc.
DECLARE #SignatureGlobalIdsTbl table (ID int,
CompanyBankAccountId int);
INSERT INTO GlobalIds (TypeId)
-- I Cannot output cba.Id into the table since its not from inserted
OUTPUT Inserted.Id,
cba.Id
INTO #SignatureGlobalIdsTbl (ID,
CompanyBankAccountId)
SELECT (#DocumentsGlobalTypeKey)
FROM CompanyBankAccounts cba
INNER JOIN Companies c ON c.CompanyId = cba.CompanyId
WHERE SignatureDocumentId IS NULL
AND (SignatureFile IS NOT NULL
AND SignatureFile != '');
INSERT INTO Documents (DocumentPath,
DocumentType,
DocumentIsExternal,
OwnerGlobalId,
OwnerGlobalTypeID,
DocumentName,
Extension,
GlobalId)
SELECT SignatureFile,
#SignatureDocumentTypeKey,
1,
CompanyGlobalId,
#OwnerGlobalTypeKey,
[dbo].[fnGetFileNameWithoutExtension](SignatureFile),
[dbo].[fnGetFileExtension](SignatureFile),
documentGlobalId
FROM (SELECT c.GlobalId AS CompanyGlobalId,
cba.*,
s.ID AS documentGlobalId
FROM CompanyBankAccounts cba
INNER JOIN Companies c ON c.CompanyId = cba.CompanyId
CROSS JOIN #SignatureGlobalIdsTbl s) info
WHERE SignatureDocumentId IS NULL
AND (SignatureFile IS NOT NULL
AND SignatureFile != '');
I Tried to use cross join to prevent cartesian production but that did not work. I also tried to output the rownumber over some value but I could not get that to be stored in the table either.
If I have two seperate queries which return the same amount of records, how can I pair the records together without creating cartesian production?
'When I try to output an Id for the table ... it doesn't work.'
This seems to be because one of the columns you want to OUTPUT is not actually part of the insert. It's an annoying problem and I wish SQL Server would allow us to do it.
Someone may have a much better answer for this than I do, but the way I usually approach this is
Create a temporary table/etc of the data I want to insert, with a column for ID (starts blank)
Do an insert of the correct amount of rows, and get the IDs out into another temporary table,
Assign the IDs as appropriate within the original temporary table
Go back and update the inserted rows with any additional data needed (though that's probably not needed here given you're just inserting a constant)
What this does is to flag/get the IDs ready for you to use, then you allocate them to your data as needed, then fill in the table with the data. It's relatively simple although it does do 2 table hits rather than 1.
Also consider doing it all within a transaction to keep the data consistent (though also probably not needed here).
How can I pair the records together?
A cross join unfortunately multiplies the rows (number of rows on left times the number of rows on the right). It is useful in some instances, but possibly not here.
I suggest when you do your inserts above, you get an identifier (e.g., companyID) in your temp table and join on that.
If you don't have a matching record and just want to assign them in order, you can use an answer similar to my answer in another recent question How to update multiple rows in a temp table with multiple values from another table using only one ID common between them?
Further notes
I suggest avoiding table variables (e.g., DECLARE #yourtable TABLE) and use temporary tables (CREATE TABLE #yourtable) instead - for performance reasons. If it's only a small amount of rows it's OK, but it gets worse as it gets larger as SQL Server assumed that table variables only have 1 row
In your bottom statement, why is there the SELECT statement in the FROM clause? Couldn't you just get rid of that select statement and have the FROM clause list the tables you want?
I figured out a way to have access to the output, by using a merge statement.
DECLARE #LogoGlobalIdsTbl TABLE (ID INT, companyBankAccountID INT)
MERGE GlobalIds
USING
(
SELECT (cba.CompanyBankAccountId)
FROM CompanyBankAccounts cba
INNER JOIN Companies c on c.CompanyId = cba.CompanyId
WHERE cba.LogoDocumentId IS NULL AND (cba.LogoFile IS NOT NUll AND cba.LogoFile != '')
) src ON (1=0)
WHEN NOT MATCHED
THEN INSERT ( TypeId )
VALUES (#DocumentsGlobalTypeKey)
OUTPUT [INSERTED].[Id], src.CompanyBankAccountId
INTO #LogoGlobalIdsTbl;

How to use a SQL table instead of a long string for a WHERE in (string) condition

I have the following SQL statement:
select customer_id, prod_id, prod_start, prod_price
from prod_table
where prod_id in (PRODLIST)
Unfortunately, PRODLIST contains about 68K 6-digit numbers. When I try to run this query on my server, I get an error that SQL can't handle so many prod_id as presented in a string.
My next thought was to put all the 68K 6-digit numbers into a single column table included_prodlist with column heading included_prod_id. The resulting included_prodlist table would then be a single column table with 68K rows, and each column would be a unique 6-digit number.
I could then do an inner join of the original query with included_prodlist as follows:
select customer_id, prod_id, prod_start, prod_price
from prod_table
where prod_id in (select included_prod_id from included_prodlist)
Unfortunately, this doesn't seem to be working i.e. the query returns no entries.
Is this the proper way to deal with long conditions?
Should I be using an inner join instead?
select customer_id, prod_id, prod_start, prod_price
from prod_table
inner join included_prodlist on prod_table.prod_id = included_prodlist.included_prod_id
Putting the values in to a table is highly recommended. The one column should be the primary key.
Then, I would go for exists rather than not in:
select p.customer_id, p.prod_id, p.prod_start, p.prod_price
from prod_table p
where exists (select 1
from included_prodlist ip
where ip.included_prod_id = p.prod_id
);
Of course using INNER JOIN can be more helpful with a better performance. For best practices create the index which is recommended in query execution plan :)
INNER JOIN on a single column table is preferable on a nested query
In a nested query the internal query runs 1st and its results are placed in the outer query
Using join, there is only one query, preferably on indexed columns
You will be able to see the differences adding EXPLAIN before the SELECT command
I would generate the product list table as a temp table, with an indexed column, that way the query would run with join even faster

Create table that is table 1 minus table 2 based on three criteria

I have a table of LoggedDischarges and another table of ActualDischarges.
I am trying to generate a query that will give me all the fields from ActualDischarges excluding those already in LoggedDischarges based on AgencyID, Program and ActivityEndDate
A client can be in multiple programs and be discharged from multiple on the same day. I need to make sure I get LoggedDischarges from each program.
This is what I have but am not sure how to add the other criteria.
select * from ActualDischarges
where (agencychildid ) not in
(select agencyid from LoggedDischarges)
Thank you,
Steve Hathaway
Even if your DBMS supports multiple columns in a subquery like
where (AgencyID, Program, ActivityEndDate) not in
( select AgencyID, Program, ActivityEndDate
from ... )
you better switch to a NOT EXISTS (in case of any NULLs):
select * from ActualDischarges as aD
where NOT EXISTS
(select * from LoggedDischarges as lD
where aD.AgencyID = lD.AgencyID
and aD.Program = lD. Program
and aD.ActivityEndDate= lD.ActivityEndDate)
For this type of match, I would recommend a LEFT JOIN with an IS NULL at the end to determine that the second table does not have the record:
SELECT a.*
FROM ActualDischarges AS a
LEFT JOIN LoggedDischarges AS l
ON agencyid=agencychildid
AND a.program=l.program
AND a.ActivityEndDate=l.ActivityEndDate
WHERE l.agencyid IS NULL
As a side note, definitely avoid using multiple IN statements for situations like this WHERE NOT IN (...) AND NOT IN (...) etc. as you end up excluding records which match different records in LoggedDischarges for different reasons, which is rarely the desired result.

Select rows that are different in SQL

I have a table with way too many columns and a couple million rows that I need to query for differences.
On these rows there will hopefully be only one column that is different and that should be the Auto incremented id field.
What I need to do is check to see if these rows ARE actually the same and if there are any that have any differences in any of the fields.
So for example, if the "Name" column is supposed to be "Peter, Paul and Mary" and the "Order #" column is supposed to be "132" I need to find any rows where those values aren't true, but I need to find it for every column in the table AND I don't actually know what the correct values are (meaning I can't just create a "SELECT...WHERE Name='This'" for each column).
So how can I find the rows that are different? (using straight SQL, no programming)
Would you think this answer is what you are looking for and would help you? here's a Link to find the appropriate sql query.
Let's suppose you coded a email newsletter signup form, but you forgot to double check that the email address was not a duplicate, or already in the database. We can write a query to find all the emails in our table that are duplicates, or occurs in more than one row.
The following SQL query works great for finding duplicate values in a table.
SELECT email,
COUNT(email) AS NumOccurrences
FROM users
GROUP BY email
HAVING ( COUNT(email) > 1 )
By using group by and then having a count greater than one, we find rows with with duplicate email addresses using the above SQL.
Blockquote
If you know the limit of the wrong results (say 10 for example) then you could order them and get only the first 11 results. You see where I am going with this, right?
I have no SQL expertise whatsoever though :)
Do you need to do this programmatically, or can you just run a few queries yourself to check it?
If the latter, I'd just do "select distinct name, order#" to start. This should return a list that includes "Peter Paul and Mary, 132" and possibly some other things.
Then find the other things by doing select ... where name = "this" as you suggest.
You could get even more info out of that first query by doing "select distinct name, order#, count(*) from ... group by name, order#". This would give you both the list of values and the frequency of a given set of values.
if I understand you correctly, (your question is not 100% clear to me), you are tryin g to find the rows that are unnecessary duplicates ? If so, Try these SQL queries:
Select A.Id, B.Id
From Table A
Join Table B
On A.Id <> B.Id
And A.ColA = B.ColA
And A.ColB = B.Col
And A.ColC = B.ColC
...
Or
Select ColA, ColB, etc.
From Table
Group By ColA, ColB, etc.
Having Count(*) > 1
If you have a correlation between two "independent" columns where there is really only one "correct" value for column B whenever column A is a given value, then you have a broken database design, because these correlation should have been factored out as a separate table.
Try this:
SELECT Name, OrderNum
FROM Orders T1
FULL OUTER JOIN (
SELECT Name, OrderNum
FROM Orders
GROUP BY Name, OrderNum
HAVING COUNT(*) > 1) T2
ON T1.Name = T2.Name
AND T1.OrderNum = T2.OrderNum
The nested select is identifying the duplicates, so you will need to target your common fields, the FULL OUTER JOIN excludes the duplicates from your result set. So essentially you are joining the table on itself to identify the duplicates and exclude them from your results. If you want only the duplicates then change the FULL OUTER JOIN to just JOIN.