Best way to combine two tables, remove duplicates, but keep all other non-duplicate values in SQL - sql

I am looking for the best way to combine two tables in a way that will remove duplicate records based on email with a priority of replacing any duplicates with the values in "Table 2", I have considered full outer join and UNION ALL but Union all will be too large as each table has several 1000 columns. I want to create this combination table as my full reference table and save as a view so I can reference it without always adding a union or something to that effect in my already complex statements. From my understanding, a full outer join will not necessarily remove duplicates. I want to:
a. Create table with ALL columns from both tables (fields that don't apply to records in one table will just have null values)
b. Remove duplicate records from this master table based on email field but only remove the table 1 records and keep the table 2 duplicates as they have the information that I want
c. A left-join will not work as both tables have unique records that I want to retain and I would like all 1000+ columns to be retained from each table
I don't know how feasible this even is but thank you so much for any answers!

If I understand your question correctly you want to join two large tables with thousands of columns that (hopefully) are the same between the two tables using the email column as the join condition and replacing duplicate records between the two tables with the records from Table 2.
I had to do something similar a few days ago so maybe you can modify my query for your purposes:
WITH only_in_table_1 AS(
SELECT *
FROM table_1 A
WHERE NOT EXISTS
(SELECT * FROM table_2 B WHERE B.email_field = A.email_field))
SELECT * FROM table_2
UNION ALL
SELECT * FROM only_in_table_1
If the columns/fields aren't the same between tables you can use a full outer join on only_in_table_1 and table_2

try using a FULL OUTER JOIN between the two tables and then a COALESCE function on each resultset column to determine from which table/column the resultset column is populated

Related

How to choose all column of a single table in a joined table?

I have a query that is like this:
SELECT * FROM A JOIN B
Table A and B are almost equal, but I use the join to filter on certain rows. Now my result table is having all columns twice, and this is causing referencing trouble where I am using this table. Therefore I would like to only show table A's columns. Is there a more easy way to achieve this than retyping all column names on the spot of *?
If you only want the columns from A, then use:
select A.*
You can add additional columns as you like:
select A.*, B.special_column

How to drop one join key when joining two tables

I have two tables. Both have lot of columns. Now I have a common column called ID on which I would join.
Now since this variable ID is present in both the tables if I do simply this
select a.*,b.*
from table_a as a
left join table_b as b on a.id=b.id
This will give an error as id is duplicate (present in both the tables and getting included for both).
I don't want to write down separately each column of b in the select statement. I have lots of columns and that is a pain. Can I rename the ID column of b in the join statement itself similar to SAS data merge statements?
I am using Postgres.
Postgres would not give you an error for duplicate output column names, but some clients do. (Duplicate names are also not very useful.)
Either way, use the USING clause as join condition to fold the two join columns into one:
SELECT *
FROM tbl_a a
LEFT JOIN tbl_b b USING (id);
While you join the same table (self-join) there will be more duplicate column names. The query would make hardly any sense to begin with. This starts to make sense for different tables. Like you stated in your question to begin with: I have two tables ...
To avoid all duplicate column names, you have to list them in the SELECT clause explicitly - possibly dealing out column aliases to get both instances with different names.
Or you can use a NATURAL join - if that fits your unexplained use case:
SELECT *
FROM tbl_a a
NATURAL LEFT JOIN tbl_b b;
This joins on all columns that share the same name and folds those automatically - exactly the same as listing all common column names in a USING clause. You need to be aware of rules for possible NULL values ...
Details in the manual.

How does JOIN work exactly in SQL

I know that joins work by combining two or more tables by their attributes, so if you have two tables that both have three columns and both have column INDEX, if you use table1 JOIN table2 you will get a new table with 5 columns, but what if you do not have a column that is shared by both table1 and table2? Can you still use JOIN or do you have to use TIMES?
Join is not a method for combining tables. It is a method to select records (and selected fields) from 2 or more tables where every table in the query must carry a field that can be matched to a field in another table in the query. The matched fields need not have the same name, but must carry the same type of data. Lacking this would be like trying to create meaning from joining a list of license plates of cars in NYC, with height data from lumberjacks in Washington state -- not meaningful.
Ex:)
Select h.name, h.home_address, h.home_phone, w.work_address,
w.department
from home h, work w
where h.employee_id = w.emp_id
As long as both columns: employee_id and emp_id carry the same information this query will work
In Microsoft Access, to get five rows from a three column table joined to a two column table, you'd use:
SELECT Table1.*, Table2.* FROM Table1 INNER JOIN Table2 ON Table1.Field1 = Table2.Field1;
You can query whatever you want, and join whatever you want, though.
If your one table is a list of people, and your other is a list of cars, and you want to see what people have names that are also models of cars, you can do:
SELECT Table1.Name, Table1.Age, Table2.Make, Table2.Year
FROM Table1 INNER JOIN Table2 ON Table1.Name = Table2.Model;
Only when Name is the same as Model will it show a record.
This is the same idea for joining tables in any relational DBMS I've used.
You are right you can join two tables even if they do not have shared column.
Join uses primary to prevent mistakes on inserting or deleting when user trying to insert record that does not has a parent one or some thing like this.
join methods has many types you can view them here:
http://dev.mysql.com/doc/refman/5.7/en/join.html
LEFT JOIN: select all records from first table, then selecting all records from second table that fulfilling the condition after ON clause.
you can't join the tables if they do not share a common column. If you can find a 3rd table that has common columns with table1 and table2 you can get them to join that way. so join table2 and tabl3 on a common column and than join table3 back to table1 on a common column.

Create new table by merging two existing tables based on matching field

I am attempting to create a new table using columns from two existing tables and it's not behaving the way I expected.
Table A has 91255063 records and table B has 2372294 records. Both tables have a common field named link_id. Link_id is not unique in either table and will not always exist in table B.
The end result I am looking for is a new table with 91255063 records, essentially all of Table A with any additional data from table B for the records with matching link_id's. I had thought outer join would accomplish this as follows:
use database1
SELECT a.*
,b.[AdditionalData1]
,b.[AdditionalData2]
,b.[AdditionalData3]
into dbo.COMBINEDTABLE
FROM Table1 a
left outer join Table2 b
ON a.LINK_ID = b.LINK_ID
This seems to work when looking at the resulting data however my row total for the newly created table COMBINEDTABLE now has 98011015 rows. Am I not using the correct join method here?
Most likely you have duplicate LINK_IDs on the right, thus for quite a few rows from Table1, there are multiplle rows from Table2. You could try using DISTINCT in your SELECT, or specify that you want only the records with the smallest or highest identifier column value (if you have one).

How to get names present in both views?

I have a very large view containing 5 million records containing repeated names with each row having unique transaction number. Another view of 9000 records containing unique names is also present. Now I want to retrieve records in first view whose names are present in second view
select * from v1 where name in (select name from v2)
But the query is taking very long to run. Is there any short cut method?
Did you try just using a INNER JOIN. This will return all rows that exist in both tables:
select v1.*
from v1
INNER JOIN v2
on v1.name = v2.name
If you need help learning JOIN syntax, here is a great visual explanation.
You can add the DISTINCT keyword which will remove any duplicate values that the query returns.
use JOIN.
The DISTINCT will allow you to return only unique records from the list since you are joining from the other table and there could be possibilities that a record may have more than one matches on the other table.
SELECT DISTINCT a.*
FROM v1 a
INNER JOIN v2 b
ON a.name = b.name
For faster performance, add an index on column NAME on both tables since you are joining through it.
To further gain more knowledge about joins, kindly visit the link below:
Visual Representation of SQL Joins