SQL - remove duplicates from left join - sql

I'm creating a joined view of two tables, but am getting unwanted duplicates from table2.
For example: table1 has 9000 records and I need the resulting view to contain exactly the same; table2 may have multiple records with the same FKID but I only want to return one record (random chosen is ok with my customer). I have the following code that works correctly, but performance is slower than desired (over 14 seconds).
SELECT
OBJECTID
, PKID
,(SELECT TOP (1) SUBDIVISIO
FROM dbo.table2 AS t2
WHERE (t1.PKID = t2.FKID)) AS ProjectName
,(SELECT TOP (1) ASBUILT1
FROM dbo.table2 AS t2
WHERE (t1.PKID = t2.FKID)) AS Asbuilt
FROM dbo.table1 AS t1
Is there a way to do something similar with joins to speed up performance?
I'm using SQL Server 2008 R2.
I got close with the following code (~.5 seconds), but 'Distinct' only filters out records when all columns are duplicate (rather than just the FKID).
SELECT
t1.OBJECTID
,t1.PKID
,t2.ProjectName
,t2.Asbuilt
FROM dbo.table1 AS t1
LEFT JOIN (SELECT
DISTINCT FKID
,ProjectName
,Asbuilt
FROM dbo.table2) t2
ON t1.PKID = t2.FKID
table examples
table1 table2
OID, PKID FKID, ProjectName, Asbuilt
1, id1 id1, P1, AB1
2, id2 id1, P5, AB5
3, id4 id2, P10, AB2
5, id5 id5, P4, AB4
In the above example returned records should be id5/P4/AB4, id2/P10/AB2, and (id1/P1/AB1 OR id1/P5/AB5)
My search came up with similar questions, but none that resolved my problem. link, link
Thanks in advance for your help. This is my first post so let me know if I've broken any rules.

This will give the results you requested and should have the best performance.
SELECT
OBJECTID
, PKID
, t2.SUBDIVISIO,
, t2.ASBUILT1
FROM dbo.table1 AS t1
OUTER APPLY (
SELECT TOP 1 *
FROM dbo.table2 AS t2
WHERE t1.PKID = t2.FKID
) AS t2

Your original query is producing arbitrary values for the two columns (the use of top with no order by). You can get the same effect with this:
SELECT t1.OBJECTID, t1.PKID, t2.ProjectName, t2.Asbuilt
FROM dbo.table1 t1 LEFT JOIN
(SELECT FKID, min(ProjectName) as ProjectName, MIN(asBuilt) as AsBuilt
FROM dbo.table2
group by fkid
) t2
ON t1.PKID = t2.FKID
This version replaces the distinct with a group by.
To get a truly random row in SQL Server (which your syntax suggests you are using), try this:
SELECT t1.OBJECTID, t1.PKID, t2.ProjectName, t2.Asbuilt
FROM dbo.table1 t1 LEFT JOIN
(SELECT FKID, ProjectName, AsBuilt,
ROW_NUMBER() over (PARTITION by fkid order by newid()) as seqnum
FROM dbo.table2
) t2
ON t1.PKID = t2.FKID and t2.seqnum = 1
This assumes version 2005 or greater.

If you want described result, you need to use INNER JOIN and following query will satisfy your need:
SELECT
t1.OID,
t1.PKID,
MAX(t2.ProjectName) AS ProjectName,
MAX(t2.Asbuilt) AS Asbuilt
FROM table1 t1
JOIN table2 t2 ON t1.PKID = t2.FKID
GROUP BY
t1.OID,
t1.PKID
If you want to see all rows from left table (table1) whether it has pair in right table or not, then use LEFT JOIN and same query will gave you desired result.
EDITED
This construction has good performance, and you dont need to use subqueries.

Related

SQL Server query showing most recent distinct data

I am trying to build a SQL query to recover only the most young record of a table (it has a Timestamp column already) where the item by which I want to filter appears several times, as shown in my table example:
.
Basically, I have a table1 with Id, Millis, fkName and Price, and a table2 with Id and Name.
In table1, items can appear several times with the same fkName.
What I need to achieve is building up a single query where I can list the last record for every fkName, so that I can get the most actual price for every item.
What I have tried so far is a query with
SELECT DISTINCT [table1].[Millis], [table2].[Name], [table1].[Price]
FROM [table1]
JOIN [table2] ON [table2].[Id] = [table1].[fkName]
ORDER BY [table2].[Name]
But I don't get the correct listing.
Any advice on this? Thanks in advance,
A simple and portable approach to this greatest-n-per-group problem is to filter with a subquery:
select t1.millis, t2.name, t1.price
from table1 t1
inner join table2 t2 on t2.id = t1.fkName
where t1.millis = (select max(t11.millis) from table1 t11 where t11.fkName = t1.fkName)
order by t1.millis desc
using Common Table Expression:
;with [LastPrice] as (
select [Millis], [Price], ROW_NUMBER() over (Partition by [fkName] order by [Millis] desc) rn
from [table1]
)
SELECT DISTINCT [LastPrice].[Millis],[table2].[Name],[LastPrice].[Price]
FROM [LastPrice]
JOIN [table2] ON [table2].[Id] = [LastPrice].[fkName]
WHERE [LastPrice].rn = 1
ORDER BY [table2].[Name]

SQL Multiple INNER JOINS In One Select-Statement

I am using this code for inventory management system, in which i want to retrieve stock in hand from four tables. i have tried with two table and got accurate result as i need it.please help me out.
Table Schema
Productmastertb
prod_id,
Product_name
salesdetailstb
sales_id,
Prod_id,
Prod_qty
estimatedetailstb
est_id,
Prod_id,
Prod_qty
Purchasedetailstb
est_id,
Prod_id,
Prod_qty
Query example (working):
SELECT
productmastertb.prod_id,
productmastertb.prod_name,
sum(estimatedetailstd.prod_qty) as Est_qty
FROM
productmaster
INNER JOIN
estimatedetailstb ON productmastertb.prodid = estimatedetails.prodid
GROUP BY
productmastertb.prod_id, productmastertb.prod_name
Similarly I have to retrieve sum of salesdetailstb.qty and purchasedetailstb.qty
Thanks in advance
You want to summarize across different "dimensions" -- that is tables. One good approach is to aggregate before doing the JOINs. Or to use subqueries. Here is the latter approach:
SELECT pm.prod_id, pm.prod_name,
(SELECT SUM(ed.prod_qty)
FROM estimatedetailstb as ed
WHERE ed.prodid = ed.prodidas
) as Est_qty,
(SELECT SUM(sd.prod_qty)
FROM salesdetailstb as sd
WHERE sd.prodid = pm.prodidas
) as Sales_qty,
(SELECT SUM(pd.prod_qty)
FROM purchasedetailstb as pd
WHERE pd.prodid = pm.prodid
) as Sales_qty
FROM productmaster pm;
This will give you all products, even those missing from one or more of the other tables.
You can add multiple joins.
SELECT t1.id, t4.name, count(t4.name)
FROM Table1 AS t1
INNER JOIN Table2 AS t2 -- the AS statement renames the table within
-- this query to t2. Columns from this table can be used
-- as t2.columnname. This needs to be done when you have
-- columns with the same name in different tables.
ON t1.id = t2.id
INNER JOIN Table3 as t3
ON t1.id = t3.id
INNER JOIN Table4 as t4
ON t3.name = t4.name
GROUP BY t1.id, t4.name

Unpivot without using columns names

Right now my query is resulting data in below format.
dbid askid amid
================================
d1 m1 a1
I want to change the display as
SOURCE_ID DEST_ID
========================
m1 d1
a1 d1
We can't use union all, inner joins, cross joins because in real time it is very big query and interacts with n no.of multiple tables and it already have them many no.of
I tried using unpivot but every time it is storing column names also in to rows.
Sample Query which is resulting data right now:
select DEST_ID,SOURCE_ID from
(select
t1.id as dbid,
t2.mid as askid,
t3.m2idd as amid from
table1 t1, table2 t2, table3 t3 where
t1.actid = t2.senid
and t2.denid = t2.mkid
)
unpivot INCLUDE NULLS (SOURCE_ME_GUID FOR DEST_ME_GUID IN (amid, askid));
Use UNION ALL in combination with a WITH clause.
You got confused with your join criteria somehow. I suppose either mkid or denid resides in table3. Otherwise you'd be cross joining table3. You shouldn't use this 1980s join syntax anyway; it's been made redundant in 1992 for a reason.
with ids as
(
select t1.id as dbid, t2.mid as askid, t3.m2idd as amid
from table1 t1
join table2 t2 on t2.senid = t1.actid
join table3 t3 on t3.mkid = t2.denid
)
select askid as source_id, dbid as dest_id from ids
union all
select amid as source_id, dbid as dest_id from ids;
Solution for the above issue.
select dbid as DEST_ID, SOURCE_ID
from (
select 'd1' dbid,
cast('m1' as varchar2(200)) as askid,
cast('a1' as varchar2(200)) as amid
from dual
) unpivot INCLUDE NULLS (
SOURCE_ID FOR DEST_ID IN (amid, askid)
);

Comparing two tables for equality in HIVE

I have two tables, table1 and table2. Each with the same columns:
key, c1, c2, c3
I want to check to see if these tables are equal to eachother (they have the same rows). So far I have these two queries (<> = not equal in HIVE):
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key
where t2.key is null or t1.c1<>t2.c1 or t1.c2<>t2.c2 or t1.c3<>t2.c3
And
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t2.key is null
So my idea is that, if a zero count is returned, the tables are the same. However, I'm getting a zero count for the first query, and a non-zero count for the second query. How exactly do they differ? If there is a better way to check this certainly let me know.
The first one excludes rows where t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, or t2.c3 is null. That means that you effectively doing an inner join.
The second one will find rows that exist in t1 but not in t2.
To also find rows that exist in t2 but not in t1 you can do a full outer join. The following SQL assumes that all columns are NOT NULL:
select count(*) from table1 t1
full outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t1.key is null /* this condition matches rows that only exist in t2 */
or t2.key is null /* this condition matches rows that only exist in t1 */
If you want to check for duplicates and the tables have exactly the same structure and the tables do not have duplicates within them, then you can do:
select t.key, t.c1, t.c2, t.c3, count(*) as cnt
from ((select t1.*, 1 as which from table1 t1) union all
(select t2.*, 2 as which from table2 t2)
) t
group by t.key, t.c1, t.c2, t.c3
having cnt <> 2;
There are various ways that you can relax the conditions in the first paragraph, if necessary.
Note that this version also works when the columns have NULL values. These might be causing the problem with your data.
Well, the best way is calculate the hash sum of each table, and compare the sum of hash.
So no matter how many column are they, no matter what data type are they, as long as the two table has the same schema, you can use following query to do the comparison:
select sum(hash(*)) from t1;
select sum(hash(*)) from t2;
And you just need to compare the return values.
I used EXCEPT statement and it worked.
select * from Original_table
EXCEPT
select * from Revised_table
Will show us all the rows of the Original table that are not in the Revised table.
If your table is partitioned you will have to provide a partition predicate.
Fyi, partition values don't need to be provided if you use Presto and querying via SQL lab.
I would recommend you not using any JOINs to try to compare tables:
it is quite an expensive operations when tables are big (which is often the case in Hive)
it can give problems when some rows/"join keys" are repeated
(and it can also be unpractical when data are in different clusters/datacenters/clouds).
Instead, I think using a checksum approach and comparing the checksums of both tables is best.
I have developed a Python script that allows you to do easily such comparison, and see the differences in a webbrowser:
https://github.com/bolcom/hive_compared_bq
I hope that can help you!
First get count for both the tables C1 and C2. C1 and C2 should be equal. C1 and C2 can be obtained from the following query
select count(*) from table1
if C1 and C2 are not equal, then the tables are not identical.
2: Find distinct count for both the tables DC1 and DC2. DC1 and DC2 should be equal. Number of distinct records can be found using the following query:
select count(*) from (select distinct * from table1)
if DC1 and DC2 are not equal, the tables are not identical.
3: Now get the number of records obtained by performing a union on the 2 tables. Let it be U. Use the following query to get the number of records in a union of 2 tables:
SELECT count (*)
FROM
(SELECT *
FROM table1
UNION
SELECT *
FROM table2)
You can say that the data in the 2 tables is identical if distinct count for the 2 tables is equal to the number of records obtained by performing union of the 2 tables. ie DC1 = U and DC2 = U
another variant
select c1-c2 "different row counts"
, c1-c3 "mismatched rows"
from
( select count(*) c1 from table1)
,( select count(*) c2 from table2 )
,(select count(*) c3 from table1 t1, table2 t2
where t1.key= t2.key
and T1.c1=T2.c1 )
Try with WITH Clause:
With cnt as(
select count(*) cn1 from table1
)
select 'X' from dual,cnt where cnt.cn1 = (select count(*) from table2);
One easy solution is to do inner join. Let's suppose we have two hive tables namely table1 and table2. Both the table has same column namely col1, col2 and col3. The number of rows should also be same. Then the command would be as follows
**
select count(*) from table1
inner join table2
on table1.col1 = table2.col1
and table1.col2 = table2.col2
and table1.col3 = table2.col3 ;
**
If the output value is same as number of rows in table1 and table2 , then all the columns has same value, If however the output count is lesser than there are some data which are different.
Use a MINUS operator:
SELECT count(*) FROM
(SELECT t1.c1, t1.c2, t1.c3 from table1 t1
MINUS
SELECT t2.c1, t2.c2, t2.c3 from table2 t2)

Left Join with duplicate keys in the right table

I'm trying merging 2 tables as follow
SELECT * FROM T1
LEFT JOIN T2 ON T1.EMPnum = T2.EMPnum
The above works well but I need the joining to use only the first appearance of the common key EMPnum record in table T2 so that the query returns exactly the same number of rows as T1
Thanks Avi
SQL tables are inherently unordered, so there is no such thing as a "first" key. In most databases, you can do something like this:
with t2 as (
select t2.*, row_number() over (partition by EMPnum order by id) as seqnum
from t2
)
select *
from t1 left join
t2
on t1.EMPnum = t2.EMPnum and t2.seqnum = 1;
Here id is just any column that specifies the ordering. If none exist, you can use EMPnum to get an arbitrary row.