Different count results with join - sql

I have that sql:
SELECT DISTINCT
count(KTT)
FROM
TRA.EVENT;
it returns me a number of 1901335.
Now I want to expand the sql with a join like this:
SELECT DISTINCT
count(E.KTT)
FROM
TRA.EVENT E
LEFT JOIN TRA.TMP_BNAME TBN ON E.KTT = TBN.KTT_DEF;
But here I have a result of 1942376.
I dont understand why? I expect also a result of 1901335. I thought I easily join the values from TBN based on the entries of EVENT?
EDIT
SELECT DISTINCT
E.KTT,
TB.B_BEZEICHNER
FROM
TRA.EVENT E
LEFT JOIN TRA.TMP_BNAME TBN ON E.KTT = TBN.KTT_DEF
LEFT JOIN TRA.TMP_B TB ON TBN.B_ID = TB.B_ID;
What I am doing wrong?
Thx for your help.
Stefan

You have not provided full details so treat those comments as general ones.
When you join 2 tables, it may happen that it can create "duplicate" rows from one table. In your instance, there may be more than 1 record with the same KTT_DEF in TRA.TMP_BNAME table. When you join that to TRA.EVENT table, it create more than one record for each original record in TRA.EVENT table.
You may choose to count the distinct values of KTT from TRA.EVENT and use DISTINCT keyword but you need to put it into the COUNT: SELECT COUNT(DISTINCT E.KTT). This will work provided that your values are actually unique. If they are not, the count will be different from the first query.

You want to count the distinct KTT?
Then your code is wrong. You have to use:
SELECT count(DISTINCT KTT)
FROM TRA.EVENT;
You get different count because you count every row. Not the distinct ones. And because the join add more rows to the query thats why you get a bigger number.

Try this:
SELECT COUNT(DISTINCT E.KTT)
FROM TRA.EVENT E
LEFT JOIN TRA.TMP_BNAME TBN ON E.KTT = TBN.KTT_DEF;

Related

Query with Left outer join and group by returning duplicates

To begin with, I have a table in my db that is fed with SalesForce info. When I run this example query it returns 2 rows:
select * from SalesForce_INT_Account__c where ID_SAP_BAYER__c = '3783513'
When I run this next query on the same table I obtain one of the rows, which is what I need:
SELECT MAX(ID_SAP_BAYER__c) FROM SalesForce_INT_Account__c where ID_SAP_BAYER__c = '3783513' GROUP BY ID_SAP_BAYER__c
Now, I have another table (PedidosEspecialesZarateCabeceras) which has a field (NroClienteDireccionEntrega) that I can match with the field I've been using in the SalesForce table (ID_SAP_BAYER__c). This table has a key that consists of just 1 field (NroPedido).
What I need to do is join these 2 tables to obtain a row from PedidosEspecialesZarateCabeceras with additional fields coming from the SalesForce table, and in case those additional fields are not available, they should come as NULL values, so for that im using a LEFT OUTER JOIN.
The problem is, since I have to match NroClienteDireccionEntrega and ID_SAP_BAYER__c and there's 2 rows in the salesforce table with the same ID_SAP_BAYER__c, my query returns 2 duplicate rows from PedidosEspecialesZarateCabeceras (They both have the same NroPedido).
This is an example query that returns duplicates:
SELECT
cab.CUIT AS CUIT,
convert(nvarchar(4000), cab.NroPedido) AS NroPedido,
sales.BillingCity__c as Localidad,
sales.BillingState__c as IdProvincia,
sales.BillingState__c_Desc as Provincia,
sales.BillingStreet__c as Calle,
sales.Billing_Department__c as Distrito,
sales.Name as RazonSocial,
cab.NroCliente as ClienteId
FROM PedidosEspecialesZarateCabeceras AS cab WITH (NOLOCK)
LEFT OUTER JOIN
SalesForce_INT_Account__c AS sales WITH (NOLOCK) ON
cab.NroClienteDireccionEntrega = sales.ID_SAP_BAYER__c
and sales.ID_SAP_BAYER__c in
( SELECT MAX(ID_SAP_BAYER__c)
FROM SalesForce_INT_Account__c
GROUP BY ID_SAP_BAYER__c
)
WHERE cab.NroPedido ='5320'
Even though the join has MAX and Group By, this returns 2 duplicate rows with different SalesForce information (Because of the 2 salesforce rows with the same ID_SAP_BAYER__c), which should not be possible.
What I need is for the left outer join in my query to pick only ONE of the salesforce rows to prevent duplication like its happening right now. For some reason the select max with the group by is not working.
Maybe I should try to join this tables in a different way, can anyone give me some other ideas on how to join the two tables to return just 1 row? It doesnt matter if the SalesForce row that gets picked out of the 2 isn't the correct one, I just need it to pick one of them.
Your IN clause is not actually doing anything, since...
SELECT MAX(ID_SAP_BAYER__c)
FROM SalesForce_INT_Account__c
GROUP BY ID_SAP_BAYER__c
... returns all possible IDSAP_BAYER__c values. (The GROUP BY says you want to return one row per unique ID_SAP_BAYER__c and then, since your MAX is operating on exactly one unique value per group, you simply return that value.)
You will want to change your query to operate on a value that is actually different between the two rows you are trying to differentiate (probably the MAX(ID) for the relevant ID_SAP_BAYER__c). Plus, you will want to link that inner query to your outer query.
You could probably do something like:
...
LEFT OUTER JOIN
SalesForce_INT_Account__c sales
ON cab.NroClienteDireccionEntrega = sales.ID_SAP_BAYER__c
and sales.ID in
(
SELECT MAX(ID)
FROM SalesForce_INT_Account__c sales2
WHERE sales2.ID_SAP_BAYER__c = cab.NroClienteDireccionEntrega
)
WHERE cab.NroPedido ='5320'
By using sales.ID in ... SELECT MAX(ID) ... instead of sales.ID_SAP_BAYER__c in ... SELECT MAX(ID_SAP_BAYER__c) ... this ensures you only match one of the two rows for that ID_SAP_BAYER__c. The WHERE sales2.ID_SAP_BAYER__c = cab.NroClienteDireccionEntrega condition links the inner query to the outer query.
There are multiple ways of doing the above, especially if you don't care which of the relevant rows you match on. You can use the above as a starting point and make it match your preferred style.
An alternative might be to use OUTER APPLY with TOP 1. Something like:
SELECT
...
FROM PedidosEspecialesZarateCabeceras AS cab
OUTER APPLY(
SELECT TOP 1 *
FROM SalesForce_INT_Account__c s1
WHERE cab.NroClienteDireccionEntrega = s1.ID_SAP_BAYER__c
) sales
WHERE cab.NroPedido ='5320'
Without an ORDER BY the match that TOP 1 chooses will be arbitrary, but I think that's what you want anyway. (If not, you could add an ORDER BY).

Prevent duplicate record when inner join query in SQL

I used the inner join command to get the data from two tables.
But, when I run the SQL query.
I got the same record duplicated 48 times.
The SQL query I created is below
SELECT
ABS_LIMIT.B1_NAME, ABS_LIMIT.B2_NAME, ABS_LIMIT.B3_NAME, ABS_LIMIT.ELEM_NAME
FROM
ABS_LIMIT
INNER JOIN
RTU_SCAN ON RTU+SCAN.B1_NAME = ABS_LIMIT.B1_NAME
WHERE
ABS_LIMIT.B3_NAME LIKE 'AMP%';
Does anyone have any idea how to remove the duplicate from the query result?
You never SELECT any columns from RTU_SCAN so you can use EXISTS rather than an INNER JOIN:
SELECT a.B1_NAME,
a.B2_NAME,
a.B3_NAME,
a.ELEM_NAME
FROM ABS_LIMIT a
WHERE EXISTS (SELECT 1 FROM RTU_SCAN r WHERE r.B1_NAME = a.B1_NAME)
AND a.B3_NAME LIKE 'AMP%';
Then, if there are duplicates in RTU_SCAN they will not propagate duplicate rows in the output.
Alternatively, you could use DISTINCT to remove duplicates:
SELECT DISTINCT
a.B1_NAME,
a.B2_NAME,
a.B3_NAME,
a.ELEM_NAME
FROM ABS_LIMIT a
INNER JOIN RTU_SCAN r
ON r.B1_NAME = a.B1_NAME
AND a.B3_NAME LIKE 'AMP%';
However, it will probably be less efficient to generate duplicates and then filter them out using DISTINCT compared to using EXISTS and not generating the duplicates in the first place.

JOIN results in too many rows

I would be super happy if I get help for this problem. Thank you in advance.
Table #1: station_temporar_con_station has 5984 rows, and 7 columns as seen in the screenshot:
ID_stations, latitude, longitude, connection_coord_city_type_coordinates_text, type_of_stations, ID_city
SQL_station_temporar_con_station
Table #2: air_quality_temporar has 11946 rows and 13 columns as seen in this screenshot:
table air_quality_temporar
Now I should have a table with all the 11946 rows from air_quality_temporar supplemented with the column connection_coord_city_type from station_temporar_con_station.
What I've tried so far:
Solution #1:
SELECT
ID_measurement, ID_stations,
station_temporar_con_station.latitude,
station_temporar_con_station.longitude,
station_temporar_con_station.connection_coord_city_type,
station_temporar_con_station.coordinates_text,
type_of_stations, ID_city
FROM
station_temporar_con_station
JOIN
air_quality_temporar ON station_temporar_con_station.coordinates_text = air_quality_temporar.coordinates_text;
But this JOIN results in 14'377 rows instead of 11'946 rows.
Solution #2:
SELECT
reference, pm25, PM10, latitude, longitude,
(SELECT connection_coord_city_type
FROM station_temporar_con_station),
conc_pm25, conc_pm10, year, pm10_type, pm25_type, date_compiled
FROM
air_quality_temporar;
But only the first value from connection_cord_city_type is filled in, because the DB does not know what it should assign where.
Does anyone have any input or a solution?
You should try to avoid duplicate connections and join only unique data. I added two latitude and longitude fields to the join below.
I also used the left join and the air_quality_temporar table in put left table To recover 11946 rows.
SELECT ID_measurement, ID_stations,
station_temporar_con_station.latitude,
station_temporar_con_station.longitude,
station_temporar_con_station.connection_coord_city_type,
station_temporar_con_station.coordinates_text, type_of_stations, ID_city
FROM air_quality_temporar
LEFT JOIN station_temporar_con_station
ON air_quality_temporar.coordinates_text = station_temporar_con_station.coordinates_text
AND air_quality_temporar.latitude = station_temporar_con_station.latitude AND
air_quality_temporar.longitude = station_temporar_con_station.longitude
You have duplicates in your tables. The one of interest is station_temporar_con_station.
To find the duplicates, use:
SELECT coordinates_text, MIN(connection_coord_city_type), MAX(connection_coord_city_type)
FROM station_temporar_con_station
GROUP BY coordinates_text;
Then you need to figure out what to do. I would suggest fixing the data.
If you just want to get any matching row in the query, you can use window functions:
SELECT aqt.*, stcs.*
FROM air_quality_temporar aqt LEFT JOIN
(SELECT stcs.*,
ROW_NUMBER() OVER (PARTITION BY coordinates_text ORDER BY coordinates_text) as seqnum
FROM station_temporar_con_station
) stcs
ON stcs.coordinates_text = aqt.coordinates_text AND
stcs.seqnum = 1;
Note that this returns an arbitrary row when there are duplicates. I also replaced the JOIN with LEFT JOIN. The duplicate rows might be hiding the fact that some rows have no matches.

Getting way more results than expected in SQL left join query

My code is such:
SELECT COUNT(*)
FROM earned_dollars a
LEFT JOIN product_reference b ON a.product_code = b.product_code
WHERE a.activity_year = '2015'
I'm trying to match two tables based on their product codes. I would expect the same number of results back from this as total records in table a (with a year of 2015). But for some reason I'm getting close to 3 million.
Table a has about 40,000,000 records and table b has 2000. When I run this statement without the join I get 2,500,000 results, so I would expect this even with the left join, but somehow I'm getting 300,000,000. Any ideas? I even refered to the diagram in this post.
it means either your left join is using only part of foreign key, which causes row multiplication, or there are simply duplicate rows in the joined table.
use COUNT(DISTINCT a.product_code)
What is the question are are trying to answer with the tsql?
instead of select count(*) try select a.product_code, b.product_code. That will show you which records match and which don't.
Should also add a where b.product_code is not null. That should exclude the records that don't match.
b is the parent table and a is the child table? try a right join instead.
Or use the table's unique identifier, i.e.
SELECT COUNT(a.earned_dollars_id)
Not sure what your datamodel looks like and how it is structured, but i'm guessing you only care about earned_dollars?
SELECT COUNT(*)
FROM earned_dollars a
WHERE a.activity_year = '2015'
and exists (select 1 from product_reference b ON a.product_code = b.product_code)

SQL filter the data if existed in the other table

I'm using SQL Server 2005, and I have a script like this:
select INV_Nr, INV_Date, INV_Customer
from INVOICE A,
left outer join CANCEL_INVOICE B on B.INV_Nr = A.INV_Nr
So how can I add in 'where' clause / filter that all the INVOICE.INV_Nr that existed in CANCEL_INVOICE.INV_Nr will not show in the query result?
Thanks,
One way(probably the best), NOT EXISTS:
SELECT inv_nr,
inv_date,
inv_customer
FROM invoice i
WHERE NOT EXISTS(SELECT 1
FROM cancel_invoice c
WHERE c.inv_nr = i.inv_nr)
The LEFT OUTER JOIN approach might work but is less efficient and leads to incorrect (or at least unexpected) results, since there is no way to differentiate between a row that doesn't exist and a row that does exist but where that column is NULL.
Try this!!
It show all those invoice A.INV_Nr which is not exist in table CANCEL_INVOICE
SELECT INV_Nr, INV_Date, INV_Customer
FROM INVOICE A,
LEFT OUTER JOIN CANCEL_INVOICE B ON A.INV_Nr=B.INV_Nr
WHERE B.INV_Nr IS NULL