My query is returning duplicates - sql

I have written an SQL query to filter for a number of conditions, and have used distinct to find only unique records.
Specifically, I need only for the AccountID field to be unique, there are multiple AddressClientIDs for each AccountID.
The query works but is however producing some duplicates.
Further caveats are:
There are multiple trans for each AccountID
There can be trans record both Y and N for an AccountID
I only want to return AccountIDs which have transaction for statuses other than what's specified, hence why I used not in, as I do not want the 2 statuses.
I would like to find only unique values for the AccountID column.
If anyone could help refine the query below, it would be much appreciated.
SELECT AFS_Account.AddressClientID
,afs_transunit.AccountID
,SUM(afs_transunit.Units)
FROM AFS_TransUnit
,AFS_Account
WHERE afs_transunit.AccountID IN (
-- Gets accounts which only have non post statuses
SELECT DISTINCT accountid
FROM afs_trans
WHERE accountid NOT IN (
SELECT accountid
FROM afs_trans
WHERE STATUS IN (
'POSTPEND'
,'POSTWAIT'
)
)
-- This gets the unique accountIDs which only have transactions with Y status,
-- and removes any which have both Y and N.
AND AccountID IN (
SELECT DISTINCT accountid
FROM afs_trans
WHERE IsAllocated = 'Y'
AND accountid NOT IN (
SELECT DISTINCT AccountID
FROM afs_trans
WHERE IsAllocated = 'N'
)
)
)
AND AFS_TransUnit.AccountID = AFS_Account.AccountID
GROUP BY afs_transunit.AccountID
,AFS_Account.AddressClientID
HAVING SUM(afs_transunit.Units) > 100
Thanks.

Since you confirmed that you have one-to-many relationship across two tables on AccountID column, you could use Max value of your AccountID to get distinct values:
SELECT afa.AddressClientID
,MAX(aft.AccountID)
,SUM(aft.Units)
FROM AFS_TransUnit aft
INNER JOIN AFS_Account afa ON aft.AccountID = afa.AccountID
GROUP BY afa.AddressClientID
HAVING SUM(aft.Units) > 100
AND MAX(aft.AccountID) IN (
-- Gets accounts which only have non post statuses
-- This gets the unique accountIDs which only have transactions with Y status,
-- and removes any which have both Y and N.
SELECT DISTINCT accountid
FROM afs_trans a
WHERE [STATUS] NOT IN ('POSTPEND','POSTWAIT')
AND a.accountid IN (
SELECT t.accountid
FROM (
SELECT accountid
,max(isallocated) AS maxvalue
,min(isallocated) AS minvalue
FROM afs_trans
GROUP BY accountid
) t
WHERE t.maxvalue = 'Y'
AND t.minvalue = 'Y'
)
)

SELECT AFS_Account.AddressClientID
,afs_transunit.AccountID
,SUM(afs_transunit.Units)
FROM AFS_TransUnit
INNER JOIN AFS_Account ON AFS_TransUnit.AccountID = AFS_Account.AccountID
INNER JOIN afs_trans ON afs_trans.acccountid = afs_transunit.accountid
WHERE afs_trans.STATUS NOT IN ('POSTPEND','POSTWAIT')
-- AND afs_trans.isallocated = 'Y'
GROUP BY afs_transunit.AccountID
,AFS_Account.AddressClientID
HAVING SUM(afs_transunit.Units) > 100
and max(afs_trans.isallocated) = 'Y'
and min(afs_trans.isallocated) = 'Y'
Modified your query with ANSI SQL join syntax. As you are joining the tables, you just need to specify the conditions without using the sub-queries you have.

Related

Including count combinations with null value in SQL

I have one dataset, and am trying to list all of the combinations of said dataset. However, I am unable to figure out how to include the combinations that are null. For example, Longitudinal? can be no and cohort can be 11-20, however for Region 1, there were no patients of that age in that region. How can I show a 0 for the count?
Here is the code:
SELECT "s_safe_005prod"."ig_eligi_group1"."site_name" AS "Site Name",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_ellong" AS "Longitudinal?",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_elcohort" AS "Cohort",
count(*) AS "count"
FROM "s_safe_005prod"."ig_eligi_group1"
GROUP BY "s_safe_005prod"."ig_eligi_group1"."site_name",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_ellong",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_elcohort"
ORDER BY "s_safe_005prod"."ig_eligi_group1"."site_name",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_ellong" ASC,
"s_safe_005prod"."ig_eligi_group1"."il_eligi_elcohort" ASC
Create a cross join across the unique values from each of the three grouping fields to create a set of all possible combinations. Then left join that to the counts you have originally and coalesce null values to zero.
WITH groups AS
(
SELECT a.site_name, b.longitudinal, c.cohort
FROM (SELECT DISTINCT site_name FROM s_safe_005prod.ig_eligi_group1) a,
(SELECT DISTINCT il_eligi_ellong AS longitudinal FROM s_safe_005prod.ig_eligi_group1) b,
(SELECT DISTINCT il_eligi_elcohort AS cohort FROM s_safe_005prod.ig_eligi_group1) c
),
dat AS
(
SELECT site_name,
il_eligi_ellong AS longitudinal,
il_eligi_elcohort AS cohort,
count(*) AS "count"
FROM s_safe_005prod.ig_eligi_group1
GROUP BY site_name,
il_eligi_ellong,
il_eligi_elcohort
)
SELECT groups.site_name,
groups.longitudinal,
groups.cohort,
COALESCE(dat.[count],0) AS "count"
FROM groups
LEFT JOIN dat ON groups.site_name = dat.site_name
AND groups.longitudinal = dat.longitudinal
AND groups.cohort = dat.cohort;

Cross joins in results

Mapping:A single address id can have different tracking ids. Each tracking id and each address id will have distinct lat and long pairs. Each tracking id can have multiple route ids although most of the time it will be a single route id to tracking id mapping.
Update: The tracking ids that I am selecting from T1_2 may or may not exist in the other tables. Also, there are no duplicates for each of the temp tables I am using for the final select statement(based on the key value).
I am having a problem with the results of the following query.The query is supposed to produce metrics for distance deviations of delivery points from addresses. Its performing some cross joins on the columns and hence the data is more than it should be. I know this is related to granularity and is a basic mistake but its hard for me to find where I went wrong. If someone can give me some pointers,please do. A subset of the results have been attached as a link and I have also highlighted a sample tracking id which should have come only once(with only route id).The results should include address ids repeating as many times as there are distinct tracking_ids which in turn should be in sync with no_pkg column. The query is also attached for reference.
Results subset
CREATE OR REPLACE FUNCTION f_stop_distance (Float, Float, Float, Float) /* This calculates distance in meters between two sets of lat and long */
RETURNS FLOAT
IMMUTABLE
AS $$
SELECT
2 * 6373000 * ASIN( SQRT( ( SIN( RADIANS(($3 - $1) / 2) ) ) ^ 2 + COS(RADIANS($1)) * COS(RADIANS($3)) * (SIN(RADIANS(($4 - $2) / 2))) ^ 2))
$$ LANGUAGE sql
;
CREATE TEMPORARY TABLE T1 AS /* This is to get top 1000 address ids which are unique identifiers for addresses in terms of orders frequency which is decided by number of distinct ordering order ids */
SELECT destination_address_id
,COUNT(DISTINCT ordering_order_id)a
,COUNT(DISTINCT tracking_id) no_pkg
FROM lmaa_pm.perfectmile_onroad_events_na
where shipment_status = 'DELIVERED'
AND delivery_station_code = 'DCH1'
AND event_day BETWEEN '2018-12-01' AND '2018-12-31'
AND tracking_id IS NOT NULL
GROUP BY destination_address_id,delivery_station_code
ORDER BY a DESC
LIMIT 1000
;
CREATE TEMPORARY TABLE T1_2 AS /* This is to get tracking ids corresponding to those top 1000 address ids */
SELECT DISTINCT destination_address_id
,tracking_id
FROM lmaa_pm.perfectmile_onroad_events_na
WHERE destination_address_id IN (SELECT destination_address_id FROM T1)
AND event_day BETWEEN '2018-12-01' AND '2018-12-31'
AND shipment_status = 'DELIVERED'
AND delivery_station_code = 'DCH1'
AND tracking_id IS NOT NULL
GROUP BY 1,2
;
CREATE TEMPORARY TABLE T2 AS /* This is to get lat long pairs for addresses and delivery point respectively */
SELECT DISTINCT gdd.lat1
,gdd.long1
,gdd.external_address_id destination_address_id
,gdd.tracking_id
,gdd.actual_lat
,gdd.actual_long
,ROW_NUMBER() OVER(PARTITION BY tracking_id ORDER BY deliverydate DESC) rn /* This is to avoid duplicates since this table contains duplicates */
FROM gtech.geocoding_data_daily_na gdd
WHERE gdd.shipment_status_id in (51,'DELIVERED')
AND tracking_id IN(SELECT tracking_id FROM T1_2)
AND confidence1 = 'high'
AND gdd.station_code='DCH1'
AND deliverydate BETWEEN '2018-12-01' AND '2018-12-31'
AND actual_lat IS NOT NULL
AND actual_long IS NOT NULL
;
CREATE TEMPORARY TABLE T2_2 AS
SELECT *
FROM T2
WHERE rn = 1
;
CREATE TEMPORARY TABLE T3 AS
SELECT T2_2.lat1
,T2_2.long1
,T2_2.actual_lat
,T2_2.actual_long
,T2_2.tracking_id
,T2_2.destination_address_id
,CASE /* This function is for identifying distance deviations in the order of 0 - 10 metres, 10-20 metres and so on */
WHEN f_stop_distance(lat1,long1,actual_lat,actual_long) <=10 THEN '0_to_10'
WHEN f_stop_distance(lat1,long1,actual_lat,actual_long) >10
and f_stop_distance(lat1,long1,actual_lat,actual_long) <=20 THEN '10_to_20'
WHEN f_stop_distance(lat1,long1,actual_lat,actual_long)>20
and f_stop_distance(lat1,long1,actual_lat,actual_long) <=50 THEN '20_to_50'
WHEN f_stop_distance(lat1,long1,actual_lat,actual_long) >50 THEN 'gt_50'
END AS Dev_from_address
FROM T2_2
ORDER BY T2_2.tracking_id
;
CREATE TEMPORARY TABLE T4 AS /* Doing some percentage calculations based on the new buckets created in the previous temp table namely percentage calculations out of total */
SELECT SUM(CASE WHEN Dev_from_address = '0_to_10' THEN 1 ELSE 0 END)a
,SUM(CASE WHEN Dev_from_address = '10_to_20' THEN 1 ELSE 0 END)b
,SUM(CASE WHEN Dev_from_address = '20_to_50' THEN 1 ELSE 0 END)c
,SUM(CASE WHEN Dev_from_address = 'gt_50' THEN 1 ELSE 0 END)d
,tracking_id
,(a/(a+b+c+d)::DECIMAL(10,2) * 100) AS e
,(b/(a+b+c+d)::DECIMAL(10,2) * 100) AS f
,(c/(a+b+c+d)::DECIMAL(10,2) * 100) AS g
,(d/(a+b+c+d)::DECIMAL(10,2) * 100) AS h
FROM T3
GROUP BY tracking_id
;
CREATE TEMPORARY TABLE T5 AS /* adding info for route id to the existing data */
SELECT DISTINCT route_id
,tracking_id
,ROW_NUMBER() OVER (PARTITION BY tracking_id ORDER BY DATE DESC) rnnn /* to avoid duplicates */
FROM omw.route_actuals_na
WHERE tracking_id IN (SELECT tracking_id FROM T1_2)
AND stop_type = 'Dropoff'
AND scan_status = 'DELIVERED'
;
CREATE TEMPORARY TABLE T5_final AS
SELECT *
FROM T5
WHERE rnnn = 1
;
/* final select */
SELECT DISTINCT T1_2.destination_address_id
,T3.lat1
,T3.long1
,T3.actual_lat
,T3.actual_long
,T3.Dev_from_address
,T1_2.tracking_id
,T1.no_pkg
,T4.e
,T4.f
,T4.g
,T4.h
,T5_final.route_id
FROM T3
JOIN T4 ON T4.tracking_id = T3.tracking_id
JOIN T1 ON T1.destination_address_id = T3.destination_address_id
JOIN T1_2 ON T1_2.destination_address_id = T3.destination_address_id
JOIN T5_final ON T5_final.tracking_id = T3.tracking_id
ORDER BY T1_2.destination_address_id
strictly - no full cross joins there - however you may have a many to many join.
To track this down try taking a look at each of your joins to see whether you have >1 key value
select tracking_id,count(*) from t4 group by 1 having count(*) > 1;
select destination_address_id,count(*) from t1 group by 1 having count(*) > 1;
select tracking_id ,count(*) from t5_final group by 1 having count(*) > 1;
where you have values returned, that could be your cause. this may help you identify where you have a many to many join.

How to search for matching staff number in sql

I am new to sql and trying to come up with a sql query which will list me the duplicate staff which were created in our system.
We have one staff which is created with id as 1234 and the same user has another account starting with staff id 01234. Is there anyway i can get the matching staff
Once i come up with correct duplicates i will than want to delete the accounts which don't have "0" at the start e.g deleted 1234 and only keep 01234
below is the sql
SELECT tps_user.tps_title AS [Name] , tps_user_type.tps_title AS [User Type]
FROM tps_user INNER JOIN
tps_user_type ON tps_user.tps_user_type_guid = tps_user_type.tps_guid
WHERE (tps_user.tps_title IN
(SELECT tps_title AS users
FROM tps_user AS t1
WHERE (tps_deleted = 0)
GROUP BY tps_title
HAVING (COUNT(tps_title) > 1))) AND (tps_user.tps_deleted = 0)
When you do you select try this:
SELECT DISTINCT CONVERT(INT,ID)
FROM your_table
WHERE ...
OR
SELECT ID
FROM your_table
WHERE ...
GROUP BY ID
This will convert all the id's to an int temporarily so when the distinct evaluates duplicates everything will be uniform to give you an accurate representation of the duplicates.
IF you don't want to convert them maybe convert them and insert them into a temporary table and add a flag to which ones have a leading zero. Or convert them then append a zero after you delete the duplicates since you want that anyway. It is easy to append a 0.
the below query will give you the list of duplicates with same Name and title. -
SELECT tps_user.tps_title AS [Name] ,
tps_user_type.tps_title AS [UserType],
COUNT(*) Duplicate_Count
FROM tps_user
INNER JOIN tps_user_type
ON tps_user.tps_user_type_guid = tps_user_type.tps_guid
group by tps_user.tps_title, tps_user_type.tps_title
having COUNT(*) > 1
order by Duplicate_Count desc
Select t1.stringId
from mytable t1
inner join mytable t2 on Convert(INT, t1.intId) = CONVERT(INT, t2.intId)
where t1.stringId not like '0%'
This should list all the persons that have duplicates but do not start with 0.

Find duplicates in MS SQL table

I know that this question has been asked several times but I still cannot figure out why my query is returning values which are not duplicates. I want my query to return only the records which have identical value in the column Credit. The query executes without any errors but values which are not duplicated are also being returned. This is my query:
Select
_bvGLTransactionsFull.AccountDesc,
_bvGLAccountsFinancial.Description,
_bvGLTransactionsFull.TxDate,
_bvGLTransactionsFull.Description,
_bvGLTransactionsFull.Credit,
_bvGLTransactionsFull.Reference,
_bvGLTransactionsFull.UserName
From
_bvGLAccountsFinancial Inner Join
_bvGLTransactionsFull On _bvGLAccountsFinancial.AccountLink =
_bvGLTransactionsFull.AccountLink
Where
_bvGLTransactionsFull.Credit
IN
(SELECT Credit AS NumOccurrences
FROM _bvGLTransactionsFull
GROUP BY Credit
HAVING (COUNT(Credit) > 1 ) )
Group By
_bvGLTransactionsFull.AccountDesc, _bvGLAccountsFinancial.Description,
_bvGLTransactionsFull.TxDate, _bvGLTransactionsFull.Description,
_bvGLTransactionsFull.Credit, _bvGLTransactionsFull.Reference,
_bvGLTransactionsFull.UserName, _bvGLAccountsFinancial.Master_Sub_Account,
IsNumeric(_bvGLTransactionsFull.Reference), _bvGLTransactionsFull.TrCode
Having
_bvGLTransactionsFull.TxDate > 01 / 11 / 2014 And
_bvGLTransactionsFull.Reference Like '5_____' And
_bvGLTransactionsFull.Credit > 0.01 And
_bvGLAccountsFinancial.Master_Sub_Account = '90210'
That's because you're matching on the credit field back to your table, which contains duplicates. You need to isolate the rows that are duplicated with ROW_NUMBER:
;WITH CTE AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY CREDIT ORDER BY (SELECT NULL)) AS RN
FROM _bvGLTransactionsFull)
Select
CTE.AccountDesc,
_bvGLAccountsFinancial.Description,
CTE.TxDate,
CTE.Description,
CTE.Credit,
CTE.Reference,
CTE.UserName
From
_bvGLAccountsFinancial Inner Join
CTE On _bvGLAccountsFinancial.AccountLink = CTE.AccountLink
WHERE CTE.RN > 1
Group By
CTE.AccountDesc, _bvGLAccountsFinancial.Description,
CTE.TxDate, CTE.Description,
CTE.Credit, CTE.Reference,
CTE.UserName, _bvGLAccountsFinancial.Master_Sub_Account,
IsNumeric(CTE.Reference), CTE.TrCode
Having
CTE.TxDate > 01 / 11 / 2014 And
CTE.Reference Like '5_____' And
CTE.Credit > 0.01 And
_bvGLAccountsFinancial.Master_Sub_Account = '90210'
Just as a side note, I would consider using aliases to shorten your queries and make them more readable. Prefixing the table name before each column in a join is very difficult to read.
I trust your code in terms of extracting all data per your criteria. With this, let me have a different approach and see your script "as-is". So then, lets keep first all the records in a temp.
Select
_bvGLTransactionsFull.AccountDesc,
_bvGLAccountsFinancial.Description,
_bvGLTransactionsFull.TxDate,
_bvGLTransactionsFull.Description,
_bvGLTransactionsFull.Credit,
_bvGLTransactionsFull.Reference,
_bvGLTransactionsFull.UserName
-- temp table
INTO #tmpTable
From
_bvGLAccountsFinancial Inner Join
_bvGLTransactionsFull On _bvGLAccountsFinancial.AccountLink =
_bvGLTransactionsFull.AccountLink
Where
_bvGLTransactionsFull.Credit
IN
(SELECT Credit AS NumOccurrences
FROM _bvGLTransactionsFull
GROUP BY Credit
HAVING (COUNT(Credit) > 1 ) )
Group By
_bvGLTransactionsFull.AccountDesc, _bvGLAccountsFinancial.Description,
_bvGLTransactionsFull.TxDate, _bvGLTransactionsFull.Description,
_bvGLTransactionsFull.Credit, _bvGLTransactionsFull.Reference,
_bvGLTransactionsFull.UserName, _bvGLAccountsFinancial.Master_Sub_Account,
IsNumeric(_bvGLTransactionsFull.Reference), _bvGLTransactionsFull.TrCode
Having
_bvGLTransactionsFull.TxDate > 01 / 11 / 2014 And
_bvGLTransactionsFull.Reference Like '5_____' And
_bvGLTransactionsFull.Credit > 0.01 And
_bvGLAccountsFinancial.Master_Sub_Account = '90210'
Then remove the "single occurrence" data by creating a row index and remove all those 1 time indexes.
SELECT * FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY Credit ORDER BY Credit) AS rowIdx
, *
FROM #tmpTable) AS innerTmp
WHERE
rowIdx != 1
You can change your preference through PARTITION BY <column name>.
Should you have any concerns, please raise it first as these are so far how I understood your case.
EDIT : To include those credits that has duplicates.
SELECT
tmp1.*
FROM #tmpTable tmp1
RIGHT JOIN (
SELECT
Credit
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY Credit ORDER BY Credit) AS rowIdx
, *
FROM #tmpTable) AS innerTmp
WHERE
rowIdx != 1
) AS tmp2
ON tmp1.Credit = tmp2.Credit

SQL query - Selecting distinct values from a table

I have a table in which i have multiple entries against a FK. I want to find out the value of FK which do not have certain entries e.g
my table has following entries.
PK----------------FK-----------------Column entries
1----------------100-----------------ab1
2----------------100-----------------ab2
3----------------100-----------------ab4
4----------------200-----------------ab1
5----------------200-----------------ab2
6----------------200-----------------ab3
7----------------300-----------------ab1
8----------------300-----------------ab2
9----------------300-----------------ab3
10---------------300-----------------ab4
Now, from this table i want to filter all those FK which do not have ab3 or ab4 in them. Certainly, i expect distinct values i.e. in this case result would be FK= 100 and 200.
The query which i am using is
select distinct(FK)
from table1
where column_entries != 'ab3'
or column_entries != 'ab4';
Certainly, this query is not fetching the desired result.
try the following :-
select distinct fk_col from table1
minus
(select distinct fk_col from table1 where col_entry='ab3'
intersect
select distinct fk_col from table1 where col_entry='ab4')
This will show all those FKs which do not have 'ab3' and 'ab4'. i.e. 100 and 200 in your case
The below script may be the solution if I got your question in a right way.
SELECT DISTINCT(TableForeignKey)
FROM Test
WHERE TableForeignKey NOT IN (
SELECT T1.TableForeignKey
FROM Test T1 INNER JOIN Test T2 ON T1.TableForeignKey = T2.TableForeignKey
WHERE T1.TableEntry = 'ab3' AND T2.TableEntry = 'ab4')
SQLFiddle Demo
You could use GROUP BY with conditional aggregation in HAVING:
SELECT FK
FROM table1
GROUP BY FK
HAVING COUNT(CASE column_entries WHEN 'ab3' THEN 1 END) = 0
OR COUNT(CASE column_entries WHEN 'ab4' THEN 1 END) = 0
;
The two conditional aggregates count 'ab3' and 'ab4' entries separately. If both end up with results greater than 0, then the corresponding FK has both 'ab3' and 'ab4' and is thus not returned. If at least one of the counts evaluates to 0, then FK is considered satisfying the requirements.