Postgresql : Removing Duplicates after performing UNION ALL - sql

I have a requirement where i need to remove some rows after Joining two tables using UNION ALL.
Here are the Tables
Accounts1
id
username
department
salary
1
Sam
IT
2000
2
Frodo
Accounts
1000
3
Natan
Service
800
4
Kenworth
Admin
900
Accounts2
id
username
department
salary
5
Sam
IT
1600
6
Frodo
Accounts
800
Expected Result of the UNION should be
id
username
department
salary
5
Sam
IT
1600
6
Frodo
Accounts
800
3
Natan
Service
800
4
Kenworth
Admin
900
As seen the expected result should contain the records of the least salary from the accounts2 table replacing the records from the accounts1. I have tried with Distinct but that doesnot resolve the requirement. Any help is greatly appreciated

You can use union all with filtering:
select a2.*
from accounts2 a2
union all
select a1.*
from accounts1 a1
where not exists (select 1
from accounts2 a2
where a2.username = a1.username and a2.department = a1.department
);
EDIT:
If you want one row per username or username/department from either table with the minimum salary, then I would suggest union all with distinct on:
select distinct on (username, department) a.*
from ((select a1.*
from accounts a1
) union all
(select a2.*
from accounts a2
)
) a
order by username, department, salary;
Remove department accordingly if you want one row per employee.

After UNIONing the two sets, I would calculate a Row_number() ON (Group By department, username Order By salary, id). Then I would wrap that one in one more Select to filter and retain only row_number = 1.
A little more code, but very explicit as to what is being performed and it has the advantage that if either data set happens to contain multiple values for a user you still get the one with the lowest salary.
This is a problem that comes up often where there are multiple records within a group domain and you want to choose "the best" one even if you can't say exactly which one that is. The row_number() window function allows your Order By to make the best choice float to the top where it will assign the row_number of 1. You can then filter and retain only the row_numbers=1 as the "best" choice within each domain. This always means at least two Select statements because window functions are evaluated after Where and Having clauses.

Related

How to return all names that appear multiple times in table [duplicate]

This question already has answers here:
What's the SQL query to list all rows that have 2 column sub-rows as duplicates?
(10 answers)
Closed last year.
Suppose I have the following schema:
student(name, siblings)
The related table has names and siblings. Note the number of rows of the same name will appear the same number of times as the number of siblings an individual has. For instance, a table could be as follows:
Jack, Lucy
Jack, Tim
Meaning that Jack has Lucy and Tim as his siblings.
I want to identify an SQL query that reports the names of all students who have 2 or more siblings. My attempt is the following:
select name
from student
where count(name) >= 1;
I'm not sure I'm using count correctly in this SQL query. Can someone please help with identifying the correct SQL query for this?
You're almost there:
select name
from student
group by name
having count(*) > 1;
HAVING is a where clause that runs after grouping is done. In it you can use things that a grouping would make available (like counts and aggregations). By grouping on the name and counting (filtering for >1, if you want two or more, not >=1 because that would include 1) you get the names you want..
This will just deliver "Jack" as a single result (in the example data from the question). If you then want all the detail, like who Jack's siblings are, you can join your grouped, filtered list of names back to the table:
select *
from
student
INNER JOIN
(
select name
from student
group by name
having count(*) > 1
) morethanone ON morethanone.name = student.name
You can't avoid doing this "joining back" because the grouping has thrown the detail away in order to create the group. The only way to get the detail back is to take the name list the group gave you and use it to filter the original detail data again
Full disclosure; it's a bit of a lie to say "can't avoid doing this": SQL Server supports something called a window function, which will effectively perform a grouping in the background and join it back to the detail. Such a query would look like:
select student.*, count(*) over(partition by name) n
from student
And for a table like this:
jack, lucy
jack, tim
jane, bill
jane, fred
jane, tom
john, dave
It would produce:
jack, lucy, 2
jack, tim, 2
jane, bill, 3
jane, fred, 3
jane, tom, 3
john, dave, 1
The rows with jack would have 2 on because there are two jack rows. There are 3 janes, there is 1 john. You could then wrap all that in a subquery and filter for n > 1 which would remove john
select *
from
(
select student.*, count(*) over(partition by name) n
from student
) x
where x.n > 1
If SQL Server didn't have window functions, it would look more like:
select *
from
student
INNER JOIN
(
select name, count(*) as n
from student
group by name
) x ON x.name = student.name
The COUNT(*) OVER(PARTITION BY name) is like a mini "group by name and return the count, then auto join back to the main detail using the name as key" i.e. a short form of the latter query
You can do:
select name
from student as s1
where exists (
select s2
from student as s2
where s1.name = s2.name and s1.siblings != s2.siblings
)
I think the best approach is what 'Caius Jard' mentioned. However, additional way if you want to get how many siblings each name has .
SELECT name, COUNT(*) AS Occurrences
FROM student
GROUP BY name
HAVING (COUNT(*) > 1)
I wanted to share another solution I came up with:
select s1.name
from student s1, student s2
where s1.name = s2.name and s1.sibling != s2.sibling;

Need to count the filtered employees

I wrote a query which need filter out the employee data on behalf on their employee codes.
For instance, in my XYZ table i have 200 employees, i need to insert these 200 employee in ABC table, but before inserting, i need to check whether all 200 employees are existed in the system,I first filter out the employee and then insert into my ABC table.
suppose, 180 out of 200 employee matched, then i will insert 180 in the ABC table.
Now i want the count 200-180=20, so i need that difference count.
I wrote a query but it fetches only the matched record, not those employee count who filters out.
Select distinct SD.EMP_code
FROm SALARY_DETAIL_REPORT_012018 SD /*219 Employees*/
JOIN
(SELECT * FROM EMPLOYEE) tbl
ON tbl.EMP_CODE=to_char(SD.EMP_CODE)
WHERE SD.REFERENCE_ID like '1-%';
final output : 213 employees
I want 219-213=6
i want those 6 employees. I also tried INTERSECT but i got same result.
Select distinct to_char(SD.EMP_code)
FROm SALARY_DETAIL_REPORT_012018 SD
WHERE SD.REFERENCE_ID like '1-%'
INTERSECT
SELECT EMP_CODE FROm EMPLOYEE;
OUTPUT
213 Employees
Kindly help me to find out the count of filtered employees
You can use NOT EXISTS :
SELECT DISTINCT SD.EMP_code
FROM SALARY_DETAIL_REPORT_012018 sd
WHERE NOT EXISTS (SELECT 1 FROM EMPLOYEE e WHERE e.EMP_CODE = TO_CHAR(SD.EMP_CODE)) AND
SD.REFERENCE_ID LIKE '1-%';
use except opertaor
Select distinct to_char(SD.EMP_code)
FROM SALARY_DETAIL_REPORT_012018 SD
WHERE SD.REFERENCE_ID like '1-%'
except
SELECT EMP_CODE FROm EMPLOYEE;

Case Statement for multiple criteria

I would like to ignore some of the results of my query as for all intents and purposes, some of the results are a duplicate, but based on the way the request was made, we need to use this hierarchy and although we are seeing different 'Company_Name' 's, we need to ignore one of the results.
Query:
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
2
ORDER BY
3 ASC, 2 ASC
This code omits half a doze joins and where statements that are not germane to this question.
Results:
Customer_Name_Count Company_Name Total_Sales
-------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 6 Jimmy's Restaurant 1,500
4 9 Impala Hotel 2,000
5 12 Sports Drink 2,500
In the above set, we can see that numbers 2 & 3 have the same count and the same total_sales number and similar company names. Is there a way to create a case statement that takes these 3 factors into consideration and then drops one or the other for Jimmy's enterprises? The other issue is that this has to be variable as there are other instances where this happens. And I would only want this to happen if the count and sales number match each other with a similar name in the company name.
Desired result:
Customer_Name_Count Company_Name Total_Sales
--------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 9 Impala Hotel 2,000
4 12 Sports Drink 2,500
Looks like other answers are accurate based on assumption that Company_IDs are the same for both.
If Company_IDs are different for both Jimmy's Bar and Jimmy's Restaurant then you can use something like this. I suggest you get functional users involved and do some data clean-up else you'll be maintaining this every time this issue arise:
SELECT
COUNT(DISTINCT CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END) AS Customer_Name_Count
,CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END AS Company_Name
,SUM(A12.Total_Sales) AS Total_Sales
FROM some_table er
GROUP BY CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END
Your problem is that the joins you are using are multiplying the number of rows. Somewhere along the way, multiple names are associated with exactly the same entity (which is why the numbers are the same). You can fix this by aggregating by the right id:
SELECT COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
MAX(Company_Name) as Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM some_table AS A12
GROUP BY Company_id -- I'm guessing the column is something like this
ORDER BY 3 ASC, 2 ASC;
This might actually overstate the sales (I don't know). Better would be fixing the join so it only returned one name. One possibility is that it is a type-2 dimension, meaning that there is a time component for values that change over time. You may need to restrict the join to a single time period.
You need to have function to return a common name for the companies and then use DISTINCT:
SELECT DISTINCT
Customer_Name_Count,
dbo.GetCommonName(Company_Name) as Company_Name,
Total_Sales
FROM dbo.theTable
You can try to use ROW_NUMBER with window function to make row number by Customer_Name_Count and Total_Sales then get rn = 1
SELECT * FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY Customer_Name_Count,Total_Sales ORDER BY Company_Name) rn
FROM (
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
Company_Name
)t1
)t1
WHERE rn = 1

MS Access equivalent for using dense_rank in select

In MS Access, I have a table with 2 million account records/rows with various columns of data. I wish to apply a sequence number to every account record. (i.e.- 1 for the first account record ABC111, 2 for the second account record DEF222..., etc.)
Then, I would like to assign a batch number sequence for every 5 distinct account number. (i.e - record 1 with account number ABC111 being associated with batch number 101, record 2 with account number DEF222 being associated with batch number of 101)
This is how I would do it with a sql server query:
select distinct(p.accountnumber),FLOOR(((50 + dense_rank() over(order by
p.accountnumber)) - 1)/5) + 100 As BATCH from
db2inst1.account_table p
Raw Data:
AccountNumber
ABC111
DEF222
GHI333
JKL444
MNO555
PQR666
STU777
Resulting Data:
RecordNumber AccountNumber BatchNumber
1 ABC111 101
2 DEF222 101
3 GHI333 101
4 JKL444 101
5 MNO555 101
6 PQR666 102
7 STU777 102
I tried to make a query that uses SELECT as well as DENSE_RANK but I couldn't figure out how to make it work.
Thanks for reading my question
Something like this would probably work.
I'd first create a temporary table to hold the distinct account numbers, then I'd do an update query to assign the ranking.
CREATE TABLE tmpAccountRank
(AccountNumber TEXT(10)
CONSTRAINT PrimaryKey PRIMARY KEY,
AccountRank INTEGER NULL);
Then I'd use this table to generate the account ranking.
DELETE FROM tmpAccountRank;
INSERT INTO tmpAccountRank(AccountNumber)
SELECT DISTINCT AccountNumber FROM db2inst1.account_table;
UPDATE tmpAccountRank
SET AccountRank =
DCOUNT('AccountNumber', 'tmpAccountRank',
'AccountNumber < ''' + AccountNumber + '''') \ 5 + 101
I use DCOUNT and integer division (\ 5) to generate the ranking. This probably will have terrible performance but I think it's the way you would do it in MS Access.
If you want to skip the temp table, you can do it all in a nested subquery, but I don't think it's a great practice to do too much in a single query, especially in MS Access.
SELECT AccountNumber,
(SELECT COUNT(*) FROM
(SELECT DISTINCT AccountNumber
FROM db2inst1.account_table
WHERE AccountNumber < t.AccountNumber) q)) \ 5 + 101
FROM db2inst1.account_table t
Actually, this won't work in MS Access; apparently you can't reference tables outside of multiple levels of nesting in a subquery.
You can do dense_rank() with a correlated subquery. The logic is:
select a.*,
(select count(distinct a2.accountnumber)
from db2inst1.account_table as a2
where a2.accountnumber <= a.accountnumber
) as dense_rank
from db2inst1.account_table as a;
Then, you can use this for getting the batch number. Unfortunately, I don't follow the logic in your question (dense_rank() produces a number but your batch number is not numeric). However, this should answer your question.
EDIT:
Oh, that's right. In MS Access you need nested subqueries:
select a.*,
(select count(*)
from (select distinct a2.accountnumber
from db2inst1.account_table as a2
) as a2
where a2.accountnumber <= a.accountnumber
) as dense_rank
from db2inst1.account_table as a;

Display a blank row between every unique row?

I have a simple query like:
SELECT employee, ITEM_TYPE, COUNT(ITEM_TYPE)
FROM hr_database
So the output may look like
BOB MUGS 4
BOB PENCILS 10
CAT MUGS 2
CAT PAPERCLIPS 7
SAL MUGS 11
But for readability, I want to put a blank row between each user in the output(i.e for readability), like this :
BOB MUGS 4
BOB PENCILS 10
CAT MUGS 2
CAT PAPERCLIPS 7
SAL MUGS 11
Is there a way to do this in Oracle SQL ? So far, I found this link but it doesn't match what I need . I'm thinking to use a WITH in the query?
You can do it in the database, but this type of processing should really be done at the application layer.
But, it is kind of an amusing trick to figure out how to do it in the database, and that is your specific question:
WITH e AS (
SELECT employee, ITEM_TYPE, COUNT(ITEM_TYPE) as cnt
FROM hr_database
GROUP BY employee, ITEM_TYPE
)
SELECT (case when cnt is not null then employee end) as employee,
item_type, cnt
FROM (select employee, item_type, cnt, 1 as x from e union all
select distinct employee, NULL, NULL, 2 as x from e
) e
ORDER BY e.employee, x;
I emphasize, though, that this is really for amusement and perhaps for understanding better how SQL works. In the real world, you do this type of work at the application layer.
A summary of how this works. The union all brings in one additional row for each employee. The x is a priority for sorting -- because you have to sort the result set to get the proper ordering. The case statement is needed to prevent the employee from being in the first column. cnt should never be NULL for the valid rows.
You can try like this with normal union & distinct
select emp,item_type,cnt from
(select distinct ' ' as emp,' ' as item_type ,' ' as cnt, employee
from hr_database
union
select employee as emp,item_type ,to_char(count(item_type)) as cnt, employee
from hr_database
group by employee,item_type)a
order by a.employee