How to eliminate duplicate rows based on column max value in sql? - sql

I have a view which returns duplicate rows having field performance, which can be 1,2,3 or 4. I need to select only rows which is having max performance. How to do this ? Tried this :
View :
results needed as follows :
So basically for each employee number group, I need to fetch what is max calender value and corresponding performance value

You can express
the employee record with the highest performance
as:
there is no record (for this employee) with a higher performance
And that gives:
SELECT * FROM employee e
WHERE NOT EXISTS (
SELECT *
FROM employee nx
WHERE nx.employee_nr = e.employee_number
AND nx.performance > e.performance
);

You need to do two queries: read max calender for any given employee group, then to select the rows with the same those values is calender and group.
Select vm."Employee Number" as eGroup, max(vm.Calender) as Calender From view1 vm
That part is obvious. The problem is how to inject it as a criterion for
Select vd."Employee Number", vd.Calender, vd.Performance From view1 vd where ... ?
Since aggregate functions (like max) gives a singular result, you can use JOIN without risk of having Cartesian sets production (when all combinations of rows from two selects are tried, NxM)
Select vd."Employee Number", vd.Calender, vd.Performance From view1 vd
Join (Select "Employee Number" as eGroup, max(vm.Calender) as maxCal From view1) as vm
On (vd."Employee Number" = vm.eGroup) and (vd.Calender = vm.maxCal)
Note, this still can produce several rows with the same Employee Number and Calender if that was how they were in the table. Unless you have a UNIQUE INDEX upon both those columns.

Thanks all for your inputs. I tried a slightly different approach for this.
Step 1 : Since I did not have any unique identifier for each row, I made one by concatenating Employee_Number and Calender columns.
It becomes easier then and we can use following :
SELECT DISTINCT
T1.EMPLOYEE_NUMBER, T1.Performance, T1.Calender
FROM dbo.view1 AS T1 LEFT OUTER JOIN dbo.view1 AS T2
ON T2.EMPLOYEE_NUMBER = T1.EMPLOYEE_NUMBER
AND T2.maxval > T1.maxval
WHERE (T2.maxval IS NULL)
Here Maxval is Concatenated column.
Actual answer is here : SQL Server: Only last entry in GROUP BY

Your stated goal doesn't match your results, but I would do this (at least in DB2):
select e.Employee_Number, e.Calendar, e.Performance
from Employee e
where e.Calendar in (select max(Calendar) from Employee group by Employee_Number)

select v1.performance,max(v1.field)
from view1 v1
group by v1.performance

Related

Confused with the Group By function in SQL

Q1: After using the Group By function, why does it only output one row of each group at most? Does this mean that having is supposed to filter the group rather than filter the records in each group?
Q2: I want to find the records in each group whose ages are greater than the average age of that group. I tried the following, but it returns nothing. How should I fix this?
SELECT *, avg(age) FROM Mytable Group By country Having age > avg(age)
Thanks!!!!
You can calculate the average age for each country in a subquery and join that to your table for filtering:
SELECT mt.*, MtAvg.AvgAge
FROM Mytable mt
inner join
(
select mtavgs.country
, avg(mtavgs.age) as AvgAge
from Mytable mtavgs
group by mtavgs.country
) MTAvg
on mtavg.country=mt.country
and mt.Age > mtavg.AvgAge
GROUP BY returns always 1 row per unique combination of values in the GROUP BY columns listed (provided that they are not removed by a HAVING clause). The subquery in our example (alias: MTAvg) will calculate a single row per country. We will use its results for filtering the main table rows by applying the condition in the INNER JOIN clause; we will also report that average by including the calculated average age.
GROUP BY is a keyword that is called an aggregate function. Check this out here for further reading SQL Group By tutorial
What it does is it lumps all the results together into one row. In your example it would lump all the results with the same country together.
Not quite sure what exactly your query needs to be to solve your exact problem. I would however look into what are called window functions in SQL. I believe what you first need to do is write a window function to find the average age in each group. Then you can write a query to return the results you need
Depending on your dbms type and version, you may be able to use a "window function" that will calculate the average per country and with this approach it makes the calculation available on every row. Once that data is present as a "derived table" you can simply use a where clause to filter for the ages that are greater then the calculated average per country.
SELECT mt.*
FROM (
SELECT *
, avg(age) OVER(PARTITION BY country) AS AvgAge
FROM Mytable
) mt
WHERE mt.Age > mt.AvgAge

Why do I get extra rows in LEFT JOIN when joining to an ID and TIMESTAMP column?

I have a table that contains multiple registration periods (date and time for the start of the registration, as well as date and time for when that instance of registration ends). For each row (registration period), there is a status column that contains the status at the end of the registration period. I was trying to get the status associated with the most recent end date of registration per a given ID. I've used a window function to get the most recent end date of interest per ID, and then I wanted to LEFT JOIN on ID and end date to get the status from the same table on which I used the window function. There should really just be one just one combination for a given end date and status per ID, but somehow I get more rows that what's in the left table.
Like I mentioned earlier, my approach was to use a window function to get MAX(end_date) per ID and some other column, let's call it enrollment_number. Then use LEFT JOIN on this table and its parent table to bring in status associated with that date only. Later, I'd like to use the result of this join to bring in the status associated with the end date into other tables where I need it.
WITH
my_first_test AS
(
SELECT my_id,
enrollment_number,
MAX(end_date_of_enrollment) OVER (partition by my_id, enrollment_number) AS end_date_enrolled
FROM enrollments
)
SELECT mft.my_id, mft.end_date_enrolled, e.status
FROM my_first_test AS mft
LEFT JOIN enrollments AS e
ON mft.my_id = e.my_id AND mft.end_date_enrolled = e.end_date_enrolled;
The CTE returns 42917 rows, same number of rows as in the enrollments table, which it should be if I understand it correctly.
Then, I LEFT JOIN enrollments, to bring in information from the status column also contained in the enrollments table. The LEFT JOIN is done on my_id and end_date_enrolled.
I expect 42917 rows in the resulting table, because my_id and end_date_enrolled together should be unique. However, I get slightly more rows in my final table - 44408. I was wondering if the StackOverflow community would be able to help me solve this mystery. I am using SQL in AWS Redshift.
You have duplicates in enrollments. You can find them with aggregation:
SELECT my_id, end_date_enrolled, COUNT(*)
FROM enrollments AS e
GROUP BY my_id, end_date_enrolled
HAVING COUNT(*) > 1;

Rank order ST_DWithin results by the number of radii a result appears in

I have a table of existing customers and another table of potential customers. I want to return a list of potential customers rank ordered by the number of radii of existing purchasers that they appear in.
There are many rows in the potential customers table per each existing customer, and the radius around a given existing customer could encompass multiple potential customers. I want to return a list of potential customers ordered by the count of the existing customer radii that they fall within.
SELECT pur.contact_id AS purchaser, count(pot.*) AS nearby_potential_customers
FROM purchasers_geocoded pur, potential_customers_geocoded pot
WHERE ST_DWithin(pur.geom,pot.geom,1000)
GROUP BY purchaser;
Does anyone have advice on how to proceed?
EDIT:
With some help, I wrote this query, which seems to do the job, but I'm verifying now.
WITH prequalified_leads_table AS (
SELECT *
FROM nearby_potential_customers
WHERE market_val > 80000
AND market_val < 120000
)
, proximate_to_existing AS (
SELECT pot.prop_id AS prequalified_leads
FROM purchasers_geocoded pur, prequalified_leads_table pot
WHERE ST_DWithin(pot.geom,pur.geom,100)
)
SELECT prequalified_leads, count(prequalified_leads)
FROM proximate_to_existing
GROUP BY prequalified_leads
ORDER BY count(*) DESC;
I want to return a list of potential customers ordered by the count of the existing customer radii that they fall within.
Your query tried the opposite of your statement, counting potential customers around existing ones.
Inverting that, and after adding some tweaks:
SELECT pot.contact_id AS potential_customer
, rank() OVER (ORDER BY pur.nearby_customers DESC
, pot.contact_id) AS rnk
, pur.nearby_customers
FROM potential_customers_geocoded pot
LEFT JOIN LATERAL (
SELECT count(*) AS nearby_customers
FROM purchasers_geocoded pur
WHERE ST_DWithin(pur.geom, pot.geom, 1000)
) pur ON true
ORDER BY 2;
I suggest a subquery with LEFT JOIN LATERAL ... ON true to get counts. Should make use of the spatial index that you undoubtedly have:
CREATE INDEX ON purchasers_geocoded USING gist (geom);
Thereby retaining rows with 0 nearby customers in the result - your original join style would exclude those. Related:
What is the difference between LATERAL and a subquery in PostgreSQL?
Then ORDER BY the resulting nearby_customers in the outer query (not: nearby_potential_customers).
It's not clear whether you want to add an actual rank. Use the window function rank() if so. I made the rank deterministic while being at it, breaking ties with an additional ORDER BY expression: pot.contact_id. Else, peers are returned in arbitrary order which can change for every execution.
ORDER BY 2 is short syntax for "order by the 2nd out column". See:
Select first row in each GROUP BY group?
Related:
How do I query all rows within a 5-mile radius of my coordinates?

SELECT list expression references column user_id which is neither grouped nor aggregated at [8:5]

I have 2 data sets. One of all patients who got ill (endo-2) and one of a special group of patients that also exists in endo-2 called "xp-56"
I've been trying to run this query and I'm not sure why it isn't working. I want to do counts of 3 columns in endo-2 of those patients that belong in the xp-56 table.
this is the code I've been using with the following error
SELECT list expression references column user_id which is neither grouped nor aggregated at [8:5]
how do I fix this so I never make the same mistake again!
SELECT
Virus_Exposure,
Medical_Delivery,
Number_of_Site
FROM
(
SELECT
medical_id,
COUNT(DISTINCT Virus_id) AS Virus_Exposure,
COUNT(EndoCrin_id) AS Medical_Delivery,
COUNT (site_id_clinic) AS Number_of_Site
FROM
`endo-2`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP("2017-12-15")
AND TIMESTAMP("2018-01-10")) AS a
RIGHT JOIN
(
SELECT
medical_id
FROM
`xp-56`
ORDER BY
medical_id DESC) AS b
ON
a.medical_id=b.medical_id
GROUP BY
medical_id
Why doesnt the medical_id in table a work?
Why not just do this?
SELECT e.medical_id,
COUNT(DISTINCT e.Virus_id) AS Virus_Exposure,
COUNT(e.EndoCrin_id) AS Medical_Delivery,
COUNT(e.site_id_clinic) AS Number_of_Site
FROM `endo-2` e JOIN
`xp-56` x
ON x.medical_id = e.medical_id
WHERE e._PARTITIONTIME BETWEEN TIMESTAMP("2017-12-15") AND TIMESTAMP("2018-01-10")
GROUP BY e.medical_id;

SQL Query to fetch information based on one or more condition. Getting combinations instead of exact number

I have two tables. Table 1 has about 750,000 rows and table 2 has 4 million rows. Table two has an extra ID field in which I am interested, so I want to write a query that will check if the 750,000 table 1 records exist in table 2. For all those rows in table 1 that exist in table 2, I want the respective ID based on same SSN. I tried the following query:
SELECT distinct b.UID, a.*
FROM [Analysis].[dbo].[Table1] A, [Proteus_8_2].dbo.Table2 B
where a.ssn = b.ssn
Instead of getting 750,000 rows in the output, I am getting 5.4 million records. Where am i going wrong?
Please help?
You're requesting all the rows in your select if b.UID is a unique field in column two.
Also if SSN is not unique in table one you can get the higher row count than the total row count for table 2.
You need to consider what you want from table 2 again.
EDIT
You can try this to return distinct combinations of ssn and uid when ssn is found in table 2 provided that ssn and uid have a cardinality of 1:1, i.e., every unique ssn has a single unique uid.
select distinct
a.ssn,b.[UID]
from [Analysis].[dbo].[Table1] a
cross apply
( select top 1 [uid] from [Proteus_8_2].[dbo].[Table2] where ssn = a.ssn ) b
where b.[UID] is not null
Try with LEFT JOIN
SELECT distinct b.UID, a.*
FROM [Analysis].[dbo].[Table1] A LEFT JOIN [Proteus_8_2].dbo.Table2 B
on a.ssn = b.ssn
Since the order detail table is in a one-many relationship to the order table, that is the expected result of any join. If you want something different, you need to define for us the business rule that will tell us how to select only one record from the Order detail table. You cannot effectively write SQL code without understanding the business rules that of what you are trying to achieve. You should never just willy nilly select one record out of the many, you need to understand which one you want.