SQL: find most common values for specific members in column 1 - sql

I have the following SQL related question:
Let us assume I have the following simple data table:
I would like to identify the most common street address and place it in column 3:
I think this should be fairly straight-forward using COUNT? Not quite sure how to go about it though. Any help is greatly appreciated
Regards

This is a very long method that I just wrote. It only lists the most frequent address. You have to get these values and insert them into the table. See if it works for you:
select * from
(select d.company, count(d.address) as final, c.maxcount,d.address
from dbo.test d inner join
(select a.company,max(a.add_count) as maxcount from
(select company,address,count(address) as add_count from dbo.test group by company,address)a
group by a.company) c
on (d.company = c.company)
group by d.company,c.maxcount,d.address)e
where e.maxcount=e.final

Here is a query in standard SQL. It first counts records per company and address, then ranks them per company giving the most often occurring address rank #1. Then it only keeps those best ranked address records, joins with the table again and shows the results.
select
mytable.company,
mytable.address,
ranked.address as most_common_address
from mytable
join
(
select
company,
address,
row_number() over (partition by company oder by cnt desc) as rn
from
(
select
company,
address,
count(*) over (partition by company, address) as cnt
from mytable
) counted
) ranked on ranked.rn = 1
and ranked.company = mytable.company
and ranked.address = mytable.address;

This select statement will give you the most frequent occurrence. Let us call this A.
SELECT `value`,
COUNT(`value`) AS `value_occurrence`
FROM `my_table`
GROUP BY `value`
ORDER BY `value_occurrence` DESC
LIMIT 1;
To INSERT this into your table,
INSERT INTO db (col1, col2, col3) VALUES (val1, val2, A)
Note that you want that whole select statment for A!

You don't mention your DBMS. Here is a solution for Oracle.
select
company,
address,
(
select stats_mode(address)
from mytable this_company_only
where this_company_only.company = mytable.company
) as most_common_address
from mytable;
This looks a bit clumsy, because STATS_MODE is only available as an aggregate function, not as an analytic window function.

Related

How do we find frequency of one column based off two other columns in SQL?

I'm relatively new to working with SQL and wasn't able to find any past threads to solve my question. I have three columns in a table, columns being name, customer, and location. I'd like to add an additional column determining which location is most frequent, based off name and customer (first two columns).
I have included a photo of an example where name-Jane customer-BEC in my created column would be "Texas" as that has 2 occurrences as opposed to one for California. Would there be anyway to implement this?
If you want 'Texas' on all four rows:
select t.Name, t.Customer, t.Location,
(select t2.location
from table1 t2
where t2.name = t.name
group by name, location
order by count(*) desc
fetch first 1 row only
) as most_frequent_location
from table1 t ;
You can also do this with analytic functions:
select t.Name, t.Customer, t.Location,
max(location) keep (dense_rank first order by location_count desc) over (partition by name) most_frequent_location
from (select t.*,
count(*) over (partition by name, customer, location) as location_count
from table1 t
) t;
Here is a db<>fiddle.
Both of these version put 'Texas' in all four rows. However, each can be tweaks with minimal effort to put 'California' in the row for ARC.
In Oracle, you can use aggregate function stats_mode() to compute the most occuring value in a group.
Unfortunately it is not implemented as a window function. So one option uses an aggregate subquery, and then a join with the original table:
select t.*, s.top_location
from mytable t
inner join (
select name, customer, stats_mode(location) top_location
from mytable
group by name, customer
) s where s.name = t.name and s.customer = t.customer
You could also use a correlated subquery:
select
t.*,
(
select stats_mode(t1.location)
from mytable t1
where t1.name = t.name and t1.customer = t.customer
) top_location
from mytable t
This is more a question about understanding the concepts of a relational database. If you want that information, you would not put that in an additional column. It is calculated data over multiple columns - why would you store that in the table itself ? It is complex to code and it would also be very expensive for the database (imagine all the rows you have to calculate that value for if someone inserted a million rows)
Instead you can do one of the following
Calculate it at runtime, as shown in the other answers
if you want to make it more persisent, you could embed that query above in a view
if you want to physically store the info, you could use a materialized view
Plenty of documentation on those 3 options in the official oracle documentation
Your first step is to construct a query that determines the most frequent location, which is as simple as:
select Name, Customer, Location, count(*)
from table1
group by Name, Customer, Location
This isn't immediately useful, but the logic can be used in row_number(), which gives you a unique id for each row returned. In the query below, I'm ordering by count(*) in descending order so that the most frequent occurrence has the value 1.
Note that row_number() returns '1' to only one row.
So, now we have
select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1 tb_
group by Name, Customer, Location
The final step puts it all together:
select tab.*, tb_.Location most_freq_location
from table1 tab
inner join
(select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1
group by Name, Customer, Location) tb_
on tb_.Name = tab.Name
and tb_.Customer = tab.Customer
and freq_name_cust = 1
You can see how it all works in this Fiddle where I deliberately inserted rows with the same frequency for California and Texas for one of the customers for illustration purposes.

SQL removing duplicates in a certain column

Good afternoon all!
Had a question regarding the removal of duplicates in one column and making it remove the whole row. I will provide an example in a screenshot in Excel as to not provide proprietary info.
I am looking to remove one of the rows highlighted in yellow for example but do not want to limit it to one Dr.Mike or one Health Partners clinic for example. Relatively new to this so any help would be greatly appreciated.
Thank you
You can do:
select distinct prov, clinic, address
from t;
I have used the below script to remove duplicates.
SELECT Prov, Clinic,Address, count(*)
FROM SomeTable
group by Prov, Clinic,Address
having count(*)>1
SELECT Prov, Clinic,Address, countrows = count(*)
INTO [holdkey]
FROM SomeTable
GROUP BY Prov, Clinic,Address
HAVING count(*) > 1
SELECT DISTINCT sometable.*
INTO holdingtable
FROM sometable, holdkey
WHERE sometable.Prov = holdkey.Prov
AND sometable.Clinic = holdkey.Clinic
AND sometable.Address = holdkey.Address
SELECT Prov, Clinic,Address, count(*)
FROM holdingtable
group by Prov, Clinic,Address
having count(*)>1
DELETE sometable
FROM sometable, holdkey
WHERE sometable.Prov = holdkey.Prov
AND sometable.Clinic = holdkey.Clinic
AND sometable.Address = holdkey.Address
INSERT sometable SELECT * FROM holdingtable

Select entry of each group having exactly 1 entry

I am looking for an optimized query
let me show you a small example.
Lets suppose I have a table having three field studentId, teacherId and subject as
Now I want those data in which a physics teacher is teaching to only one student, i.e
teacher 300 is only teaching student 3 and so on.
What I have tried till now
select sid,tid from tabletesting with(nolock)
where tid in (select tid from tabletesting with(nolock)
where subject='physics' group by tid having count(tid) = 1)
and subject='physics'
The above query is working fine. But I want different solution in which I don't have to scan the same table twice.
I also tried using Rank() and Row_Number() but no result.
FYI :
I have showed you an example, this is not the actual table i am playing with, my table contain huge number of rows and columns and where clause is also very complex(i.e date comparison etc.), so I don't want to give the same where clause in subquery and outquery.
You can do this with window functions. Assuming that there are no duplicate students for a given teacher (as in your sample data):
select tt.sid, tt.tid
from (select tt.*, count(*) over (partition by teacher) as scnt
from TableTesting tt
) tt
where scnt = 1;
Another way to approach this, which might be more efficient, is to use an exists clause:
select tt.sid, tt.tid
from TableTesting tt
where not exists (select 1 from TableTesting tt1 where tt1.tid = tt.tid and tt1.sid <> tt.sid)
Another option is to use an analytic function:
select sid, tid, subject from
(
select sid, tid, subject, count(sid) over (partition by subject, tid) cnt
from tabletesting
) X
where cnt = 1

Find duplicated rows that are not exactly same

Can i select all rows that have same column value (for example SSN field) but display them all separably. ?
I've searched for this answer but they all have "count(*) and group by" section that demands the rows to be exactly same.
Try This:
SELECT A, B FROM MyTable
WHERE A IN
(
SELECT A FROM MyTable GROUP BY A HAVING COUNT(*)>1
)
I have done with SQL server. But hope this is what you need
Here is another approach, which only references the table once, using an analytic function instead of a subquery to get the duplicate counts It might be faster; it also might not, depending on the particular data.
SELECT * FROM (
SELECT col1, col2, col3, ssn, COUNT(*) OVER (PARTITION BY ssn) ssn_dup_count
)
WHERE ssn_dup_count > 1
ORDER BY ssn_dup_count DESC
SELECT
*
FROM
MyTable
WHERE
EXISTS
(
SELECT
NULL
FROM
MyTable MT
WHERE
MyTable.SameColumnName = MT.SameColumnName
AND MyTable.DifferentColumnName <> MT.DifferentColumnName)
This will fetch the required data and show them in order so that we can see the grouped data together.
SELECT * FROM TABLENAME
WHERE SSN IN
(
SELECT SSN FROM TABLENAMEGROUP BY SSN HAVING COUNT(SSN)>1
)
ORDER BY SSN
Here SSN is the column names fro which similar value check is done.

How to find max value and its associated field values in SQL?

Say I have a list of student names and their marks. I want to find out the highest mark and the student, how can I write one select statement to do that?
Assuming you mean marks rather than remarks, use:
select name, mark
from students
where mark = (
select max(mark)
from students
)
This will generally result in a fairly efficient query. The subquery should be executed once only (unless your DBMS is brain-dead) and the result fed into the second query. You may want to ensure that you have an index on the mark column.
If you don't want to use a subquery:
SELECT name, remark
FROM students
ORDER BY remark DESC
LIMIT 1
select name, remarks
from student
where remarks =(select max(remarks) from student)
If you are using a database that supports windowing,
SELECT name, mark FROM
(SELECT name, mark, rank() AS rk
FROM student_marks OVER (ORDER BY mark DESC)
) AS subqry
WHERE subqry.rk=1;
This probably does not run as fast as the mark=(SELECT MAX(mark)... style query, but it would be worth checking out.
In SQL Server:
SELECT TOP 1 WITH TIES *
FROM Students
ORDER BY Mark DESC
This will return all the students that have the highest mark, whether there is just one of them or more than one. If you want only one row, drop the WITH TIES specifier. (But the actual row is not guaranteed to be always the same then.)
You can create view and join it with original table:
V1
select id , Max(columName)
from t1
group by id
select * from t1
where t1.id = V1.id and t1.columName = V1.columName
this is right if you need Max Values with related info
I recently had a need for something "kind of similar" to this post and wanted to share a technique. Say you have an Order and OrderDetail table, and you want to return info from the Order table along with the product name associated with the highest priced detail row. Here's a way to pull that off without subtables, RANK, etc.. The key is to create and aggregate that combined the key and value from the detailed table and then just max on that and substring out the value you want.
create table CustOrder(ID int)
create table CustOrderDetail(OrderID int, Price money, ProdName varchar(20))
insert into CustOrder(ID) values(1)
insert into CustOrderDetail(OrderID,Price,ProdName) values(1,10,'AAA')
insert into CustOrderDetail(OrderID,Price,ProdName) values(1,50,'BBB')
insert into CustOrderDetail(OrderID,Price,ProdName) values(1,10,'CCC')
select
o.ID,
JoinAggregate=max(convert(varchar,od.price)+'*'+od.prodName),
maxProd=
SUBSTRING(
max(convert(varchar,od.price)+'*'+od.prodName)
,CHARINDEX('*',max(convert(varchar,od.price)+'*'+convert(varchar,od.prodName))
)+1,9999)
from
CustOrder o
inner join CustOrderDetail od on od.orderID = o.ID
group by
o.ID