Duplicate values SQL (MS Access) - sql

I need to find duplicate records across 2 or more fields. But using this does not work in Access:
SELECT assay.depth_from, assay.au_gt
FROM assay
GROUP BY depth_from, au_gt
HAVING count(*) >1;
Am I missing something? It does match up with various answers here so not sure what.
I just get a records with duplicate depth_from but the au_gt are not duplicate. Actually not all the depth_from are even all duplicated.

I see two possible syntax issues with your SQL. First, you probably don't need to use the assay. prefix before your field names since you have specified which table you are selecting from, and this makes your reference to those fields in GROUP BY inconsistent. If you do use assay. in your SELECT statement use it in GROUP BY as well. Secondly, you should include count(*) in the SELECT statement. This is basically for the same reason- whatever you reference in GROUP BY and HAVING should be the column names you specified in SELECT. Try this:
SELECT depth_from, au_gt, count(*)
FROM assay
GROUP BY depth_from, au_gt
HAVING count(*) >1;

Related

In SQL how can I remove duplicates on a column without including it in the SELECT command/ the extract

I'm trying to remove duplicates on a column in SQL, without including that column in the extract (since it contains personally identifiable data). I thought I might be able to do this with nested queries (as below), however this isn't working. I also thought it might be possible to remove duplicates in the WHERE statement, but couldn't find anything from googling. Any ideas? Thanks in advance.
SELECT [ETHNIC], [RELIGION]
FROM
(SELECT DISTINCT [ID], [ETHNIC], [RELIGION]
FROM MainData)
Using distinct like that will apply distinct to the row, so if there are two rows with the same ID but different ETHNIC and RELIGION the distinct won't remove them. To do that you could use group by in your query, but then you need to use an aggregation (e.g. max):
SELECT [ETHNIC], [RELIGION]
FROM
(SELECT [ID], MAX([ETHNIC]) AS ETHNIC, MAX([RELIGION]) AS RELIGION
FROM MainData
GROUP BY [ID])
If that's not what you're looking for, some SQL dialects require that you name your inner select, so you could try adding AS X to the end of your query.

How to adapt a GROUP BY across multiple tables?

I'm trying to optimize a SQL query that uses a GROUP BY across multiple tables. Essentially, I have multiple tables which all contain a PID column, and the output I need is a record of every PID in all of the tables as well as a count of how many records across all of those tables contain that PID. When trying to use GROUP BY PID, I get a "column ambiguously defined" error if using multiple tables. Here is an example of the code I am using to retrieve the proper data from one table (can ignore the where clause):
select pid, count(*)
from table1
where vendor_id in(1,2)
and delay_code <=23
and age between 18 and 49 and sex = 'M'
group by pid
Essentially, I want to do this across a group of tables (i.e. table1, table2, table3 etc), but can't figure out how to do so without getting a "column ambiguously defined" error.
You need to identify what record you are referencing. You can do that either by specifying the table, or using an alias. Aliases are required when you have multiple references to the same table.
Specify table:
SELECT table1.pid, COUNT(*)
FROM table1
GROUP BY table1.pid
Use alias:
SELECT t1.pid, COUNT(*)
FROM table1 AS t1
GROUP BY t1.pid

"group by" needed in count(*) SQL statement?

The following statement works in my database:
select column_a, count(*) from my_schema.my_table group by 1;
but this one doesn't:
select column_a, count(*) from my_schema.my_table;
I get the error:
ERROR: column "my_table.column_a" must appear in the GROUP BY clause
or be used in an aggregate function
Helpful note: This thread: What does SQL clause "GROUP BY 1" mean? discusses the meaning of "group by 1".
Update:
The reason why I am confused is because I have often seen count(*) as follows:
select count(*) from my_schema.my_table
where there is no group by statement. Is COUNT always required to be followed by group by? Is the group by statement implicit in this case?
This error makes perfect sense. COUNT is an "aggregate" function. So you need to tell it which field to aggregate by, which is done with the GROUP BY clause.
The one which probably makes most sense in your case would be:
SELECT column_a, COUNT(*) FROM my_schema.my_table GROUP BY column_a;
If you only use the COUNT(*) clause, you are asking to return the complete number of rows, instead of aggregating by another condition. Your questing if GROUP BY is implicit in that case, could be answered with: "sort of": If you don't specify anything is a bit like asking: "group by nothing", which means you will get one huge aggregate, which is the whole table.
As an example, executing:
SELECT COUNT(*) FROM table;
will show you the number of rows in that table, whereas:
SELECT col_a, COUNT(*) FROM table GROUP BY col_a;
will show you the the number of rows per value of col_a. Something like:
col_a | COUNT(*)
---------+----------------
value1 | 100
value2 | 10
value3 | 123
You also should take into account that the * means to count everything. Including NULLs! If you want to count a specific condition, you should use COUNT(expression)! See the docs about aggragate functions for more details on this topic.
If you don't use the Group by clause at all then all that will be returned is a count of 1 for each row, which is already assumed anyway and therefore redundant data. By adding GROUP BY 1 you have categorized the information thereby making it non-redundant even though it returns the same result in theory as the statement that creates an error.
When you have a function like count, sum etc. you need to group the other columns. This would be equivalent to your query:
select column_a, count(*) from my_schema.my_table group by column_a;
When you use count(*) with no other column, you are counting all rows from SELECT * from the table. When you use count(*) alongside another column, you are counting the number of rows for each different value of that other column. So in this case you need to group the results, in order to show each value and its count only once.
group by 1 in this case refers to column_a which has the column position 1 in your query.
This why it works on your server. Indeed this is not a good practice in sql.
You should mention the column name because the column order may change in the table so it will be hard to maintain this code.
The best solution is:
select column_a, count(*) from my_schema.my_table group by column_a;

getting unique results without using group by

For some reason, I am not able to use GROUP BY but I can use DISTINCT like this:
SELECT DISTINCT(name), id, email from myTable
However above query lists people with same name also because I am selecting more than one columns whereas I want to select only unique names. Is there someway to get unique names without using GROUP BY ?
Although using GROUP BY is the most direct way, you can do other things if it is prohibited. For example, you can use NOT EXISTS with a subquery, like this:
SELECT name, id, email
FROM myTable t
WHERE NOT EXISTS (SELECT * FROM myTable tt WHERE tt.name=t.name AND tt.id < t.id)
This query uses NOT EXISTS to eliminate rows with the same name and IDs higher than the one selected. Note that since this query must pick a single user per name, it may eliminate some users based on their ID, which is a rather arbitrary criterion.

SQL Having on columns not in SELECT

I have a table with 3 columns:
userid mac_address count
The entries for one user could look like this:
57193 001122334455 42
57193 000C6ED211E6 15
57193 FFFFFFFFFFFF 2
I want to create a view that displays only those MAC's that are considered "commonly used" for this user. For example, I want to filter out the MAC's that are used <10% compared to the most used MAC-address for that user. Furthermore I want 1 row per user. This could easily be achieved with a GROUP BY, HAVING & GROUP_CONCAT:
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
And indeed, the result is as follows:
57193 001122334455,000C6ED211E6 42
However I really don't want the count-column in my view. But if I take it out of the SELECT statement, I get the following error:
#1054 - Unknown column 'count' in 'having clause'
Is there any way I can perform this operation without being forced to have a nasty count-column in my view? I know I can probably do it using inner queries, but I would like to avoid doing that for performance reasons.
Your help is very much appreciated!
As HAVING explicitly refers to the column names in the select list, it is not possible what you want.
However, you can use your select as a subselect to a select that returns only the rows you want to have.
SELECT a.userid, a.macs
FROM
(
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
) as a
UPDATE:
Because of a limitation of MySQL this is not possible, although it works in other DBMS like Oracle.
One solution would be to create a view for the subquery. Another solution seems cleaner:
CREATE VIEW YOUR_VIEW (userid, macs) AS
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
This will declare the view as returning only the columns userid and macs although the underlying SELECT statement returns more columns than those two.
Although I am not sure, whether the non-DBMS MySQL supports this or not...