Remove duplicate rows in MS-Access - sql

I am using Microsoft Access and in it, I have a table with data that is sometimes repeated. I'm not able to create an SQL query that removes duplicate data, leaving only distinct data in the table. Can someone help me?
My current table:
Date | Level | Name
---------+--------+--------
12/25/2021 | 2 | Jack
12/25/2021 | 2 | Jack
12/10/2021 | 3 | Ana
12/01/2021 | 1 | Lenon
12/01/2021 | 1 | Lenon
12/30/2021 | 3 | Ana
Expected result:
Date | Level | Name
---------+--------+--------
12/25/2021 | 2 | Jack
12/10/2021 | 3 | Ana
12/01/2021 | 1 | Lenon
12/30/2021 | 3 | Ana
PS: Ana appears twice in the expected result table because the dates of the two rows referring to Ana are different, so they are not duplicated values.

Just use select distinct:
select distinct t.*
from t;
I would add that tables should not have duplicate rows. Something is wrong with the table generation if you are getting duplicates -- either the query being used or the process for inserting rows into the table.

You can do a group by of the Date, Level and Name columns.
Use this query:
SELECT Date
,Level
,Name
FROM <TableName>
GROUP BY Date, Level, Name

Related

How can I have a duplicated row to show only one row with the most recent updating of the the table in PostgreSQL?

I am new at PostgreSQL and I am building a new project where I need to work mostly on database. I have a table called products and I have filled it with some data. The problem here is that I need to find all the duplicated rows on the table. The table is like this:
id | name | created_date | updated_date |
-----|----------|---------------|---------------|
1 | hat | 01/05/2022 | 01/06/2022 |
2 | jeans | 01/05/2022 | 01/06/2022 |
3 | shoes | 01/05/2022 | 01/06/2022 |
4 | hat | 01/05/2022 | 01/06/2022 |
...
The duplicated rows are 1 and 4 with the name hat.
After finding the duplicated rows, I need to display the row with the most recent date and then update the table where the table is filled without any duplicates.
Any ideas?
You can use DISTINCT ON (keeps only the first row of each set of rows where the given expressions evaluate to equal):
SELECT DISTINCT ON (name) *
FROM products
ORDER BY name, updated_date DESC
In this query rows with same name sorted by updated_date and all rows except first (with most recent date) are eliminated.

MS-Access SQL DISTINCT GROUP BY

I am currently trying to SELECT the DISTINCT FirstNames in a GROUP, using Microsoft Access 2010.
The simplified relevant columns of my table looks like this:
+----+-------------+-----------+
| ID | GroupNumber | FirstName |
+----+-------------+-----------+
| 1 | 1 | Peter |
| 2 | 1 | Bob |
| 3 | 1 | Peter |
| 4 | 2 | Rosemary |
| 5 | 2 | Jamie |
| 6 | 3 | Peter |
+----+-------------+-----------+
My actual table contains two columns to which I want to apply this process (separately), but I should be able to simply repeat the process for the other column. The column group number is a simplification, my table actually groups all rows in a ten day interval together, but I've already solved that problem.
And I would like it to return this:
+-------------+------------+
| GroupNumber | FirstNames |
+-------------+------------+
| 1 | Peter |
| 1 | Bob |
| 2 | Rosemary |
| 2 | Jamie |
| 3 | Peter |
+-------------+------------+
This means that I want all Distinct FirstNames for each Group.
A regular DISTINCT would ignore group boundaries and only mention Peter once. All aggregate functions reduce my output to only one value or don't work on strings at all. Access also doesn't support SELECTing columns that are not aggregates or in the GROUP BY statement.
All other answers I've found either want an aggregate, are not applicable to MS Access or are solved by working around the data in ways not applicable to my case. (Standardized languages are a nice thing, aren't they?)
My current (invalid) query looks like this:
SELECT GroupNumber,
DISTINCT FirstNames -- This is illegal, distinct applies to all
-- columns and doesn't respect groups.
FROM Example AS b
-- Complicated stuff to make the groups
GROUP BY GroupNumber;
This query is a one time thing and is used to analyze a 58000 row excel spreadsheet exported from another Database (not my fault), so optimizing for runtime is not necessary.
I would like to achieve this purely through SQL and without VBA if at all possible.
This should work:
SELECT DISTINCT GroupNumber, FirstNames
FROM Example AS b
A solution for this problem would be group by the columns GroupNumber and FirstNames at the same time. The query is presented below:
Select GroupNumber, FirstNames
From input
Group By GroupNumber, FirstNames
(Standardized languages are a nice thing, aren't they?)

Remove newest redundant row and update timestamp

I'm working with a SQLite database that receives large data dumps on a regular basis from several sources. Unfortunately, those sources aren't intelligent about what they dump, and I end up with a lot of repeated records from one time to the next. I'm looking for a way to remove these repeated records without affecting the records that have legitimately changed from the past dump to this one.
Here's the general structure of the data (_id is the primary key):
| _id | _dateUpdated | _dateEffective | _dateExpired | name | status | location |
|-----|--------------|----------------|--------------|------|--------|----------|
| 1 | 2016-05-01 | 2016-05-01 | NULL | Fred | Online | USA |
| 2 | 2016-05-01 | 2016-05-01 | NULL | Jim | Online | USA |
| 3 | 2016-05-08 | 2016-05-08 | NULL | Fred | Offline| USA |
| 4 | 2016-05-08 | 2016-05-08 | NULL | Jim | Online | USA |
| 5 | 2016-05-15 | 2016-05-15 | NULL | Fred | Offline| USA |
| 6 | 2016-05-15 | 2016-05-15 | NULL | Jim | Online | USA |
I'd like to be able to reduce this data to something like this:
| _id | _dateUpdated | _dateEffective | _dateExpired | name | status | location |
|-----|--------------|----------------|--------------|------|--------|----------|
| 1 | 2016-05-01 | 2016-05-01 | 2016-05-07 | Fred | Online | USA |
| 2 | 2016-05-15 | 2016-05-01 | NULL | Jim | Online | USA |
| 3 | 2016-05-15 | 2016-05-08 | NULL | Fred | Offline| USA |
The idea here is that rows 4, 5, and 6 exactly duplicate rows 2 and 3 except for the timestamps (I'd need to compare by all three fields - name, status, location). However, row 3 does not duplicate row 1 (status changed from Online to Offline), so the _dateExpired field is set in row 1, and row 3 becomes the most recent record.
I'm querying this table with something like this:
SELECT * FROM Data WHERE
date(_dateEffective) <= date("now")
AND (_dateExpired IS NULL OR date(_dateExpired) > date("now"))
Is this sort of reduction possible in SQLite?
I am still a beginner to SQL and database design in general, so it's possible that I haven't structured the database in the best way. I'm open to suggestions there as well...I'm going for the ability to query data at a given point in time - for example, "what was Jim's status around 2016-05-06?"
Thanks in advance!
Consider using a staging table where the dump file goes into a DumpTable (regularly cleaned out before each dump) and then an INSERT...SELECT query migrates to your final table.
Now the SELECT portion maintains a correlated subquery (to calculate new [_dateExpired] for needed rows) and derived table subquery (to filter out non-dups according to your criteria). Finally, the LEFT JOIN...NULL with FinalTable is to ensure no duplicate records are appended, assuming [_id] is a unique identifier. Below is the routine:
Clean Out DumpTable
DELETE FROM DumpTable;
Run Dump Routine to be appended into DumpTable
Append Records to FinalTable
INSERT INTO FinalTable ([_id], [_dateUpdated], [_dateEffective], [_dateExpired],
[name], status, location)
SELECT d.[_id], d.[_dateUpdated], d.[_dateEffective],
(SELECT Min(date(sub.[_dateEffective], '-1 day'))
FROM DumpTable sub
WHERE sub.[name] = DumpTable.[name]
AND sub.[_dateEffective] > DumpTable.[_dateEffective]
AND sub.status <> DumpTable.status) As calcExpired
d.name, d.status, d.location
FROM DumpTable d
INNER JOIN
(SELECT Min(DumpTable.[_id]) AS min_id,
DumpTable.name, DumpTable.status
FROM DumpTable
GROUP BY DumpTable.name, DumpTable.status) AS c
ON (c.name = d.name)
AND (c.min_id = d.[_id])
AND (c.status = d.status)
LEFT JOIN FinalTable f
ON d.[_id] = f.[_id]
WHERE f.[_id] IS NULL;
-- INSERTED RECORDS:
-- _id _dateUpdated _dateEffective _dateExpired name status location
-- 1 2016-05-01 2016-05-01 2016-05-07 Fred Online USA
-- 2 2016-05-01 2016-05-01 Jim Online USA
-- 3 2016-05-08 2016-05-08 Fred Offline USA
Is this sort of reduction possible in SQLite?
The answer to any "reduction" question in SQL is always Yes. The trick is to find what axes you're reducing along.
Here's a partial solution to illustrate; it gives the first Online date for each name & location.
select min(_dateEffective) as start_date
, name
, location
from Data
where status = 'Online'
group by
name
, location
With an outer join back to the table (on name & location) where the status is 'Offline' and the _dateEffective is greater than start_date, you get your _dateExpired.
_id is the primary key
There is a commonly held misunderstanding that every table needs some kind of sequential "ID" number as a primary key. The key you really care about is known as a natural key, 1 or more columns in the data that uniquely identify the data. In your case, it looks to me like that's _dateEffective, name, status, and location. At the very least, declare them unique to prevent accidental duplication.

SQL deleting rows with duplicate dates conditional upon values in two columns

I have data on approx 1000 individuals, where each individual can have multiple rows, with multiple dates and where the columns indicate the program admitted to and a code number.
I need each row to contain a distinct date, so I need to delete the rows of duplicate dates from my table. Where there are multiple rows with the same date, I need to keep the row that has the lowest code number. In the case of more than one row having both the same date and the same lowest code, then I need to keep the row that also has been in program (prog) B. For example;
| ID | DATE | CODE | PROG|
--------------------------------
| 1 | 1996-08-16 | 24 | A |
| 1 | 1997-06-02 | 123 | A |
| 1 | 1997-06-02 | 123 | B |
| 1 | 1997-06-02 | 211 | B |
| 1 | 1997-08-19 | 67 | A |
| 1 | 1997-08-19 | 23 | A |
So my desired output would look like this;
| ID | DATE | CODE | PROG|
--------------------------------
| 1 | 1996-08-16 | 24 | A |
| 1 | 1997-06-02 | 123 | B |
| 1 | 1997-08-19 | 23 | A |
I'm struggling to come up with a solution to this, so any help greatly appreciated!
Microsoft SQL Server 2012 (X64)
The following works with your test data
SELECT ID, date, MIN(code), MAX(prog) FROM table
GROUP BY date
You can then use the results of this query to create a new table or populate a new table. Or to delete all records not returned by this query.
SQLFiddle http://sqlfiddle.com/#!9/0ebb5/5
You can use min() function: (See the details here)
select ID, DATE, min(CODE), max(PROG)
from table
group by DATE
I assume that your table has a valid primary key. However i would recommend you to take IDas Primary key. Hope this would help you.

Calculating Number of Columns that have no Null value

I want to make a table like following
| ID | Sibling1 | Sibling2 | Sibling 3 | Total_Siblings |
______________________________________________________________
| 1 | Tom | Lisa | Null | 2 |
______________________________________________________________
| 2 | Bart | Jason | Nelson | 3 |
______________________________________________________________
| 3 | George | Null | Null | 1 |
______________________________________________________________
| 4 | Null | Null | Null | 0 |
For Sibling1, Sibling2, Sibling3: they are all nvarchar(50) (can't change this as the requirement).
My concern is that how can I calculate the value for Total_Siblings so it will display the number of siblings like above, using SQL? i attempted to use (Sibling1 + Sibling 2) but it does not display the result I want.
Cheers
A query like this would do the trick.
SELECT ID,Sibling1,Sibling2,Sibling3
,COUNT(Sibling1)+Count(Sibling2)+Count(Sibling3) AS Total
FROM MyTable
GROUP BY ID
A little explanation is probably required here. Count with a field name will count the number of non-null values. Since you are grouping by ID, It will only ever return 0 or 1. Now, if you're using anything other than MySQL, you'll have to substitute
GROUP BY ID
FOR
GROUP BY ID,Sibling1,Sibling2,Sibling3
Because most other databases require that you specify all columns that don't contain an aggregate function in the GROUP BY section.
Also, as an aside, you may want to consider changing your database schema to store the siblings in another table, so that each person can have any number of siblings.
You can do this by adding up individual counts:
select id,sibling1,sibling2,sibling3
,count(sibling1)+count(sibling2)+count(sibling3) as total_siblings
from table
group by 1,2,3,4;
However, your table structure makes this scale crappily (what if an id can belong to, say, 50 siblings?). If you store your data into a table with columns of id and sibling, then this query would be as simple as:
select id,count(sibling)
from table
group by id;