How to remove duplicate rows and keep one in an Access database? - sql

I need to remove duplicate rows in my Access database, does anyone have generic query to do this? As I have this problem with multiple tables

There are two things you need to do,
Determine what the criteria are for a unique record - what is the list of columns where two, or more, records would be considered duplicates, e.g. JobbID and HisGuid
Decide what you want to do with the duplicate records - do you want to hard delete them, or set the IsDeleted flag that you have on the table
Once you've determined the criteria that you want to use for uniqueness you then need to pick 1 record from each group of duplicates to retain. A query along the lines of:
SELECT MAX(ID)
FROM MyTable
GROUP
BY JobbID, HisGuid
Will give you (and I've assumed that the ID column is an auto-increment/identity column that is unique across all records in the table) the highest value for each group of records where JobbID and HisGuid are both the same. You could use MIN(ID) if you want, it's up to you - you just need to pick ONE record from each group to keep.
Assuming that you want to set the IsDeleted flag on the records you don't want to keep, you can then incorporate this into an update query:
UPDATE MyTable
SET IsDeleted = 1
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyTable
GROUP
BY JobbID, HisGuid
)
This takes the result of the query that retrieves the highest IDs and uses it to say set IsDeleted to 1 for all the records where the ID isn't the highest ID for each group of records where JobbID and HisGuid are the same.
The only part I can't help you with is running these queries in Access as I don't have it installed on the PC I'm using right now and my memory is a bit rusty regarding how/where to run arbitrary queries.

Related

Is there way to add a field in a parent query that will increment as the query goes through all values generated in a subquery?

I think I have a table that lacks a true primary key and I need to make one in the output. I cannot modify the table.
I need to run a select query to generate a list of values (list_A), then take those values and query them to show all the records related to them. From those records, I do another select to extract a now visible list called list_B. From list_B, I can search them to reveal all the records related to the original list (list_A), with many of those records missing the values from list_A but still need to be counted.
Here's my process so far:
I declared a sequence called 'temp_key', which starts from 1 and increments by 1.
I add a field called 'temp_key' to the parent query, so that it will hopefully show which element of the original list_A sub-query the resulting records are related to.
I run into trouble because I don't know how to make the temp_key increment as the list_A sub-query moves from the beginning to end of all the values in the list.
SELECT currval(temp_key) AS temp_key, list_A, list_B
FROM table
WHERE list_B IN (SELECT DISTINCT list_B
FROM table
WHERE list_A IN (SELECT DISTINCT list_A
from table);
As it is now, the above query doesn't work because there seems to be no way to make the current value of temp_key increment upward as it goes through values from the list originally generated from the lowest level sub-query (list_A).
For example, there might be only 10 values in list_A. And the output could have 100s of records, all labeled 1 through 10, with many of those values missing values in the list_A field. But they still need to be labeled 1 through 10 because the values of list_B connect the two sets.
Maybe you can create a new primary key column first with the following code (concatenating row number with list_a):
WITH T AS (
SELECT currval(temp_key) AS temp_key, list_A, list_B,
CONCAT(ROW_NUMBER() OVER(PARTITION BY list_A ORDER BY list_B),list_A) AS Prim_Key
FROM table )
SELECT * fROM T
Then you can specify in the where clause what keys you want to select

Does SQL server retrieve the same rows when you use top statement

My SQL table consists of three columns - Event (type xml), InsertedTime (type datetime) and status (type nvarchar - possible values processed and unprocessed). None of them are unique identifiers and all of these are mandatory.
As part of a select query, I retrieve the top 1000 rows of the table (based on the unprocessed status), use the XML to retrieve some values, and would like to update the status of these exact 1000 rows to processed status.
My question is: I'm using the SELECT TOP 1000 FROM table WHERE status ='Unprocessed' ORDER BY InsertedTime to retrieve and UPDATE TOP 1000 table WHERE status = 'Processed' ORDER BY InsertedTime statements to achieve this.
I understand that in Oracle, I can use the rowid pseudocolumn to ensure that I'm updating the same rows that were retrieved in the first place. But how do I achieve this without having any unique identifier or primary key in the table in SQL?
Note: The table is being written to continuously.
You're selecting rows to be handled and then trying to update the status of those rows? Instead of doing select + update you could use output clause, with something like this:
UPDATE TOP (1000) table
set status = 'Processed'
output deleted.Event, deleted.InsertedTime, deleted.status
where status = 'Unprocessed'
This will both update the rows + return Event, InsertedTime and status fields (old values). If you need the new values, you can use the virtual table inserted.
Assuming that processes may try to insert new rows while your two queries are processing, you have a few options:
Wrap the two queries in a transaction. This should guarantee atomicity between them, at the cost of extra locking on the table.
Find the oldest InsertedTime value from the first query and use that with the WHERE clause in the 2nd query.
Combine the UPDATE and SELECT into a single statement via an OUTPUT clause.

DELETE rows in a SELECT statement based on condition

I have a very simple table where I keep player bans. There are only two columns (not actually, but for simplicity's sake) - a player's unique ID (uid) and the ban expire date (expiredate)
I don't want to keep expired bans in the table, so all bans where expiredate < currentdate need to be deleted.
To see if a player is banned, I query this table with his uid and the current date to see if there are any entries. If there are - we determine that the player is banned.
So I need to run two queries. One that would fetch me the bans, and another that would clean up the table of redundant bans.
I was wondering if it would be possible to combine these queries into one. Select and return the entry if it is still relevant, and remove the entry and return nothing if it is not.
Are there any nice ways to do this in a single select query?
Edit:
To clarify, I actually have some other information in the table, such as the ban reason, the ban date, etc. so I need to return the row as well as delete irrelevant entries.
Unfortunately you cannot have a delete statement inside a select statement... I got you a link from sqlite docos for your reference.
Select statements are only used for retrieving data. Although a select could be used in a sub-query for a delete statement, it would still be used for retrieving data.
You must execute the two statements in separated in your case.
delete from player_bans p
where exists (
select 1 from player_bans
where p.uid = uid
and expire_date < current_date)
Is this what you are trying to do?
DELETE FROM TableName
WHERE expiredate < CURDATE();

How to renumber a table column

I have a SQLite table sorted by column ID. But I need to sort it by another numerical field called RunTime.
CREATE TABLE Pass_2 AS
SELECT RunTime, PosLevel, PosX, PosY, Speed, ID
FROM Pass_1
The table Pass_2 looks good, but I need to renumber the ID column from 1 .. n without resorting the records.
It is a principle of SQL databases that the underlying tables have no natural or guaranteed order to their records. You must specify the order in which you want to see the records when SELECTing from a table using an ORDER BY clause.
You can obtain the records you want using SELECT * FROM your_table ORDER BY RunTime, and that is the correct and reliable way to do this in any SQL database.
If you want to attempt to get the records in Pass_2 to "be" in RunTime order, you can add the ORDER BY clause to the SELECT you use to create the table but remember: you are not guaranteed to get the records back in the order in which they were added to the table.
When might you get the records back in a different order? This is most likely to happen when your query can be answered using columns in a covering index -- in that case the records are more likely to be returned in index order than any "natural" order (but again, no guarantees with an ORDER BY clause).
If you want a new ID column starting at 1, then use the ROW_NUMBER() function. Instead of ID in your query use this ROW_NUMBER() OVER(ORDER BY Runtime) AS ID.... This will replace the old ID column with a freshly calculated column

SQL update statement to populate numerical series for each distinct subset of table records

I need the SQL update statement to assign consecutive sequence numbers to subsets of records in a table. I'm using MS access.
Let's say the current table has records like:
notebook,blue
notebook.Yellow
pencil,yellow
chair,blue
desk,green
desk,blue
I would like to add another field to the table and populate it as follows:
notebook,blue,1
notebook.Yellow,1
pencil,yellow,2
chair,blue,2
desk,green,1
desk,blue,3
you see that I have given a consecutive number assignment based on a certain set of criteria. In this example, the criteria was a distinct value in the second field (in real life, the criteria will be a distinct combination of values from several fields, but all the relevant fields are within the same table... no join is needed to get the criteria). since there are three records with blue in field 2, these are numbered 1,2,3. And since there are two records with yellow, they are numbered 1,2.
So I can't derive the numbering from the row number, since I have several numbering series in the same table all starting with 1.
Also, I need it to be a query where I don't have to explicitly specify the value in the second field. I just want each unique value in the second field to get its own numbering series. that is, I don't want to have to explicitly write one query to generate the numbers for "blue", and write a separate query to generate the numbers for "yellow"
The maximum number of records in the series is under 1000. So I don't mind if I would need to create and auxiliary table with 1000 records, with a field containing the values 1 to 1000. Then the update statement to the primary table could pull in the next value from the auxiliary table.
But I don't know the SQL syntax to use for this update statement, or for the update statement for any other approach. So I need your advice.
I'm not sure how to do this with a single SQL statement, but here are 2 SQL statements that could be used to handle each case:
insert into table ('desk', 'blue', 1)
where not exists (select field3 from table where field1 = 'desk' and field2 = 'blue');
insert into table (field1, field2, field3)
select field1, field2, count(1) + 1
from table
where field1 = 'desk'
and field2 = 'blue'
group by field1, field2;
Create Table #TableAutoIncrement (ID int identity(1000 , 1) , item varchar(20), COLOR varchar(20) )
Insert INTO #TableAutoIncrement
(item, COLOR )
SELECT item, COLOR FROM YOURTABLE
--- GETTING all the values from the temporary table
SELECT * FROM #TableAutoIncrement
A colleague of mine worked out the necessary SQL. Here's the generalized solution (note that I really needed to number the multiple series in my data set based on a combination of two fields. In my simplified example in the original post, I was using only one field--color--but since I really need two fields, that's what I show in this solution.
SELECT *,
(SELECT COUNT(T1.ID)
FROM
[TableName] AS T1
WHERE T1.ID >= T2.ID and t1.[NameCriteriaField1] = t2.[NameCriteriaField1]
and t1.[NameCriteriaField2]= t2.[NameCriteriaField2])
AS Sequence into OutputResultsTableName
FROM
[TableName] AS T2
ORDER BY [NameCriteriaField2] , [NameCriteriaField1]
The source table is set up with "ID" as field with an integer value. Every record has a unique value of ID, but it does not matter if there are gaps in the ID or how the records are sequenced against the ID. (e.g., the typical MS access auto numbered primary key field serves this purpose)
This query is set up to assume that there are two fields in your data set that you want to use to group your records and assign a numerical series count to each record within each group. (Thus your table may contain multiple groups, and each group has its own numbering series starting with 1. But the way the query is formulated, there are exactly 2 criteria that define the group.) You cannot use any where clauses to further filter the records that get counted. Through experimentation, I found that adding where clauses gives unreliable results where records can get omitted. So if you need the results to be filtered so that some records are not to be included in the numerical series for a particular group, then do one of the following before running my query:
run a query to delete the undesired records from the source table
first copy all records from the source table into a new table and delete the records from the new table that should not be numbered, and run my query on the new table
deleting extraneous records before running this query is needed only if those records qualify as members of a group defined by criteria 1 and criteria 2. If there are extraneous records that don't match those two criteria, you can leave them in the table, because they will not impact the numbering of the records within the groups that you care about. They will just get their own independent numbering, which you can just ignore.**
The numbering of each group starts at 1, and the query dynamically defines the groups based on the distinct combinations of criteria1 and criteria2. However, if you have records that do not belong to any group, these records will all be numbered with 0. (Criteria1 and criteria2--at least to the extent of my testing--are non-null values. (In theory--at least on Microsoft Access, an empty string is different than Null, but I did not test this with empty strings either.) If you have records that have null in the criteria1 or criteria2 fields, MS Access consider these records as not belonging to any group and thus numbers them with 0. That is, these distinct groups need to define by non-null values for criteria1 and criteria2, and thus this is different than the way SQL DISTINCT statement works.
If you need to have NULL as a valid criteria for defining the group (and thus to have groups defined by NULL numbered), it's very simple. Prior to running my query, first run an update statement that changes all instances of null values in criteria1 or criteria2 to the phrase "placeholder for null field". Then run my query. On the result set (after the numbering has been assigned to the groups), run another update to change all occurrences of the placeholder phrase back to null.
Adjustment to syntax if your group is defined by only one field criteria
SELECT *,
(SELECT COUNT(T1.ID)
FROM
[TableName] AS T1
WHERE T1.ID >= T2.ID and t1.[NameCriteriaField1] = t2.[NameCriteriaField1] )
AS Sequence into OutputResultsTableName
FROM
[TableName] AS T2
ORDER BY [NameCriteriaField2] , [NameCriteriaField1]
Adjustment to syntax if your group is defined by combination of 3 field criteria
SELECT *,
(SELECT COUNT(T1.ID)
FROM
[TableName] AS T1
WHERE T1.ID >= T2.ID and t1.[NameCriteriaField1] = t2.[NameCriteriaField1]
and t1.[NameCriteriaField2]= t2.[NameCriteriaField2]
and t1.[NameCriteriaField3]= t2.[NameCriteriaField3])
AS Sequence into OutputResultsTableName
FROM
[TableName] AS T2
ORDER BY [NameCriteriaField2] , [NameCriteriaField1]