SQL - delete rows where only one column changes - sql

I have a large table in SQL, in which an effective_from date column should update every time one of the other columns changes. However, for some reason, there are numerous rows in which the effective_from date changes, but no other values have changed. For example:
CODE NAME EFFECTIVE_FROM
CCWA Oak 1999
CCWA Willow 2001
CCWA Willow 2004
How can I delete the rows where the change in effective_from date doesn't provide any info. e.g. the third row in the above table.
The tables are very large, so I would prefer to use SELECT statements rather than DELETE or ALTER which seem to be slow.
Any help much appreciated!

I believe you are looking for:
SELECT Code, Name, MAX(EFFECTIVE_FROM)
FROM myTable
GROUP BY Code, Name

Since it is the later date that adds no information, you want to select the minimum date value.
SELECT Code, Name, MIN(EFFECTIVE_FROM)
FROM CodeTable
GROUP BY Code, Name

try this:
SELECT code, name, max(EFFECTIVE_FROM)
FROM tablename
GROUP BY code, name

You want to use lag(). The result set without duplicates:
select t.*
from (select t.*,
lag(code) over (order by effective_from) as prev_code,
lag(name) over (order by effective_from) as prev_name
from t
) t
where (prev_code <> code or prev_code is null) and
(prev_name <> name or prev_name is null);
This assumes that code and name are never NULL. That is easy to incorporate in the logic (but it makes the where clause a bit complicated).

Your question doesn not clearify the real result you want to achieve: if you want to permanently delete elements from the table, you need to use a DELETE, if your target is simply to filter out the duplicates you described, you can use a SELECT (and the elements will remain in the table).
The fact that you consider to use a DELETE make me suppose that this "duplicates" (except for the date) are not desirable.
In this case you can also consider to add a trigger that prevent the insertion when the informative fields (all fields except EFFECTIVE_FROM) aren't changed, in this way only interesting data changes will generate a new row.
Then you can execute a one-shot operation which delete all the duplicated elements that does not reflect any data change (operation to do by night, or however when the system has a low load or no one is using it, if the table is really very large as you typed).
This kind of solution changes the nature of this table, in fact you lose the historical information of updates without real data changes. Consider this solution only if these informations aren't necessary for your target.

Related

Filling in Null rows of a dataset with LAG, OVER and undetermined amount of offset?

everyone. First time I've posted here. I looked for some sticky threads that might tell me some "HEY DO THIS BEFORE YOU POST FOR THE FIRST TIME" info, but I may have missed it. So, here's the question:
I'm working on building out a dataset for an analysis, and I'm trying to fill in some null rows. I don't know if it's the best way, but I think I need to LAG OVER PARTITION BY this dataset. Here's an example of the table:
My goal would be to have all of the null values in the BidEnd field filled with the most recent cell above it. So, rows 1-4 would all be filled with 2020-01-03. The end goal is to be able to label all the rows as valid or not. If the bid start occurred after the bid end, then it would not be valid. The dataset will need to do this with all customers and then with all bid_ids grouped under that customer.
I'd much prefer to use the code and an actual example, but I am not allowed to share that information, so I've tried to recreate the scenario as best as possible. Sorry if it's confusing.
In standard SQL, you would use lag(ignore nulls):
select t.*,
lag(bidend ignore nulls) over (partition by customer2 order by row)
from t;
Although standard SQL, not all databases support the ignore nulls option on lag(). That is why tagging your database is important.
Actually, it looks like you have one value per customer2/bid_id pair. If that is true, you can use max():
select t.*,
max(bidend) over (partition by customer2, bid_id)
from t;

Query to compare 3 different data columns with a reference value

I need to check whether a table (in an Oracle DB) contains entries that were updated after a certain date. "Updated" in this case means any of 3 columns (DateCreated, DateModified, DateDeleted) have a value greater than the reference.
The query I have come up so far is this
select * from myTable
where DateCreated > :reference_date
or DateModified > :reference_date
or DateDeleted > :reference_date
;
This works and gives desired results, but is not what I want, because I would like to enter the value for :reference_date only once.
Any ideas on how I could write a more elegant query ?
While what you have looks fine and only uses one bind variable, if for some reason you have positional rather than named binds then you could avoid the need to supply the bind value multiple time by using an inline view or a CTE:
with cte as (select :reference_date as reference_date from dual)
select myTable.*
from cte
join myTable
on myTable.DateCreated > cte.reference_date
or myTable.DateModified > cte.reference_date
or myTable.DateDeleted > cte.reference_date
;
But again I wouldn't consider that better than your original unless you have a really compelling reason and a problem supplying the bind value. Having to set it three times from a calling program probably wouldn't count as compelling, for example, for me anyway. And I'd check it didn't affect performance before deploying - I'd expect Oracle to optimise something like this but the execution plan might be interesting.
I suppose you could rewrite that as:
select * from myTable
where greatest(DateCreated, DateModified, DateDeleted) > :reference_date;
if you absolutely had to, but I wouldn't. Your original query is, IMHO, much easier to understand than this one, plus by using a function, you've lost any chance of using an index, should one exist (unless you have a function based index based on the new clause).

SQL - renumbering a sequential column to be sequential again after deletion

I've researched and realize I have a unique situation.
First off, I am not allowed to post images yet to the board since I'm a new user, so see appropriate links below
I have multiple tables where a column (not always the identifier column) is sequentially numbered and shouldn't have any breaks in the numbering. My goal is to make sure this stays true.
Down and Dirty
We have an 'Event' table where we randomly select a percentage of the rows and insert the rows into table 'Results'. The "ID" column from the 'Results' is passed to a bunch of delete queries.
This more or less ensures that there are missing rows in several tables.
My problem:
Figuring out an sql query that will renumber the column I specify. I prefer to not drop the column.
Example delete query:
delete ItemVoid
from ItemTicket
join ItemVoid
on ItemTicket.item_ticket_id = itemvoid.item_ticket_id
where itemticket.ID in (select ID
from results)
Example Tables Before:
Example Tables After:
As you can see 2 rows were delete from both tables based on the ID column. So now I gotta figure out how to renumber the item_ticket_id and the item_void_id columns where the the higher number decreases to the missing value, and the next highest one decreases, etc. Problem #2, if the item_ticket_id changes in order to be sequential in ItemTickets, then
it has to update that change in ItemVoid's item_ticket_id.
I appreciate any advice you can give on this.
(answering an old question as it's the first search result when I was looking this up)
(MS T-SQL)
To resequence an ID column (not an Identity one) that has gaps,
can be performed using only a simple CTE with a row_number() to generate a new sequence.
The UPDATE works via the CTE 'virtual table' without any extra problems, actually updating the underlying original table.
Don't worry about the ID fields clashing during the update, if you wonder what happens when ID's are set that already exist, it
doesn't suffer that problem - the original sequence is changed to the new sequence in one go.
WITH NewSequence AS
(
SELECT
ID,
ROW_NUMBER() OVER (ORDER BY ID) as ID_New
FROM YourTable
)
UPDATE NewSequence SET ID = ID_New;
Since you are looking for advice on this, my advice is you need to redesign this as I see a big flaw in your design.
Instead of deleting the records and then going through the hassle of renumbering the remaining records, use a bit flag that will mark the records as Inactive. Then when you are querying the records, just include a WHERE clause to only include the records are that active:
SELECT *
FROM yourTable
WHERE Inactive = 0
Then you never have to worry about re-numbering the records. This also gives you the ability to go back and see the records that would have been deleted and you do not lose the history.
If you really want to delete the records and renumber them then you can perform this task the following way:
create a new table
Insert your original data into your new table using the new numbers
drop your old table
rename your new table with the corrected numbers
As you can see there would be a lot of steps involved in re-numbering the records. You are creating much more work this way when you could just perform an UPDATE of the bit flag.
You would change your DELETE query to something similar to this:
UPDATE ItemVoid
SET InActive = 1
FROM ItemVoid
JOIN ItemTicket
on ItemVoid.item_ticket_id = ItemTicket.item_ticket_id
WHERE ItemTicket.ID IN (select ID from results)
The bit flag is much easier and that would be the method that I would recommend.
The function that you are looking for is a window function. In standard SQL (SQL Server, MySQL), the function is row_number(). You use it as follows:
select row_number() over (partition by <col>)
from <table>
In order to use this in your case, you would delete the rows from the table, then use a with statement to recalculate the row numbers, and then assign them using an update. For transactional integrity, you might wrap the delete and update into a single transaction.
Oracle supports similar functionality, but the syntax is a bit different. Oracle calls these functions analytic functions and they support a richer set of operations on them.
I would strongly caution you from using cursors, since these have lousy performance. Of course, this will not work on an identity column, since such a column cannot be modified.

SQL Server Implicit Order

i've got an issue due to database conception.
My data are grouped in a table which looks like :
IdGroup | IdValue
So for each group i've got the list of value.
Indeed, we should have had an order column or an id, but i can't.
Do you know anyway which can prove the order of the select value based on the insert order ?
I mean, if I inserted 1003,1001,1002 could i garantuee it to be retrieve in this order ?
IdGroup | IdValue
1 | 1003
1 | 1001
1 | 1002
Of course, using an order by doesn't seems to fit because i don't have any column usable.
Any idea ? Using a system proc or something like this.
Thanks a lot :)
Stop telling me to use an order by and altering the table, it doesn't fit and yes i know it's the good pratice to do... thanks :)
A couple of ideas:
DBCC PAGE (undocumented) can be used to look at the raw data pages of the table. It may be possible to determine insert order by looking at the low level information.
If you cannot alter the table, can you add a table to the database? If so, consider creating a table with an identity column and use a trigger on the original table to insert the records in the new table.
Also, you should include which version(s) of SQL Server are involved. Doing anything this unusual will very often be version specific.
You shouldn't rely on the data being returned in a particular order; use an ORDER BY clause to guarantee the order.
(Despite the fact that data appears to be returned in clustered index order, this might not always be the case).
Whilst some small scale tests will show that it returns it in what appears to be the right order, it just will not hold.
The golden rule remains - unless an order by clause is specified, there are no guarentees provided on the order of the returned data.
edit : If you place a non-clustered index on the idgroup column it is forced to add a hidden field, the uniqueifier since the values are the same - the problem it, you can't access it in an order by clause, but from a forensic perspective, you can determine the order it was inserted in.
As others have said, the only way to guarantee an ordering is with an ORDER BY clause. What isn't highlighted in their answers is that, the only place that this ORDER BY matters is in the SELECT statement. It doesn't* matter if you apply an ORDER BY clause during the INSERT statement; the system is free to return results from a select in whatever order it finds most efficient, unless an ORDER BY is specified at that time.
*There's a particular way to ensure what order IDENTITY values are assigned to a result set during an INSERT, using an ORDER BY, but I can't remember the exact details, and it still doesn't effect the order of SELECT.
Can you add the Created Date column? In this way you can get the records using Order by Clause Created Date. Moreover set it's default value Getdate()

Remove duplicate entries from database with conditions

I've had a good look around but havnt been able to find a solution so hoping someone can help with this one.
I have a MySQL table of results from an internal logging application which records a result from a check routine, there are a number of check routines which are identified with the tracker column:
id (int)(PK), tracker (int), time (timestamp), result (int)
A result only needs to be recorded if the previous result is not the same, only changes need to be captured. Unfortunatly this was ignored when it was built (in a hurry) a month ago and results have been recorded blindly with no checks on previous results. This has now been recorded but I'm still left with a few thousand rows of which a significant number are duplicate entries and I'm after a way of clearing these out to just leave the change points.
So I need to go through each row, look at the previous result recorded by that tracker and delete the row if its the same, this is a bit beyond my experience with MySQL and the attempts I've made so far have been fairly poor!
Can anyone help?
Use:
DELETE a
FROM YOUR_TABLE a
LEFT JOIN (SELECT MAX(t.id) AS latest_id
FROM YOUR_TABLE t
GROUP BY t.tracker, t.result) b ON b.latest_id = a.id
WHERE b.latest_id IS NULL
Alternate using IN:
DELETE FROM YOUR_TABLE
WHERE id NOT IN (SELECT x.latest_id
FROM (SELECT MAX(t.id) AS latest_id
FROM YOUR_TABLE t
GROUP BY t.tracker, t.result) x )
There are complaints that this is slow to execute, but that probably doesn't affect you. It will certainly be faster than anything else you might do:
select DISTINCT id, tracker, time, result
from table;
I think you want a unique index on the table:
ALTER IGNORE TABLE table ADD UNIQUE INDEX (tracker, time, result)
http://dev.mysql.com/doc/refman/5.1/en/alter-table.html
You'll have to use INSERT IGNORE... when adding new rows as inserts that would duplicate an existing (tracker, time, result) key will cause an error.