Bahaviour of SQL update using one-to-many join - sql

Imagine I have two tables, t1 and t2. t1 has two fields, one containing unique values called a and another field called value. Table t2 has a field that does not contain unique values called b and a field also called value.
Now, if I use the following update query (this is using MS Access btw):
UPDATE t1
INNER JOIN t2 ON t1.a=t2.b
SET t1.value=t2.value
If I have the following data
t1 t2
a | value b | value
------------ ------------
'm' | 0.0 'm'| 1.1
'm'| 0.2
and run the query what value ends up in t1.value? I ran some tests but couldn't find consistent behaviour, so I'm guessing it might just be undefined. Or this kind of update query is something that just shouldn't be done? There is a long boring story about why I've had to do it this way, but it's irrelevant to the technical nature of my enquiry.

This is known as a non deterministic query, it means exactly what you have found that you can run the query multiple times with no changes to the query or underlying data and get different results.
In practice what happens is the value will be updated with the last record encountered, so in your case it will be updated twice, but the first update will be overwritten by last. What you have absolutely no control over is in what order the SQL engine accesses the records, it will access them it whatever order it deems fit, this could be simply a clustered index scan from the begining, or it could use other indexes and access the clustered index in a different order. You have no way of knowing this. It is quite likely that running the update multiple times would yield the same result, because with no changes to the data the sql optimiser will use the same query plan. But again there is no guarantee, so you should not rely on a non determinstic query to get deterministic results.
EDIT
To update the value in T1 to the Maximum corresponding value in T2 you can use DMax:
UPDATE T1
SET Value = DMax("Value", "T2", "b=" & T1.a);

When you execute the query as you’ve indicated, the “value” that ends up in “t1” for the row ‘m’ will be, effectively, random, due to the fact that “t2” has multiple rows for the identity value ‘m’.
Unless you specifically specify that you want the maximum (max function), minimum (min function) or some-other aggregate of the collection of rows with the identity ‘m’ the database has no ability to make a defined choice and as such returns whatever value it first comes across, hence the inconsistent behaviour.
Hope this helps.

Related

Index for join query with where clause PostgreSQL

I have to optimize the following query with the help of indexes.
SELECT f.*
FROM first f
JOIN second s on f.attributex_id = s.id
WHERE f.attributex_id IS NOT NULL AND f.attributey_id IS NULL
ORDER BY s.month ASC LIMIT 100;
Further infos:
attributex_id is a foreign key pointing to second.id
attributey_id is a foreign key pointing to another table not used in the query
Changing the query is not an option
Most entries (98%) in first the following will be true f.attributex_id IS NOT NULL. Same for the second condition f.attributey_id IS NULL
I tried to add as index as follows.
CREATE INDEX index_for_first
ON first (attributex_id, attributey_id)
WHERE attributex_id IS NOT NULL AND (attributey_id IS NULL)
But the index is not used (checked via Explain Analyze) when executing the query. What kind of indexes would I need to optimize the query and what am I doing wrong with the above index?
Does an index on s.month make sense, too (month is unique)?
Based on the query text and the fact that nearly all records in first satisfy the where clause, what you're essentially trying to do is
identify the 100 second records with the lowest month value
output the contents of the related records in the first table.
To achieve that you can create indexes on
second.month
first.attributex_id
Caveats
Since this query must be optimized, it's safe to say there are many rows in both tables. Since there are only 12 months in the year, the output of the query is probably not deterministic (i.e., it may return a different set of rows each time it's run, even if there is no activity in either table between runs) since many records likely share the same value for month. Adding "tie breaker" column(s) to the index on second may help, though your order by only includes month, so no guarantees. Also, if second.month can have null values, you'll need to decide whether those null values should collate first or last among values.
Also, this particular query is not the only one being run against your data. These indexes will take up disk space and incrementally slow down writes to the tables. If you have a dozen queries that perform poorly, you might fall into a trap of creating a couple indexes to help each one individually and that's not a solution that scales well.
Finally, you stated that
changing the query is not an option
Does that mean you're not allowed to change the text of the query, or the output of the query?
I personally feel like re-writing the query to select from second and then join first makes the goal of the query more obvious. The fact that your initial instinct was to add indexes to first lends credence to this idea. If the query were written as follows, it would have been more obvious that the thing to do is facilitate efficient access to the tiny set of rows in second that you're interested in:
...
from second s
join first f ...
where ...
order by s.month asc limit 100;

"Set" with "where exists" performs better then without

I come up by chances to this curious case.
Environment:
Oracle 12.2.2
Involved 2 tables.
N. of rows 16 milions
As far I know, and reported here Oracle / PLSQL: EXISTS Condition the use of where exists is in general less perfomant of other way.
In my case however when updating a table's column with the value with another on join condition with the exists, the query run in about 12-13 sec without issues(I did only some check, as I really do not know all the content of the table):
update fdm_auftrag ou
set (ou.e_hr,ou.e_budget) = ( select b.e_hr,b.e_budget
from fdm_budget_auftrag b
where b.fk_column1 = ou.fk_column1
and b.fk_column2 = ou.fk_column2
and b.fk_col3 = ou.fk_col3 )
where exists ( select b.e_hr,b.e_budget
from fdm_budget_auftrag b
where b.fk_column1 = ou.fk_column1
and b.fk_column2 = ou.fk_column2
and b.fk_col3 = ou.fk_col3 );
instead without the exists, it takes so much time then I even interrupt it.
I am just gessing as the condition in exist is valuated as a boolean, if the enginee found out at least one row, then had to do less touch on the DB, but I am not sure about it.
It is correct this "guess", have someone a more clear explanation?
The where clause is limiting the number of rows being updated.
Fewer updated rows means that the update query runs faster. There is a lot of overhead to updating a row, including stashing away information for roll-back purposes.
I am assuming that you are updating relatively few rows in a much larger table. If the where clause is selecting most of the rows, then there might be no performance difference.
And, finally, the two queries are not identical. Without the where unmatched values will be assigned NULL.

Inconsistent results from BigQuery: same query, different number of rows

I noticed today that one my query was having inconsistent results: every time I run it I have a different number of rows returned (cache deactivated).
Basically the query looks like this:
SELECT *
FROM mydataset.table1 AS t1
LEFT JOIN EACH mydataset.table2 AS t2
ON t1.deviceId=t2.deviceId
LEFT JOIN EACH mydataset.table3 AS t3
ON t2.email=t3.email
WHERE t3.email IS NOT NULL
AND (t3.date IS NULL OR DATE_ADD(t3.date, 5000, 'MINUTE')<TIMESTAMP('2016-07-27 15:20:11') )
The tables are not updated between each query. So I'm wondering if you also have noticed that kind of behaviour.
I usually make queries that return a lot of rows (>1000) so a few missing rows here and there is hardly noticeable. But this query return a few row, and it varies everytime between 10 and 20 rows :-/
If a Google engineer is reading this, here are two Job ID of the same query with different results:
picta-int:bquijob_400dd739_1562d7e2410
picta-int:bquijob_304f4208_1562d7df8a2
Unless I'm missing something, the query that you provide is completely deterministic and so should give the same result every time you execute it. But you say it's "basically" the same as your real query, so this may be due to something you changed.
There's a couple of things you can do to try to find the cause:
replace select * by an explicit selection of fields from your tables (a combination of fields that uniquely determine each row)
order the table by these fields, so that the order becomes the same each time you execute the query
simplify your query. In the above query, you can remove the first condition and turn the two left outer joins into inner joins and get the same result. After that, you could start removing tables and conditions one by one.
After each step, check if you still get different result sets. Then when you have found the critical step, try to understand why it causes your problem. (Or ask here.)

iSeries query changes selected RRN of subquery result rows

I'm trying to make an optimal SQL query for an iSeries database table that can contain millions of rows (perhaps up to 3 million per month). The only key I have for each row is its RRN (relative record number, which is the physical record number for the row).
My goal is to join the table with another small table to give me a textual description of one of the numeric columns. However, the number of rows involved can exceed 2 million, which typically causes the query to fail due to an out-of-memory condition. So I want to rewrite the query to avoid joining a large subset with any other table. So the idea is to select a single page (up to 30 rows) within a given month, and then join that subset to the second table.
However, I ran into a weird problem. I use the following query to retrieve the RRNs of the rows I want for the page:
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
This query works just fine, returning the correct RRNs for the rows I need. However, when I attempted to join the result of the subquery with another table, the RRNs changed. So I simplified the query to a subquery within a simple outer query, without any join:
select rrn(e) as RRN, e.*
from TABLE1 as e
where rrn(e) in (
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
)
order by e.UPDATED, e.ACCOUNT
The outer query simply grabs all of the columns of each row selected by the subquery, using the RRN as the row key. But this query does not work - it returns rows with completely different RRNs.
I need the actual RRN, because it will be used to retrieve more detailed information from the table in a subsequent query.
Any ideas about why the RRNs end up different?
Resolution
I decided to break the query into two calls, one to issue the simple subquery and return just the RRNs (rows-IDs), and the second to do the rest of the JOINs and so forth to retrieve the complete info for each row. (Since the table gets updated only once a day, and rows never get deleted, there are no potential timing problems to worry about.)
This approach appears to work quite well.
Addendum
As to the question of why an out-of-memory error occurs, this appears to be a limitation on only some of our test servers. Some can only handle up to around 2m rows, while others can handle much more than that. So I'm guessing that this is some sort of limit imposed by the admins on a server-by-server basis.
Trying to use RRN as a primary key is asking for trouble.
I find it hard to believe there isn't a key available.
Granted, there may be no explicit primary key defined in the table itself. But is there a unique key defined in the table?
It's possible there's no keys defined in the table itself ( a practice that is 20yrs out of date) but in that case there's usually a logical file with a unique key defined that is by the application as the de-facto primary key to the table.
Try looking for related objects via green screen (DSPDBR) or GUI (via "Show related"). Keyed logical files show in the GUI as views. So you'd need to look at the properties to determine if they are uniquely keyed DDS logicals instead of non-keyed SQL views.
A few times I've run into tables with no existing de-facto primary key. Usually, it was possible to figure out what could be defined as one from the existing columns.
When there truly is no PK, I simply add one. Usually a generated identity column. There's a technique you can use to easily add columns without having to recompile or test any heritage RPG/COBOL programs. (and note LVLCHK(*NO) is NOT it!)
The technique is laid out in Chapter 4 of the modernizing Redbook
http://www.redbooks.ibm.com/abstracts/sg246393.html
1) Move the data to a new PF (or SQL table)
2) create new LF using the name of the existing PF
3) repoint existing LF to new PF (or SQL table)
Done properly, the record format identifiers of the existing objects don't change and thus you don't have to recompile any RPG/COBOL programs.
I find it hard to believe that querying a table of mere 3 million rows, even when joined with something else, should cause an out-of-memory condition, so in my view you should address this issue first (or cause it to be addressed).
As for your question of why the RRNs end up different I'll take the liberty of quoting the manual:
If the argument identifies a view, common table expression, or nested table expression derived from more than one base table, the function returns the relative record number of the first table in the outer subselect of the view, common table expression, or nested table expression.
A construct of the type ...where something in (select somethingelse...) typically translates into a join, so there.
Unless you can specifically control it, e.g., via ALWCPYDTA(*NO) for STRSQL, SQL may make copies of result rows for any intermediate set of rows. The RRN() function always accesses physical record number, as contrasted with the ROW_NUMBER() function that returns a logical row number indicating the relative position in an ordered (or unordered) set of rows. If a copy is generated, there is no way to guarantee that RRN() will remain consistent.
Other considerations apply over time; but in this case it's as likely to be simple copying of intermediate result rows as anything.

SQL - renumbering a sequential column to be sequential again after deletion

I've researched and realize I have a unique situation.
First off, I am not allowed to post images yet to the board since I'm a new user, so see appropriate links below
I have multiple tables where a column (not always the identifier column) is sequentially numbered and shouldn't have any breaks in the numbering. My goal is to make sure this stays true.
Down and Dirty
We have an 'Event' table where we randomly select a percentage of the rows and insert the rows into table 'Results'. The "ID" column from the 'Results' is passed to a bunch of delete queries.
This more or less ensures that there are missing rows in several tables.
My problem:
Figuring out an sql query that will renumber the column I specify. I prefer to not drop the column.
Example delete query:
delete ItemVoid
from ItemTicket
join ItemVoid
on ItemTicket.item_ticket_id = itemvoid.item_ticket_id
where itemticket.ID in (select ID
from results)
Example Tables Before:
Example Tables After:
As you can see 2 rows were delete from both tables based on the ID column. So now I gotta figure out how to renumber the item_ticket_id and the item_void_id columns where the the higher number decreases to the missing value, and the next highest one decreases, etc. Problem #2, if the item_ticket_id changes in order to be sequential in ItemTickets, then
it has to update that change in ItemVoid's item_ticket_id.
I appreciate any advice you can give on this.
(answering an old question as it's the first search result when I was looking this up)
(MS T-SQL)
To resequence an ID column (not an Identity one) that has gaps,
can be performed using only a simple CTE with a row_number() to generate a new sequence.
The UPDATE works via the CTE 'virtual table' without any extra problems, actually updating the underlying original table.
Don't worry about the ID fields clashing during the update, if you wonder what happens when ID's are set that already exist, it
doesn't suffer that problem - the original sequence is changed to the new sequence in one go.
WITH NewSequence AS
(
SELECT
ID,
ROW_NUMBER() OVER (ORDER BY ID) as ID_New
FROM YourTable
)
UPDATE NewSequence SET ID = ID_New;
Since you are looking for advice on this, my advice is you need to redesign this as I see a big flaw in your design.
Instead of deleting the records and then going through the hassle of renumbering the remaining records, use a bit flag that will mark the records as Inactive. Then when you are querying the records, just include a WHERE clause to only include the records are that active:
SELECT *
FROM yourTable
WHERE Inactive = 0
Then you never have to worry about re-numbering the records. This also gives you the ability to go back and see the records that would have been deleted and you do not lose the history.
If you really want to delete the records and renumber them then you can perform this task the following way:
create a new table
Insert your original data into your new table using the new numbers
drop your old table
rename your new table with the corrected numbers
As you can see there would be a lot of steps involved in re-numbering the records. You are creating much more work this way when you could just perform an UPDATE of the bit flag.
You would change your DELETE query to something similar to this:
UPDATE ItemVoid
SET InActive = 1
FROM ItemVoid
JOIN ItemTicket
on ItemVoid.item_ticket_id = ItemTicket.item_ticket_id
WHERE ItemTicket.ID IN (select ID from results)
The bit flag is much easier and that would be the method that I would recommend.
The function that you are looking for is a window function. In standard SQL (SQL Server, MySQL), the function is row_number(). You use it as follows:
select row_number() over (partition by <col>)
from <table>
In order to use this in your case, you would delete the rows from the table, then use a with statement to recalculate the row numbers, and then assign them using an update. For transactional integrity, you might wrap the delete and update into a single transaction.
Oracle supports similar functionality, but the syntax is a bit different. Oracle calls these functions analytic functions and they support a richer set of operations on them.
I would strongly caution you from using cursors, since these have lousy performance. Of course, this will not work on an identity column, since such a column cannot be modified.