Update join not updating the same number of rows as equivelant select - sql

When I run this SELECT statement, I receive 642 rows...
SELECT *
FROM _DevLoadIn a
JOIN ArticleCompanyList b ON b.Company = a.Name
When I run this UPDATE statement, only 630 rows are updated...
UPDATE b
SET b.BGCompanyId = a.RelatedId
FROM _DevLoadIn a
JOIN ArticleCompanyList b ON b.Company = a.Name
The JOIN is identical, so how can the number of effected rows be different? Both statements execute without error. I don't see how this could be possible. Can anyone provide any insight? Am I missing something about how an update/join works?

Best guess is that there are more matches in A for each value of B. So the select statement returns the joined duplicates of A - but the update only updates the row once.
In other words, the additional values in your select are representations of B (not A).
---- updated post question edit -----
Are you sure that your are updating the right value? Make sure that the proper table (A or B) is on the left hand side of the update statement. It appears you've edited your question and switched places of what was originally posted. The theory is still the same however.

If b.BGCompanyId is already equal to a.RelatedId, it will not show as updated.
You could verify this by modifying your original query like so:
SELECT *
FROM _DevLoadIn a
JOIN ArticleCompanyList b ON b.Company = a.Name
WHERE b.BGCompanyID != a.RelatedId

Related

Values after join are incorrect

I have 2 database tables. Table A has to fetch some records based on parameter passed there may or may not be an entry in table B with that key.
What I want to do is:
select a.col1,a.col2,a.col3
FROM table WHERE a.id = 123
This would fetch 20 rows. For one of the rows there is an entry in another table B.
select T_level from table b where b.id = 123
only one record appears with right value.
What I want is to get this in a single query. Something like:
select a.col1,a.col2,a.col3,b.T_level
from a,b
where a.id = 123
and a.id = b.id
When I do that, I get 20 rows and the column T_level as '50' for all the rows, whereas it should be '50' for one correct row, for rest it should be null.
I further tried:
select a.col1,a.col2,a.col3,nvl(b.T_level,0) from a,b
but that doesn't fetch the way I expect.
Firstly, please learn to use ansi sql join syntax. The Oracle join syntax you are using hasn't been considered good practice for decades
SQL Join syntax
If you want to get all records from a and any matching records from b then you need to use a LEFT OUTER JOIN

Determine datatypes of columns - SQL selection

Is it possible to determine the type of data of each column after a SQL selection, based on received results? I know it is possible though information_schema.columns, but the data I receive comes from multiple tables and is joint together and the data is renamed. Besides that, I'm not able to see or use this query or execute other queries myself.
My job is to store this received data in another table, but without knowing beforehand what I will receive. I'm obviously able to check for example if a certain column contains numbers or text, but not if it is originally stored as a TINYINT(1) or a BIGINT(128). How to approach this? To clarify, it is alright if the data-types of the columns of the source and destination aren't entirely the same, but I don't want to reserve too much space beforehand (or too less for that matter).
As I'm typing, I realize I'm formulation the question wrong. What would be the best approach to handle described situation? I thought about altering tables on the run (e.g. increasing size if needed), but that seems a bit, well, wrong and not the proper way.
Thanks
Can you issue the following query about your new table after you create it?
SELECT *
INTO JoinedQueryResults
FROM TableA AS A
INNER JOIN TableB AS B ON A.ID = B.ID
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'JoinedQueryResults'
Is the query too big to run before knowing how big the results will be? Get a idea of how many rows it may return, but the trick with queries with joins is to group on the columns you are joining on, to help your estimate return more quickly. Here's of an example of just returning a row count from the query above which would have created the JoinedQueryResults table above.
SELECT SUM(A.NumRows * B.NumRows)
FROM (SELECT ID, COUNT(*) AS NumRows
FROM TableA
GROUP BY ID) AS A
INNER JOIN (SELECT ID, COUNT(*) AS NumRows
FROM TableB
GROUP BY ID) AS B ON A.ID = B.ID
The query above will run faster if all you need is a record count to help you estimate a size.
Also try instantiating a table for your results with a query like this.
SELECT TOP 0 *
INTO JoinedQueryResults
FROM TableA AS A
INNER JOIN TableB AS B ON A.ID = B.ID

Why and when to use CROSS JOIN instead of INNER JOIN with UPDATE statements?

Coding in T-Sql since three mounths or so, I've just seen for the first time the use of a CROSS JOIN in an UPDATE statement in some code and I'm not able to figure out the use cases of such a construct.
Does anyone know?
Edit: here is a sample code of what I can't understand well yet.
UPDATE a
SET a.COL1 = b.COL1
FROM Table1 AS a
CROSS JOIN Table2 AS b
And there are other updates in the code that provide a WHERE clause like:
UPDATE a
SET a.COL1 = b.COL1
FROM Table1 AS a
CROSS JOIN Table2 AS b
WHERE condition_on_columns_from_a_and_from_b
And the point is that for each row of Table1, a select on the the cross join with the filtering returns more than a row.
I'm a bit confused with the understanding of the behavior.
PS: the table Table1 takes more than 5 giga bytes of space..
A cross join generates the cartesian product of two tables. This means it combines EVERY row of table A with EVERY row of table B. When Table A has n rows and table B has m rows, the result set has n*m rows.
There is no good reason that I can imagine to do this. The query is either written incorrectly, or just a test to slow down your system or to invalidate the target table's data (or perhaps, just to see what it does).
It will probably set COL1 of every row in Table1 to the same single random value from Table2's COL1 (though probably either the first or last such value). But it will do so very inefficiently (unless the optimizer in later versions of SQL Server have optimized out this useless case, I haven't tested it in years myself).
To understand the use case, you would need to look at the data. I can easily see using the first update if I was positive tableb would always and only contain one record. This is especially true of that one record has no field to join to table A on. In this case you are updating all the fields in table a with the value of that field in table b. Normally this type of thing where all records are updated woudl only be for resetting values.
To see what would be updated, do this:
UPDATE a
SET a.COL1 = b.COL1
--select a.COL1,b.COL1, *
FROM Table1 AS a
CROSS JOIN Table2 AS b
WHERE condition_on_columns_from_a_and_from_b
Now you can run just the select part to see what value a.col1 would be replaced with and see the other fields in the tables to see if the join and where clasue appear to be correct. This will help you understand what the corss join is doing. YOu could then temporarily replace the cross join with a left join and an inner join to understand what behavior it has that is differnt than the other types of joins. Play around with the select for awhile until you really understand what is happening. I never write an update without having the select in comments so I can ensure I am updating what I think I should be before I move the code to prod. This is espcially true if you write complex updates like I do that could involve ten or fifteen joins and several where conditions.
Okay, with this query:
UPDATE a
SET COL1 = b.COL1
FROM Table1 AS a
CROSS JOIN Table2 AS b
WHERE condition_on_columns_from_a_and_from_b
If we take the set formed by a CROSS JOIN b (and before considering the FROM clause), then we have a Cartesian product, where every row from a is paired with every row from b.
If we now consider the WHERE clause - unless this WHERE clause is sufficient to guarantee that each row from a is only represented once, then we will have an indeterminate result. That is, if there are two rows in the set which are both derived from the same row from a (but different rows from b), then there is no way to know, for sure, which of those two rows will be used to compute the SET a.COL1 = b.COL1 assignment.
I don't think it's even guaranteed, if we had the following:
UPDATE a
SET COL1 = b.COL1, COL2 = b.COL2
FROM --As before
that the same row from b will be used for both assignments.
All of the above is true for any UPDATE statement using the T-SQL FROM clause extension - unless you're careful to constrain your join conditions, then multiple assignments for the same row may be possible. But a CROSS JOIN just seems to make it far more likely to occur. And SQL Server issues no diagnostic messages if this occurs.

SQL Method of checking that INNER / LEFT join doesn't duplicate rows

Is there a good or standard SQL method of asserting that a join does not duplicate any rows (produces 0 or 1 copies of the source table row)? Assert as in causes the query to fail or otherwise indicate that there are duplicate rows.
A common problem in a lot of queries is when a table is expected to be 1:1 with another table, but there might exist 2 rows that match the join criteria. This can cause errors that are hard to track down, especially for people not necessarily entirely familiar with the tables.
It seems like there should be something simple and elegant - this would be very easy for the SQL engine to detect (have I already joined this source row to a row in the other table? ok, error out) but I can't seem to find anything on this. I'm aware that there are long / intrusive solutions to this problem, but for many ad hoc queries those just aren't very fun to work out.
EDIT / CLARIFICATION: I'm looking for a one-step query-level fix. Not a verification step on the results of that query.
If you are only testing for linked rows rather than requiring output, then you'd use EXISTS.
More correctly, you need a "semi-join" but this isn't supported by most RDBMS unless as EXISTS
SELECT a.*
FROM TableA a
WHERE EXISTS (SELECT * FROM TableB b WHERE a.id = b.id)
Also see:
Using 'IN' with a sub-query in SQL Statements
EXISTS vs JOIN and use of EXISTS clause
SELECT JoinField
FROM MyJoinTable
GROUP BY JoinField
HAVING COUNT(*) > 1
LIMIT 1
Is that simple enough? Don't have Postgres but I think it's valid syntax.
Something along the lines of
SELECT a.id, COUNT(b.id)
FROM TableA a
JOIN TableB b ON a.id = b.id
GROUP BY a.id
HAVING COUNT(b.id) > 1
Should return rows in TableA that have more than one associated row in TableB.

SQL (any) Request for insight on a query optimization

I have a particularly slow query due to the vast amount of information being joined together. However I needed to add a where clause in the shape of id in (select id from table).
I want to know if there is any gain from the following, and more pressing, will it even give the desired results.
select a.* from a where a.id in (select id from b where b.id = a.id)
as an alternative to:
select a.* from a where a.id in (select id from b)
Update:
MySQL
Can't be more specific sorry
table a is effectively a join between 7 different tables.
use of * is for examples
Edit, b doesn't get selected
Your question was about the difference between these two:
select a.* from a where a.id in (select id from b where b.id = a.id)
select a.* from a where a.id in (select id from b)
The former is a correlated subquery. It may cause MySQL to execute the subquery for each row of a.
The latter is a non-correlated subquery. MySQL should be able to execute it once and cache the results for comparison against each row of a.
I would use the latter.
Both queries you list are the equivalent of:
select a.*
from a
inner join b on b.id = a.id
Almost all optimizers will execute them in the same way.
You could post a real execution plan, and someone here might give you a way to speed it up. It helps if you specify what database server you are using.
YMMV, but I've often found using EXISTS instead of IN makes queries run faster.
SELECT a.* FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id = a.id)
Of course, without seeing the rest of the query and the context, this may not make the query any faster.
JOINing may be a more preferable option, but if a.id appears more than once in the id column of b, you would have to throw a DISTINCT in there, and you more than likely go backwards in terms of optimization.
I would never use a subquery like this. A join would be much faster.
select a.*
from a
join b on a.id = b.id
Of course don't use select * either (especially never use it when doing a join as at least one field is repeated) and it wastes network resources to send unnneeded data.
Have you looked at the execution plan?
How about
select a.*
from a
inner join b
on a.id = b.id
presumably the id fields are primary keys?
Select a.* from a
inner join (Select distinct id from b) c
on a.ID = c.AssetID
I tried all 3 versions and they ran about the same. The execution plan was the same (inner join, IN (with and without where clause in subquery), Exists)
Since you are not selecting any other fields from B, I prefer to use the Where IN(Select...) Anyone would look at the query and know what you are trying to do (Only show in a if in b.).
your problem is most likely in the seven tables within "a"
make the FROM table contain the "a.id"
make the next join: inner join b on a.id = b.id
then join in the other six tables.
you really need to show the entire query, list all indexes, and approximate row counts of each table if you want real help