Why does joining on different data types produce a conversion type inconsistently? - sql

As I try to join tables together on a value that's represented in different data types, I get really odd errors. Please consider the following:
I have two tables; let's say one is in database "CoffeeWarehouse," and the other is in database "CoffeeAnalytics":
Table 1: CoffeeWarehouse.dbo.BeanInfo
Table 2: CoffeeAnalytics.dbo.BeanOrderRecord
Now, both tables have a field called OrderNumber (although in table 2, it's spelled as[order number]); in Table 1, it's represented as a string, and in Table 2, it's represented as a float.
I proceed to join the tables together:
SELECT ordernumber,
bor.*
FROM CoffeeWarehouse.dbo.BeanInfo AS bni
LEFT JOIN CoffeeAnalytics.dbo.BeanOrderRecord AS bor ON bor.[order number] = bni.ordernumber;
If I specify the order numbers I'd like by adding the following:
WHERE bni.ordernumber = '48911'
then I see the complete table I'd like- all the fields from the table I've joined are populated properly.
If I add more order numbers, it works too:
WHERE bni.ordernumber IN ('48911', '83716', '98811', ...)
Now for the problem:
Suppose I want to select everything in the table where another field, i.e. CountryOfOrigin, is not null. I'm not going to enter several thousand order numbers- I just want to use a where clause to weed out the rows with incomplete data.
So I add the following to my original query:
WHERE bor.CountryOfOrigin IS NOT NULL
When I execute, I get this error:
Msg 8114, Level 16, State 5, Line 1
Error converting data type varchar to float.
I get the same error if I even simply use this as a where clause:
WHERE bni.ordernumber IS NOT NULL
Why is this the case? When I specify the ordernumber, the join works well- when I want to select many ordernumbers, I get a conversion error.
Any help/insight?

The SQL Server query optimiser can choose different paths to get your results, even with the same query from minute to minute.
In this query, say:
SELECT ordernumber,
bor.*
FROM CoffeeWarehouse.dbo.BeanInfo AS bni
LEFT JOIN CoffeeAnalytics.dbo.BeanOrderRecord AS bor ON bor.[order number] = bni.ordernumber
WHERE bni.ordernumber = '48911';
The query optimiser may, for example, take one of two paths:
It may choose to use BeanInfo as the "driving" table, use an index to narrow down the rows in that table to, say, a single row with order number 48911, and then join to BeanOrderRecord using just that one order number.
It may choose to use BeanOrderRecord as the driving table, join the two tables together by order number to get a full set of results, and then filter that resultset by the order number.
Which path the query optimiser takes will depend on a variety of things, including defined indexes, the number of rows in the table, cardinality, and so on.
Now, if it just so happens that one of your order numbers isn't convertible to a float—say someone typed '!2345' by accident—the first optimiser choice may always work, and the second one may always fail. But you don't get to choose which path the optimiser takes.
This is why you're seeing what you think of as weird results. In one of your queries, all the order numbers are being analysed and that's triggering the error, in another only order numbers that are convertible to float are being analysed, so there's no error. But it's basically just luck that it's working out the way it is. It could just as well be the other way around, or neither query might ever work.
This is one reason it's bad to store things in inappropriate data types. Fixing that would be the obvious solution.
A dirty and terrible fix, however, might be to always cast your FLOAT to a VARCHAR when doing the order number comparison, as I believe it's always safe to cast from FLOAT to VARCHAR. Though you may need to experiment to make sure the resulting VARCHAR value is formatted the same as your order number (or cast to INTEGER first...)
You'll have to resort to some quite fiddly trickery to get any performance out of your existing setup, though. If they were both VARCHAR values you could easily make the table join very fast by indexing each order number column, but as it is the casting you'll have to do will render normal indexes unusable for a join.
If you're using a recent version of SQL Server, you can use TRY_CAST to find the problem row(s):
SELECT * FROM BeanOrderRecord WHERE TRY_CAST([order number] AS VARCHAR) IS NULL
...will find rows with any FLOAT [order number] which can't be converted to a VARCAHR.

Related

Simple query hangs forever on large table

I'm trying to crosswalk some code values from another developer's code using the business objects frontend (I know, it's sub-optimal, but they haven't given me back-end access).
What I need to do is just pull a record from the relevant table to compare code values to display values. I'm guessing the problem has something to do with the table containing millions of records. Even when I narrow my query to one value, try only records from today, and set Max rows retrieved to 1, it's hanging forever.
The code it generated for my query is:
SELECT
CLINICAL_EVENT.EVENT_CD,
CV_EVENT.DISPLAY
FROM
CLINICAL_EVENT,
CODE_VALUE CV_EVENT
WHERE
( CLINICAL_EVENT.EVENT_CD=CV_EVENT.CODE_VALUE )
AND
(
CLINICAL_EVENT.EVENT_CD = 338743225
AND
CLINICAL_EVENT.EVENT_END_DT_TM
> '16-02-2017 00:00:00'
)
Can you by chance avoid the cross join in your query by using a join syntax instead of the , notation? perhaps the engine is optimizing to avoid the cross join, perhaps not.
SELECT
CLINICAL_EVENT.EVENT_CD,
CV_EVENT.DISPLAY
FROM
CLINICAL_EVENT
INNER JOIN CODE_VALUE CV_EVENT
on CLINICAL_EVENT.EVENT_CD=CV_EVENT.CODE_VALUE
WHERE CLINICAL_EVENT.EVENT_CD = 338743225
AND CLINICAL_EVENT.EVENT_END_DT_TM > '16-02-2017 00:00:00'
Additionally what data type is EVENT_END_DT_TM perhaps implicitly casting your '16-02-2017 00:00:00' to a date or datetime would aid performance.
Expanding a bit on my comment:
The code values and corresponding display values you want to examine are effectively both coming from table CODE_VALUE. The only thing you're gaining from the join is duplication of those results according to the number of times the code value appears on the CLINICAL_EVENT rows satisfying the date criterion (in a sense that encompasses suppressing all appearances if there are no matching rows).
You seem to want simply to compare the code value and corresponding description, rather than to evaluate how many times that code appears. In that case, you are incurring a lot of unneeded work -- and possibly even some unwanted work -- by joining CODE_VALUE to CLINICAL_EVENT. Instead, just select the wanted row(s) directly from CODE_VALUE alone.

Inconsistent results from BigQuery: same query, different number of rows

I noticed today that one my query was having inconsistent results: every time I run it I have a different number of rows returned (cache deactivated).
Basically the query looks like this:
SELECT *
FROM mydataset.table1 AS t1
LEFT JOIN EACH mydataset.table2 AS t2
ON t1.deviceId=t2.deviceId
LEFT JOIN EACH mydataset.table3 AS t3
ON t2.email=t3.email
WHERE t3.email IS NOT NULL
AND (t3.date IS NULL OR DATE_ADD(t3.date, 5000, 'MINUTE')<TIMESTAMP('2016-07-27 15:20:11') )
The tables are not updated between each query. So I'm wondering if you also have noticed that kind of behaviour.
I usually make queries that return a lot of rows (>1000) so a few missing rows here and there is hardly noticeable. But this query return a few row, and it varies everytime between 10 and 20 rows :-/
If a Google engineer is reading this, here are two Job ID of the same query with different results:
picta-int:bquijob_400dd739_1562d7e2410
picta-int:bquijob_304f4208_1562d7df8a2
Unless I'm missing something, the query that you provide is completely deterministic and so should give the same result every time you execute it. But you say it's "basically" the same as your real query, so this may be due to something you changed.
There's a couple of things you can do to try to find the cause:
replace select * by an explicit selection of fields from your tables (a combination of fields that uniquely determine each row)
order the table by these fields, so that the order becomes the same each time you execute the query
simplify your query. In the above query, you can remove the first condition and turn the two left outer joins into inner joins and get the same result. After that, you could start removing tables and conditions one by one.
After each step, check if you still get different result sets. Then when you have found the critical step, try to understand why it causes your problem. (Or ask here.)

How to do damage with SQL by adding to the end of a statement?

Perhaps I am not creative or knowledgeable enough with SQL... but it looks like there is no way to do a DROP TABLE or DELETE FROM within a SELECT without the ability to start a new statement.
Basically, we have a situation where our codebase has some gigantic, "less-than-robust" SQL generation component that never uses prepared statements and we now have an API that interacts with this legacy component.
Right now we can modify a query by appending to the end of it, but have been unable to insert any semicolons. Thus, we can do something like this:
/query?[...]&location_ids=loc1')%20or%20L1.ID%20in%20('loc2
which will result in this
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('loc1') or L1.ID in ('loc2');...
This is just one example.
Basically we can append pretty much anything to the end of any/most generated SQL queries, less adding a semicolon.
Any ideas on how this could potentially do some damage? Can you add something to the end of a SQL query that deletes from or drops tables? Or create a query so absurd that it takes up all CPU and never completes?
You said that this:
/query?[...]&location_ids=loc1')%20or%20L1.ID%20in%20('loc2
will result in this:
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('loc1') or L1.ID in ('loc2');
so it looks like this:
/query?[...]&location_ids=');DROP%20TABLE users;--
will result in this:
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('');DROP TABLE users;--');
which is a SELECT, a DROP and a comment.
If it’s not possible to inject another statement, you limited to the existing statement and its abilities.
Like in this case, if you are limited to SELECT and you know where the injection happens, have a look at PostgreSQL’s SELECT syntax to see what your options are. Since you’re injecting into the WHERE clause, you can only inject additional conditions or other clauses that are allowed after the WHERE clause.
If the result of the SELECT is returned back to the user, you may want to add your own SELECT with a UNION operation. However, PostgreSQL requires compatible data types for corresponding columns:
The two SELECT statements that represent the direct operands of the UNION must produce the same number of columns, and corresponding columns must be of compatible data types.
So you would need to know the number and data types of the columns of the original SELECT first.
The number of columns can be detected with the ORDER BY clause by specifying the column number like ORDER BY 3, which would order the result by the values of the third column. If the specified column does not exist, the query will fail.
Now after determining the number of columns, you can inject a UNION SELECT with the appropriate number of columns with an null value for each column of your UNION SELECT:
loc1') UNION SELECT null,null,null,null,null --
Now you determine the types of each column by using a different value for each column one by one. If the types of a column are incompatible, you may an error that hints the expected data type like:
ERROR: invalid input syntax for integer
ERROR: UNION types text and integer cannot be matched
After you have determined enough column types (one column may be sufficient when it’s one that is presented the user), you can change your SELECT to select whatever you want.

Where clause affecting join

Dumb question time. Oracle 10g.
Is it possible for a where clause to affect a join?
I've got a query in the form of:
select * from
(select product, product_name from products p
join product_serial ps on product.id = ps.id
join product_data pd on pd.product_value = to_number(p.product_value)) product_result
where product_name like '%prototype%';
Obviously this is a contrived example. No real need to show the table structure as it's all imaginary. Unfortunately, I can't show the real table structure or query. In this case, p.product_value is a VARCHAR2 field which in certain rows have an ID stored inside it rather than text. (Yes, bad design - but something I inherited and am unable to change)
The issue is in the join. If I leave out the where clause, the query works and rows are returned. However, if I add the where clause, I get "invalid number" error on the pd.product_value = to_number(p.product_value) join condition.
Obviously, the "invalid number" error happens when rows are joined which contain non-digits in the p.product_value field. However, my question is how are those rows being selected? If the join succeeds without the outer where clause, shouldn't the outer where clause just select rows from the result of the join? It appears what is happening is the where clause is affecting what rows are joined, despite the join being in an inner query.
Is my question making sense?
It affects the plan that's generated.
The actual order that tables are joined (and so filtered) is not dictated by the order you write your query, but by the statistics on the tables.
In one version, the plan generated co-incidentally means that the 'bad' rows never get processed; because the preceding joins filtered the result set down to a point that they're never joined on.
The introduction of the WHERE clause has meant that ORACLE now believes a different order of join is better (because filtering by the product name requires a certain index, or because it narrows the data down a lot, etc).
This new order means that the 'bad' rows get processed before the join that filters them out.
I would endeavour to clean the data before querying it. Possibly by creating a derived column where the value is already cast to a number, or left as NULL if it is not possible to do so.
You can also use EXPLAIN PLAN to see the different plans being gerenated from your queries.
Short answer: yes.
Long answer: the query engine is free to rewrite your query however it wants, as long as it returns the same results. All of the query is available to it to use for the purpose of producing the most efficient query it can.
In this case, I'd guess that there is an index that covers what you are wanting, but it doesn't cover product name, when you add that to the where clause, the index isn't used and instead there's a scan where both conditions are tested at the same time, thus your error.
Which is really an error in your join condition, you shouldn't be using to_number unless you are sure it's a number.
I guess your to_number(p.product_value) only applies for rows with a valid product_name.
What happens is that your join is applied before your where clause resulting in the failure of the to_number function.
What you need to do is include your product_name like '%prototype%' as a JOIN clause like this:
select * from
(select product, product_name from products p
join product_serial ps on product.id = ps.id
join product_data pd on product_name like '%prototype%' AND
pd.product_value = to_number(p.product_value));
For more background (and a really good read), I'd suggest reading Jonathan Gennick's Subquery Madness.
Basically, the problem is that Oracle is free to evaluate predicates in any order. So it is free to push (or not push) the product_name predicate into your subquery. It is free to evaluate the join conditions in any order. So if Oracle happens to pick a query plan where it filters out the non-numeric product_value rows before it applies the to_number, the query will succeed. If it happens to pick a plan where it applies the to_number before filtering out the non-numeric product_value rows, you'll get an error. Of course, it's also possible that it will return the first N rows successfully and then you'll get an error when you try to fetch row N+1 because row N+1 is the first time that it is trying to apply the to_number predicate to a non-numeric data.
Other than fixing the data model, you could potentially throw some hints into the query to force Oracle to evaluate the predicate that ensures that all the non-numeric data is filtered out before the to_number predicate is applied. But in general, it's a bit challenging to fully hint a query in a way that will force the optimizer to always evaluate things in the "proper" order.

I have a large query, how do I debug this?

So, I get this error message:
EDT ERROR: syntax error at or near "union" at character 436
The query in question is a large query that consists of 12 smaller queries all connected together with UNION ALL, and each small query has two inner join statements. So, something like:
SELECT table.someid as id
,table.lastname as name
,table2.groupname as groupname
, 'Leads ' as Type
from table
inner join table3 on table3.specificid = table.someid
INNER JOIN table2 on table3.specificid=table2.groupid
where table3.deleted=0
and table.someid > 0
and table2.groupid in ('2','3','4')
LIMIT 5
UNION all
query2....
Note that table2 and table3 are the same tables in each query, and the fields from table2 and table3 are also the same, I think.
Quick question (I am still kinda new to all this):
What does 'Leads ' as Type mean? Unlike the other statements preceding an AS, this one isn't written like table.something.
Quick edit question: What does table2.groupid in ('2','3','4') mean?
I checked each small query one by one, each one works and returns a result, though the results are always empty for some reason(this may or may not be dependent on the user logged in though, as some PHP code generated this query).
As for the results themselves, most of them look something like this (they are arranged horizontally though):
id(integer)
name (character varying(80))
groupname (character varying(100))
type (unknown)
The difference in the results are twofold:
1)Most of the results contain the same field names but quite a few of them have different field lengths. Like some will say character varying (80), while others will say character varying (100), please correct me if this is actually not field length.
2)2 of the queries contain different fields, but only the id field is different, and it's probably because they don't have the "as id" part.
I am not quite sure of what the requirements of UNION ALL are, but if I think, it is meant to only work if all the fields are the same, but if that funky number changes (the one in the brackets), then are the fields considered to be different even if they have the same name?
Also, what's strange is that some of the queries returned the exact same fields, with the same field length, so I tried to UNION ALL only those queries, but no luck, still got a syntax error at UNION.
Another important thing I should mention is that the DB used to be MySQL, but we changed to PostGreSQL, so this bug might be a result of the change (i.e. code that might work in MySQL but not in PostGres).
Thanks for your time.
You can have only one "LIMIT xxx" clause. At the end of the query, and not before the UNION.
The error you get is due to missing parentheses here:
...
LIMIT 5
UNION all
...
The manual:
(ORDER BY and LIMIT can be attached to a subexpression if it is
enclosed in parentheses. Without parentheses, these clauses will be
taken to apply to the result of the UNION, not to its right-hand input
expression.)
Later example:
Sum results of a few queries and then find top 5 in SQL
The only real way I have found to debug big queries is to break it into understandable parts and debug each subexpression independently:
Does each show the expected rows?
Are the resulting fields and types as expected?
For union, do the result fields and types exactly match corresponding other subexpressions?