I have a large query, how do I debug this? - sql

So, I get this error message:
EDT ERROR: syntax error at or near "union" at character 436
The query in question is a large query that consists of 12 smaller queries all connected together with UNION ALL, and each small query has two inner join statements. So, something like:
SELECT table.someid as id
,table.lastname as name
,table2.groupname as groupname
, 'Leads ' as Type
from table
inner join table3 on table3.specificid = table.someid
INNER JOIN table2 on table3.specificid=table2.groupid
where table3.deleted=0
and table.someid > 0
and table2.groupid in ('2','3','4')
LIMIT 5
UNION all
query2....
Note that table2 and table3 are the same tables in each query, and the fields from table2 and table3 are also the same, I think.
Quick question (I am still kinda new to all this):
What does 'Leads ' as Type mean? Unlike the other statements preceding an AS, this one isn't written like table.something.
Quick edit question: What does table2.groupid in ('2','3','4') mean?
I checked each small query one by one, each one works and returns a result, though the results are always empty for some reason(this may or may not be dependent on the user logged in though, as some PHP code generated this query).
As for the results themselves, most of them look something like this (they are arranged horizontally though):
id(integer)
name (character varying(80))
groupname (character varying(100))
type (unknown)
The difference in the results are twofold:
1)Most of the results contain the same field names but quite a few of them have different field lengths. Like some will say character varying (80), while others will say character varying (100), please correct me if this is actually not field length.
2)2 of the queries contain different fields, but only the id field is different, and it's probably because they don't have the "as id" part.
I am not quite sure of what the requirements of UNION ALL are, but if I think, it is meant to only work if all the fields are the same, but if that funky number changes (the one in the brackets), then are the fields considered to be different even if they have the same name?
Also, what's strange is that some of the queries returned the exact same fields, with the same field length, so I tried to UNION ALL only those queries, but no luck, still got a syntax error at UNION.
Another important thing I should mention is that the DB used to be MySQL, but we changed to PostGreSQL, so this bug might be a result of the change (i.e. code that might work in MySQL but not in PostGres).
Thanks for your time.

You can have only one "LIMIT xxx" clause. At the end of the query, and not before the UNION.

The error you get is due to missing parentheses here:
...
LIMIT 5
UNION all
...
The manual:
(ORDER BY and LIMIT can be attached to a subexpression if it is
enclosed in parentheses. Without parentheses, these clauses will be
taken to apply to the result of the UNION, not to its right-hand input
expression.)
Later example:
Sum results of a few queries and then find top 5 in SQL

The only real way I have found to debug big queries is to break it into understandable parts and debug each subexpression independently:
Does each show the expected rows?
Are the resulting fields and types as expected?
For union, do the result fields and types exactly match corresponding other subexpressions?

Related

Why does joining on different data types produce a conversion type inconsistently?

As I try to join tables together on a value that's represented in different data types, I get really odd errors. Please consider the following:
I have two tables; let's say one is in database "CoffeeWarehouse," and the other is in database "CoffeeAnalytics":
Table 1: CoffeeWarehouse.dbo.BeanInfo
Table 2: CoffeeAnalytics.dbo.BeanOrderRecord
Now, both tables have a field called OrderNumber (although in table 2, it's spelled as[order number]); in Table 1, it's represented as a string, and in Table 2, it's represented as a float.
I proceed to join the tables together:
SELECT ordernumber,
bor.*
FROM CoffeeWarehouse.dbo.BeanInfo AS bni
LEFT JOIN CoffeeAnalytics.dbo.BeanOrderRecord AS bor ON bor.[order number] = bni.ordernumber;
If I specify the order numbers I'd like by adding the following:
WHERE bni.ordernumber = '48911'
then I see the complete table I'd like- all the fields from the table I've joined are populated properly.
If I add more order numbers, it works too:
WHERE bni.ordernumber IN ('48911', '83716', '98811', ...)
Now for the problem:
Suppose I want to select everything in the table where another field, i.e. CountryOfOrigin, is not null. I'm not going to enter several thousand order numbers- I just want to use a where clause to weed out the rows with incomplete data.
So I add the following to my original query:
WHERE bor.CountryOfOrigin IS NOT NULL
When I execute, I get this error:
Msg 8114, Level 16, State 5, Line 1
Error converting data type varchar to float.
I get the same error if I even simply use this as a where clause:
WHERE bni.ordernumber IS NOT NULL
Why is this the case? When I specify the ordernumber, the join works well- when I want to select many ordernumbers, I get a conversion error.
Any help/insight?
The SQL Server query optimiser can choose different paths to get your results, even with the same query from minute to minute.
In this query, say:
SELECT ordernumber,
bor.*
FROM CoffeeWarehouse.dbo.BeanInfo AS bni
LEFT JOIN CoffeeAnalytics.dbo.BeanOrderRecord AS bor ON bor.[order number] = bni.ordernumber
WHERE bni.ordernumber = '48911';
The query optimiser may, for example, take one of two paths:
It may choose to use BeanInfo as the "driving" table, use an index to narrow down the rows in that table to, say, a single row with order number 48911, and then join to BeanOrderRecord using just that one order number.
It may choose to use BeanOrderRecord as the driving table, join the two tables together by order number to get a full set of results, and then filter that resultset by the order number.
Which path the query optimiser takes will depend on a variety of things, including defined indexes, the number of rows in the table, cardinality, and so on.
Now, if it just so happens that one of your order numbers isn't convertible to a float—say someone typed '!2345' by accident—the first optimiser choice may always work, and the second one may always fail. But you don't get to choose which path the optimiser takes.
This is why you're seeing what you think of as weird results. In one of your queries, all the order numbers are being analysed and that's triggering the error, in another only order numbers that are convertible to float are being analysed, so there's no error. But it's basically just luck that it's working out the way it is. It could just as well be the other way around, or neither query might ever work.
This is one reason it's bad to store things in inappropriate data types. Fixing that would be the obvious solution.
A dirty and terrible fix, however, might be to always cast your FLOAT to a VARCHAR when doing the order number comparison, as I believe it's always safe to cast from FLOAT to VARCHAR. Though you may need to experiment to make sure the resulting VARCHAR value is formatted the same as your order number (or cast to INTEGER first...)
You'll have to resort to some quite fiddly trickery to get any performance out of your existing setup, though. If they were both VARCHAR values you could easily make the table join very fast by indexing each order number column, but as it is the casting you'll have to do will render normal indexes unusable for a join.
If you're using a recent version of SQL Server, you can use TRY_CAST to find the problem row(s):
SELECT * FROM BeanOrderRecord WHERE TRY_CAST([order number] AS VARCHAR) IS NULL
...will find rows with any FLOAT [order number] which can't be converted to a VARCAHR.

How to find rows in a table with similar string values

I have a Microsoft SQL Server database table with around 7 million crowd-sourced records, primarily containing a string name value with some related details. For nearly every record it seems there are a dozen similar typo records and I am trying to do some fuzzy matching to identify record groups such as "Apple", "Aple", "Apples", "Spple", etc. These names can also contain multiple words with spaces between them.
I've come up with a solution of using an edit-distance scalar function that returns number of keystrokes required for transformation from string1 to string2 and using that function to join the table to itself. As you can imagine, this doesn't perform that well since its having to execute the function millions of times to evaluate a join.
So I put that in a cursor so at least only one string1 is being evaluated at a time, this at least gets results coming out but after letting it run for weeks it has only made it through evaluating 150,000 records. With 7 million to evaluate, I don't think I have the kind of time my method is going to take.
I put full text indexes on the string names, but couldn't really find a way to use the full text predicates when I didn't have a static value I was searching.
Any ideas how I could do something like the following in a way that wouldn't take months to run?
SELECT t1.name, t2.name
FROM names AS t1
INNER JOIN names AS t2
ON EditDistance(t1.name,t2.name) = 1
AND t1.id != t2.id
You may use the DIFFERENCE ( character_expression , character_expression ) function to evaluate the difference in the SOUNDEX code for each character expression. The SOUNDEX code is used to evaluate the difference between strings.
DIFFERENCE will return an integer of 0 (the highest possible difference) and 4 (the least possible difference). You could utilize this value to determine how closely matched the strings are (e.g. a condition similar to DIFFERENCE(column1, column2) > 3 would match records where the SOUNDEX values of column1 and column2 are off by 1).
Here is a link to the documentation of the DIFFERENCE function: https://technet.microsoft.com/en-us/library/ms188753(v=sql.105).aspx
You need to find a way to avoid comparing each record to every other record. If you are just using a single field, you can use a special data structure like a trie, for example https://github.com/mattandahalfew/Levenshtein_search

Query applying Where clause BEFORE Joins?

I'm honestly really confused here, so I'll try to keep it simple.
We have Table A:
id
Table B:
id || number
Table A is a "prefilter" to B, since B contains a lot of different objects, including A.
So my query, trying to get all A's with a filter;
SELECT * FROM A a
JOIN B b ON b.id = a.id
WHERE CAST(SUBSTRING(b.number, 2, 30) AS integer) between 151843 and 151865
Since ALL instances of A starts with a letter ("X******"), I just want to truncate the first letter to let the filter do his work with the number specified by the user.
At first glance, there should be absolutely no worries. But it seems I was wrong. And on something I didn't expect to be...
It seems like my WHERE clause is executed BEFORE my JOIN. Therefore, since many B's have number with more than one Letter at the start, I have an invalid conversion happening. Despite the fact that it would NEVER happen if we stay in A's.
I always thought that where clause was executed after joins, but in this case, it seems postgres wants to prove me wrong.
Any explanations ?
SQLFiddle demonstrating problem: http://sqlfiddle.com/#!15/cd7e6e/7
And even with the SubQuery, it still makes the same error...
You can use the regex substr function to remove everything but digits: CAST(substring(B.number from '\d') AS integer).
See working example here: http://sqlfiddle.com/#!15/cd7e6e/18
SQL is a declarative language. For a select statement, you declare the criteria the data you are looking for must meet. You don't get to choose the execution path, your query isn't executed procedurally.
Thus the optimizer is free to choose any execution plan it likes, as long as it returns records specified by your criteria.
I suggest you change your query to cast to string instead of to integer. Something like:
WHERE SUBSTRING(b.number, 2, 30) between CAST(151843 AS varchar) and CAST(151865 AS varchar)
Do the records of A that are in B have the same id in table B as in A. If those records are inserted in a different order, this may not be the case and therefore return different records than expected.

How to do damage with SQL by adding to the end of a statement?

Perhaps I am not creative or knowledgeable enough with SQL... but it looks like there is no way to do a DROP TABLE or DELETE FROM within a SELECT without the ability to start a new statement.
Basically, we have a situation where our codebase has some gigantic, "less-than-robust" SQL generation component that never uses prepared statements and we now have an API that interacts with this legacy component.
Right now we can modify a query by appending to the end of it, but have been unable to insert any semicolons. Thus, we can do something like this:
/query?[...]&location_ids=loc1')%20or%20L1.ID%20in%20('loc2
which will result in this
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('loc1') or L1.ID in ('loc2');...
This is just one example.
Basically we can append pretty much anything to the end of any/most generated SQL queries, less adding a semicolon.
Any ideas on how this could potentially do some damage? Can you add something to the end of a SQL query that deletes from or drops tables? Or create a query so absurd that it takes up all CPU and never completes?
You said that this:
/query?[...]&location_ids=loc1')%20or%20L1.ID%20in%20('loc2
will result in this:
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('loc1') or L1.ID in ('loc2');
so it looks like this:
/query?[...]&location_ids=');DROP%20TABLE users;--
will result in this:
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('');DROP TABLE users;--');
which is a SELECT, a DROP and a comment.
If it’s not possible to inject another statement, you limited to the existing statement and its abilities.
Like in this case, if you are limited to SELECT and you know where the injection happens, have a look at PostgreSQL’s SELECT syntax to see what your options are. Since you’re injecting into the WHERE clause, you can only inject additional conditions or other clauses that are allowed after the WHERE clause.
If the result of the SELECT is returned back to the user, you may want to add your own SELECT with a UNION operation. However, PostgreSQL requires compatible data types for corresponding columns:
The two SELECT statements that represent the direct operands of the UNION must produce the same number of columns, and corresponding columns must be of compatible data types.
So you would need to know the number and data types of the columns of the original SELECT first.
The number of columns can be detected with the ORDER BY clause by specifying the column number like ORDER BY 3, which would order the result by the values of the third column. If the specified column does not exist, the query will fail.
Now after determining the number of columns, you can inject a UNION SELECT with the appropriate number of columns with an null value for each column of your UNION SELECT:
loc1') UNION SELECT null,null,null,null,null --
Now you determine the types of each column by using a different value for each column one by one. If the types of a column are incompatible, you may an error that hints the expected data type like:
ERROR: invalid input syntax for integer
ERROR: UNION types text and integer cannot be matched
After you have determined enough column types (one column may be sufficient when it’s one that is presented the user), you can change your SELECT to select whatever you want.

In SQL, 'distinct' reduces the number of result rows from one to zero

I have a SQL statement of the following structure:
select distinct ...
from table1,
(select from table2, table3, table4 where ...)
where ...
order by ...
With certain values in the where clauses, the statement returns zero rows in the result set. When I remove the 'distinct' keyword, it returns a single row. I would expect to see a single result row in both cases. Is there some property of the 'distinct' keyword that I am not aware of and that causes this behavior?
The database is Oracle 11g.
What you describe is not the expected behaviour of DISTINCT. This is:
SQL> select * from dual
2 /
D
-
X
1 row selected.
SQL> select distinct * from dual
2 /
D
-
X
1 row selected.
SQL>
So, if what you say is happening really is what is happening then it's a bug. However, you also say it's a rare occurrence which means there is a good chance it is some peculiarity in your data and/or transient conditions in your environment, and not a bug.
You need to create a reproducible test case, for two reasons. Partly, nobody will be able to investigate your problem without one. But mainly because building a test case is an investigation in its own right: attempting to isolate the precise combination of data and/or ambient factors often generates the insight which leads to a solution.
It turned out that one of the sub-selects resulted in a data set that contained, among others, a row where every column was NULL. It seems that this row influenced the evaluation of the DISTINCT in a non-obvious way (at least to me). Maybe this is due to some under-the-hood SQL optimizations. After I removed the cause of this NULL-filled row, the problem is gone and the statement evaluates to one row in the result as it should.