Force Presto to maintain order of WHERE clauses - sql

I'm trying to run something like the following query:
SELECT * FROM foo WHERE cardinality(bar) > 0 AND bar[1] = '...';
However, I'm getting Query failed: Array subscript out of bounds. I'm assuming this is because Presto is trying to optimize the query by checking bar[1] = '...' before checking cardinality(bar) > 0. Is there a way to force Presto to maintain the order of the clauses?

I've solved this in two ways when I've needed it.
Use the element_at function instead of the [] subscript notation. element_at returns a NULL when indexing past the end of an array, so you could reduce your example to one condition. element_at also works in the SELECT clause, although it isn't needed with your WHERE clause:
SELECT bar[1] FROM foo WHERE element_at(bar,1) = '...';
Do the first condition in a subquery using the with clause:
WITH (SELECT * FROM foo WHERE cardinality(bar) > 0) AS populated_foo
SELECT * FROM populated_foo WHERE bar[1] = '...';
The 2nd approach doesn't make much sense for your example, but I've found it useful for more complex conditions involving row objects inside of arrays.

Related

WHERE clause returning no results

I have the following query:
SELECT *
FROM public."Matches"
WHERE 'Matches.Id' = '24e81894-2f1e-4654-bf50-b75e584ed3eb'
I'm certain there is an existing match with this Id (tried it on other Ids as well), but it returns 0 rows. I'm new to querying with PgAdmin so it's probably just a simple error, but I've read the docs up and down and can't seem to find why this is returning nothing.
Single quotes are only used for strings in SQL. So 'Matches.Id' is a string constant and obviously not the same as '24e81894-2f1e-4654-bf50-b75e584ed3eb' thus the WHERE condition is always false (it's like writing where 1 = 0)
You need to use double quotes for identifiers, the same way you did in the FROM clause.
WHERE "Matches"."Id" = ...
In general the use of quoted identifiers is strongly discouraged.

Optimizing multiple identical operator and function calls in Hive?

I'm new to Hive and trying to optimize a query that is taking a while to run. I have identical calls to regexp_extract and get_json in my SELECT and WHERE statements, and I was wondering if there is a way to optimize this by storing the results from one statement and using them in the other (or if Hive is already doing something like this in the background).
Example query:
SELECT
regexp_extract(get_json(json, 'url'), '.*[&?]q=([^&]*)') as query
FROM
api_request_logs
WHERE
LENGTH(regexp_extract(get_json(json, 'url'), '.*[&?]q=([^&]*)')) > 0
Thanks!
You can use a derived table to specify the regex only once but I don't think it runs faster
select * from (
select regexp_extract(get_json(json, 'url'), '.*[&?]q=([^&]*)') as query
from api_request_logs
) t where length(query) > 0

Check if value exists in Postgres array

Using Postgres 9.0, I need a way to test if a value exists in a given array. So far I came up with something like this:
select '{1,2,3}'::int[] #> (ARRAY[]::int[] || value_variable::int)
But I keep thinking there should be a simpler way to this, I just can't see it. This seems better:
select '{1,2,3}'::int[] #> ARRAY[value_variable::int]
I believe it will suffice. But if you have other ways to do it, please share!
Simpler with the ANY construct:
SELECT value_variable = ANY ('{1,2,3}'::int[])
The right operand of ANY (between parentheses) can either be a set (result of a subquery, for instance) or an array. There are several ways to use it:
SQLAlchemy: how to filter on PgArray column types?
IN vs ANY operator in PostgreSQL
Important difference: Array operators (<#, #>, && et al.) expect array types as operands and support GIN or GiST indices in the standard distribution of PostgreSQL, while the ANY construct expects an element type as left operand and can be supported with a plain B-tree index (with the indexed expression to the left of the operator, not the other way round like it seems to be in your example). Example:
Index for finding an element in a JSON array
None of this works for NULL elements. To test for NULL:
Check if NULL exists in Postgres array
Watch out for the trap I got into: When checking if certain value is not present in an array, you shouldn't do:
SELECT value_variable != ANY('{1,2,3}'::int[])
but use
SELECT value_variable != ALL('{1,2,3}'::int[])
instead.
but if you have other ways to do it please share.
You can compare two arrays. If any of the values in the left array overlap the values in the right array, then it returns true. It's kind of hackish, but it works.
SELECT '{1}' && '{1,2,3}'::int[]; -- true
SELECT '{1,4}' && '{1,2,3}'::int[]; -- true
SELECT '{4}' && '{1,2,3}'::int[]; -- false
In the first and second query, value 1 is in the right array
Notice that the second query is true, even though the value 4 is not contained in the right array
For the third query, no values in the left array (i.e., 4) are in the right array, so it returns false
unnest can be used as well.
It expands array to a set of rows and then simply checking a value exists or not is as simple as using IN or NOT IN.
e.g.
id => uuid
exception_list_ids => uuid[]
select * from table where id NOT IN (select unnest(exception_list_ids) from table2)
Hi that one works fine for me, maybe useful for someone
select * from your_table where array_column ::text ilike ANY (ARRAY['%text_to_search%'::text]);
"Any" works well. Just make sure that the any keyword is on the right side of the equal to sign i.e. is present after the equal to sign.
Below statement will throw error: ERROR: syntax error at or near "any"
select 1 where any('{hello}'::text[]) = 'hello';
Whereas below example works fine
select 1 where 'hello' = any('{hello}'::text[]);
When looking for the existence of a element in an array, proper casting is required to pass the SQL parser of postgres. Here is one example query using array contains operator in the join clause:
For simplicity I only list the relevant part:
table1 other_name text[]; -- is an array of text
The join part of SQL shown
from table1 t1 join table2 t2 on t1.other_name::text[] #> ARRAY[t2.panel::text]
The following also works
on t2.panel = ANY(t1.other_name)
I am just guessing that the extra casting is required because the parse does not have to fetch the table definition to figure the exact type of the column. Others please comment on this.

Aggregate functions and OrderBy

I have a query like the following one:
SELECT case_filed_by,
office_code,
desg_code,
court_code,
court_case_no,
COUNT(office_code) as count
FROM registration_of_case
WHERE TRUE
AND SUBSTR(office_code, 1, 0) = SUBSTR('', 1, 0)
ORDER BY court_code, court_case_no
I am getting the following error:
ERROR: column "registration_of_case.case_filed_by" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: SELECT case_filed_by,office_code,desg_code, court_code,court […]
As you describe in your comments, you actually want the number of selected rows in a separate field of your result set.
You can achieve this by using a subselect for the count and the join these two queries.
Something like this:
SELECT case_filed_by,
office_code,
desg_code,
court_code,
court_case_no,
office_code_count
FROM registration_of_case,
(SELECT COUNT(office_code) AS office_code_count
FROM registration_of_case
WHERE TRUE
AND SUBSTR(office_code, 1, 0) = SUBSTR('', 1, 0)
) AS count_query
WHERE TRUE
AND SUBSTR(office_code, 1, 0) = SUBSTR('', 1, 0)
ORDER BY court_code, court_case_no
I couldn't test the query, but it should work or at least point you into the right direction.
You are using COUNT(), which is an aggregate function, along with a number of fields that are not part of the GROUP BY (since there is none) or in the aggregate function (except office_code).
Now, in MySQL something like this is allowed because the engine will select one record from the group and return that (although the query cannot affect it in any way, that's usually okay). Postgresql clearly cannot. I don't use Postgresql and I can work it out.
If Postgresql has a "non-strict" mode, I suggest you enable that; otherwise, either correct your query or change database types.
I would suggest an appropriate query, if I knew what Postgresql does, and doesn't, allow.
Add a group by clause like this,
"group by case_filed_by,office_code,desg_code,court_code,court_case_no"
Now try exceuting, it will work.
The simple logic is if you want to use aggreagate function together with other columns in table, group by that columns.
Check it out and comment if works

"ORDER BY ... USING" clause in PostgreSQL

The ORDER BY clause is decribed in the PostgreSQLdocumentation as:
ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...]
Can someone give me some examples how to use the USING operator? Is it possible to get an alternating order of the resultset?
A very simple example would be:
> SELECT * FROM tab ORDER BY col USING <
But this is boring, because this is nothing you can't get with the traditional ORDER BY col ASC.
Also the standard catalog doesn't mention anything exciting about strange comparison functions/operators. You can get a list of them:
> SELECT amoplefttype::regtype, amoprighttype::regtype, amopopr::regoper
FROM pg_am JOIN pg_amop ON pg_am.oid = pg_amop.amopmethod
WHERE amname = 'btree' AND amopstrategy IN (1,5);
You will notice, that there are mostly < and > functions for primitive types like integer, date etc and some more for arrays and vectors and so on. None of these operators will help you to get a custom ordering.
In most cases where custom ordering is required you can get away using something like ... ORDER BY somefunc(tablecolumn) ... where somefunc maps the values appropriately. Because that works with every database this is also the most common way. For simple things you can even write an expression instead of a custom function.
Switching gears up
ORDER BY ... USING makes sense in several cases:
The ordering is so uncommon, that the somefunc trick doesn't work.
You work with a non-primitive type (like point, circle or imaginary numbers) and you don't want to repeat yourself in your queries with strange calculations.
The dataset you want to sort is so large, that support by an index is desired or even required.
I will focus on the complex datatypes: often there is more than one way to sort them in a reasonable way. A good example is point: You can "order" them by the distance to (0,0), or by x first, then by y or just by y or anything else you want.
Of course, PostgreSQL has predefined operators for point:
> CREATE TABLE p ( p point );
> SELECT p <-> point(0,0) FROM p;
But none of them is declared usable for ORDER BY by default (see above):
> SELECT * FROM p ORDER BY p;
ERROR: could not identify an ordering operator for type point
TIP: Use an explicit ordering operator or modify the query.
Simple operators for point are the "below" and "above" operators <^ and >^. They compare simply the y part of the point. But:
> SELECT * FROM p ORDER BY p USING >^;
ERROR: operator > is not a valid ordering operator
TIP: Ordering operators must be "<" or ">" members of __btree__ operator families.
ORDER BY USING requires an operator with defined semantics: Obviously it must be a binary operator, it must accept the same type as arguments and it must return boolean. I think it must also be transitive (if a < b and b < c then a < c). There may be more requirements. But all these requirements are also necessary for proper btree-index ordering. This explains the strange error messages containing the reference to btree.
ORDER BY USING also requires not just one operator to be defined but an operator class and an operator family. While one could implement sorting with only one operator, PostgreSQL tries to sort efficiently and minimize comparisons. Therefore, several operators are used even when you specify only one - the others must adhere to certain mathematical constraints - I've already mentioned transitivity, but there are more.
Switching Gears up
Let's define something suitable: An operator for points which compares only the y part.
The first step is to create a custom operator family which can be used by the btree index access method. see
> CREATE OPERATOR FAMILY xyzfam USING btree; -- superuser access required!
CREATE OPERATOR FAMILY
Next we must provide a comparator function which returns -1, 0, +1 when comparing two points. This function WILL be called internally!
> CREATE FUNCTION xyz_v_cmp(p1 point, p2 point) RETURNS int
AS $$BEGIN RETURN btfloat8cmp(p1[1],p2[1]); END $$ LANGUAGE plpgsql;
CREATE FUNCTION
Next we define the operator class for the family. See the manual for an explanation of the numbers.
> CREATE OPERATOR CLASS xyz_ops FOR TYPE point USING btree FAMILY xyzfam AS
OPERATOR 1 <^ ,
OPERATOR 3 ?- ,
OPERATOR 5 >^ ,
FUNCTION 1 xyz_v_cmp(point, point) ;
CREATE OPERATOR CLASS
This step combines several operators and functions and also defines their relationship and meaning. For example OPERATOR 1 means: This is the operator for less-than tests.
Now the operators <^ and >^ can be used in ORDER BY USING:
> INSERT INTO p SELECT point(floor(random()*100), floor(random()*100)) FROM generate_series(1, 5);
INSERT 0 5
> SELECT * FROM p ORDER BY p USING >^;
p
---------
(17,8)
(74,57)
(59,65)
(0,87)
(58,91)
Voila - sorted by y.
To sum it up: ORDER BY ... USING is an interesting look under the hood of PostgreSQL. But nothing you will require anytime soon unless you work in very specific areas of database technology.
Another example can be found in the Postgres docs. with source code for the example here and here. This example also shows how to create the operators.
Samples:
CREATE TABLE test
(
id serial NOT NULL,
"number" integer,
CONSTRAINT test_pkey PRIMARY KEY (id)
)
insert into test("number") values (1),(2),(3),(0),(-1);
select * from test order by number USING > //gives 3=>2=>1=>0=>-1
select * from test order by number USING < //gives -1=>0=>1=>2=>3
So, it is equivalent to desc and asc. But you may use your own operator, that's the essential feature of USING
Nice answers, but they didn't mentioned one real valuable case for ´USING´.
When you have created an index with non default operators family, for example varchar_pattern_ops ( ~>~, ~<~, ~>=~, ... ) instead of <, >, >= then if you search based on index and you want to use index in order by clause you need to specify USING with the appropriate operator.
This can be illustrated with such example:
CREATE INDEX index_words_word ON words(word text_pattern_ops);
Lets compare this two queries:
SELECT * FROM words WHERE word LIKE 'o%' LIMIT 10;
and
SELECT * FROM words WHERE word LIKE 'o%' ORDER BY word LIMIT 10;
The difference between their executions is nearly 100 times in a 500K words DB! And also results may not be correct within non-C locale.
How this could happend?
When you making search with LIKE and ORDER BY clause, you actually make this call:
SELECT * FROM words WHERE word ~>=~ 'o' AND word ~<~'p' ORDER BY word USING < LIMIT 10;
Your index created with ~<~ operator in mind, so PG cannot use given index in a given ORDER BY clause. To get things done right query must be rewritten to this form:
SELECT * FROM words WHERE word ~>=~ 'o' AND word ~<~'p' ORDER BY word USING ~<~ LIMIT 10;
or
SELECT * FROM words WHERE word LIKE 'o%' ORDER BY word USING ~<~ LIMIT 10;
Optionally one can add the key word ASC (ascending) or DESC
(descending) after any expression in the ORDER BY clause. If not
specified, ASC is assumed by default. Alternatively, a specific
ordering operator name can be specified in the USING clause. An
ordering operator must be a less-than or greater-than member of some
B-tree operator family. ASC is usually equivalent to USING < and DESC
is usually equivalent to USING >.
PostgreSQL 9.0
It may look something like this I think (I don't have postgres to verify this right now, but will verify later)
SELECT Name FROM Person
ORDER BY NameId USING >