BigQuery query doesn't work with UNNEST()

BigQuery query doesn't work with UNNEST() - google-bigquery

I'm trying to search on StackOverflow data through BigQuery by letting this query match a string pattern on answers and by filtering relevant question answers by tags.
WITH question_answers_join AS (
SELECT *
FROM (
SELECT id, creation_date, title
, (SELECT AS STRUCT body b
FROM `bigquery-public-data.stackoverflow.posts_answers`
WHERE a.id=parent_id
) answers
, SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
)
)SELECT *
FROM question_answers_join
WHERE 'google-bigquery' IN UNNEST(tags)
AND REGEXP_CONTAINS(answers.b, r"hello")
ORDER BY RAND()
LIMIT 100
however, I get this error:
Scalar subquery produced more than one element
what is it referring to? How can I fix this?

Below is for BigQuery Standard SQL
#standardSQL
WITH question_answers_join AS (
SELECT *
FROM (
SELECT id, creation_date, title
, ARRAY(SELECT body /* this line was the reason for error */
FROM `bigquery-public-data.stackoverflow.posts_answers`
WHERE a.id=parent_id
) answers
, SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
)
)
SELECT *
FROM question_answers_join
WHERE 'google-bigquery' IN UNNEST(tags)
AND EXISTS (
SELECT 1
FROM UNNEST(answers) answer
WHERE REGEXP_CONTAINS(answer, r"hello")
)
ORDER BY RAND()
LIMIT 100
I think, it is easy to just compare above with your original query to see the differences (hint: there are just two of them). First difference is the actual reason for the error you saw. Second difference is to reflect changes introduced by first one

Related

BigQuery "Schrödingers Row" or why ROW_NUMBER() is not a good identifier

Situation
We have a fairly complex internal logic to allocate marketing spend to various channels and had currently started to rework some of our queries to simplify the setup. We recently came across a really puzzling case where using ROW_NUMBER() OVER() to identify unique rows lead to very strange results.
Problem
In essence, using ROW_NUMBER() OVER() resulted in what I call Schrödingers Rows. As they appear to be matched and unmatched at the same time (please find replicable query below). In the attached screenshot (which is a result of the query) it can be clearly seen that
german_spend + non_german_spend > total_spend
Which should not be the case.
Query
Please note that execution of the query will give you different results each time you run it as it relies on RAND() to generate dummy data. Also please be aware that the query is a very dumbed down version of what we are doing. For reasons beyond the scope of this post, we needed to uniquely identify the buckets.
###################
# CREATE Dummy Data
###################
DECLARE NUMBER_OF_DUMMY_RECORDS DEFAULT 1000000;
WITH data AS (
SELECT
num as campaign_id,
RAND() as rand_1,
RAND() as rand_2
FROM
UNNEST(GENERATE_ARRAY(1, NUMBER_OF_DUMMY_RECORDS)) AS num
),
spend_with_categories AS (
SELECT
campaign_id,
CASE
WHEN rand_1 < 0.25 THEN 'DE'
WHEN rand_1 < 0.5 THEN 'AT'
WHEN rand_1 < 0.75 THEN 'CH'
ELSE 'IT'
END AS country,
CASE
WHEN rand_2 < 0.25 THEN 'SMALL'
WHEN rand_2 < 0.5 THEN 'MEDIUM'
WHEN rand_2 < 0.75 THEN 'BIG'
ELSE 'MEGA'
END AS city_size,
CAST(RAND() * 1000000 AS INT64) as marketing_spend
FROM
data
),
###################
# END Dummy Data
###################
spend_buckets AS (
SELECT
country,
city_size,
CONCAT("row_", ROW_NUMBER() OVER()) AS identifier,
#MD5(CONCAT(country, city_size)) AS identifier, (this works)
SUM(marketing_spend) AS marketing_spend
FROM
spend_with_categories
GROUP BY 1,2
),
german_spend AS (
SELECT
country,
ARRAY_AGG(identifier) AS identifier,
SUM(marketing_spend) AS marketing_spend
FROM
spend_buckets
WHERE
country = 'DE'
GROUP BY
country
),
german_identifiers AS (
SELECT id AS identifier FROM german_spend, UNNEST(identifier) as id
),
non_german_spend AS (
SELECT SUM(marketing_spend) AS marketing_spend FROM spend_buckets WHERE identifier NOT IN (SELECT identifier FROM german_identifiers)
)
(SELECT "german_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM german_spend
UNION ALL
SELECT "non_german_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM non_german_spend
UNION ALL
SELECT "total_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM spend_buckets)
Solution
We were actually able to solve the problem by using a hash of the key instead of the ROW_NUMBER() OVER() identifier, but out of curiosity I would still love to understand what causes this.
Additional Notes
Using GENERATE_UUID() AS identifier instead of CONCAT("row_", ROW_NUMBER() OVER()) AS identifier leads to almost 0 matches. I.e. entire spend is classified as non-german.
Writing spend_buckets to a table also solves the problem, which leads me to believe that maybe ROW_NUMBER() OVER() is lazily executed or so?
using a small number for the dummy data also produces non-matching results regardless of the method of generating a "unique" id

Hash functions are a way better for marking rows than generating a rownumber, which is changing each day.
The CTE (with tables) are not persistent, but calculated for each time used in your query.
Running the same CTE several times within a query, results in different results:
With test as (Select rand() as x)
Select * from test
union all Select * from test
union all Select * from test
A good solution is the use of temp table. A workaround is to use search for CTE table, which creates a row_number or generates random number and are used more than once in following. These CTE are to rename and be used in a recursive CTE and then the later CTE is used. In your example it is the spend_buckets:
WITH recursive
...
spend_buckets_ as (
...),
spend_buckets as
(select * from spend_buckets_
union all select * from spend_buckets_
where false
),
Then the values will match.

How to rewrite CONNECT BY PRIOR Oracle style query to RECURSIVE CTE Postgres for query with correlated WHERE clause? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
Now I have following working query for Oracle:
select * from (
select orgId, oNdId, stamp, op,
lgin, qwe, rty,
tusid, tnid, teid,
thid, tehid, trid,
name1, name2,
xtrdta, rownum as rnum from
(
select a.*
from tblADT a
where a.orgId=? and EXISTS(
SELECT oNdId, prmsn FROM (
SELECT oNdId, rp.prmsn FROM tblOND
LEFT JOIN tblRoleprmsn rp ON rp.roleId=? AND rp.prmsn='vors'
START WITH oNdId IN (
SELECT oNdId FROM tblrnpmsn rnp
WHERE rnp.roleId=?
AND rnp.prmsn=?
)
CONNECT BY PRIOR oNdId = parentId
)
WHERE oNdId = a.oNdId OR 1 = (
CASE WHEN prmsn IS NOT NULL THEN
CASE WHEN a.oNdId IS NULL THEN 1 ELSE 0 END
END
)
)
AND op IN (?)
order by stamp desc
) WHERE rownum < (? + ? + 1)
) WHERE rnum >= (? + 1)
For now I am trying to implement analog for PostreSQl. Based on my investigation I could use recursive CTE.
But I am not successful. The eaxamples I found all without where clause so it is not so easy.
Could you please help me with that ?

The Oracle query seems to have a few extra quirks and conditions I'm not able to understand. It's probably related to the specific use case.
In the absence of sample data I'll show you the simple case. You say:
There is a table 'tblOND' which has 2 columns 'oNdId' and 'parentId' it is a hierarchy here
Here's a query that would get all the children of nodes, according to an initial filtering predicate:
create table tblond (
ondid int primary key not null,
parentid int foreign key references tblond (ondid)
);
with recursive
n as (
select ondid, parentid, 1 as lvl
from tblond
where <search_predicate> -- initial nodes
union all
select t.ondid, t.parentid, n.lvl + 1
from n
join tblond t on t.parentid = n.ondid -- #1
)
select * from n
Recursive CTEs are not limited to hierarchies, but to any kind of graph. As long as you are able to depict the relationship to "walk" to the next nodes (#1) you can keep adding rows.
Also the example shows a "made up" column lvl; you can produce as many columns as you need/want.
The section before the UNION ALL is the "anchor" query that is run only once. After the UNION ALL is the "iterative" query that is run iteratively until it does not return any more rows.

SQL to show one result calculated by the other values?

It seems we can use a SQL statement as:
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
);
but we can't do
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
),
(
select
(c_foos / c_bars) as the_ratio
);
or
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
),
(c_foos / c_bars) as the_ratio;
Is there a way to do that showing all 3 numbers? Is there a more definite rule as to what can be done and what can't?

You can try this:
You define two CTEs in a WITH clause, so you can use your result in the main query built on two cte tables (cte_num and cte_den)
WITH recursive
cte_num AS (
SELECT count(*) as c_foos
FROM foos
),
cte_den AS (
SELECT count(*) as c_bars
FROM bars
)
SELECT
cte_num.foos,
cte_den.bars,
cte_num.foos / cte_den.bars as the_ratio
from cte_num, cte_den;

There is a small number of simple rules... but SQL seems so easy that most programmers prefer to cut to the chase, and later complain they didn't get the plot :)
You can think of a query as a description of a flow: columns in a select share inputs (defined in from), but are evaluated "in parallel", without seeing each other. Your complex example boils down to the fact, that you cannot do this:
select 1 as a, 2 as b, a + b;
fields a and b are defined as outputs from the query, but there are no inputs called a and b. All you have to do is modify the query so that a and b are inputs:
select a + b from (select 1 as a, 2 as b) as inputs
And this will work (this is, btw., the solution for your queries).
Addendum:
The confusion comes from the fact, that in most SQL 101 cases outputs are created directly from inputs (data just passes through).
This flow model is useful, because it makes things easier to reason about in more complex cases. Also, we avoid ambiguities and loops. You can think about it in the context of query like: select name as last_name, last_name as name, name || ' ' || last_name from person;

Move the conditions to the FROM clause:
select f.c_foos, b.c_bars, f.c_foos / f.c_bars
from (select count(*) as c_foos from foos
) f cross join
(select count(*) as c_bars from bars
) b;
Ironically, your first version will work in MySQL (see here). I don't actually think this is intentional. I think it is an artifact of their parser -- meaning that it happens to work but might stop working in future versions.

The simplest way is to use a CTE that returns the 2 columns:
with cte as (
select
(select count(*) from foos) as c_foos,
(select count(*) from bars) as c_bars
)
select c_foos, c_bars, (c_foos / c_bars) as the_ratio
from cte
Note that the aliases of the 2 columns must be set outside of each query and not inside (the parentheses).

Self join SQL is taking too much time to execute

Below SQL is taking too much time to execute.Dont know where is am doing wrong but yes getting proper result.can i further simplify this sql.
This is oracle db and jmc_job_step table contains huge records.
select *
from
jmc_job_run_id jobrunid0_
inner join
jmc_job_step jobsteps1_
on jobrunid0_.id=jobsteps1_.job_run_id
where
(
jobsteps1_.creation_date in (
select
min(jobstep2_.creation_date)
from
jmc_job_step jobstep2_
where
jobrunid0_.id=jobstep2_.job_run_id
group by
jobstep2_.job_run_id ,
jobstep2_.job_step_no
)
)
or jobsteps1_.job_step_progress_value in (
select
max(jobstep3_.job_step_progress_value)
from
jmc_job_step jobstep3_
where
jobrunid0_.id=jobstep3_.job_run_id
group by
jobstep3_.job_run_id ,
jobstep3_.job_step_no
)
)
order by
jobrunid0_.job_start_time desc

This is useless; it says "I don't care what those columns contain", but - yet - you give the database engine to check those values anyway.
(
upper(jobrunid0_.tenant_id) like '%'|| null
)
and (
upper(jobrunid0_.job_run_id) like '%'||null||'%'
)

ORDER BY the IN value list

I have a simple SQL query in PostgreSQL 8.3 that grabs a bunch of comments. I provide a sorted list of values to the IN construct in the WHERE clause:
SELECT * FROM comments WHERE (comments.id IN (1,3,2,4));
This returns comments in an arbitrary order which in my happens to be ids like 1,2,3,4.
I want the resulting rows sorted like the list in the IN construct: (1,3,2,4).
How to achieve that?

You can do it quite easily with (introduced in PostgreSQL 8.2) VALUES (), ().
Syntax will be like this:
select c.*
from comments c
join (
values
(1,1),
(3,2),
(2,3),
(4,4)
) as x (id, ordering) on c.id = x.id
order by x.ordering

In Postgres 9.4 or later, this is simplest and fastest:
SELECT c.*
FROM comments c
JOIN unnest('{1,3,2,4}'::int[]) WITH ORDINALITY t(id, ord) USING (id)
ORDER BY t.ord;
WITH ORDINALITY was introduced with in Postgres 9.4.
No need for a subquery, we can use the set-returning function like a table directly. (A.k.a. "table-function".)
A string literal to hand in the array instead of an ARRAY constructor may be easier to implement with some clients.
For convenience (optionally), copy the column name we are joining to ("id" in the example), so we can join with a short USING clause to only get a single instance of the join column in the result.
Works with any input type. If your key column is of type text, provide something like '{foo,bar,baz}'::text[].
Detailed explanation:
PostgreSQL unnest() with element number

Just because it is so difficult to find and it has to be spread: in mySQL this can be done much simpler, but I don't know if it works in other SQL.
SELECT * FROM `comments`
WHERE `comments`.`id` IN ('12','5','3','17')
ORDER BY FIELD(`comments`.`id`,'12','5','3','17')

With Postgres 9.4 this can be done a bit shorter:
select c.*
from comments c
join (
select *
from unnest(array[43,47,42]) with ordinality
) as x (id, ordering) on c.id = x.id
order by x.ordering;
Or a bit more compact without a derived table:
select c.*
from comments c
join unnest(array[43,47,42]) with ordinality as x (id, ordering)
on c.id = x.id
order by x.ordering
Removing the need to manually assign/maintain a position to each value.
With Postgres 9.6 this can be done using array_position():
with x (id_list) as (
values (array[42,48,43])
)
select c.*
from comments c, x
where id = any (x.id_list)
order by array_position(x.id_list, c.id);
The CTE is used so that the list of values only needs to be specified once. If that is not important this can also be written as:
select c.*
from comments c
where id in (42,48,43)
order by array_position(array[42,48,43], c.id);

I think this way is better :
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY id=1 DESC, id=3 DESC, id=2 DESC, id=4 DESC

Another way to do it in Postgres would be to use the idx function.
SELECT *
FROM comments
ORDER BY idx(array[1,3,2,4], comments.id)
Don't forget to create the idx function first, as described here: http://wiki.postgresql.org/wiki/Array_Index

In Postgresql:
select *
from comments
where id in (1,3,2,4)
order by position(id::text in '1,3,2,4')

On researching this some more I found this solution:
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY CASE "comments"."id"
WHEN 1 THEN 1
WHEN 3 THEN 2
WHEN 2 THEN 3
WHEN 4 THEN 4
END
However this seems rather verbose and might have performance issues with large datasets.
Can anyone comment on these issues?

To do this, I think you should probably have an additional "ORDER" table which defines the mapping of IDs to order (effectively doing what your response to your own question said), which you can then use as an additional column on your select which you can then sort on.
In that way, you explicitly describe the ordering you desire in the database, where it should be.

sans SEQUENCE, works only on 8.4:
select * from comments c
join
(
select id, row_number() over() as id_sorter
from (select unnest(ARRAY[1,3,2,4]) as id) as y
) x on x.id = c.id
order by x.id_sorter

SELECT * FROM "comments" JOIN (
SELECT 1 as "id",1 as "order" UNION ALL
SELECT 3,2 UNION ALL SELECT 2,3 UNION ALL SELECT 4,4
) j ON "comments"."id" = j."id" ORDER BY j.ORDER
or if you prefer evil over good:
SELECT * FROM "comments" WHERE ("comments"."id" IN (1,3,2,4))
ORDER BY POSITION(','+"comments"."id"+',' IN ',1,3,2,4,')

And here's another solution that works and uses a constant table (http://www.postgresql.org/docs/8.3/interactive/sql-values.html):
SELECT * FROM comments AS c,
(VALUES (1,1),(3,2),(2,3),(4,4) ) AS t (ord_id,ord)
WHERE (c.id IN (1,3,2,4)) AND (c.id = t.ord_id)
ORDER BY ord
But again I'm not sure that this is performant.
I've got a bunch of answers now. Can I get some voting and comments so I know which is the winner!
Thanks All :-)

create sequence serial start 1;
select * from comments c
join (select unnest(ARRAY[1,3,2,4]) as id, nextval('serial') as id_sorter) x
on x.id = c.id
order by x.id_sorter;
drop sequence serial;
[EDIT]
unnest is not yet built-in in 8.3, but you can create one yourself(the beauty of any*):
create function unnest(anyarray) returns setof anyelement
language sql as
$$
select $1[i] from generate_series(array_lower($1,1),array_upper($1,1)) i;
$$;
that function can work in any type:
select unnest(array['John','Paul','George','Ringo']) as beatle
select unnest(array[1,3,2,4]) as id

Slight improvement over the version that uses a sequence I think:
CREATE OR REPLACE FUNCTION in_sort(anyarray, out id anyelement, out ordinal int)
LANGUAGE SQL AS
$$
SELECT $1[i], i FROM generate_series(array_lower($1,1),array_upper($1,1)) i;
$$;
SELECT
*
FROM
comments c
INNER JOIN (SELECT * FROM in_sort(ARRAY[1,3,2,4])) AS in_sort
USING (id)
ORDER BY in_sort.ordinal;

select * from comments where comments.id in
(select unnest(ids) from bbs where id=19795)
order by array_position((select ids from bbs where id=19795),comments.id)
here, [bbs] is the main table that has a field called ids,
and, ids is the array that store the comments.id .
passed in postgresql 9.6

Lets get a visual impression about what was already said. For example you have a table with some tasks:
SELECT a.id,a.status,a.description FROM minicloud_tasks as a ORDER BY random();
id | status | description
----+------------+------------------
4 | processing | work on postgres
6 | deleted | need some rest
3 | pending | garden party
5 | completed | work on html
And you want to order the list of tasks by its status.
The status is a list of string values:
(processing, pending, completed, deleted)
The trick is to give each status value an interger and order the list numerical:
SELECT a.id,a.status,a.description FROM minicloud_tasks AS a
JOIN (
VALUES ('processing', 1), ('pending', 2), ('completed', 3), ('deleted', 4)
) AS b (status, id) ON (a.status = b.status)
ORDER BY b.id ASC;
Which leads to:
id | status | description
----+------------+------------------
4 | processing | work on postgres
3 | pending | garden party
5 | completed | work on html
6 | deleted | need some rest
Credit #user80168

I agree with all other posters that say "don't do that" or "SQL isn't good at that". If you want to sort by some facet of comments then add another integer column to one of your tables to hold your sort criteria and sort by that value. eg "ORDER BY comments.sort DESC " If you want to sort these in a different order every time then... SQL won't be for you in this case.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery query doesn't work with UNNEST() - google-bigquery

Related

BigQuery "Schrödingers Row" or why ROW_NUMBER() is not a good identifier

How to rewrite CONNECT BY PRIOR Oracle style query to RECURSIVE CTE Postgres for query with correlated WHERE clause? [closed]

SQL to show one result calculated by the other values?

Self join SQL is taking too much time to execute

ORDER BY the IN value list

Categories

Resources