SQL to show one result calculated by the other values? - sql

It seems we can use a SQL statement as:
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
);
but we can't do
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
),
(
select
(c_foos / c_bars) as the_ratio
);
or
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
),
(c_foos / c_bars) as the_ratio;
Is there a way to do that showing all 3 numbers? Is there a more definite rule as to what can be done and what can't?

You can try this:
You define two CTEs in a WITH clause, so you can use your result in the main query built on two cte tables (cte_num and cte_den)
WITH recursive
cte_num AS (
SELECT count(*) as c_foos
FROM foos
),
cte_den AS (
SELECT count(*) as c_bars
FROM bars
)
SELECT
cte_num.foos,
cte_den.bars,
cte_num.foos / cte_den.bars as the_ratio
from cte_num, cte_den;

There is a small number of simple rules... but SQL seems so easy that most programmers prefer to cut to the chase, and later complain they didn't get the plot :)
You can think of a query as a description of a flow: columns in a select share inputs (defined in from), but are evaluated "in parallel", without seeing each other. Your complex example boils down to the fact, that you cannot do this:
select 1 as a, 2 as b, a + b;
fields a and b are defined as outputs from the query, but there are no inputs called a and b. All you have to do is modify the query so that a and b are inputs:
select a + b from (select 1 as a, 2 as b) as inputs
And this will work (this is, btw., the solution for your queries).
Addendum:
The confusion comes from the fact, that in most SQL 101 cases outputs are created directly from inputs (data just passes through).
This flow model is useful, because it makes things easier to reason about in more complex cases. Also, we avoid ambiguities and loops. You can think about it in the context of query like: select name as last_name, last_name as name, name || ' ' || last_name from person;

Move the conditions to the FROM clause:
select f.c_foos, b.c_bars, f.c_foos / f.c_bars
from (select count(*) as c_foos from foos
) f cross join
(select count(*) as c_bars from bars
) b;
Ironically, your first version will work in MySQL (see here). I don't actually think this is intentional. I think it is an artifact of their parser -- meaning that it happens to work but might stop working in future versions.

The simplest way is to use a CTE that returns the 2 columns:
with cte as (
select
(select count(*) from foos) as c_foos,
(select count(*) from bars) as c_bars
)
select c_foos, c_bars, (c_foos / c_bars) as the_ratio
from cte
Note that the aliases of the 2 columns must be set outside of each query and not inside (the parentheses).

Related

BigQuery "Schrödingers Row" or why ROW_NUMBER() is not a good identifier

Situation
We have a fairly complex internal logic to allocate marketing spend to various channels and had currently started to rework some of our queries to simplify the setup. We recently came across a really puzzling case where using ROW_NUMBER() OVER() to identify unique rows lead to very strange results.
Problem
In essence, using ROW_NUMBER() OVER() resulted in what I call Schrödingers Rows. As they appear to be matched and unmatched at the same time (please find replicable query below). In the attached screenshot (which is a result of the query) it can be clearly seen that
german_spend + non_german_spend > total_spend
Which should not be the case.
Query
Please note that execution of the query will give you different results each time you run it as it relies on RAND() to generate dummy data. Also please be aware that the query is a very dumbed down version of what we are doing. For reasons beyond the scope of this post, we needed to uniquely identify the buckets.
###################
# CREATE Dummy Data
###################
DECLARE NUMBER_OF_DUMMY_RECORDS DEFAULT 1000000;
WITH data AS (
SELECT
num as campaign_id,
RAND() as rand_1,
RAND() as rand_2
FROM
UNNEST(GENERATE_ARRAY(1, NUMBER_OF_DUMMY_RECORDS)) AS num
),
spend_with_categories AS (
SELECT
campaign_id,
CASE
WHEN rand_1 < 0.25 THEN 'DE'
WHEN rand_1 < 0.5 THEN 'AT'
WHEN rand_1 < 0.75 THEN 'CH'
ELSE 'IT'
END AS country,
CASE
WHEN rand_2 < 0.25 THEN 'SMALL'
WHEN rand_2 < 0.5 THEN 'MEDIUM'
WHEN rand_2 < 0.75 THEN 'BIG'
ELSE 'MEGA'
END AS city_size,
CAST(RAND() * 1000000 AS INT64) as marketing_spend
FROM
data
),
###################
# END Dummy Data
###################
spend_buckets AS (
SELECT
country,
city_size,
CONCAT("row_", ROW_NUMBER() OVER()) AS identifier,
#MD5(CONCAT(country, city_size)) AS identifier, (this works)
SUM(marketing_spend) AS marketing_spend
FROM
spend_with_categories
GROUP BY 1,2
),
german_spend AS (
SELECT
country,
ARRAY_AGG(identifier) AS identifier,
SUM(marketing_spend) AS marketing_spend
FROM
spend_buckets
WHERE
country = 'DE'
GROUP BY
country
),
german_identifiers AS (
SELECT id AS identifier FROM german_spend, UNNEST(identifier) as id
),
non_german_spend AS (
SELECT SUM(marketing_spend) AS marketing_spend FROM spend_buckets WHERE identifier NOT IN (SELECT identifier FROM german_identifiers)
)
(SELECT "german_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM german_spend
UNION ALL
SELECT "non_german_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM non_german_spend
UNION ALL
SELECT "total_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM spend_buckets)
Solution
We were actually able to solve the problem by using a hash of the key instead of the ROW_NUMBER() OVER() identifier, but out of curiosity I would still love to understand what causes this.
Additional Notes
Using GENERATE_UUID() AS identifier instead of CONCAT("row_", ROW_NUMBER() OVER()) AS identifier leads to almost 0 matches. I.e. entire spend is classified as non-german.
Writing spend_buckets to a table also solves the problem, which leads me to believe that maybe ROW_NUMBER() OVER() is lazily executed or so?
using a small number for the dummy data also produces non-matching results regardless of the method of generating a "unique" id
Hash functions are a way better for marking rows than generating a rownumber, which is changing each day.
The CTE (with tables) are not persistent, but calculated for each time used in your query.
Running the same CTE several times within a query, results in different results:
With test as (Select rand() as x)
Select * from test
union all Select * from test
union all Select * from test
A good solution is the use of temp table. A workaround is to use search for CTE table, which creates a row_number or generates random number and are used more than once in following. These CTE are to rename and be used in a recursive CTE and then the later CTE is used. In your example it is the spend_buckets:
WITH recursive
...
spend_buckets_ as (
...),
spend_buckets as
(select * from spend_buckets_
union all select * from spend_buckets_
where false
),
Then the values will match.

BigQuery query doesn't work with UNNEST()

I'm trying to search on StackOverflow data through BigQuery by letting this query match a string pattern on answers and by filtering relevant question answers by tags.
WITH question_answers_join AS (
SELECT *
FROM (
SELECT id, creation_date, title
, (SELECT AS STRUCT body b
FROM `bigquery-public-data.stackoverflow.posts_answers`
WHERE a.id=parent_id
) answers
, SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
)
)SELECT *
FROM question_answers_join
WHERE 'google-bigquery' IN UNNEST(tags)
AND REGEXP_CONTAINS(answers.b, r"hello")
ORDER BY RAND()
LIMIT 100
however, I get this error:
Scalar subquery produced more than one element
what is it referring to? How can I fix this?
Below is for BigQuery Standard SQL
#standardSQL
WITH question_answers_join AS (
SELECT *
FROM (
SELECT id, creation_date, title
, ARRAY(SELECT body /* this line was the reason for error */
FROM `bigquery-public-data.stackoverflow.posts_answers`
WHERE a.id=parent_id
) answers
, SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
)
)
SELECT *
FROM question_answers_join
WHERE 'google-bigquery' IN UNNEST(tags)
AND EXISTS (
SELECT 1
FROM UNNEST(answers) answer
WHERE REGEXP_CONTAINS(answer, r"hello")
)
ORDER BY RAND()
LIMIT 100
I think, it is easy to just compare above with your original query to see the differences (hint: there are just two of them). First difference is the actual reason for the error you saw. Second difference is to reflect changes introduced by first one

DB2 getting QDT Array List maximum exceeded using CTE and sql recursion

I am using CTE to create a recursive query to merge multiple column data into one.
I have about 9 working CTE's (I need to merge columns a few times in one row per request, so I have the CTE helpers). When I add the 10th, I get an error. I am running the query on Visual Studio 2010 and here is the error:
And on the As400 system using the, WRKOBJLCK MyUserProfile *USRPRF command, I see:
I can't find any information on this.
I am using DB2 running on an AS400 system, and using: Operating system: i5/OS Version: V5R4M0
I repeat these same 3 CTE's but with different conditions to compare against:
t1A (ROWNUM, PARTNO, LOCNAM, LOCCODE, QTY) AS
(
SELECT rownumber() over(partition by s2.LOCPART), s2.LOCPART, s2.LOCNAM, s2.LOCCODE, s2.LOCQTY
FROM (
SELECT distinct s1.LOCPART, L.LOCNAM, L.LOCCODE, L.LOCQTY
FROM(
SELECT COUNT(LOCPART) AS counts, LOCPART
FROM LOCATIONS
WHERE LOCCODE = 'A'
GROUP BY LOCPART) S1, LOCATIONS L
WHERE S1.COUNTS > 1 AND S1.LOCPART = L.LOCPART AND L.LOCCODE = 'A'
)s2
),
t2A(PARTNO, LIST, QTY, CODE, CNT) AS
(
select PARTNO, LOCNAM, QTY, LOCCODE, 1
from t1A
where ROWNUM = 1
UNION ALL
select t2A.PARTNO, t2A.LIST || ', ' || t1A.LOCNAM, t1A.QTY, t1A.LOCCODE, t2A.CNT + 1
FROM t2A, t1A
where t2A.PARTNO = t1A.PARTNO
AND t2A.CNT + 1 = t1A.ROWNUM
),
t3A(PARTNO, LIST, QTY, CODE, CNT) AS
(
select t2.PARTNO, t2.LIST, q.SQTY, t2.CODE, t2.CNT
from(
select SUM(QTY) as SQTY, PARTNO
FROM t1A
GROUP BY PARTNO
) q, t2A t2
where t2.PARTNO = q.PARTNO
)
Using these, I just call a simple select on one of the CTE's just for testing, and I get the error each time when I have more than 9 CTE's (even if only one is being called).
In the AS400 error (green screen snapshot) what does QDT stand for, and when am I using an Array here?
This was a mess. Error after error. The only way I could get around this was to create views and piece them together.
When creating the view I was only able to get it to work with one CTE not multiple, then what worked fine as one recursive CTE, wouldn't work when trying to define as a view. I had to break apart the sub query into views, and I couldn't create a view out of SELECT rownumber() over(partition by COL1, Col2) that contained a sub query, I had to break it down into two views. If I called SELECT rownumber() over(partition by COL1, Col2) using a view as its subquery and threw that into the CTE it wouldn't work. I had to put the SELECT rownumber() over(partition by COL1, Col2) with its inner view into another view, and then I was able to use it in the CTE, and then create a main view out of all of that.
Also, Each error I got was a system error not SQL.
So in conclusion, I relied heavily on views to fix my issue if anyone ever runs across this same problem.

Simplify SQL query which uses `row_to_json`

I am using row_to_json function available in PostgreSQL 9.3 to get a query result as JSON:
SELECT row_to_json(_customer_wishes) FROM (
SELECT
...
(SELECT row_to_json(_brand)
FROM (
SELECT b.id, b.name, b.url
) AS _brand
) AS brand,
JOIN brand AS b ON ...
WHERE ...
) AS _customer_wishes;
However I don't like (SELECT row_to_json(_brand) FROM (SELECT b.*) AS _brand ) AS brand.
I would like something like (SELECT row_to_json(SELECT b.*)) AS brand, but I am not sure if it's possible.
Usually I use types for situations like this (we use database in a object oriented way).
Just for example:
CREATE TYPE type_brand AS
(
id integer,
name text,
url text
);
And your subquery:
SELECT row_to_json(SELECT (b.id, b.name, b.url)::type_brand) AS brand.
But, with table structure and desidered result I think I can suggest a more elegant query, where you can use row_to_json just once.

SQL "WITH" to include multiple derived tables

Can I write something like below. But this is not giving proper output in WinSQL/Teradata
with
a (x) as ( select 1 ),
b (y) as ( select * from a )
select * from b
Do you really need to use CTEs for this particular solution when derived tables would work as well:
SELECT B.*
FROM (SELECT A.*
FROM (SELECT 1 AS Col1) A
) B;
That being said, I believe multiple CTEs are available in Teradata 14.10 or 15. I believe support for a single CTE and the WITH clause were introduced in Teradata 12 or 13.
You call the dependent 1st and then the parent
like this and it will work. Why is it like that ? Teradata likes people to play with it longer and spend more time with it, making it feel important
with
"b" (y) as ( select * from "a" ),
"a" (x) as ( select '1' )
select * from b