Postgres Select execution order incorrect - sql

The following query does not work in Postgres 9.4.5.
SELECT * FROM (
SELECT M.NAME, M.VALUE AS V
FROM METRICS AS M, METRICATTRIBUTES AS A
WHERE M.NAME=A.NAME AND A.ISSTRING='FALSE'
) AS S1
WHERE CAST(S1.V AS NUMERIC)<0
I get an error like:
invalid input syntax for type numeric: "astringvalue"
Read on to see why I made query this overly complicated.
METRICS is a table of metric, value pairs. The values are stored as strings and some of the values of the VALUE field are, in fact strings. The METRICATTRIBUTES table identifies those metric names which may have string values. I populated the METRICATTRIBUTES table from an analysis of the METRICS table.
To check, I ran...
SELECT * FROM (
SELECT M.NAME, M.VALUE AS V
FROM METRICS AS M, METRICATTRIBUTES AS A
WHERE M.NAME=A.NAME AND A.ISSTRING='FALSE'
) AS S1
WHERE S1.V LIKE 'a%'
This returns no values (like I would expect). The error seems to be in the execution plan. Which looks something like this (sorry, I had to fat finger this)
1 -> HAS JOIN
2 HASH COND: ((M.NAME::TEXT=(A.NAME)::TEXT))
3 SEQ SCAN ON METRICS M
4 FILTER: ((VALUE)::NUMERIC<0::NUMERIC)
5 -> HASH
6 -> Seq Scan on METRICATTRIBUTES A
7 Filter: (NOT ISSTRING)
I am not an expert on this (only 1 week of Postgres experience) but it looks like Postgres is trying to apply the cast (line 4) before it applies the join condition (line 2). By doing this, it will try to apply the cast to invalid string values which is precisely what I am trying to avoid!
Writing this with an explicit join did not make any difference. Writing it as a single select statement was my first attempt, never expecting this type of problem. That also did not work.
Any ideas?

As you can see from your plan, table METRICS is being scanned in full (Seq Scan) and filtered with your condition: CAST(S1.V AS NUMERIC)<0—join does not limits the scope at all.
Obviously, you have some rows, that contain non-numeric data in the METRICS.VALUE.
Check your table for such rows like this:
SELECT * FROM METRICS
WHERE NOT VALUE ~ '^([0-9].,e)*$'
Note, that it is difficult to catch all possible combinations with regular expression, therefore check out this related question: isnumeric() with PostgreSQL
Name VALUE for the column is not good, as this word is a reserved one.
Edit: If you're absolutely sure, that joined tables will produce wanted VALUE-s, than you can use CTEs, which have optimization fence feature in PostgreSQL:
WITH S1 AS (
SELECT M.NAME, M.VALUE AS V
FROM METRICS AS M
JOIN METRICATTRIBUTES AS A USING (NAME)
WHERE A.ISSTRING='FALSE'
)
SELECT *
FROM S1
WHERE CAST(S1.V AS NUMERIC)<0;

Related

Bigquery SQL: convert array to columns

I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is

How to rotate a two-column table?

This might be a novice question – I'm still learning. I'm on PostgreSQL 9.6 with the following query:
SELECT locales, count(locales) FROM (
SELECT lower((regexp_matches(locale, '([a-z]{2,3}(-[a-z]{2,3})?)', 'i'))[1])
AS locales FROM users)
AS _ GROUP BY locales
My query returns the following dynamic rows:
locales
count
en
10
fr
7
de
3
n additional locales (~300)...
n-count
I'm trying to rotate it so that locale values end up as columns with a single row, like this:
en
fr
de
n additional locales (~300)...
10
7
3
n-count
I'm having to do this to play nice with a time-series db/app
I've tried using crosstab(), but all the examples show better defined tables with 3 or more columns.
I've looked at examples using join, but I can't figure out how to do it dynamically.
Base query
In Postgres 10 or later you could use the simpler and faster regexp_match() instead of regexp_matches(). (Since you only take the first match per row anyway.) But don't bother and use the even simpler substring() instead:
SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale
, count(*)::int AS ct
FROM users
WHERE locale ~* '[a-z]{2,3}' -- eliminate NULL, allow index support
GROUP BY 1
ORDER BY 2 DESC, 1
Simpler and faster than your original base query.
About those ordinal numbers in GROUP BY and ORDER BY:
Select first row in each GROUP BY group?
Subtle difference: regexp_matches() returns no row for no match, while substring() returns null. I added a WHERE clause to eliminate non-matches a-priori - and allow index support if applicable, but I don't expect indexes to help here.
Note the prefixed (?i), that's a so-called "embedded option" to use case-insensitive matching.
Added a deterministic ORDER BY clause. You'd need that for a simple crosstab().
Aside: you might need _ in the pattern instead of - for locales like "en_US".
Pivot
Try as you might, SQL does not allow dynamic result columns in a single query. You need two round trips to the server. See;
How do I generate a pivoted CROSS JOIN where the resulting table definition is unknown?
You can use a dynamically generated crosstab() query. Basics:
PostgreSQL Crosstab Query
Dynamic query:
PostgreSQL convert columns to rows? Transpose?
But since you generate a single row of plain integer values, I suggest a simple approach:
SELECT 'SELECT ' || string_agg(ct || ' AS ' || quote_ident(locale), ', ')
FROM (
SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale
, count(*)::int AS ct
FROM users
WHERE locale ~* '[a-z]{2,3}'
GROUP BY 1
ORDER BY 2 DESC, 1
) t
Generates a query of the form:
SELECT 10 AS en, 7 AS fr, 3 AS de, 3 AS "de-at"
Execute it to produce your desired result.
In psql you can append \gexec to the generating query to feed the generated SQL string back to the server immediately. See:
My function returned a string. How to execute it?

Athena/Presto | Can't match ID row on self join

I'm trying to get the bi-grams on a string column.
I've followed the approach here but Athena/Presto is giving me errors at the final steps.
Source code so far
with word_list as (
SELECT
transaction_id,
words,
n,
regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)') as f70,
f70_remittance_info
FROM exploration_transaction
cross join unnest(regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)')) with ordinality AS t (words, n)
where cardinality((regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)'))) > 1
and f70_remittance_info is not null
limit 50 )
select wl1.f70, wl1.n, wl1.words, wl2.f70, wl2.n, wl2.words
from word_list wl1
join word_list wl2
on wl1.transaction_id = wl2.transaction_id
The specific issue I'm having is on the very last line, when I try to self join the transaction ids - it always returns zero rows. It does work if I join only by wl1.n = wl2.n-1 (the position on the array) which is useless if I can't constrain it to a same id.
Athena doesn't support the ngrams function by presto, so I'm left with this approach.
Any clues why this isn't working?
Thanks!
This is speculation. But I note that your CTE is using limit with no order by. That means that an arbitrary set of rows is being returned.
Although some databases materialize CTEs, many do not. They run the code independently each time it is referenced. My guess is that the code is run independently and the arbitrary set of 50 rows has no transaction ids in common.
One solution would be to add order by transacdtion_id in the subquery.

POSTGRES - evaluate (sub)expression error as NULL

I have sensor data in a Postgres table measurements with columns id, timestamp, s0, s1, s2, ...
Besides, there is an index on columns (id, timestamp). I want to allow for dynamic math expressions (in the example below: sin(s3)*0.1000/s5) for calculation of derived values.
SELECT
timestamp,
trunc((sin(s3) * 0.1000/s5)::numeric, 3) AS "calculated"
FROM measurements
WHERE id = 42
ORDER BY timestamp DESC
LIMIT 10000;
Obviously, this is prone to a "division by zero" error which will make the query fail. Is there a way to catch this error and return e.g. NULL for the calculated value where the error would occur?
Inspired by
Postgres return null values on function error/failure when casting
Store a formula in a table and use the formula in a function
I already tried defining a postgres function eval_numeric(sensors int[], formula text) that parses the formula and returns NULL on exception. The third row of the SQL statement above now reads
trunc(eval_numeric(ARRAY[s3,s5],'sin(var1)*0.1/var2'), 3) AS "calculated"
This gives the desired behavior but execution time as reported by EXPLAIN ANALYZE increases by a factor of 20 (~20ms -> ~400ms). Any other ideas?
UPDATE
The dynamic expression to be evaluated stems from a web application user. So the formula above is only an example (might require checking for negative argument to square root). I'd rather have a generic error checking possibility and would prefer not having logic in the math expression. This would be easier for the end user and I could validate the allowed math e.g. with a math parser thereby preventing SQL injection.
Can you change the expression to this?
SELECT timestamp,
trunc((sin(s3) * 0.1000/nullif(s5, 0))::numeric, 3) AS "calculated",
FROM measurements
WHERE id = 42
ORDER BY timestamp DESC
LIMIT 10000;
This is the simplest way to accomplish what you want to do.
You could also use a CASE expression:
SELECT
timestamp,
CASE WHEN s5 != 0
THEN trunc((sin(s3) * 0.1000/s5)::numeric, 3)
ELSE NULL AS "calculated",
FROM measurements
WHERE id = 42
ORDER BY timestamp DESC
LIMIT 10000;
This has the potential benefit that you may replace the value with anything you want, including NULL.
Another option, if you don't care about rows which would have triggered a divide by zero, would be to just add the check on s5 to the WHERE clause and filter off those rows before the division happens.

Match count of a regular expression for every row

I use below query to get content rows which has my_regex_pattern. But I don't know how many times the pattern hit for every row. What is the best way to get match count for every row in Postgres?
For example if a row's content is 'abcdefabcgh' and my regular expression is 'abc', I want 2 since 'abcdefabcgh' has two 'abc'.
SELECT content
FROM table1
WHERE content ~ 'my_regex_pattern'
Or how can I get rows which has matches more than a specific number. For example just give me records which has abc more than 4 times.
Of course you can make it work with regexp_matches(). Or better yet, regexp_split_to_table(). To apply to a whole table, use a LATERAL join (requires Postgres 9.3+):
SELECT content, ct
FROM table1 t, LATERAL (
SELECT count(*) - 1 AS ct
FROM regexp_split_to_table(t.content, 'abc')
) c
WHERE t.content ~ 'abc'; -- eliminate rows without match
For simple patterns like in the example in your question, you could also:
SELECT content, (length(content) - length(replace(content, 'abc', ''))) / length('abc')
FROM table1
WHERE content LIKE '%abc%';
Typically faster, since regular expression functions are costly. Also works for older versions.