Bigquery SQL: convert array to columns - sql
I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
SELECT "B" AS name, [5,4,3,2,1] AS a
long_format AS (
SELECT name, vals
FROM raw
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
SELECT "B" AS name, [5,4,3,2,1] AS a
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
SELECT "B" AS name, [5,4,3,2,1] AS a
long_format AS (
SELECT name, vals, offset
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
SELECT "B" AS name, [5,4,3,2,1] AS a
long_format AS (
SELECT name, vals, offset
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| """
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| """
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is
How to rotate a two-column table?
This might be a novice question – I'm still learning. I'm on PostgreSQL 9.6 with the following query: SELECT locales, count(locales) FROM ( SELECT lower((regexp_matches(locale, '([a-z]{2,3}(-[a-z]{2,3})?)', 'i'))[1]) AS locales FROM users) AS _ GROUP BY locales My query returns the following dynamic rows: locales count en 10 fr 7 de 3 n additional locales (~300)... n-count I'm trying to rotate it so that locale values end up as columns with a single row, like this: en fr de n additional locales (~300)... 10 7 3 n-count I'm having to do this to play nice with a time-series db/app I've tried using crosstab(), but all the examples show better defined tables with 3 or more columns. I've looked at examples using join, but I can't figure out how to do it dynamically.
Base query In Postgres 10 or later you could use the simpler and faster regexp_match() instead of regexp_matches(). (Since you only take the first match per row anyway.) But don't bother and use the even simpler substring() instead: SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale , count(*)::int AS ct FROM users WHERE locale ~* '[a-z]{2,3}' -- eliminate NULL, allow index support GROUP BY 1 ORDER BY 2 DESC, 1 Simpler and faster than your original base query. About those ordinal numbers in GROUP BY and ORDER BY: Select first row in each GROUP BY group? Subtle difference: regexp_matches() returns no row for no match, while substring() returns null. I added a WHERE clause to eliminate non-matches a-priori - and allow index support if applicable, but I don't expect indexes to help here. Note the prefixed (?i), that's a so-called "embedded option" to use case-insensitive matching. Added a deterministic ORDER BY clause. You'd need that for a simple crosstab(). Aside: you might need _ in the pattern instead of - for locales like "en_US". Pivot Try as you might, SQL does not allow dynamic result columns in a single query. You need two round trips to the server. See; How do I generate a pivoted CROSS JOIN where the resulting table definition is unknown? You can use a dynamically generated crosstab() query. Basics: PostgreSQL Crosstab Query Dynamic query: PostgreSQL convert columns to rows? Transpose? But since you generate a single row of plain integer values, I suggest a simple approach: SELECT 'SELECT ' || string_agg(ct || ' AS ' || quote_ident(locale), ', ') FROM ( SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale , count(*)::int AS ct FROM users WHERE locale ~* '[a-z]{2,3}' GROUP BY 1 ORDER BY 2 DESC, 1 ) t Generates a query of the form: SELECT 10 AS en, 7 AS fr, 3 AS de, 3 AS "de-at" Execute it to produce your desired result. In psql you can append \gexec to the generating query to feed the generated SQL string back to the server immediately. See: My function returned a string. How to execute it?
Match count of a regular expression for every row
I use below query to get content rows which has my_regex_pattern. But I don't know how many times the pattern hit for every row. What is the best way to get match count for every row in Postgres? For example if a row's content is 'abcdefabcgh' and my regular expression is 'abc', I want 2 since 'abcdefabcgh' has two 'abc'. SELECT content FROM table1 WHERE content ~ 'my_regex_pattern' Or how can I get rows which has matches more than a specific number. For example just give me records which has abc more than 4 times.
Of course you can make it work with regexp_matches(). Or better yet, regexp_split_to_table(). To apply to a whole table, use a LATERAL join (requires Postgres 9.3+): SELECT content, ct FROM table1 t, LATERAL ( SELECT count(*) - 1 AS ct FROM regexp_split_to_table(t.content, 'abc') ) c WHERE t.content ~ 'abc'; -- eliminate rows without match For simple patterns like in the example in your question, you could also: SELECT content, (length(content) - length(replace(content, 'abc', ''))) / length('abc') FROM table1 WHERE content LIKE '%abc%'; Typically faster, since regular expression functions are costly. Also works for older versions.
Can I turn multiple values stored in a single field into a set of rows from within a select statement?
This is asked regarding an Oracle 11g database. I'm trying to query an Atlassian Confluence calendar table. It stores calendar entries for an entire calendar into a single value in a single row, which is this gigantic glob of iCal crap. If the fields within each entry were in a consistent order, my regex fu would be strong enough to parse out the particular entry I am searching for... but since I need to search for a date, the description, and the summary, all of which can apparently be in any order within the BEGIN/END VEVENT, this is impossible. I'm halfway certain it would be impossible even with lookahead and lookbehind. Is there a sql (not pl-sql) construction that would chop this single string/blob value out into multiple rows, so that I could do something like: select * from (chopped up value) where x like '%something%'; This would make it sort of the reverse of a wm_concat() or group_concat... A typical entry looks something like this (and it has 50 or 60 already): BEGIN:VEVENT SUMMARY:Richard Smichard ATTENDEE;X-CONFLUENCE-USER=rismich: onfluence/display/~rismich LOCATION: DESCRIPTION:Primary DTSTART;VALUE=DATE:20130726 DTEND;VALUE=DATE:20130729 DTSTAMP:20130724T153322Z CREATED:20130724T153322Z LAST-MODIFIED:20130724T153322Z ORGANIZER; SEQUENCE:0 END:VEVENT I can't use PL-SQL or build a proper parser because the environment this will run in doesn't make that possible. I get to run a select statement, and it either returns the value I'm looking for, or it doesn't. Also, NoSQL sucks. Big time.
This is a quick test: with w1 as ( select 'BEGIN:VEVENT\ SUMMARY:Richard Smichard ATTENDEE;X-CONFLUENCE-USER=rismich: onfluence/display/~rismich LOCATION: DESCRIPTION:Primary DTSTART;VALUE=DATE:20130726 DTEND;VALUE=DATE:20130729 DTSTAMP:20130724T153322Z CREATED:20130724T153322Z LAST-MODIFIED:20130724T153322Z ORGANIZER; SEQUENCE:0 END:VEVENT' text from dual ), w2 as ( select 'SUMMARY' label from dual union all select 'DESCRIPTION' label from dual ) select regexp_substr(w1.text, 'UID.*') id, w2.label, substr(regexp_substr(w1.text, w2.label || '.*'), instr(regexp_substr(w1.text, w2.label || '.*'), ':') + 1) spl from w1, w2; It gives: 1 SUMMARY Richard Smichard 2 DESCRIPTION Primary
Purposely having a query return blank entries at regular intervals
I want to write a query that returns 3 results followed by blank results followed by the next 3 results, and so on. So if my database had this data: CREATE TABLE table (a integer, b integer, c integer, d integer); INSERT INTO table (a,b,c,d) VALUES (1,2,3,4), (5,6,7,8), (9,10,11,12), (13,14,15,16), (17,18,19,20), (21,22,23,24), (25,26,37,28); I would want my query to return this 1,2,3,4 5,6,7,8 9,10,11,12 , , , 13,14,15,16 17,18,19,20 21,22,23,24 , , , 25,26,27,28 I need this to work for arbitrarily many entries that I select for, have three be grouped together like this. I'm running postgresql 8.3
This should work flawlessly in PostgreSQL 8.3 SELECT a, b, c, d FROM ( SELECT rn, 0 AS rk, (x[rn]).* FROM ( SELECT x, generate_series(1, array_upper(x, 1)) AS rn FROM (SELECT ARRAY(SELECT tbl FROM tbl) AS x) x ) y UNION ALL SELECT generate_series(3, (SELECT count(*) FROM tbl), 3), 1, (NULL::tbl).* ORDER BY rn, rk ) z Major points Works for a query that selects all columns of tbl. Works for any table. For selecting arbitrary columns you have to substitute (NULL::tbl).* with a matching number of NULL columns in the second query. Assuming that NULL values are ok for "blank" rows. If not, you'll have to cast your columns to text in the first and substitute '' for NULL in the second SELECT. Query will be slow with very big tables. If I had to do it, I would write a plpgsql function that loops through the results and inserts the blank rows. But you mentioned you had no direct access to the db ...
In short, no, there's not an easy way to do this, and generally, you shouldn't try. The database is concerned with what your data actually is, not how it's going to be displayed. It's not an appropriate scope of responsibility to expect your database to return "dummy" or "extra" data so that some down-stream process produces a desired output. The generating script needs to do that. As you can't change your down-stream process, you could (read that with a significant degree of skepticism and disdain) add things like this: Select Top 3 a, b, c, d From table Union Select Top 1 '', '', '', '' From table Union Select Top 3 Skip 3 a, b, c, d From table Please, don't actually try do that.
You can do it (at least on DB2 - there doesn't appear to be equivalent functionality for your version of PostgreSQL). No looping needed, although there is a bit of trickery involved... Please note that though this works, it's really best to change your display code. Statement requires CTEs (although that can be re-written to use other table references), and OLAP functions (I guess you could re-write it to count() previous rows in a subquery, but...). WITH dataList (rowNum, dataColumn) as (SELECT CAST(CAST(:interval as REAL) / (:interval - 1) * ROW_NUMBER() OVER(ORDER BY dataColumn) as INTEGER), dataColumn FROM dataTable), blankIncluder(rowNum, dataColumn) as (SELECT rowNum, dataColumn FROM dataList UNION ALL SELECT rowNum - 1, :blankDataColumn FROM dataList WHERE MOD(rowNum - 1, :interval) = 0 AND rowNum > :interval) SELECT * FROM dataList ORDER BY rowNum This will generate a list of those elements from the datatable, with a 'blank' line every interval lines, as ordered by the initial query. The result set only has 'blank' lines between existing lines - there are no 'blank' lines on the ends.
Is there an SQL function which generates a given range of sequential numbers?
I need to generate an array of sequential integers with a given range in order to use it in: SELECT tbl.pk_id FROM tbl WHERE tbl.pk_id NOT IN (sequential array);
If you have a given range - ie a start point and an end point - of sequential integers you should just be able to use the BETWEEN keyword: SELECT tbl.pk_id FROM tbl WHERE tbl.pk_id NOT BETWEEN START_INT AND END_INT or am I misunderstanding your question..?
Because you say you've already got a number table, I would suggest this: SELECT element FROM series WHERE element NOT IN (SELECT pk_id FROM tbl) Might possibly be more efficient than the query you've tried.
Two thoughts . . . First, there's no standard SQL function that does that. But some systems include a non-standard function that does generates a series. In PostgreSQL, for example, you can use the generate_series() function. select generate_series(1,100000); 1 2 3 ... 100000 That function essentially returns a table; it can be used in joins. If Informix doesn't have a function that does something similar, maybe you can write an Informix SPL function that does. Second, you could just create a one-column table and populate it with a series of integers. This works on all platforms, and doesn't require programming. It requires only minimal maintenance. (You need to keep more integers in this table than you're using in your production table.) create table integers ( i integer primary key ); Use a spreadsheet or a utility program to generate a series of integers to populate it. The easiest way if you have a Unix, Linux, or Cygwin environment laying around is to use seq. $ seq 1 5 > integers $ cat integers 1 2 3 4 5 Informix has a free developer version you can download. Maybe you can build a compelling demo with it, and management will let you upgrade.
i'll suggest a generic solution to create a result set containing the positive integers 0 .. 2^k-1 for a given k for subsequent use as a subquery, view or materialized view. the code below illustrates the technique for k=2. SELECT bv0 + 2* bv1 + 4*bv2 val FROM ( SELECT * FROM ( SELECT 0 bv0 FROM DUAL UNION SELECT 1 bv0 FROM DUAL ) bit0 CROSS JOIN ( SELECT 0 bv1 FROM DUAL UNION SELECT 1 bv1 FROM DUAL ) bit1 CROSS JOIN ( SELECT 0 bv2 FROM DUAL UNION SELECT 1 bv2 FROM DUAL ) bit2 ) pow2 ; i hope that helps you with your task best regards, carsten