BigQuery - Joining and pivoting large tables - google-bigquery

I know there are some posts on pivoting, which I have used to get where I am today (thanks to the BQ community!). But this post seeks some advice on optimising this where there is a large number of pivot columns needed, distributed table joins are needed....as well and deudping. Not asking much right!
Objective:
We have 2 large BQ tables, with a full 10 years history that needs joining:
sales_order_header (13 GB - 1.35 million rows)
sales_order_line (50GM - 5 million rows)
This is a typical 'header/line' one to many relationship. The data for the tables arrives as 2 seperate streams unfortunately rather then 1 document style where the line is nested inside the header which would be ideal - but its not so distributed joins become necesary for some of the views our BI tool (Tableau) wants to periodically (every 60 mins) call to ingest 'cleansed' data that is:
deduped (both tables that is)
joined header to line (on salesOrderId)
each has its own array of 'sourceData' namve / value paris that needs unpacking / 'pivot' so its not an array
Point 3 presents an issue in its own right. We have a column called 'sourceData' which is basically where the core data is - its an array of string name value pairs (a row in BQ is a replication of a single row from a DB so the key is a column name and value the value for a single row).
Now I think here lay the issue, as there are 250 array entries (we know the exact number up front) , this equates to 250 'unnest' statements each and using the best approach I can think of using sub selects:
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
250 times
And this is done as a pattern for each of the header and the line tables repsective views.
So the SQL for the view for just retrieving a deduped, flattened/pivoted array for the sales_order_header table is as follows. The sales_order_line has the same pattern for its view:
#standardSQL
WITH latest_snapshot_dups AS (
SELECT
salesOrderId,
PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", lastUpdated) AS lastUpdatedTimestampUTC,
sourceData,
_PARTITIONTIME AS bqPartitionTime
FROM
`project.ds.sales_order_header_refdata`
),
latest_snapshot_nodups AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY salesOrderId ORDER BY lastUpdatedTimestampUTC DESC) AS rowNum
FROM latest_snapshot_dups
)
SELECT
salesOrderId,
lastUpdatedTimestampUTC,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'b') AS b,
....250 of these
FROM
latest_snapshot_nodups
WHERE
rowNum = 1
Although just showing one here, we have these two similar views (with total of 250 + 300 = 550 unique subqueries that unnest/pivot), and now I want to join the header with the line views and I run into an issue straight away exceeding a limit of subqueries.
Is there a better way to do this, assuming this is the data there is to work with? A better way to 'pivot' perhaps? Or a more efficient way building a single view that optimises the order of things, rather then using 2 discrete views?
Thanks for your help BQ Community!

I run into an issue straight away exceeding a limit of subqueries
You currently using below pattern (removed mot significant part of code for simplicity)
#standardSQL
SELECT
salesOrderId,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'b') AS b,
....250 OF these
FROM latest_snapshot_nodups
Try below pattern
#standardSQL
SELECT
salesOrderId,
MAX(IF(name = 'a', val, NULL)) AS a,
MAX(IF(name = 'b', val, NULL)) AS b,
....250 OF these
FROM latest_snapshot_nodups, UNNEST(sourceData) kv
GROUP BY salesOrderId

Related

Bigquery SQL: convert array to columns

I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is

Difference of values in the same column ms access sql (mdb)

I have a table which contains two column with values that are not unique, those values are generated automatically and I have no way to do anything about it, cannot edit the table, db nor make custom functions.
With that in mind I've solved this problem in sql server, but it contains some functions that does not exist in ms-access.
The columns are Volume and ComponentID, here is my code in sql:
with rows as (
select row_number() over (order by volume) as rownum, volume
from test where componentid = 'S3')
select top 10
rowsMinusOne.volume, coalesce(rowsMinusOne.volume - rows.volume,0) as diff
from rows as rowsMinusOne
left outer join rows
on rows.rownum = rowsMinusOne.rownum - 1
Sample data:
58.29168
70.57396
85.67902
97.04888
107.7026
108.2022
108.3975
108.5777
109
109.8944
Expected results:
Volume
diff
58.29168
0
70.57396
12.28228
85.67902
15.10506
97.04888
11.36986
107.7026
10.65368
108.2022
0.4996719
108.3975
0.1952896
108.5777
0.1801834
109
0.4223404
109.8944
0.89431
I have solved the part of the coalesce by replacing it with NZ, I have tryed to use the DCOUNT to solve the row_number (How to show the record number in a MS Access report table?) but I reveive the error that it cannot find the function (I am reading the data by code, that is the only thing I can do).
I also tryed this but, as the answer says I need a column with a unique value which I do not have nor can create Microsoft Access query to duplicate ROW_NUMBER
Consider:
SELECT TOP 10 Table1.ComponentID,
DCount("*","Table1","ComponentID = 'S3' AND Volume<" & [Volume])+1 AS Seq, Table1.Volume,
Nz(Table1.Volume -
(SELECT Top 1 Dup.Volume FROM Table1 AS Dup
WHERE Dup.ComponentID = Table1.ComponentID AND Dup.Volume<Table1.Volume
ORDER BY Volume DESC),0) AS Diff
FROM Table1
WHERE (((Table1.ComponentID)="S3"))
ORDER BY Table1.Volume;
This will likely perform very slowly with large dataset.
Alternative solutions:
build query that calculates difference, use that query as source for a report, use textbox RunningSum property to calculate sequence number
VBA looping through recordset and saving results to a 'temp' table
export to Excel

Splitting table PK values into roughly same-size ranges

I have a table in Postgres with about half a million rows and an integer primary key.
I'd like to split its entire PK space into N ranges of approximately same size for independent processing. How do I best do it?
I apparently can do it by fetching all PK values to a client and remember every N-th value. This does a full scan and a fetch of all the values, while I only want no more than N+1 of them.
I can select min and max values and cut the range, but if the PKs are not distributed quite evenly, it may give me some ranges of seriously different sizes.
I want ranges for index-based access later on, so any modulo-based tricks do mot apply.
Is there any nice SQL-based solution that does not involve fetching all the keys to a client? Writing an N-specific query, e.g. with N clauses, if fine.
An example:
IDs in a range, say, from 1234 to 567890, N = 4.
I'd like to get 4 numbers, say 127123, 254789, 379860, so than there are approximately 125k records in each of the ranges of IDs [1234, 127123], [127123, 254789], [254789, 379860], [379860, 567890].
Update:
I've come up with a solution like this:
select
percentile_disc(0.25) within group (order by c.id) over() as pct_25
,percentile_disc(0.50) within group (order by c.id) over() as pct_50
,percentile_disc(0.75) within group (order by c.id) over() as pct_75
from customer c
limit 1
;
It does a decent job of giving me the exact range boundaries, and runs only a few seconds, which is fine for my purposes.
What bothers me is that I have to add the limit 1 clause to get just one row. Without it, I receive identical rows, one per record in the table. Is there a better way to get just a one row of the percentiles?
I think you can use row_number() for this purpose. Something like this:
select t.*,
floor((seqnum * N) / cnt) as range
from (select t.*,
row_number() over (order by pk) - 1 as seqnum,
count(*) over () as cnt
from t
) t;
This assumes by range that you mean ranges on pk values. You can also move the range expression to a where clause to just select one particular range.

Migrating from non-partitioned to Partitioned tables

In June the BQ team announced support for date-partitioned tables. But the guide is missing how to migrate old non-partitioned tables into the new style.
I am looking for a way to update several or if not all tables to the new style.
Also outside of DAY type partitioned what other options are available? Does the BQ UI show this, as I wasn't able to create such a new partitioned table from the BQ Web UI.
What works for me is the following set of queries applied directly in the big query (big query create new query).
CREATE TABLE (new?)dataset.new_table
PARTITION BY DATE(date_column)
AS SELECT * FROM dataset.table_to_copy;
Then as the next step I drop the table:
DROP TABLE dataset.table_to_copy;
I got this solution from https://fivetran.com/docs/warehouses/bigquery/partition-table
using only step 2
from Pavan’s answer: Please note that this approach will charge you the scan cost of the source table for the query as many times as you query it.
from Pentium10 comments: So suppose I have several years of data, I need to prepare different query for each day and run all of it, and suppose I have 1000 days in history, I need to pay 1000 times the full query price from the source table?
As we can see - the main problem here is on having full scan for each and every day. The rest is less of a problem and can be easily scripted out in any client of the choice
So, below is to - How to partition table while avoid full table scan for each and every day?
Below step-by-step shows the approach
It is generic enough to extend/apply to anyone real use-case - meantime I am using bigquery-public-data.noaa_gsod.gsod2017 and I am limiting "exercise" to just 10 days to keep it readable
Step 1 – Create Pivot table
In this step we
a) compress each row’s content into record/array
and
b) put them all into respective ”daily” column
#standardSQL
SELECT
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170101' THEN r END) AS day20170101,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170102' THEN r END) AS day20170102,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170103' THEN r END) AS day20170103,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170104' THEN r END) AS day20170104,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170105' THEN r END) AS day20170105,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170106' THEN r END) AS day20170106,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170107' THEN r END) AS day20170107,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170108' THEN r END) AS day20170108,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170109' THEN r END) AS day20170109,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170110' THEN r END) AS day20170110
FROM (
SELECT d, r, ROW_NUMBER() OVER(PARTITION BY d) AS line
FROM (
SELECT
stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t
GROUP BY stn, d
)
)
GROUP BY line
Run above query in Web UI with pivot_table (or whatever name is preferred) as a destination
As we can see - here we will get table with 10 columns – one column for one day and schema of each column is a copy of schema of original table:
Step 2 – Processing partitions one-by-one ONLY scanning respective column (no full table scan) – inserting into respective partition
#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20170101) AS r
Run above query from Web UI with destination table named mytable$20160101
You can run same for next day
#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20170102) AS r
Now you should have destination table as mytable$20160102 and so on
You should be able to automate/script this step with any client of your choice
There are many variations of how you can use above approach - it is up to your creativity
Note: BigQuery allows up to 10000 columns in table, so 365 columns for respective days of one year is definitely not a problem here :o)
Unless there is a limitation on how far back you can go with new partitions – I heard (but didn’t have chance to check yet) there is now no more than 90 days back
Update
Please note:
Above version has a little extra logic of packing all aggregated cells into as least final number of rows as possible.
ROW_NUMBER() OVER(PARTITION BY d) AS line
and then
GROUP BY line
along with
ARRAY_CONCAT_AGG(…)
does this
This works well when row size in your original table is not that big so final combined row size still will be within rows size limit that BigQuery has (which I believe is 10 MB as of now)
If your source table already has row size close to that limit – use below adjusted version
In this version – grouping is removed such that each row has only value for one column
#standardSQL
SELECT
CASE WHEN d = 'day20170101' THEN r END AS day20170101,
CASE WHEN d = 'day20170102' THEN r END AS day20170102,
CASE WHEN d = 'day20170103' THEN r END AS day20170103,
CASE WHEN d = 'day20170104' THEN r END AS day20170104,
CASE WHEN d = 'day20170105' THEN r END AS day20170105,
CASE WHEN d = 'day20170106' THEN r END AS day20170106,
CASE WHEN d = 'day20170107' THEN r END AS day20170107,
CASE WHEN d = 'day20170108' THEN r END AS day20170108,
CASE WHEN d = 'day20170109' THEN r END AS day20170109,
CASE WHEN d = 'day20170110' THEN r END AS day20170110
FROM (
SELECT
stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t
GROUP BY stn, d
)
WHERE d BETWEEN 'day20170101' AND 'day20170110'
As you can see now - pivot table (sparce_pivot_table) is sparse enough (same 21.5 MB but now 114,089 rows vs. 11,584 rows in pivot_table) so it has average row size of 190B vs 1.9KB in initial version. Which is obviously about 10 times less as per number of columns in the example.
So before using this approach some math needs to be done to project/estimate what and how can be done!
Still: each cell in pivot table is sort of JSON representation of whole row in original table. It is such as it holds not just values as it was for rows in original table but also has a schema in it
As such it is quite verbose - thus the size of cell can be multiple times bigger than original size [which limits the usage of this approach ... unless you get even more creative :o) ... which is still plenty of areas here to apply :o) ]
As of today, you can now create a partitioned table from a non-partitioned table by querying it and specifying the partition column. You'll pay for one full table scan on the original (non-partitioned) table. Note: this is currently in beta.
https://cloud.google.com/bigquery/docs/creating-column-partitions#creating_a_partitioned_table_from_a_query_result
To create a partitioned table from a query result, write the results to a new destination table. You can create a partitioned table by querying either a partitioned table or a non-partitioned table. You cannot change an existing standard table to a partitioned table using query results.
Until the new feature is rolled out in BigQuery, there is another (much cheaper) way to partition the table(s) by using Cloud Dataflow. We used this approach instead of running hundreds of SELECT * statements, which would have cost us thousands of dollars.
Create the partitioned table in BigQuery using the normal partition command
Create a Dataflow pipeline and use a BigQuery.IO.Read sink to read the table
Use a Partition transform to partition each row
Using a max of 200 shards/sinks at a time (any more than that and you hit API limits), create a BigQuery.IO.Write sink for each day/shard that will write to the corresponding partition using the partition decorator syntax - "$YYYYMMDD"
Repeat N times until all data is processed.
Here's an example on Github to get you started.
You still have to pay for the Dataflow pipeline(s), but it's a fraction of the cost of using multiple SELECT * in BigQuery.
If you have date sharded tables today, you can use this approach:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#converting_dated_tables_into_a_partitioned_table
If you have a single non-partitioned table to be converted to partitioned table, you can try the approach of running a SELECT * query with allow large results and using the table's partition as the destination (similar to how you'd restate data for a partition):
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition
Please note that this approach will charge you the scan cost of the source table for the query as many times as you query it.
We are working on something to make this scenario significantly better in the next few months.
CREATE TABLE `dataset.new_table`
PARTITION BY DATE(date_column)
AS SELECT * FROM `dataset.old_table`;
drop table `dataset.old_table`;
ALTER TABLE `dataset.new_table`
RENAME TO old_table;

Purposely having a query return blank entries at regular intervals

I want to write a query that returns 3 results followed by blank results followed by the next 3 results, and so on. So if my database had this data:
CREATE TABLE table (a integer, b integer, c integer, d integer);
INSERT INTO table (a,b,c,d)
VALUES (1,2,3,4),
(5,6,7,8),
(9,10,11,12),
(13,14,15,16),
(17,18,19,20),
(21,22,23,24),
(25,26,37,28);
I would want my query to return this
1,2,3,4
5,6,7,8
9,10,11,12
, , ,
13,14,15,16
17,18,19,20
21,22,23,24
, , ,
25,26,27,28
I need this to work for arbitrarily many entries that I select for, have three be grouped together like this.
I'm running postgresql 8.3
This should work flawlessly in PostgreSQL 8.3
SELECT a, b, c, d
FROM (
SELECT rn, 0 AS rk, (x[rn]).*
FROM (
SELECT x, generate_series(1, array_upper(x, 1)) AS rn
FROM (SELECT ARRAY(SELECT tbl FROM tbl) AS x) x
) y
UNION ALL
SELECT generate_series(3, (SELECT count(*) FROM tbl), 3), 1, (NULL::tbl).*
ORDER BY rn, rk
) z
Major points
Works for a query that selects all columns of tbl.
Works for any table.
For selecting arbitrary columns you have to substitute (NULL::tbl).* with a matching number of NULL columns in the second query.
Assuming that NULL values are ok for "blank" rows.
If not, you'll have to cast your columns to text in the first and substitute '' for NULL in the second SELECT.
Query will be slow with very big tables.
If I had to do it, I would write a plpgsql function that loops through the results and inserts the blank rows. But you mentioned you had no direct access to the db ...
In short, no, there's not an easy way to do this, and generally, you shouldn't try. The database is concerned with what your data actually is, not how it's going to be displayed. It's not an appropriate scope of responsibility to expect your database to return "dummy" or "extra" data so that some down-stream process produces a desired output. The generating script needs to do that.
As you can't change your down-stream process, you could (read that with a significant degree of skepticism and disdain) add things like this:
Select Top 3
a, b, c, d
From
table
Union Select Top 1
'', '', '', ''
From
table
Union Select Top 3 Skip 3
a, b, c, d
From
table
Please, don't actually try do that.
You can do it (at least on DB2 - there doesn't appear to be equivalent functionality for your version of PostgreSQL).
No looping needed, although there is a bit of trickery involved...
Please note that though this works, it's really best to change your display code.
Statement requires CTEs (although that can be re-written to use other table references), and OLAP functions (I guess you could re-write it to count() previous rows in a subquery, but...).
WITH dataList (rowNum, dataColumn) as (SELECT CAST(CAST(:interval as REAL) /
(:interval - 1) * ROW_NUMBER() OVER(ORDER BY dataColumn) as INTEGER),
dataColumn
FROM dataTable),
blankIncluder(rowNum, dataColumn) as (SELECT rowNum, dataColumn
FROM dataList
UNION ALL
SELECT rowNum - 1, :blankDataColumn
FROM dataList
WHERE MOD(rowNum - 1, :interval) = 0
AND rowNum > :interval)
SELECT *
FROM dataList
ORDER BY rowNum
This will generate a list of those elements from the datatable, with a 'blank' line every interval lines, as ordered by the initial query. The result set only has 'blank' lines between existing lines - there are no 'blank' lines on the ends.