Perform loop and calculation on BigQuery Array type - sql

My original data, B is an array of INT64:
And I want to calculate the difference between B[n+1] - B[n], hence result in a new table as follow:
I figured out I can somehow achieve this by using LOOP and IF condition:
DECLARE x INT64 DEFAULT 0;
LOOP
SET x = x + 1
IF(x < array_length(table.B))
THEN INSERT INTO newTable (SELECT A, B[OFFSET(x+1)] - B[OFFSET(x)]) from table
END IF;
END LOOP;
The problem is that the above idea doesn't work on each row of my data, cause I still need to loop through each row in my data table, but I can't find a way to integrate my scripting part into a normal query, where I can
SELECT A, [calculation script] from table
Can someone point me how can I do it? Or any better way to solve this problem?
Thank you.

Below actually works - BigQuery
select * replace(
array(select diff from (
select offset, lead(el) over(order by offset) - el as diff
from unnest(B) el with offset
) where not diff is null
order by offset
) as B
)
from `project.dataset.table` t
if to apply to sample data in your question - output is

You can use unnest() with offset for this purpose:
select id, a,
array_agg(b_el - prev_b_el order by n) as b_diffs
from (select t.*, b_el, lag(b_el) over (partition by t.id order by n) as prev_b_el
from t cross join
unnest(b) b_el with offset n
) t
where prev_b_el is not null
group by t.id, t.a

Related

Comparing two arrays in BigQuery

Postgres has a nice utility in that arrays can be compared directly. For example:
SELECT ARRAY[1,2] < ARRAY[1,3];
# t
How could this behavior be emulated in BigQuery, for example with a function?
Consider below hint
select format('%t', ARRAY[1,2]) < format('%t', ARRAY[1,3])
You can join on the offset of the arrays and then compare the value at that offset at the first difference. Here is an example:
WITH
a AS (SELECT * from UNNEST([1,2]) AS elem WITH OFFSET),
b AS (SELECT * from UNNEST([1,3]) AS elem WITH OFFSET)
SELECT
a.elem < b.elem
FROM
a JOIN b USING (offset)
WHERE
a.elem != b.elem
LIMIT 1

Geography function over a column

I am trying to use the st_makeline() function in order to create lines for every points and the next one in a single column.
Do I need to create another column with the 2 points already ?
with t1 as(
SELECT *, ST_GEOGPOINT(cast(long as float64) , cast(lat as float64)) geometry FROM `my_table.faissal.trajets_flix`
where id = 1
order by index_loc
)
select index_loc geometry
from t1
Here are the results
Thanks for your help
You seems to want to write this code:
https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_makeline
WITH t1 as (
SELECT *, ST_GEOGPOINT(cast(long as float64), cast(lat as float64)) geometry
FROM `my_table.faissal.trajets_flix`
-- WHERE id = 1
)
SELECT id, ST_MAKELINE(ARRAY_AGG(geometry ORDER BY index_loc)) traj
FROM t1
GROUP BY id;
with output:
When visualized on the map.
Consider also below simple and cheap option
select st_geogfromtext(format('linestring(%s)',
string_agg(long || ' ' || lat order by index_loc))
) as path
from `my_table.faissal.trajets_flix`
where id = 1
if applied to sample data in your question - output is
which is visualized as

SQL: Divide long text in multiple rows

I would like to divide a long text in multiple rows; there are other questions similar to this one but none of them worked for me.
What I have
ID | Message
----------------------------------
1 | Very looooooooooooooooong text
2 | Short text
What I would like to do is divide that string every n characters
Result if n = 15:
Id | Message
------------------------------------------
1 | Very looooooooo
1 | oooooooong text
2 | Short text
Even better if the split is done at the first space after n character.
I tried with string_split and substring but I cannot find anything that works.
I thought to use something similar to this:
SELECT index, element FROM table, CAST(message AS SUPER) AS element AT index;
But it doesn't take into account the length and I don't like casting a varchar variable into a super.
You can use generate_series() to accomplish this:
select m.*, gs.posn, substring(m.message, gs.posn, 15) as split_message
from messages m
cross join lateral generate_series(1, length(message), 15) gs(posn);
Splitting on spaces after the length is a little trickier. We would have to split the message into words and then figure out how to break them into groups and then reaggregate.
I could not figure out how to split on spaces without recursion. I hope you don't mind that it treats all whitespace as word boundaries:
with recursive by_words as (
select m.*, s.n, s.word, length(s.word) as word_len,
max(s.n) over (partition by m.id) as num_words
from messages m
cross join lateral regexp_split_to_table(m.message, '\s+')
with ordinality as s(word, n)
), rejoin as (
select id, n, array[word] as words, word_len as cum_word_len,
word_len >= 15 as keep
from by_words
where n = 1
union all
select p.id, c.n,
case
when p.cum_word_len >= 15 then array[c.word]
else p.words||c.word
end as words,
case
when p.cum_word_len >= 15 then c.word_len
else p.cum_word_len + c.word_len + 1
end as cum_word_len,
(p.cum_word_len + c.word_len + 1 >= 15)
or (c.n = c.num_words) as keep
from rejoin p
join by_words c on (c.id, c.n) = (p.id, p.n + 1)
)
select id,
row_number() over (partition by id
order by n) as segnum,
array_to_string(words, ' ') as split_message
from rejoin
where keep
order by 1, 2
;
db<>fiddle here
Edit to add:
Can you please tell me whether the below works in Redshift?
with gs as (
select generate_series as posn
from generate_series(1, 150000, 15)
)
select *, substring(m.message, gs.posn, 15) as split_message
from messages m
join gs
on gs.posn <= greatest(1, length(m.message))
order by m.id, gs.posn
;
Thanks to #Mike Organek 's answer and his help I found a solution that works with Redshift too.
Problem in Mike's answer for Redshift is related to generate_series that is not well supported in Redshift, so here's a workaround.
with row as (
select t.*, row_number() over () as x
from table t -- big enough table
limit 100
),
result as
(
select (x-1)*15+1 as posn from row --change 15 to a number to split the long text with
)
select * into gs
from result
And then Mike's answer:
select *, substring(m.feedback from gs.posn for 15) as split_message
from messages m
join gs
on gs.posn <= greatest(1, length(m.message))
order by m.id, gs.posn

detect gaps in integer sequence

Intention: detect whether a numeric sequence contains gaps. No need to identify the missing elements, just flag (true / false) the sequence if it contains gaps.
CREATE TABLE foo(x INTEGER);
INSERT INTO foo(x) VALUES (1), (2), (4);
Below is my (apparently correctly functioning) query to detect gaps:
WITH cte AS
(SELECT DISTINCT x FROM foo)
SELECT
( (SELECT COUNT(*) FROM cte a
CROSS JOIN cte b
WHERE b.x=a.x-1)
=(SELECT COUNT(*)-1 FROM cte))
OR (NOT EXISTS (SELECT 1 FROM cte))
where the OR is needed for the edge case where the table is empty. The query's logic is based on the observation that in a contiguous sequence the number of links equals the number of elements minus 1.
Anything more idiomatic or performant (should I be worried by the CROSS JOIN in particularly long sequences?)
Try this:
SELECT
CASE WHEN ((MAX(x)-MIN(x)+1 = COUNT(DISTINCT X)) OR
(COUNT(DISTINCT X) = 0) )
THEN 'TRUE'
ELSE 'FALSE'
END
FROM foo
SQLFiddle demo
The following should detect whether or not there are gaps:
select (case when max(x) - min(x) + 1 = count(distinct x)
then 'No Gaps'
else 'Some Gaps'
end)
from foo;
If there are no gaps or duplicates, then the number of distinct values of x is the max minus the min plus 1.
A different approach...
If you subtract your min value from the max value, and add 1, you should equal the count.
if count = (max-min)+1 then "no gaps!"
If you can express that in SQL, it should be very efficient.
SELECT 'Has ' || count(*) - 1 || ' gaps.' AS gaps
FROM foo f1
LEFT JOIN foo f2 ON f2.id = f1.id + 1
WHERE f2.id IS NULL;
The trick is to count rows, where the next row is missing - which only happens for the last row(s) if there are no gaps.
If there are no rows, you get 'Has -1 gaps.'.
If there are no gaps, you get 'Has 0 gaps.'.
Else you get 'Has n gaps.' .. n being the exact number of gaps, no matter how big.
The count can be increased for duplicates, but 0 and -1 are immune to dupes.

How to create an aggregate function for median?

I need to create an aggregate function in Advantage-Database to calculate the median value.
SELECT
group_field
, MEDIAN(value_field)
FROM
table_name
GROUP BY
group_field
Seems the solutions I am finding are quite specific to the sql engine used.
There is no built-in median aggregate function in ADS as you can see in the help file:
http://devzone.advantagedatabase.com/dz/webhelp/Advantage10.1/index.html
I'm afraid that you have to write your own stored procedure or sql script to solve this problem.
The accepted answer to the following question might be a solution for you:
Simple way to calculate median with MySQL
I've updated this answer with a solution that avoids the join in favor of storing some data in a json object.
SOLUTION #1 (two selects and a join, one to get counts, one to get rankings)
This is a little lengthy, but it does work, and it's reasonably fast.
SELECT x.group_field,
avg(
if(
x.rank - y.vol/2 BETWEEN 0 AND 1,
value_field,
null
)
) as median
FROM (
SELECT group_field, value_field,
#r:= IF(#current=group_field, #r+1, 1) as rank,
#current:=group_field
FROM (
SELECT group_field, value_field
FROM table_name
ORDER BY group_field, value_field
) z, (SELECT #r:=0, #current:='') v
) x, (
SELECT group_field, count(*) as vol
FROM table_name
GROUP BY group_field
) y WHERE x.group_field = y.group_field
GROUP BY x.group_field
SOLUTION #2 (uses a json object to store the counts and avoids the join)
SELECT group_field,
avg(
if(
rank - json_extract(#vols, path)/2 BETWEEN 0 AND 1,
value_field,
null
)
) as median
FROM (
SELECT group_field, value_field, path,
#rnk := if(#curr = group_field, #rnk+1, 1) as rank,
#vols := json_set(
#vols,
path,
coalesce(json_extract(#vols, path), 0) + 1
) as vols,
#curr := group_field
FROM (
SELECT p.group_field, p.value_field, concat('$.', p.group_field) as path
FROM table_name
JOIN (SELECT #curr:='', #rnk:=1, #vols:=json_object()) v
ORDER BY group_field, value_field DESC
) z
) y GROUP BY group_field;