Adding missing date rows to a BigQuery Table - sql

I have a table where 1 of the rows is an integer that represents the rows time. Problem is the table isn't full, there are missing timestamps.
I would like to fill missing values such that every 10 seconds there is a row. I want the rest of the columns to be nuns (later I'll forward fill these nuns).
10 secs is basically 10,000.
If this was python the range would be:
range(
min(table[column]),
max(table[column]),
10000
)

If your values are strictly parted by 10 seconds, and there are just some multiples of 10 seconds intervals which are missing, you can go by this approach to fill your data holes:
WITH minsmax AS (
SELECT
MIN(time) AS minval,
MAX(time) AS maxval
FROM `dataset.table`
)
SELECT
IF (d.time <= i.time, d.time, i.time) as time,
MAX(IF(d.time <= i.time, d.value, NULL)) as value
FROM (
SELECT time FROM minsmax m, UNNEST(GENERATE_ARRAY(m.minval, m.maxval+100, 100)) AS time
) AS i
LEFT JOIN `dataset.table` d ON 1=1
WHERE ABS(d.time - i.time) >= 100
GROUP BY 1
ORDER BY 1
Hope this helps.

You can use arrays. For numbers, you can do:
select n
from unnest(generate_array(1, 1000, 1)) n;
There are similar functions for generate_timestamp_array() and generate_date_array() if you really need those types.

I ended up using the following query through python API:
"""
SELECT
i.time,
Sensor_Reading,
Sensor_Name
FROM (
SELECT time FROM UNNEST(GENERATE_ARRAY({min_time}, {max_time}+{sampling_period}+1, {sampling_period})) AS time
) AS i
LEFT JOIN
`{input_table}` AS input
ON
i.time =input.Time
ORDER BY i.time
""".format(sampling_period=sampling_period, min_time=min_time,
max_time=max_time,
input_table=input_table)
Thanks to both answers

Related

Sum over a given time period

The following codes gives the total duration that a light has been switched on.
CREATE TABLE switch_times (
id SERIAL PRIMARY KEY,
is1 BOOLEAN,
id_dec INTEGER,
label TEXT,
ts TIMESTAMP WITH TIME ZONE default current_timestamp
);
CREATE VIEW makecount AS
SELECT *, row_number() OVER (PARTITION BY id_dec ORDER BY id) AS count
FROM switch_times;
select c1.label, SUM(c2.ts-c1.ts) AS sum
from
(makecount AS c1
inner join
makecount AS c2 ON c2.count = c1.count + 1)
where c2.is1=FALSE AND c1.id_dec = c2.id_dec AND c2.is1 != c1.is1
GROUP BY c1.label;
Link to working demo https://dbfiddle.uk/ZR8pLEBk
Any suggestions on how to alter the code so that it would give the sum over a given specific time period, say the 25th, during which all three lights were switched on for 12 hours? Problem 1: current code gives total sum, as follows. Problem 2: all durations that have not ended are disregarded, because there is no switch off time.
label sum
0x29 MH3 1 day 03:00:00
0x2B MH1 1 day 01:00:00
0x2C MH2 1 day 02:00:00
The expected results is just over a a given date, i.e.
label sum
0x29 MH3 12:00:00
0x2B MH1 12:00:00
0x2C MH2 12:00:00
Assuming the following (which should be defined in the question):
Postgres 15.
The table is big, many rows per label, performance matters, we can add indexes.
All columns are actually NOT NULL, you just forgot to declare columns as such.
Evey "light" has a distinct id_dec and a distinct label. Having both in switch_times is redundant. (Normalization!)
A light is "switched on" if the most recent earlier entry has is1 IS TRUE. Else it's considered "off".
The order of rows is established by ts, not by id as used in your query (typically incorrect).
Consecutive entries do not have to change the state.
No duplicate entries for (id_dec, ts). (There is a unique index enforcing that.)
There is no minimum or maximum time interval between entries.
"The 25th" is supposed to mean tstzrange '[2022-11-25 0:0+02, 2022-11-26 0:0+02)' (Note the time zone offsets.)
You want results for all labels that were switched on at all during the given time interval.
There is a table "labels" with one distinct entry per relevant light. If you don't have one, create it.
Indexes
Have at least these indexes to make everything fast:
CREATE INDEX ON switch_times (id_dec, ts DESC);
CREATE INDEX ON switch_times (ts);
Optional step to create table labels
CREATE TABLE labels AS
WITH RECURSIVE cte AS (
(
SELECT id_dec, label
FROM switch_times
ORDER BY 1
LIMIT 1
)
UNION ALL
(
SELECT s.id_dec, s.label
FROM cte c
JOIN switch_times s ON s.id_dec > c.id_dec
ORDER BY 1
LIMIT 1
)
)
TABLE cte;
ALTER TABLE labels
ADD PRIMARY KEY (id_dec)
, ALTER COLUMN label SET NOT NULL
, ADD CONSTRAINT label_uni UNIQUE (label)
;
Why this way? See:
Optimize GROUP BY query to retrieve latest row per user
Main query
WITH bounds(lo, hi) AS (
SELECT timestamptz '2022-11-25 0:0+02' -- enter time interval here *once*
, timestamptz '2022-11-26 0:0+02'
)
, snapshot AS (
SELECT id_dec, label, is1, ts
FROM switch_times s, bounds b
WHERE s.ts >= b.lo
AND s.ts < b.hi
UNION ALL -- must be separate
SELECT s.*
FROM labels l
JOIN LATERAL ( -- latest earlier entry
SELECT s.id_dec, s.label, s.is1, b.lo AS ts -- cut off at lower bound
FROM switch_times s, bounds b
WHERE s.id_dec = l.id_dec
AND s.ts < b.lo
ORDER BY s.ts DESC
LIMIT 1
) s ON s.is1 -- ... if it's "on"
)
SELECT label, sum(z - a) AS duration
FROM (
SELECT label
, lag(is1, 1, false) OVER w AS last_is1
, lag(ts) OVER w AS a
, ts AS z
FROM snapshot
WINDOW w AS (PARTITION BY label ORDER BY ts ROWS UNBOUNDED PRECEDING)
) sub
WHERE last_is1
GROUP BY 1;
fiddle
CTE bounds is an optional convenience feature to enter lower and upper bound of your time interval once.
CTE snapshot collects all rows of interest, which consists of
all rows inside the time interval (1st leg of UNION ALL query)
the latest earlier row if it was "on" (2nd leg of UNION ALL query)
We need to gather 2. separately to cover corner cases where the light was switched on earlier and there is no entry for the given time interval! But we can replace the timestamp to the lower bound immediately.
The final query gets the previous (is1, ts) for every row in a subquery, defaulting to "off" if there was no previous row.
Finally sum up intervals in the outer SELECT. Only sum what's switched on at the begin (no matter the final state).
Related:
Jump SQL gap over specific condition & proper lead() usage
My assumption
actual on time is time difference between is1 is true to next is1 false order by ts
Below query will calculate total sum of on time between two dates
select
id_dec ,
label,
sum(to_timestamp(nexttime)-ts) as time_def
from
(
select
id_dec,
"label",
ts,
is1,
case
when is1 = true then lead(extract(epoch from ts))over(partition by id_dec
order by
id_dec ,
ts asc)
else 0
end nexttime
from
switch_times
where
ts between '2022-11-24' and '2022-11-28'
) as a
where
nexttime <> 0
group by
id_dec,
label

Is there a better way to retrieve a random row from an Oracle table?

Not so long ago I needed to fetch a random row from a table in an Oracle database. The most widespread solution that I've found was this:
SELECT * FROM
( SELECT * FROM tabela WHERE warunek
ORDER BY dbms_random.value )
WHERE rownum = 1​
However, this is very performance heavy for large tables, as it sorts the table in random order first, then grabs the first row.
Today, one of my collegues suggested a different way:
SELECT * FROM (
SELECT * FROM MAIN_PRODUCT
WHERE ROWNUM <= CAST((SELECT COUNT(*) FROM MAIN_PRODUCT)*dbms_random.value AS INTEGER)
ORDER BY ROWNUM DESC
) WHERE ROWNUM = 1;
It works way faster and seems to return random values, but does it really? Could someone give an insight into whether it is really random and behaves the way as expected? I'm really curious why I haven't found this approach anywhere else while looking, and if it is indeed random and way better performance wise, why isn't it more widespread?
This is the (possibly) the most simple query possible to get the results.
But the SELECT COUNT(*) FROM MAIN_PRODUCT will table scan i doubt you can get a query which does not do that.
P.s This query assumes not deleted records.
Query
SELECT *
FROM
MAIN_PRODUCT
WHERE
ROWNUM = FLOOR(
(dbms_random.value * (SELECT COUNT(*) FROM MAIN_PRODUCT)) + 1
)
FLOOR(
(dbms_random.value * (SELECT COUNT(*) FROM MAIN_PRODUCT)) + 1
)
Will generate a number between between 1 and the max count of the table see demo how that works when you refresh it.
Oracle12c+ Query
SELECT *
FROM
MAIN_PRODUCT
WHERE
ROWNUM <= FLOOR(
(dbms_random.value * (SELECT COUNT(*) FROM MAIN_PRODUCT)) + 1
)
ORDER BY
ROWNUM DESC
FETCH FIRST ROW ONLY
The second code you have
SELECT * FROM (
SELECT * FROM MAIN_PRODUCT
WHERE ROWNUM <= CAST((SELECT COUNT(*) FROM MAIN_PRODUCT)*dbms_random.value AS INTEGER)
ORDER BY ROWNUM DESC
) WHERE ROWNUM = 1;
is excellent, except that it will get subsequent elements. dbms_random.value is returning a real number between 0 and 1. Multiplying this with the number of rows will provide you a really random number and the bottleneck here is counting the number of rows rather then generating a random value for each row.
Proof
Consider the
0 <= x < 1
number. If we multiply it with n, we get
0 <= n * x < n
which is exactly what you need if you want to load a single element. The reason this is not widespread is that in many cases the performance issues are not felt due to only a few thousands of records.
EDIT
If you would need k number of records, not just the first one, then it would be slightly difficult, however, still solvable. The algorithm would be something like this (I do not have Oracle installed to test it, so I only describe the algorithm):
randomize(n, k)
randomized <- empty_set
while (k > 0) do
newValue <- random(n)
n <- n - 1
k <- k - 1
//find out how many elements are lower than newValue
//increase newValue with that amount
//find out if newValue became larger than some values which were larger than new value
//increase newValue with that amount
//repeat until there is no need to increase newValue
while end
randomize end
If you randomize k elements from n, then you will be able to use those values in your filter.
The key to improving performance is to lessen the load of the ORDER BY.
If you know about how many rows match the conditions, then you can filter before the sort. For instance, the following takes about 1% of the rows:
SELECT *
FROM (SELECT *
FROM tabela
WHERE warunek AND dbms_random.value < 0.01
ORDER BY dbms_random.value
)
WHERE rownum = 1​ ;
A variation is to calculate the number of matching values. Then randomly select a smaller sample. The following gets about 100 matching rows and then sorts them for the random selection:
SELECT a.*
FROM (SELECT *
FROM (SELECT a.*, COUNT(*) OVER () as cnt
FROM tabela a
WHERE warunek
) a
WHERE dbms_random.value < 100 / cnt
ORDER BY dbms_random.value
) a
WHERE rownum = 1​ ;

Can we modify the previous row and use it in current row in a SQL query for a list?

I've looked around and found a few posts with LAG() and running total type queries, but none seem to fit what I'm looking for. Maybe i'm not using the correct terms in my search or maybe I might be over complicating the situation. Hope someone could help me out.
But what I'm looking to do is to take the previous result and multiple it by the current row for a range of dates. The starting is always some base number lets do 10 to keep it simple. The values will be float, but i kept it to round numbers here to better explain my inquiry.
The first is showing the calculation part and the 2nd table below is showing what the result should look like in the end.
date val1 calc_result
20120930 null 10
20121031 2 10*2=20
20121130 3 20*3=60
20121231 1 60*1=60
20130131 2 60*2=120
20130228 1 120*1=120
The query would return
20120930 10
20121031 20
20121130 60
20121231 60
20130131 120
20130228 120
I'm trying to see if this can be done in a query type solution or would a PL/SQL table/cursors need to be used?
Any help would be appreciated.
You can do this with a recursive CTE:
with dates as (
select t.*, row_number() over (order by date) as seqnum
from t
),
cte as (
select t.date, t.val, 10 as calc_result
from dates t
where t.seqnum = 1
union all
select t.date, t.val, cte.calc_result * t.val
from cte join
dates t
on t.seqnum = cte.seqnum + 1
)
select cte.date, cte.calc_result
from cte
order by cte.date;
This is calculating a cumulative product. You can do it with some exponential arithmetic. Replace 10 in the query with the desired start value.
select date,val1
,case when row_number() over(order by date) = 1 then 10 --set start value for first row
else 10*exp(sum(ln(val1)) over(order by date)) end as res
from tbl

Returning the lowest integer not in a list in SQL

Supposed you have a table T(A) with only positive integers allowed, like:
1,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18
In the above example, the result is 10. We always can use ORDER BY and DISTINCT to sort and remove duplicates. However, to find the lowest integer not in the list, I came up with the following SQL query:
select list.x + 1
from (select x from (select distinct a as x from T order by a)) as list, T
where list.x + 1 not in T limit 1;
My idea is start a counter and 1, check if that counter is in list: if it is, return it, otherwise increment and look again. However, I have to start that counter as 1, and then increment. That query works most of the cases, by there are some corner cases like in 1. How can I accomplish that in SQL or should I go about a completely different direction to solve this problem?
Because SQL works on sets, the intermediate SELECT DISTINCT a AS x FROM t ORDER BY a is redundant.
The basic technique of looking for a gap in a column of integers is to find where the current entry plus 1 does not exist. This requires a self-join of some sort.
Your query is not far off, but I think it can be simplified to:
SELECT MIN(a) + 1
FROM t
WHERE a + 1 NOT IN (SELECT a FROM t)
The NOT IN acts as a sort of self-join. This won't produce anything from an empty table, but should be OK otherwise.
SQL Fiddle
select min(y.a) as a
from
t x
right join
(
select a + 1 as a from t
union
select 1
) y on y.a = x.a
where x.a is null
It will work even in an empty table
SELECT min(t.a) - 1
FROM t
LEFT JOIN t t1 ON t1.a = t.a - 1
WHERE t1.a IS NULL
AND t.a > 1; -- exclude 0
This finds the smallest number greater than 1, where the next-smaller number is not in the same table. That missing number is returned.
This works even for a missing 1. There are multiple answers checking in the opposite direction. All of them would fail with a missing 1.
SQL Fiddle.
You can do the following, although you may also want to define a range - in which case you might need a couple of UNIONs
SELECT x.id+1
FROM my_table x
LEFT
JOIN my_table y
ON x.id+1 = y.id
WHERE y.id IS NULL
ORDER
BY x.id LIMIT 1;
You can always create a table with all of the numbers from 1 to X and then join that table with the table you are comparing. Then just find the TOP value in your SELECT statement that isn't present in the table you are comparing
SELECT TOP 1 table_with_all_numbers.number, table_with_missing_numbers.number
FROM table_with_all_numbers
LEFT JOIN table_with_missing_numbers
ON table_with_missing_numbers.number = table_with_all_numbers.number
WHERE table_with_missing_numbers.number IS NULL
ORDER BY table_with_all_numbers.number ASC;
In SQLite 3.8.3 or later, you can use a recursive common table expression to create a counter.
Here, we stop counting when we find a value not in the table:
WITH RECURSIVE counter(c) AS (
SELECT 1
UNION ALL
SELECT c + 1 FROM counter WHERE c IN t)
SELECT max(c) FROM counter;
(This works for an empty table or a missing 1.)
This query ranks (starting from rank 1) each distinct number in ascending order and selects the lowest rank that's less than its number. If no rank is lower than its number (i.e. there are no gaps in the table) the query returns the max number + 1.
select coalesce(min(number),1) from (
select min(cnt) number
from (
select
number,
(select count(*) from (select distinct number from numbers) b where b.number <= a.number) as cnt
from (select distinct number from numbers) a
) t1 where number > cnt
union
select max(number) + 1 number from numbers
) t1
http://sqlfiddle.com/#!7/720cc/3
Just another method, using EXCEPT this time:
SELECT a + 1 AS missing FROM T
EXCEPT
SELECT a FROM T
ORDER BY missing
LIMIT 1;

Use google bigquery to build histogram graph

How can write a query that makes histogram graph rendering easier?
For example, we have 100 million people with ages, we want to draw the histogram/buckets for age 0-10, 11-20, 21-30 etc... What does the query look like?
Has anyone done it? Did you try to connect the query result to google spreadsheet to draw the histogram?
You could also use the quantiles aggregation operator to get a quick look at the distribution of ages.
SELECT
quantiles(age, 10)
FROM mytable
Each row of this query would correspond to the age at that point in the list of ages. The first result is the age 1/10ths of the way through the sorted list of ages, the second is the age 2/10ths through, 3/10ths, etc.
See the 2019 update, with #standardSQL --Fh
The subquery idea works, as does "CASE WHEN" and then doing a group by:
SELECT COUNT(field1), bucket
FROM (
SELECT field1, CASE WHEN age >= 0 AND age < 10 THEN 1
WHEN age >= 10 AND age < 20 THEN 2
WHEN age >= 20 AND age < 30 THEN 3
...
ELSE -1 END as bucket
FROM table1)
GROUP BY bucket
Alternately, if the buckets are regular -- you could just divide and cast to an integer:
SELECT COUNT(field1), bucket
FROM (
SELECT field1, INTEGER(age / 10) as bucket FROM table1)
GROUP BY bucket
With #standardSQL and an auxiliary stats query, we can define the range the histogram should look into.
Here for the time to fly between SFO and JFK - with 10 buckets:
WITH data AS (
SELECT *, ActualElapsedTime datapoint
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate_year = "2018-01-01"
AND Origin = 'SFO' AND Dest = 'JFK'
)
, stats AS (
SELECT min+step*i min, min+step*(i+1)max
FROM (
SELECT max-min diff, min, max, (max-min)/10 step, GENERATE_ARRAY(0, 10, 1) i
FROM (
SELECT MIN(datapoint) min, MAX(datapoint) max
FROM data
)
), UNNEST(i) i
)
SELECT COUNT(*) count, (min+max)/2 avg
FROM data
JOIN stats
ON data.datapoint >= stats.min AND data.datapoint<stats.max
GROUP BY avg
ORDER BY avg
If you need round numbers, see: https://stackoverflow.com/a/60159876/132438
Using a cross join to get your min and max values (not that expensive on a single tuple) you can get a normalized bucket list of any given bucket count:
select
min(data.VAL) as min,
max(data.VAL) as max,
count(data.VAL) as num,
integer((data.VAL-value.min)/(value.max-value.min)*8) as group
from [table] data
CROSS JOIN (SELECT MAX(VAL) as max, MIN(VAL) as min, from [table]) value
GROUP BY group
ORDER BY group
in this example we're getting 8 buckets (pretty self explanatory) plus one for null VAL
Write a subquery like this:
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0)
Then you can do something like this:
SELECT * FROM
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0),
(SELECT '2' AS agegroup, count(*) FROM people WHERE AGE <= 20 AND AGE >= 10),
(SELECT '3' AS agegroup, count(*) FROM people WHERE AGE <= 120 AND AGE >= 20)
Result will be like:
Row agegroup count
1 1 somenumber
2 2 somenumber
3 3 another number
I hope this helps you. Of course in the age group you can write anything like: '0 to 10'
There is now the APPROX_QUANTILES aggregation function in standard SQL.
SELECT
APPROX_QUANTILES(column, number_of_bins)
...
I found gamars approach quite intriguing and expanded a little bit on it using scripting instead of the cross join. Notably, this approach also allows to consistently change group sizes, like here with group sizes that increase exponentially.
declare stats default
(select as struct min(new_confirmed) as min, max(new_confirmed) as max
from `bigquery-public-data.covid19_open_data.covid19_open_data`
where new_confirmed >0 and date = date "2022-03-07"
);
declare group_amount default 10; -- change group amount here
SELECT
CAST(floor(
(ln(new_confirmed-stats.min+1)/ln(stats.max-stats.min+1)) * (group_amount-1))
AS INT64) group_flag,
concat('[',min(new_confirmed),',',max(new_confirmed),']') as group_value_range,
count(1) as quantity
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where new_confirmed >0 and date = date "2022-03-07"
GROUP BY group_flag
ORDER BY group_flag ASC
The basic approach is to label each value with its group_flag and then group by it. The flag is calculated by scaling the value down to a value between 0 and 1 and then scale it up again to 0 - group_amount.
I just took the log of the corrected value and range before their division to get the desired bias in group sizes. I also add 1 to make sure it doesn't try to take the log of 0.
You're looking for a single vector of information. I would normally query it like this:
select
count(*) as num,
integer( age / 10 ) as age_group
from mytable
group by age_group
A big case statement will be needed for arbitrary groups. It would be simple but much longer. My example should be fine if every bucket contains N years.
Take a look at the custom SQL functions. It works as
to_bin(10, [0, 100, 500]) => '... - 100'
to_bin(1000, [0, 100, 500, 0]) => '500 - ...'
to_bin(1000, [0, 100, 500]) => NULL
Read more here
https://github.com/AdamovichAleksey/BigQueryTips/blob/main/sql/functions/to_bins.sql
Any ideas and commits are welcomed