BigQuery - Cannot re-use lagged records - google-bigquery

When using lag(value,offset), I don't seem to be able to re-use the output in other functions.
The output of the following code shows that previous_timestamp_utc exists, but neither of the functions, casting to date() or datediff(), return values.
SELECT
id,
timestamp_utc,
DATE(timestamp_utc) AS date_timestamp_utc,
previous_timestamp_utc,
DATE(previous_timestamp_utc) AS date_previous_timestamp_utc,
DATEDIFF(timestamp_utc,previous_timestamp_utc),
FROM (
SELECT
id,
timestamp_utc,
LAG(timestamp_utc,1) OVER (PARTITION BY id ORDER BY timestamp_utc) AS previous_timestamp_utc,
FROM (
SELECT
SEC_TO_TIMESTAMP (timestamp) AS timestamp_utc,
id,
num_characters,
FROM
[publicdata:samples.wikipedia] ) )
ORDER BY
4 DESC
LIMIT
1000
Can anyone explain why this is occurring?
Workaround: I'm unclear as to why this works, but a spotted workaround is to pre-cast the lag() field into a date(): replacing
LAG(timestamp_utc,1) OVER (PARTITION BY id ORDER BY timestamp_utc)
with
LAG(date(timestamp_utc),1) OVER (PARTITION BY id ORDER BY timestamp_utc)
causes the previous_timestamp_utc to be used in a date() and datediff(). This is not something we should be expected to do when using the lag() function.

This is a bug in BigQuery in handling timestamps with the LAG function.
The timestamp type is lost during intermediate results. When the table is written it will be correctly written as a timestamp type in the resulting table, but any intermediate results interpret the type as a raw integer resulting in unexpected results.
You found the work-around: cast to a non-timestamp type before the LAG function.
This issue is logged in our internal issue tracker. Thank you for the bug report!

Related

Get first record based on time in PostgreSQL

DO we have a way to get first record considering the time.
example
get first record today, get first record yesterday, get first record day before yesterday ...
Note: I want to get all records considering the time
sample expected output should be
first_record_today,
first_record_yesterday,..
As I understand the question, the "first" record per day is the earliest one.
For that, we can use RANK and do the PARTITION BY the day only, truncating the time.
In the ORDER BY clause, we will sort by the time:
SELECT sub.yourdate FROM (
SELECT yourdate,
RANK() OVER
(PARTITION BY DATE_TRUNC('DAY',yourdate)
ORDER BY DATE_TRUNC('SECOND',yourdate)) rk
FROM yourtable
) AS sub
WHERE sub.rk = 1
ORDER BY sub.yourdate DESC;
In the main query, we will sort the data beginning with the latest date, meaning today's one, if available.
We can try out here: db<>fiddle
If this understanding of the question is incorrect, please let us know what to change by editing your question.
A note: Using a window function is not necessary according to your description. A shorter GROUP BY like shown in the other answer can produce the correct result, too and might be absolutely fine. I like the window function approach because this makes it easy to add further conditions or change conditions which might not be usable in a simple GROUP BY, therefore I chose this way.
EDIT because the question's author provided further information:
Here the query fetching also the first message:
SELECT sub.yourdate, sub.message FROM (
SELECT yourdate, message,
RANK() OVER (PARTITION BY DATE_TRUNC('DAY',yourdate)
ORDER BY DATE_TRUNC('SECOND',yourdate)) rk
FROM yourtable
) AS sub
WHERE sub.rk = 1
ORDER BY sub.yourdate DESC;
Or if only the message without the date should be selected:
SELECT sub.message FROM (
SELECT yourdate, message,
RANK() OVER (PARTITION BY DATE_TRUNC('DAY',yourdate)
ORDER BY DATE_TRUNC('SECOND',yourdate)) rk
FROM yourtable
) AS sub
WHERE sub.rk = 1
ORDER BY sub.yourdate DESC;
Updated fiddle here: db<>fiddle

Calculate difference between current row and next available date value in SQL

I have get difference between the current Issue Reset and next stop time in Postgres SQL. I am not able to understand, how can i get this done using window functions. I tried NEXT_VALUE and FIRST_VALUE, but i am seeing examples for moving aggregates. I need a single query to get this.
I need to achieve difference between '22/08/2020 11:29:00' and '17/08/2020 11:19:00' which tells me duration of running time.
It looks like you want the difference between each stoptime and the latest prior issue. If issues and stop times are properly interleaved, as you explained in the comments, then you can use window functions as follows:
select t.*, stop_time - max_issue as diff
from (
select t.*, max(issue) over(order by issue) max_issue
from mytable t
) t
where stop_time is not null

How to select the correct date in the same column when data in different rows are equal

I have the following problem with my data on a DB2 database. I want to create an overview when a machine was used for a project with a begin and end date.
The following data is available:
||Machine name||Description||Project||Start date|
|Mach1|DB2_AIX|Team_1_PERS|TEST_1|07-03-2017|
|Mach1|DB2_AIX|Team_1_PERS|TEST_1|16-03-2017|
|Mach1|DB2_AIX|Team_1_PERS|TEST_1|24-04-2017|
|Mach1|DB2_AIX|Team_1_PERS|TEST_1|07-05-2017|
|Mach1|DB2_AIX|Team_1_PERS|TEST_2|13-05-2016|
|Mach1|DB2_AIX|Team_1_PERS|TEST_1|22-05-2017|
|Mach1|DB2_AIX|Team_1_PERS|TEST_1|12-06-2017|
The result that I'm looking for is:
|Machine name||Description||Project||Start date||Last date|
|Mach1|DB2_AIX|Team_1_PERS|TEST_1|07-03-2017|07-05-2017|
|Mach1|DB2_AIX|Team_1_PERS|TEST_2|13-05-2016|13-05-2017|
|Mach1|DB2_AIX|Team_1_PERS|TEST_1|22-05-2017|12-06-2017|
Does anybody have an idea how to create this result with a statement?
This is a classic gaps-and-islands problem, and the standard solutions will work just fine:
WITH Grouped_Run AS (SELECT name, description, project, test, executedOn,
ROW_NUMBER() OVER(ORDER BY executedOn) -
ROW_NUMBER() OVER(PARTITION BY name, description, project, test ORDER BY executedOn) AS groupingId
FROM Machine)
SELECT name, description, project, test, MIN(executedOn) as testStart
FROM Grouped_Run
GROUP BY name, description, project, test, groupingId
ORDER BY testStart
Fiddle example
(it's a little unclear if the group is going to be the whole row, but that's adjustable)
....will produce the results you're looking for.
Note that depending on what specific version you're on, there may be other/faster ways to achieve these results.
It seems like you're trying to get the first and last of "Start date". Write a GROUP BY query with MIN(Start date) and another with MAX(Start date) then union the results. You'll have to select DISTINCT or do another GROUP BY to eliminate the duplicates that will occur when there's only one date.

Using date_add function with a timestamp in bigquery results in null for output

I'm trying to use the date_add function with a timestamp in bigquery, but I'm getting 'null' as a result from the output. I've used date_add successfully before, so I don't understand what the problem is. Here's a bit of code.
SELECT
userId,
MAX(most_recent_session) most_recent_session,
date_add(MAX(most_recent_session), 24, 'HOUR') as added_a_day,
FROM
(
SELECT
userId,
LAG(time, 0) OVER (PARTITION BY userId ORDER BY time) as most_recent_session,
LAG(time, 1) OVER (PARTITION BY userId ORDER BY time) as previous_session,
FROM TABLE_DATE_RANGE(dataset.tablename_, TIMESTAMP(DATE_ADD(CURRENT_TIMESTAMP(), -30, "DAY")), CURRENT_TIMESTAMP())
GROUP BY
userId,
time
)
)
group by
userID
So what I would expect to get out would be three columns, the first containing userId, the second containing a time stamp for that users most recent session, and then a third with 24 hours added on to it. But in the third column instead of getting the value in the 2nd column with 24 hours added on to it, I get 'null'.
Any thoughts?
I figured out the solution to the problem. You need to wrap the 'most_recent_session' that exists w/ in the outer level of the SQL w/ a USEC_TO_TIMESTAMP function. That struck me as odd because BQ recognized the field as being a time stamp, but it works.

PostgreSQL SQL query clarification

Can anyone help me understand this SQL query in PostgreSQL ?
SELECT sum(count)
FROM (
SELECT count,
time,
max(time) OVER (PARTITION BY post_id) max_time
FROM totals
WHERE cust_id IN %s
AND time < %s
AND type = %s
) as ss
WHERE time = max_time;
Additional comment
To explain my comments on the OP I was having with a_horse_with_no_name, this query could be re-written as follows:
SELECT sum(count)
FROM (
SELECT count,
time,
RANK() OVER (ORDER BY time DESC PARTITION BY post_id) time_desc
FROM totals
WHERE cust_id IN %s
AND time < %s
AND type = %s
) as ss
WHERE time_desc = 1;
I believe this makes it clearer what this query is doing (since it a more standard form.)
Original Comment
Let me make a guess -- lets say count is the number of views and time is the time that there are those views. My guess is it is something like this. But KM won't tell us.
In any case if that is how it works then this is what the query does:
It gives the total views of all posts.
(As limited by the incoming parameters)
I could explain why, but I'll wait for you to apologize for cursing at me.
It returns the total sum of the count column where the the value in the time column matches the latest value in the time column for each post_id.
The totals that are checked are limited by cust_id, the time and the type. Values for those conditions are (apparently) passed as parameters.