Big Query - User Defined Function - Scalar Subquery Error - sql

I am trying to re-create a user defined function based on the query below, however, when I ran it, it returned me this error instead. Appreciate if anyone here knows a workaround.
Scalar subquery produced more than one element:
create or replace function `dataset.list_of_days`
(user_id string, start_date date, end_date date) AS
((
with temp as (
select user_id, day from
unnest(generate_date_array(start_date, end_date)) day)
select as struct row_number() over (partition by user_id order by day asc) as row_num,
user_id, day
from temp
));
with temp as (
select '100110' as user_id, date('2020-01-31') as start_date,
date('2020-02-28') as end_date )
select dataset.list_of_days(user_id, start_date, end_date)
from temp;

BigQuery's UDF could only return a scalar value, it seems you don't have too huge of an output from inside the UDF, you may consider to rewrite it as
create or replace function `dataset.list_of_days`
(user_id string, start_date date, end_date date) AS
(ARRAY( -- NOTE: ARRAY() was added to you original query
with temp as (
select user_id, day from
unnest(generate_date_array(start_date, end_date)) day)
select as struct row_number() over (partition by user_id order by day asc) as row_num,
user_id, day
from temp
));
with temp as (
select '100110' as user_id, date('2020-01-31') as start_date,
date('2020-02-28') as end_date )
select dataset.list_of_days(user_id, start_date, end_date)
from temp;
Note now the return type of the UDF is ARRAY< STRUCT<INT64, STRING, DATE> >, you may unnest the array if you'll need it as a table with multiple rows.

Related

DATE_TRUNC with :: and without

The query only works when there is :: DATE.
-- Wrap the query you wrote in a CTE named reg_dates
WITH reg_dates AS (
SELECT
user_id,
MIN(order_date) AS reg_date
FROM orders
GROUP BY user_id)
SELECT
-- Count the unique user IDs by registration month
DATE_TRUNC('month', reg_date) :: DATE AS delivr_month,
COUNT(DISTINCT user_id) AS regs
FROM reg_dates
GROUP BY delivr_month
ORDER BY delivr_month ASC;
Why is that required? When I run the query below, without :: DATE, it does not work.
-- Wrap the query you wrote in a CTE named reg_dates
WITH reg_dates AS (
SELECT
user_id,
MIN(order_date) AS reg_date
FROM orders
GROUP BY user_id)
SELECT
-- Count the unique user IDs by registration month
DATE_TRUNC('month', reg_date) AS delivr_month,
COUNT(DISTINCT user_id) AS regs
FROM reg_dates
GROUP BY delivr_month
ORDER BY delivr_month ASC;
Highly likely your RDBMS is PostgreSQL, in your case the :: converts a date type of date, further, :: is represented as CAST(expression AS type).
Equally,
CAST (DATE_TRUNC('month', reg_date) AS DATE) AS delivr_month
What does "does not work" mean? Note that date_trunc in PostgreSQL returns a datetime. So if you need a date for your query to work, this is why you need ::date.

how to calculate difference between dates in BigQuery

I have a table named Employees with Columns: PersonID, Name, StartDate. I want to calculate 1) difference in days between the newest and oldest employee and 2) the longest period of time (in days) without any new hires. I have tried to use DATEDIFF, however the dates are in a single column and I'm not sure what other method I should use. Any help would be greatly appreciated
Below is for BigQuery Standard SQL
#standardSQL
SELECT
SUM(days_before_next_hire) AS days_between_newest_and_oldest_employee,
MAX(days_before_next_hire) - 1 AS longest_period_without_new_hire
FROM (
SELECT
DATE_DIFF(
StartDate,
LAG(StartDate) OVER(ORDER BY StartDate),
DAY
) days_before_next_hire
FROM `project.dataset.your_table`
)
You can test, play with above using dummy data as in the example below
#standardSQL
WITH `project.dataset.your_table` AS (
SELECT DATE '2019-01-01' StartDate UNION ALL
SELECT '2019-01-03' StartDate UNION ALL
SELECT '2019-01-13' StartDate
)
SELECT
SUM(days_before_next_hire) AS days_between_newest_and_oldest_employee,
MAX(days_before_next_hire) - 1 AS longest_period_without_new_hire
FROM (
SELECT
DATE_DIFF(
StartDate,
LAG(StartDate) OVER(ORDER BY StartDate),
DAY
) days_before_next_hire
FROM `project.dataset.your_table`
)
with result
Row days_between_newest_and_oldest_employee longest_period_without_new_hire
1 12 9
Note use of -1 in calculating longest_period_without_new_hire - it is really up to you to use this adjustment or not depends on your preferences of counting gaps
1) difference in days between the newest and oldest record
WITH table AS (
SELECT DATE(created_at) date, *
FROM `githubarchive.day.201901*`
WHERE _table_suffix<'2'
AND repo.name = 'google/bazel-common'
AND type='ForkEvent'
)
SELECT DATE_DIFF(MAX(date), MIN(date), DAY) max_minus_min
FROM table
2) the longest period of time (in days) without any new records
WITH table AS (
SELECT DATE(created_at) date, *
FROM `githubarchive.day.201901*`
WHERE _table_suffix<'2'
AND repo.name = 'google/bazel-common'
AND type='ForkEvent'
)
SELECT MAX(diff) max_diff
FROM (
SELECT DATE_DIFF(date, LAG(date) OVER(ORDER BY date), DAY) diff
FROM table
)

Check if timestamp is contained in date

I'm trying to check if a datetime is contained in current date, but I'm not veing able to do it.
This is my query:
select
date(timestamp) as event_date,
count(*)
from pixel_logs.full_logs f
where 1=1
where event_date = CUR_DATE()
How can I fix it?
Like Mikhail said, you need to use CURRENT_DATE(). Also, count(*) requires you to GROUP BY the date in your example. I do not know how your data is formatted, but one way to modify your query:
#standardSQL
WITH
table AS (
SELECT
1494977678 AS timestamp_secs) -- Current timestamp (in seconds)
SELECT
event_date,
COUNT(*) as count
FROM (
SELECT
DATE(TIMESTAMP_SECONDS(timestamp_secs)) AS event_date,
CURRENT_DATE()
FROM
table)
WHERE
event_date = CURRENT_DATE()
GROUP BY
event_date;

Compare timestamps stored as strings to a string formatted date

event_date contains timestamps stored as strings.
1382623200
1382682600
1384248600
...
How can I SELECT rows where event_date is less than a string formatted date? This is my best attempt:
SELECT *
FROM [analytics:workspace.events]
WHERE TIMESTAMP(event_date) < PARSE_UTC_USEC("2013-05-02 09:09:29");
I get all rows regardless of what date I pass to PARSE_UTC_USEC()
It looks like the event_date strings represent Unix seconds. Try this using standard SQL (uncheck "Use Legacy SQL" under "Show Options"):
WITH T AS (
SELECT x, event_date
FROM UNNEST(['1382623200',
'1382682600',
'1384248600']) AS event_date WITH OFFSET x
)
SELECT *
FROM (
SELECT * REPLACE (TIMESTAMP_SECONDS(CAST(event_date AS INT64)) AS event_date)
FROM T
)
WHERE event_date < '2013-05-02 09:09:29';
The subquery converts the event_date string into a timestamp using the REPLACE clause.
Try below. Hope this helps
SELECT event_date, TIMESTAMP(event_date) as ts
FROM -- [analytics:workspace.events]
(
SELECT event_date FROM
(SELECT '1382623200' AS event_date),
(SELECT '1382682600' AS event_date),
(SELECT '1384248600' AS event_date)
)
WHERE TIMESTAMP(event_date) < PARSE_UTC_USEC("2013-10-25 07:30:00")
above is just example - you should use your table in real life:
SELECT event_date, TIMESTAMP(event_date) as ts
FROM [analytics:workspace.events]
WHERE TIMESTAMP(event_date) < PARSE_UTC_USEC("2013-10-25 07:30:00")

BigQuery Date and Time Functions returning NULL on a timestamp column

I am pulling 3 timestamp columns: timestamp, prev_timestamp, next_timestamp from one timestamp column in a table using LAG() and LEAD(). I need to do some simple date & time formatting but when I use a function like MONTH() on either prev_timestamp or next_timestamp it returns NULL.
The schema type of the resulting column is correct (TIMESTAMP) and for some reason regular timestamp date and time formatting works. How do I make it so that it returns the month correctly for all 3 columns?
Example code that returns the month for timestamp column and NULL for prev and next timestamp columns:
SELECT
MONTH(timestamp) AS month,
MONTH(prev_timestamp) AS prev_month,
MONTH(next_timestamp) AS next_month
FROM (
SELECT
timestamp,
LAG(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) prev_timestamp,
LEAD(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp
FROM timestamp_table
)
So having tested and checked a couple of things, I actually inspired myself from Mikhail's answer and realized his answer is incorrect, since the lag/lead doesn't return milliseconds, but MICROseconds (why? that's anybody's guess).
SELECT
MONTH(timestamp) AS month,
MONTH(MSEC_TO_TIMESTAMP((prev_timestamp/1000))) AS prev_month,
MONTH(MSEC_TO_TIMESTAMP((next_timestamp/1000))) AS next_month
FROM (
SELECT
timestamp,
LAG(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) prev_timestamp,
LEAD(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp
FROM timestamp_table
)
Should work. I just tested creating a table with three rows that are timestamp. Using this without the /1000, my lagged/lead versions were giving a different month. I tested and turns out, if you don't do the division, you end up somewhere in the 47th millennium.
Try below
SELECT
MONTH(MSEC_TO_TIMESTAMP(timestamp)) AS month,
MONTH(MSEC_TO_TIMESTAMP(prev_timestamp)) AS prev_month,
MONTH(MSEC_TO_TIMESTAMP(next_timestamp)) AS next_month
FROM (
SELECT
timestamp,
LAG(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) prev_timestamp,
LEAD(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp
FROM timestamp_table
)