BigQuery SQL functions not working properly - sql

I have a problem with functions such as maxif and sumif. When I try to use any of them in my project the console returns 'Function not found: sumif/maxif; Did you mean sum/max?'
It is odd, because function countif works perfectly fine, and both of maxif and sumif are described in the BigQuery documentation, so I'm kind of worried what to do with them in order to run the code properly.
Beneath is a part of my code, any suggestions would be most welcome:
SELECT
DISTINCT *,
COUNTIF(status ='completed') OVER (PARTITION BY id ORDER BY created_at) cpp, --this works
sumif(value,status='completed') OVER (PARTITION BY id ORDER BY created_at) spp, -- this doesn't
maxif(created_at, status = 'completed') OVER (PARTITION BY id ORDER BY created_at DESC) lastpp,
FROM
`production.payment_transactions`

Below is for BigQuery Standard SQL
#standardSQL
SELECT
DISTINCT *,
COUNTIF(status = 'completed') OVER (PARTITION BY id ORDER BY created_at) cpp, --this works
SUM(IF(status = 'completed', value, NULL)) OVER (PARTITION BY id ORDER BY created_at) spp, -- this now works
MAX(IF(status = 'completed', value, NULL)) OVER (PARTITION BY id ORDER BY created_at DESC) lastpp, -- this now works
FROM `production.payment_transactions`

SUMIF() and MAXIF() are not a big query functions. Use a CASE expression:
maxif(case when status = 'completed' then created_at end) over (partition by id order by created_at desc)
This is confusing because the functions are used in other parts of the GCP environment, particularly a component called Dataprep.

Related

What is the advantage of using BigQuery’s QUALIFY operator?

I have just discovered BigQuery’s QUALIFY operator and have been reading about it at https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#qualify_clause
That documentation though does not explain why I should use QUALIFY instead of a normal WHERE predicate. If we take the example provided in the documentation:
SELECT item,
  RANK() OVER (PARTITION BY category ORDER BY purchases DESC) as rank
FROM Produce
WHERE Produce.category = 'vegetable'
QUALIFY rank <= 3
That query could also be written as
SELECT
  item,
  RANK() OVER (PARTITION BY category ORDER BY purchases DESC) as rank
FROM Produce
WHERE Produce.category = 'vegetable'
AND rank <= 3
and it would produce the same result. So what is the advantage of using QUALIFY?
One usage of the QUALIFY clause is to filter the results by the analytic function (sometimes with WINDOW FUNCTION). As you mentioned in the comments this could be seen as Syntactic sugar since the results of the analytic function can be stored in an additional subquery and can be filtered with WHERE clasue.
USAGE 01: Find user_id's last login info
w/ QUALIFY
SELECT user_id, ip, country_code, os, ...,
FROM login_logs
WHERE TRUE
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY log_datetime DESC) = 1
;
wo/ QUALIFY
#standardSQL
WITH
login_log AS (
SELECT
user_id, ip, country_code, os, ...,
ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY log_datetime DESC
) AS row_num
FROM user_login_info_table
)
SELECT user_id, ip, country_code, os, ...
FROM login_log
WHERE row_num = 1
;
Performances
I just had tested two different approaches with my data and found out that the slot time of the two queries are almost similar. However, the QUALIFY clause has a slight advantage in shuffled byte usage since it doesn't require keeping the results of row_num columns.
USAGE 02: Find if OS had changed when users login
SELECT
log_datetime, user_id, os,
LAG(os, 1, NULL) OVER user_id_os_list as previous_os,
FROM login_logs
WHERE TRUE
QUALIFY (previous_os != os)
WINDOW user_id_os_list AS (
PARTITION BY user_id ORDER BY log_datetime
)
;

Rank() based on column entries while the data is ordered by date

I'm trying to use dense_rank() function over the pagename column after the data is ordered by time_id.
Expected output in rank column, rn, is: [1,2,2,3,4].
Currently I wrote it as:
with tbl2 as
(select UID, pagename, date_id, time_id, source--, dense_rank() over(partition by UID order by pagename) as rn
from tbl1
order by time_id)
select *, dense_rank() over(partition by UID order by time_id, pagename) as rn
from tbl2
Any help would be appreciated
Edit 1: What I am trying to achieve here is to give ranks, as per the user on-screen action flow, to the pages that are visited. Suppose if the same page 'A' is visited back after visiting a different page 'B' then the ranks for these page visits A, B, A will be 1,2,3 (note that the same page A has different ranks 1 & 3)
step-by-step demo:db<>fiddle
SELECT
*,
SUM(is_diff) OVER (ORDER BY date_id, time_id, page)
FROM (
SELECT
*,
CASE WHEN page = lag(page) over (order by date_id, time_id) THEN 0 ELSE 1 END as is_diff
FROM mytable
)s
This looks exactly like a problem I asked some years ago: Window functions: PARTITION BY one column after ORDER BY another
You want to execute a window function on columns (uuid, page) but want to keep the current order which is given by unrelated columns (date_id, time_id).
The problem is, that PARTITION BY orders the records before the ORDER BY clause. So, it defines the primary order and this is not expected.
Once I found a solution for that. I adapted it to your used case. Please read the explanation over there: https://stackoverflow.com/a/52439794/3984221
Interesting part: Your special rank() case is not explicitly required in the query, because my solution creates that out-of-the-box ("by accident" so-to-speak ;) ).
Hmmm . . . If you want the pages ordered by their earliest time, then use two levels of window functions:
select t.*,
dense_rank() over (partition by uid order by min_rn, pagename) as ranking
from (select t.*,
min(rn) over (partition by uid, pagename) as min_rn
from t
) t
Note: This uses rn as a convenient shortcut because the date/time is split into two columns. You can also combine them:
select t.*,
dense_rank() over (partition by uid order by min_dt, pagename) as ranking
from (select t.*,
min(date_id || time_id) over (partition by uid, pagename) as min_dt
from t
) t;
Note: This solution is different from S_man's. On your sample data, they do the same thing. However, if the user returns to a page, then his gives page a new ranking. This gives the page the same ranking as the first time it appears. It is not clear what you really want.
You can use DENSE_RANK() like this for your requirment,
SELECT
u_id,
page_name,
date_id,
time_id,
source,
DENSE_RANK()
OVER (
PARTITION BY page_name
ORDER BY u_id DESC
) rn
FROM ( SELECT * FROM tbl1 ORDER BY time_id ) AS result;

AS400 - Why token *, ! not valid? What could be the alternate way to run the query using STRSQL - SQL Interactive Session?

In AS400, How can I run this query using STRSQL?
For the below query getting the following error message instead returning results.
"Token , was not valid. Valid tokens: FROM INTO."
Code Snippet for Original Query:
WITH cte AS ( SELECT *,
!SUM(status != 'INACTIVE') OVER (PARTITION BY loc_code, user_id, service_area, service_sector) only_inactive,
ROW_NUMBER() OVER (PARTITION BY loc_code, user_id, service_area, service_sector ORDER BY last_changed DESC) rn
FROM test )
SELECT *
FROM cte
WHERE only_inactive AND rn = 1
After checking, the query I found the problem is the SELECT statement inside WITH clause. I don't know why this error is there? Unable to find a possible way to solve this issue.
Now, I tried to remove * from the select statement inside WITH clause, somehow avoided this error. After executing my updated query again, I am getting same kind but a different error message.
Token ! was not valid. Valid tokens: ( + * - ? : DAY INF LAG NAN RID
ROW RRN CASE CAST CHAR DATE DAYS.
Code Snippet for Updated Query:
WITH cte AS ( SELECT !SUM(status != 'INACTIVE') OVER (PARTITION BY loc_code, user_id, service_area,
service_sector) only_inactive,
ROW_NUMBER() OVER (PARTITION BY loc_code, user_id, service_area, service_sector ORDER BY last_changed
DESC) rn
FROM test )
SELECT *
FROM cte
WHERE only_inactive AND rn = 1
I tried:
Instead of ! token I tried using <> and ¬= but it didn't help me in both cases. When I tried to run my code I encountered the same error message but now with <> and ¬= tokens.
Expected Result:
I want to return all records satisfying the logic given in the query with all columns in my table.
Could someone please tell me how to solve this issue?
I tried the following updated query just to test whether it's working without ! or NOT token or not.
WITH cte AS ( SELECT test.*,
SUM(status < 'INACTIVE') OVER (PARTITION BY loc_code, user_id, service_area, service_sector) only_inactive,
ROW_NUMBER() OVER (PARTITION BY loc_code, user_id, service_area, service_sector ORDER BY last_changed DESC) rn
FROM test )
SELECT *
FROM cte
WHERE only_inactive AND rn = 1
I expected this query must have listed some results but this time I get this error message.
Token < was not valid. Valid tokens: ) ,.
I think the problem is with tokens. Not sure what.
Try writing the query like this:
WITH cte AS (
SELECT t.*,
MIN( status = 'INACTIVE') OVER (PARTITION BY loc_code, user_id, service_area, service_sector) as only_inactive,
ROW_NUMBER() OVER (PARTITION BY loc_code, user_id, service_area, service_sector ORDER BY last_changed DESC) as rn
FROM test t
)
Here is a db<>fiddle showing that this syntax works in DB2.
Note that the types of the columns should not matter for the syntax error you are seeing.

How to distribute ranks when prior rank is zero (part 2)

This is an extension to my prior question How to distribute values when prior rank is zero. The solution worked great for the postgres environment, but now I need to replicate to a databricks environment (spark sql).
The question is the same as before, but now trying to determine how to convert this postgres query to spark sql. Basically, it's summing up allocation amounts if there are gaps in the data (ie, no micro_geo's when grouping by location and geo3). The "imputed allocation" will equal 1 for all location & zip3 groups.
This is the postgres query, which works great:
select location_code, geo3, distance_group, has_micro_geo, imputed_allocation from
(
select ia.*,
(case when has_micro_geo > 0
then sum(allocation) over (partition by location_code, geo3, grp)
else 0
end) as imputed_allocation
from (select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
) ia
)z
But it doesn't translate well and produces this error in databricks:
Error in SQL statement: ParseException:
mismatched input 'from' expecting <EOF>(line 1, pos 78)
== SQL ==
select location_code, geo3, distance_group, has_micro_geo, imputed_allocation from
------------------------------------------------------------------------------^^^
(
select ia.*,
(case when has_micro_geo > 0
then sum(allocation) over (partition by location_code, geo3, grp)
else 0
end) as imputed_allocation
from (select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
) ia
)z
Or at a minimum, how to convert just part of this inner query which creates a "grp", and then perhaps the rest will work. I have been trying to replace this filter-where logic with something else, but attempts have not worked as desired.
select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
Here's a db-fiddle with data https://www.db-fiddle.com/f/wisvDZJL9BkWxNFkfLXdEu/0 which is currently set to postgres, but again I need to run this in a spark sql environment. I've tried breaking this down and creating different tables, but my groups are not working as desired.
Here's an image to better visualize the output:
You need to rewrite this subquery:
select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
Although the filter() clause to window and aggregate functions is standard SQL, few databases support it so far. Instead, consider a conditional window sum(), which produces the same result:
select s.*,
sum(case when has_micro_geo <> 0 then 1 else 0 end) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
I think that the rest of the query should run fine in Spark SQL.
As has_micro_geo is already a 0/1 flag you can rewite the count(filter) to
sum(has_micro_geo)
over (partition by location_code, geo3
order by distance_group desc
rows unbounded preceding) as grp
Adding rows unbounded preceding to avoid the default range unbounded preceding which might be less performant.
Btw, I wrote that already in my comment to Gordon's solution to your prior question :-)

using row_number() in db2

I have the following data and I am trying to count each record whenever a combo of userid+customerid appear. If the same combo returns in between other combos, I want to start the counter from one again. However, when I use row_number() in the following code in Db2, I can't seem to get it working so that the counter is restarting at 1 for each userid+customerid appearing over time. Does anyone have other suggestions? Thank you so much.
select userid, customer, event_name, event_timestamp,
row_number() over (partition by user_id, customer_number
order by event_timestamp) as steps_rownum
from trackhist
order by userid, event_timestamp;
select userid, customer, event_name, event_timestamp,
row_number() over (partition by userid, customer, grp
order by event_timestamp) as steps
from
(
select t.*,
(row_number() over (order by event_timestamp) -
row_number() over (partition by userid, customer order by event_timestamp)
) as grp
from trackhist t
) t
order by userid, event_timestamp
Output:
Demo here:
Rextester
Note that the demo uses SQL Server, as DB2 is not an option with Rextester (at least for now). The query should work in DB2 however, assuming ROW_NUMBER() behaves the same in both databases.