Issues connecting Google Data Studio to BigQuery with window function - google-bigquery

I have a complex BigQuery view that pulls data from various connected Google Sheets along with calculated data from within BigQuery. I'm trying to create a dashboard on top of the view in Data Studio.
I'm having an issue getting my data to show in Data Studio and have isolated it to a particular part of the underlying view in BigQuery.
I had an earlier problem that was answered by this question.
I am running effectively the query from that post, saved as a view and then connected to Data Studio.
SELECT order_id, order_date,
ARRAY_AGG(line_item) AS line_items
FROM (
SELECT order_id, order_date,
STRUCT(item_sku,
item_quantity,
item_subtotal,
cost.product_cost) AS line_item
FROM `order_data_table`, UNNEST(line_items) AS items
JOIN `price_history_table` AS cost
ON items.item_sku = cost.sku AND effective_date < order_date
QUALIFY 1 = ROW_NUMBER() OVER(PARTITION BY order_id, order_date, item_sku ORDER BY effective_date DESC)
)
GROUP BY order_id, order_date
This query uses a window function and it is this that is causing my issue. Whenever I try to connect to the data I get this.
With the details being
Data Studio cannot connect to your data set.
Failed to fetch data from the underlying data set
Removing the below line from the query solves the issue but then I don't have the desired data.
QUALIFY 1 = ROW_NUMBER() OVER(PARTITION BY order_id, order_date, item_sku ORDER BY effective_date DESC)
Is there a reason why this breaks Data Studio? Can I avoid it? Can I solve the original issue in a different way that doesn't use a window function?
UPDATE
Looks like there is an issue in Data Studio where it does not support the QUALIFY function.
Any suggestions on how I can re-write this query without using QUALIFY?

As described in the issue, the workaround is to always include the WHERE clause. Without it, Data Studio gets confused and fails.
I had the same problem and adding WHERE TRUE to my queries fixed it without needing any other rewrite.

I have fixed this by removing the QUALIFY function and re-writing my query as this
SELECT order_id, order_date,
ARRAY_AGG(line_item) AS line_items
FROM (
SELECT order_id, order_date,
STRUCT(item_sku,
item_quantity,
item_subtotal,
cost.product_cost,
ROW_NUMBER() OVER(PARTITION BY order_id, order_date, item_sku ORDER BY effective_date DESC) AS row_num) AS line_item
FROM `order_data_table`, UNNEST(line_items) AS items
JOIN `price_history_table` AS cost
ON items.item_sku = cost.sku AND effective_date < order_date
)
WHERE line_item.row_num = 1
GROUP BY order_id, order_date

Related

Case when in Rank partition By

I've been relearning SQL again but I'm not sure if this code can be done. Can someone please provide feedback or alternative on this case ?
So over all I'm looking into any duplication between a order that was submitted between the same day, different time, same user.
I was thinking for the second step I would rank them to find out if there's another row based on the time and date, to be ranked two?
Select * ( including orderDate)
RANK() OVER(PARTITION BY
Customer,
case
when (Orderstart(Datetimestamp) > OrderEnd(Datetimestamp) and OrderEnd<Orderstart ) AS Rank_Items
From FirstStep
This is just ranking everything now going up to 500+ ranks.
Sample Data
Desired Result:
I would use row_number() to get identify multiple orders by the same customer on the same date:
row_number() over (partition by customer, cast(orderdatetime as date) order by orderdatetime)
The cast-to-date might vary by database.
This enumerates the orders for a customer on a given date, which seems to be what you want to accomplish.

SQL select the first match in an ordered list

Using Microsoft SQL Server Management Studio version 14.0.17213.0
I have a list of events that go in order. I want to select the highest precedent acct_no, complete_date and event.
My problem is if I use
select
account_number, event, max(complete_date) as mx_comp
from
mytable
where
event in ('event1','event2'....)
then I get all my acct_numbers, all the events in the list and the max complete date for that event. But I want acct_no listed with the maximum completed date for any item in the list and the associated event.
Furthermore, its wholly possible that two events occurred on the same date, so I cannot do
select *
from mytable mt
join
(select acct_number, max(complete_date)
from mytable) t on mt.acct_number = t.account_number
and mt.complete_date = t.complete_date
because if two events occurred on the same day then I still get duplicate results.
I have tried to do a similar thing with
row_number() over (order by account_number) as RowNum
but it did not work, because I still get matches to all the events, not just my highest precedence event
it really boils down to needing to return the acct_number, event and complete date associated to the highest importance match from items in an ordered list.
I am sure it is easy - I just cannot seem to figure it out and despite all my google and stack searching I simply cannot figure it out
I have recently been thinking that it might be possible with something like coalesce(mylist) because I would be able to put my list in order but I cannot figure out how to use coalesce in a meaningful way for this problem.
The real solution would be to create a table with precedence numbers or have a most recent indicator but I dont have unlimited access to create any tables I want.
Any help or ideas on how to match to an ordered list would be appreciated
You seem to want:
select t.*
from (select t.*,
row_number() over (partition by account_number order by complete_date desc) as seqnum
from mytable t
where event in ('event1', 'event2', ....)
) t
where seqnum = 1;

The alias name RANK() function is not recognized in the where clause with DISTINCT columns

I have 2 tables with columns (customer, position, product ,sales_cycle, call_count , cntry_cd , owner_cd , cr8) and I am facing some challenges as mentioned below Kindly please help me to fix this
My Requirement
I have 2 tables test.table1 and test.table2
I need to insert values form "test.table2" by doing an select with "test.table1". But I am facing a problem i.e. I am getting some duplicates while loading data to "test.table2"
I have totally 8 columns in both the table but while loading I need to take the highest rank of the column "call_count" with condition of unique values of these columns (customer, position, product ,sales_cycle)
Query what I tried
select
distinct (customer, position, product ,sales_cycle),
rank () over (order by call_count desc) rnk,
cntry_cd,
owner_cd,
cr8
from test.table1
where rnk=1
I am facing few challenges in the above query (The database I am using is RedShift)
1.I can't do distinct for only few columns
2.The alias name "rnk" is not recognized in the where clause
Kindly please help me to fix this , Thanks
You can't use a column alias on the same level where it's introduced. You need to wrap the query in a derived table. The distinct as shown is useless as well if you use rank()
select customer, position, product, sales_cycle,
cntry_cd, owner_cd, cr8
from (
select customer, position, product, sales_cycle,
cntry_cd, owner_cd, cr8,
rank () over (order by call_count desc) rnk
from test.table1
) t
where rnk=1;
The derived table adds no overhead to the processing time. In this case it is merely syntactic sugar to allow you to reference the column alias.

SQL 'partition by order by' turns count() into rank()?

I am trying to figure out how to use partition by properly, and looking for a brief explanation to the following results. (I apologize for including the test data without proper SQL code.)
Example 1: Counts the IDs (e.g. shareholders) for each company and adds it to the original data frame (as "newvar").
select ID, company,
count(ID) over(partition by company) as newvar
from testdata;
Example 2: When I now add order by shares count() somehow seems to turn into rank(), so that the output is merely a ranking variable.
select ID, company,
count(ID) over(partition by company order by shares) as newvar
from testdata;
I thought order by just orders the data, but it seems to have an impact on "newvar".
Is there a simple explanation to this?
Many thanks in advance!
.csv file that contains testdata:
ID;company;shares
1;a;10
2;a;20
3;a;70
1;b;50
4;b;10
5;b;10
6;b;30
2;c;80
3;c;10
7;c;10
1;d;20
2;d;30
3;d;25
6;d;10
7;d;15
count() with an order by does a cumulative count. It is going to turn the value either into rank() or row_number(), depending on ties in the shares value and how the database handles missing windows frames (rows between or range between).
If you want to just order the data, then the order by should be after the from clause:
select ID, company,
count(ID) over(partition by company) as newvar
from testdata
order by shares;

Returning data from a single Child record (sorted by date) with Parent data also

As a SQL noob I have a, what I am assuming, basic question about 1 to many children records.
I have an order table and an Order_Status child table.
Order table
ID Order_Number Status Order_Date ect
Order_Status table
StatusTo StatusFrom Order_ID StatusChange_Date
The child table can have many enties for the status changing for a single parent order.
How do I pull back the following information as a single record with the child tables's (os) most recent record for that parent(p)? (p.Order_Number, p.Status, p.Order_Date, os.StatusTo, os.StatusChange_Date).
I need to know because I am concerned the final os.statusto does not match the p.status.
Thanks in advance!
Steve
you can join on to a sub query which gets the most recent order status
e.g.
SELECT p.Order_Number, p.Status, p.Order_Date, os.StatusTo, os.StatusChange_Date
FROM ORDER p
LEFT JOIN (
SELECT StatusTo, Order_ID, MAX(StatusChange_Date) as StatusChange_Date
FROM Order_Status
GROUP BY StatusTo, Order_ID
) os ON os.Order_ID= p.Order_ID
I believe this should work. Assuming that you only care about those orders that have changes, and where the change is different than what is recorded (should be trivial to modify).
WITH Most_Recent_Change (order_id, statusTo, changedAt, rownum) as
(SELECT order_id, statusTo, statusChange_date,
ROWNUMBER() OVER(PARTITION BY order_id
ORDER BY statusChange_date DESC)
FROM Order_Status)
SELECT Order.order_number, Order.status, Order.order_date,
Most_Recent_Change.statusTo, Most_Recent_Change.changedAt
FROM Order
JOIN Most_Recent_Change
ON Most_Recent_Change.order_id = Order.id
AND Most_Recent_Change.rownum = 1
AND Most_Recent_Change.statusTo <> Order.status
(would have an SQLFiddle example, but it's acting weird at the moment)
Please note you should be careful of the commit level you run this at, as otherwise you may get false positives from rows being concurrently updated.
Other notes:
Don't use reserved words (like ORDER) for identifiers. It's just a hassle in general
Don't suffix columns with their datatypes, especially if those types may change in the future. I'm aware that order_date isn't being strictly named in this fashion, but it's dangerously close. It should probably be something like orderedOn (if of a strict 'solar day' type) or the better orderedAt (timestamp, in UTC or with timezone).