BigQuery GroupBy with STRUCT

BigQuery GroupBy with STRUCT - sql

In BigQuery, I can successfully run the following query using standard SQL:
SELECT
COUNT(*) AS totalCount,
city,
DATE_TRUNC(timeInterval.intervalStart, YEAR) AS start
FROM
sandbox.CountByCity
GROUP BY
city, start
But it fails when I nest the start value in a STRUCT, like this...
SELECT
COUNT(*) AS totalCount,
city,
STRUCT(
DATE_TRUNC(timeInterval.intervalStart, YEAR) AS start
) as timeSpan
FROM
sandbox.CountByCity
GROUP BY
city, timeSpan.start
In this case, I get the following error message:
Cannot GROUP BY field references from SELECT list alias timeSpan at [10:11]
What is the correct way to write the query so that the start value is nested within a STRUCT?

You can do this using ANY_VALUE. The struct value that you get is well-defined, since the value is the same for the entire group:
SELECT
COUNT(*) AS totalCount,
city,
ANY_VALUE(STRUCT(
DATE_TRUNC(timeInterval.intervalStart, YEAR) AS start
)) as timeSpan
FROM
sandbox.CountByCity
GROUP BY
city, DATE_TRUNC(timeInterval.intervalStart, YEAR);
Here is an example using some sample data:
WITH `sandbox.CountByCity` AS (
SELECT 'Seattle' AS city, STRUCT(DATE '2017-12-11' AS intervalStart) AS timeInterval UNION ALL
SELECT 'Seattle', STRUCT(DATE '2016-11-10' AS intervalStart) UNION ALL
SELECT 'Seattle', STRUCT(DATE '2017-03-24' AS intervalStart) UNION ALL
SELECT 'Kirkland', STRUCT(DATE '2017-02-01' AS intervalStart)
)
SELECT
COUNT(*) AS totalCount,
city,
ANY_VALUE(STRUCT(
DATE_TRUNC(timeInterval.intervalStart, YEAR) AS start
)) as timeSpan
FROM
`sandbox.CountByCity`
GROUP BY
city, DATE_TRUNC(timeInterval.intervalStart, YEAR);
You could also consider submitting a feature request to enable GROUP BY with STRUCT types.

Not sure why exactly you would wanted this - but believe it is for some reason - so try below (at least formally it does what you ask)
#standardSQL
SELECT
totalCount,
city,
STRUCT(start) timeSpan
FROM (
SELECT
COUNT(*) AS totalCount,
city,
DATE_TRUNC(timeInterval.intervalStart, YEAR) AS start
FROM `sandbox.CountByCity`
GROUP BY city, start
)

Related

Extract last modified date from backup snapshots in bigquery

I have snapshots backed up everyday and named in this format: TableName_20221218
I want to extract the date from the name of the snapshots to create a date column
Currently, i am manually adding date columns this way but it is kinda inconvenient because i have to update the code everyday
select id, date('2022-11-17') as date, rfm_r, rfm_f, rfm_m, recency_score, phone, recency_score_detail, gender, age_group, city from `backup.t1_customer_dependent_20221117`
union all
select id, date('2022-11-18') as date, rfm_r, rfm_f, rfm_m, recency_score, phone, recency_score_detail, gender, age_group, city from `backup.t1_customer_dependent_20221118`
union all
select id, date('2022-11-19') as date, rfm_r, rfm_f, rfm_m, recency_score, phone, recency_score_detail, gender, age_group, city from `backup.t1_customer_dependent_20221119`
union all
select id, date('2022-11-20') as date, rfm_r, rfm_f, rfm_m, recency_score, phone, recency_score_detail, gender, age_group, city from `backup.t1_customer_dependent_20221120`
Instead of that i want to automatically take the date from the name of the snapshot to create the date column and transform the code to some thing like this
select id, date, rfm_r, rfm_f, rfm_m, recency_score, phone, recency_score_detail, gender, age_group, city from `backup.t1_customer_dependent*`
Anyone know how to do this ?
Im new to bigquery so any help will be greatly appreciated
Thanks

Consider using _TABLE_SUFFIX for wildcard tables.
SELECT id, PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS date, rfm_r ...
FROM `backup.t1_customer_dependent_*`

Need help in Optimizing Query for Azure SQL

I am new to the SQL world and working with the below mentioned query, the table contains 3000000+ records. Can you please suggest how to reduce query run time or any other query for the same result.
I tried two queries:
#1
SELECT *
FROM (SELECT ID,
Priority,
Agent_Name,
Urgency,
Status,
Agent_Group_Name,
Country,
Region,
Due_by,
Type,
Created_Date,
Resolved_Date,
Closed_Date,
Resolution_Status,
Requester_Location,
WH_Region,
ExecDate,
Date,
Full_Date,
Datatype,
Department_Name,
Requester_Emails,
ROW_NUMBER()
OVER (
PARTITION BY ID
ORDER BY Execdate DESC ) nn
FROM weekly_tickets
WHERE Created_date >= '2022-01-01 12:00:00 AM') sub_table
WHERE sub_table.nn = 1
#2
WITH cte
AS (SELECT ID,
Priority,
Agent_Name,
Urgency,
Status,
Category,
Item_Category,
Agent_Group_Name,
What_is_the_Impact_,
Country,
Impact,
Region,
Resolution_Time_in_Bhrs,
Sub_Category,
Due_by,
Type,
Issue_Owner,
Created_Date,
Number_of_Users,
Approval_Status,
Resolved_Date,
Closed_Date,
How_is_the_issue_affecting_the_service_,
Number_of_Users_staffed,
Resolution_Status,
Sites,
Requester_Location,
Number_of_Users_affected,
WH_Region,
CampaignOriginId,
ExecDate,
Date,
Full_Date,
AgeEvol,
Datatype,
Department_Name,
Requester_Emails,
ROW_NUMBER()
OVER (
PARTITION BY ID
ORDER BY Execdate DESC ) nn
FROM weekly_tickets
WHERE Created_date >= '2022-01-01 12:00:00 AM')
SELECT *
FROM cte
WHERE cte.nn = 1

Always read execution plan or show someone else if it's too complex for you. This query seems pretty obvious to reconstruct with a high degree of certainty. I assume following optimization steps:
Scan table and filter by Created_date, return all columns required by next steps (here all columns used by SELECT clause),
Order by ID, ExecDate DESC
Segment
[..]
Filter comes from WHERE clause. Ordering and segmenting comes from OVER clause. I also assume that date filter significantly reduces number of rows returned. Since there is less rows to process, query should perform better.
All that means you should start with the following index:
CREATE INDEX IX_Predicted ON weekly_tickets(Created_Date) INCLUDE (ID,
Priority,
Agent_Name,
Urgency,
Status,
Category,
Item_Category,
Agent_Group_Name,
What_is_the_Impact_,
Country,
Impact,
Region,
Resolution_Time_in_Bhrs,
Sub_Category,
Due_by,
Type,
Issue_Owner,
Number_of_Users,
Approval_Status,
Resolved_Date,
Closed_Date,
How_is_the_issue_affecting_the_service_,
Number_of_Users_staffed,
Resolution_Status,
Sites,
Requester_Location,
Number_of_Users_affected,
WH_Region,
CampaignOriginId,
ExecDate,
Date,
Full_Date,
AgeEvol,
Datatype,
Department_Name,
Requester_Emails);

You could create an index to help with performance:
CREATE NONCLUSTERED INDEX index_POC_ticket
ON dbo.weekly_tickets (ID, Execdate DESC)
WITH (DROP_EXISTING = ON);
Edit: added DESC to the Execdate column in the index, since I believe it should support the ordering in your window function.
If ID is the primary key, I don't believe you need to include it in the above index (maybe someone more knowledgeable can comment here).

I should think that your query can be rewrite as :
SELECT ID,
Priority,
Agent_Name,
Urgency,
STATUS,
Agent_Group_Name,
Country,
Region,
Due_by,
Type,
Created_Date,
Resolved_Date,
Closed_Date,
Resolution_Status,
Requester_Location,
WH_Region,
ExecDate,
Date,
Full_Date,
Datatype,
Department_Name,
Requester_Emails
FROM weekly_tickets AS wt
WHERE Created_date >= '2022-01-01 12:00:00 AM'
AND NOT EXISTS(SELECT *
FROM weekly_tickets AS t
WHERE wt.ID <> t.ID
AND wt.Execdate < t.Execdate);
And can be much more efficient with the following index :
CREATE INDEX X ON weekly_tickets (ID, Execdate);
If not exists...
Test it !

Group by date but not include time - Oracle SQl

I'm trying to use listagg to group categories by date, but the field is date-time. Categories are appearing on separate lines. Is it possible to group by date only? I've tried CAST as well as DATE in the group by, but it's still not working. Here's the base query:
select ACCOUNT,
ID,
NAME,
TERM,
listagg(CATEGORY, ', ') within group (order by CATEGORY) as cat_by_date,
trunc(TRANSACTION_DATE) short_date
from TABLE
where term= '2022'
and CATEGORY in ('T', 'H', 'P')
group by
ACCOUNT_UID,
ID,
NAME,
TERM,
TRANSACTION_DATE
order by 1

TRUNC in the GROUP BY as well
...
group by
ACCOUNT_UID,
ID,
NAME,
TERM,
TRUNC(TRANSACTION_DATE)

First user by category

How can I count the new users for each category who bought in the category for the first by year? For instance, 2015-2020 by year, if someone bought in 2015 for the first it will be counted as a new uesr in 2015 but not in 2016-2020.
Table_1 (Columns: product_name, date, category, sales, user_id)
Want to get the result as bleow

You’ll want to start with a sub query to get the first date each user purchased in the category. This is a pretty straightforward group by problem:
select
user_id,
category,
min(date) as first_category_purchase
from my_table
group by user_id, category;
Next, you can use Postgres’s date_trunc function to group by year and category, using your first query as a sub query:
select
category,
date_trunc('year', first_category_purchase)
count(*)
from (
select
user_id,
category,
min(date) as first_category_purchase
from my_table
group by user_id, category
) a
group by 1, 2;

In Postgres, one method is group by after a distinct on:
select date, count(*) as num_new_users
from (select distinct on (user_id, category) t.*
from t
order by user_id, category, date asc
) d
group by date
order by date;
If date is really a date and not a year, then you need something like to_char() or date_trunc() to convert it to a year.

SQL min / max with all fields

I am facing a simple problem with an SQL query that I do not know how to tackle.
I have a table with the following structure
CITY COUNTRY DATES TEMPERATURE
Note that for a given country, I can have several cities. And, for a given city, I have several rows giving me the TEMPERATURE at each available DATE. This is just a time serie.
I would like to write a query which gives me for every cities the DATE where the TEMPERATURE is the MIN and the DATE where the TEMPERATURE is the MAX. The query should return something like that:
CITY COUNTRY DATE_MIN_TEMPERATURE MIN_TEMPERATURE DATE_MAX_TEMPERATURE MAX_TEMPERATURE
Any idea on how to achieve this?
Best regards,
Deny

Oracle provides keep/dense_rank first for this purpose:
select city,
min(temperature) as min_temperature,
max(date) keep (dense_rank first order by temperature asc) as min_temperature_date,
max(temperature) as max_temperature,
max(date) keep (dense_rank first order by temperature desc) as max_temperature_date
from t
group by city;
Note that this returns only one date if there are ties. If you want to handle that, more logic is needed:
select city, min(temperature) as min_temperature,
listagg(case when seqnum_min = 1 then date end, ',') within group (order by date) as mindates,
max(temperature) as max_temperature,
listagg(case when seqnum_max = 1 then date end, ',') within group (order by date) as maxdates,
from (select t.*,
rank() over (partition by city order by temperature) as seqnum_min,
rank() over (partition by city order by temperature desc) as seqnum_max
from t
) t
where seqnum_min = 1 or seqnum_max = 1
group by city;

In Oracle 11 and above, you can use PIVOT. In the solution below I use LISTAGG to show all the dates in case of ties. Another option is, in the case of ties, to show the most recent date when the extreme temperature was reached; if that is preferred, simply replace LISTAGG(dt, ....) (including the WITHIN GROUP clause) with MAX(dt). However, in that case the first solution offered by Gordon (using the first function) is more efficient anyway - no need for pivoting.
Note that I changed "date" to "dt" - DATE is a reserved word in Oracle. I also show the rows by country first, then city (the more logical ordering). I created test data in a WITH clause, but the solution is everything below the comment line.
with
inputs ( city, country, dt, temperature ) as (
select 'Palermo', 'Italy' , date '2014-02-13', 3 from dual union all
select 'Palermo', 'Italy' , date '2002-01-23', 3 from dual union all
select 'Palermo', 'Italy' , date '1998-07-22', 42 from dual union all
select 'Palermo', 'Italy' , date '1993-08-24', 30 from dual union all
select 'Maseru' , 'Lesotho', date '1994-01-11', 34 from dual union all
select 'Maseru' , 'Lesotho', date '2004-08-13', 12 from dual
)
-- >> end test data; solution (SQL query) begins with the next line
select country, city,
"'min'_DT" as date_min_temp, "'min'_TEMP" as min_temp,
"'max'_DT" as date_max_temp, "'max'_TEMP" as max_temp
from (
select city, country, dt, temperature,
case when temperature = min(temperature)
over (partition by city, country) then 'min'
when temperature = max(temperature)
over (partition by city, country) then 'max'
end as flag
from inputs
)
pivot ( listagg(to_char(dt, 'dd-MON-yyyy'), ', ')
within group (order by dt) as dt, min(temperature) as temp
for flag in ('min', 'max'))
order by country, city -- ORDER BY is optional
;
COUNTRY CITY DATE_MIN_TEMP MIN_TEMP DATE_MAX_TEMP MAX_TEMP
------- ------- ------------------------ ---------- -------------- ----------
Italy Palermo 23-JAN-2002, 13-FEB-2014 3 22-JUL-1998 42
Lesotho Maseru 13-AUG-2004 12 11-JAN-1994 34
2 rows selected.

Instead of keep/dense_rank first function you can also use FIRST_VALUE and LAST_VALUE:
select distinct city,
MIN(temperature) OVER (PARTITION BY city) as min_temperature,
FIRST_VALUE(date) OVER (PARTITION BY city ORDER BY temperature) AS min_temperature_date,
MAX(temperature) OVER (PARTITION BY city) as max_temperature,
LAST_VALUE(date) OVER (PARTITION BY city ORDER BY temperature) AS max_temperature_date
FROM t;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery GroupBy with STRUCT - sql

Related

Extract last modified date from backup snapshots in bigquery

Need help in Optimizing Query for Azure SQL

Group by date but not include time - Oracle SQl

First user by category

SQL min / max with all fields

Categories

Resources