Why does the order by in Big Query not working? - sql

I am trying to use order by in big query to sort my query. What I want to do is, to order the results based on the week number of the year but it doesn't seem to work. Nor does it show any kind of syntax issue.
SELECT * FROM (SELECT concat(cast(EXTRACT(week FROM elt.event_datetime) as string),', ', extract(year from elt.event_datetime)) WEEK, elt.msg_source SOURCE, (elt.source_timedelta_s_ + elt.pipeline_timedelta_s_) Latency FROM <table> elt join ,<table1> ai ON elt.msg_id = ai.msg_id WHERE ai.report_type <> 'PFR' and EXTRACT(date FROM elt.event_datetime) > extract(date from (date_sub(current_timestamp(), INTERVAL 30 day)))
order by WEEK desc)PIVOT ( AVG(Latency) FOR SOURCE IN ('FLYHT', 'SMTP')) t
Basically, I want my results as they are numbered in green in the image below.
Can someone check what is the issue?

SELECT * FROM (SELECT concat(cast(EXTRACT(week FROM elt.event_datetime) as string),', ', extract(year from elt.event_datetime)) WEEK, elt.msg_source SOURCE, (elt.source_timedelta_s_ + elt.pipeline_timedelta_s_) Latency FROM <table> elt join ,<table1> ai ON elt.msg_id = ai.msg_id WHERE ai.report_type <> 'PFR' and EXTRACT(date FROM elt.event_datetime) > extract(date from (date_sub(current_timestamp(), INTERVAL 30 day))))
PIVOT ( AVG(Latency) FOR SOURCE IN ('FLYHT', 'SMTP')) t order by (select RIGHT(t.WEEK,4)) desc ,(select regexp_substr(t.WEEK,'[^,]+')) desc
as suggested by #Shipra Sarkar in the comments.

Related

SQL Optimization: multiplication of two calculated field generated by window functions

Given two time-series tables tbl1(time, b_value) and tbl2(time, u_value).
https://www.db-fiddle.com/f/4qkFJZLkZ3BK2tgN4ycCsj/1
Suppose we want to find the last value of u_value in each day, the daily cumulative sum of b_value on that day, as well as their multiplication, i.e. daily_u_value * b_value_cum_sum.
The following query calculates the desired output:
WITH cte AS (
SELECT
t1.time,
t1.b_value,
t2.u_value * t1.b_value AS bu_value,
last_value(t2.u_value)
OVER
(PARTITION BY DATE_TRUNC('DAY', t1.time) ORDER BY DATE_TRUNC('DAY', t2.time) ) AS daily_u_value
FROM stackoverflow.tbl1 t1
LEFT JOIN stackoverflow.tbl2 t2
ON
t1.time = t2.time
)
SELECT
DATE_TRUNC('DAY', c.time) AS time,
AVG(c.daily_u_value) AS daily_u_value,
SUM( SUM(c.b_value)) OVER (ORDER BY DATE_TRUNC('DAY', c.time) ) as b_value_cum_sum,
AVG(c.daily_u_value) * SUM( SUM(c.b_value) ) OVER (ORDER BY DATE_TRUNC('DAY', c.time) ) as daily_u_value_mul_b_value
FROM cte c
GROUP BY 1
ORDER BY 1 DESC
I was wondering what I can do to optimize this query? Is there any alternative solution that generates the same result?
db filddle demo
from your query: Execution Time: 250.666 ms to my query Execution Time: 205.103 ms
seems there is some progress there. Mainly reduce the time of cast, since I saw your have many times cast from timestamptz to timestamp. I wonder why not just another date column.
I first execute my query then yours, which mean the compare condition is quite fair, since second time execute generally more faster than first time.
alter table tbl1 add column t1_date date;
alter table tbl2 add column t2_date date;
update tbl1 set t1_date = time::date;
update tbl2 set t2_date = time::date;
WITH cte AS (
SELECT
t1.t1_date,
t1.b_value,
t2.u_value * t1.b_value AS bu_value,
last_value(t2.u_value)
OVER
(PARTITION BY t1_date ORDER BY t2_date ) AS daily_u_value
FROM stackoverflow.tbl1 t1
LEFT JOIN stackoverflow.tbl2 t2
ON
t1.time = t2.time
)
SELECT
t1_date,
AVG(c.daily_u_value) AS daily_u_value,
SUM( SUM(c.b_value)) OVER (ORDER BY t1_date ) as b_value_cum_sum,
AVG(c.daily_u_value) * SUM( SUM(c.b_value) ) OVER
(ORDER BY t1_date ) as daily_u_value_mul_b_value
FROM cte c
GROUP BY 1
ORDER BY 1 DESC

How to calculate weekly retention in BigQuery using SQL

I have the following table with the week number and the retention rate.
|creation_week |num_engaged_users |num_users_in_cohort |retention_rate|
|:------------:|:-----------------:|:------------------:|:------------:|
|37| 373114 |4604 |67.637|
|38| 1860 |4604. |40.4|
|39| 1233 |4604 |26.781|
|40| 668 |4604 |14.509|
|41| 450 |4604 |9.774|
|42| 463| 4604|10.056|
What I need is to make it look something like this
|week |week0 |week1 |week2|week3|week4|week5|week6|
|:---:|:----:|:----:|:---:|:---:|:---:|:---:|:---:|
|week37|100|ret.rate|ret.rate|ret.rate|ret.rate|ret.rate|ret.rate|
|week38|100|ret.rate|ret.rate|ret.rate|ret.rate|ret.rate|
|week39|100|ret.rate|ret.rate|ret.rate|ret.rate|
|week40|100|ret.rate|ret.rate|ret.rate|
|week41|100|ret.rate|ret.rate|
|week42|100|ret.rate|
how can I do that using BigQuery SQL?
For some reason Stackoverflow doesn't allow to post this question unless all the tables are marked as code...
I will provide the SQL code I used in the first answer because it doesn't let me post it either
WITH
new_user_cohort AS (
WITH
#table with cookie and user_ids for the later matching
table_1 AS (
SELECT
DISTINCT props.value.string_value AS cookie_id,
user_id
FROM
`stockduel.analytics.events`,
UNNEST(event_properties) AS props
WHERE
props.key = 'cookie_id'
AND user_id>0),
#second table which gives acess to the sample with the users who performed the event
table_2 AS (
SELECT
DISTINCT props.value.string_value AS cookie_id,
EXTRACT(WEEK
FROM
creation_date) AS first_week
FROM
`stockduel.analytics.events`,
UNNEST(event_properties) AS props
WHERE
props.key = 'cookie_id'
AND event_type = 'launch_first_time'
#set the date from when starting cohort analysis
AND EXTRACT(WEEK
FROM
creation_date) = EXTRACT(WEEK
FROM
DATE '2021-09-15'))
#join user id with cookie_id and group the elements to remove the duplicates
SELECT
user_id,
first_week
FROM
table_2
JOIN
table_1
ON
table_1.cookie_id = table_2.cookie_id
#group the results to avoid duplicates
GROUP BY
user_id,
first_week ),
num_new_users AS (
SELECT
COUNT(*) AS num_users_in_cohort,
first_week
FROM
new_user_cohort
GROUP BY
first_week ),
engaged_users_by_day AS (
SELECT
COUNT(DISTINCT `stockduel.analytics.ws_raw_sessions_v2`.user_id) AS num_engaged_users,
EXTRACT(WEEK
FROM
started_at) AS creation_week,
FROM
`stockduel.analytics.ws_raw_sessions_v2`
JOIN
new_user_cohort
ON
new_user_cohort.user_id = `stockduel.analytics.ws_raw_sessions_v2`.user_id
WHERE
EXTRACT(WEEK
FROM
started_at) BETWEEN EXTRACT(WEEK
FROM
DATE '2021-09-15')
AND EXTRACT(WEEK
FROM
DATE '2021-10-22')
GROUP BY
creation_week )
SELECT
creation_week,
num_engaged_users,
num_users_in_cohort,
ROUND((100*(num_engaged_users / num_users_in_cohort)), 3) AS retention_rate
FROM
engaged_users_by_day
CROSS JOIN
num_new_users
ORDER BY
creation_week

Attempting to calculate absolute change and % change in 1 query

I'm having trouble with the SELECT portion of this query. I can calculate the absolute change just fine, but when I want to also find out the percent change I get lost in all the subqueries. Using BigQuery. Thank you!
SELECT
station_name,
ridership_2013,
ridership_2014,
absolute_change_2014 / ridership_2013 * 100 AS percent_change,
(ridership_2014 - ridership_2013) AS absolute_change_2014,
It will probably be beneficial to organize your query with CTEs and descriptive aliases to make things a bit easier. For example...
with
data as (select * from project.dataset.table),
ridership_by_year as (
select
extract(year from ride_date) as yr,
count(*) as rides
from data
group by 1
),
ridership_by_year_and_station as (
select
extract(year from ride_date) as yr,
station_name,
count(*) as rides
from data
group by 1,2
),
yearly_changes as (
select
this_year.yr,
this_year.rides,
prev_year.rides as prev_year_rides,
this_year.rides - coalesce(prev_year.rides,0) as absolute_change_in_rides,
safe_divide( this_year.rides - coalesce(prev_year.rides), prev_year.rides) as relative_change_in_rides
from ridership_by_year this_year
left join ridership_by_year prev_year on this_year.yr = prev_year.yr + 1
),
yearly_station_changes as (
select
this_year.yr,
this_year.station_name,
this_year.rides,
prev_year.rides as prev_year_rides,
this_year.rides - coalesce(prev_year.rides,0) as absolute_change_in_rides,
safe_divide( this_year.rides - coalesce(prev_year.rides), prev_year.rides) as relative_change_in_rides
from ridership_by_year this_year
left join ridership_by_year prev_year on this_year.yr = prev_year.yr + 1
)
select * from yearly_changes
--select * from yearly_station_changes
Yes this is a bit longer, but IMO it is much easier to understand.

Not able to run a simple beam pipeline

I have a simple sql query with some aggregations, there is no problem with the query itself, I am looking into its execution plan and don't know where are those aggregations in the plan come from the query itself:
Table:
Query (this query contains string operation, group by, order by and join, purpose: to get the reporting period that total amount increased certain target over the years):
WITH cte
AS (SELECT Year(orderdate) AS yr,
Month(orderdate) AS mon,
Ltrim(Rtrim(Str(Year(orderdate)))) + '-'
+ Ltrim(Rtrim(Str(Month(orderdate)))) AS theMonth,
Sum(totalamount) AS theAmount
FROM [order]
GROUP BY Year(orderdate),
Month(orderdate),
Ltrim(Rtrim(Str(Year(orderdate)))) + '-'
+ Ltrim(Rtrim(Str(Month(orderdate)))))
SELECT TOP 3 cte.themonth,
cte_prev.themonth AS thePrevMonth,
cte.theamount,
cte_prev.theamount AS thePrevAmount,
( cte.theamount - cte_prev.theamount ) AS diff
FROM cte
JOIN cte cte_prev
ON cte.yr = cte_prev.yr + 1
AND cte.mon = cte_prev.mon
WHERE ( cte.theamount - cte_prev.theamount ) / cte_prev.theamount > 0.8
ORDER BY ( cte.theamount - cte_prev.theamount ) / cte_prev.theamount DESC
Execution plan:
I wonder how can I create a better/simpler query to calculate the difference between two reporting period? and the string trimming is really annoying here: why there is no simple and single trim but have to ltrim and rtrim?

Filling in missing dates DB2 SQL

My initial query looks like this:
select process_date, count(*) batchCount
from T1.log_comments
order by process_date asc;
I need to be able to do some quick analysis for weekends that are missing, but wanted to know if there was a quick way to fill in the missing dates not present in process_date.
I've seen the solution here but am curious if there's any magic hidden in db2 that could do this with only a minor modification to my original query.
Note: Not tested, framed it based on my exposure to SQL Server/Oracle. I guess this gives you the idea though:
*now amended and tested on DB2*
WITH MaxDateQry(MaxDate) AS
(
SELECT MAX(process_date) FROM T1.log_comments
),
MinDateQry(MinDate) AS
(
SELECT MIN(process_date) FROM T1.log_comments
),
DatesData(ProcessDate) AS
(
SELECT MinDate from MinDateQry
UNION ALL
SELECT (ProcessDate + 1 DAY) FROM DatesData WHERE ProcessDate < (SELECT MaxDate FROM MaxDateQry)
)
SELECT a.ProcessDate, b.batchCount
FROM DatesData a LEFT JOIN
(
SELECT process_date, COUNT(*) batchCount
FROM T1.log_comments
) b
ON a.ProcessDate = b.process_date
ORDER BY a.ProcessDate ASC;