How to get the difference between (multiple) two different rows? - google-bigquery

I have a set of data containing some fields: month, customer_id, row_num (RANK), and verified_date.
The rank field indicates the first (1) and second (2) purchase of each customer. I would like to know the time difference between first and second purchase for each customer and show only its first month = month where row_num = 1.
https://i.ibb.co/PjJk5Y0/Capture.png
So my expected result is like below image:
https://i.ibb.co/y5Mww7k/Capture-2.png
I'm using StandardSQL in Google Bigquery.
row_num, verified_date
from table
GROUP BY 1, 2```

We can try using a pivot query here, aggregating by the customer_id:
SELECT
MAX(CASE WHEN row_num = 1 THEN month END) AS month,
customer_id,
1 AS row_num,
DATE_DIFF(MAX(CASE WHEN row_num = 2 THEN verified_date END),
MAX(CASE WHEN row_num = 1 THEN verified_date END), DAY) AS difference
FROM yourTable
GROUP BY
customer_id;

Related

Create partitions based on column values in sql

I am very new to sql and query writing and after alot of trying, I am asking for help.
As shown in the picture, I want to create partition of data based on is_late = 1 and show its count (that is 2) but at the same time want to capture the value of last_status where is_late = 0 to be displayed in the single row.
The task is to calculate how many time the rider was late and time taken by him from first occurrence of estimated time to the last_status.
Desired output:
You can use following query
SELECT
rider_id,
task_created_time,
expected_time_to_arrive,
is_late,
last_status,
task_count,
CONVERT(VARCHAR(5), DATEADD(MINUTE, DATEDIFF(MINUTE, expected_time_to_arrive, last_status), 0), 114) AS time_delayed
FROM
(SELECT
rider_id,
task_created_time,
expected_time_to_arrive,
is_late,
SUM(CASE WHEN is_late = 1 THEN 1 ELSE 0 END) OVER(PARTITION BY rider_id ORDER BY rider_id) AS task_count,
ROW_NUMBER() OVER(PARTITION BY rider_id ORDER BY rider_id) AS num,
MAX(last_status) OVER(PARTITION BY rider_id ORDER BY rider_id) AS last_status
FROM myTestTable) t
WHERE num = 1
db<>fiddle

How do I display records in the same row although I am using group by on 2 columns that appear in different rows right now?

This is the output I am getting now but I want all the records for one gateway in one row I am trying to find the damage count and total count of packages processed by an airport in a week. Currently I am grouping by airport and week so I am getting the records in different rows for an airport and week. I want to have the records for a particular airport in a single row with weeks being in the same row.
I tried putting a conditional group by but that did not work.
select tmp.gateway,tmp.weekbucket, sum(tmp.damaged_count) as DamageCount, sum(tmp.total_count) as TotalCount, round(sum(tmp.DPMO),0) as DPMO from
(
select a.gateway,
date_trunc('week', (a.processing_date + interval '1 day')) - interval '1 day' as weekbucket,
count(distinct(b.fulfillment_shipment_id||b.package_id)) as damaged_count,
count(distinct(a.fulfillment_shipment_id||a.package_id)) as total_count,
count(distinct(b.fulfillment_shipment_id||b.package_id))*1.00/count(distinct(a.Fulfillment_Shipment_id || a.package_id))*1000000 as DPMO
from booker.d_air_shipments_na a
left join trex.d_ps_packages b
on (a.fulfillment_shipment_id||a.package_id =b.Fulfillment_Shipment_id||b.package_id)
where a.processing_date >= current_date-7
and (exception_summary in ('Reprint-Damaged Label') or exception_summary IS NULL)
and substring(route, position(a.gateway IN route) +6, 1) <> 'K'
group by a.gateway, weekbucket) as tmp
group by tmp.gateway, tmp.weekbucket
order by tmp.gateway, tmp.weekbucket desc;
As you get two week's days starting and ending hence its likely that youll get 2 rows for each. Can try to remove week bucket from group by after performing your actual select/within the inner select and put a max on week bucket with summing both counts of both start and end of week dates.
select
tmp.gateway,max(tmp.weekbucket),
sum(tmp.damaged_count) as
DamageCount,
sum(tmp.total_count) as TotalCount,
round(sum(tmp.DPMO),0) as DPMO
from
(
select a.gateway,
date_trunc('week', (a.processing_date +
interval '1 day')) - interval '1 day' as
weekbucket, count(distinct(b.fulfillment_shipment_id||b
.package_id)) as damaged_count,
count(distinct(a.fulfillment_shipment_id||a .package_id)) as total_count,
count(distinct(b.fulfillment_shipment_id||b.package_id))*1.00/count(distinct(a.Fulfillment_Shipment_id || a.package_id))*1000000 as DPMO
from booker.d_air_shipments_na a
left join trex.d_ps_packages b
on (a.fulfillment_shipment_id||a.package_id =b.Fulfillment_Shipment_id||b.package_id)
where a.processing_date >= current_date-7
and (exception_summary in ('Reprint-Damaged Label') or exception_summary IS NULL)
and substring(route, position(a.gateway IN route) +6, 1) <> 'K'
group by a.gateway, weekbucket) as tmp
group by tmp.gateway
order by tmp.gateway,
max(tmp.weekbucket) desc;
So you want to pivot the two weeks into a single row with two sets of aggregates?:
select
tmp.gateway,
tmp.weekbucket,
min(case when rn = 1 then tmp.damaged_count end) as DamageCountWeek1,
min(case when rn = 2 then tmp.damaged_count end) as DamageCountWeek2,
min(case when rn = 1 then tmp.total_count end) as TotalCountWeek1,
min(case when rn = 2 then tmp.total_count end) as TotalCountWeek2,
min(case when rn = 1 then round(tmp.DPMO, 0) end) as DPMOWeek1,
min(case when rn = 2 then round(tmp.DPMO, 0) end) as DPMOWeek2,
from (
select row_number() over (partition by gateway order by weekbucket) as rn,
...
) as tmp
group by tmp.gateway
order by tmp.gateway;

Calculate % of total - redshift / sql

I'm trying to calculate the percentage of one column over a secondary total column.
I wrote:
create temporary table screenings_count_2018 as
select guid,
datepart(y, screening_screen_date) as year,
sum(case when screening_package = 4 then 1 end) as count_package_4,
sum(case when screening_package = 3 then 1 end) as count_package_3,
sum(case when screening_package = 2 then 1 end) as count_package_2,
sum(case when screening_package = 1 then 1 end) as count_package_1,
sum(case when screening_package in (1, 2, 3, 4) then 1 end) as count_total_packages
from prod.leasing_fact
where year = 2018
group by guid, year;
That table establishes the initial count and total count columns. All columns look correct.
Then, I'm using ratio_to_report to calculate the percentage (referencing this tutorial):
create temporary table screenings_percentage as
select
guid,
year,
ratio_to_report(count_package_1) over (partition by count_total_packages) as percentage_package_1
from screenings_count_2018
group by guid, year,count_package_1,count_total_packages
order by percentage_package_1 desc;
I also tried:
select
guid,
year,
sum(count_package_1/count_total_packages) as percentage_package_1
-- ratio_to_report(count_package_1) over (partition by count_total_packages) as percentage_package_1
from screenings_count_2018
group by guid, year,count_package_1,count_total_packages
order by percentage_package_1 desc;
Unfortunately, percentage_package_1 just returns all null values (this is not correct - I'm expecting percentages). Neither are working.
What am I doing wrong?
Thanks!
Since you are already laid out the columns with components and a total, in creating screenings_count_2018, do you actually need to use ratio_to_report?
select
, guid
, year
, count_package_1/count_total_packages as percentage_package_1
, count_package_2/count_total_packages as percentage_package_2
, count_package_3/count_total_packages as percentage_package_3
, count_package_4/count_total_packages as percentage_package_4
from screenings_count_2018
That should work. NB are you guaranteed to never have count_total_packages be zero? If it can be zero you'll need to handle it. One way is with a case statement.
If you wish for the per-package percentages to appear in a single column, then you can use ratio_to_report -- it is a "window" analytic function and it will be something like this against the original table.
with count_table as (
select guid
, datepart(y, screening_screen_date) as year
, screening_package
, count(1) as count
from prod.leasing_fact
where year = 2018
group by guid
, datepart(y, screening_screen_date)
, screening_package
)
select guid
, year
, screening_package
, ratio_to_report(count) over(partition by guid, year, screening_package) as perc_of_total
from count_table
you will need round(100.0*count_package_1/count_total_packages,1) and so on as you already calculated the subtotal and total

SQL Date intelligence: filtering data by seconds ran from last known valid result

Help! We're trying to create a new column (Is Valid?) to reproduce the logic below.
It is a binary result that:
it is 1 if it is the first known value of an ID
it is 1 if it is 3 seconds or later than the previous "1" of that ID
Note 1: this is not the difference in seconds from the previous record
It is 0 if it is less than 3 seconds than the previous "1" of that ID
Note 2: there are many IDs in the data set
Note 3: original dataset has ID and Date
Attached a PoC of the data and the expected result.
You would have to do this using a recursive CTE, which is quite expensive:
with tt as (
select t.*, row_number() over (partition by id order by time) as seqnum
from t
),
recursive cte as (
select t.*, time as grp_start
from tt
where seqnum = 1
union all
select tt.*,
(case when tt.time < cte.grp_start + interval '3 second'
then tt.time
else tt.grp_start
end)
from cte join
tt
on tt.seqnum = cte.seqnum + 1
)
select cte.*,
(case when grp_start = lag(grp_start) over (partition by id order by time)
then 0 else 1
end) as isValid
from cte;

SQL sum over(partition) not subtracting negative values in SUM

I have the following query which outputs a list of transactions per user - units spent and units earned - column 'Amount'.
I have managed to group this per user and do a running total - column 'Running_Total_Spend'.
However it is ADDING the negative 'Amount' values rather than subtracting them. Sp pretty sure it is the SUM part of query not working.
WITH cohort AS(
SELECT DISTINCT userID FROM events_live WHERE startDate = '2018-07-26' LIMIT 50),
my_events AS (
SELET events_live.* FROM events_live WHERE eventDate >= '2018-07-26')
SELECT cohort.userID,
my_events.eventDate,
my_events.eventTimestamp,
CASE
--spent resource outputs a negative value ---working
WHEN transactionVector = 'SPENT' THEN -abs(my_events.productAmount)
--earned resource outputs a positive value ---working
WHEN transactionVector = 'RECEIVED' THEN my_events.productAmount END AS Amount,
ROW_NUMBER() OVER (PARTITION BY cohort.userID ORDER BY cohort.userID, eventTimestamp asc) AS row,
--sum the values in column 'Amount' for this partition
--should sum positive and negative values ---NOT WORKING--converting negatives into positive
--------------------------------------------------
SUM(CASE WHEN my_events.productAmount >= 0 THEN my_events.productAmount
WHEN my_events.productAmount <0 THEN -abs(my_events.productAmount) end) OVER(PARTITION BY cohort.userID ORDER BY cohort.userID, eventTimestamp asc) AS Running_Total_Spend
---------------------------------------------------
FROM cohort
INNER JOIN my_events ON cohort.userID=my_events.userID
WHERE productName = 'COINS' AND transactionVector IN ('SPENT','RECEIVED')
I suspect you want that logic around transactionvector for the sum too as my_events.productamount seems to be always positive.
...
sum(CASE
WHEN transactionvector = 'SPENT' THEN
-my_events.productamount
WHEN transactionvector = 'RECEIVED' THEN
my_events.productamount
END) OVER (PARTITION BY cohort.userid
ORDER BY cohort.userid,
eventTimestamp) running_total_spend
...
Update your sum function to -
SUM(my_events.productAmount) OVER(PARTITION BY cohort.userID ORDER BY cohort.userID, eventTimestamp asc) AS Running_Total_Spend