SUM for a week's worth of data - sql

Currently, I've got a query at work that works like this:
SELECT ColumnA, ColumnB, PurchaseAmount/100 FROM TableX
JOIN TableY ON TableX.A = TableY.B
WHERE PurchaseAmount/100 > 299
and PurchaseTimestamp >= DATE_ADD(CURRENT_TIMESTAMP(),INTERVAL -1 DAY)
So this query runs once a day. Very simple, works great. My question is, how could I say, SUM the PurchaseAmount (so separation purchases) to $1000 for a week? Is that even possible?
So a sample result from original query would be:
Col A (User ID
PurchaseAmount
1
$600
1
$800
2
$700
3
$1100
And I would desire to return: 1 and 3, since these are the "IDs" that have a SUM over $1000. (Complicating this is, of course, that these Purchases are happening over the course of a week; not one single day.)

with data as (
SELECT ColumnA, ColumnB, PurchaseAmount / 100 as Amt,
sum(PurchaseAmount / 100) over (partition by ColumnA) as CustomerTotal
FROM TableX INNER JOIN TableY ON TableX.A = TableY.B
WHERE PurchaseAmount / 100 > 299
AND PurchaseTimestamp >= DATE_ADD(CURRENT_TIMESTAMP(), INTERVAL -7 DAY)
)
select ColumnA, ColumnB, Amt
from data where CustomerTotal > 1000;
I am guessing that you want to expand the selection to a full seven days rather than just the single day of the posted query and then return all customer purchases in that range. Here's one way to augment your query with a grand total per customer that can be used for filtering. The purpose of division by 100 is unclear but you should be able to easily adjust that according to your data. Similarly with the comparison against 299.
You might also want to just use a subquery to compare against a list of customers that match the weekly test via a separate query.
SELECT ColumnA, ColumnB, PurchaseAmount / 100 as Amt,
FROM TableX INNER JOIN TableY ON TableX.A = TableY.B
WHERE PurchaseAmount / 100 > 299
AND PurchaseTimestamp >= DATE_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
AND ColumnA IN (
SELECT ColumnA
FROM TableX INNER JOIN TableY ON TableX.A = TableY.B
WHERE PurchaseAmount / 100 > 299
AND PurchaseTimestamp >= DATE_ADD(CURRENT_TIMESTAMP(), INTERVAL -7 DAY)
GROUP BY ColumnA
HAVING SUM(PurchaseAmount / 100) > 1000
);
To get the total with the results you might try:
with p as (
SELECT ColumnA, ColumnB, PurchaseAmount / 100 as Amt,
sum(PurchaseAmount / 100) over (partition by ColumnA) as CustomerTotal
FROM TableX INNER JOIN TableY ON TableX.A = TableY.B
WHERE PurchaseAmount / 100 > 299
AND PurchaseTimestamp >= DATE_ADD(CURRENT_TIMESTAMP(), INTERVAL -7 DAY)
)
select * from p
where PurchaseTimestamp >= DATE_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
and CustomerTotal > 1000;
This does assume that the final result is a subset of the larger result used to compute the weekly threshold. That seems to be true in your case.

Related

SQL - Calculate percentage by group, for multiple groups

I have a table in GBQ in the following format :
UserId Orders Month
XDT 23 1
XDT 0 4
FKR 3 6
GHR 23 4
... ... ...
It shows the number of orders per user and month.
I want to calculate the percentage of users who have orders, I did it as following :
SELECT
HasOrders,
ROUND(COUNT(*) * 100 / CAST( SUM(COUNT(*)) OVER () AS float64), 2) Parts
FROM (
SELECT
*,
CASE WHEN Orders = 0 THEN 0 ELSE 1 END AS HasOrders
FROM `Table` )
GROUP BY
HasOrders
ORDER BY
Parts
It gives me the following result:
HasOrders Parts
0 35
1 65
I need to calculate the percentage of users who have orders, by month, in a way that every month = 100%
Currently to do this I execute the query once per month, which is not practical :
SELECT
HasOrders,
ROUND(COUNT(*) * 100 / CAST( SUM(COUNT(*)) OVER () AS float64), 2) Parts
FROM (
SELECT
*,
CASE WHEN Orders = 0 THEN 0 ELSE 1 END AS HasOrders
FROM `Table` )
WHERE Month = 1
GROUP BY
HasOrders
ORDER BY
Parts
Is there a way execute a query once and have this result ?
HasOrders Parts Month
0 25 1
1 75 1
0 45 2
1 55 2
... ... ...
SELECT
SIGN(Orders),
ROUND(COUNT(*) * 100.000 / SUM(COUNT(*), 2) OVER (PARTITION BY Month)) AS Parts,
Month
FROM T
GROUP BY Month, SIGN(Orders)
ORDER BY Month, SIGN(Orders)
Demo on Postgres:
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=4cd2d1455673469c2dfc060eccea8020
You've stated that it's important for the total to be 100% so you might consider rounding down in the case of no orders and rounding up in the case of has orders for those scenarios where the percentages falls precisely on an odd multiple of 0.5%. Or perhaps rounding toward even or round smallest down would be better options:
WITH DATA AS (
SELECT SIGN(Orders) AS HasOrders, Month,
COUNT(*) * 10000.000 / SUM(COUNT(*)) OVER (PARTITION BY Month) AS PartsPercent
FROM T
GROUP BY Month, SIGN(Orders)
ORDER BY Month, SIGN(Orders)
)
select HasOrders, Month, PartsPercent,
PartsPercent - TRUNCATE(PartsPercent) AS Fraction,
CASE WHEN HasOrders = 0
THEN FLOOR(PartsPercent) ELSE CEILING(PartsPercent)
END AS PartsRound0Down,
CASE WHEN PartsPercent - TRUNCATE(PartsPercent) = 0.5
AND MOD(TRUNCATE(PartsPercent), 2) = 0
THEN FLOOR(PartsPercent) ELSE ROUND(PartsPercent) -- halfway up
END AS PartsRoundTowardEven,
CASE WHEN PartsPercent - TRUNCATE(PartsPercent) = 0.5 AND PartsPercent < 50
THEN FLOOR(PartsPercent) ELSE ROUND(PartsPercent) -- halfway up
END AS PartsSmallestTowardZero
from DATA
It's usually not advisable to test floating-point values for equality and I don't know how BigQuery's float64 will work with the comparison against 0.5. One half is nevertheless representable in binary. See these in a case where the breakout is 101 vs 99. I don't have immediate access to BigQuery so be aware that Postgres's rounding behavior is different:
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=c8237e272427a0d1114c3d8056a01a09
Consider below approach
select hasOrders, round(100 * parts, 2) as parts, month from (
select month,
countif(orders = 0) / count(*) `0`,
countif(orders > 0) / count(*) `1`,
from your_table
group by month
)
unpivot (parts for hasOrders in (`0`, `1`))
with output like below

See the distribution of secondary requests grouped by time interval in sql

I have the following table:
RequestId,Type, Date, ParentRequestId
1 1 2020-10-15 null
2 2 2020-10-19 1
3 1 2020-10-20 null
4 2 2020-11-15 3
For this example I am interested in the request type 1 and 2, to make the example simpler. My task is to query a big database and to see the distribution of the secondary transaction based on the difference of dates with the parent one. So the result would look like:
Interval,Percentage
0-7 days,50 %
8-15 days,0 %
16-50 days, 50 %
So for the first line from teh expected result we have the request with the id 2 and for the third line from the expected result we have the request with the id 4 because the date difference fits in this interval.
How to achieve this?
I'm using sql server 2014.
We like to see your attempts, but by the looks of it, it seems like you're going to need to treat this table as 2 tables and do a basic GROUP BY, but make it fancy by grouping on a CASE statement.
WITH dateDiffs as (
/* perform our date calculations first, to get that out of the way */
SELECT
DATEDIFF(Day, parent.[Date], child.[Date]) as daysDiff,
1 as rowsFound
FROM (SELECT RequestID, [Date] FROM myTable WHERE Type = 1) parent
INNER JOIN (SELECT ParentRequestID, [Date] FROM myTable WHERE Type = 2) child
ON parent.requestID = child.parentRequestID
)
/* Now group and aggregate and enjoy your maths! */
SELECT
case when daysDiff between 0 and 7 then '0-7'
when daysDiff between 8 and 15 then '8-15'
when daysDiff between 16 and 50 THEN '16-50'
else '50+'
end as myInterval,
sum(rowsFound) as totalFound,
(select sum(rowsFound) from dateDiffs) as totalRows,
1.0 * sum(rowsFound) / (select sum(rowsFound) from dateDiffs) * 100.00 as percentFound
FROM dateDiffs
GROUP BY
case when daysDiff between 0 and 7 then '0-7'
when daysDiff between 8 and 15 then '8-15'
when daysDiff between 16 and 50 THEN '16-50'
else '50+'
end;
This seems like basically a join and group by query:
with dates as (
select 0 as lo, 7 as hi, '0-7 days' as grp union all
select 8 as lo, 15 as hi, '8-15 days' union all
select 16 as lo, 50 as hi, '16-50 days'
)
select d.grp,
count(*) as cnt,
count(*) * 1.0 / sum(count(*)) over () as raio
from dates left join
(t join
t tp
on tp.RequestId = t. ParentRequestId
)
on datediff(day, tp.date, t.date) between d.lo and d.hi
group by d.grp
order by d.lo;
The only trick is generating all the date groups, so you have rows with zero values.

How to find value in a range of following rows - SQL Teradata

I have a table with the following columns:
account, validity_date,validity_month,amount.
For each row i want to check if the value in field "amount' exist over the rows range of the next month. if yes, indicator=1, else 0.
account validity_date validity_month amount **required_column**
------- ------------- --------------- ------- ----------------
123 15oct2019 201910 400 0
123 20oct2019 201910 500 1
123 15nov2019 201911 1000 0
123 20nov2019 201911 500 0
123 20nov2019 201911 2000 1
123 15dec2019 201912 400
123 15dec2019 201912 2000
Can anyone help?
Thanks
validity_month/100*12 + validity_month MOD 100 calculates a month number (for comparing across years, Jan to previous Dec) and the inner ROW_NUMBER reduces multiple rows with the same amount per month to a single row (kind of DISTINCT):
SELECT dt.*
,CASE -- next row is from next month
WHEN Lead(nextMonth IGNORE NULLS)
Over (PARTITION BY account, amount
ORDER BY validity_date)
= (validity_month/100*12 + validity_month MOD 100) +1
THEN 1
ELSE 0
END
FROM
(
SELECT t.*
,CASE -- one row per account/month/amount
WHEN Row_Number()
Over (PARTITION BY account, amount, validity_month
ORDER BY validity_date ) = 1
THEN validity_month/100*12 + validity_month MOD 100
END AS nextMonth
FROM tab AS t
) AS dt
Edit:
The previous is for exact matching amounts, for a range match the query is probably very hard to write with OLAP-functions, but easy with a Correlated Subquery:
SELECT t.*
,CASE
WHEN
( -- check if there's a row in the next month matching the current amount +/- 10 percent
SELECT Count(*)
FROM tab AS t2
WHERE t2.account_ = t.account_
AND (t2.validity_month/100*12 + t2.validity_month MOD 100)
= ( t.validity_month/100*12 + t.validity_month MOD 100) +1
AND t2.amount BETWEEN t.amount * 0.9 AND t.amount * 1.1
) > 0
THEN 1
ELSE 0
END
FROM tab AS t
But then performance might be really bad...
Assuming the values are unique within a month and you have a value for each month for each account, you can simplify this to:
select t.*,
(case when lead(seqnum) over (partition by account, amount order by validity_month) = seqnum + 1
then 1 else 0
end)
from (select t.*,
dense_rank() over (partition by account order by validity_month) as seqnum
from t
) t;
Note: This puts 0 for the last month rather than NULL, but that can easily be adjusted.
You can do this without the subquery by using month arithmetic. It is not clear what the data type of validity_month is. If I assume a number:
select t.*,
(case when lead(floor(validity_month / 100) * 12 + (validity_month mod 100)
) over (partition by account, amount order by validity_month) =
(validity_month / 100) * 12 + (validity_month mod 100) - 1
then 1 else 0
end)
from t;
Just to add another way to do this using Standard SQL. This query will return 1 when the condition is met, 0 when it is not, and null when there isn't a next month to evaluate (as implied in your result column).
It is assumed that we're partitioning on the account field. Also includes a 10% range match on the amount field based on the comment made. Note that if you have an id field, you should include it (if two rows have the same account, validity_date, validity_month, amount there will only be one resulting row, due to DISTINCT).
Performance-wise, should be similar to the answer from #dnoeth.
SELECT DISTINCT
t1.account,
t1.validity_date,
t1.validity_month,
t1.amount,
CASE
WHEN t2.amount IS NOT NULL THEN 1
WHEN MAX(t1.validity_month) OVER (PARTITION BY t1.account) > t1.validity_month THEN 0
ELSE NULL
END AS flag
FROM `project.dataset.table` t1
LEFT JOIN `project.dataset.table` t2
ON
t2.account = t1.account AND
DATE_DIFF(
PARSE_DATE("%Y%m", CAST(t2.validity_month AS STRING)),
PARSE_DATE("%Y%m", CAST(t1.validity_month AS STRING)),
MONTH
) = 1 AND
t2.amount BETWEEN t1.amount * 0.9 AND t1.amount * 1.1;

Cohort/ Retention query in BigQuery using Google Analytics exported data

I need help formulating a cohort/retention query
I am trying to build a query to look at visitors who performed ActionX on their first visit (in the time frame) and then how many days later they returned to perform Action X again
The output I (eventually) need looks like this...
The table I am dealing with is an export of Google Analytics to BigQuery
If anyone could help me with this or anyone who has written a query similar that I can manipulate?
Thanks
Just to give you simple idea / direction
Below is for BigQuery Standard SQL
#standardSQL
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
You can test it with below dummy data from your question
#standardSQL
WITH `OutputFromQuery` AS (
SELECT '01.07.17' AS Date_of_action_first_taken, 1000 AS Visits, 800 AS later_1_day, 400 AS later_2_days, 300 AS later_3_days UNION ALL
SELECT '02.07.17', 1000, 860, 780, 860 UNION ALL
SELECT '29.07.17', 1000, 780, 120, 0 UNION ALL
SELECT '30.07.17', 1000, 710, 0, 0
)
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
The OutputFromQuery data is as below:
Date_of_action_first_taken Visits later_1_day later_2_days later_3_days
01.07.17 1000 800 400 300
02.07.17 1000 860 780 860
29.07.17 1000 780 120 0
30.07.17 1000 710 0 0
and the final output is:
Date_of_action_first_taken later_1_day later_2_days later_3_days
01.07.17 80.0 40.0 30.0
02.07.17 90.0 78.0 86.0
29.07.17 80.0 12.0 0.0
30.07.17 70.0 0.0 0.0
I found this query on Turn Your App Data into Answers with Firebase and BigQuery (Google I/O'19)
It should work :)
#standardSQL
###################################################
# Part 1: Cohort of New Users Starting on DEC 24
###################################################
WITH
new_user_cohort AS (
SELECT DISTINCT
user_pseudo_id as new_user_id
FROM
`[your_project].[your_firebase_table].events_*`
WHERE
event_name = `[chosen_event] ` AND
#set the date from when starting cohort analysis
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) = '20191224' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
),
num_new_users AS (
SELECT count(*) as num_users_in_cohort FROM new_user_cohort
),
#############################################
# Part 2: Engaged users from Dec 24 cohort
#############################################
engaged_users_by_day AS (
SELECT
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) as event_day,
COUNT(DISTINCT user_pseudo_id) as num_engaged_users
FROM
`[your_project].[your_firebase_table].events_*`
INNER JOIN
new_user_cohort ON new_user_id = user_pseudo_id
WHERE
event_name = 'user_engagement' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
GROUP BY
event_day
)
####################################################################
# Part 3: Daily Retention = [Engaged Users / Total Users]
####################################################################
SELECT
event_day,
num_engaged_users,
num_users_in_cohort,
ROUND((num_engaged_users / num_users_in_cohort), 3) as retention_rate
FROM
engaged_users_by_day
CROSS JOIN
num_new_users
ORDER BY
event_day
So I think I may have cracked it... from this output I then would need to manipulate it (pivot table it) to make it look like the desired output.
Can anyone review this for me and let me know what you think?
`WITH
cohort_items AS (
SELECT
MIN( TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 +
h.time*1000)), DAY) ) AS cohort_day, fullVisitorID
FROM
TABLE123 AS U,
UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN "20170701" AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 2
),
user_activites AS (
SELECT
A.fullVisitorID,
DATE_DIFF(DATE(TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 + h.time*1000)), DAY)), DATE(C.cohort_day), DAY) AS day_number
FROM `Table123` A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID,
UNNEST(hits) AS h
WHERE
A._TABLE_SUFFIX BETWEEN "20170701 AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 1,2),
cohort_size AS (
SELECT
cohort_day,
count(1) as number_of_users
FROM
cohort_items
GROUP BY 1
ORDER BY 1
),
retention_table AS (
SELECT
C.cohort_day,
A.day_number,
COUNT(1) AS number_of_users
FROM
user_activites A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID
GROUP BY 1,2
)
SELECT
B.cohort_day,
S.number_of_users as total_users,
B.day_number,
B.number_of_users / S.number_of_users as percentage
FROM retention_table B
LEFT JOIN cohort_size S ON B.cohort_day = S.cohort_day
WHERE B.cohort_day IS NOT NULL
ORDER BY 1, 3
`
Thank you in advance!
If you use some techniques available in BigQuery, you can potentially solve this type of problem with very cost and performance effective solutions. As an example:
SELECT
init_date,
ARRAY((SELECT AS STRUCT days, freq, ROUND(freq * 100 / MAX(freq) OVER(), 2) FROM UNNEST(data) ORDER BY days)) data
FROM(
SELECT
init_date,
ARRAY_AGG(STRUCT(days, freq)) data
FROM(
SELECT
init_date,
data AS days,
COUNT(data) freq
FROM(
SELECT
init_date,
ARRAY(SELECT DATE_DIFF(PARSE_DATE("%Y%m%d", dts), PARSE_DATE("%Y%m%d", init_date), DAY) AS dt FROM UNNEST(dts) dts) data
FROM(
SELECT
MIN(date) init_date,
ARRAY_AGG(DISTINCT date) dts
FROM `Table123`
WHERE TRUE
AND EXISTS(SELECT 1 FROM UNNEST(hits) where eventinfo.eventCategory = 'recommendation') -- This is your 'ACTION TAKEN' filter
AND _TABLE_SUFFIX BETWEEN "20170724" AND "20170731"
GROUP BY fullvisitorid
)
),
UNNEST(data) data
GROUP BY init_date, days
)
GROUP BY init_date
)
I tested this query against our G.A data and selected customers who interacted with our recommendation system (as you can see in the filter selection WHERE EXISTS...). Example of result (omitted absolute values of freq for privacy reasons):
As you can see, at day 28th for instance, 8% of customers came back 1 day later and interacted with the system again.
I recommend you to play around with this query and see if it works well for you. It's simpler, cheaper, faster and hopefully easier to maintain.

SQL need to apply a variable rate if values occur over consecutive time periods

I have a table that looks like this:
Within the query I need to find the Maximum Import value that occurs over two time periods (rows) where the value is greater that a defined Threshold and apply a rate. If it happens over more than two time periods a different rate will be used
Threshold = 1000
Rate 1 (2 consecutive) = 100
Rate 2 (> 2 consecutive) = 200
Id DateTime Import Export Total
1 2016-01-13 00:00 1000 500 1500
2 2016-01-13 00:15 2500 100 3000
3 2016-01-13 00:30 1900 200 2100
4 2016-01-13 01:00 900 100 1200
Ids 2 and 3 are > Threshold so the query should return the MIN value of those (2500,1900) = 1900 minus the Threshold (1000) = 900. Apply the rate Rate1 * 900 = 9000
If we change the value of Id 4 to 1200 then the MIN value would be 1200. Less the threshold = 200. 200 * Rate2 = 4000
Any help would be greatly appreciated!
Update after feedback. My challenge appears to be that I'm not grabbing the 2nd highest value. Here is an example of the dataset:
Dataset example
I added another var to shrink the list down to test gap and island portion. Here is a smaller subset:
Subset
Here is the code:
WITH CTE AS (
SELECT LogTable.[LocalTimestamp] as thetime,LogTable.[SystemImport] as import, LogTable.[Id] - ROW_NUMBER() OVER (ORDER BY LogTable.[Id]) AS grp
FROM {System_KWLogRaw} LogTable
WHERE LogTable.[SystemImport] between #DemandThreshold and #In1 and
DATEPART(year,#inDate) = DATEPART(year, LogTable.[LocalTimestamp]) and
DATEPART(month,#inDate) = DATEPART(month, LogTable.[LocalTimestamp]) and
DATEPART(day,#inDate) = DATEPART(day, LogTable.[LocalTimestamp])
),
counted AS (
SELECT *, COUNT(*) OVER (PARTITION BY grp) AS cnt
FROM CTE
)
SELECT MAX(counted.import) as again1
FROM counted
WHERE cnt > 3 and counted.import < (SELECT MAX(counted.import) FROM counted)
This returns 3555.53 instead of 3543.2 which is the 2nd highest value
This will do what you're asking for:
with x as (
select
t1.Id,
t1.DateTime,
t1.Import,
t1.Export,
t1.Total,
count(t2.Import) over (partition by 1) as [QualifyingImports],
min(t2.Import) over (partition by 1) as [MinQualifyingImport]
from
myTable t1
left join myTable t2 on t2.Import > 1000 and t2.Id = t1.Id
where
t1.DateTime >= '2016-01-13'
and t1.DateTime < dateadd(d, 1,'2016-01-13')
)
select
x.Id,
x.DateTime,
x.Import,
x.Export,
x.Total,
case when x.[QualifyingImports] > 2 then (x.MinQualifyingImport - 1000) * 200 else (x.MinQualifyingImport - 1000) * 100 end as [Rate]
from x
I've put together a Fiddle so you can play around with different values for Id # 4.
I really wanted to make the values of things like threshold and period into #variables, but it doesn't appear to be supported inside CTEs so I just had to hard code them.
EDIT
Turns out the CTE is overkill, you can shrink it down to this and use #variables, yay!
declare #period smalldatetime = '2016-01-13'
declare #threshold float = 1000
declare #rate1 float = 100
declare #rate2 float = 200
select
t1.Id,
t1.DateTime,
t1.Import,
t1.Export,
t1.Total,
case
when count(t2.Import) over (partition by 1) > 2 then (min(t2.Import) over (partition by 1) - #threshold) * #rate2
else (min(t2.Import) over (partition by 1) - #threshold) * #rate1
end as [Rate]
from
myTable t1
left join myTable t2 on t2.Import > #threshold and t2.Id = t1.Id
where
t1.DateTime >= #period
and t1.DateTime < dateadd(d, 1, #period)
New Fiddle