How to segment in SQL using approx_quantiles - sql

I am using the ga_sessions sample data in BigQuery and I was aiming to divide the customers into segments based on how often they placed an order using APPROX_QUANTILES. Ideally I want the output to tell me which segment the customer belongs to based on their orders.
Unfortunately I cannot get the code to run properly as I now get a 1 for each segment and the 100% segment returns 36 everytime. Any idea on how to improve this query?
WITH transdata AS (
SELECT
DISTINCT fullVisitorId AS VisitorId
,COUNT(DISTINCT FORMAT('%s%i',fullVisitorId,visitId)) AS uniqueVisits
,SUM(totals.transactions) AS total_transactions
,SUM(totals.totalTransactionRevenue) AS total_transaction_revenue
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_table_suffix BETWEEN '20160801' AND '20170801'
GROUP BY 1
ORDER BY 3 DESC
)
-- determine percentiles for total transactions per customer
SELECT
a.*
,b.percentiles[offset (20)] AS v20
,b.percentiles[offset (40)] AS v40
,b.percentiles[offset (60)] AS v60
,b.percentiles[offset (80)] AS v80
,b.percentiles[offset (100)] AS v100
FROM
transdata AS a,
(SELECT APPROX_QUANTILES(total_transactions, 100) percentiles FROM transdata) AS b

Related

Return data where the running total of amounts 30 days from another row is greater than or equal to the amount for that row

Let's say I a table that contains the date, account, product, purchase type, and amount like below:
Looking at this table, you can see that for any particular account/product combination, there are buys and sells. Essentially, what I'd like to write is a SQL query that flags the following: Are there accounts that bought at a certain amount and then sold the same aggregate amount or more 30 days from that buy?
So for example, we can see account 1 bought product A for 20k on 8/1. If we look at the running sum of sells by account 1 for product A over the next 30 days, we see they sold a total of 20k - the same as the initial buy:
Ideally, the query would return results that flag all of these instances: for each individual buy, find all sells for that product/account 30 days from that buy, and only return rows where the running total of sells is greater than or equal to that initial buy.
EDIT: Using the sample data provided, the desired should look more or less look like the following:
You'll see that the buy on 8/2 for product B/account 2 is not returned because the running sum of sells for that product/account/buy combination over the next 30 days does not equal or exceed the buy amount of 35k but it does return rows for the buy on 8/3 for product B/ account 2 because the sells do exceed the buy amount of 10k.
I know I need to self join the sells against the buys, where the accounts/products equal and the datediff is less than or equal 30 and I basically have that part structured. What I can't seem to get working is the running total part and only returning data when that total is greater than or equal to that buy. I know I likely need to use the over/partition by clauses for the running sum but I'm struggling to produce the right results/optimize properly. Any help on this would be greatly appreciated - just looking for some general direction on how to approach this.
Bonus: Would be even more powerful to stop returning the sells once the running total passes the buy, so for example, the last two rows in the desired output I provided aren't technically needed - since the first two sells following the buy had already eclipsed the buy amount.
In SQL Server, one option uses a lateral join:
select
t.*,
case when t.amount = x.amount then 1 else 0 end as is_returned
from mytable t
cross apply (
select sum(amount) amount
from mytable t1
where
t1.purchase_type = 'Sell'
and t1.account = t.account
and t1.product = t.product
and t1.date >= t.date
and t1.date <= dateadd(day, 30, t.date)
) x
where t.purchase_type = 'Buy'
The lateral join sums the amount of "sells" of the same account and product within the following 30 days, which you can then compare with the amount of the buy. The query gives you one row per buy, with a boolean flag that indicates if the amounts match.
In databases that support the range specification to window functions, this would be more efficiently expressed with a window sum:
select *
from (
select
t.*,
case when amount = sum(case when purchase_type = 'Sell' then amount end) over(
partition by account, product
order by date
range between current row and interval '30' day following
) then 1 else 0 end as is_returned
from mytable t
) t
where purchase_type = 'Buy'
Edit: this would generate a resultset similar to the third table in your question:
select t.*, x.*
from mytable t
cross apply (
select
t1.date sale_date,
t1.amount sell_amount,
sum(t1.amount) over(order by t1.date) running_sell_amount,
sum(t1.amount) over() total_sell_amount
from mytable t1
where
t1.purchase_type = 'Sell'
and t1.account = t.account
and t1.product = t.product
and t1.date >= t.date
and t1.date <= dateadd(day, 30, t.date)
) x
where t.purchase_type = 'Buy' and t.amount = x.total_sell_amount

Trying to create a SQL query

I am trying to create a query that retrieves only the ten companies with the highest number of pickups over the six-month period, this means pickup occasions, and not the number of items picked up.
I have done this
SELECT *
FROM customer
JOIN (SELECT manifest.pickup_customer_ref reference,
DENSE_RANK() OVER (PARTITION BY manifest.pickup_customer_ref ORDER BY COUNT(manifest.trip_id) DESC) rnk
FROM manifest
INNER JOIN trip ON manifest.trip_id = trip.trip_id
WHERE trip.departure_date > TRUNC(SYSDATE) - interval '6' month
GROUP BY manifest.pickup_customer_ref) cm ON customer.reference = cm.reference
WHERE cm.rnk < 11;
this uses dense_rank to determine the order or customers with the highest number of trips first
Hmm well i don't have Oracle so I can't test it 100%, but I believe your looking for something like the following:
Keep in mind that when you use group by, you have to narrow down to the same fields you group by in the select. Hope this helps at least give you an idea of what to look at.
select TOP 10
c.company_name,
m.pickup_customer_ref,
count(*) as 'count'
from customer c
inner join mainfest m on m.pickup_customer_ref = c.reference
inner join trip t on t.trip_id = m.trip_id
where t.departure_date < DATEADD(month, -6, GETDATE())
group by c.company_name, m.pickup_customer_ref
order by 'count', c.company_name, m.pickup_customer_ref desc

How to check newer than or equal to a date on access SQL

Currently I have two tables, using Access 2007
TimeSheet(empID, TimeSheet, hours)
and
Rates(empID,Rate,PromotionDate)
How do I select the correct billing rates of employee based on their date of promotion?
For example, I have
Rates(001,10,#01/01/2013#)
Rates(001,15,#01/05/2013#)
Rates(002,10,#01/01/2013#)
and
Timesheet(001,#02/01/2013#,5)
Timesheet(001,#02/05/2013#,5)
Timesheet(002,#02/01/2013#,7)
In this case, I want to show that if empID 001 submited a time sheet at 02/01/2013 it would be billed with $10/hr
, but his timesheets starting at May 1st would be billed with $15/hr
My query right now is
SELECT t.empID , t.timesheet, r.hours ,
(SELECT rate FROM rate WHERE t.timeSheet >= r.promotionDate) AS RateBilled
FROM rate AS r , timesheet AS t
WHERE r.empid = t.empid
When ran, it shows a message of “At most one record can be returned by this subquery”
Any help would be appreciated, thanks.
Edit:
I have some strange output using the sql
SELECT t.empID, t.timesheet, r.rate AS RateBilled
FROM Rates AS r, timesheet AS t
WHERE r.empid=t.empid And t.timeSheet>=r.promotionDate
GROUP BY t.empID, t.timesheet, r.rate, r.promotionDate
HAVING r.promotionDate=MAX(r.promotionDate);
as you can see the output table ratebilled for empID 1 is switching back and forth from 10 to 15, even though past May 01, it should all be using 15 ,
any help is appreciated, thanks.
The select subquery you have setup potentially returns multiple values where only one should be returned. Consider the case where there may be two promotions and a recent timesheet, then the select will return two values because on both occasions the timesheet is newer than the promotion.
Try using the following as your subquery:
SELECT TOP 1 rate FROM rate
WHERE t.timeSheet >= r.promotionDate
ORDER BY r.promotionDate DESC
N.B. I don't think the above is terribly efficient. Instead try something like
SELECT t.empID , t.timesheet, r.hours , r.rate AS RateBilled
FROM rate AS r , timesheet AS t
WHERE r.empid = t.empid AND t.timeSheet >= r.promotionDate
GROUP BY t.empID, t.timesheet
HAVING r.promotionDate = MAX( r.promotionDate);

TERADATA: Aggregate across multiple tables

Consider the following query where aggregation happens across two tables: Sales and Promo and the aggregate values are again used in a calculation.
SELECT
sales.article_id,
avg((sales.euro_value - ZEROIFNULL(promo.euro_value)) / NULLIFZERO(sales.qty - ZEROIFNULL(promo.qty)))
FROM
( SELECT
sales.article_id,
sum(sales.euro_value),
sum(sales.qty)
from SALES_TABLE sales
where year >= 2011
group by article_id
) sales
LEFT OUTER JOIN
( SELECT
promo.article_id,
sum(promo.euro_value),
sum(promo.qty)
from PROMOTION_TABLE promo
where year >= 2011
group by article_id
) promo
ON sales.article_id = promo.article_id
GROUP BY sales.article_id;
Some notes on the query:
Both the inner queries return huge number of rows due to large number of articles. Running explain on teradata, the inner queries themselves take very less time, but the join takes a long time.
Assume primary key on article_id is present and both the tables are partitioned by year.
Left Outer Join because second table contains optional data.
So, can you suggest a better way of writing this query. Thanks for reading this far :)
Not really sure how the avg function got into the mix, so I'm removing it.
SELECT article_id,
(SUM(sales_value) - SUM(promo_value)) /
(SUM(sales_qty) - SUM(promo_qty))
FROM (
SELECT
article_id,
sum(euro_value) AS sales_value,
sum(qty) AS sales_qty,
0 AS promo_value,
0 AS promo_qty
from SALES_TABLE sales
where year >= 2011
group by article_id
UNION ALL
SELECT
article_id,
0 AS sales_value,
0 AS sales_qty,
sum(euro_value) AS promo_value,
sum(qty) AS promo_qty
from SALES_TABLE sales
where year >= 2011
group by article_id
) AS comb
GROUP BY article_id;

Sum Until Value Reached - Teradata

In Teradata, I need a query to first identify all members in the MEM TABLE that currently have a negative balance, let's call that CUR_BAL. Then, for all of those members only, sum all transactions from the TRAN TABLE in order by date until the sum of those transactions is equal to the CUR_BAL.
Editing to add a third ADJ table that contains MEM_NBR, ADJ_DT and ADJ_AMT that need to be included in the running total in order to capture all of the records.
I would like the outcome to include the MEM.MEM_NBR, MEM.CUR_BAL, TRAN.TRAN_DATE OR ADJ.ADJ_DT (date associated with the transaction that resulted in the running total to equal CUR_BAL), MEM.LST_UPD_DT. I don't need to know if the balance is negative as a result of a transaction or adjustment, just the date that it went negative.
Thank you!
select
mem_nbr,
cur_bal,
tran_date,
tran_type
from (
select
a.mem_nbr,
a.cur_bal,
b.tran_date,
b.tran_type,
a.lst_upd_dt,
sum(b.tran_amt) over (partition by b.mem_nbr order by b.tran_date rows between unbounded preceding and current row) as cumulative_bal
from mem a
inner join (
select
mem_nbr,
tran_date,
tran_amt,
'Tran' as tran_type
from tran
union all
select
mem_nbr,
adj_date,
adj_amt,
'Adj' as tran_type
from adj
) b
on a.mem_nbr = b.mem_nbr
where a.cur_bal < 0
qualify cumulative_bal < 0
) z
qualify rank() over (partition by mem_nbr order by tran_date) = 1
The subquery picks up all instances where the cumulative balance is negative, then the outer query picks up the earliest instance of it. If you want the latest, add desc after tran_date in the final qualify line.