How to NTILE over distinct values in BigQuery? - sql

I have a query that I'm trying to put together in Google BigQuery that would decile sales for each customer. The problem I'm running into is that if a decile breaks at the point where many customers have the same sales value, they could end up in different deciles despite having the same sales.
For example, if there were twenty customers in total, and one spent $100, 18 spent $50, and one spent $25, the 18 customers who spent $50 will still be broken out across all the deciles due to equal groups being created, whereas in reality I would want them to be placed in the same decile.
The data that I'm using is obviously a bit more complex -- there are about 10 million customers, and the sales are deciled within a particular group to which each customer belongs.
Example code:
NTILE(10) OVER (PARTITION BY customer_group ORDER BY yearly_sales asc) as current_sales_decile
The NTILE function works, but I just run into the problem described above and haven't figured out how to fix it. Any suggestions welcome.

Calculate the ntile yourself:
select ceiling(rank() over (partition by customer_group order by yearly_sales) * 10.0 /
count(*) over (partition by customer_group)
)
This gives you more control over how the tiles are formed. In particular, all rows with the same value go in the same tile.

Related

SQL calculating running total as you go down the rows but also taking other fields into account

I'm hoping you guys can help with this problem.
I have a set of data which I have displayed via excel.
I'm trying to work out the rolling new cap allowance but need to deduct from previous weeks bookings. I don't want to use a cursor so can anyone help.
I'm going to group by the product id so it will need to start afresh for every product.
In the image, Columns A to D are fixed and I am trying to calculate the data in column E ('New Cap'). The 'New Cap' is the expected results.
Column F gives a detailed formula of what im trying to do.
Not sure what I've done for the post to be marked down.
Thanks
Update:
The formula looks like this.
You want the sum of the cap through this row minus the sum of booked through the previous row. This is easy to do with window functions:
select t.*,
(sum(cap + booked) over (partition by productid order by weekbeg) - booked
) as new_cap
from t;
You can get the new running total using lag and sum over window functions - calculate the cap-booked first, then use sum over() for the running total:
select weekbeg, ProductId, Cap, Booked,
Sum(n) over(partition by productid order by weekbeg) New_Cap
from (
select *, cap - Lag(booked,1,0) over(partition by productid order by weekbeg)n
from t
)t

SQL bucket based on totals

I have a list of retailers and their total sales. I want to bucket them in 4 categories based on their total sales. I want to show 10% of retailers cover 70% of sales.
In the example below, I am trying to divide the retailers in 4 quantiles.
In the below case total sales for all 10 retailers is 4500. In order to divide these retailers in 4 quantiles, I have sorted data by sales from high to low and assign them quantile.
Sum of sales for retailers in each quantile is around 4500/4= 1100.
How can I replicate this logic in sql?
Here's sample data :-
If I understand correctly, you can use cumulative sums and some arithmetic. I think this does what you want.
select t.*,
ceiling(running_total * 4.0 / total_total)
from (select t.*, sum(total) over (order by total desc) as running_total,
sum(total) over() as total_total
from t
) t

Optimize Average of Averages SQL Query

I have a table where each row is a vendor with a sale made on some date.
I'm trying to compute average daily sales per vendor for the year 2019, and get a single number. Which I think means I want to compute an average of averages.
This is the query I'm considering, but it takes a very long time on this large table. Is there a smarter way to compute this average without this much nesting? I have a feeling I'm scanning rows more times than I need to.
-- Average of all vendor's average daily sale counts
SELECT AVG(vendor_avgs.avg_daily_sales) avg_of_avgs
FROM (
-- Get average number of daily sales for each vendor
SELECT vendor_daily_totals.memberdeviceid, AVG(vendor_daily_totals.cnt)
avg_daily_sales
FROM (
-- Get total number of sales for each vendor
SELECT vendorid, COUNT(*) cnt
FROM vendor_sales
WHERE year = 2019
GROUP BY vendorid, month, day
) vendor_daily_totals
GROUP BY vendor_daily_totals.vendorid
) vendor_avgs;
I'm curious if there is in general a way to compute an average of averages more efficiently.
This is running in Impala, by the way.
I think you can just do the calculation in one shot:
SELECT AVG(t.avgs)
FROM (
SELECT vendorid,
COUNT(*) * 1.0 / COUNT(DISTINCT month, day) as avgs
FROM vendor_sales
WHERE year = 2019
GROUP BY vendorid
) t
This gets the total and divides by the number of days. However, COUNT(DISTINCT) might be even slower than nested GROUP BYs in Impala, so you need to test this.

SQL Server - How calculate # of entities to hit 80% of sum total?

I have a list of companies, their industry, and their annual revenue.
I need to partition the list by industry and figure out how many companies in each industry it takes to account for 80% of the industry's total revenue.
I can run the partition, I can figure out what 80% of each industry's revenue is, but I have zero idea how to figure out how many companies it takes to hit 80%. My only idea is to pull a list for each industry, sort revenue high to low, and sum down until I hit the 80% number.
Are there any built-in functions or clever approaches that can help me here?
Thanks!
I would use window functions:
select industry, count(*)
from (select t.*,
sum(revenue) over (partition by industry order by revenue desc) as running_revenue,
sum(revenue) over (partition by industry) as total_revenue
from t
) t
where running_revenue - revenue < 0.8 * total_revenue
group by industry;
The where includes all companies up to the first that passes the 80% threshold.
There are other functions such as ntile() and percentile() that can be used. I find it simplest to do the calculation directly using sum().

DB2 - Ranking data by timeframe

I am trying to write a report (DB2 9.5 on Solaris) to do the following:
I have a set of data, let's say it's an order table. I want to run a report which will give me, for each month, the number of orders per customer, and their "rank" that month. The rank would be based on the number of orders. I was playing around with the RANK() OVER clauses, but I can't seem to get it to give me a rank per month (or other "group by"). If there are 100 customers and 12 months of data, i would expect 1200 rows in the report, 100 per month, each with a rank between 1 and 100. Let me know if more detail would be helpful. Thanks in advance.
the solution is to use the PARTITION BY clause.
for example, see page 5 here: http://cmsaville.ca/documents/MiscDocs/TopNQueries.pdf