How to capture the average of multiple categories? - sql

I am trying to find the average number of purchases by buyer by store without surfacing buyer because there are millions.
I'm getting an error of invalid identifier trying to group by store and am not sure what I'm missing or if there's a better way to do this. The sample data looks like this, but with millions of records.
Purchase_ID
Buyer_ID
Store
abc
1a
East
abd
1a
East
abe
1b
East
abf
1c
West
abg
1c
West
abh
1d
South
abi
1e
North
abj
1f
North
And the ideal output would look like:
t.store
average_purchases_per_store
East
1.5
West
2
South
1
North
1
Sample code:
SELECT t.store,AVG(T.distinct_purchases) as average_purchases_per_store
FROM
(SELECT COUNT(DISTINCT(purchase_id)) AS distinct_purchases
FROM table GROUP BY buyer) AS T GROUP BY t.store
Any help would be hugely appreciated.

Greg's answer is almost correct, but he lost the DISTINCT thus is a ling repeats, the value is lost:
with T1(PURCHASE_ID,BUYER_ID, STORE) as (
select * from values
('abc','1a','East'),
('abc','1a','East'),
('abd','1a','East'),
('abe','1b','East'),
('abf','1c','West'),
('abg','1c','West'),
('abh','1d','South'),
('abi','1e','North'),
('abj','1f','North')
), BUYER_PURCHASES as (
select BUYER_ID
,STORE
,count(distinct PURCHASE_ID) as PURCHASES
from T1
group by 1,2
)
select STORE
,avg(PURCHASES) as average_purchases_per_store
from BUYER_PURCHASES
group by STORE
gives:
STORE
AVERAGE_PURCHASES_PER_STORE
East
1.5
West
2
North
1
South
1

You just need to aggregate to buyers and stores first, and from that intermediate result aggregate to store:
create or replace table T1(PURCHASE_ID string, BUYER_ID string, STORE string);
insert into T1 (PURCHASE_ID,BUYER_ID, STORE) values
('abc','1a','East'),
('abd','1a','East'),
('abe','1b','East'),
('abf','1c','West'),
('abg','1c','West'),
('abh','1d','South'),
('abi','1e','North'),
('abj','1f','North');
with BUYER_PURCHASES as
(
select BUYER_ID
,STORE
,count(*) as PURCHASES
from T1
group by BUYER_ID, STORE
)
select STORE
,avg(PURCHASES) as average_purchases_per_store
from BUYER_PURCHASES
group by STORE
;
Output:
STORE
AVERAGE_PURCHASES_PER_STORE
East
1.5
West
2
South
1
North
1
Note that you don't need to use the distinct keyword unless you have to filter out duplicate rows. If you do have duplicates, that should be addressed on ETL/ELT.

Hopefully this is enough to get you started. There's literally thousands of possible approaches that depending on your datasets (you mentioned there's millions of rows) may provide you more flexibility or speed etc. High level approach would be to reduce the number of rows as quickly as possible. The first count distinct query should include as many predicates as you can to prevent any extra work. Hope this helps :-)
SELECT
STORE
,AVG(DISTINCT_STORE_PURCHASES) AVG_PURCHASES_PER_STORE
,AVG(DISTINCT_BUYER_PURCHASES) AVG_BUYER_PURCHASES_PER_STORE
FROM
(SELECT
STORE
, COUNT(DISTINCT PURCHASE_ID) OVER (PARTITION BY BUYER_ID) DISTINCT_BUYER_PURCHASES
, DIV0(COUNT(DISTINCT PURCHASE_ID) OVER (PARTITION BY STORE), COUNT(DISTINCT BUYER_ID) OVER (PARTITION BY STORE) ) DISTINCT_STORE_PURCHASES
FROM CTE)
GROUP BY
STORE ;

Related

percentile functions with GROUPBY in BigQuery

In my CENSUS table, I'd like to group by State, and for each State get the median county population and the number of counties.
In psql, redshift, and snowflake, I can do this:
psql=> SELECT state, count(county), PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "population2000") AS median FROM CENSUS GROUP BY state;
state | count | median
----------------------+-------+----------
Alabama | 67 | 36583
Alaska | 24 | 7296.5
Arizona | 15 | 116320
Arkansas | 75 | 20229
...
I'm trying to find a nice way to do this in standard BigQuery. I've noticed that there's undocumented percentile_cont analytical function available, but I have to do some major hacks to get it to do what I want.
I'd like to be able to do the same sort thing with what I've gathered are the correct arguments:
SELECT
state,
COUNT(county),
PERCENTILE_CONT(population2000,
0.5) OVER () AS `medPop`
FROM
CENSUS
GROUP BY
state;
but this query yields the error
SELECT list expression references column population2000 which is neither grouped nor aggregated at
I can get the answer I want, but I'd be very disappointed if this is the recommended way to do what I want to do:
SELECT
MAX(nCounties) AS nCounties,
state,
MAX(medPop) AS medPop
FROM (
SELECT
nCounties,
T1.state,
(PERCENTILE_CONT(population2000,
0.5) OVER (PARTITION BY T1.state)) AS `medPop`
FROM
census T1
LEFT OUTER JOIN (
SELECT
COUNT(county) AS `nCounties`,
state
FROM
census
GROUP BY
state) T2
ON
T1.state = T2.state) T3
GROUP BY
state
Is there a better way to do what I want to do? Also, is the PERCENTILE_CONT function ever going to be documented?
Thanks for reading!
Thanks for your interest. PERCENTILE_CONT is under development, and we will publish the documentation once it is GA. We will support it as analytic function first, and we plan to support it as aggregate function (allowing GROUP BY) later. Between these 2 releases, a simpler workaround would be
SELECT
state,
ANY_VALUE(nCounties) AS nCounties,
ANY_VALUE(medPop) AS medPop
FROM (
SELECT
state,
COUNT(county) OVER (PARTITION BY state) AS nCounties,
PERCENTILE_CONT(population2000,
0.5) OVER (PARTITION BY state) AS medPop
FROM
CENSUS)
GROUP BY
state

Select top three records grouping by two factors

I am trying to identify the three records with the highest values grouped by two factors. I realize this question is similar to this one PostgreSQL: select top three in each group, but I can't figure out how to generalize from this example which includes a single factor, to two factors. I have tried searching stack overflow for an answer to this question beyond the one listed above and I can't find one, but perhaps I'm not searching for the correct terms.
Briefly, I'm connecting to a table with the following schema
city, country, value
I only have a single row per city, country combination, but I have a variable, but the number of city entries I have per country is variable. For example, I have a few dozen cities for Canada, a hundred for the United States, but only two for Uzbekistan.
What I want, as output is a table with the same schema, but only countaining the rows containing the highest three values for city, nested within country. For example, if Canada has the cities and values of
{Canada, toronto, 100}, {Canada, vancouver, 80},
{Canada, montreal,112}, {Canada, calgary, 109},
{Canada, edmonton, 76}, {Canada, winnipeg, 73},
and the United States has the entries of
{{us, nyc, 104}, {us, chicago, 87},
{us, boston, 98}, {us, seattle, 105},
{us, sanfran, 88}, {us, minneapolis, 84},
{us, miami, 103}, {us, houston, 112},
{us, dallas, 78}, {us, tucson, 83}}
and Uzbekistan has the entries of
{uzbekistan, qarshi, 95}, {uzbeckistan, gluiston, 101}
What I would like as output would be
Canada, Montreal, 112
Canada, Toronto, 100
Canada, Calgary, 109
us, houston, 112
us, seattle, 105
us, nyc, 103,
uzbeckistan, qarshi, 95
uzbeckistan, gluiston 101
I've tried the following query
SELECT logincity, logincountry, VAL
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY logincountry, logincity ORDER BY
val DESC) AS Row_ID
FROM a_table)
WHERE Row_ID < 4
ORDER BY logincity
But I end up with more than three cities per country.
Can someone help me out?
Thanks Stack Overflow!
I think you only need partition by logincountry
SELECT logincity, logincountry, VAL
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY logincountry
ORDER BY val DESC) AS Row_ID
FROM a_table ) T
WHERE Row_ID < 4
ORDER BY logincity
TIP: You probably will realize the problem if you include the Row_id on the SELECT
SELECT logincity, logincountry, VAL, Row_ID
On your query all Row_ID = 1
TIP 2: Your query want top 3 cities for each country, so you only have one partition country. So the linked question is the right answer, top 3 of each group in this case country.

weighted ranking/ combined score in Google Big Query

...Spent several hours trying what not and researching this forum. Quite pessimistic at this point about the usefulness of Google Big Query (GBQ) for anything more than trivial queries, but here is one last desperate try, maybe someone has better ideas:
Let's say we have a COUNTRY table with average population weight(in kilograms) and height (in meters) per country as follows:
country | continent | weight | height |
============================================
US | America | 200 | 2.00 |
Canada | America | 170 | 1.90 |
France | Europe | 160 | 1.78 |
Germany | Europe | 110 | 2.00 |
Let's say you want to pick out and live in the European country with "smallest" people, where you define the measure "smallness" as the weighted sum of body weight and height with some constant weights, such as 0.6 for body weight and 0.4 for body height.
In Oracle or MS SQL server this can be done elegantly and compactly by using analytic window functions such as rank() and row_number(), for example:
select country, combined_score
from (select
country
,( 0.6*rank(weight) over() + 0.4*rank(height) over() ) combined_score
from country
where continent = 'Europe')
order by combined_score
Note that the ranking is done after the filtering for continent. The continent filter is dynamic (say input from a web form), so the ranking can not be pre-calculated and stored in the table in advance!
In GBQ there are no rank() , row_number() or over(). Even if you try some "poor man" hacks it is still not going to work because GBQ does not support correlated queries. Here are similar attempts by other people with some pretty unsatisfactory and inefficient results:
BigQuery SQL running totals
Row number in BigQuery?
Any ideas how this can be done? I can even restructure the data to use nested records, if it helps. Thank you in advance!
In your specific example, I think you can compute the result without using RANK and OVER at all:
SELECT country, score
FROM (SELECT country, 0.6 * weight + 0.4 * height AS score
FROM t WHERE continent = 'Europe')
ORDER BY score;
However, I'm assuming that this is a toy example and that your real problem involves use of RANK more in line with your example query. In that case, BigQuery does not yet support analytic functions directly, but we'll consider this to be a feature request. :-)
An equivalent for RANK in BigQuery is row_number().
For example, the top 5 contributors to Wikipedia, with row_number giving their place:
SELECT
ROW_NUMBER() OVER() row_number,
contributor_username,
count,
FROM (
SELECT contributor_username, COUNT(*) count,
FROM [publicdata:samples.wikipedia]
GROUP BY contributor_username
ORDER BY COUNT DESC
LIMIT 5)

SQL Selecting distinct rows from multiple columns based on max value in one column

This is my SQL View - lets call it MyView :
ECode SHCode TotalNrShare CountryCode Country
000001 +00010 100 UKI United Kingdom
000001 ABENSO 900 USA United States
000355 +00012 1000 ESP Spain
000355 000010 50 FRA France
000042 009999 10 GER Germany
000042 +00012 999 ESP Spain
000787 ABENSO 500 USA United States
000787 000150 500 ITA Italy
001010 009999 100 GER Germany
I would like to return the single row with the highest number in the column TotalNrShare for each ECode.
For example, I’d like to return these results from the above view:
ECode SHCode TotalNrShare CountryCode Country
000001 ABENSO 900 USA United States
000355 +00012 1000 ESP Spain
000042 +00012 999 ESP Spain
000787 ABENSO 500 USA United States
001010 009999 100 GER Germany
(note in the case of ECode 000787 where there are two SHCode's with 500 each, as they are the same amount we can just return the first row rather than both, it isnt important for me which row is returned since this will happen very rarely and my analysis doesnt need to be 100%)
Ive tried various things but do not seem to be able to return either unqiue results or the additional country code/country info that I need.
This is one of my attempts (based on other solutions on this site, but I am doing something wrong):
SELECT tsh.ECode, tsh.SHCode, tsh.TotalNrShare, tsh.CountryCode, tsh.Country
FROM dbo.MyView AS tsh INNER JOIN
(SELECT DISTINCT ECode, MAX(TotalNrShare) AS MaxTotalSH
FROM dbo.MyView
GROUP BY ECode) AS groupedtsh ON tsh.ECode = groupedtsh.ECode AND tsh.TotalNrShare = groupedtsh.MaxTotalSH
WITH
sequenced_data AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY ECode ORDER BY TotalNrShare) AS sequence_id
FROM
myView
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
This should, however, give the same results as your example query. It's simply a different approach to accomplish the same thing.
As you say that something is wrong, however, please could you elaborate on what is going wrong? Is TotalNrShare actually a string for example? And is that messing up your ordering (and so the MAX())?
EDIT:
Even if the above code was not compatible with your SQL Server, it shouldn't crash it out completely. You should just get an error message. Try executing Select * By Magic, for example, and it should just give an error. I strongly suggest getting your installation of Management Studio looked at and/or re-installed.
In terms of an alternative, you could do this...
SELECT
*
FROM
(SELECT ECode FROM MyView GROUP BY ECode) AS base
CROSS APPLY
(SELECT TOP 1 * FROM MyView WHERE ECode = base.ECode ORDER BY TotalNrShare DESC) AS data
Ideally you would replace the base sub-query with a table that already has a distinct list of all the ECodes that you are interested in.
try this;
with cte as(
SELECT tsh.ECode, tsh.SHCode, tsh.TotalNrShare, tsh.CountryCode, tsh.Country,
ROW_NUMBER() over (partition by ECode order by SHCode ) as row_num
FROM dbo.MyView)
select * from cte where row_num=1

SQL SUM with Repeating Sub Entries - Best Practice?

I hit this issue regularly but here is an example....
I have a Order and Delivery Tables. Each order can have one to many Deliveries.
I need to report totals based on the Order Table but also show deliveries line by line.
I can write the SQL and associated Access Report for this with ease ....
SELECT xxx
FROM
Order
LEFT OUTER JOIN
Delivery on Delivery.OrderNO = Order.OrderNo
until I get to the summing element. I obviously only want to sum each Order once, not the 1-many times there are deliveries for that order.
e.g. The SQL might return the following based on 2 Orders (ignore the banalness of the report, this is very much simplified)
Region OrderNo Value Delivery Date
North 1 £100 12-04-2012
North 1 £100 14-04-2012
North 2 £73 01-05-2012
North 2 £73 03-05-2012
North 2 £73 07-05-2012
South 3 £50 23-04-2012
I would want to report:
Total Sales North - £173
Delivery 12-04-2012
Delivery 14-04-2012
Delivery 01-05-2012
Delivery 03-05-2012
Delivery 07-05-2012
Total Sales South - £50
Delivery 23-04-2012
The bit I'm referring to is the calculation of the £173 and £50 which the first of which obviously shouldn't be £419!
In the past I've used things like MAX (for a given Order) but that seems like a fudge.
Surely there must be a regular answer to this seemingly common problem but I can't find one.
I don't necessarily need the code - just a helpful point in the right direction.
Many thanks,
Chris.
A roll up operator may not look pretty. However, it would do the regular aggregates that you see now, and it show the subtotals of the order. This is what you're looking for.
SELECT xxx
FROM
Order
LEFT OUTER JOIN
Delivery on Delivery.OrderNO = Order.OrderNo
GROUP BY xxx
WITH ROLLUP;
I'm not exactly sure how the rest of your query is set up, but it would look something like this:
Region OrderNo Value Delivery Date
North 1 £100 12-04-2012
North 1 £100 14-04-2012
North 2 £73 01-05-2012
North 2 £73 03-05-2012
North 2 £73 07-05-2012
NULL NULL f419 NULL
I believe what you want is called a windowing function for your aggregate operation. It looks like the following:
SELECT xxx, SUM(Value) OVER (PARTITION BY Order.Region) as OrderTotal
FROM
Order
LEFT OUTER JOIN
Delivery on Delivery.OrderNO = Order.OrderNo
Here's the MSDN article. The PARTITION BY tells the SUM to be done separately for each distinct Order.Region.
Edit: I just noticed that I missed what you said about orders being counted multiple times. One thing you could do is SUM() the values before joining, as a CTE (guessing at your schema a bit):
WITH RegionOrders AS (
SELECT Region, OrderNo, SUM(Value) OVER (PARTITION BY Region) AS RegionTotal
FROM Order
)
SELECT Region, OrderNo, Value, DeliveryDate, RegionTotal
FROM RegionOrders RO
INNER JOIN Delivery D on D.OrderNo = RO.OrderNo