Select top three records grouping by two factors - sql

I am trying to identify the three records with the highest values grouped by two factors. I realize this question is similar to this one PostgreSQL: select top three in each group, but I can't figure out how to generalize from this example which includes a single factor, to two factors. I have tried searching stack overflow for an answer to this question beyond the one listed above and I can't find one, but perhaps I'm not searching for the correct terms.
Briefly, I'm connecting to a table with the following schema
city, country, value
I only have a single row per city, country combination, but I have a variable, but the number of city entries I have per country is variable. For example, I have a few dozen cities for Canada, a hundred for the United States, but only two for Uzbekistan.
What I want, as output is a table with the same schema, but only countaining the rows containing the highest three values for city, nested within country. For example, if Canada has the cities and values of
{Canada, toronto, 100}, {Canada, vancouver, 80},
{Canada, montreal,112}, {Canada, calgary, 109},
{Canada, edmonton, 76}, {Canada, winnipeg, 73},
and the United States has the entries of
{{us, nyc, 104}, {us, chicago, 87},
{us, boston, 98}, {us, seattle, 105},
{us, sanfran, 88}, {us, minneapolis, 84},
{us, miami, 103}, {us, houston, 112},
{us, dallas, 78}, {us, tucson, 83}}
and Uzbekistan has the entries of
{uzbekistan, qarshi, 95}, {uzbeckistan, gluiston, 101}
What I would like as output would be
Canada, Montreal, 112
Canada, Toronto, 100
Canada, Calgary, 109
us, houston, 112
us, seattle, 105
us, nyc, 103,
uzbeckistan, qarshi, 95
uzbeckistan, gluiston 101
I've tried the following query
SELECT logincity, logincountry, VAL
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY logincountry, logincity ORDER BY
val DESC) AS Row_ID
FROM a_table)
WHERE Row_ID < 4
ORDER BY logincity
But I end up with more than three cities per country.
Can someone help me out?
Thanks Stack Overflow!

I think you only need partition by logincountry
SELECT logincity, logincountry, VAL
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY logincountry
ORDER BY val DESC) AS Row_ID
FROM a_table ) T
WHERE Row_ID < 4
ORDER BY logincity
TIP: You probably will realize the problem if you include the Row_id on the SELECT
SELECT logincity, logincountry, VAL, Row_ID
On your query all Row_ID = 1
TIP 2: Your query want top 3 cities for each country, so you only have one partition country. So the linked question is the right answer, top 3 of each group in this case country.

Related

How to capture the average of multiple categories?

I am trying to find the average number of purchases by buyer by store without surfacing buyer because there are millions.
I'm getting an error of invalid identifier trying to group by store and am not sure what I'm missing or if there's a better way to do this. The sample data looks like this, but with millions of records.
Purchase_ID
Buyer_ID
Store
abc
1a
East
abd
1a
East
abe
1b
East
abf
1c
West
abg
1c
West
abh
1d
South
abi
1e
North
abj
1f
North
And the ideal output would look like:
t.store
average_purchases_per_store
East
1.5
West
2
South
1
North
1
Sample code:
SELECT t.store,AVG(T.distinct_purchases) as average_purchases_per_store
FROM
(SELECT COUNT(DISTINCT(purchase_id)) AS distinct_purchases
FROM table GROUP BY buyer) AS T GROUP BY t.store
Any help would be hugely appreciated.
Greg's answer is almost correct, but he lost the DISTINCT thus is a ling repeats, the value is lost:
with T1(PURCHASE_ID,BUYER_ID, STORE) as (
select * from values
('abc','1a','East'),
('abc','1a','East'),
('abd','1a','East'),
('abe','1b','East'),
('abf','1c','West'),
('abg','1c','West'),
('abh','1d','South'),
('abi','1e','North'),
('abj','1f','North')
), BUYER_PURCHASES as (
select BUYER_ID
,STORE
,count(distinct PURCHASE_ID) as PURCHASES
from T1
group by 1,2
)
select STORE
,avg(PURCHASES) as average_purchases_per_store
from BUYER_PURCHASES
group by STORE
gives:
STORE
AVERAGE_PURCHASES_PER_STORE
East
1.5
West
2
North
1
South
1
You just need to aggregate to buyers and stores first, and from that intermediate result aggregate to store:
create or replace table T1(PURCHASE_ID string, BUYER_ID string, STORE string);
insert into T1 (PURCHASE_ID,BUYER_ID, STORE) values
('abc','1a','East'),
('abd','1a','East'),
('abe','1b','East'),
('abf','1c','West'),
('abg','1c','West'),
('abh','1d','South'),
('abi','1e','North'),
('abj','1f','North');
with BUYER_PURCHASES as
(
select BUYER_ID
,STORE
,count(*) as PURCHASES
from T1
group by BUYER_ID, STORE
)
select STORE
,avg(PURCHASES) as average_purchases_per_store
from BUYER_PURCHASES
group by STORE
;
Output:
STORE
AVERAGE_PURCHASES_PER_STORE
East
1.5
West
2
South
1
North
1
Note that you don't need to use the distinct keyword unless you have to filter out duplicate rows. If you do have duplicates, that should be addressed on ETL/ELT.
Hopefully this is enough to get you started. There's literally thousands of possible approaches that depending on your datasets (you mentioned there's millions of rows) may provide you more flexibility or speed etc. High level approach would be to reduce the number of rows as quickly as possible. The first count distinct query should include as many predicates as you can to prevent any extra work. Hope this helps :-)
SELECT
STORE
,AVG(DISTINCT_STORE_PURCHASES) AVG_PURCHASES_PER_STORE
,AVG(DISTINCT_BUYER_PURCHASES) AVG_BUYER_PURCHASES_PER_STORE
FROM
(SELECT
STORE
, COUNT(DISTINCT PURCHASE_ID) OVER (PARTITION BY BUYER_ID) DISTINCT_BUYER_PURCHASES
, DIV0(COUNT(DISTINCT PURCHASE_ID) OVER (PARTITION BY STORE), COUNT(DISTINCT BUYER_ID) OVER (PARTITION BY STORE) ) DISTINCT_STORE_PURCHASES
FROM CTE)
GROUP BY
STORE ;

How to get the differences between two rows **and** the name of the field where the difference is, in BigQuery?

I have a table in BigQuery like this:
Name
Phone Number
Address
John
123456778564
1 Penny Lane
John
873452987424
1 Penny Lane
Mary
845704562848
87 5th Avenue
Mary
845704562848
54 Lincoln Rd.
Amy
342847327234
4 Ocean Drive Avenue
Amy
347907387469
98 Truman Rd.
I want to get a table with the differences between two consecutive rows and the name of the field where occurs the difference:
I mean this:
Name
Field
Before
After
John
Phone Number
123456778564
873452987424
Mary
Address
87 5th Avenue
54 Lincoln Rd.
Amy
Phone Number
342847327234
347907387469
Amy
Address
4 Ocean Drive Avenue
98 Truman Rd.
How can I do this ? I've looked on other posts but couldn't find something that corresponds to my need.
Thank you
Consider below BigQuery'ish solution
select Name, ['Phone Number', 'Address'][offset(offset)] Field,
prev_field as Before, field as After
from (
select timestamp, Name, offset, field,
lag(field) over (partition by Name, offset order by timestamp) as prev_field
from yourtable,
unnest([`Phone Number`, Address]) field with offset
)
where prev_field != field
if applied to sample data in your question - output is
As you can see here - no matter how many columns in your table that you need to compare - it is still just one query - no unions and such.
You just need to enumerate your columns in two places
['Phone Number', 'Address'][offset(offset)] Field
and
unnest([`Phone Number`, Address]) field with offset
Note: you can further refactor above using scripting's execute immediate to compose such lists within the query on the fly (check my other answers - I frequently use such technique in them)
One method is just use to use lag() and union all
select name, 'phone', prev_phone as before, phone as after
from (select name, phone,
lag(phone) over (partition by name order by timestamp) as prev_phone
from t
) t
where prev_phone <> phone
union all
select name, 'address', prev_address as before, address as afte4r
from (select name, address,
lag(address) over (partition by name order by timestamp) as prev_address
from t
) t
where prev_address <> address

match tables with intermediate mapping table (fuzzy joins with similar strings)

I'm using BigQuery.
I have two simple tables with "bad" data quality from our systems. One represents revenue and the other production rows for bus journeys.
I need to match every journey to a revenue transaction but I only have a set of fields and no key and I don't really know how to do this matching.
This is a sample of the data:
Revenue
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, London, Manchester, Qwerty
Journeys
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, Kings Cross, Piccadilly Gardens, Qwer
2020, 123123, Kings Cross, Victoria Station, Qwert
2020, 123123, London, Manchester, Qwerty
Every station has a maximum of 9 alternative names and these are stored in a "station" table.
Stations
Station Name, Station Name 2, Station Name 3,...
London, Kings Cross, Euston,...
Manchester, Piccadilly Gardens, Victoria Station,...
I would like to test matching or joining the tables first with the original fields. This will generate some matches but there are many journeys that are not matched. For the unmatched revenue rows, I would like to change the product name (shorten it to two letters and possibly get many matches from production table) and then station names by first change the station_origin and then station_destination. When using a shorter product name I could possibly get many matches but I want the row from the production table with the most common product.
Something like this:
1. Do a direct match. That is, I can use the fields as they are in the tables.
2. Do a match where the revenue.product is changed by shortening it to two letters. substr(product,0,2)
3. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
4. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product is changed as above with a substr(product,0,2) but rev.station_destination is not changed.
5. Change the rev.station_destination to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
I was told that maybe I should create an intermediate table with all combinations of stations and products and let a rank column decide the order. The station names in the station's table are in order of importance so "station name" is more important than "station name 2" and so on.
I started to do a query with a subquery per rank and do a UNION ALL but there are so many combinations that there must be another way to do this.
Don't know if this makes any sense but I would appreciate any help or ideas to do this in a better way.
Cheers,
Cris
To implement a complex joining strategy with approximate matching, it might make more sense to define the strategy within JavaScript - and call the function from a BigQuery SQL query.
For example, the following query does the following steps:
Take the top 200 male names in the US.
Find if one of the top 200 female names matches.
If not, look for the most similar female name within the options.
Note that the logic to choose the closest option is encapsulated within the JS UDF fhoffa.x.fuzzy_extract_one(). See https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83 to learn more about this.
WITH data AS (
SELECT name, gender, SUM(number) c
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY 1,2
), top_men AS (
SELECT * FROM data WHERE gender='M'
ORDER BY c DESC LIMIT 200
), top_women AS (
SELECT * FROM data WHERE gender='F'
ORDER BY c DESC LIMIT 200
)
SELECT name male_name,
COALESCE(
(SELECT name FROM top_women WHERE name=a.name)
, fhoffa.x.fuzzy_extract_one(name, ARRAY(SELECT name FROM top_women))
) female_version
FROM top_men a

SQL (COUNT(*) / locations.area)

We are learning SQL at school, and my professor has this sql code in his documents.
SELECT wp.city, (COUNT(*) / locations.area) AS population_density
FROM world_poulation AS wp
INNER JOIN location
ON wp.city = locations.city
WHERE locations.state = “Hessen”
GROUP BY wp.city, locations.area
Everything is almost clear for me, just the aggregate function with /locations.area doesn't make any sense to me. Can anybody help?
Thank you in advance!
Look at what the query is grouped on, that tells you what each group consists of. In this case, each group is a city, and contains all the rows that have the same value for wp.city (and as the location table is joined on that value too, the locations.area is only included in the grouping so that it can be used in the result).
So each group has a number of rows, and the COUNT(*) aggregate will contain the number of rows for each group. The value of (COUNT(*) / locations.area) will be the number of rows in the group divided by the value of locations.area for that group.
If you would have data like this:
world_population
name city
--------- ---------
John London
Peter London
Sarah London
Malcolm London
Ian Cardiff
Johanna Stockholm
Sven Stockholm
Egil Stockholm
locations
city state area
----------- -------------- ---------
London Hessen 2
Cardiff Somehere else 14
Stockholm Hessen 1
Then you would get a result with two groups (as Cardiff is not in the state Hessen). One group has four people from London which has the area 2, so the population density would be 2. The other group has three people from Stockholm which has the area 1, so the population density would be 3.
Side note: There is a typo in the query, as it joins in the table location but refers to it as locations everywhere else.
Try writing it like:
SELECT wp.city,
locations.area,
COUNT(*) AS population,
(COUNT(*) / locations.area) AS population_density
FROM world_poulation AS wp
INNER JOIN location
ON wp.city = locations.city
WHERE locations.state = “Hessen”
GROUP BY wp.city, locations.area
The key is the GROUP BY statement. You are showing pairs of cities and areas. The COUNT(*) is the number of times a given pair shows up in the table you created by joining world population and location. The area is just a number, so you can divide the area by the COUNT.

Select distinct values with count in PostgreSQL

This is a heavily simplified version of an SQL problem I'm dealing with. Let's say I've got a table of all the cities in the world, like this:
country city
------------
Canada Montreal
Cuba Havanna
China Beijing
Canada Victoria
China Macau
I want to count how many cities each country has, so that I would end up with a table as such:
country city_count
------------------
Canada 50
Cuba 10
China 200
I know that I can get the distinct country values with SELECT distinct country FROM T1 and I suspect I need to construct a subquery for the city_count column. But my non-SQL brain is just telling me I need to loop through the results...
Thanks!
Assuming the only reason for a new row is a unique city
select country, count(country) AS City_Count
from table
group by country