Nested partitioning and ranking in google big query

Nested partitioning and ranking in google big query - sql

Below is how the data looks like-
I want to sort this data on different levels to achieve the final output.
Level 1:
Whenever there are duplicate values for name, I want to get the least ranking for each distinct (id, name,last_name, gender) tuple.
Level 1 Result:
Level 2:
In level 2, I want to get the least ranking for each gender category for a particular name.
Level 2 Result:
Final output:
For each name, if 'male' and 'female' rank is the same then return the whichever occurs first in the table. If it is different return the record with the least rank.
Final result expected-

Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY ranking, id LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY name

I do suspect that you can just partition by name:
select *
from (
select
t.*,
row_number() over(partition by name order by ranking, id) rn
from mytable t
) t
where rn = 1
The second sort criteria on id breaks the tie.

Related

Getting MAX of a column and adding one more

I'm trying to make an SQL query that returns the greatest number from a column and its respective id.
For more information I have two columns ID and NUMBER. Both of them have 2 entries and I want to get the highest number with the ID next to it. This is what I tried but didn't success.
SELECT ID, MAX(NUMBER) AS MAXNUMB
FROM TABLE1
GROUP BY ID, MAXNUMB;
The problem I'm experiencing is that it just shows ALL the entries and if I add a "where" expression it just shows the same (all entries [ids+numbers]).
Pd.: Yes, I got what I wanted but only with one column (number) if I add another column (ID) to select it "brokes".

Try:
SELECT
ID,
A_NUMBER
FROM TABLE1
WHERE A_NUMBER = (
SELECT MAX(A_NUMBER)
FROM TABLE1);
Presuming you want the IDs* of the row with the highest number (and not, instead, the highest number for each ID -- if IDs were not unique in your table, for example).
* there may be more than one ID returned if there are two or more IDs with equal maximum numbers

you can try this
Select ID,maxNumber
From
(
SELECT
ID,
(Select Max(NUMBER) from Tmp where Id = t.Id) maxNumber
FROM
Tmp t
)T1
Group By ID,maxNumber

The query you posted has an illegal column name (number) and is group by the alias for the max value, which is illegal and also doesn't make sense; and you can't include the unaliased max() within the group-by either. So it's likely you're actually doing something like:
select id, max(numb) as maxnumb
from table1
group by id;
which will give one row per ID, with the maximum numb (which is the new name I've made up for your numeric column) for each ID. Or as you said you get "ALL the entries" you might have group by id, numb, which would show all rows from the table (unless there are duplicate combinations).
To get the maximum numb and the corresponding id you could group by id only, order by descending maxnumb, and then return the first row only:
select id, max(numb) as maxnumb
from table1
group by id
order by maxnumb desc
fetch first 1 row only
If there are two ID with the same maxnumb then you would only get one of them - and which one is indeterminate unless you modify the order by - but in that case you might prefer to use first 1 row with ties to see them all.
You could achieve the same thing with a subquery and analytic function to generating a ranking, and have the outer query return the highest-ranking row(s):
select id, numb as maxnumb
from (
select id, numb, dense_rank() over (order by numb desc) as rnk
from table1
)
where rnk = 1
You could also use keep to get the same result as first 1 row only:
select max(id) keep (dense_rank last order by numb) as id, max(numb) as maxnumb
from table1
fiddle

Using the append model to do partial row updates in BigQuery

Suppose I have the following record in BQ:
id name age timestamp
1 "tom" 20 2019-01-01
I then perform two "updates" on this record by using the streaming API to 'append' additional data -- https://cloud.google.com/bigquery/streaming-data-into-bigquery. This is mainly to get around the update quota that BQ enforces (and it is a high-write application we have).
I then append two edits to the table, one update that just modifies the name, and then one update that just modifies the age. Here are the three records after the updates:
id name age timestamp
1 "tom" 20 2019-01-01
1 "Tom" null 2019-02-01
1 null 21 2019-03-03
I then want to query this record to get the most "up-to-date" information. Here is how I have started:
SELECT id, **name**, **age**,max(timestamp)
FROM table
GROUP BY id
-- 1,"Tom",21,2019-03-03
How would I get the correct name and age here? Note that there could be thousands of updates to a record, so I don't want to have to write 1000 case statements, if at all possible.
For various other reasons, I usually won't have all row data at one time, I will only have the RowID + FieldName + FieldValue.
I suppose plan B here is to do a query to get the current data and then add my changes to insert the new row, but I'm hoping there's a way to do this in one go without having to do two queries.

Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "tom" name, 20 age, DATE '2019-01-01' ts UNION ALL
SELECT 1, "Tom", NULL, '2019-02-01' UNION ALL
SELECT 1, NULL, 21, '2019-03-03'
)
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
with result
Row id name age ts
1 1 Tom 21 2019-03-03

This is a classic case of application of analytic functions in Standard SQL.
Here is how you can achieve your results:
select id, name, age from (
select id, name, age, ts, rank() over (partition by id order by ts desc) rnk
from `yourdataset.yourtable`
)
where rnk = 1
This will sub-group your records based id and pick the one with most recent ts (indicating the record most recently added for a given id).

how to select the most recent records

Select id, name , max(modify_time)
from customer
group by id, name
but I get all records.

Order by modify_time desc and use row_number to number the row for id,name combination.Then select each combination with row_number = 1
select id,modify_time,name
from (
select id,modify_time,name,row_number() over(partition by id order by modify_time desc) as r_no
from customer
) a
where a.r_no=1

Ids are unique, which means grouping them by the id, will result in the same table.
My suggestion would be, to order the table by "modify_time" descending and limit the result to 1 (Maybe something like the following):
Select id, name modify_time from customer ORDER BY modify_time DESC limit 1

The reason you are getting the whole table as a result is because you are grouping by id AND name. That means every unique combination of id and name is returned. And since all names per id are different, the whole table is returned.
If you want the last modification per id (or name) you should only group by id (or name respectively).

I need the Top 10 results from table

I need to get the Top 10 results for each Region, Market and Name along with those with highest counts (Gaps). There are 4 Regions with 1 to N Markets. I can get the Top 10 but cannot figure out how to do this without using a Union for every Market. Any ideas on how do this?
SELECT DISTINCT TOP 10
Region, Market, Name, Gaps
FROM
TableName
ORDER BY
Region, Market, Gaps DESC

One approach would be to use a CTE (Common Table Expression) if you're on SQL Server 2005 and newer (you aren't specific enough in that regard).
With this CTE, you can partition your data by some criteria - i.e. your Region, Market, Name - and have SQL Server number all your rows starting at 1 for each of those "partitions", ordered by some criteria.
So try something like this:
;WITH RegionsMarkets AS
(
SELECT
Region, Market, Name, Gaps,
RN = ROW_NUMBER() OVER(PARTITION BY Region, Market, Name ORDER BY Gaps DESC)
FROM
dbo.TableName
)
SELECT
Region, Market, Name, Gaps
FROM
RegionsMarkets
WHERE
RN <= 10
Here, I am selecting only the "first" entry for each "partition" (i.e. for each Region, Market, Name tuple) - ordered by Gaps in a descending fashion.
With this, you get the top 10 rows for each (Region, Market, Name) tuple - does that approach what you're looking for??

I think you want row_number():
select t.*
from (select t.*,
row_number() over (partition by region, market order by gaps desc) as seqnum
from tablename t
) t
where seqnum <= 10;
I am not sure if you want name in the partition by clause. If you have more than one name within a market, that may be what you are looking for. (Hint: Sample data and desired results can really help clarify a question.)

sql query finding most often level appear

I have a table Student in SQL Server with these columns:
[ID], [Age], [Level]
I want the query that returns each age value that appears in Students, and ﬁnds the level value that appears most often. For example, if there are more 'a' level students aged 18 than 'b' or 'c' it should print the pair (18, a).
I am new to SQL Server and I want a simple answer with nested query.

You can do this using window functions:
select t.*
from (select age, level, count(*) as cnt,
row_number() over (partition by age order by count(*) desc) as seqnum
from student s
group by age, level
) t
where seqnum = 1;
The inner query aggregates the data to count the number of levels for each age. The row_number() enumerates these for each age (the partition by with the largest first). The where clause then chooses the highest values.
In the case of ties, this returns just one of the values. If you want all of them, use rank() instead of row_number().

One more option with ROW_NUMBER ranking function in the ORDER BY clause. WITH TIES used when you want to return two or more rows that tie for last place in the limited results set.
SELECT TOP 1 WITH TIES age, level
FROM dbo.Student
GROUP BY age, level
ORDER BY ROW_NUMBER() OVER(PARTITION BY age ORDER BY COUNT(*) DESC)
Or the second version of the query using amount each pair of age and level, and max values of count pair age and level per age.
SELECT *
FROM (
SELECT age, level, COUNT(*) AS cnt,
MAX(COUNT(*)) OVER(PARTITION BY age) AS mCnt
FROM dbo.Student
GROUP BY age, level
)x
WHERE x.cnt = x.mCnt
Demo on SQLFiddle

Another option but will require later version of sql-server:
;WITH x AS
(
SELECT age,
level,
occurrences = COUNT(*)
FROM Student
GROUP BY age,
level
)
SELECT *
FROM x x
WHERE EXISTS (
SELECT *
FROM x y
WHERE x.occurrences > y.occurrences
)
I realise it doesn't quite answer the question as it only returns the age/level combinations where there are more than one level for the age.
Maybe someone can help to amend it so it includes the single level ages aswell in the result set: http://sqlfiddle.com/#!3/d597b/9

with combinations as (
select age, level, count(*) occurrences
from Student
group by age, level
)
select age, level
from combinations c
where occurrences = (select max(occurrences)
from combinations
where age = c.age)
This finds every age and level combination in the Students table and counts the number of occurrences of each level.
Then, for each age/level combination, find the one whose occurrences are the highest for that age/level combination. Return the age and level for that row.
This has the advantage of not being tied to SQL Server - it's vanilla SQL. However, a window function like Gordon pointed out may perform better on SQL Server.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Nested partitioning and ranking in google big query - sql

Below is for BigQuery Standard SQL #standardSQL SELECT AS VALUE ARRAY_AGG(t ORDER BY ranking, id LIMIT 1)[OFFSET(0)] FROM `project.dataset.table` t GROUP BY name

I do suspect that you can just partition by name: select * from ( select t.*, row_number() over(partition by name order by ranking, id) rn from mytable t ) t where rn = 1 The second sort criteria on id breaks the tie.

Related

Getting MAX of a column and adding one more

Using the append model to do partial row updates in BigQuery

how to select the most recent records

I need the Top 10 results from table

sql query finding most often level appear

Categories

Resources