SQL GROUP BY prioritise value based on other column - sql

I have a table something like this:
value
high_priority
grouping
1
TRUE
one
2
FALSE
one
3
FALSE
one
3
FALSE
two
4
FALSE
two
I would like to get the MAX value by grouping unless the entry is high_priority in which case I should prioritise that over the non high_priority entries.
For example, on the above table I want these results:
value
grouping
1
one
4
two
The simple solution of GROUP BY won't account for the high_priority entries:
SELECT
MAX(value) AS value,
grouping
FROM the_table
GROUP BY grouping
How can I extend this to also account for the high_priority entries?

Based on your description, you can use aggregation like this:
select grouping,
coalesce(case when max(high_priority) filter (where high_priority) then id end,
max(id)
) as id
from the_table
group by grouping;
However distinct on might be a simpler solution:
select distinct on (grouping) t.*
from the_table t
order by grouping, high_priority desc;

Alternatively group by grouping, high_priority and row_number the results as needed.
SELECT value, grouping
FROM (
SELECT
MAX(value) AS value,
grouping, high_priority,
row_number() over(partition by grouping order by high_priority desc) rn
FROM the_table
GROUP BY grouping,high_priority
)t
where rn=1

By joining on the same table and using COALESCE this provides the correct results:
SELECT
COALESCE(MAX(high_priorities.value), MAX(the_table.value)) AS value,
the_table.grouping
FROM the_table
LEFT JOIN the_table AS high_priorities
ON the_table.grouping = high_priorities.grouping
AND high_priorities.high_priority = TRUE
GROUP BY the_table.grouping
This works because in the event that any high priority exists, it will pick that. In the event that none exist the first clause of COALESCE will become NULL and it will fall back to the second option.

Related

BigQuery - Extract last entry of each group

I have one table where multiple records inserted for each group of product. Now, I want to extract (SELECT) only the last entries. For more, see the screenshot. The yellow highlighted records should be return with select query.
The HAVING MAX and HAVING MIN clause for the ANY_VALUE function is now in preview
HAVING MAX and HAVING MIN were just introduced for some aggregate functions - https://cloud.google.com/bigquery/docs/release-notes#February_06_2023
with them query can be very simple - consider below approach
select any_value(t having max datetime).*
from your_table t
group by t.id, t.product
if applied to sample data in your question - output is
You might consider below as well
SELECT *
FROM sample_table
QUALIFY DateTime = MAX(DateTime) OVER (PARTITION BY ID, Product);
If you're more familiar with an aggregate function than a window function, below might be an another option.
SELECT ARRAY_AGG(t ORDER BY DateTime DESC LIMIT 1)[SAFE_OFFSET(0)].*
FROM sample_table t
GROUP BY t.ID, t.Product
Query results
You can use window function to do partition based on key and selecting required based on defining order by field.
For Example:
select * from (
select *,
rank() over (partition by product, order by DateTime Desc) as rank
from `project.dataset.table`)
where rank = 1
You can use this query to select last record of each group:
Select Top(1) * from Tablename group by ID order by DateTime Desc

Get minimum without using row number/window function in Bigquery

I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below
In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;
Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.
Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective
A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id

BigQuery - Select only first row in BigQuery

I have a table with data where in Column A I have groups of repeating Data (one after another).
I want to select only first row of each group based on values in column A only (no other criteria). Mind you, I want all corresponding columns selected also for the mentioned new found row (I don't want to exclude them).
Can someone help me with a proper query.
Here is a sample:
SAMPLE
Thanks!
#standardSQL
SELECT row.*
FROM (
SELECT ARRAY_AGG(t LIMIT 1)[OFFSET(0)] row
FROM `project.dataset.table` t
GROUP BY columnA
)
you can try smth like this:
#standardSQL
SELECT
* EXCEPT(rn)
FROM (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY columnA ORDER BY columnA) AS rn
FROM
your_dataset.your_table)
WHERE rn = 1
that will return:
Row columnA col2 ...
1 AC1001 Z_Creation
2 ACO112BISPIC QN
...
Add LIMIT 1 at the end of the query
something like
SELECT name, year FROM person_table ORDER BY year LIMIT 1
You can now use qualify for a more concise solution:
select
*
from
your_dataset.your_table
where true
qualify ROW_NUMBER() OVER(PARTITION BY columnA ORDER BY columnA) = 1
In BigQuery the physical sequence of rows is not significant. “BigQuery does not guarantee a stable ordering of rows in a table. Only the result of a query with an explicit ORDER BY clause has well-defined ordering.”[1].
First, you need to define which property will determine the first row of your group, then you can run Vasily Bronsky’s query by changing ORDER BY with that property. Which means either you should add another column to the table to store the order of the rows or select one from the columns you have.

Select last duplicate row with different id Oracle 11g

I have a table that look like this:
The problem is I need to get the last record with duplicates in the column "NRODENUNCIA".
You can use MAX(DENUNCIAID), along with GROUP BY... HAVING to find the duplicates and select the row with the largest DENUNCIAID:
SELECT MAX(DENUNCIAID), NRODENUNCIA, FECHAEMISION, ADUANA, MES, NOMBREESTADO
FROM YourTable
GROUP BY NRODENUNCIA, FECHAEMISION, ADUANA, MES, NOMBREESTADO
HAVING COUNT(1) > 1
This will only show rows that have at least one duplicate. If you want to see non-duplicate rows too, just remove the HAVING COUNT(1) > 1
There are a number of solutions for your problem. One is to use row_number.
Note that I've ordered by DENUNCIID in the OVER clause. This defines the "Last Record" as the one that has the largest DENUNCIID. If you want to define it differently you'd need to change the field that is being ordered.
with dupes as (
SELECT
ROW_NUMBER() OVER (Partition by NRODENUNCIA ORDER BY DENUNCIID DESC) RN,
*
FROM
YourTable
)
SELECT * FROM dupes where rn = 1
This only get's the last record per dupe.
If you want to only include records that have dupes then you change the where clause to
WHERE rn =1
and NRODENUNCIA in (select NRODENUNCIA from dupes where rn > 1)

How to do a Postgresql group aggregation: 2 fields using one to select the other

I have a table - Data - of rows, simplified, like so:
Name,Amount,Last,Date
A,16,31,1-Jan-2014
A,27,38,1-Feb-2014
A,12,34,1-Mar-2014
B,8,37,1-Jan-2014
B,3,38,1-Feb-2014
B,17,39,1-Mar-2014
I wish to group them similar to:
select Name,sum(Amount),aggr(Last),max(Date) from Data group by Name
For aggr(Last) I want the value of 'Last' from the row that contains max(Date)
So the result I want would be 2 rows
Name,Amount,Last,Date
A,55,34,1-Mar-2014
B,28,39,1-Mar-2014
i.e. in both cases, the value of Last is the one from the row that contained 1-Mar-2014
The query I'm actually doing is basically the same, but with many more sum() fields and millions of rows, so I'm guessing an aggregate function could avoid multiple extra requests each group of incoming rows.
Instead, use row_number() and conditional aggregation:
select Name, sum(Amount),
max(case when seqnum = 1 then Last end) as Last,
max(date)
from (select d.*, row_number() over (partition by name order by date desc) as seqnum
from data d
) d
group by Name;