SQL: Select the largest value for a group - sql

I have a table like this:
someOtherCols
shelf
type
count
...
row1
A
2
...
row1
B
3
...
row2
C
2
...
row2
D
2
I would like to group by shelf, and only keep the type that has the highest count. If there's a tie, choose any row.
So the result would be like:
someOtherCols
shelf
type
count
...
row1
B
3
...
row2
C
2
I'm using AWS Athena and I'm trying the following query that I saw from another answer I saw:
SELECT * FROM table
WHERE count IN (MAX(count) FROM shoppingAggregate GROUP BY someOtherCols, shelf)
Seems like Athena does not like it. How can I achieve this? Thanks a lot!

Use window functions:
you will have to do an over for each attribute in "some other cols". I used min below (you have to make the query deterministic). You can choose what you prefer.
select distinct -- you need to remove duplicates
min(othercol) over (partition by row),
row, max(type) over (partition by row),
count(*) over (partition by row)
from relation;
sqlite will allow you to have a non-determinist query that will be significantly simpler using a group by (but it will not work in other dbms)

I tried to use ROW_NUMBER() and was able to solve this problem:
ranked AS (
SELECT table.*, ROW_NUMBER() OVER (PARTITION BY othercols, shelf ORDER BY counts DESC) AS rn
FROM table
)
SELECT * FROM ranked WHERE rn = 1;

Presto offers the function max_by() which I think is in Athena as well:
select shelf, max(count), max_by(type, count)
from t
group by shelf

Related

Use window functions to select the value from a column based on the sum of another column, in an aggregate query

Consider this data (View on DB Fiddle):
id
dept
value
1
A
5
1
A
5
1
B
7
1
C
5
2
A
5
2
A
5
2
B
15
2
A
2
The base query I am running is pretty simple. Just get the total value by id and the most frequent dept.
SELECT
id,
MODE() WITHIN GROUP(ORDER BY dept) AS dept_freq,
SUM(value) AS value
FROM test
GROUP BY id
;
id
dept_freq
value
1
A
22
2
A
27
But I also need to get, for each id, the dept that concentrates the greatest value (so the greatest sum of value by id and dept, not the highest individual value in the original table).
Is there any way to use window functions to achieve that and do it directly in the base query above?
The expected output for this particular example would be:
id
dept_freq
dept_value
value
1
A
A
22
2
A
B
27
I could achieve that with the query below and then joining that with the results of the base query above
SELECT * FROM(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY value DESC) as row
FROM (
SELECT id, dept, SUM(value) AS value
FROM test
GROUP BY id, dept
) AS alias1
) AS alias2
WHERE alias2.row = 1
;
id
dept
value
row
1
A
10
1
2
B
15
1
But it is not easy to read/maintain and seems also pretty inefficient. So I thought it should be possible to achieve this using window functions directly in the base query, and that also may also help Postgres to come up with a better query plan that does less passes over the data. But none of my attempts using over partition and filter worked.
step-by-step demo:db<>fiddle
You can fetch the dept for the highest values using the first_value() partition function. Adding this before your mode() grouping should do it:
SELECT
id,
highest_value_dept,
MODE() WITHIN GROUP(ORDER BY dept) AS dept_freq,
SUM(value) as value
FROM (
SELECT
id,
dept,
value,
FIRST_VALUE(dept) OVER (PARTITION BY id ORDER BY value DESC) as highest_value_dept
FROM test
) s
GROUP BY 1,2

BigQuery - Select only first row in BigQuery

I have a table with data where in Column A I have groups of repeating Data (one after another).
I want to select only first row of each group based on values in column A only (no other criteria). Mind you, I want all corresponding columns selected also for the mentioned new found row (I don't want to exclude them).
Can someone help me with a proper query.
Here is a sample:
SAMPLE
Thanks!
#standardSQL
SELECT row.*
FROM (
SELECT ARRAY_AGG(t LIMIT 1)[OFFSET(0)] row
FROM `project.dataset.table` t
GROUP BY columnA
)
you can try smth like this:
#standardSQL
SELECT
* EXCEPT(rn)
FROM (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY columnA ORDER BY columnA) AS rn
FROM
your_dataset.your_table)
WHERE rn = 1
that will return:
Row columnA col2 ...
1 AC1001 Z_Creation
2 ACO112BISPIC QN
...
Add LIMIT 1 at the end of the query
something like
SELECT name, year FROM person_table ORDER BY year LIMIT 1
You can now use qualify for a more concise solution:
select
*
from
your_dataset.your_table
where true
qualify ROW_NUMBER() OVER(PARTITION BY columnA ORDER BY columnA) = 1
In BigQuery the physical sequence of rows is not significant. “BigQuery does not guarantee a stable ordering of rows in a table. Only the result of a query with an explicit ORDER BY clause has well-defined ordering.”[1].
First, you need to define which property will determine the first row of your group, then you can run Vasily Bronsky’s query by changing ORDER BY with that property. Which means either you should add another column to the table to store the order of the rows or select one from the columns you have.

How to sum two columns in sql without group by

I have columns such as pagecount, convertedpages and changedpages in a table along with many other columns.
pagecount is the sum of convertedpages and changedpages.
I need to select all rows along with pagecount and i cant group them. I am wondering if there is any way to do it?
This select is part of view. so can i use another sql statement to bring just the sum and then somehow make it part of the main sql query?
Thank you.
SELECT
*,
(ConvertedPages + ChangedPages) as PageCount
FROM Table
If I'm understanding your question correctly, while I'm not sure why you can't use group by, another option would be to use a correlated subquery:
select distinct id,
(select sum(field) from yourtable y2 where y.id = y2.id) summedresult
from yourtable y
This assumes you have data such as:
id | field
1 | 10
1 | 15
2 | 10
And would be equivalent to:
select id, sum(field)
from yourtable
group by id
Not 100% on what you're after here, but if you want a total across rows without grouping, you can use OVER() with an aggregate in SQL Server:
SELECT *, SUM(convertedpages) OVER() AS convertedpages
, SUM(changedpages) OVER() AS changedpages
, SUM(changedpages + convertedpages) OVER() as PageCount
FROM Table
This repeats the total for every row, you can use PARTITION BY inside OVER() if you'd like to have the aggregate to be grouped by some fields while still displaying the full detail of all rows.

SQL select segment

I'm using SQL Server 2008.
I have a table with x amount of rows. I would like to always divide x by 5 and select the 3rd group of records.
Let's say there are 100 records in the table:
100 / 5 = 20
the 3rd segment will be record 41 to 60.
How will I be able in SQL to calculate and select this 3rd segment only?
Thanks.
You can use NTILE.
Distributes the rows in an ordered partition into a specified number of groups.
Example:
SELECT col1, col2, ..., coln
FROM
(
SELECT
col1, col2, ..., coln,
NTILE(5) OVER (ORDER BY id) AS groupno
FROM yourtable
)
WHERE groupno = 3
That's a perfect use for the NTILE ranking function.
Basically, you define your query inside a CTE and add an NTILE to your rows - a number going from 1 to n (the argument to NTILE). You order your rows by some column, and then you get the n groups of rows you're looking for, and you can operate on any one of those "groups" of data.
So try something like this:
;WITH SegmentedData AS
(
SELECT
(list of your columns),
GroupNo = NTILE(5) OVER (ORDER BY SomeColumnOfYours)
FROM dbo.YourTable
)
SELECT *
FROM SegmentedData
WHERE GroupNo = 3
Of course, you can also use an UPDATE statement after the CTE to update those rows.

Select all from a table, where 2 columns are Distinct

Hi I have a table of deals, I need to return the entire table but I need the title and the price to be distinct, as there is quite a few double ups, I've put in an example scenario below
Col ID || Col Title || Col Price || Col Source
a b c d
a b c b
b a a c
b a a 1
Expected result:
a b c d
b a a c
I'm not sure whether or not to use distinct or group by here, any suggestions would be appreciated
Cheers
Scott
=======================
Looking at some of your suggestions I'm going to have to rethink this, Thanks guys
This will arbitrarily pick one of the rows for each distinct (price,title) pair
;WITH myCTE AS
(
SELECT
*,
ROWNUMBER() OVER (PARTITION BY Price, Title ORDER BY Source) AS rn
FROM
MyTable
)
SELECT
*
FROM
myCTE
WHERE
rn = 1
You can use group by, but to return only title and price, ID and source would have to be ignored
You are asking for entire table but in your sample output you have lost two Records and thus losing the value of 'Col Source'.
a b c b
b a a 1
Group By will help you write very simple query
select id, title, price, source from table group by title, price
A DISTINCT and GROUP BY usually generate the same query plan, so performance should be the same across both query constructs. GROUP BY should be used to apply aggregate operators to each group. If all you need is to remove duplicates then use DISTINCT. If you are using sub-queries execution plan for that query varies so in that case you need to check the execution plan before making decision of which is faster.
You should go for the GROUP BY as the entire columns required in your resultset. However, the DISTINCT will return only unique list of specific column.
SELECT ID, Title, Price, Source
FROM table as t
GROUP BY Title, Price