Redshift - optimal way to group by and select multiple non-aggregate columns - sql

I'm trying to aggregate a table based on grouping a column. The table looks like this-
The primary objective is to get a sum based on similar IDs -
SELECT id,sum(amt) FROM tbl GROUP BY id;
This returns a valid result where the 'amt' column is summed up correctly -
But in addition to this I also want to get other columns based on certain logic. A result like this -
I tried -
SELECT id, sum(amt), location, date FROM tbl GROUP BY id;
This fails because the GROUP BY clause does not have the columns in the SELECT clause(location, date etc.).My only option here is to use other aggregate functions like MIN(), MAX() etc.
To circumvent this I tried a different approach using window functions in Redshift like this -
WITH AGG_TBL AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS ROW,
id,
SUM(amt) OVER (PARTITION BY id) AS sum,
location,
date
FROM tbl
)
SELECT x.*
FROM AGG_TBL x
WHERE x.row = 1
The code above selects the first row based on the partition logic and I'm able to select multiple columns.
I will be running this aggregation query on 600+million rows so I want to know if this is indeed the most optimal way to do this in Redshift or if there is a better more efficient way?

Related

BigQuery - Extract last entry of each group

I have one table where multiple records inserted for each group of product. Now, I want to extract (SELECT) only the last entries. For more, see the screenshot. The yellow highlighted records should be return with select query.
The HAVING MAX and HAVING MIN clause for the ANY_VALUE function is now in preview
HAVING MAX and HAVING MIN were just introduced for some aggregate functions - https://cloud.google.com/bigquery/docs/release-notes#February_06_2023
with them query can be very simple - consider below approach
select any_value(t having max datetime).*
from your_table t
group by t.id, t.product
if applied to sample data in your question - output is
You might consider below as well
SELECT *
FROM sample_table
QUALIFY DateTime = MAX(DateTime) OVER (PARTITION BY ID, Product);
If you're more familiar with an aggregate function than a window function, below might be an another option.
SELECT ARRAY_AGG(t ORDER BY DateTime DESC LIMIT 1)[SAFE_OFFSET(0)].*
FROM sample_table t
GROUP BY t.ID, t.Product
Query results
You can use window function to do partition based on key and selecting required based on defining order by field.
For Example:
select * from (
select *,
rank() over (partition by product, order by DateTime Desc) as rank
from `project.dataset.table`)
where rank = 1
You can use this query to select last record of each group:
Select Top(1) * from Tablename group by ID order by DateTime Desc

Get minimum without using row number/window function in Bigquery

I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below
In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;
Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.
Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective
A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id

Postgres another issue with "column must appear in the GROUP BY clause or be used in an aggregate function"

I have 2 tables in my Postgres database.
vehicles
- veh_id PK
- veh_number
positions
- position_id PK
- vehicle_id FK
- time
- latitude
- longitude
.... few more fields
I have multiple entries in Position table for every Vehicle. I would like to get all vehicle positions but the newest ones (where time field is latest). I tried query like this:
SELECT *
FROM positions
GROUP BY vehicle_id
ORDER BY time DESC
But there's an error:
column "positions.position_id" must appear in the GROUP BY clause or be used in an aggregate function
I tried to change it to:
SELECT *
FROM positions
GROUP BY vehicle_id, position_id
ORDER BY time DESC
but then it doesn't group entries.
I tried to found similiar problems e.g.:
PostgreSQL - GROUP BY clause or be used in an aggregate function
or
GroupingError: ERROR: column must appear in the GROUP BY clause or be used in an aggregate function
but I didn't really helped with my problem.
Could you help me fix my query?
Is simple if you have columns on the SELECT those should be also on the GROUP section unless they are wrapped with aggregated function
Also dont use * use the column names
SELECT col1, col2, MAX(col3), COUNT(col4), AVG(col5) -- aggregated columns
-- dont go in GROUP BY
FROM yourTable
GROUP BY col1, col2 -- all not aggregated field
Now regarding your query, looks like you want
SELECT *
FROM (
SELECT * ,
row_number() over (partition by vehicle_id order by time desc) rn
FROM positions
) t
WHERE t.rn = 1;
try to use this group by clause
GROUP BY position_id,vehicle_id
primary key then FK

Return row data based on Distinct Column value

I'm using SQL Azure with asp script, and for the life of me, have had no luck trying to get this to work. The table I'm running a query on has many columns, but I want to query for distinct values on 2 columns (name and email), from there I want it to return the entire row's values.
What my query looks like now:
SELECT DISTINCT quote_contact, quote_con_email
FROM quote_headers
WHERE quote_contact LIKE '"&query&"%'
But I need it to return the whole row so I can retrieve other data points. Had I been smart a year ago, I would have created a separate table just for the contacts, but that's a year ago.
And before I implemented LiveSearch features.
One approach would be to use a CTE (Common Table Expression). With this CTE, you can partition your data by some criteria - i.e. your quote_contact and quote_con_email - and have SQL Server number all your rows starting at 1 for each of those partitions, ordered by some other criteria - i.e. probably SomeDateTimeStamp.
So try something like this:
;WITH DistinctContacts AS
(
SELECT
quote_contact, quote_con_email, (other columns here),
RN = ROW_NUMBER() OVER(PARTITION BY quote_contact, quote_con_email ORDER BY SomeDateTimeStamp DESC)
FROM
dbo.quote_headers
WHERE
quote_contact LIKE '"&query&"%'
)
SELECT
quote_contact, quote_con_email, (other columns here)
FROM
DistinctContacts
WHERE
RowNum = 1
Here, I am selecting only the last entry for each "partition" (i.e. for each pair of name/email) - ordered in a descending fashion by the time stamp.
Does that approach what you're looking for??
You need to provide more details.
This is what I could come up with Without them:
WITH dist as (
SELECT DISTINCT quote_contact, quote_con_email, RANK() OVER(ORDER BY quote_contact, quote_con_email) rankID
FROM quote_headers
WHERE quote_contact LIKE '"&query&"%'
),
data as (
SELECT *, RANK() OVER(ORDER BY quote_contact, quote_con_email) rankID FROM quote_headers
)
SELECT * FROM dist d INNER JOIN data src ON d.rankID = src.rankID

SELECT *, COUNT(*) in SQLite

If i perform a standard query in SQLite:
SELECT * FROM my_table
I get all records in my table as expected. If i perform following query:
SELECT *, 1 FROM my_table
I get all records as expected with rightmost column holding '1' in all records. But if i perform the query:
SELECT *, COUNT(*) FROM my_table
I get only ONE row (with rightmost column is a correct count).
Why is such results? I'm not very good in SQL, maybe such behavior is expected? It seems very strange and unlogical to me :(.
SELECT *, COUNT(*) FROM my_table is not what you want, and it's not really valid SQL, you have to group by all the columns that's not an aggregate.
You'd want something like
SELECT somecolumn,someothercolumn, COUNT(*)
FROM my_table
GROUP BY somecolumn,someothercolumn
If you want to count the number of records in your table, simply run:
SELECT COUNT(*) FROM your_table;
count(*) is an aggregate function. Aggregate functions need to be grouped for a meaningful results. You can read: count columns group by
If what you want is the total number of records in the table appended to each row you can do something like
SELECT *
FROM my_table
CROSS JOIN (SELECT COUNT(*) AS COUNT_OF_RECS_IN_MY_TABLE
FROM MY_TABLE)