Postgres another issue with "column must appear in the GROUP BY clause or be used in an aggregate function" - sql

I have 2 tables in my Postgres database.
vehicles
- veh_id PK
- veh_number
positions
- position_id PK
- vehicle_id FK
- time
- latitude
- longitude
.... few more fields
I have multiple entries in Position table for every Vehicle. I would like to get all vehicle positions but the newest ones (where time field is latest). I tried query like this:
SELECT *
FROM positions
GROUP BY vehicle_id
ORDER BY time DESC
But there's an error:
column "positions.position_id" must appear in the GROUP BY clause or be used in an aggregate function
I tried to change it to:
SELECT *
FROM positions
GROUP BY vehicle_id, position_id
ORDER BY time DESC
but then it doesn't group entries.
I tried to found similiar problems e.g.:
PostgreSQL - GROUP BY clause or be used in an aggregate function
or
GroupingError: ERROR: column must appear in the GROUP BY clause or be used in an aggregate function
but I didn't really helped with my problem.
Could you help me fix my query?

Is simple if you have columns on the SELECT those should be also on the GROUP section unless they are wrapped with aggregated function
Also dont use * use the column names
SELECT col1, col2, MAX(col3), COUNT(col4), AVG(col5) -- aggregated columns
-- dont go in GROUP BY
FROM yourTable
GROUP BY col1, col2 -- all not aggregated field
Now regarding your query, looks like you want
SELECT *
FROM (
SELECT * ,
row_number() over (partition by vehicle_id order by time desc) rn
FROM positions
) t
WHERE t.rn = 1;

try to use this group by clause
GROUP BY position_id,vehicle_id
primary key then FK

Related

Get minimum without using row number/window function in Bigquery

I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below
In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;
Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.
Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective
A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id

SQL Query Multiple Columns Using Distinct on One Column Only and Using Order By

I am trying to select all the distinct names from a column and order them by another column, sort_order.
I've tried several things:
select distinct ( name ), sort_order from table1 where active=1 order by sort_order
The above code outputs two columns, however, some repeat names have different sort_order values and still appear.
select name, sort_order from table1 where
name in (Select min(name) FROM table1 where active=1 group by sort_order )
The above code produces the error message:
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
I tried replacing the order by with a group by, but this produces the list in the wrong order.
select distinct name as flavors from table1 where active=1 order by sort_order
The above code produces the error message:
ORDER BY items must appear in the select list if SELECT DISTINCT is specified.
I need the name column to display all distinct names and the sort_order column should display all the corresponding sort_order numbers (some may repeat).
Use aggregation:
select name, max(sort_order)
from z_mflavs
where active = 1
group by name
order by max(sort_order); -- or min() or avg()
Note that in your query, the parentheses around (name) are utterly superfluous. SELECT DISTINCT is a clause in the SQL language and it applies to all columns being selected, regardless of whether any are expressions in parentheses.
Use this query...if you want values of name, sort_order column
SELECT name, MAX(sort_order) as sort_order FROM table1
Group by name ORDER BY MAX(sort_order) DESC

Redshift - optimal way to group by and select multiple non-aggregate columns

I'm trying to aggregate a table based on grouping a column. The table looks like this-
The primary objective is to get a sum based on similar IDs -
SELECT id,sum(amt) FROM tbl GROUP BY id;
This returns a valid result where the 'amt' column is summed up correctly -
But in addition to this I also want to get other columns based on certain logic. A result like this -
I tried -
SELECT id, sum(amt), location, date FROM tbl GROUP BY id;
This fails because the GROUP BY clause does not have the columns in the SELECT clause(location, date etc.).My only option here is to use other aggregate functions like MIN(), MAX() etc.
To circumvent this I tried a different approach using window functions in Redshift like this -
WITH AGG_TBL AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS ROW,
id,
SUM(amt) OVER (PARTITION BY id) AS sum,
location,
date
FROM tbl
)
SELECT x.*
FROM AGG_TBL x
WHERE x.row = 1
The code above selects the first row based on the partition logic and I'm able to select multiple columns.
I will be running this aggregation query on 600+million rows so I want to know if this is indeed the most optimal way to do this in Redshift or if there is a better more efficient way?

error using "order by" in a select statement (error: column is not contained in "either an aggregate function or the GROUP BY clause")

I have a table below and want to count the number of consecutive occurrences each letter appears. The code to reproduce the table I am using is listed for those helping to save time.
CREATE TABLE table1 (id integer, names varchar(50));
INSERT INTO table1 VALUES (1,'A');
INSERT INTO table1 VALUES (2,'A');
INSERT INTO table1 VALUES (3,'B');
INSERT INTO table1 VALUES (4,'B');
INSERT INTO table1 VALUES (5,'B');
INSERT INTO table1 VALUES (6,'B');
INSERT INTO table1 VALUES (7,'C');
INSERT INTO table1 VALUES (8,'B');
INSERT INTO table1 VALUES (9,'B');
select * from table1;
I found code already written to accomplish this online, which I've tested and can confirm it runs successfully. It's shown here.
select names, count(*) as count
from (select id, names, (row_number() over (order by id) - row_number() over (partition by names order by id)) as grp
from table1
) as temp
group by grp, names
I am trying to add in the ORDER BY clause at the end, like so:
select names, count(*) as count
from (select id, names, (row_number() over (order by id) - row_number() over (partition by names order by id)) as grp
from table1
) as temp
group by grp, names
order by id -- added this here, but it creates an error.
but kept getting the error "Column "temp.id" is invalid in the ORDER BY clause because it is not contained in either an aggregate function or the GROUP BY clause." However, I am able to order by "names." What is the difference here?
Also, why can't I add in the "order by id" in the subquery? If I run this subquery on its own (see below), then the "order by id" is fine, but all together it cannot run. Why is this?
select names, count(*) as count
from (select id, names, (row_number() over (order by id) - row_number() over (partition by names order by id)) as grp
from table1
order by id -- added this in here, but it creates an error.
) as temp
group by grp, names
order by names
A select statement returns rows in an arbitrary order -- unless it has an order by. This is an extension of the fact that SQL operators on unordered sets.
Your select has no order by, so you should not assume the data would come back in any particular ordering. To get the results order by id, add order by id to the select.
kept getting the error "Column "temp.id" is invalid in the ORDER BY
clause because it is not contained in either an aggregate function or
the GROUP BY clause." However, I am able to order by "names." What is
the difference here?
SQL does things in a certain order. If your query has a GROUP BY (which yours does), that is done first. After grouping, the only thing SQL has is the columns that are selected and grouped by, so those are the only columns that can be used in the order by clause.
As an example, think of houses in a street. If you did a query on houses, returning colour & count, you might get something like Red 2, White 10, Green 3. But asking to sort that by address number makes no sense, because that information is not in data we've returned. In your case you are returning names, count, and you used grp in the group by clause, so those are the only things you can use to sort the final data, because they are all you have, and all that makes sense.
Also, why can't I add in the "order by id" in the subquery? If I run
this subquery on its own (see below), then the "order by id" is fine,
but all together it cannot run. Why is this?
When you have a subquery, the results are used as if they were a table. You can join on it, or query from it like you are, but the point is the order of that table has no effect on any thing else. The entry order of the underlying table is no guarantee that your query will come out in that order, unless you use an order by clause. And because you are doing a group by, that order means nothing anyway. Because the order of the subquery has no effect, SQL won't let you put it in.

Optimal Oracle SQL Query to complete group-by on multiple columns in single table containing ~ 7,000,000 records

I am a SQL Novice in need of some advice. What is the most efficient (fastest running query) way to do the following-
Select all columns from a table after-
-Performing a "Group By" based on the unique values contained in two columns: "top_line_id" and "external_reference".
-Selecting a single record from each group based on the max or min value (doesn't matter which one) contained in a different field such as support_id.
Someone on my team provided the below query, but I can't seem to get it working. I receive an error message stating "invalid relational operator" when I attempt to execute it.
Select *
from STAGE.SFS_GH_R3_IB_ENTLMNT_CONTACTS
Where support_id, external_reference, top_line_id in (
select max(support_id),
external_reference,
top_line_id from STAGE.SFS_GH_R3_IB_ENTLMNT_CONTACTS
)
One more thing - the columns on which we are performing the Group By contain null values in some records. We would like those excluded from the query.
Any assistance you can provide is very much appreciated.
Although you phrase this as a group by query, there is another approach using row_number(). This enumerates each row in the group, based on the "order by" clause. In the following query, it enumerates each group based on external_reference and top_line_id, ordered by support_id:
select *
from (Select t.*,
row_number() over (partition by external_reference, top_line_id
order by support_id) as seqnum
from STAGE.SFS_GH_R3_IB_ENTLMNT_CONTACTS t
)
where seqnum = 1
This should work(can't test it)
SELECT
*
FROM
stage.sfs_gh_r3_ib_entlmnt_contacts
WHERE
(support_id, external_reference, top_line_id) IN
(
SELECT
max(support_id),
external_reference,
top_line_id
FROM
stage.sfs_gh_r3_ib_entlmnt_contacts
WHERE
external_reference IS NOT NULL AND
top_line_id IS NOT NULL
GROUP BY
top_line_id, external_reference
)