Get values from first and last row per group - sql

I'm new to Postgres, coming from MySQL and hoping that one of y'all would be able to help me out.
I have a table with three columns: name, week, and value. This table has a record of the names, the week at which they recorded the height, and the value of their height.
Something like this:
Name | Week | Value
------+--------+-------
John | 1 | 9
Cassie| 2 | 5
Luke | 6 | 3
John | 8 | 14
Cassie| 5 | 7
Luke | 9 | 5
John | 2 | 10
Cassie| 4 | 4
Luke | 7 | 4
What I want is a list per user of the value at the minimum week and the max week. Something like this:
Name |minWeek | Value |maxWeek | value
------+--------+-------+--------+-------
John | 1 | 9 | 8 | 14
Cassie| 2 | 5 | 5 | 7
Luke | 6 | 3 | 9 | 5
In Postgres, I use this query:
select name, week, value
from table t
inner join(
select name, min(week) as minweek
from table
group by name)
ss on t.name = ss.name and t.week = ss.minweek
group by t.name
;
However, I receive an error:
column "w.week" must appear in the GROUP BY clause or be used in an aggregate function
Position: 20
This worked fine for me in MySQL so I'm wondering what I'm doing wrong here?

There are various simpler and faster ways.
2x DISTINCT ON
SELECT *
FROM (
SELECT DISTINCT ON (name)
name, week AS first_week, value AS first_val
FROM tbl
ORDER BY name, week
) f
JOIN (
SELECT DISTINCT ON (name)
name, week AS last_week, value AS last_val
FROM tbl
ORDER BY name, week DESC
) l USING (name);
Or shorter:
SELECT *
FROM (SELECT DISTINCT ON (1) name, week AS first_week, value AS first_val FROM tbl ORDER BY 1,2) f
JOIN (SELECT DISTINCT ON (1) name, week AS last_week , value AS last_val FROM tbl ORDER BY 1,2 DESC) l USING (name);
Simple and easy to understand. Also fastest in my old tests. Detailed explanation for DISTINCT ON:
Select first row in each GROUP BY group?
2x window function, 1x DISTINCT ON
SELECT DISTINCT ON (name)
name, week AS first_week, value AS first_val
, first_value(week) OVER w AS last_week
, first_value(value) OVER w AS last_value
FROM tbl t
WINDOW w AS (PARTITION BY name ORDER BY week DESC)
ORDER BY name, week;
The explicit WINDOW clause only shortens the code, no effect on performance.
first_value() of composite type
The aggregate functions min() or max() do not accept composite types as input. You would have to create custom aggregate functions (which is not that hard).
But the window functions first_value() and last_value() do. Building on that we can devise simple solutions:
Simple query
SELECT DISTINCT ON (name)
name, week AS first_week, value AS first_value
,(first_value((week, value)) OVER (PARTITION BY name ORDER BY week DESC))::text AS l
FROM tbl t
ORDER BY name, week;
The output has all data, but the values for the last week are stuffed into an anonymous record (optionally cast to text). You may need decomposed values.
Decomposed result with opportunistic use of table type
For that we need a well-known composite type. An adapted table definition would allow for the opportunistic use of the table type itself directly:
CREATE TABLE tbl (week int, value int, name text); -- optimized column order
week and value come first, so now we can sort by the table type itself:
SELECT (l).name, first_week, first_val
, (l).week AS last_week, (l).value AS last_val
FROM (
SELECT DISTINCT ON (name)
week AS first_week, value AS first_val
, first_value(t) OVER (PARTITION BY name ORDER BY week DESC) AS l
FROM tbl t
ORDER BY name, week
) sub;
Decomposed result from user-defined row type
That's probably not possible in most cases. Register a composite type with CREATE TYPE (permanent) or with CREATE TEMP TABLE (for the duration of the session):
CREATE TEMP TABLE nv(last_week int, last_val int); -- register composite type
SELECT name, first_week, first_val, (l).last_week, (l).last_val
FROM (
SELECT DISTINCT ON (name)
name, week AS first_week, value AS first_val
, first_value((week, value)::nv) OVER (PARTITION BY name ORDER BY week DESC) AS l
FROM tbl t
ORDER BY name, week
) sub;
Custom aggregate functions first() & last()
Create functions and aggregates once per database:
CREATE OR REPLACE FUNCTION public.first_agg (anyelement, anyelement)
RETURNS anyelement
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $1';
CREATE AGGREGATE public.first(anyelement) (
SFUNC = public.first_agg
, STYPE = anyelement
, PARALLEL = safe
);
CREATE OR REPLACE FUNCTION public.last_agg (anyelement, anyelement)
RETURNS anyelement
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $2';
CREATE AGGREGATE public.last(anyelement) (
SFUNC = public.last_agg
, STYPE = anyelement
, PARALLEL = safe
);
Then:
SELECT name
, first(week) AS first_week, first(value) AS first_val
, last(week) AS last_week , last(value) AS last_val
FROM (SELECT * FROM tbl ORDER BY name, week) t
GROUP BY name;
Probably the most elegant solution. Faster with the additional module first_last_agg providing a C implementation.
Compare instructions in the Postgres Wiki.
Related:
Calculating follower growth over time for each influencer
db<>fiddle here (showing all)
Old sqlfiddle
Each of these queries was substantially faster than the currently accepted answer in a quick test on a table with 50k rows with EXPLAIN ANALYZE.
There are more ways. Depending on data distribution, different query styles may be (much) faster, yet. See:
Optimize GROUP BY query to retrieve latest row per user

This is a bit of a pain, because Postgres has the nice window functions first_value() and last_value(), but these are not aggregation functions. So, here is one way:
select t.name, min(t.week) as minWeek, max(firstvalue) as firstvalue,
max(t.week) as maxWeek, max(lastvalue) as lastValue
from (select t.*, first_value(value) over (partition by name order by week) as firstvalue,
last_value(value) over (partition by name order by week) as lastvalue
from table t
) t
group by t.name;

Related

Selecting pair(including reverse order) with highest date value

I have a messages table like this
Messages Table
I want to select each unique pair (including reversed order) with highest date. Therefore resulting SQL Select Statement would be like this:
from_id | to_id | date | message
1 2 13:06 I'm Alp
2 3 13:06 I'm Oliver
3 1 11:38 From third to one
I tried to use distinct with max function but it didn't help.
You can use window functions:
select *
from (
select m.*,
row_number() over(partition by min(from_id, to_id), max(from_id, to_id) order by date desc) rn
from messages m
) m
where rn = 1
Note: counter-intuitively enough, SQLite's min() and max() functions, when given several arguments, are the equivalent of least() and greatest() in other databases.

Select multiple columns having distinct just in 3 of them

i've got a table that i need to return about 14 column values but only return 1 row for the duplicates on some of the columns.
The second problem is that between the duplicates i need to keep the one that has the biggest int in one of the columns that is not required to be unique.
Since the Table is somewhat big, I am seeking advice into doing this in the most efficient way.
should i be doing a group by?
my table is somewhat like this, i will simplify the number of columns.
ID(UniqueIdentifier) | ACCID(UniqueIdentifier) | DateTime(DateTime) | distance(int)|type(int)
28761188-0886-E911-822F-DD1FA635D450 1238FD8A-BD00-411A-A81C-0F6F5C026BCC 2019-06-03 14:04:41.000 2 3
41761188-0886-E911-822F-DD1FA635D450 1238FD8A-BD00-411A-A81C-0F6F5C026BCC 2019-06-03 14:04:41.000 1 3
I should be only selecting when ACCID and DATETIME is unique, the column ID in primary so will never be duplicate, and i need to keep the row with the biggest distance.
You can use the ROW_NUMBER() window function, as in:
select *
from (
select
id,
accid,
datetime,
distance,
type,
row_number() over(partition by accid, datetime order by type desc) as rn
from t
) x
where rn = 1
If you want to show multiple "ties", then replace ROW_NUMBER() by RANK().
I would suggest a correlated subquery with the right index as the fastest method:
select t.*
from t
where t.id = (select top (1) t2.id
from t t2
where t2.ACCID = t.ACCID
order by t2.distance desc
) ;
The best index is on (ACCID, distance desc, id).

PostgreSQL using sum in where clause

I have a table which has a numeric column named 'capacity'. I want to select first rows which the total sum of their capacity is no greater than X, Sth like this query
select * from table where sum(capacity )<X
But I know I can not use aggregation functions in where part.So what other ways exists for this problem?
Here is some sample data
id| capacity
1 | 12
2 | 13.5
3 | 15
I want to list rows which their sum is less than 26 with the order of id, so a query like this
select * from table where sum(capacity )<26 order by id
and it must give me
id| capacity
1 | 12
2 | 13.5
because 12+13.5<26
A bit late to the party, but for future reference, the following should work for a similar problem as the OP's:
SELECT id, sum(capacity)
FROM table
GROUP BY id
HAVING sum(capacity) < 26
ORDER by id ASC;
Use the PostgreSQL docs for reference to aggregate functions: https://www.postgresql.org/docs/9.1/tutorial-agg.html
Use Having clause
select * from table order by id having sum(capacity)<X
You can use the window variant of sum to produce a cumulative sum, and then use it in the where clause. Note that window functions can't be placed directly in the where clause, so you'd need a subquery:
SELECT id, capacity
FROM (SELECT id, capacity, SUM(capacity) OVER (ORDER BY id ASC) AS cum_sum
FROM mytable) t
WHERE cum_sum < 26
ORDER BY id ASC;

SQL Query to group text based on numeric column

I have a table 'TEST' as shown below
Number | Seq | Name
-------+-------+------
123 | 1 | Hello
123 | 2 | Hi
123 | 3 | Greetings
234 | 1 | Goodbye
234 | 2 | Bye
I want to write a query, to group the table by 'Number', and select the rows with the maximum sequence number (MAX(Seq)). The output of the query would be
Number | Seq | Name
-------+-------+------
123 | 3 | Greetings
234 | 2 | Bye
How do I go about this?
EDIT: TEST is actually a table that is the result from a long query (joining multiple tables) that I have already written. I already have a (SELECT ...) statement to get the values I need. Is there a way to remove duplicate rows (with the same 'Number' as shown above) and select only the one with maximum 'Seq' value.
I am on Microsoft SQL Server 2008 (SP2)
I was hoping there would be a way to achieve this by
SELECT * FROM (SELECT ...) TEST <condition to group>
You can use a select win in clause
select * from test
where (number, count) in (select number, max(count) from test group by Number)
Another option is to use a windowed ROW_NUMBER() function with a partition on the number:
With Cte As
(
Select *,
Row_Number() Over (Partition By Number Order By Count Desc) RN
From TEST
)
Select Number, Count, Name
From Cte
Where RN = 1
SELECT *
FROM (SELECT test.*, MAX (seq) OVER (PARTITION BY num) max_seq
FROM test)
WHERE seq = max_seq
I changed the column name from number because you can't use a reserved word for a column name. This is pretty much the same as the other answers, except that it explicitly gets the maximum sequence number for each NUM.
You want to use an ANALYTIC function together with a conditional clause to get you only the rows of TEST that you desire.
WITH TEST as (
...your really complex query that generates TEST...
)
SELECT
Number, Seq, Name,
RANK() OVER (PARTITION By Number ORDER BY Seq DESC) AS aRank
FROM Test
WHERE aRank = 1
;
This returns the Number, Seq, Name for each Number grouping where the Seq is maximum. Yes, it also returns a column named aRank with all '1' in it...hopefully it can be ignored.
The solution to this is to do an self join on only the MAX(Seq) values.
This answer can be found at SQL Select only rows with Max Value on a Column

How to write excluding sum query?

I have a values table:
+------------+---------+
| name | value |
+------------+---------+
| parameter1 | 53.8462 |
| parameter2 | 7.6923 |
| parameter3 | 23.0769 |
| parameter4 | 15.3846 |
+------------+---------+
What is the query for sum values of the three last parameters (parameter 2, parameter 3, parameter 4) without the first parameter (parameter1)?
SELECT SUM(value) tot
FROM table
WHERE name='parameter2' OR name='parameter3' OR name='parameter4'
or
SELECT SUM(value) tot
FROM table
WHERE name<>'parameter1'
This may be a bit simplistic, but can't you do this:
select sum(value) from table where name != 'parameter2'
If what you are really after is the sum past n-th value, you could do this (in SQL Server):
WITH OrderedRows AS
(
SELECT name, value,
ROW_NUMBER() OVER (ORDER BY name) AS 'RowNumber'
FROM table
)
SELECT sum(value)
FROM OrderedRows
WHERE RowNumber > 1;
If you are specif to this only
SELECT SUM(value) tot
FROM table
WHERE name<>'parameter1'
but if you need some generic solution than do not use this
with some null-checking, so the sum can still work:
SELECT SUM(coalesce(value, 0)) your_total
FROM table
WHERE coalesce(name, '') <> 'parameter1'
select sum(value) from values where name!='parameter1';
In place of ! you could also use <>.
If your goal is to sum the last three columns, even on bigger tables than your example, you are looking for moving window functions.
In Oracle you can write
WITH T AS (
SELECT 'parameter1' PAR, 2 VAL FROM DUAL
UNION ALL
SELECT 'parameter2' PAR, 3 VAL FROM DUAL
UNION ALL
SELECT 'parameter3' PAR, 5 VAL FROM DUAL
UNION ALL
SELECT 'parameter4' PAR, 7 VAL FROM DUAL
)
SELECT PAR, SUM(VAL) OVER (ORDER BY PAR ROWS 2 PRECEDING) LAST3SUM FROM T;
This would yield to
PAR LAST3SUM
---------- ----------
parameter1 2
parameter2 5
parameter3 10
parameter4 15
You shoudl look at the Oracle Documentation about Analytic Functions and keep the following in mind:
Note that the query uses SUM, but no GROUP BY. This is because we are not aggregating data, but calculating the SUM for each row we select.
Note that order is important, in my example I ORDER BY PAR, but you can as well order by any other column available in your query.
Oracle Data Warehousing Guide also discusses windowing functions, giving a lot of useful examples.