Group by repeating attribute - sql

Basically I have a table messages, with user_id field that identifies a user that created the message.
When I display a conversation(set of messages) between two users, I want to be able to group the messages by user_id, but in a tricky way:
Let's say there are some messages (sorted by created_at desc):
id: 1, user_id: 1
id: 2, user_id: 1
id: 3, user_id: 2
id: 4, user_id: 2
id: 5, user_id: 1
I want to get 3 message groups in the below order:
[1,2], [3,4], [5]
It should group by *user_id* until it sees a different one and then groups by that one.
I'm using PostgreSQL and would be happy to use something specific to it, whatever would give the best performance.

Try something like this:
SELECT user_id, array_agg(id)
FROM (
SELECT id,
user_id,
row_number() OVER (ORDER BY created_at)-
row_number() OVER (PARTITION BY user_id ORDER BY created_at) conv_id
FROM table1 ) t
GROUP BY user_id, conv_id;
The expression:
row_number() OVER (ORDER BY created_at)-
row_number() OVER (PARTITION BY user_id ORDER BY created_at) conv_id
Will give you a special id for every message group (this conv_id can be repeated for other user_id, but user_id, conv_id will give you all distinct message groups)
My SQLFiddle with example.
Details: row_number(), OVER (PARTITION BY ... ORDER BY ...)

Proper SQL
I want to get 3 message groups in the below order: [1,2], [3,4], [5]
To get the requested order, add ORDER BY min(id):
SELECT grp, user_id, array_agg(id) AS ids
FROM (
SELECT id
, user_id
, row_number() OVER (ORDER BY id) -
row_number() OVER (PARTITION BY user_id ORDER BY id) AS grp
FROM tbl
ORDER BY 1 -- for ordered arrays in result
) t
GROUP BY grp, user_id
ORDER BY min(id);
db<>fiddle here
Old sqliddle
The addition would barely warrant another answer. The more important issue is this:
Faster with PL/pgSQL
I'm using PostgreSQL and would be happy to use something specific to it, whatever would give the best performance.
Pure SQL is all nice and shiny, but a procedural server-side function is much faster for this task. While processing rows procedurally is generally slower, plpgsql wins this competition big-time, because it can make do with a single table scan and a single ORDER BY operation:
CREATE OR REPLACE FUNCTION f_msg_groups()
RETURNS TABLE (ids int[])
LANGUAGE plpgsql AS
$func$
DECLARE
_id int;
_uid int;
_id0 int; -- id of last row
_uid0 int; -- user_id of last row
BEGIN
FOR _id, _uid IN
SELECT id, user_id FROM messages ORDER BY id
LOOP
IF _uid <> _uid0 THEN
RETURN QUERY VALUES (ids); -- output row (never happens after 1 row)
ids := ARRAY[_id]; -- start new array
ELSE
ids := ids || _id; -- add to array
END IF;
_id0 := _id;
_uid0 := _uid; -- remember last row
END LOOP;
RETURN QUERY VALUES (ids); -- output last iteration
END
$func$;
Call:
SELECT * FROM f_msg_groups();
Benchmark and links
I ran a quick test with EXPLAIN ANALYZE on a similar real life table with 60k rows (execute several times, pick fastest result to exclude cashing effects):
SQL:
Total runtime: 1009.549 ms
Pl/pgSQL:
Total runtime: 336.971 ms
Related:
GROUP BY and aggregate sequential numeric values
GROUP BY consecutive dates delimited by gaps
Ordered count of consecutive repeats / duplicates

The GROUP BY clause will collapse the response in 2 records - one with user_id 1 and one with user_id 2 no matter of the ORDER BY clause so I recommend you'd send just the ORDER BY created_at
prev_id = -1
messages.each do |m|
if ! m.user_id == prev_id do
prev_id = m.user_id
#do whatever you want with a new message group
end
end

You can use chunk:
Message = Struct.new :id, :user_id
messages = []
messages << Message.new(1, 1)
messages << Message.new(2, 1)
messages << Message.new(3, 2)
messages << Message.new(4, 2)
messages << Message.new(5, 1)
messages.chunk(&:user_id).each do |user_id, records|
p "#{user_id} - #{records.inspect}"
end
The output:
"1 - [#<struct Message id=1, user_id=1>, #<struct Message id=2, user_id=1>]"
"2 - [#<struct Message id=3, user_id=2>, #<struct Message id=4, user_id=2>]"
"1 - [#<struct Message id=5, user_id=1>]"

Related

Ranking players in SQL database will return always 1 when where id_user is used

I basically have a table "race" with columns for "id_race", "id_user" and columns for user predictions "pole_position", "1st", "2nd", "3rd" and "fastest_lap". In addition to those columns, each prediction column also has a control column such as "PPC", "1eC", "2eC", "3eC" and "srC". Those control columns are then compared by a query against a "result" table. Then the control columns in race are awarded points for a correct prediction.
table race
I want to add up those results per user and then rank them per user. I want to show that rank on the player's user page. I have a query for my SQL which works fine in itself and gives me a list with rank column.
SELECT
#rownum := #rownum +1 AS rank,
total,
id_user
FROM
(SELECT
SUM(PPC + 1eC + 2eC + 3eC + srC ) AS total,
id_user
FROM
race
GROUP BY
id_user
ORDER BY
total DESC) T,
(SELECT #rownum := 0) a;
Output of rank query:
However when I add the where id_user it always gets the first rank. Does anyone have an idea if this can be solved and how I could achieve it to add where to my rank query?
I've already tried filtering. In addition, I have tried to use the Row_number function. It also always gives a result of 1 because only 1 user remains after filtering. I am unable to filter out the correct position. So please help!
You have to create a view to extracting the correct rank. Once you use WHERE clause, you will get the rank based on the population rather that the subset.
Please find an indicative answer on fiddle where a CTE and ROW function are used. The indicative code is:
WITH sum_cte AS (
SELECT ROW_NUMBER() OVER(ORDER BY SUM(PPC + 1EC + 2eC + 3eC + srC) DESC) AS Row,
id_user,
SUM(PPC + 1EC + 2eC + 3eC + srC) AS total_sum
FROM race
GROUP BY id_user)
SELECT Row, id_user, total_sum
FROM sum_cte
WHERE id_user = 1
User 1 with the second score will appear with a row valuation 2.

More than one row returned by a subquery used as an expression when UPDATE on multiple rows

I'm trying to update rows in a single table by splitting them into two "sets" of rows.
The top part of the set should have a status set to X and the bottom one should have a status set to status Y.
I've tried putting together a query that looks like this
WITH x_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
LIMIT 5
), y_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
OFFSET 5
)
UPDATE people
SET status = folks.status
FROM (values
((SELECT id from x_status), 'X'),
((SELECT id from y_status), 'Y')
) as folks (ids, status)
WHERE id IN (folks.ids);
When I run this query I get the following error:
pq: more than one row returned by a subquery used as an expression
This makes sense, folks.ids is expected to return a list of IDs, hence the IN clause in the UPDATE statement, but I suspect the problem is I can not return the list in the values statement in the FROM clause as it turns into something like this:
(1, 2, 3, 4, 5, 5)
(6, 7, 8, 9, 1)
Is there a way how this UPDATE can be done using a CTE query at all? I could split this into two separate UPDATE queries, but CTE query would be better and in theory faster.
I think I understand now... if I get your problem, you want to set the status to 'X' for the oldest five records and 'Y' for everything else?
In that case I think the row_number() analytic would work -- and it should do it in a single pass, two scans, and eliminating one order by. Let me know if something like this does what you seek.
with ranked as (
select
id, row_number() over (order by date_registered desc) as rn
from people
)
update people p
set
status = case when r.rn <= 5 then 'X' else 'Y' end
from ranked r
where
p.id = r.id
Any time you do an update from another data set, it's helpful to have a where clause that defines the relationship between the two datasets (the non-ANSI join syntax). This makes it iron-clad what you are updating.
Also I believe this code is pretty readable so it will be easier to build on if you need to make tweaks.
Let me know if I missed the boat.
So after more tinkering, I've come up with a solution.
The problem with why the previous query fails is we are not grouping the IDs in the subqueries into arrays so the result expands into a huge list as I suspected.
The solution is grouping the IDs in the subqueries into ARRAY -- that way they get returned as a single result (tuple) in ids value.
This is the query that does the job. Note that we must unnest the IDs in the WHERE clause:
WITH x_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
LIMIT 5
), y_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
OFFSET 5
)
UPDATE people
SET status = folks.status
FROM (values
(ARRAY(SELECT id from x_status), 'X'),
(ARRAY(SELECT id from y_status), 'Y')
) as folks (ids, status)
WHERE id IN (SELECT * from unnest(folks.ids));

Select all rows where the sum of column X is greather or equal than Y

I need to find a group of lots to satisfy X demand for items. I can't do it with aggregate functions, it seems to me that I need something more than a window function, do you know anything that can help me solve this problem?
For example, if I have a demand for 1 Item, the query should return any lot with a quantity greater than or equal to 1. But if I have a demand for 15, there are no lots with that availability, so it should return a lot of 10 and another with 5 or one of 10 and two of 3, etc.
With a programming language like Java this is simple, but with SQL is it possible? I am trying to achieve it with sales functions but I cannot find a way to add the available quantity of the current row until reaching the required quantity.
SELECT id,VC_NUMERO_LOTE,SF_FECHA_CREACION,SI_ID_M_ARTICULO,VI_CANTIDAD,NEXT, VI_CANTIDAD + NEXT AS TOT FROM (
SELECT row_number() over (ORDER BY SF_FECHA_CREACION desc) id ,VC_NUMERO_LOTE,SF_FECHA_CREACION,SI_ID_M_ARTICULO,
VI_CANTIDAD,LEAD(VI_CANTIDAD,1) OVER (ORDER BY SF_FECHA_CREACION desc) as NEXT FROM PUBLIC.M_LOTE WHERE SI_ID_M_ARTICULO = 44974
AND VI_CANTIDAD > 0 ) AS T
WHERE MOD(id, 2) != 0
I tried with lead to then sum only odd records but I saw that it is not the way, any suggestions?
You need a recursive query like this:
demo:db<>fiddle
WITH RECURSIVE lots_with_rowcount AS ( -- 1
SELECT
*,
row_number() OVER (ORDER BY avail_qty DESC) as rowcnt
FROM mytable
), lots AS ( -- 2
SELECT -- 3
lot_nr,
avail_qty,
rowcnt,
avail_qty as total_qty
FROM lots_with_rowcount
WHERE rowcnt = 1
UNION
SELECT
t.lot_nr,
t.avail_qty,
t.rowcnt,
l.total_qty + t.avail_qty -- 4
FROM lots_with_rowcount t
JOIN lots l ON t.rowcnt = l.rowcnt + 1
AND l.total_qty < --<your demand here>
)
SELECT * FROM lots -- 5
This CTE is only to provide a row count to each record which can be used within the recursion to join the next records.
This is the recursive CTE. A recursive CTE contains two parts: The initial SELECT statement and the recursion.
Initial part: Queries the lot record with the highest avail_qty value. Naturally, you can order them in any order you like. Most qty first yield the smallest output.
After the UNION the recursion part: Here the current row is joined the previous output AND as an additional condition: Join only if the previous output doesn't fit your demand value. In that case, the next total_qty value is calculated using the previous and the current qty value.
Recursion end, when there's no record left which fits the join condition. Then you can SELECT the entire recursion output.
Notice: If your demand was higher than your all your available quantities in total, this would return the entire table because the recursion runs as long as the demanded is not reached or your table ends. You should add a query before, which checks this:
SELECT SUM(avail_qty) > demand FROM mytable
I gratefully fiddled around with S-Man's fiddle and found a query, at least simpler to understand
select lot_nr, avail_qty, tot_amount from
(select lot_nr, avail_qty,
sum(avail_qty) over (order by avail_qty desc rows between unbounded preceding and current row) as tot_amount,
sum(avail_qty) over (order by avail_qty desc rows between unbounded preceding and current row) - avail_qty as last_amount
from mytable) amounts
where last_amount < 15 -- your amount here
so this lists all rows where with the predecesor (in descending order by avail_qty) the limit isn't yet reached
Here is a simple old-school PL/pgSQL version that uses a (slow) loop. It returns only the lot numbers as an illustration. Basically what it does is return lot numbers for a particular item_id in certain order (that reflects the required business rules) and allocates the available quantities until the allocated quantity is equal or exceeds the required quantity.
create function get_lots(required_item integer, required_qty integer) returns setof text as
$$
declare
r record;
allocated_qty integer := 0;
begin
for r in select * from lots where item_id = required_item order by <your biz-rule> loop
return next r.lot_number;
allocated_qty := allocated_qty + r.available_qty;
exit when allocated_qty >= required_qty;
end loop;
end;
$$ language plpgsql;
-- Use
select lot_id from get_lots(1, 17) lot_id;

merging two tables, while applying aggregates on the duplicates (max,min and sum)

I have a table (let's call it log) with a few millions of records. Among the fields I have Id, Count, FirstHit, LastHit.
Id - The record id
Count - number of times this Id has been reported
FirstHit - earliest timestamp with which this Id was reported
LastHit - latest timestamp with which this Id was reported
This table only has one record for any given Id
Everyday I get into another table (let's call it feed) with around half a million records with these fields among many others:
Id
Timestamp - Entry date and time.
This table can have many records for the same id
What I want to do is to update log in the following way.
Count - log count value, plus the count() of records for that id found in feed
FirstHit - the earliest of the current value in log or the minimum value in feed for that id
LastHit - the latest of the current value in log or the maximum value in feed for that id.
It should be noticed that many of the ids in feed are already in log.
The simple thing that worked is to create a temporary table and insert into it the union of both as in
Select Id, Min(Timestamp) As FirstHit, MAX(Timestamp) as LastHit, Count(*) as Count FROM feed GROUP BY Id
UNION ALL
Select Id, FirstHit,LastHit,Count FROM log;
From that temporary table I do a select that aggregates Min(firsthit), max(lasthit) and sum(Count)
Select Id, Min(FirstHit),Max(LastHit),Sum(Count) FROM #temp GROUP BY Id;
and that gives me the end result. I could then delete everything from log and replace it with everything with temp, or craft an update for the common records and insert the new ones. However, I think both are highly inefficient.
Is there a more efficient way of doing this. Perhaps doing the update in place in the log table?
If your SQL Server version is 2008 or later then you can try this:
MERGE INTO log l
USING (SELECT Id, MIN(Timestamp) AS FirstHit, MAX(Timestamp) AS LastHit, Count(*) as Count FROM feed GROUP BY Id) f
ON l.Id = f.Id
WHEN MATCHED THEN
UPDATE SET
FirstHit = CASE WHEN l.FirstHit < f.FirstHit THEN l.FirstHit ELSE f.FirstHit END,
LastHit = CASE WHEN l.LastHit > f.LastHit THEN l.LastHit ELSE f.LastHit END,
Count = l.Count + f.Count
WHEN NOT MATCHED THEN
INSERT (Id, FirstHit, LastHit, Count)
VALUES (f.Id, f.FirstHit, f.LastHit, f.Count);
The keyword here is EVERYDAY. You should have a (batch) job which run the process at the end of each day. The idea is process only the record from yesterday, this is ways better than process the whole Feed table.
Updated information:
Feed table contains only the hits from last run date. This is much easier with MERGE to update Log table:
Notice: We can say FirstHit will never be updated. Only LastHit and Count. Improved from #dened answer.
MERGE INTO log l
USING (SELECT Id, MIN(Timestamp) AS FirstHit, MAX(Timestamp) AS LastHit, Count(*) as TodayHit FROM feed GROUP BY Id) f
ON l.Id = f.Id
WHEN MATCHED THEN
UPDATE SET
LastHit = f.LastHit,
Count = l.Count + f.TodayHit
WHEN NOT MATCHED THEN
INSERT (Id, FirstHit, LastHit, Count)
VALUES (f.Id, f.FirstHit, f.LastHit, f.TodayHit);
I can't test it but I think this should work, not sure how it will perform, though:
select
ifnull(log.Id,feedsum.Id) as Id
, case when log.FirstHit is null then feedsum.FirstHit
when feedsum.FirstHit is null then log.FirstHit
when log.FirstHit<feedsum.FirstHit then log.FirstHit
else feedsum.FirstHit as FirstHit
, case when log.LastHit is null then feedsum.LastHit
when feedsum.LastHit is null then log.LastHit
when log.LastHit>feedsum.LastHit then log.LastHit
else feedsum.LastHit as LastHit
from
log
full outer join (
Select Id, Min(Timestamp) As FirstHit, MAX(Timestamp) as LastHit, Count(*) as Count FROM feed GROUP BY Id
) feedsum using (Id)

Debugging a SQL Query

I have a table structure like below. I need to select the row where User_Id =100 and User_sub_id = 1 and time_used = minimum of all and where Timestamp the highest. The output of my query should result in :
US;1365510103204;NY;1365510103;100;1;678;
My query looks like this.
select *
from my_table
where CODE='DE'
and User_Id = 100
and User_sub_id = 1
and time_used = (select min(time_used)
from my_table
where CODE='DE'
and User_Id=100
and User_sub_id= 1);
this returns me all the 4 rows. I need only 1, the one with highest timestamp.
Many Thanks
CODE: Timestamp: Location: Time_recorded: User_Id: User_sub_Id: time_used
"US;1365510102420;NY;1365510102;100;1;1078;
"US;1365510102719;NY;1365510102;100;1;978;
"US;1365510103204;NY;1365510103;100;1;878;
"US;1365510102232;NY;1365510102;100;1;678;
"US;1365510102420;NY;1365510102;100;1;678;
"US;1365510102719;NY;1365510102;100;1;678;
"US;1365510103204;NY;1365510103;100;1;678;
"US;1365510102420;NY;1365510102;101;1;678;
"US;1365510102719;NY;1365510102;101;1;638;
"US;1365510103204;NY;1365510103;101;1;638;
Another possibly faster solution is using window functions:
select *
from (
select code,
timestamp,
min(time_used) over (partition by user_id, user_sub_id) as min_used,
row_number() over (partition by user_id, user_sub_id order by timestamp desc) as rn,
time_used,
user_id,
user_sub_id
from my_table
where CODE='US'
and User_Id = 100
and User_sub_id = 1
) t
where time_used = min_used
and rn = 1;
This only needs to scan the table once instead of twice as your solution with the sub-select is doing.
I would strongly recommend to rename the column timestamp.
First this is a reserved word and using them is not recommended.
And secondly it doesn't document anything - it's horrible name as such. time_used is much better and you should find something similar for timestamp. Is that the "recording time", the "expiration time", the "due time" or something completely different?
Then try this:
select *
from my_table
where CODE='DE'
and User_Id=100
and User_sub_id=1
and time_used=(
select min(time_used)
from my_table
where CODE='DE'
and User_Id=100 and User_sub_id=1
)
order by "timestamp" desc -- <-- this adds sorting
limit 1; -- <-- this retrieves only one row
Add to your query the following condition
ORDER BY Timestamp DESC, LIMIT 1