Is there a [straightforward] way to order results *first*, *then* group by another column, with SQL? - sql

I see that in an SQL query, the GROUP BY has to precede the ORDER BY expression. Does this imply that ordering is done after grouping would have discarded identical rows?
Because I seem to need to order rows by a timestamp first, then discard the rows with identical timestamp. And I don't know how to accomplish this.
I am using MySQL 5.1.41.
Here is the definition of the table expressed with create table:
create table
(
A int,
B timestamp
)
The data could be:
+-----+-----------------------+
| A | B |
+-----+-----------------------+
| 1 | today |
| 1 | yesterday |
| 2 | yesterday |
| 2 | tomorrow |
+-----+-----------------------+
The results of the query on the above table, which I am after, would be:
+-----+-----------------------+
| A | B |
+-----+-----------------------+
| 1 | today |
| 2 | tomorrow |
+-----+-----------------------+
Basically, I want the rows with the latest timestamp in column "B" (hence the mention of ORDER BY), and only one row for each value in column "A" (think DISTINCT or GROUP BY).
The actual problem behind the simplified example above:
In reality, I have two tables - users and payment_receipts:
create table users
(
phone_nr int(10) unsigned not null,
primary key (phone_nr)
)
create table payment_receipts
(
phone_nr int(10) unsigned not null,
payed_ts timestamp default current_timestamp not null,
payed_until_ts timestamp not null,
primary key (phone_nr, payed_ts, payed_until_ts)
)
The tables may include other columns but I omit these as irrelevant. Implementing a payment scheme, I have to send SMS to users across the cellular network, in periodic intervals depending on whether the payment is due or not. The payment is actualized when the SMS is sent as the recipient is taxed for it. I use the payment_receipts table to keep records of all payments done, i.e. for book-keeping. This is intended to model a real shop where both the buyer and the seller get a copy of the receipt of purchase, for reference. This table stores my (seller's) copy [of each receipt]. The customer's receipt is the received SMS itself. Each time an SMS is sent (and thus a payment is accomplished), the table is inserted a receipt record, stating who paid, when and "until when". To explain the latter, imagine a subscription service, but one which spans indefinitely until the user opt-out explicitly, at which point the corresponding user record is removed. A payment is made a month in advance, so as a rule, the difference between the payed_ts and payed_until_ts is 30 days worth of time.
I have a batch job that executes every day and needs to select a list of users that are due monthly payment as part of the automatic subscription renewal described above. To link this to the dummy example earlier, the phone number column phone_nr would be the column "A" and payed_until_ts would be column "B", but in reality there are two tables, which has to do with the following behaviour: when a user record is removed, the receipt must remain, for book-keeping. So not only do I need to group payments by date and discard all but the latest payment receipt date, I also need to watch out not to select receipts for which there no longer is a matching user record.
To solve the problem of selecting required records -- those that are due payment -- I need to find receipts with the latest payed_until_ts timestamp for each phone_nr (there may be several, obviously) and out of those records I further need to select only those phone numbers where payed_until_ts is earlier than the time the batch job executes. I then would send an SMS to each of these numbers, inserting a receipt record for each sent SMS, where payed_ts is now() and payed_until_ts is now() + interval 30 days.
But I can't seem to come up with the query required.

Select a,b from (select a,b from table order by b) as c group by a;

Yes, grouping is done first, and it affects a single select whereas ordering affects all the results from all select statements in a union, such as:
select a, 'max', max(b) from tbl group by a
union all select a, 'min', min(b) from tbl group by a
order by 1, 2
(using field numbers in order by since I couldn't be bothered to name my columns). Each group by affects only its select, the order by affects the combined result set.
It seems that what you're after can be achieved with:
select A, max(B) from tbl group by A
This uses the max aggregation function to basically do your pre-group ordering (it doesn't actually sort it in any decent DBMS, rather it will simply choose the maximum from an suitable index if available).

SELECT DISTINCT a,b
FROM tbl t
WHERE b = (SELECT MAX(b) FROM tbl WHERE tbl.a = t.a);

According to your new rules (tested with PostgreSQL)
Query You'd Want:
SELECT pr.phone_nr, pr.payed_ts, pr.payed_until_ts
FROM payment_receipts pr
JOIN users
ON (pr.phone_nr = users.phone_nr)
JOIN (select phone_nr, max(payed_until_ts) as payed_until_ts
from payment_receipts
group by phone_nr
) sub
ON ( pr.phone_nr = sub.phone_nr
AND pr.payed_until_ts = sub.payed_until_ts)
ORDER BY pr.phone_nr, pr.payed_ts, pr.payed_until_ts;
Original Answer (with updates):
CREATE TABLE foo (a NUMERIC, b TEXT, DATE);
INSERT INTO foo VALUES
(1,'a','2010-07-30'),
(1,'b','2010-07-30'),
(1,'c','2010-07-31'),
(1,'d','2010-07-31'),
(1,'a','2010-07-29'),
(1,'c','2010-07-29'),
(2,'a','2010-07-29'),
(2,'a','2010-08-01');
-- table contents
SELECT * FROM foo ORDER BY c,a,b;
a | b | c
---+---+------------
1 | a | 2010-07-29
1 | c | 2010-07-29
2 | a | 2010-07-29
1 | a | 2010-07-30
1 | b | 2010-07-30
1 | c | 2010-07-31
1 | d | 2010-07-31
2 | a | 2010-08-01
-- The following solutions both retrieve records based on the latest date
-- they both return the same result set, solution 1 is faster, solution 2
-- is easier to read
-- Solution 1:
SELECT foo.a, foo.b, foo.c
FROM foo
JOIN (select a, max(c) as c from foo group by a) bar
ON (foo.a=bar.a and foo.c=bar.c)
ORDER BY foo.a, foo.b, foo.c;
-- Solution 2:
SELECT a, b, MAX(c) AS c
FROM foo main
GROUP BY a, b
HAVING MAX(c) = (select max(c) from foo sub where main.a=sub.a group by a)
ORDER BY a, b;
a | b | c
---+---+------------
1 | c | 2010-07-31
1 | d | 2010-07-31
2 | a | 2010-08-01
(3 rows)
Comment:
1 is returned twice because their are multiple b values. This is acceptable (and advised). Your data should never have this problem, because c is based on b's value.

create table user_payments
(
phone_nr int NOT NULL,
payed_until_ts datetime NOT NULL
)
insert into user_payments
(phone_nr, payed_until_ts)
values
(1, '2016-01-28'), -- today
(1, '2016-01-27'), -- yesterday
(2, '2016-01-27'), -- yesterday
(2, '2016-01-29') -- tomorrow
select phone_nr, MAX(payed_until_ts) as latest_payment
from user_payments
group by phone_nr
-- OUTPUT:
-- phone_nr latest_payment
-- 1 2016-01-28 00:00:00.000
-- 2 2016-01-29 00:00:00.000
In the above example, I have used datetime column but similar query should work for timestamp column.
The MAX function will basically do the "ORDER BY" payed_until_ts column and pick the latest value for each phone_nr.
Also, you will get only one value for each phone_nr due to "GROUP BY" clause.

Related

How to consolidate blocks of time?

I have a derived table with a list of relative seconds to a foreign key (ID):
CREATE TABLE Times (
ID INT
, TimeFrom INT
, TimeTo INT
);
The table contains mostly non-overlapping data, but there are occasions where I have a TimeTo < TimeFrom of another record:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 70 |
| 10 | 60 | 150 |
| 10 | 75 | 150 |
| .. | ... | ... |
+----+----------+--------+
The result set is meant to be a flattened linear idle report, but with too many of these overlaps, I end up with negative time in use. I.e. If the window above for ID = 10 was 150 seconds long, and I summed the differences of relative seconds to subtract from the window size, I'd wind up with 150-(20+20+90+75)=-55. This approach I've tried, and is what led me to realizing there were overlaps that needed to be flattened.
So, what I'm looking for is a solution to flatten the overlaps into one set of times:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 150 |
| .. | ... | ... |
+----+----------+--------+
Considerations: Performance is very important here, as this is part of a larger query that will perform well on it's own, and I'd rather not impact its performance much if I can help it.
On a comment regarding "Which seconds have an interval", this is something I have tried for the end result, and am looking for something with better performance. Adapted to my example:
SELECT SUM(C.N)
FROM (
SELECT A.N, ROW_NUMBER()OVER(ORDER BY A.N) RowID
FROM
(SELECT TOP 60 1 N FROM master..spt_values) A
, (SELECT TOP 720 1 N FROM master..spt_values) B
) C
WHERE EXISTS (
SELECT 1
FROM Times SE
WHERE SE.ID = 10
AND SE.TimeFrom <= C.RowID
AND SE.TimeTo >= C.RowID
AND EXISTS (
SELECT 1
FROM Times2 D
WHERE ID = SE.ID
AND D.TimeFrom <= C.RowID
AND D.TimeTo >= C.RowID
)
GROUP BY SE.ID
)
The problem I have with this solution is I have get a Row Count Spool out of the EXISTS query in the query plan with a number of executions equal to COUNT(C.*). I left the real numbers in that query to illustrate that getting around this approach is for the best. Because even with a Row Count Spool reducing the cost of the query by quite a bit, it's execution count increases the cost of the query as a whole by quite a bit as well.
Further Edit: The end goal is to put this in a procedure, so Table Variables and Temp Tables are also a possible tool to use.
OK. I'm still trying to do this with just one SELECT. But This totally works:
DECLARE #tmp TABLE (ID INT, GroupId INT, TimeFrom INT, TimeTo INT)
INSERT INTO #tmp
SELECT ID, 0, TimeFrom, TimeTo
FROM Times
ORDER BY Id, TimeFrom
DECLARE #timeTo int, #id int, #groupId int
SET #groupId = 0
UPDATE #tmp
SET
#groupId = CASE WHEN id != #id THEN 0
WHEN TimeFrom > #timeTo THEN #groupId + 1
ELSE #groupId END,
GroupId = #groupId,
#timeTo = TimeTo,
#id = id
SELECT Id, MIN(TimeFrom), Max(TimeTo) FROM #tmp
GROUP BY ID, GroupId ORDER BY ID
Left join each row to its successor overlapping row on the same ID value (where such exist).
Now for each row in the result-set of LHS left join RHS the contribution to the elapsed time for the ID is:
isnull(RHS.TimeFrom,LHS.TimeTo) - LHS.TimeFrom as TimeElapsed
Summing these by ID should give you the correct answer.
Note that:
- where there isn't an overlapping successor row the calculation is simply
LHS.TimeTo - LHS.TimeFrom
- where there is an overlapping successor row the calculation will net to
(RHS.TimeFrom - LHS.TimeFrom) + (RHS.TimeTo - RHS.TimeFrom)
which simplifies to
RHS.TimeTo - LHS.TimeFrom
What about something like below (assumes SQL 2008+ due to CTE):
WITH Overlaps
AS
(
SELECT t1.Id,
TimeFrom = MIN(t1.TimeFrom),
TimeTo = MAX(t2.TimeTo)
FROM dbo.Times t1
INNER JOIN dbo.Times t2 ON t2.Id = t1.Id
AND t2.TimeFrom > t1.TimeFrom
AND t2.TimeFrom < t1.TimeTo
GROUP BY t1.Id
)
SELECT o.Id,
o.TimeFrom,
o.TimeTo
FROM Overlaps o
UNION ALL
SELECT t.Id,
t.TimeFrom,
t.TimeTo
FROM dbo.Times t
INNER JOIN Overlaps o ON o.Id = t.Id
AND (o.TimeFrom > t.TimeFrom OR o.TimeTo < t.TimeTo);
I do not have a lot of data to test with but seems decent on the smaller data sets I have.
I also wrapped by head around this issue - and afterall I found, that the problem is your data.
You claim (if i get that right), that these entries should reflect the relative times, when a user goes idle / comes back.
So, you should consider to sanitize your data and refactor your inserts to produce valid data sets.
For instance, the two lines:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 50 | 70 |
| 10 | 60 | 150 |
how can it be possible that a user is idle until second 70, but goes idle on second 60? This already implies, that he has been back latest at around second 59.
I can only assume that this issue comes from different threads and/or browser windows (tabs) a user might be using your application with. (Each having it's own "idle detection")
So instead of working-around the symptoms - you should fix the cause! Why is this data entry inserted into the table? You could avoid this by simple checking, if the user is already idle before inserting a new row.
Create a unique key constraint on ID and TimeTo
Whenever an idle-event is detected, execute the following query:
INSERT IGNORE INTO Times (ID,TimeFrom,TimeTo)VALUES('10', currentTimeStamp, -1);
-- (If the user is already "idle" - nothing will happen)
Whenever an comeback-event is detected, execute the following query:
UPDATE Times SET TimeTo=currentTimeStamp WHERE ID='10' and TimeTo=-1
-- (If the user is already "back" - nothing will happen)
The fiddle linked here: http://sqlfiddle.com/#!2/dcb17/1 would reproduce the chain of events for your example, but resulting in a clean and logical set of idle-windows:
ID TIMEFROM TIMETO
10 10 30
10 50 70
10 75 150
Note: The Output is slightly different from the output you desired. But I feel that this is more accurate, cause of the reason outlined above: A user cannot go idle on second 70 without returning from it's current idle state before. He either STAYS idle (and a second thread/tab runs into the idle-event) Or he returned in between.
Especially for your need to maximize performance, you should fix the data and not invent a work-around-query. This is maybe 3 ms upon inserts, but could be worth 20 seconds upon select!
Edit: if Multi-Threading / Multiple-Sessions is the cause for the wrong insert, you would also need to implement a check, if most_recent_come_back_time < now() - idleTimeout - otherwhise a user might comeback on tab1, and is recorded idle on tab2 after a few seconds, cause tab2 did run into it's idle timeout, cause the user only refreshed tab1.
I had the 'same' problem once with 'days' (additionaly without counting WE and Holidays)
The word counting gave me the following idea:
create table Seconds ( sec INT);
insert into Seconds values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9), ...
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom and t.timeto-1
and id=10;
you can cut the start to 0 (I put the '10' here in braces)
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom- (10) and t.timeto- (10)-1
and id=10;
and finaly
select count(distinct sec) from times t, seconds s,
(select min(timefrom) m from times where id=10) as m
where s.sec between t.timefrom-m.m and t.timeto-m.m-1
and id=10;
additionaly you can "ignore" eg. 10 seconds by dividing you loose some prezition but earn speed
select count(distinct sec)*d from times t, seconds s,
(select min(timefrom) m from times where id=10) as m,
(select 10 d) as d
where s.sec between (t.timefrom-m)/d and (t.timeto-m)/d-1
and id=10;
Sure it depends on the range you have to look at, but a 'day' or two of seconds should work (although i did not test it)
fiddle ...

SQL Get n last unique entries by date

I have an access database that I'm well aware is quite poorly designed, unfortunately it is what I must use. It looks a little like the following:
(Row# is not a column in the database, it's just there to help me describe what I'm after)
Row# ID Date Misc
1 001 01/8/2013 A
2 001 01/8/2013 B
3 001 01/8/2013 C
4 002 02/8/2013 D
5 002 02/8/2013 A
6 003 04/8/2013 B
7 003 04/8/2013 D
8 003 04/8/2013 D
What I'm trying to do is obtain all information entered for the last n (by date) 'entries' where an 'entry' is all rows with a unique ID.
So if I want the last 1 entry I will get rows 6, 7 and 8. The last two entries will get me rows 4-8 etc.
I've tried to get the SN's needed in a subselect and then select all entries where those SN's appear, but I couldn't get it to work. Any help appreciated.
Thanks.
The proper Access syntax:
select *
from t
where ID in (select top 10 ID
from t
group by ID
order by max([date]) desc
)
I think this will work:
select *
from table
where Date in (
select distinct(Date) as unique_date from table order by unique_date DESC limit <num>
)
The idea is to use the subselect with a limit to only identify dates you care about.
EDIT: Some databases do not allow a limit in a subquery (I'm looking at you, mysql). In that case, you'll have to make a temporary table out of the subquery then select * from it.

SQL Query to remove cyclic redundancy

I have a table that looks like this:
Column A | Column B | Counter
---------------------------------------------
A | B | 53
B | C | 23
A | D | 11
C | B | 22
I need to remove the last row because it's cyclic to the second row. Can't seem to figure out how to do it.
EDIT
There is an indexed date field. This is for Sankey diagram. The data in the sample table is actually the result of a query. The underlying table has:
date | source node | target node | path count
The query to build the table is:
SELECT source_node, target_node, COUNT(1)
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
GROUP BY source_node, target_node
In the sample, the last row C to B is going backwards and I need to ignore it or the Sankey won't display. I need to only show forward path.
Removing all edges from your graph where the tuple (source_node, target_node) is not ordered alphabetically and the symmetric row exists should give you what you want:
DELETE
FROM sankey_table t1
WHERE source_node > target_node
AND EXISTS (
SELECT NULL from sankey_table t2
WHERE t2.source_node = t1.target_node
AND t2.target_node = t1.source_node)
If you don't want to DELETE them, just use this WHERE clause in your query for generating the input for the diagram.
If you can adjust how your table is populated, you can change the query you're using to only retrieve the values for the first direction (for that date) in the first place, with a little bit an analytic manipulation:
SELECT source_node, target_node, counter FROM (
SELECT source_node,
target_node,
COUNT(*) OVER (PARTITION BY source_node, target_node) AS counter,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
)
WHERE rnk = 1;
The inner query gets the same data you collect now but adds a ranking column, which will be 1 for the first row for any source/target pair in any order for a given day. The outer query then just ignores everything else.
This might be a candidate for a materialised view if you're truncating and repopulating it daily.
If you can't change your intermediate table but can still see the underlying table you could join back to it using the same kind of idea; assuming the table you're querying from is called sankey_agg_table:
SELECT sat.source_node, sat.target_node, sat.counter
FROM sankey_agg_table sat
JOIN (SELECT source_node, target_node,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table) st
ON st.source_node = sat.source_node
AND st.target_node = sat.target_node
AND st.rnk = 1;
SQL Fiddle demos.
DELETE FROM yourTable
where [Column A]='C'
given that these are all your rows
EDIT
I would recommend that you clean up your source data if you can, i.e. delete the rows that you call backwards, if those rows are incorrect as you state in your comments.

how to select one tuple in rows based on variable field value

I'm quite new into SQL and I'd like to make a SELECT statement to retrieve only the first row of a set base on a column value. I'll try to make it clearer with a table example.
Here is my table data :
chip_id | sample_id
-------------------
1 | 45
1 | 55
1 | 5986
2 | 453
2 | 12
3 | 4567
3 | 9
I'd like to have a SELECT statement that fetch the first line with chip_id=1,2,3
Like this :
chip_id | sample_id
-------------------
1 | 45 or 55 or whatever
2 | 12 or 453 ...
3 | 9 or ...
How can I do this?
Thanks
i'd probably:
set a variable =0
order your table by chip_id
read the table in row by row
if table[row]>variable, store the table[row] in a result array,increment variable
loop till done
return your result array
though depending on your DB,query and versions you'll probably get unpredictable/unreliable returns.
You can get one value using row_number():
select chip_id, sample_id
from (select chip_id, sample_id,
row_number() over (partition by chip_id order by rand()) as seqnum
) t
where seqnum = 1
This returns a random value. In SQL, tables are inherently unordered, so there is no concept of "first". You need an auto incrementing id or creation date or some way of defining "first" to get the "first".
If you have such a column, then replace rand() with the column.
Provided I understood your output, if you are using PostGreSQL 9, you can use this:
SELECT chip_id ,
string_agg(sample_id, ' or ')
FROM your_table
GROUP BY chip_id
You need to group your data with a GROUP BY query.
When you group, generally you want the max, the min, or some other values to represent your group. You can do sums, count, all kind of group operations.
For your example, you don't seem to want a specific group operation, so the query could be as simple as this one :
SELECT chip_id, MAX(sample_id)
FROM table
GROUP BY chip_id
This way you are retrieving the maximum sample_id for each of the chip_id.

sql logical compression of records

I have a table in SQL with more than 1 million records which I want to compress using following algorithm ,and now I'm looking for the best way to do that ,preferably without using a cursor .
if the table contains all 10 possible last digits(from 0 to 9) for a number (like 252637 in following example) we will find the most used Source (in our example 'A') and then remove all digits where Source = 'A' and insert the collapsed digit instead of that (here 252637) .
the example below would help for better understanding.
Original table :
Digit(bigint)| Source
|
2526370 | A
2526371 | A
2526372 | A
2526373 | B
2526374 | C
2526375 | A
2526376 | B
2526377 | A
2526378 | B
2526379 | B
Compressed result :
252637 |A
2526373 |B
2526374 |C
2526376 |B
2526378 |B
2526379 |B
This is just another version of Tom Morgan's accepted answer. It uses division instead of substring to trim the least significant digit off the BIGINT digit column:
SELECT
t.Digit/10
(
-- Foreach t, get the Source character that is most abundant (statistical mode).
SELECT TOP 1
Source
FROM
table i
WHERE
(i.Digit/10) = (t.Digit/10)
GROUP BY
i.Source
ORDER BY
COUNT(*) DESC
)
FROM
table t
GROUP BY
t.Digit/10
HAVING
COUNT(*) = 10
I think it'll be faster, but you should test it and see.
You could identify the rows which are candidates for compression without a cursor (I think) by GROUPing by a substring of the Digit (the length -1) HAVING count = 10. That would identify digits with 10 child rows. You could use this list to insert to a new table, then use it again to delete from the original table. What would be left would be rows that don't have all 10, which you'd also want to insert to the new table (or copy the new data back to the original).
Does that makes sense? I can write it out a bit better if it doesn't.
Possible SQL Solution:
SELECT
SUBSTRING(t.Digit,0,len(t.Digit)-1)
(SELECT TOP 1 Source
FROM innerTable i
WHERE SUBSTRING(i.Digit,0,len(i.Digit)-1)
= SUBSTRING(t.Digit,0,len(t.Digit)-1)
GROUP BY i.Source
ORDER BY COUNT(*) DESC
)
FROM table t
GROUP BY SUBSTRING(t.Digit,0,len(t.Digit)-1)
HAVING COUNT(*) = 10