SQL select all rows per group after a condition is met - sql

I would like to select all rows for each group after the last time a condition is met for that group. This related question has an answer using correlated subqueries.
In my case I will have millions of categories and hundreds of millions/billions of rows. Is there a way to achieve the same results using a more performant query?
Here is an example. The condition is all rows (per group) after the last 0 in the conditional column.
category | timestamp | condition
--------------------------------------
A | 1 | 0
A | 2 | 1
A | 3 | 0
A | 4 | 1
A | 5 | 1
B | 1 | 0
B | 2 | 1
B | 3 | 1
The result I would like to achieve is
category | timestamp | condition
--------------------------------------
A | 4 | 1
A | 5 | 1
B | 2 | 1
B | 3 | 1

If you want everything after the last 0, you can use window functions:
select t.*
from (select t.*,
max(case when condition = 0 then timestamp end) over (partition by category) as max_timestamp_0
from t
) t
where timestamp > max_timestamp_0 or
max_timestamp_0 is null;
With an index on (category, condition, timestamp), the correlated subquery version might also perform quite well:
select t.*
from t
where t.timestamp > all (select t2.timestamp
from t t2
where t2.category = t.category and
t2.condition = 0
);

You might want to try window functions:
select category, timestamp, condition
from (
select
t.*,
min(condition) over(partition by category order by timestamp desc) min_cond
from mytable t
) t
where min_cond = 1
The window min() with the order by clause computes the minimum value of condition over the current and following rows of the same category: we can use it as a filter to eliminate rows for which there is a more recent row with a 0.
Compared to the correlated subquery approach, the upside of using window functions is that it reduces the number of scans needed on the table. Of course this computing also has a cost, so you'll need to assess both solutions against your sample data.

Related

postgres to grab the first N most populous groups and put the rest in an "Other" group

I have a table with the columns status, operator, and cost, where the columns status and operator are categorical and I want to sum up the cost per (status, operator) pair. Usually I would do this with a simple statement like
SELECT SUM(cost), status, operator FROM my_table GROUP BY status, operator;
but the hard part is there could be 100s of unique operators which I can't visualize for the client in a meaningful way. What I want to be able to do is only explicitly show the top N many operator categories (meaning to top N operators that have the highest SUM(cost) across the entire dataset) and then group all of the remaining rows in an "Other" operator. An inefficient way to do this would be the following:
-- letting N = 12
SELECT
SUM(cost),
status,
CASE
WHEN operator IN (
SELECT t.operator
FROM my_table AS t
GROUP BY t.operator ORDER BY SUM(t.cost) DESC
LIMIT 12
) THEN operator
ELSE 'Other'
END AS operator
FROM my_table
GROUP BY
status,
CASE
WHEN operator IN (
SELECT t.operator
FROM my_table AS t
GROUP BY t.operator ORDER BY SUM(t.cost) DESC
LIMIT 12
) THEN operator
ELSE 'Other'
END;
While the inefficient way works, in production it is far too slow. In actuality, the cost is not a simple column in a table but is computed by a subquery that is very slow to compute and the table is large, so I can't afford to use the CASE statement with an IN clause. What I would rather do is somehow have the full table where I use the GROUP BY statement I listed first in a FROM-clause subquery, then aggregate that to get the top N operator categories and an "Other" category. I tried to do this with window functions but I don't really understand how those work and I could not find something that got the right answer. If somebody could help it would be greatly appreciated.
EDIT: The cost column is not an actual column. I should have been more clear. It's computed by a very expensive subquery so I want to compute the cost for each row of the original table as few times as possible.
Example:
Say we have a table that looks like:
pk | status | operator | cost
----+-----------+-------------------+----------------------
1 | A | op_1 | 1
2 | A | op_1 | 5
3 | A | op_1 | 3
4 | A | op_1 | 7
5 | B | op_2 | 10
6 | B | op_2 | 15
7 | A | op_3 | 100
8 | A | op_4 | 1000
9 | B | op_5 | 12000
10 | A | op_5 | 10200
11 | B | op_5 | 10020
If I only want the top 3 operators (meaning the three operators with the highest SUM(cost) - in this case operators 3, 4, 5), the query should return:
status | operator | cost
-----------+-------------------+----------------------
B | op_5 | 32220
A | op_4 | 1000
A | op_3 | 100
B | Other | 25
A | Other | 16
In this example, operators 1-2 get rolled up into the "Other" operator, since we only want the top 3 given explicitly. So the first "Other" row in the result table sums all rows where status=B and operator is not one of the top three operators. The second "Other" row sums up all the rows where status=A and operator is not one of the top three operators.
I have converted your query from sub-query to join. You can use Left join as follows:
SELECT
SUM(cost),
status,
CASE
WHEN tt.operator is not null
THEN tt.operator
ELSE 'Other'
END AS operator
FROM my_table t
LEFT JOIN (
SELECT t.operator FROM my_table AS t
GROUP BY t.operator
ORDER BY SUM(t.cost) DESC LIMIT 12 ) tt
On t.operator = tt.operator
GROUP BY
status,
CASE
WHEN tt.operator is not null
THEN tt.operator
ELSE 'Other'
END;
Now, comming to what I understood from description. You want total 13 or less rows(12 operator and 1 other) for particular status if there is N=12. You can use row_number window function as follows
SELECT SUM(cost),
status,
Operator
From
(SELECT SUM(cost),
status,
Case when Row_number() over (partition by stqtus order by sum(cost) desc) <= 12
then operator
else 'Others'
end as operator
FROM my_table
GROUP BY status, operator) t
GROUP BY status, operator;

Select Top N rows plus another Select based on previous result

I am quite new to SQL.
I have a MS SQL DB where I would like to fetch the top 3 rows with datetime above a specific input PLUS get all the rows where the datetime value is equal to the last row of the previous fetch.
| rowId | Timestamp | data |
|-------|--------------------------|------|
| rsg | 2019-01-01T00:00:00.000Z | 120 |
| zqd | 2020-01-01T00:00:00.000Z | 36 |
| ylp | 2020-01-01T00:00:00.000Z | 48 |
| abt | 2022-01-01T00:00:00.000Z | 53 |
| zio | 2022-01-01T00:00:00.000Z | 12 |
Here is my current request to fetch the 3 rows.
SELECT
TOP 3 *
FROM
Table
WHERE
Timestamp >= '2020-01-01T00:00:00.000Z'
ORDER BY
Timestamp ASC
Here I would like to get in one request the last 4 rows.
Thanks for your help
One possibility, using ROW_NUMBER:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY Timestamp) rn
FROM yourTable
)
SELECT *
FROM cte
WHERE
Timestamp <= (SELECT Timestamp FROM cte WHERE rn = 3);
Matching records should be in the first 3 rows or should have timestamps equal to the timestamp in the third row. We can combine these conditions by restricting to timestamps equal or before the timestamp in the third row.
Or maybe use TOP 3 WITH TIES:
SELECT TOP 3 WITH TIES *
FROM yourTable
ORDER BY Timestamp;

Aggregate rows between two rows with certain value

I'm trying to formulate a query to aggregate rows that are between rows with a specific value: in this example I want to collapse and sum time of all rows that have an ID other than 1, but still show rows with ID 1.
This is my table:
ID | Time
----+-----------
1 | 60
2 | 10
3 | 15
1 | 30
4 | 100
1 | 20
This is the result I'm looking for:
ID | Time
--------+-----------
1 | 60
Other | 25
1 | 30
Other | 100
1 | 20
I have attempted to SUM and add a condition with CASE, or but so far my solutions only get me to sum ALL rows and I lose the intervals, so I get this:
ID | Time
------------+-----------
Other | 125
1 | 110
Any help or suggestions in the right direction would be greatly appreciated, thanks!
You need to define the groupings. SQLite is not great for this sort of manipulation, but you can do it by summing the "1" values up to each value.
In SQLite, we can use the rowid column for the ordering:
select (case when id = 1 then '1' else 'other' end) as which,
sum(time)
from (select t.*,
(select count(*) from t t2 where t2.rowid <= t.rowid and t2.id = 1) as grp
from t
) t
group by (case when id = 1 then '1' else 'other' end), grp
order by grp, which;

SQL - Calculate next row based on previous in the same column

I have spent hours trying to solve this with loops, the lag function but it doesn't solve my problem. I have a table where the first row of a particular field is populated, the next row is calculated based on a subtraction of the previous row of data from 2 columns, the next row is then based on the result of this. The example is below of the original table and the result set:
a b a b
502.5 33.85 502.5 33.85
25.46 468.65 25.46
20.83 443.19 20.83
133.07 422.36 133.07
144.65 289.29 144.65
144.65 144.64 144.65
I have tried several different methods with stored procedures and can get the 2nd row result set but I can't get it to continue and calculate the rest of the fields, it's easy in excel but not so in SQL. Any suggestions?
If your RDBMS supports windowed aggregate functions:
Assuming you have an id or some such thing that is determining the order of your rows (as you indicated there is a first).
You can use the max() over() (in this case min() works instead of max() as well) and sum() over() windowed aggregate functions
select
id
, max(a) over (order by id) - (sum(b) over (order by id) - b) as a
, b
from t
rextester demo: http://rextester.com/MGKM17497
returns:
+----+--------+--------+
| id | a | b |
+----+--------+--------+
| 1 | 502,50 | 33,85 |
| 2 | 468,65 | 25,46 |
| 3 | 443,19 | 20,83 |
| 4 | 422,36 | 133,07 |
| 5 | 289,29 | 144,65 |
| 6 | 144,64 | 144,65 |
+----+--------+--------+
In case, as I saw data before editing )
This solution also assumes that you have id column and order depends on this column
with t(id, a, b) as(
select 1, 502.5, 33.85 union all
select 2, 25.46, null union all
select 3, 20.83, null union all
select 4, 133.07, null union all
select 5, 144.65, null union all
select 6, 144.65, null
)
select case when id = 1 then a else b end as a, case when id = 1 then (select b from t order by id offset 0 rows fetch next 1 rows only) else a end as b from (
select id, a, lag((select a from t order by id offset 0 rows fetch next 1 rows only)-s) over(order by id) as b from (
select id, a, sum(case when b is null then a else b end ) over(order by id) s
from t
) tt
) ttt

Is it possible to do so without using nested SELECTS?

Suppose I have the following table:
--------------------------------------------
ReceiptNo | Date | EmployeeID | Qty
--------------------------------------------
1 | 12-DEC-2015 | 1 | 200
2 | 13-DEC-2015 | 1 | 500
3 | 13-DEC-2015 | 1 | 100
4 | 13-DEC-2015 | 3 | 100
5 | 13-DEC-2015 | 3 | 500
6 | 13-DEC-2015 | 2 | 75
--------------------------------------------
Show the tuples with maximum Qty.
Answer:
--------------------------------------------
2 | 13-DEC-2015 | 1 | 500
5 | 13-DEC-2015 | 3 | 500
--------------------------------------------
I need to use aggregate function MAX().
Is it possible to do so without using nested SELECTS?
Try this in sql server
SELECT TOP 1 WITH TIES *
FROM TABLE
ORDER BY QTY DESC
No.
You can't show the tuples with maximum Qty, using the max aggregate function while avoiding nested selects.
VR46 posted a nice way to do it without using nested selects, but also without the max aggregate. A similar approach can be used in Oracle 12c using the FETCH clause:
select *
from table
order by qty desc
fetch first row with ties
If you want to use the max aggregate, this is the way to do it:
select *
from table
where qty = (select max(qty) from table)
Another way to do it would be using the rank or dense_rank window functions, but they require a nested select, and do not use the max aggregate function:
select *
from (select t.*,
dense_rank() over (order by t.qty desc) as rnk
from table t) t
where t.rnk = 1
Not using max, but plain "cross-platform" ANSI SQL without nested queries:
SELECT t1.*
FROM mytable t1
LEFT OUTER JOIN mytable t2 ON t2.Qty > t1.Qty
WHERE t2.Qty IS NULL
Retrieves all records for which there is no record with a greater quantity in the same table.