SQL: select top fewest rows with sum more than 0.7 - sql

The raw data table is
+--------+--------+
| id | value |
+--------+--------+
| 1 | 0.1 |
| 1 | 0.2 |
| 1 | 0.3 |
| 1 | 0.2 |
| 1 | 0.2 |
| 2 | 0.4 |
| 2 | 0.5 |
| 2 | 0.1 |
| 3 | 0.5 |
| 3 | 0.5 |
+--------+--------+
For each id, its value sum is 1. I want to select the top fewest rows of each id with value sum is more than or equal with 0.7, like
+--------+--------+
| id | value |
+--------+--------+
| 1 | 0.3 |
| 1 | 0.2 |
| 1 | 0.2 |
| 2 | 0.5 |
| 2 | 0.4 |
| 3 | 0.5 |
| 3 | 0.5 |
+--------+--------+
How to solve this problem?

It's neither pretty nor efficient but it's the best I can come up with.
Disclaimer: I'm sure this will perform horribly on any real-world dataset.
with recursive calc (id, row_list, value_list, total_value) as (
select id, array[ctid], array[value]::numeric(6,2)[], value::numeric(6,2) as total_value
from data
union all
select c.id, p.row_list||c.ctid, (p.value_list||c.value)::numeric(6,2)[], (p.total_value + c.value)::numeric(6,2)
from data as c
join calc as p on p.id = c.id and c.ctid <> all(p.row_list)
)
select id, unnest(min(value_list)) as value
from (
select id,
value_list,
array_length(row_list,1) num_values,
min(array_length(row_list,1)) over (partition by id) as min_num_values
from calc
where total_value >= 0.7
) as result
where num_values = min_num_values
group by id
SQLFiddle example: http://sqlfiddle.com/#!15/8966b/1
How does this work?
The recursive CTE (thew with recursive) part creates all possible combinations of values from the table. To make sure that the same value is not counted twice I'm collecting the CTIDs (an Postgres internal unique identifier for each row) for each row already processed into an array. The recursive join condition (p.id = c.id and c.ctid <> all(p.row_list)) then makes sure only values for the same id are added and only those that have not yet processed.
The result of the CTE is then reduced to all rows where the total sum (the column total_value) is >= 0.7.
The final outer select (the alias result) is then filtered down to those where the number of values making up the total sum is the smallest. The distinct and unnest then transforms the arrays back into a proper "table". The distinct is necessary because the CTE collects all combinations so that for e.g. id=3 the value_list array will contain {0.40,0.50} and {0.50,0.40}. Without the distinct, the unnest would return both combinations making it a total of four rows for id=3.

This also isn't that pretty but I think it'd be more efficient (and more transferable between RDBMS')
with unique_data as (
select id
, value
, row_number() over ( partition by id order by value desc ) as rn
from my_table
)
, cumulative_sum as (
select id
, value
, sum(value) over ( partition by id order by rn ) as csum
from unique_data
)
, first_over_the_mark as (
select id
, value
, csum
, lag(csum) over ( partition by id order by csum ) as prev_value
from cumulative_sum
)
select *
from first_over_the_mark
where coalesce(prev_value, 0) < 0.7
SQL Fiddle
I've done it with CTEs to make it easier to see what's happening but there's no need to use them.
It uses a cumulative sum, the first CTE makes the data unique as without it 0.2 is the same value and so all rows that have 0.2 get summed together. The second works out the running sum. The third then works out the previous value. If the previous is strictly less than 0.7 pick up everything. The idea being that if the previous cumulative sum is less than 0.7 then the current value is more (or equal) to that number.
It's worth noting that this will break down if you have any rows in your table where the value is 0.

This is a variant on Ben's method, but it is simpler to implement. You just need a cumulative sum, ordered by value in reverse, and then to take everything where the cumulative sum is less then 0.7 plus the first one that exceeds that value.
select t.*
from (select t.*,
sum(value) over (partition by id order by value desc) as csum
from t
) t
where csum - value < 0.7;
The expression csum - value is the cumulative sum minus the current value (you can also get this using something like rows between unbounded preceding and 1 preceding). Your condition is that this value is less than some threshold.
EDIT:
Ben's comment is right about duplicate values. His solution is fine. Here is another solution:
select t.*
from (select t.*,
sum(value) over (partition by id order by value desc, random()) as csum
from t
) t
where csum - value < 0.7;

Related

Is there a way to calculate average based on distinct rows without using a subquery?

If I have data like so:
+----+-------+
| id | value |
+----+-------+
| 1 | 10 |
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 2 | 20 |
+----+-------+
How do I calculate the average based on the distinct id WITHOUT using a subquery (i.e. querying the table directly)?
For the above example it would be (10+20+30)/3 = 20
I tried to do the following:
SELECT AVG(IF(id = LAG(id) OVER (ORDER BY id), NULL, value)) AS avg
FROM table
Basically I was thinking that if I order by id and check the previous row to see if it has the same id, the value should be NULL and thus it would not be counted into the calculation, but unfortunately I can't put analytical functions inside aggregate functions.
As far as I know, you can't do this without a subquery. I would use:
SELECT AVG(avg_value)
FROM
(
SELECT AVG(value) AS avg_value
FROM yourTable
GROUP BY id
) t;
WITH RANK AS (
Select *,
ROW_NUMBER() OVER(PARTITION BY ID) AS RANK
FROM
TABLE
QUALIFY RANK = 1
)
SELECT
AVG(VALUES)
FROM RANK
The outer query will have other parameters that need to access all the data in the table
I interpret this comment as wanting an average on every row -- rather than doing an aggregation. If so, you can use window functions:
select t.*,
avg(case when seqnum = 1 then value end) over () as overall_avg
from (select t.*,
row_number() over (partition by id order by id) as seqnum
from t
) t;
Yes there is a way,
Simply use distinct inside the avg function as below :
select avg(distinct value) from tab;
http://sqlfiddle.com/#!4/9d156/2/0

Find the count of IDs that have the same value

I'd like to get a count of all of the Ids that have have the same value (Drops) as other Ids. For instance, the illustration below shows you that ID 1 and 3 have A drops so the query would count them. Similarly, ID 7 & 18 have B drops so that's another two IDs that the query would count totalling in 4 Ids that share the same values so that's what my query would return.
+------+-------+
| ID | Drops |
+------+-------+
| 1 | A |
| 2 | C |
| 3 | A |
| 7 | B |
| 18 | B |
+------+-------+
I've tried the several approaches but the following query was my last attempt.
With cte1 (Id1, D1) as
(
select Id, Drops
from Posts
),
cte2 (Id2, D2) as
(
select Id, Drops
from Posts
)
Select count(distinct c1.Id1) newcnt, c1.D1
from cte1 c1
left outer join cte2 c2 on c1.D1 = c2.D2
group by c1.D1
The result if written out in full would be a single value output but the records that the query should be choosing should look as follows:
+------+-------+
| ID | Drops |
+------+-------+
| 1 | A |
| 3 | A |
| 7 | B |
| 18 | B |
+------+-------+
Any advice would be great. Thanks
You can use a CTE to generate a list of Drops values that have more than one corresponding ID value, and then JOIN that to Posts to find all rows which have a Drops value that has more than one Post:
WITH CTE AS (
SELECT Drops
FROM Posts
GROUP BY Drops
HAVING COUNT(*) > 1
)
SELECT P.*
FROM Posts P
JOIN CTE ON P.Drops = CTE.Drops
Output:
ID Drops
1 A
3 A
7 B
18 B
If desired you can then count those posts in total (or grouped by Drops value):
WITH CTE AS (
SELECT Drops
FROM Posts
GROUP BY Drops
HAVING COUNT(*) > 1
)
SELECT COUNT(*) AS newcnt
FROM Posts P
JOIN CTE ON P.Drops = CTE.Drops
Output
newcnt
4
Demo on SQLFiddle
You may use dense_rank() to resolve your problem. if drops has the same ID then dense_rank() will provide the same rank.
Here is the demo.
with cte as
(
select
drops,
count(distinct rnk) as newCnt
from
( select
*,
dense_rank() over (partition by drops order by id) as rnk
from myTable
) t
group by
drops
having count(distinct rnk) > 1
)
select
sum(newCnt) as newCnt
from cte
Output:
|newcnt |
|------ |
| 4 |
First group the count of the ids for your drops and then sum the values greater than 1.
select sum(countdrops) as total from
(select drops , count(id) as countdrops from yourtable group by drops) as temp
where countdrops > 1;

SQL select all rows per group after a condition is met

I would like to select all rows for each group after the last time a condition is met for that group. This related question has an answer using correlated subqueries.
In my case I will have millions of categories and hundreds of millions/billions of rows. Is there a way to achieve the same results using a more performant query?
Here is an example. The condition is all rows (per group) after the last 0 in the conditional column.
category | timestamp | condition
--------------------------------------
A | 1 | 0
A | 2 | 1
A | 3 | 0
A | 4 | 1
A | 5 | 1
B | 1 | 0
B | 2 | 1
B | 3 | 1
The result I would like to achieve is
category | timestamp | condition
--------------------------------------
A | 4 | 1
A | 5 | 1
B | 2 | 1
B | 3 | 1
If you want everything after the last 0, you can use window functions:
select t.*
from (select t.*,
max(case when condition = 0 then timestamp end) over (partition by category) as max_timestamp_0
from t
) t
where timestamp > max_timestamp_0 or
max_timestamp_0 is null;
With an index on (category, condition, timestamp), the correlated subquery version might also perform quite well:
select t.*
from t
where t.timestamp > all (select t2.timestamp
from t t2
where t2.category = t.category and
t2.condition = 0
);
You might want to try window functions:
select category, timestamp, condition
from (
select
t.*,
min(condition) over(partition by category order by timestamp desc) min_cond
from mytable t
) t
where min_cond = 1
The window min() with the order by clause computes the minimum value of condition over the current and following rows of the same category: we can use it as a filter to eliminate rows for which there is a more recent row with a 0.
Compared to the correlated subquery approach, the upside of using window functions is that it reduces the number of scans needed on the table. Of course this computing also has a cost, so you'll need to assess both solutions against your sample data.

How to pass the value of previous row to current row?

How can I pass the result of previous row to the computation of the current row
Given the unit and the cost, I need to get the average cost of each transactions:
The formula:
Average cost is the sum of transaction cost
If Type is Sub then Trx cost is equal to cost
If Type is Red then Trx cost is Unit * (sum of previous trx cost/sum of previous units)
| Row | Type | Unit | Cost | TrxCost | Ave_cost |
| 1 |Sub | 0.2 | 1000 | 1000 | 1000 |
| 2 |Sub | 0.3 | 2500 | 2500 | 3500 |
| 3 |Sub | 0.1 | 600 | 600 | 4100 |
| 4 |Red |- 0.2 |-1100 | -1366.67 | 2733.33 |
| 5 |Sub | 0.3 | 1000 | 1000 | 3733.33 |
| 6 |Red | -0.6 | -600 | -3200 | 533.33 |
Update:
Order is based on row number.
Thanks.
You may use Recursive CTE
WITH cte (row_num,
type,
unit,
sum_of_unit,
cost,
trxcost,
ave_cost
) AS (
SELECT row_num,
type,
unit,
unit AS sum_of_unit,
cost,
cost AS trxcost,
cost AS ave_cost
FROM t
WHERE row_num IN (
SELECT MIN(row_num)
FROM t
)
UNION ALL
SELECT t.row_num,
t.type,
t.unit,
c.sum_of_unit + t.unit AS sum_of_unit,
t.cost,
CASE t.type
WHEN 'Sub' THEN t.cost
WHEN 'Red' THEN t.unit * ( c.ave_cost / c.sum_of_unit )
END
AS trxcost,
c.ave_cost + CASE t.type
WHEN 'Sub' THEN t.cost
WHEN 'Red' THEN t.unit * ( c.ave_cost / c.sum_of_unit )
END AS ave_cost
FROM t
JOIN cte c ON t.row_num = c.row_num + 1
)
SELECT * FROM cte
Dbfiddle Demo
You can do this in two passes: one to get theTrxCost, then one to get the Ave_cost.
What you are calling "average" is a running total by the way; you are merely adding up values.
You need window functions with ROWS BETWEEN clauses. (In case of SUM(...) OVER (ORDER BY ...) this is implicitly BETWEEN UNBOUNDED PRECEDING AND CURRENT, however).
select
id, type, unit, cost, round(trxcost, 2) as trxcost,
round(sum(trxcost) over (order by id), 2) as ave_cost
from
(
select
id, type, unit, cost,
case
when type = 'Sub' then cost
else
unit *
sum(cost) over (order by id rows between unbounded preceding and 1 preceding) /
sum(unit) over (order by id rows between unbounded preceding and 1 preceding)
end as trxcost
from mytable
)
order by id;
I renamed your row column id, because ROW is a reserved word.
The last row's results differ from yours. I used your formula, but get different figures.
Rextester demo: https://rextester.com/ASXFY4323
See the SQL Window Functions , which allow you to access values from other rows in the result set. In your case, you will need to tell us some more criteria for when to stop looking etc.:
select
lag(unit,1) over (partition by type order by whatever)
* lag(cost,1) over (partition by type order by whatever)
from Trx
But I'm still missing how you want to correlate the transactions and the reductions to each other. There must be some column you're not telling us about. If that column (PartNumber?) is known, you can simply group by and sum by that.

SQL Aggregate Sum to Only Net Out Negative Rows

I'm trying to roll up product values based on dates. The example below starts out with 20,000, adds 5,000, and then subtracts 7,000. The result should be eating through the entire 5,000 and then into the prior positive row. This would remove the 5,000 row.
I think this would be as simple as doing a sum window function ordered by date descending. However, as you can see below, I want to stop summing at any row that remains positive and then move to the next.
I cannot figure out the logic in SQL to make this work. In my head, it should be:
SUM(Value) OVER (PARTITION BY Product, (positive valued rows) ORDER BY Date DESC)
But there could be multiple positive valued rows in a row where a negative valued row could eat through all of them, or there could be multiple negative values in a row.
This post seemed promising, but I don't think the logic would work for if a negative value would be larger than the positive value.
HAVE:
+------------+----------------+-------+
| Date | Product | Value |
+------------+----------------+-------+
| 01/13/2015 | Prod1 | 20000 |
| 08/13/2015 | Prod1Addition1 | 5000 |
| 12/13/2015 | Prod1Removal | -7000 |
| 02/13/2016 | Prod1Addition2 | 2000 |
| 03/13/2016 | Prod1Addition3 | 1000 |
| 04/13/2016 | Prod1Removal | -1500 |
+------------+----------------+-------+
WANT:
+------------+----------------+-------+
| Date | Product | Value |
+------------+----------------+-------+
| 01/13/2015 | Prod1 | 18000 |
| 02/13/2016 | Prod1Addition2 | 1500 |
+------------+----------------+-------+
i can only think of a recursive cte solution
; with
cte as
(
select Date, Product, Value, rn = row_number() over (order by Date)
from yourtable
),
rcte as
(
select Date, Product, Value, rn, grp = 1
from cte
where rn = 1
union all
select Date = case when r.Value < 0 then c.Date else r.Date end,
Product = case when r.Value < 0 then c.Product else r.Product end,
c.Value,
c.rn,
grp = case when r.Value < 0 then r.grp + 1 else r.grp end
from rcte r
inner join cte c on r.rn = c.rn - 1
)
select Date, Product, Value = sum(Value)
from rcte
group by Date, Product, grp
order by Date
I think that you want this:
select Date,
Product,
Sum(Value) As Value
From TABLE_NAME
Group By Date, Product
Order by Date, Product;
thats correct?