How can I pass the result of previous row to the computation of the current row
Given the unit and the cost, I need to get the average cost of each transactions:
The formula:
Average cost is the sum of transaction cost
If Type is Sub then Trx cost is equal to cost
If Type is Red then Trx cost is Unit * (sum of previous trx cost/sum of previous units)
| Row | Type | Unit | Cost | TrxCost | Ave_cost |
| 1 |Sub | 0.2 | 1000 | 1000 | 1000 |
| 2 |Sub | 0.3 | 2500 | 2500 | 3500 |
| 3 |Sub | 0.1 | 600 | 600 | 4100 |
| 4 |Red |- 0.2 |-1100 | -1366.67 | 2733.33 |
| 5 |Sub | 0.3 | 1000 | 1000 | 3733.33 |
| 6 |Red | -0.6 | -600 | -3200 | 533.33 |
Update:
Order is based on row number.
Thanks.
You may use Recursive CTE
WITH cte (row_num,
type,
unit,
sum_of_unit,
cost,
trxcost,
ave_cost
) AS (
SELECT row_num,
type,
unit,
unit AS sum_of_unit,
cost,
cost AS trxcost,
cost AS ave_cost
FROM t
WHERE row_num IN (
SELECT MIN(row_num)
FROM t
)
UNION ALL
SELECT t.row_num,
t.type,
t.unit,
c.sum_of_unit + t.unit AS sum_of_unit,
t.cost,
CASE t.type
WHEN 'Sub' THEN t.cost
WHEN 'Red' THEN t.unit * ( c.ave_cost / c.sum_of_unit )
END
AS trxcost,
c.ave_cost + CASE t.type
WHEN 'Sub' THEN t.cost
WHEN 'Red' THEN t.unit * ( c.ave_cost / c.sum_of_unit )
END AS ave_cost
FROM t
JOIN cte c ON t.row_num = c.row_num + 1
)
SELECT * FROM cte
Dbfiddle Demo
You can do this in two passes: one to get theTrxCost, then one to get the Ave_cost.
What you are calling "average" is a running total by the way; you are merely adding up values.
You need window functions with ROWS BETWEEN clauses. (In case of SUM(...) OVER (ORDER BY ...) this is implicitly BETWEEN UNBOUNDED PRECEDING AND CURRENT, however).
select
id, type, unit, cost, round(trxcost, 2) as trxcost,
round(sum(trxcost) over (order by id), 2) as ave_cost
from
(
select
id, type, unit, cost,
case
when type = 'Sub' then cost
else
unit *
sum(cost) over (order by id rows between unbounded preceding and 1 preceding) /
sum(unit) over (order by id rows between unbounded preceding and 1 preceding)
end as trxcost
from mytable
)
order by id;
I renamed your row column id, because ROW is a reserved word.
The last row's results differ from yours. I used your formula, but get different figures.
Rextester demo: https://rextester.com/ASXFY4323
See the SQL Window Functions , which allow you to access values from other rows in the result set. In your case, you will need to tell us some more criteria for when to stop looking etc.:
select
lag(unit,1) over (partition by type order by whatever)
* lag(cost,1) over (partition by type order by whatever)
from Trx
But I'm still missing how you want to correlate the transactions and the reductions to each other. There must be some column you're not telling us about. If that column (PartNumber?) is known, you can simply group by and sum by that.
Related
I have a column in table LIKE below
| Column A | Column B |
| Active | 202211210423 |
| XYZ | 202211210424 |
| XYZ | 202211210424 |
...
| PQR | 202211210426 |
| Active | 202211210523 |
| abc | 202211210525 |
Table_Input
How do I count distinct records from Column A that are between "Active"?
Output can be like, COLUMN C is distinct count between "Active".
| Column A | Column B | Column C |
| Active | 202211210423 | x
| XYZ | 202211210424 | 24
| XYZ | 202211210424 | 24
...
| PQR | 202211210426 | 24
| Active | 202211210523 | 24
| abc | 202211210525 | y
Expected_output
Can we use Analytical functions to do that?
I tried using FIRST_VALUE function It did not work as they all will get to the first appearance of Active.
Input Fields 1
Output 2
[Dataset]5
Not sure I clearly understood your requirements, but you might consider below.
SELECT * EXCEPT(agg, part),
(SELECT COUNT(DISTINCT x.timestamp) FROM t.agg x WHERE x.colB <> 'Active') Count,
FROM (
SELECT *, ARRAY_AGG(STRUCT(timestamp, colB)) OVER (PARTITION BY colA, part) agg FROM (
SELECT *, COUNT(1) OVER w0 - COUNTIF(colB <> 'Active') OVER w1 AS part
FROM sample_data
WINDOW w0 AS (PARTITION BY colA ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
w1 AS (PARTITION BY colA ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
) t
ORDER BY timestamp;
Query results
First we generate some sample data in table tbl.
For distinct counts, a group identifer is helpful and created as row_group . Then each partition is viewed individually.
with tbl as (select *, 1000+cast(rand()*5 as int64)+10*row_number() over () as colB from unnest(["Active","Y","X","Active","Y","Z","X","Active"])colA ),
helper as (select *, countif(colA="Active") over (order by colB rows between unbounded preceding and 1 preceding) as row_group from tbl)
select *,
-1+count(distinct colA) over (partition by row_group)
from helper
For counting the total numbers, this way is easier:
Then we create a helper table with a row number by ordering of colB. The minimum and maximum of this row number has to be found for unbounded preceding and unbounded following. Substracting gives the total amount between these key words.
with tbl as (select *, 1000+cast(rand()*5 as int64)+10*row_number() over () as colB from unnest(["Active","Y","X","Active","Y","Z","X","Active"])colA ),
helper as (select *, row_number() over (order by colB) as row_id from tbl)
select *,
min(if(colA="Active",row_id,null)) over (order by colB rows between current row and unbounded following)-
ifnull(max(if(colA="Active",row_id,null)) over (order by colB rows between unbounded preceding and 1 preceding),0)
from helper
I have a table with user retail transactions. It includes sales and cancels. If Qty is positive - it sells, if negative - cancels. I want to attach cancels to the most appropriate sell. So, I have tables likes that:
| CustomerId | StockId | Qty | Date |
|--------------+-----------+-------+------------|
| 1 | 100 | 50 | 2020-01-01 |
| 1 | 100 | -10 | 2020-01-10 |
| 1 | 100 | 60 | 2020-02-10 |
| 1 | 100 | -20 | 2020-02-10 |
| 1 | 100 | 200 | 2020-03-01 |
| 1 | 100 | 10 | 2020-03-05 |
| 1 | 100 | -90 | 2020-03-10 |
User with ID 1 has the following actions: buy 50 -> return 10 -> buy 60 -> return 20 -> buy 200 -> buy 10 - return 90. For each cancel row (with negative Qty) I find the previous row (by Date) with positive Qty and greater than cancel Qty.
So I need to create BigQuery queries to create table likes this:
| CustomerId | StockId | Qty | Date | CancelQty |
|--------------+-----------+-------+------------+-------------|
| 1 | 100 | 50 | 2020-01-01 | -10 |
| 1 | 100 | 60 | 2020-02-10 | -20 |
| 1 | 100 | 200 | 2020-03-01 | -90 |
| 1 | 100 | 10 | 2020-03-05 | 0 |
Does anybody help me with these queries? I have created one candidate query (split cancel and sales, join them, and do some staff for removing), but it works incorrectly in the above case.
I use BigQuery, so any BQ SQL features could be applied.
Any ideas will be helpful.
You can use the following query.
;WITH result AS (
select t1.*,t2.Qty as cQty,t2.Date as Date_t2 from
(select *,ROW_NUMBER() OVER (ORDER BY qty DESC) AS [ROW NUMBER] from Test) t1
join
(select *,ROW_NUMBER() OVER (ORDER BY qty) AS [ROW NUMBER] from Test) t2
on t1.[ROW NUMBER] = t2.[ROW NUMBER]
)
select CustomerId,StockId,Qty,Date,ISNULL(cQty, 0) As CancelQty,Date_t2
from (select CustomerId,StockId,Qty,Date,case
when cQty < 0 then cQty
else NULL
end AS cQty,
case
when cQty < 0 then Date_t2
else NULL
end AS Date_t2 from result) t
where qty > 0
order by cQty desc
result: https://dbfiddle.uk
You can do this as a gaps-and-islands problem. Basically, add a grouping column to the rows based on a cumulative reverse count of negative values. Then within each group, choose the first row where the sum is positive. So:
select t.* (except cancelqty, grp),
(case when min(case when cancelqty + qty >= 0 then date end) over (partition by customerid grp) = date
then cancelqty
else 0
end) as cancelqty
from (select t.*,
min(cancelqty) over (partition by customerid, grp) as cancelqty
from (select t.*,
countif(qty < 0) over (partition by customerid order by date desc) as grp
from transactions t
) t
from t
) t;
Note: This works for the data you have provided. However, there may be complicated scenarios where this does not work. In fact, I don't think there is a simple optimal solution assuming that the returns are not connected to the original sales. I would suggest that you fix the data model so you record where the returns come from.
The below query seems to satisfy the conditions and the output mentioned.The solution is based on mapping the base table (t) and having the corresponding canceled qty row alongside from same table(t1)
First, a self join based on the customer and StockId is done since they need to correspond to the same customer and product.
Additionally, we are bringing in the canceled transactions t1 that happened after the base row in table t t.Dt<=t1.Dt and to ensure this is a negative qty t1.Qty<0 clause is added
Further we cannot attribute the canceled qty if they are less than the Original Qty. Therefore I am checking if the positive is greater than the canceled qty. This is done by adding a '-' sign to the cancel qty so that they can be compared easily. -(t1.Qty)<=t.Qty
After the Join, we are interested only in the positive qty, so adding a where clause to filter the other rows from the base table t with canceled quantities t.Qty>0.
Now we have the table joined to every other canceled qty row which is less than the transaction date. For example, the Qty 50 can have all the canceled qty mapped to it but we are interested only in the immediate one came after. So we first group all the base quantity values and then choose the date of the canceled Qty that came in first in the Having clause condition HAVING IFNULL(t1.dt, '0')=MIN(IFNULL(t1.dt, '0'))
Finally we get the rows we need and we can exclude the last column if required using an outer select query
SELECT t.CustomerId,t.StockId,t.Qty,t.Dt,IFNULL(t1.Qty, 0) CancelQty
,t1.dt dt_t1
FROM tbl t
LEFT JOIN tbl t1 ON t.CustomerId=t1.CustomerId AND
t.StockId=t1.StockId
AND t.Dt<=t1.Dt AND t1.Qty<0 AND -(t1.Qty)<=t.Qty
WHERE t.Qty>0
GROUP BY 1,2,3,4
HAVING IFNULL(t1.dt, '0')=MIN(IFNULL(t1.dt, '0'))
ORDER BY 1,2,4,3
fiddle
Consider below approach
with sales as (
select * from `project.dataset.table` where Qty > 0
), cancels as (
select * from `project.dataset.table` where Qty < 0
)
select any_value(s).*,
ifnull(array_agg(c.Qty order by c.Date limit 1)[offset(0)], 0) as CancelQty
from sales s
left join cancels c
on s.CustomerId = c.CustomerId
and s.StockId = c.StockId
and s.Date <= c.Date
and s.Qty > abs(c.Qty)
group by format('%t', s)
if applied to sample data in your question - output is
I have a table with two columns: id and score. I'd like to create a third column that equals the quantile that an individual's score falls in. I'd like to do this in BigQuery's standardSQL.
Here's my_table:
+----+--------+
| id | score |
+----+--------+
| 1 | 2 |
| 2 | 13 |
| 3 | -2 |
| 4 | 7 |
+----+--------+
and afterwards I'd like to have the following table (example shown with quartiles, but I'd be interested in quartiles/quintiles/deciles)
+----+--------+----------+
| id | score | quaRtile |
+----+--------+----------+
| 1 | 2 | 2 |
| 2 | 13 | 4 |
| 3 | -2 | 1 |
| 4 | 7 | 3 |
+----+--------+----------+
It would be excellent if this were to work on 100 million rows. I've looked around to see a couple solutions that seem to use legacy sql, and the solutions using RANK() functions don't seem to work for really large datasets. Thanks!
If I understand correctly, you can use ntile(). For instance, if you wanted a value from 1-4, you can do:
select t.*, ntile(4) over (order by score) as tile
from t;
If you want to enumerate the values, then use rank() or dense_rank():
select t.*, rank() over (order by score) as tile
from t;
I see, your problem is getting the code to work, because BigQuery tends to run out of resources without a partition by. One method is to break up the score into different groups. I think this logic does what you want:
select *,
( (count(*) over (partition by cast(score / 1000 as int64) order by cast(score / 1000 as int64)) -
count(*) over (partition by cast(score / 1000 as int64))
) +
rank() over (partition by cast(score / 1000 as int64) order by regi_id)
) as therank,
-- rank() over (order by score) as therank
from t;
This breaks the score into 1000 groups (perhaps that is too many for an integer). And then reconstructs the ranking.
If your score has relatively low cardinality, then join with aggregation works:
select t.*, (running_cnt - cnt + 1) as therank
from t join
(select score, count(*) as cnt, sum(count(*)) over (order by score) as running_cnt
from t
group by score
) s
on t.score = s.score;
Once you have the rank() (or row_number()) you can easily calculate the tiles yourself (hint: division).
Output suggest me rank() :
SELECT *, RANK() OVER (ORDER BY score) as quantile
FROM table t
ORDER BY id;
I'm trying to roll up product values based on dates. The example below starts out with 20,000, adds 5,000, and then subtracts 7,000. The result should be eating through the entire 5,000 and then into the prior positive row. This would remove the 5,000 row.
I think this would be as simple as doing a sum window function ordered by date descending. However, as you can see below, I want to stop summing at any row that remains positive and then move to the next.
I cannot figure out the logic in SQL to make this work. In my head, it should be:
SUM(Value) OVER (PARTITION BY Product, (positive valued rows) ORDER BY Date DESC)
But there could be multiple positive valued rows in a row where a negative valued row could eat through all of them, or there could be multiple negative values in a row.
This post seemed promising, but I don't think the logic would work for if a negative value would be larger than the positive value.
HAVE:
+------------+----------------+-------+
| Date | Product | Value |
+------------+----------------+-------+
| 01/13/2015 | Prod1 | 20000 |
| 08/13/2015 | Prod1Addition1 | 5000 |
| 12/13/2015 | Prod1Removal | -7000 |
| 02/13/2016 | Prod1Addition2 | 2000 |
| 03/13/2016 | Prod1Addition3 | 1000 |
| 04/13/2016 | Prod1Removal | -1500 |
+------------+----------------+-------+
WANT:
+------------+----------------+-------+
| Date | Product | Value |
+------------+----------------+-------+
| 01/13/2015 | Prod1 | 18000 |
| 02/13/2016 | Prod1Addition2 | 1500 |
+------------+----------------+-------+
i can only think of a recursive cte solution
; with
cte as
(
select Date, Product, Value, rn = row_number() over (order by Date)
from yourtable
),
rcte as
(
select Date, Product, Value, rn, grp = 1
from cte
where rn = 1
union all
select Date = case when r.Value < 0 then c.Date else r.Date end,
Product = case when r.Value < 0 then c.Product else r.Product end,
c.Value,
c.rn,
grp = case when r.Value < 0 then r.grp + 1 else r.grp end
from rcte r
inner join cte c on r.rn = c.rn - 1
)
select Date, Product, Value = sum(Value)
from rcte
group by Date, Product, grp
order by Date
I think that you want this:
select Date,
Product,
Sum(Value) As Value
From TABLE_NAME
Group By Date, Product
Order by Date, Product;
thats correct?
The raw data table is
+--------+--------+
| id | value |
+--------+--------+
| 1 | 0.1 |
| 1 | 0.2 |
| 1 | 0.3 |
| 1 | 0.2 |
| 1 | 0.2 |
| 2 | 0.4 |
| 2 | 0.5 |
| 2 | 0.1 |
| 3 | 0.5 |
| 3 | 0.5 |
+--------+--------+
For each id, its value sum is 1. I want to select the top fewest rows of each id with value sum is more than or equal with 0.7, like
+--------+--------+
| id | value |
+--------+--------+
| 1 | 0.3 |
| 1 | 0.2 |
| 1 | 0.2 |
| 2 | 0.5 |
| 2 | 0.4 |
| 3 | 0.5 |
| 3 | 0.5 |
+--------+--------+
How to solve this problem?
It's neither pretty nor efficient but it's the best I can come up with.
Disclaimer: I'm sure this will perform horribly on any real-world dataset.
with recursive calc (id, row_list, value_list, total_value) as (
select id, array[ctid], array[value]::numeric(6,2)[], value::numeric(6,2) as total_value
from data
union all
select c.id, p.row_list||c.ctid, (p.value_list||c.value)::numeric(6,2)[], (p.total_value + c.value)::numeric(6,2)
from data as c
join calc as p on p.id = c.id and c.ctid <> all(p.row_list)
)
select id, unnest(min(value_list)) as value
from (
select id,
value_list,
array_length(row_list,1) num_values,
min(array_length(row_list,1)) over (partition by id) as min_num_values
from calc
where total_value >= 0.7
) as result
where num_values = min_num_values
group by id
SQLFiddle example: http://sqlfiddle.com/#!15/8966b/1
How does this work?
The recursive CTE (thew with recursive) part creates all possible combinations of values from the table. To make sure that the same value is not counted twice I'm collecting the CTIDs (an Postgres internal unique identifier for each row) for each row already processed into an array. The recursive join condition (p.id = c.id and c.ctid <> all(p.row_list)) then makes sure only values for the same id are added and only those that have not yet processed.
The result of the CTE is then reduced to all rows where the total sum (the column total_value) is >= 0.7.
The final outer select (the alias result) is then filtered down to those where the number of values making up the total sum is the smallest. The distinct and unnest then transforms the arrays back into a proper "table". The distinct is necessary because the CTE collects all combinations so that for e.g. id=3 the value_list array will contain {0.40,0.50} and {0.50,0.40}. Without the distinct, the unnest would return both combinations making it a total of four rows for id=3.
This also isn't that pretty but I think it'd be more efficient (and more transferable between RDBMS')
with unique_data as (
select id
, value
, row_number() over ( partition by id order by value desc ) as rn
from my_table
)
, cumulative_sum as (
select id
, value
, sum(value) over ( partition by id order by rn ) as csum
from unique_data
)
, first_over_the_mark as (
select id
, value
, csum
, lag(csum) over ( partition by id order by csum ) as prev_value
from cumulative_sum
)
select *
from first_over_the_mark
where coalesce(prev_value, 0) < 0.7
SQL Fiddle
I've done it with CTEs to make it easier to see what's happening but there's no need to use them.
It uses a cumulative sum, the first CTE makes the data unique as without it 0.2 is the same value and so all rows that have 0.2 get summed together. The second works out the running sum. The third then works out the previous value. If the previous is strictly less than 0.7 pick up everything. The idea being that if the previous cumulative sum is less than 0.7 then the current value is more (or equal) to that number.
It's worth noting that this will break down if you have any rows in your table where the value is 0.
This is a variant on Ben's method, but it is simpler to implement. You just need a cumulative sum, ordered by value in reverse, and then to take everything where the cumulative sum is less then 0.7 plus the first one that exceeds that value.
select t.*
from (select t.*,
sum(value) over (partition by id order by value desc) as csum
from t
) t
where csum - value < 0.7;
The expression csum - value is the cumulative sum minus the current value (you can also get this using something like rows between unbounded preceding and 1 preceding). Your condition is that this value is less than some threshold.
EDIT:
Ben's comment is right about duplicate values. His solution is fine. Here is another solution:
select t.*
from (select t.*,
sum(value) over (partition by id order by value desc, random()) as csum
from t
) t
where csum - value < 0.7;