Retrieve values from rows other than the previous in a recursive CTE - sql

I am running a recursive CTE in order to calculate the average weighted cost of a product for x given warehouses. In this table, we can see a very simplified version of what the original data looks like:
Simplified original data
The first two rows are the initial values for the warehouses. That is why they have "N/A" in the Movement column.
The AVG_Weighted_Price column is 0 for the remaining rows because that is the value I wish to calculate with the recursive cte.
I have created a recursive cte which intends to calculate the AVG_Weighted_Price column and it does so with the following simplified (and frankly wrong) formula -> (b.Movement * a.AVG_Weighted_Price)/b.Total_Quantity (Having a as the previous row and b as the row being calculated).
In the table, it is clear this will not work because I have to retrieve the most recent value from the same Warehouse, which is not always the previous row. This could be solved simply by using the first two values as anchors and running the recursive cte for the A warehouse parent first, and later for the B warehouse parent.
However, because the AVG_Weighted_Price in one warehouse will affect the other, I have to run the recursion using the field "ID" as the order since it represents the order in which the movements (rows) happened. Nonetheless, the initial values (row 1 and 2) will pass with their original values and will not undergo any calculations (row 1 because it is the anchor and row 2 because it will be hardcoded to do so).
If I could run the recursion in the order of the warehouses and not necessarily in the order of the ID, the following query would be correct (#Sample_Table is the table showed in the picture above):
DROP TABLE IF EXISTS #RS
;WITH cte
AS
(
SELECT *
FROM #Sample_Table
WHERE Warehouse_Order = 1
UNION ALL
SELECT b.Warehouse
,b.Movement
,b.Total_Quantity
,CASE WHEN b.Warehouse_Order = 1 THEN b.AVG_Weighted_Price
ELSE (b.Movement * a.AVG_Weighted_Price) / b.Total_Quantity END AS AVG_Weighted_Price
,b.ID
,b.Warehouse_Order
FROM cte a
INNER JOIN #Sample_Table b
ON b.Warehouse = a.Warehouse AND b.Warehouse_Order = a.Warehouse_Order + 1
)
SELECT *
INTO #RS
FROM cte
This would be the result of this query:
Result from first query
This, however, is incorrect because, as I said before, the recursion must run in the same order as the ID.
For this reason, I tried to apply a LAG that retrieves the most recent value from the same warehouse. However, as far as I am aware, LAG doesn't work on recursive cte's and it always returns a NULL value. Here is the code I tried to use (note the changes in the Anchor WHERE clause and in the JOIN conditions, as well as the LAG present in the calculated field):
DROP TABLE IF EXISTS #RS
;WITH cte
AS
(
SELECT *
FROM #Sample_Table
WHERE ID = 1
UNION ALL
SELECT b.Warehouse
,b.Movement
,b.Total_Quantity
,CASE WHEN b.Warehouse_Order =1 THEN b.AVG_Weighted_Price
ELSE (b.Movement * LAG(b.AVG_Weighted_Price) OVER (PARTITION BY b.Warehouse ORDER BY b.ID)) / b.Total_Quantity END AS AVG_Weighted_Price
,b.ID
,b.Warehouse_Order
FROM cte a
INNER JOIN #Sample_Table b
ON b.ID = a.ID + 1
)
SELECT *
INTO #RS
FROM cte
The result of this query is as follows:
Result from second query
I understand why the LAG returns the NULL values and why we cannot use it here, but I honestly can't seem to find another solution.
The original data has tens of centers and millions of rows, so a WHILE loop to treat these cases one by one would be too consuming (already tested).
If anyone could help me solve this issue, I would forever be thankful as I have been banging my head on this problem for quite some time now. Thank you for your patience and sorry if I was, at anytime, confusing.
Edit: I created an Excel in order to better clarify the issue. I hope this helps:

This looks like it is a fairly simple problem. However, your question is difficult to understand.
Here is what I would want:
a SQL script to recreate a representative sample of your data. like this
declare #test table
(Warehouse varchar(20)
,Movement decimal (19,2)
, Total_Quantity decimal (19,2)
, Avg_Weighted_Price decimal (19,2)
, ID int
, Warehouse_Order int)
insert #test
values
( 'A', null, '100', '10', '1', '1')
,( 'B', null, '30', '5', '2', '1')
,( 'A', '10', '110', '0', '3', '2')
,( 'A', '-5', '105', '0', '4', '3')
,( 'B', '30', '60', '0', '5', '2')
,( 'B', '5', '65', '0', '6', '3')
,( 'B', '-25', '40', '0', '7', '4')
,( 'A', '10', '115', '0', '8', '4')
,( 'B', '10', '50', '0', '9', '5')
,( 'A', '10', '125', '0', '10', '5')
SELECT * FROM #test
Description of your data
so far, as I understand the starting value or opening balance of inventory in a warehouse can be seen in rows that have a null value for Movement.
Movement: +ve values are additons/recipts of item; -ves are reductions
Total_quantity shows current position (opening balance + movement)
what you are trying to do with this data.
as i understand it, update each row with compute avg_weighted_price
How is your average weighted price determined?
I think i understand what you are trying to do with lag and am fairly certain that your approach is wrong. (first clue: There is no cost associated with each receipt)
Try your formula in a simpler way - use excel or paper and pen and manually calculate the Avg_Weighted_price. that might clarify things a bit
Try explaining the purpose of this exercise. Why do you need this avg_weighted_price on every row?
when the problem is well defined, i expect the solution will be fairly simple.
Edit1; responding to excel sample:
Lag will work only once for you, as you can see here:
SELECT *
, lag(Avg_Weighted_Price, 1, 0) over (partition by Warehouse order by Warehouse_Order, id) as Lagprice
, case when Movement is null then Avg_Weighted_Price
when Total_Quantity <> 0
then Movement * lag(Avg_Weighted_Price, 1, 0) over (partition by Warehouse order by Warehouse_Order, id)/Total_Quantity
else 0 end as ComputedAvgPrice
From #test order by ID
Notice that the Aug_Weighted_price falls from 10 to 0.909
Is that the result you want?

Related

How to specify a linear programming-like constraint (i.e. max number of rows for a dimension's attributes) in SQL server?

I'm looking to assign unique person IDs to a marketing program, but need to optimize based on each person's Probability Score (some people can be sent to multiple programs, some only one) and have two constraints such as budgeted mail quantity for each program.
I'm using SQL Server and am able to put IDs into their highest scoring program using the row_number() over(partition by person_ID order by Prob_Score), but I need to return a table where each ID is assigned to a program, but I'm not sure how to add the max mail quantity constraint specific to each individual program. I've looked into the Check() constraint functionality, but I'm not sure if that's applicable.
create table test_marketing_table(
PersonID int,
MarketingProgram varchar(255),
ProbabilityScore real
);
insert into test_marketing_table (PersonID, MarketingProgram, ProbabilityScore)
values (1, 'A', 0.07)
,(1, 'B', 0.06)
,(1, 'C', 0.02)
,(2, 'A', 0.02)
,(3, 'B', 0.08)
,(3, 'C', 0.13)
,(4, 'C', 0.02)
,(5, 'A', 0.04)
,(6, 'B', 0.045)
,(6, 'C', 0.09);
--this section assigns everyone to their highest scoring program,
--but this isn't necessarily what I need
with x
as
(
select *, row_number()over(partition by PersonID order by ProbabilityScore desc) as PersonScoreRank
from test_marketing_table
)
select *
from x
where PersonScoreRank='1';
I also need to specify some constraints: two max C packages, one max A & one max B package can be sent. How can I reassign the IDs to a program while also using the highest probability score left available?
The final result should look like:
PersonID MarketingProgram ProbabilityScore PersonScoreRank
3 C 0.13 1
6 C 0.09 1
1 A 0.07 1
6 B 0.045 2
You need to rethink your ROW_NUMBER() formula based on your actual need, and you should also have a table of Marketing Programs to make this work efficiently. This covers the basic ideas you need to incorporate to efficiently perform the filtering you need.
MarketingPrograms Table
CREATE TABLE MarketingPrograms (
ProgramID varchar(10),
PeopleDesired int
)
Populate the MarketingPrograms Table
INSERT INTO MarketingPrograms (ProgramID, PeopleDesired) Values
('A', 1),
('B', 1),
('C', 2)
Use the MarketingPrograms Table
with x as (
select *,
row_number()over(partition by ProgramId order by ProbabilityScore desc) as ProgramScoreRank
from test_marketing_table
)
select *
from x
INNER JOIN MarketingPrograms m
ON x.MarketingProgram = m.ProgramID
WHERE x.ProgramScoreRank <= m.PeopleDesired

avoiding group by for column used in datediff?

As the database is currently constructed, I can only use a Date Field of a certain table in a datediff-function that is also part of a count aggregation (not the date field, but that entity where that date field is not null. The group by in the end messes up the counting, since the one entry is counted on it's own / as it's own group.
In some detail:
Our lead recruiter want's a report that shows the sum of applications, and conducted interviews per opening. So far no problem. Additionally he likes to see the total duration per opening from making it public to signing a new employee per opening and of cause only if the opening could already be filled.
I have 4 tables to join:
table 1 holds the data of the opening
table 2 has the single applications
table 3 has the interview data of the applications
table 4 has the data regarding the publication of the openings (with the date when a certain opening was made public)
The problem is the duration requirement. table 4 holds the starting point and in table 2 one (or none) applicant per opening has a date field filled with the time he returned a signed contract and therefor the opening counts as filled. When I use that field in a datediff I'm forced to also put that column in the group by clause and that results in 2 row per opening. 1 row has all the numbers as wanted and in the second row there is always that one person who has a entry in that date field...
So far I haven't come far in thinking of a way of avoiding that problem except for explanining to the colleague that he get's his time-to-fill number in another report.
SELECT
table1.col1 as NameOfProject,
table1.col2 as Company,
table1.col3 as OpeningType,
table1.col4 as ReasonForOpening,
count (table2.col2) as NumberOfApplications,
sum (case when table2.colSTATUS = 'withdrawn' then 1 else 0 end) as mberOfApplicantsWhoWithdraw,
sum (case when table3.colTypeInterview = 'PhoneInterview' then 1 else 0 end) as NumberOfPhoneInterview,
...more sum columns...,
table1.finished, // shows „1“ if opening is occupied
DATEDIFF(day, table4.colValidFrom, **table2.colContractReceived**) as DaysToCompletion
FROM
table2 left join table3 on table2.REF_NR = table3.REF_NR
join table1 on table2.PROJEKT = table1.KBEZ
left join table4 on table1.REFNR = table4.PRJ_REFNR
GROUP BY
**table2.colContractReceived**
and all other columns except the ones in aggregate (sum and count) functions go in the GROUP BY section
ORDER BY table1.NameOfProject
Here is a short rebuild of what it looks like. First a row where the opening is not filled and all aggregations come out in one row as wanted. The next project/opening shows up double, because the field used in the datediff is grouped independently...
project company; no_of_applications; no_of_phoneinterview; no_of_personalinterview; ... ; time_to_fill_in_days; filled?
2018_312 comp a 27 4 2 null 0
2018_313 comp b 54 7 4 null 0
2018_313 comp b 1 1 1 42 1
I'd be glad to get any idea how to solve this. Thanks for considering my request!
(During the 'translation' of all the specific column and table names I might have build in a syntax error here and there but the query worked well ecxept for that unwanted extra aggregation per filled opening)
If I've understood your requirement properly, I believe the issue you are having is that you need to show the date between the starting point and the time at which an applicant responded to an opening, however this must only show a single row based on whether or not the position was filled (if the position was filled, then show that row, if not then show that row).
I've achieved this result by assuming that you count a position as filled using the "ContractsRecevied" column. This may be wrong however the principle should still provide what you are looking for.
I've essentially wrapped your query in to a subquery, performed a rank ordering by the contractsfilled column descending and partitioned by the project. Then in the outer query I filter for the first instance of this ranking.
Even if my assumption about the column structure and data types is wrong, this should provide you with a model to work with.
The only issue you might have with this ranking solution is if you want to aggregate over both rows within one (so include all of the summed columns for both the position filled and position not filled row per project). If this is the case let me know and we can work around that.
Please let me know if you have any questions.
declare #table1 table (
REFNR int,
NameOfProject nvarchar(20),
Company nvarchar(20),
OpeningType nvarchar(20),
ReasonForOpening nvarchar(20),
KBEZ int
);
declare #table2 table (
NumberOfApplications int,
Status nvarchar(15),
REF_NR int,
ReturnedApplicationDate datetime,
ContractsReceived bit,
PROJEKT int
);
declare #table3 table (
TypeInterview nvarchar(25),
REF_NR int
);
declare #table4 table (
PRJ_REFNR int,
StartingPoint datetime
);
insert into #table1 (REFNR, NameOfProject, Company, OpeningType, ReasonForOpening, KBEZ)
values (1, '2018_312', 'comp a' ,'Permanent', 'Business growth', 1),
(2, '2018_313', 'comp a', 'Permanent', 'Business growth', 2),
(3, '2018_313', 'comp a', 'Permanent', 'Business growth', 3);
insert into #table2 (NumberOfApplications, Status, REF_NR, ReturnedApplicationDate, ContractsReceived, PROJEKT)
values (27, 'Processed', 4, '2018-04-01 08:00', 0, 1),
(54, 'Withdrawn', 5, '2018-04-02 10:12', 0, 2),
(1, 'Processed', 6, '2018-04-15 15:00', 1, 3);
insert into #table3 (TypeInterview, REF_NR)
values ('Phone', 4),
('Phone', 5),
('Personal', 6);
insert into #table4 (PRJ_REFNR, StartingPoint)
values (1, '2018-02-25 08:00'),
(2, '2018-03-04 15:00'),
(3, '2018-03-04 15:00');
select * from
(
SELECT
RANK()OVER(Partition by NameOfProject, Company order by ContractsReceived desc) as rowno,
table1. NameOfProject,
table1.Company,
table1.OpeningType,
table1.ReasonForOpening,
case when ContractsReceived >0 then datediff(DAY, StartingPoint, ReturnedApplicationDate) else null end as TimeToFillInDays,
ContractsReceived Filled
FROM
#table2 table2 left join #table3 table3 on table2.REF_NR = table3.REF_NR
join #table1 table1 on table2.PROJEKT = table1.KBEZ
left join #table4 table4 on table1.REFNR = table4.PRJ_REFNR
group by NameOfProject, Company, OpeningType, ReasonForOpening, ContractsReceived,
StartingPoint, ReturnedApplicationDate
) x where rowno=1

How do I exclude entries from a recursive CTE?

How can I exclude entries from a recursive CTW with Sqlite?
CREATE TABLE GroupMembers (
group_id VARCHAR,
member_id VARCHAR
);
INSERT INTO GroupMembers(group_id, member_id) VALUES
('1', '10'),
('1', '20'),
('1', '30'),
('1', '-50'),
('2', '30'),
('2', '40'),
('3', '1'),
('3', '50'),
('4', '-10'),
('10', '50'),
('10', '60');
I want a query that will give me the list of members (recursively) in the group. However, a member with the first character being '-' means that the id that comes after the minus is NOT in the group.
For example, the members of '1' are '10', '20', '30', and '-50'. '10', however, is a group so we need to add its children '50' and '60'. However, '-50' is already a member so we cannot include '50'. In conclusion the members of '1' are '10', '20', '30', '-50', and '60'.
It seems like this query should work:
WITH RECURSIVE members(id) AS (
VALUES('1')
UNION
SELECT gm.member_id
FROM members m
INNER JOIN GroupMembers gm ON mg.group_id=m.id
LEFT OUTER JOIN members e ON '-' || gm.member_id=e.id
WHERE e.id IS NULL
)
SELECT id FROM members;
But I get the error: multiple references to recursive table: members
How can I fix/rewrite this to do what I want?
Note: it doesnt matter whether the '-50' entry is returned in the result set.
I don't have a SQLite available for testing, but assuming the -50 also means that 50 should be excluded as well, I think you are looking for this:
WITH RECURSIVE members(id) AS (
VALUES('1')
UNION
SELECT gm.member_id
FROM GroupMembers gm
JOIN members m ON gm.group_id=m.id
WHERE member_id not like '-%'
AND not exists (select 1
from groupMembers g2
where g2.member_id = '-'||gm.member_id)
)
SELECT id
FROM members;
(The above works in Postgres)
You usually select from the base table in the recursive part and the join back to the actual CTE. The filtering of unwanted rows is then done with a regular where clause not by joining the CTE again. A recursive CTE is defined to terminate when the JOIN finds no more rows.
SQLFiddle (Postgres): http://sqlfiddle.com/#!15/04405/1
Edit after the requirements have changed (have been detailed):
As you need to exclude the rows based on their position (a detail that you didn't provide in your original question). The filter can only be done outside of the CTE. Again I can't test this with SQLite, only with Postgres:
WITH RECURSIVE members(id, level) AS (
VALUES('4', 1)
UNION
SELECT gm.member_id, m.level + 1
FROM GroupMembers gm
JOIN members m ON gm.group_id=m.id
)
SELECT m.id, m.level
FROM members m
where id not like '-%'
and not exists (select 1
from members m2
where m2.level < m.level
and m2.id = '-'||m.id);
Updated SQLFiddle: http://sqlfiddle.com/#!15/ec0f9/3

SQL query to separate a column into separate columns

I would like to have separate columns for H and T's prices, with 'period' as the common index. Any suggestions as to how I should go about this?
This is what my SQL query produces at the moment:
You can use GROUP BY and a conditional, like this:
SELECT
period
, SUM(CASE NAME WHEN 'H' THEN price ELSE 0 END) as HPrice
, SUM(CASE NAME WHEN 'T' THEN price ELSE 0 END) as TPrice
FROM MyTable
GROUP BY period
You can do the following:
SELECT period,
max(CASE WHEN name = 'H' THEN price END) as h_price,
max(CASE WHEN name = 'T' THEN price END) as t_price
FROM myTable
GROUP by period
If you mean to recreate the table?
1) Create a new table with columns: period, price_h & price_t.
2) Copy all (distinct) from period into new table's period.
3) Copy all price where name = H to new table's price_h joining the period column
4) repeat 3 for price_t....
good luck!
A little late to the game on this but you could also pivot the data.
Lets create a sample table.
CREATE TABLE myData(period int, price decimal(12,4), name varchar(10))
GO
-- Inserting Data into Table
INSERT INTO myData
(period, price, name)
VALUES
(1, 53.0450, 'H'),
(1, 55.7445, 'T'),
(2, 61.2827, 'H'),
(2, 66.0544, 'T'),
(3, 61.3405, 'H'),
(3, 66.0327, 'T');
Now the select with the pivot performed.
SELECT period, H, T
FROM (
SELECT period, price, name
FROM myData) d
PIVOT (SUM(price) FOR name IN (H, T)) AS pvt
ORDER BY period
I've used this technique when I needed to build a dynamic sql script that took in the columns in which would be displayed on the header of the table. No need for case statements.
Im not sure about the performance of the case and pivot. Maybe someone with a little more experience could add some comments on which would give better performance.

SQL Showing Less information depending on date

I have this code, what It returns is a list of some clients, but it lists too many. This is because it lists several of the same thing just with diffrent dates. I only want to show the latest date and none of the other ones. I tried to do a group by Client_Code but it didn't work, it just through up not an aggregate function or something similar (can get if needed). What I have been asked to get is all of our clients, with all the details listed. in the 'as' part and they all pull through properly. If I take out:
I.DATE_LAST_POSTED as 'Last Posted',
I.DATE_LAST_BILLED as 'Last Billed'
It shows up okay, but I need the last billed date only to appear. But putting these lines in shows the client several times listing all the diffrent bill dates. And I think that is because it is pulling across the diffrent Matters in the Matter_Master Table. Essentially, I would like to only show the Client Information on the highest Matter with there last billed date.
Please let me know if this needs clarification, im trying to explain best I can....
SELECT DISTINCT
A.DIWOR as 'ID',
B.Client_alpha_Name as 'Client Name',
A.ClientCODE as 'Client Code',
B.Client_address as 'Client Address',
D.COMM_NO AS 'Contact',
E.Contact_full_name as 'Possible Key Contact',
G.LOBSICDESC as 'LOBSIC Code',
H.EARNERNAME as 'Client Care Parnter',
A.CLIENTCODE + '/' + LTRIM(STR(A.LAST_MATTER_NUM)) as 'Last Matter Code',
I.DATE_LAST_POSTED as 'Last Posted',
I.DATE_LAST_BILLED as 'Last Billed'
FROM CLIENT_MASTER A
JOIN CLIENT_INFO B
ON A.CLIENTCODE=B.CLIENT_CODE
JOIN MATTER_MASTER C
ON A.DIWOR=C.CLIENTDIWOR
JOIN COMMINFO D
ON A.DIWOR=D.DIWOR
JOIN CONTACT E
ON A.CLIENTCODE=E.CLIENTCODE
JOIN VW_CONTACT F
ON E.NAME_DIWOR=F.NAME_DIWOR
JOIN LOBSIC_CODES G
ON A.LOBSICDIWOR=G.DIWOR
JOIN STAFF H
ON A.CLIENTCAREPARTNER=H.DIWOR
JOIN MATTER I
ON C.DIWOR=I.MATTER_DIWOR
WHERE F.COMPANY_FLAG='Y'
AND C.MATTER_MANAGER NOT IN ('78','466','2','104','408','73','51','561','504','101','13','534','16','461','531','144','57','365','83','107','502','514','451')
AND I.DATE_LAST_BILLED > 0
GROUP BY A.ClientCODE
ORDER BY A.DIWOR
Your problem is that you aren't using enough aggregate functions. Which is probably why you're using both the DISTINCT clause and the GROUP BY clause (the recommendation is to use GROUP BY, and not DISTINCT).
So... remove DISTINCT, add the necessary (unique, more or less) list of columns to the GROUP BY clause, and wrap the rest in aggregate functions, constants, or subselects. In the specific case of wanting the largest date, wrap it in a MAX() function.
If I understood right:
--=======================
-- sample data - simplifed output of your query
--=======================
declare #t table
(
ClientCode int,
ClientAddress varchar(50),
DateLastBilled datetime
-- the rest of fields is skipped
)
insert into #t values (1, 'address1', '2011-01-01')
insert into #t values (1, 'address1', '2011-01-02')
insert into #t values (1, 'address1', '2011-01-03')
insert into #t values (1, 'address1', '2011-01-04')
insert into #t values (2, 'address2', '2011-01-07')
insert into #t values (2, 'address2', '2011-01-08')
insert into #t values (2, 'address2', '2011-01-09')
insert into #t values (2, 'address2', '2011-01-10')
--=======================
-- solution
--=======================
select distinct
ClientCode,
ClientAddress,
DateLastBilled
from
(
select
ClientCode,
ClientAddress,
DateLastBilled,
-- list of remaining fields
MaxDateLastBilled = max(DateLastBilled) over(partition by ClientCode)
from
(
-- here should be your query
select * from #t
) t
) t
where MaxDateLastBilled = DateLastBilled