Remove duplicates from single field only in rollup query

Remove duplicates from single field only in rollup query - sql

I have a table of data for individual audits on inventory. Every audit has a location, an expected value, a variance value, and some other data that aren't really important here.
I am writing a query for Cognos 11 which summarizes a week of these audits. Currently, it rolls everything up into sums by location class. My problem is that there may be multiple audits for individual locations and while I want the variance field to sum the data from all audits regardless of whether it's the first count on that location, I only want the expected value for distinct locations (i.e. only SUM expected value where the location is distinct).
Below is a simplified version of the query. Is this even possible or will I have to write a separate query in Cognos and make it two reports that will have to be combined after the fact? As you can likely tell, I'm fairly new to SQL and Cognos.
SELECT COALESCE(CASE
WHEN location_class = 'A'
THEN 'Active'
WHEN location_class = 'C'
THEN 'Active'
WHEN location_class IN (
'R'
,'0'
)
THEN 'Reserve'
END, 'Grand Total') "Row Labels"
,SUM(NVL(expected_cost, 0)) "Sum of Expected Cost"
,SUM(NVL(variance_cost, 0)) "Sum of Variance Cost"
,SUM(ABS(NVL(variance_cost, 0))) "Sum of Absolute Cost"
,COUNT(DISTINCT location) "Count of Locations"
,(SUM(NVL(variance_cost, 0)) / SUM(NVL(expected_cost, 0))) "Variance"
FROM audit_table
WHERE audit_datetime <= #prompt('EndDate') # audit_datetime >= #prompt('StartDate') #
GROUP BY ROLLUP(CASE
WHEN location_class = 'A'
THEN 'Active'
WHEN location_class = 'C'
THEN 'Active'
WHEN location_class IN (
'R'
,'0'
)
THEN 'Reserve'
END)
ORDER BY 1 ASC
This is what I'm hoping to end up with:
Thanks for any help!

Have you tried taking a look at the OVER clause in SQL? It allows you to use windowed functions within a result set such that you can get aggregates based on specific conditions. This would probably help since you seem to trying to get a summation of data based on a different grouping within a larger grouping.
For example, let's say we have the below dataset:
group1 group2 val dateadded
----------- ----------- ----------- -----------------------
1 1 1 2020-11-18
1 1 1 2020-11-20
1 2 10 2020-11-18
1 2 10 2020-11-20
2 3 100 2020-11-18
2 3 100 2020-11-20
2 4 1000 2020-11-18
2 4 1000 2020-11-20
Using a single query we can return both the sums of "val" over "group1" as well as the summation of the first (based on datetime) "val" records in "group2":
declare #table table (group1 int, group2 int, val int, dateadded datetime)
insert into #table values (1, 1, 1, getdate())
insert into #table values (1, 1, 1, dateadd(day, 1, getdate()))
insert into #table values (1, 2, 10, getdate())
insert into #table values (1, 2, 10, dateadd(day, 1, getdate()))
insert into #table values (2, 3, 100, getdate())
insert into #table values (2, 3, 100, dateadd(day, 1, getdate()))
insert into #table values (2, 4, 1000, getdate())
insert into #table values (2, 4, 1000, dateadd(day, 1, getdate()))
select t.group1, sum(t.val) as group1_sum, group2_first_val_sum
from #table t
inner join
(
select group1, sum(group2_first_val) as group2_first_val_sum
from
(
select group1, val as group2_first_val, row_number() over (partition by group2 order by dateadded) as rownumber
from #table
) y
where rownumber = 1
group by group1
) x on t.group1 = x.group1
group by t.group1, x.group2_first_val_sum
This returns the below result set:
group1 group1_sum group2_first_val_sum
----------- ----------- --------------------
1 22 11
2 2200 1100
The most inner subquery in the joined table numbers the rows in the data set based on "group2", resulting in the records either having a "1" or a "2" in the "rownum" column since there's only 2 records in each "group2".
The next subquery takes that data and filters out any rows that are not the first (rownum = 1) and sums the "val" data.
The main query gets the sum of "val" in each "group1" from the main table and then joins on the subqueried table to get the "val" sum of only the first records in each "group2".
There are more efficient ways to write this such as moving the summation of the "group1" values to a subquery in the SELECT statement to get rid of one of the nested tabled subqueries, but I wanted to show how to do it without subqueries in the SELECT statement.

Have you tried to put the distinct at the bottom like this ?
(SUM(NVL(variance_cost,0)) / SUM(NVL(expected_cost,0))) "Variance",
COUNT(DISTINCT location) "Count of Locations"
FROM audit_table

Related

Generate Identifier for consecutive rows with same value

I'm trying to get an SQL Server query that needs partitioning in a way such that consecutive rows with the same Type value ordered by date have the same unique identifier.
Let's say I have the following table
declare #test table
(
CustomerId varchar(10),
Type int,
date datetime
)
insert into #test values ('aaaa', 1, '2015-10-24 22:52:47')
insert into #test values ('bbbb', 1, '2015-10-23 22:56:47')
insert into #test values ('cccc', 2, '2015-10-22 21:52:47')
insert into #test values ('dddd', 2, '2015-10-20 22:12:47')
insert into #test values ('aaaa', 1, '2015-10-19 20:52:47')
insert into #test values ('dddd', 2, '2015-10-18 12:52:47')
insert into #test values ('aaaa', 3, '2015-10-18 12:52:47')
I want my output column to be something like this (the numbers do not need to be ordered, all I need are unique identifiers for each group).
0
0
1
1
2
3
4
Explanation: first 2 rows have UD:0 because the both have a type "1", then the next row has a different type ("2") so it should be another identifier, UD:1 in this case, the following row still has the same type so the UD is the same, then the next one has a different type "1" so another identifier, in this case UD:2 and on and on.
The customerId column is irrelevant to the query, the condition should be based on the Type and Date column
My current almost does the trick but it fails in some cases giving the same ID to rows with different type values.
SELECT
ROW_NUMBER() OVER (ORDER BY date) -
ROW_NUMBER() OVER (PARTITION BY Type ORDER BY date)
FROM #TEST

This is a Gaps & Islands problem that is solved using the traditional solution.
For example:
select
*,
sum(inc) over(order by date desc, type) as grp
from (
select *,
case when type <> lag(type) over(order by date desc, type)
then 1 else 0 end as inc
from test
) x
order by date desc, type
Result:
CustomerId Type date inc grp
----------- ----- --------------------- ---- ---
aaaa 1 2015-10-24T22:52:47Z 0 0
bbbb 1 2015-10-23T22:56:47Z 0 0
cccc 2 2015-10-22T21:52:47Z 1 1
dddd 2 2015-10-20T22:12:47Z 0 1
aaaa 1 2015-10-19T20:52:47Z 1 2
dddd 2 2015-10-18T12:52:47Z 1 3
aaaa 3 2015-10-18T12:52:47Z 1 4
See example at SQL Fiddle.

Do different sums in the same query

I currently have an issue on a query:
I have 2 tables.
Table 1:
Table 2:
I'm trying to join both tables on DateHour (that works) and for each campaign, for each PRF_ID, for each LENGTH and for each Type, to calculate count the occurences of the HPP column of table 2, year per year.
So for instance, for a given PRF_ID, length,type and campaign, I will have a range of dates in Table 1 between 01/04/2019 and 01/04/2020.
In this case, in my query, I need a new column giving me for all the dates between 01/04/2019 and 31/12/2019 the sum of HPP ocurrences in this period.
For 2020, the sum would be between 01/01/2020 and 01/04/2020.
I tried doing something like this:
SELECT Table1.DateHour,
SUM(Table2.HPP) OVER (PARTITION BY YEAR(Table2.DateHour)
FROM Table1
LEFT JOIN Table2 on Table2.DateHour=Table1.DateHour
But that gives me really odd results, the OVER PARTITION BY does not seem to work.

Your question is confusing because it mixes terminology.
Count versus sum
... to calculate count the occurences ... the sum would be ...
Counting occurrences is not the same as adding them up. Every records that can be joined counts as an occurence. Calculating the sum means adding up the values of a column. I added both calculations to the solution below (see IsHour_Sum versus Table2_Count).
Grouping on combinations
Do different sums in the same query [question title]
... and for each campaign, for each PRF_ID, for each LENGTH and for each Type ...
Do you want to aggregate over each combination of those columns or do you want to aggregate over each column individually? I have assumed you are after the combinations in my solution. Example to clarify:
If
column A has 3 possible values (A1, A2, A3) and
column B has 2 possible values (B1, B2)
Then
there are 5 counts (3 + 2) when aggregating (A) and (B) indivually
there are 6 counts (3 * 2) when aggregating each combination of (A,B)
Again:
A1 -> count 1 vs A1,B1 -> count 1
A2 -> count 2 A1,B2 -> count 2
A3 -> count 3 A2,B1 -> count 3
B1 -> count 4 A2,B2 -> count 4
B2 -> count 5 A3,B1 -> count 5
A3,B2 -> count 6
Sample data
I left out the column Value from table1 because it is not part of your question. I also changed the dates for table2 to 2018-05-23 to match with the majority of the table1 records. Otherwise all counts and sums would be 0.
declare #table1 table
(
DateHour datetime,
PRF_ID int,
Campaign nvarchar(5),
Length int,
ContractType nvarchar(5)
);
insert into #table1 (DateHour, PRF_ID, Campaign, Length, ContractType) values
('2018-05-23 00:00', 1, 'Q218', 1, 'G'),
('2018-05-23 01:00', 1, 'Q218', 1, 'G'),
('2018-05-23 02:00', 1, 'Q218', 1, 'G'),
('2020-05-23 03:00', 1, 'Q120', 1, 'G'),
('2018-05-23 04:00', 1, 'Q218', 1, 'G'),
('2019-07-23 01:00', 1, 'Q219', 1, 'G');
declare #table2 table
(
DateHour datetime,
HPP int
);
insert into #table2 (DateHour, HPP) values
('2018-05-23 00:00', 0),
('2018-05-23 01:00', 0),
('2018-05-23 02:00', 1),
('2018-05-23 03:00', 0),
('2018-05-23 04:00', 1),
('2018-05-23 05:00', 0),
('2018-05-23 06:00', 0),
('2018-05-23 07:00', 0);
Solution
The easiest way to aggregation on the year of the dates, is to split of that part as a new column instead of using a over(partition by ...) construction. If you do not need the new column Year in your output, then you can simply remove it from the field list (after select), but it must remain in the grouping clause (after group by).
select year(t1.DateHour) as 'Year',
t1.PRF_ID,
t1.Campaign,
t1.Length,
t1.ContractType,
isnull(sum(t2.HPP), 0) as 'IsHour_Sum',
count(t2.DateHour) as 'Table2_Count'
from #table1 t1
left join #table2 t2
on t2.DateHour = t1.DateHour
/* -- specify date filter as required
where t1.DateHour >= '2019-04-01 00:00'
and t1.DateHour < '2020-04-01 00:00'
*/
group by year(t1.DateHour),
t1.PRF_ID,
t1.Campaign,
t1.Length,
t1.ContractType;
Result
Year PRF_ID Campaign Length ContractType IsHour_Sum Table2_Count
----------- ----------- -------- ----------- ------------ ----------- ------------
2018 1 Q218 1 G 2 4
2019 1 Q219 1 G 0 0
2020 1 Q120 1 G 0 0

Count length of consecutive duplicate values for each id

I have a table as shown in the screenshot (first two columns) and I need to create a column like the last one. I'm trying to calculate the length of each sequence of consecutive values for each id.
For this, the last column is required. I played around with
row_number() over (partition by id, value)
but did not have much success, since the circled number was (quite predictably) computed as 2 instead of 1.
Please help!

First of all, we need to have a way to defined how the rows are ordered. For example, in your sample data there is not way to be sure that 'first' row (1, 1) will be always displayed before the 'second' row (1,0).
That's why in my sample data I have added an identity column. In your real case, the details can be order by row ID, date column or something else, but you need to ensure the rows can be sorted via unique criteria.
So, the task is pretty simple:
calculate trigger switch - when value is changed
calculate groups
calculate rows
That's it. I have used common table expression and leave all columns in order to be easy for you to understand the logic. You are free to break this in separate statements and remove some of the columns.
DECLARE #DataSource TABLE
(
[RowID] INT IDENTITY(1, 1)
,[ID]INT
,[value] INT
);
INSERT INTO #DataSource ([ID], [value])
VALUES (1, 1)
,(1, 0)
,(1, 0)
,(1, 1)
,(1, 1)
,(1, 1)
--
,(2, 0)
,(2, 1)
,(2, 0)
,(2, 0);
WITH DataSourceWithSwitch AS
(
SELECT *
,IIF(LAG([value]) OVER (PARTITION BY [ID] ORDER BY [RowID]) = [value], 0, 1) AS [Switch]
FROM #DataSource
), DataSourceWithGroup AS
(
SELECT *
,SUM([Switch]) OVER (PARTITION BY [ID] ORDER BY [RowID]) AS [Group]
FROM DataSourceWithSwitch
)
SELECT *
,ROW_NUMBER() OVER (PARTITION BY [ID], [Group] ORDER BY [RowID]) AS [GroupRowID]
FROM DataSourceWithGroup
ORDER BY [RowID];

You want results that are dependent on actual data ordering in the data source. In SQL you operate on relations, sometimes on ordered set of relations rows. Your desired end result is not well-defined in terms of SQL, unless you introduce an additional column in your source table, over which your data is ordered (e.g. auto-increment or some timestamp column).
Note: this answers the original question and doesn't take into account additional timestamp column mentioned in the comment. I'm not updating my answer since there is already an accepted answer.

One way to solve it could be through a recursive CTE:
create table #tmp (i int identity,id int, value int, rn int);
insert into #tmp (id,value) VALUES
(1,1),(1,0),(1,0),(1,1),(1,1),(1,1),
(2,0),(2,1),(2,0),(2,0);
WITH numbered AS (
SELECT i,id,value, 1 seq FROM #tmp WHERE i=1 UNION ALL
SELECT a.i,a.id,a.value, CASE WHEN a.id=b.id AND a.value=b.value THEN b.seq+1 ELSE 1 END
FROM #tmp a INNER JOIN numbered b ON a.i=b.i+1
)
SELECT * FROM numbered -- OPTION (MAXRECURSION 1000)
This will return the following:
i id value seq
1 1 1 1
2 1 0 1
3 1 0 2
4 1 1 1
5 1 1 2
6 1 1 3
7 2 0 1
8 2 1 1
9 2 0 1
10 2 0 2
See my little demo here: https://rextester.com/ZZEIU93657
A prerequisite for the CTE to work is a sequenced table (e. g. a table with an identitycolumn in it) as a source. In my example I introduced the column i for this. As a starting point I need to find the first entry of the source table. In my case this was the entry with i=1.
For a longer source table you might run into a recursion-limit error as the default for MAXRECURSION is 100. In this case you should uncomment the OPTION setting behind my SELECT clause above. You can either set it to a higher value (like shown) or switch it off completely by setting it to 0.

IMHO, this is easier to do with cursor and loop.
may be there is a way to do the job with selfjoin
declare #t table (id int, val int)
insert into #t (id, val)
select 1 as id, 1 as val
union all select 1, 0
union all select 1, 0
union all select 1, 1
union all select 1, 1
union all select 1, 1
;with cte1 (id , val , num ) as
(
select id, val, row_number() over (ORDER BY (SELECT 1)) as num from #t
)
, cte2 (id, val, num, N) as
(
select id, val, num, 1 from cte1 where num = 1
union all
select t1.id, t1.val, t1.num,
case when t1.id=t2.id and t1.val=t2.val then t2.N + 1 else 1 end
from cte1 t1 inner join cte2 t2 on t1.num = t2.num + 1 where t1.num > 1
)
select * from cte2

Teradata: Recursively Subtract

I have a set of data as follows:
Product Customer Sequence Amount
A 123 1 928.69
A 123 2 5032.81
A 123 3 6499.19
A 123 4 7908.57
What I want to do is recursively subtract the amounts based on the result of the previous subtraction (keeping the first amount as-is), into in a 'Result' column
e.g. Subtract 0 from 928.69 = 928.69, subtract 928.69 from 5032.81 = 4104.12, subtract 4104.12 from 6499.19 = 2395.07, etc (for each product/customer)
The results I'm trying to achieve are:
Product Customer Sequence Amount Result
A 123 1 928.69 928.69
A 123 2 5032.81 4104.12
A 123 3 6499.19 2395.07
A 123 4 7908.57 5513.50
I had been trying to achieve this using combinations of LEAD & LAG, but couldn't figure out how to use the result in the next row.
I'm thinking it's possible using a recursive statement, iterating over the sequence, however I'm not familiar with teradata recursion and couldn't successfully adapt the samples I found.
Can anyone please direct me on how to format a recursive teradata SQL statement to achieve the above result? I'm also open to non-recursive options if there are any.
CREATE VOLATILE TABLE MY_TEST (Product CHAR(1), Customer INTEGER, Sequence INTEGER, Amount DECIMAL(16,2)) ON COMMIT PRESERVE ROWS;
INSERT INTO MY_TEST VALUES ('A', 123, 1, 928.69);
INSERT INTO MY_TEST VALUES ('A', 123, 2, 5032.81);
INSERT INTO MY_TEST VALUES ('A', 123, 3, 6499.19);
INSERT INTO MY_TEST VALUES ('A', 123, 4, 7908.57);

This is really weird because of the alternation of the + and -.
If you know the value is always positive, then this works:
with t as (
select 1 as customer, 928.69 as amount, 928.69 as result union all
select 2, 5032.81, 4104.12 union all
select 3, 6499.19, 2395.07 union all
select 4, 7908.57, 5513.50
)
select t.*,
abs(sum( case when seqnum mod 2 = 1 then - amount else amount end ) over (partition by product order by sequence rows unbounded preceding)
from t;
The abs() is really a shortcut. If the resulting value could be negative, you can have an outer case expression to determine if the result should be multiplied by -1 or 1:
select t.*,
((case when sequence mod 2 = 1 then -1 else 1 end) *
sum( case when sequence mod 2 = 1 then - amount else amount end ) over (partition by product order by sequence rows unbounded preceding)
)
from t

select colA-der_col_A from table A,
(select coalesce(min(col_A) as der_col_A over (partition by col_B order by col_A rows between 1 following and 1 following), 0)
from table) B
on (A.col_b=B.Col_B);
Replace col_A and col_B with your key columns.Product,customer and sequence in your case.

SQL Server Sum a specific number of rows based on another column

Here are the important columns in my table
ItemId RowID CalculatedNum
1 1 3
1 2 0
1 3 5
1 4 25
1 5 0
1 6 8
1 7 14
1 8 2
.....
The rowID increments to 141 before the ItemID increments to 2. This cycle repeats for about 122 million rows.
I need to SUM the CalculatedNum field in groups of 6. So sum 1-6, then 7-12, etc. I know I end up with an odd number at the end. I can discard the last three rows (numbers 139, 140 and 141). I need it to start the SUM cycle again when I get to the next ItemID.
I know I need to group by the ItemID but I am having trouble trying to figure out how to get SQL to SUM just 6 CalculatedNum's at a time. Everything else I have come across SUMs based on a column where the values are the same.
I did find something on Microsoft's site that used the ROW_NUMBER function but I couldn't quite make sense of it. Please let me know if this question is not clear.
Thank you

You need to group by (RowId - 1) / 6 and ItemId. Like this:
drop table if exists dbo.Items;
create table dbo.Items (
ItemId int
, RowId int
, CalculatedNum int
);
insert into dbo.Items (ItemId, RowId, CalculatedNum)
values (1, 1, 3), (1, 2, 0), (1, 3, 5), (1, 4, 25)
, (1, 5, 0), (1, 6, 8), (1, 7, 14), (1, 8, 2);
select
tt.ItemId
, sum(tt.CalculatedNum) as CalcSum
from (
select
*
, (t.RowId - 1) / 6 as Grp
from dbo.Items t
) tt
group by tt.ItemId, tt.Grp

You could use integer division and group by.
SELECT ItemId, (RowId-1)/6 as Batch, sum(CalculatedNum)
FROM your_table GROUP BY ItemId, Batch
To discard incomplete batches:
SELECT ItemId, (RowId-1)/6 as Batch, sum(CalculatedNum), count(*) as Cnt
FROM your_table GROUP BY ItemId, Batch HAVING Cnt = 6
EDIT: Fix an off by one error.

To ensure you're querying 6 rows at a time you can try to use the modulo function : https://technet.microsoft.com/fr-fr/library/ms173482(v=sql.110).aspx
Hope this can help.

Thanks everyone. This was really helpful.
Here is what we ended up with.
SELECT ItemID, MIN(RowID) AS StartingRow, SUM(CalculatedNum)
FROM dbo.table
GROUP BY ItemID, (RowID - 1) / 6
ORDER BY ItemID, StartingRow
I am not sure why it did not like the integer division in the select statement but I checked the results against a sample of the data and the math is correct.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove duplicates from single field only in rollup query - sql

Have you tried to put the distinct at the bottom like this ? (SUM(NVL(variance_cost,0)) / SUM(NVL(expected_cost,0))) "Variance", COUNT(DISTINCT location) "Count of Locations" FROM audit_table

Related

Generate Identifier for consecutive rows with same value

Do different sums in the same query

Count length of consecutive duplicate values for each id

Teradata: Recursively Subtract

SQL Server Sum a specific number of rows based on another column

Categories

Resources