Using a CTE in OVER(PARTITION BY) - sql

I'm trying to calculate volume from 3 columns in a table and return only unique volumes. We have many rows with the same Width, Height, and Length and so naturally my volume calculation will have duplicate return values for Volume. I am under the impression that, in order to accomplish this, I must use OVER, PARTITION and a CTE as aliases are not allowed to be referenced in OVER
WITH
cteVolume (Id, Volume)
AS
(
SELECT Id, Width * Height * [Length] AS Volume FROM PackageMaterialDimensions
)
SELECT *
INTO #volumeTempTable
FROM (
SELECT pp.ID, (pp.Width * pp.Height * pp.[Length]) AS Volume,
ROW_NUMBER() OVER(PARTITION BY cte.Volume ORDER BY pp.ID DESC) rn
FROM PlanPricing pp
INNER JOIN cteVolume cte ON pp.ID = cte.Id
) a
WHERE rn = 1
SELECT * FROM #volumeTempTable
ORDER BY Volume DESC
DROP TABLE #volumeTempTable
Note, the reason for the temp tables is because I plan on doing some extra work with this data. I also am currently debugging so I am using these tables to output to the data window
Here is what is wrong with this query
- It is still returning duplicates
- It is only returning one volume for every row
- It is only returning about 75 rows when there are 71000 rows in the table
How can I modify this query to essentially do the following
- Calculate volume for EVERY row in the table
- SELECT rows with unique volume calculations. (I do not want to see the same volume twice in my result set)
Edit - providing data as requested
Current data set Ignore the extra columns
What I would like is
ID | Volume
193 | 280
286 | 350
274 | 550
241 | 720
Basically, I want to calculate volume for every row, then I would like to somehow group by volume in order to cut down duplicates and select the first row from each group

Does this do what you want?
WITH cteVolume (Id, Volume) AS (
SELECT Id, Width * Height * [Length] AS Volume
FROM PackageMaterialDimensions
)
SELECT DISTINCT volume
FROM CTE ;
If you want one id per volume:
WITH cteVolume (Id, Volume) AS (
SELECT Id, Width * Height * [Length] AS Volume
FROM PackageMaterialDimensions
)
SELECT volume, MIN(Id) as Id
FROM CTE
GROUP BY volume;

Perhaps your issue is coming from partitioning cte.volume from the PackageMaterialDimensions table, but you're also selecting pp.volume from the PlanPricing table?
Not able to confirm without more information on your data set and tables.

As far as I can see you can't use windows functions inside the recursive part of the CTE. You have to sum them manually, inside the CTE part.
So, instead of
ROW_NUMBER() OVER(PARTITION BY cte.Volume ORDER BY pp.ID DESC) rn
Just write
1 as rn
in the first part, and
rn+1 as rn
in the second part.

Related

Teradata SQL - How to transpose data in one row

I have a table as below:
I currently have a query that selects records where SEQ=450 and RESULT='LT' OR SEQ=650 and RESULT='LT'. And for a particular ID, if there are both Sequences with 450 and 650 with RESULT='LT' like in this case as shown, I only keep the row with SEQ=450 and RESULT='LT'. However, what I want as a final output is also the SEQ and CODE values from the row above the 450 SEQ. like below
and if only 650 exists for an ID then,
Obviously in this case I would only choose seq=450.
The current query I have is
CREATE MULTISET VOLATILE TABLE DOM AS (
WITH cte AS (
SELECT DISTINCT ID
FROM MASTER
WHERE SEQ = 450
)
SELECT DISTINCT SCAN.ID, SCAN.SEQ, SCAN.CODE, SCAN.RESULT FROM MASTER
WHERE (MASTER.SEQ = 450 OR (MASTER.SEQ = 650 AND NOT EXISTS (
SELECT 1 FROM cte WHERE cte.ID = MASTER.ID AND cte.ID = MASTER.ID
))) AND MASTER.RESULT ='LT'
) WITH DATA PRIMARY INDEX (ID, SEQ) ON COMMIT PRESERVE ROWS;
Which gives me the output for this particular ID as:
How can I modify the query to also get the
other columns? Note: The SEQ_BEFORE will not always be 300 or 600, so I cannot just use that seq no. as a reference in the query.
This is how I understand the task: You want to get the data for seq 450 and 650 along with their predecessor values (seq 300 and 600 in your example) Then per ID you only want to select the row with the lesser ID of the two, so if you find only one of the two sequences 450 and 650, you show it, if there exist both, you only show 450.
Use LAG to get the predecessor's values. Use MIN OVER to get the lesser seq per ID.
select *
from
(
select
id, seq, result, code,
lag(code) over (partition by id order by seq) as code_before,
lag(seq) over (partition by id order by seq) as seq_before
from mytable
) with_values_before
where seq in (450, 650)
qualify seq = min(seq) over (partition by id)
order by id;

How to query samples in relativity?

I have a large data set with about 100 million rows that I want to 'compress' the data set and get a 1% sample of the entire dataset while ensuring relativity.
How can such query be implemented?
Step 1: create the helper table
You can use aggregation to group records by visit_id, and CROSS JOIN with a query that computes the total number of records in the table to compute the distribution percent:
CREATE TABLE my_helper AS
SELECT
t.visit_number,
COUNT(*) visit_count,
SUM(t.purchase_id) sum_purchase,
COUNT(*)/total.cnt distribution
FROM
mytable t
CROSS JOIN (SELECT COUNT(*) cnt FROM mytable) total
GROUP BY t.visit_number
Step 2: sample the main table using the helper table
Within a subquery, you can use ROW_NUMBER() OVER(PARTITION BY visit_number ORDER BY RANDOM()) to assign a random rank to each record within groups of records sharing the same visit_id. Then, in the outer query, you can join on the helper table to select the corect amount of records for each visit_id:
SELECT x.*
FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY visit_number ORDER BY RANDOM()) rn
FROM mytable t
) x
INNER JOIN my_helper h ON h.visit_number = x.visit_number
WHERE x.rn <= 1000000 * h.distribution
Side notes:
this only works if there are indeed more than 1 million record in the source table
the exact number of records in the output might be slightly below or above 1 million (depending on the distribution in the original table)
it should be possible to combine both queries into a single one, which would avoid the need to use a helper table
This is doable. A quick way is to take every nth record only.
1) order by a random column (probably ID)
2) apply a nownum() attribute
3) apply a mod(rownum) = 0 on whatever percent makes sense (e.g. 1% would be rownum mod 100)
You may need steps 1/2 in a sub query and step 3 on the outside.
Enjoy and good luck!

Using SQL Server : how to use select criteria based on sum

Given the below table and using SQL (SQL Server preferred), how can I select only the ProductID's that sum to the first 200 orders or less is returned?
In other words, I'd like ID's for 'Corn Flakes', 'Wheeties' returned since this is close to the sum of 200 orders but returning anything more would be over the limit.
Given that 108 + 92 = 200, I must assume that you want the product ids in order.
In that case, you can use a cumulative sum:
select t.*
from (select t.*,
sum(orders) over (order by product_id) as running_orders
from t
) t
where running_orders <= 200;
Not sure which is more appropriate for your level and version:
select * from T as t
where (
select sum(Orders) from T as t2
where t2.ProductID <= t.ProductID -- *
) <= 200;
with data as (
select *,
sum(Orders)
over (order by ProductID desc) as cumm -- *
from T
)
select * from data where cumm <= 200;
Both of these essentially assume there will be no ties, or at least no ties that would both land as a single product order in the 200th spot.
If you discover that you intended to sort by number or orders rather than product id change the column references in the lines marked with asterisks.

Getting a single row based on unique column

I think this is mostly a terminology issue, where I'm having a hard time articulating a problem.
I've got a table with a couple columns that manage some historical log data. The two columns I'm interested in are timestamp(or Id, as the id is generated sequentially) and terminalID.
I'd like to supply a list of terminal ids and find only the latest data, that is highest id or timestamp per terminalID
Ended up using group solution as #Danny suggested, and the other solution he referenced
I found the time difference to be quite noticeable, so I'm posting both results here for anyone's FYI.
S1:
SELECT UR.* FROM(
SELECT TerminalID, MAX(ID) as lID
FROM dbo.Results
WHERE TerminalID in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
GROUP BY TerminalID
) GT left join dbo.Results UR on UR.id=lID
S2
SELECT *
FROM (
SELECT TOP 100
Row_Number() OVER (PARTITION BY terminalID ORDER BY Id DESC) AS [Row], *
FROM dbo.Results
WHERE TerminalID in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
ORDER BY Id DESC
) a
WHERE a.row=1
the results were:
S1:
CPU time = 297 ms, elapsed time = 343 ms.
Query Cost 36%
Missing index impact - 94%
S2:
CPU time = 562 ms, elapsed time = 1000 ms.
Query Cost 64%
Missing index impact - 41%
After adding the missing index to solution one (indexing ID only, as opposed to s2, where multiple columns needed an index), I got the query down to 15ms
use the TOP keyword:
SELECT TOP 1 ID, terminalID FROM MyTable WHERE <your condition> ORDER BY <something that orders it like you need so that the correct top row is returned>.
I think you're on the right track with GROUP BY. Sounds like you want:
SELECT TerminalID, MAX(Timestamp) AS LastTimestamp
FROM [Table_Name]
WHERE TerminalID IN (.., .., .., ..)
GROUP BY TerminalID
While not as obvious as using MAX with a GROUP BY, this can offer extra flexibility if you need to have more than one column determining which row or rows you want pulled back.
SELECT *
FROM (
SELECT
Row_Number() OVER (PARTITION BY terminalID ORDER BY Id DESC) AS [Row],
[terminalID],[Id],[timestamp]
FROM <TABLE>
ORDER BY Id DESC
) a
WHERE a.row=1

MSSQL 2008 SP pagination and count number of total records

In my SP I have the following:
with Paging(RowNo, ID, Name, TotalOccurrences) as
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
SELECT * FROM Paging WHERE RowNo BETWEEN 1 and 50
SELECT COUNT(*) FROM Paging
The result is that I get the error: invalid object name 'Paging'.
Can I query again the Paging table? I don't want to include the count for all results as a new column ... I would prefer to return as another data set. Is that possible?
Thanks, Radu
After more research I fond another way of doing this:
with Paging(RowNo, ID, Name, TotalOccurrences) AS
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
select RowNo, ID, Name, TotalOccurrences, (select COUNT(*) from Paging) as TotalResults from Paging where RowNo between (#PageNumber - 1 )* #PageSize + 1 and #PageNumber * #PageSize;
I think that this has better performance than calling two times the query.
You can't do that because the CTE you are defining will only be available to the FIRST query that appears after it's been defined. So when you run the COUNT(*) query, the CTE is no longer available to reference. That's just a limitation of CTEs.
So to do the COUNT as a separate step, you'd need to not use the CTE and instead use the full query to COUNT on.
Or, you could wrap the CTE up in an inline table valued function and use that instead, to save repeating the main query, something like this:
CREATE FUNCTION dbo.ufnExample()
RETURNS TABLE
AS
RETURN
(
with Paging(RowNo, ID, Name, TotalOccurrences) as
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
SELECT * FROM Paging
)
SELECT * FROM dbo.ufnExample() x WHERE RowNo BETWEEN 1 AND 50
SELECT COUNT(*) FROM dbo.ufnExample() x
Please be aware that Radu D's solution's query plan shows double hits to those tables. It is doing two executions under the covers. However, this still may be the best way as I haven't found a truly scalable 1-query design.
A less scalable 1-query design is to dump a completed ordered list into a #tablevariable , SELECT ##ROWCOUNT to get the full count, and select from #tablevariable where row number between X and Y. This works well for <10000 rows, but with results in the millions of rows, populating that #tablevariable gets expensive.
A hybrid approach is to populate this temp/variable up to 10000 rows. If not all 10000 rows are filled up, you're set. If 10000 rows are filled up, you'll need to rerun the search to get the full count. This works well if most of your queries return well under 10000 rows. The 10000 limit is a rough approximation, you can play around with this threshold for your case.
Write "AS" after the CTE table name Paging as below:
with Paging AS (RowNo, ID, Name, TotalOccurrences) as
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
SELECT * FROM Paging WHERE RowNo BETWEEN 1 and 50
SELECT COUNT(*) FROM Paging