Selecting SUM of TOP 2 values within a table with multiple GROUP in SQL

Selecting SUM of TOP 2 values within a table with multiple GROUP in SQL - sql

I've been playing with sets in SQL Server 2000 and have the following table structure for one of my temp tables (#Periods):
RestCTR HoursCTR Duration Rest
----------------------------------------
1 337 2 0
2 337 46 1
3 337 2 0
4 337 46 1
5 338 1 0
6 338 46 1
7 338 2 0
8 338 46 1
9 338 1 0
10 339 46 1
...
What I'd like to do is to calculate the Sum of the 2 longest Rest periods for each HoursCTR, preferably using sets and temp tables (rather than cursors, or nested subqueries).
Here's the dream query that just won't work in SQL (no matter how many times I run it):
Select HoursCTR, SUM ( TOP 2 Duration ) as LongestBreaks
FROM #Periods
WHERE Rest = 1
Group By HoursCTR
The HoursCTR can have any number of Rest periods (including none).
My current solution is not very elegant and basically involves the following steps:
Get the max duration of rest, group by HoursCTR
Select the first (min) RestCTR row that returns this max duration for each HoursCTR
Repeat step 1 (excluding the rows already collected in step 2)
Repeat step 2 (again, excluding rows collected in step 2)
Combine the RestCTR rows (from step 2 and 4) into single table
Get SUM of the Duration pointed to by the rows in step 5, grouped by HoursCTR
If there are any set functions that cut this process down, they would be very welcome.

The best way to do this in SQL Server is with a common table expression, numbering the rows in each group with the windowing function ROW_NUMBER():
WITH NumberedPeriods AS (
SELECT HoursCTR, Duration, ROW_NUMBER()
OVER (PARTITION BY HoursCTR ORDER BY Duration DESC) AS RN
FROM #Periods
WHERE Rest = 1
)
SELECT HoursCTR, SUM(Duration) AS LongestBreaks
FROM NumberedPeriods
WHERE RN <= 2
GROUP BY HoursCTR
edit: I've added an ORDER BY clause in the partitioning, to get the two longest rests.
Mea culpa, I did not notice that you need this to work in Microsoft SQL Server 2000. That version doesn't support CTE's or windowing functions. I'll leave the answer above in case it helps someone else.
In SQL Server 2000, the common advice is to use a correlated subquery:
SELECT p1.HoursCTR, (SELECT SUM(t.Duration) FROM
(SELECT TOP 2 p2.Duration FROM #Periods AS p2
WHERE p2.HoursCTR = p1.HoursCTR
ORDER BY p2.Duration DESC) AS t) AS LongestBreaks
FROM #Periods AS p1

SQL 2000 does not have CTE's, nor ROW_NUMBER().
Correlated subqueries can need an extra step when using group by.
This should work for you:
SELECT
F.HoursCTR,
MAX (F.LongestBreaks) AS LongestBreaks -- Dummy max() so that groupby can be used.
FROM
(
SELECT
Pm.HoursCTR,
(
SELECT
COALESCE (SUM (S.Duration), 0)
FROM
(
SELECT TOP 2 T.Duration
FROM #Periods AS T
WHERE T.HoursCTR = Pm.HoursCTR
AND T.Rest = 1
ORDER BY T.Duration DESC
) AS S
) AS LongestBreaks
FROM
#Periods AS Pm
) AS F
GROUP BY
F.HoursCTR

Unfortunately for you, Alex, you've got the right solution: correlated subqueries, depending upon how they're structured, will end up firing multiple times, potentially giving you hundreds of individual query executions.
Put your current solution into the Query Analyzer, enable "Show Execution Plan" (Ctrl+K), and run it. You'll have an extra tab at the bottom which will show you how the engine went about the process of gathering your results. If you do the same with the correlated subquery, you'll see what that option does.
I believe that it's likely to hammer the #Periods table about as many times as you have individual rows in that table.
Also - something's off about the correlated subquery, seems to me. Since I avoid them like the plague, knowing that they're evil, I'm not sure how to go about fixing it up.

Related

Fetching a minimum of N rows, plus all peers of the last row

I have a sample table named assets which looks like this:
id
name
block_no
1
asset1
2
2
asset2
2
3
asset3
3
There can be any number of assets in a specific block. I need a minimum of 100 rows from the table, and containing all the data from the block_no. Like, if there are 95 rows to block_no 2 and around 20 on block_no 3, I need all 20 of block_no 3 as if I am fetching data in packets based on block_no.
Is this possible and feasible?

Postgres 13 or later
There is a dead simple solution using WITH TIES in Postgres 13 or later:
SELECT *
FROM assets
WHERE block_no >= 2 -- your starting block
ORDER BY block_no
FETCH FIRST 100 ROWS WITH TIES;
This will return at least 100 rows (if enough qualify), plus all peers of the 100th row.
If your table isn't trivially small, an index on (block_no) is essential for performance.
See:
Get top row(s) with highest value, with ties
Older versions
Use the window function rank() in a subquery:
SELECT (a).*
FROM (
SELECT a, rank() OVER (ORDER BY block_no) AS rnk
FROM assets a
) sub
WHERE rnk <= 100;
Same result.
I use a little trick with the row type to strip the added rnk from the result. That's an optional addition.
See:
PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?

How to do a SQL count only up to x number?

I want to know whether a given query has more than x elements in a performant way. Let's say I have a query that outputs 2 billion rows but I only want to know if the result set is bigger than 10k, how would I do this without the SQL engine counting up to the 2 billion?
I tried this
SELECT 1
WHERE EXISTS (SELECT COUNT(1) FROM mytable
WHERE somefilter = 58
HAVING COUNT(1) < 10000)
but it seems just as slow (or more) than a simple count
SELECT COUNT(1)
WHERE somefilter = 58
This is for SQL Server 2016.
Any ideas?

You can do:
select 1
where (select count(*)
from (select top (10000) 1
from mytable
where somefilter = 58
) x
) < 10000
The innermost subquery returns at most 10,000 rows. If there are more than 10,000 then the query stops at the first 10,000 and returns 10,000.
Note: If there are fewer than 10,000 rows, then this will not affect performance, because all rows will need to be generated. You may need to add indexes or partitions to really improve performance.

SQL: Select Top 2 Query is Excluding Records with more than 2 Records

I just joined after having a problem writing a query in MS Access. I am trying to write a query that will pull out the first two valid samples in from a list of replicated sample results and then would like to average the sample values. I have written a query that does pull samples with only two valid samples and averages these values. However, my query doesn't pull samples where there are more than two valid sample results. Here's my query:
SELECT temp_platevalid_table.samp_name AS samp_name, avg (temp_platevalid_table.mean_conc) AS fin_avg, count(temp_platevalid_table.samp_valid) AS sample_count
FROM Temp_PlateValid_table
WHERE (Temp_PlateValid_table.id In (SELECT TOP 2 S.id
FROM Temp_PlateValid_table as S
WHERE S.samp_name = S.samp_name and s.samp_valid=1 and S.samp_valid=1
ORDER BY ID))
GROUP BY Temp_PlateValid_table.samp_name
HAVING ((Count(Temp_PlateValid_table.samp_valid))=2)
ORDER BY Temp_PlateValid_table.samp_name;
Here's an example of what I'm trying to do:
ID Samp_Name Samp_Valid Mean_Conc
1 54d2d2 1 15
2 54d2d2 1 20
3 54d2d2 1 25
The average mean_conc should be 17.5, however, with my current query, I wouldn't receive a value at all for 54d2d2. Is there a way to tweak my query so that I get a value for samples that have more than two valid values? Please note that I'm using MS Access, so I don't think I can use fancier SQL code (partition by, etc.).
Thanks in advance for your help!

Is this what you want?
select pv.samp_name, avg(pv.value_conc)
from Temp_PlateValid_table pv
where pv.samp_valid = 1 and
pv.id in (select top 2 id
from Temp_PlateValid_table as pv2
where pv2.samp_name = pv.samp_name and pv2.samp_valid = 1
)
group by pv.samp_name;
You might need avg(pv.value_conc * 1.0).

Manually specify starting value for Row_Number()

I want to define the start of ROW_NUMBER() as 3258170 instead of 1.
I am using the following SQL query
SELECT ROW_NUMBER() over(order by (select 3258170)) as 'idd'.
However, the above query is not working. When I say not working I mean its executing but its not starting from 3258170. Can somebody help me?
The reason I want to specify the row number is I am inserting Rows from one table to another. In the first Table the last record's row number is 3258169 and when I insert new records I want them to have the row number from 3258170.

Just add the value to the result of row_number():
select 3258170 - 1 + row_number() over (order by (select NULL)) as idd
The order by clause of row_number() is specifying what column is used for the order by. By specifying a constant there, you are simply saying "everything has the same value for ordering purposes". It has nothing, nothing at all to do with the first value chosen.
To avoid confusion, I replaced the constant value with NULL. In SQL Server, I have observed that this assigns a sequential number without actually sorting the rows -- an observed performance advantage, but not one that I've seen documented, so we can't depend on it.

I feel this is easier
ROW_NUMBER() OVER(ORDER BY Field) - 1 AS FieldAlias (To start from 0)
ROW_NUMBER() OVER(ORDER BY Field) + 3258169 AS FieldAlias (To start from 3258170)

Sometimes....
The ROW_NUMBER() may not be the best solution especially when there could be duplicate records in the underlying data set (for JOIN queries etc.). This may result in more rows returned than expected. You may consider creating a SEQUENCE which can be in some cases considered a cleaner solution.
i.e.:
CREATE SEQUENCE myRowNumberId
START WITH 1
INCREMENT BY 1
GO
SELECT NEXT VALUE FOR myRowNumberId AS 'idd' -- your query
GO
DROP SEQUENCE myRowNumberId; -- just to clean-up after ourselves
GO
The downside is that sequences may be difficult to use in complex queries with DISTINCT, WINDOW functions etc. See the complete sequence documentation here.

I had a situation where I was importing a hierarchical structure into an application where a seq number had to be unique within each hierarchical level and start at 110 (for ease of subsequent manual insertion). The data beforehand looked like this...
Level Prod Type Component Quantity Seq
1 P00210005 R NZ1500 57.90000000 120
1 P00210005 C P00210005M 1.00000000 120
2 P00210005M R M/C Operation 20.00000000 110
2 P00210005M C P00210006 1.00000000 110
2 P00210005M C P00210007 1.00000000 110
I wanted the row_number() function to generate the new sequence numbers but adding 10 and then multiplying by 10 wasn't achievable as expected. To force the sequence of arithmetic functions you have to enclose the entire row_number(), and partition clause in brackets. You can only perform simple addition and substraction on the row_number() as such.
So, my solution for this problem was
,10*(10+row_number() over (partition by Level order by Type desc, [Seq] asc)) [NewSeq]
Note the position of the brackets to allow the multiplication to occur after the addition.
Level Prod Type Component Quantity [Seq] [NewSeq]
1 P00210005 R NZ1500 57.90000000 120 110
1 P00210005 C P00210005M 1.00000000 120 120
2 P00210005M R M/C Operation 20.00000000 110 110
2 P00210005M C P00210006 1.00000000 110 120
2 P00210005M C P00210007 1.00000000 110 130

ROW_NUMBER() OVER(ORDER BY Field) - 1 AS FieldAlias (To start from 0)
ROW_NUMBER() OVER(ORDER BY Field) - 2862718 AS FieldAlias (To start from 2862718)
The order by clause of row_number() is specifying what column is used for the order by. By specifying a constant there, you are simply saying "everything has the same value for ordering purposes". It has nothing, nothing at all to do with the first value chosen.

SQL Server 2005 - SUM'ing one field, but only for the first occurence of a second field

Platform: SQL Server 2005 Express
Disclaimer: I’m quite a novice to SQL and so if you are happy to help with what may be a very simple question, then I won’t be offended if you talk slowly and use small words :-)
I have a table where I want to SUM the contents of multiple rows. However, I want to SUM one column only for the first occurrence of text in a different column.
Table schema for table 'tblMain'
fldOne {varchar(100)} Example contents: “Dandelion“
fldTwo {varchar(8)} Example contents: “01:00:00” (represents hh:mm:ss)
fldThree {numeric(10,0)} Example contents: “65”
Contents of table:
Row number fldOne fldTwo fldThree
------------------------------------------------
1 Dandelion 01:00:00 99
2 Daisy 02:15:00 88
3 Dandelion 00:45:00 77
4 Dandelion 00:30:00 10
5 Dandelion 00:15:00 200
6 Rose 01:30:00 55
7 Daisy 01:00:00 22
etc. ad nausium
If I use:
Select * from tblMain where fldTwo < ’05:00:00’ order by fldOne, fldTwo desc
Then all rows are correctly returned, ordered by fldOne and then fldTwo in descending order (although in the example data I've shown, all the data is already in the correct order!)
What I’d like to do is get the SUM of each fldThree, but only from the first occurrence of each fldOne.
So, SUM the first Dandelion, Daisy and Rose that I come across. E.g.
99+88+55
At the moment, I’m doing this programmatically; return a RecordSet from the Select statement above, and MoveNext through each returned row, only adding fldThree to my ‘total’ if I’ve never seen the text from fldOne before. It works, but most of the Select queries return over 100k rows and so it’s quite slow (slow being a relative term – it takes about 50 seconds on my setup).
The actual select statement (selecting about 100k rows from 1.5m total rows) completes in under a second which is fine. The current programatic loop is quite small and tight, it's just the number of loops through the RecordSet that takes time. I'm using adOpenForwardOnly and adLockReadOnly when I open the record set.
This is a routine that basically runs continuously as more data is added, and also the fldTwo 'times' vary, so I can't be more specific with the Select statement.
Everything that I’ve so far managed to do natively with SQL seems to run quickly and I’m hoping I can take the logic (and work) away from my program and get SQL to take the strain.
Thanks in advance

The best way to approach this is with window functions. These let you enumerate the rows within a group. However, you need some way to identify the first row. SQL tables are inherently unordered, so you need a column to specify the ordering. Here are some ideas.
If you have an id column, which is defined as an identity so it is autoincremented:
select sum(fldThree)
from (select m.*,
row_number() over (partition by fldOne order by id) as seqnum
from tblMain m
) m
where seqnum = 1
To get an arbitrary row, you could use:
select sum(fldThree)
from (select m.*,
row_number() over (partition by fldOne order by (select NULL as noorder)) as seqnum
from tblMain m
) m
where seqnum = 1
Or, if FldTwo has the values in reverse order:
select sum(fldThree)
from (select m.*,
row_number() over (partition by fldOne order by FldTwo desc) as seqnum
from tblMain m
) m
where seqnum = 1

Maybe this?
SELECT SUM(fldThree) as ExpectedSum
FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY fldOne ORDER BY fldTwo DSEC) Rn
FROM tblMain) as A
WHERE Rn = 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas