How to SELECT TOP 95% of the row in a table - sql

I want to create a performance report based on table data.
I dont know how many rows are there in the table, I would like to have Top 95% (Percent) of the rows based on some where condition.
Table Structure -
Column Name - txid , start_time, end_time
For my Performance report I need to get the average of end_time - start_time. The common value of (end_time - start_time) ranges from 100ms to 1 sec.
However there are few transaction (less than 2% ) that took around 100-2K sec due to some or the other technical error.
I want to avoid those rows to get a fair average report. Including those rows in my Report raises a huge concern.

You can use a subquery. I would just go for row_number() and count(*), although other window functions such as ntile(), percentile_cont(), and percentile_disc() could be used for this purpose:
select t.*
from (select t.*,
row_number() over (order by <ordering col>) as seqnum,
count(*) over () as cnt
from t
where . . .
) t
where seqnum <= 0.95 * cnt;

Supposing you have a table TABLE with a field id:
select top (
(select count(Id) FROM [TABLE])*95/100
) id FROM [TABLE]

In TSQL:
DECLARE #ourCount as Int
DECLARE #topNinetyFive as Int
Select #ourCount = count(1) FROM [ourDatabase].[dbo].[ourTable]
Set #topNinetyFive = round(0.95 * #ourCount, 0)
Select TOP (#topNinetyFive) * FROM [ourDatabase].[dbo].[ourTable]
-- NOTE: a more meaningful criteria could be based on one of the columns with a 'where' clause

Related

How to query samples in relativity?

I have a large data set with about 100 million rows that I want to 'compress' the data set and get a 1% sample of the entire dataset while ensuring relativity.
How can such query be implemented?
Step 1: create the helper table
You can use aggregation to group records by visit_id, and CROSS JOIN with a query that computes the total number of records in the table to compute the distribution percent:
CREATE TABLE my_helper AS
SELECT
t.visit_number,
COUNT(*) visit_count,
SUM(t.purchase_id) sum_purchase,
COUNT(*)/total.cnt distribution
FROM
mytable t
CROSS JOIN (SELECT COUNT(*) cnt FROM mytable) total
GROUP BY t.visit_number
Step 2: sample the main table using the helper table
Within a subquery, you can use ROW_NUMBER() OVER(PARTITION BY visit_number ORDER BY RANDOM()) to assign a random rank to each record within groups of records sharing the same visit_id. Then, in the outer query, you can join on the helper table to select the corect amount of records for each visit_id:
SELECT x.*
FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY visit_number ORDER BY RANDOM()) rn
FROM mytable t
) x
INNER JOIN my_helper h ON h.visit_number = x.visit_number
WHERE x.rn <= 1000000 * h.distribution
Side notes:
this only works if there are indeed more than 1 million record in the source table
the exact number of records in the output might be slightly below or above 1 million (depending on the distribution in the original table)
it should be possible to combine both queries into a single one, which would avoid the need to use a helper table
This is doable. A quick way is to take every nth record only.
1) order by a random column (probably ID)
2) apply a nownum() attribute
3) apply a mod(rownum) = 0 on whatever percent makes sense (e.g. 1% would be rownum mod 100)
You may need steps 1/2 in a sub query and step 3 on the outside.
Enjoy and good luck!

Get two most frequent data from SQL tbl?

i have a tbl call it tbl_test in which continously data are inserting and it has approx 10^6 records at a time it has colums
Acquire_Id (Value between 1 to 20 ),
Status_Msg(value between 'A' to 'Z'),
Status_Code(value between 1 to 26)
There is one to one mapping b/w Status_Msg and Status_Code
Now i want to get two most frequent status_msg and Staus_Code Count for each acquirer if they are present in table
Query should be Cost Saving
Most databases support the ANSI standard window functions. You can get what you want using row_number() (or rank() or dense_rank(), depending on how ties are returned) after aggregating the values.
The following returns exactly two rows for each acquirer (even if there are ties).
select t.*
from (select t.acquire_id, t.status_msg, t.status_code, count(*) as cnt,
row_number() over (partition by t.acquire_id order by count(*) desc) as seqnum
from tbl_test t
group by t.acquire_id, t.status_msg, t.status_code
) t
where seqnum <= 2;

oracle sql to get min timestamp when the count of results large than a number

in order to improve the performance, I need a sql to implement the following requirement.
If there is a table and has the following column:
id timestamp value
How can I get the min timestamp(e.g. :t1) when the count of the result > 100000 ?
then the following sql result--count(*) will > 100000
select count(*) from table where timestamp < :t1
My understanding of your question is: Find the earliest timestamp in the table for which there are at least 100,000 earlier rows.
There are probably many ways to do it; the main difficulty is trying to come up with an efficient one.
I think an analytic-function approach is most likely to work well. The most obvious choice is to use COUNT:
select min(timestamp) from (
select timestamp, count(*) over (order by timestamp rows between unbounded preceding and 1 preceding) earlier_rows
from table
)
where earlier_rows >= 100000
But I suspect using RANK or something similar will be faster:
select min(timestamp) from (
select timestamp, rank() over (order by timestamp) time_rank
from table
)
where time_rank > 100000
I'm not sure off the top of my head, but these may give slightly different results if there are duplicate timestamps.
This will give you the min and max value and the count
select
count(t.*),
min(t.timestamp),
max(t.timestamp)
from table t
where ( select count(*) from table t where t.timestamp < :t1 ) > 10000

Group by every N records in T-SQL

I have some performance test results on the database, and what I want to do is to group every 1000 records (previously sorted in ascending order by date) and then aggregate results with AVG.
I'm actually looking for a standard SQL solution, however any T-SQL specific results are also appreciated.
The query looks like this:
SELECT TestId,Throughput FROM dbo.Results ORDER BY id
WITH T AS (
SELECT RANK() OVER (ORDER BY ID) Rank,
P.Field1, P.Field2, P.Value1, ...
FROM P
)
SELECT (Rank - 1) / 1000 GroupID, AVG(...)
FROM T
GROUP BY ((Rank - 1) / 1000)
;
Something like that should get you started. If you can provide your actual schema I can update as appropriate.
Give the answer to Yuck. I only post as an answer so I could include a code block. I did a count test to see if it was grouping by 1000 and the first set was 999. This produced set sizes of 1,000. Great query Yuck.
WITH T AS (
SELECT RANK() OVER (ORDER BY sID) Rank, sID
FROM docSVsys
)
SELECT (Rank-1) / 1000 GroupID, count(sID)
FROM T
GROUP BY ((Rank-1) / 1000)
order by GroupID
I +1'd #Yuck, because I think that is a good answer. But it's worth mentioning NTILE().
Reason being, if you have 10,010 records (for example), then you'll have 11 groupings -- the first 10 with 1000 in them, and the last with just 10.
If you're comparing averages between each group of 1000, then you should either discard the last group as it's not a representative group, or...you could make all the groups the same size.
NTILE() would make all groups the same size; the only caveat is that you'd need to know how many groups you wanted.
So if your table had 25,250 records, you'd use NTILE(25), and your groupings would be approximately 1000 in size -- they'd actually be 1010 in size; the benefit being, they'd all be the same size, which might make them more relevant to each other in terms of whatever comparison analysis you're doing.
You could get your group-size simply by
DECLARE #ntile int
SET #ntile = (SELECT count(1) from myTable) / 1000
And then modifying #Yuck's approach with the NTILE() substitution:
;WITH myCTE AS (
SELECT NTILE(#ntile) OVER (ORDER BY id) myGroup,
col1, col2, ...
FROM dbo.myTable
)
SELECT myGroup, col1, col2...
FROM myCTE
GROUP BY (myGroup), col1, col2...
;
Answer above does not actually assign a unique group id to each 1000 records. Adding Floor() is needed. The following will return all records from your table, with a unique GroupID for each 1000 rows:
WITH T AS (
SELECT RANK() OVER (ORDER BY your_field) Rank,
your_field
FROM your_table
WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
FROM T
And for my needs, I wanted my GroupID to be a random set of characters, so I changed the Floor(...) GroupID to:
TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 10) AS STRING),'seed1'))) GroupID
without the seed value, you and I would get the exact same output because we're just doing a SHA256 on the number 1, 2, 3 etc. But adding the seed makes the output unique, but still repeatable.
This is BigQuery syntax. T-SQL might be slightly different.
Lastly, if you want to leave off the last chunk that is not a full 1000, you can find it by doing:
WITH T AS (
SELECT RANK() OVER (ORDER BY your_field) Rank,
your_field
FROM your_table
WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
, COUNT(*) OVER(PARTITION BY TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 1000) AS STRING),'seed1')))) AS CountInGroup
FROM T
ORDER BY CountInGroup
You can also use Row_Number() instead of rank. No Floor required.
declare #groupsize int = 50
;with ct1 as ( select YourColumn, RowID = Row_Number() over(order by YourColumn)
from YourTable
)
select YourColumn, RowID, GroupID = (RowID-1)/#GroupSize + 1
from ct1
I read more about NTILE after reading #user15481328 answer
(resource: https://www.sqlservertutorial.net/sql-server-window-functions/sql-server-ntile-function/ )
and this solution allowed me to find the max date within each of the 25 groups of my data set:
with cte as (
select date,
NTILE(25) OVER ( order by date ) bucket_num
from mybigdataset
)
select max(date), bucket_num
from cte
group by bucket_num
order by bucket_num

MSSQL 2008 SP pagination and count number of total records

In my SP I have the following:
with Paging(RowNo, ID, Name, TotalOccurrences) as
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
SELECT * FROM Paging WHERE RowNo BETWEEN 1 and 50
SELECT COUNT(*) FROM Paging
The result is that I get the error: invalid object name 'Paging'.
Can I query again the Paging table? I don't want to include the count for all results as a new column ... I would prefer to return as another data set. Is that possible?
Thanks, Radu
After more research I fond another way of doing this:
with Paging(RowNo, ID, Name, TotalOccurrences) AS
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
select RowNo, ID, Name, TotalOccurrences, (select COUNT(*) from Paging) as TotalResults from Paging where RowNo between (#PageNumber - 1 )* #PageSize + 1 and #PageNumber * #PageSize;
I think that this has better performance than calling two times the query.
You can't do that because the CTE you are defining will only be available to the FIRST query that appears after it's been defined. So when you run the COUNT(*) query, the CTE is no longer available to reference. That's just a limitation of CTEs.
So to do the COUNT as a separate step, you'd need to not use the CTE and instead use the full query to COUNT on.
Or, you could wrap the CTE up in an inline table valued function and use that instead, to save repeating the main query, something like this:
CREATE FUNCTION dbo.ufnExample()
RETURNS TABLE
AS
RETURN
(
with Paging(RowNo, ID, Name, TotalOccurrences) as
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
SELECT * FROM Paging
)
SELECT * FROM dbo.ufnExample() x WHERE RowNo BETWEEN 1 AND 50
SELECT COUNT(*) FROM dbo.ufnExample() x
Please be aware that Radu D's solution's query plan shows double hits to those tables. It is doing two executions under the covers. However, this still may be the best way as I haven't found a truly scalable 1-query design.
A less scalable 1-query design is to dump a completed ordered list into a #tablevariable , SELECT ##ROWCOUNT to get the full count, and select from #tablevariable where row number between X and Y. This works well for <10000 rows, but with results in the millions of rows, populating that #tablevariable gets expensive.
A hybrid approach is to populate this temp/variable up to 10000 rows. If not all 10000 rows are filled up, you're set. If 10000 rows are filled up, you'll need to rerun the search to get the full count. This works well if most of your queries return well under 10000 rows. The 10000 limit is a rough approximation, you can play around with this threshold for your case.
Write "AS" after the CTE table name Paging as below:
with Paging AS (RowNo, ID, Name, TotalOccurrences) as
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
SELECT * FROM Paging WHERE RowNo BETWEEN 1 and 50
SELECT COUNT(*) FROM Paging