Random Samples of XX rows per Column Value - sql

I'm using T-SQL and require some sample output of random rows.
Typically I would write some SQL as per below
Select top 10 *
from SampleTable as ST
Order by NewID()
However this time I want say 100 rows but them split equally by another column value for instance Column 'Type'.
100 Rows with a sample of 25 rows for TypeA , 25 rows for Type B, 25 rows for Type C and lastly 25 rows for Type D scenerio.
My 'Type' values are saved to a temp table
Select top 10 *
from SampleTable as ST
Inner Join #Types as TY
on TY.Type = ST.Type
Order by NewID()
I've seen NTILE but not sure if applicable for my problem.
Thanks.

Use ROW_NUMBER in conjunction with NEWID():
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ST.Type ORDER BY NEWID()) rn
FROM SampleTable AS ST
INNER JOIN #TypesAS TY ON TY.Type = ST.Type
)
SELECT *
FROM cte
WHERE rn <= 25;
The above solution will return 25 records from each type (or however many fewer might be available), randomly.

Related

How to find Max value in a column in SQL Server 2012

I want to find the max value in a column
ID CName Tot_Val PName
--------------------------------
1 1 100 P1
2 1 10 P2
3 2 50 P2
4 2 80 P1
Above is my table structure. I just want to find the max total value only from the table. In that four row ID 1 and 2 have same value in CName but total val and PName has different values. What I am expecting is have to find the max value in ID 1 and 2
Expected result:
ID CName Tot_Val PName
--------------------------------
1 1 100 P1
4 2 80 P1
I need result same as like mention above
select Max(Tot_Val), CName
from table1
where PName in ('P1', 'P2')
group by CName
This is query I have tried but my problem is that I am not able to bring PName in this table. If I add PName in the select list means it will showing the rows doubled e.g. Result is 100 rows but when I add PName in selected list and group by list it showing 600 rows. That is the problem.
Can someone please help me to resolve this.
One possible option is to use a subquery. Give each row a number within each CName group ordered by Tot_Val. Then select the rows with a row number equal to one.
select x.*
from ( select mt.ID,
mt.CName,
mt.Tot_Val,
mt.PName,
row_number() over(partition by mt.CName order by mt.Tot_Val desc) as No
from MyTable mt ) x
where x.No = 1;
An alternative would be to use a common table expression (CTE) instead of a subquery to isolate the first result set.
with x as
(
select mt.ID,
mt.CName,
mt.Tot_Val,
mt.PName,
row_number() over(partition by mt.CName order by mt.Tot_Val desc) as No
from MyTable mt
)
select x.*
from x
where x.No = 1;
See both solutions in action in this fiddle.
You can search top-n-per-group for this kind of a query.
There are two common ways to do it. The most efficient method depends on your indexes and data distribution and whether you already have another table with the list of all CName values.
Using ROW_NUMBER
WITH
CTE
AS
(
SELECT
ID, CName, Tot_Val, PName,
ROW_NUMBER() OVER (PARTITION BY CName ORDER BY Tot_Val DESC) AS rn
FROM table1
)
SELECT
ID, CName, Tot_Val, PName
FROM CTE
WHERE rn=1
;
Using CROSS APPLY
WITH
CTE
AS
(
SELECT CName
FROM table1
GROUP BY CName
)
SELECT
A.ID
,A.CName
,A.Tot_Val
,A.PName
FROM
CTE
CROSS APPLY
(
SELECT TOP(1)
table1.ID
,table1.CName
,table1.Tot_Val
,table1.PName
FROM table1
WHERE
table1.CName = CTE.CName
ORDER BY
table1.Tot_Val DESC
) AS A
;
See a very detailed answer on dba.se Retrieving n rows per group
, or here Get top 1 row of each group
.
CROSS APPLY might be as fast as a correlated subquery, but this often has very good performance (and better than ROW_NUMBER():
select t.*
from t
where t.tot_val = (select max(t2.tot_val)
from t t2
where t2.cname = t.cname
);
Note: The performance depends on having an index on (cname, tot_val).

How to average the top n in each SQL group

I'm trying to figure out how to average the top N values within each group. I have a table with two columns, Group and Value. My goal is to average the top N values within each group where N is different based on another table.
For group A, N equals 3 and is highlighted in red. The output is the average of the top 3 values.
For group B, N equals 2 and is highlighted in green. Because we only have 1 value of 2.2 for group B, we need to go to the filler table. The filler value for group B is 2.0, so we will average 2.2 and 2.0. If N = 5, then the filler value will be repeated 4 times for Group B.
My initial idea is to:
Rank the values in each group
Join it to the second table
Use where Rank <= N to remove the duplicates before averaging
However, I not sure how the filling table could be incorporated since N could be greater than the number of values I have. I do need to use SQL Server 2008.
First of all, I hope that you're using more adequate names instead of Group and Value. Here's a sample code that first defines the order to later define the N values that will be used and get an average from those. The code is untested as you didn't provide consumable sample data.
WITH CTE AS(
SELECT *,
ROW_NUMBER() OVER( PARTITION BY [Group] ORDER BY [Value] DESC) AS rn,
COUNT(*) OVER( PARTITION BY [Group]) ItemCount
FROM TableWithValues
)
SELECT [Group],
(SUM( [Value]) + CASE WHEN N.n > c.ItemCount
THEN (N.n - c.ItemCount) * F.Filler
ELSE 0 END)/ N.n AS [Value]
FROM CTE c
JOIN TableWithN N ON c.[Group] = N.[Group] AND c.rn <= N.n
JOIN Fillers F ON c.[Group] = F.[Group]
GROUP BY [Group];

get ROW NUMBER of random records

For a simple SQL like,
SELECT top 3 MyId FROM MyTable ORDER BY NEWID()
how to add row numbers to them so that the row numbers become 1,2, and 3?
UPDATE:
I thought I can simplify my question as above, but it turns out to be more complicated. So here is a fuller version -- I need to give three random picks (from MyTable) for each person, with pick/row number of 1, 2, and 3, and there is no logical joining between person and picks.
SELECT * FROM Person
LEFT JOIN (
SELECT top 3 MyId FROM MyTable ORDER BY NEWID()
) D ON 1=1
The problem with above SQL are,
Obviously, pick/row number of 1, 2, and 3 should be added
and what is not obvious is that, the above SQL will give each person the same picks, whereas I need to give different person different picks
Here is a working SQL to test it out:
SELECT TOP 15 database_id, create_date, cs.name FROM sys.databases
CROSS apply (
SELECT top 3 Row_number()OVER(ORDER BY (SELECT NULL)) AS RowNo,*
FROM (SELECT top 3 name from sys.all_views ORDER BY NEWID()) T
) cs
So, Please help.
NOTE: This is NOT about MySQL byt T-SQL as their syntax are different, Thus the solution is different as well.
Add Row_number to outer query. Try this
SELECT Row_number()OVER(ORDER BY (SELECT NULL)),*
FROM (SELECT TOP 3 MyId
FROM MyTable
ORDER BY Newid()) a
Logically TOP keyword is processed after Select. After Row Number is generated random 3 records will be pulled. So you should not generate Row Number in original query
Update
It can be achieved through CROSS APPLY. Replace the column names inside cross apply where clause with valid column name from Person table
SELECT *
FROM Person p
CROSS apply (SELECT Row_number()OVER(ORDER BY (SELECT NULL)) rn,*
FROM (SELECT TOP 3 MyId
FROM MyTable
WHERE p.some_col = p.some_col -- Replace it with some column from person table
ORDER BY Newid())a) cs

Select random rows from multiple tables in one query

I'm trying to insert some dummy data into a table (A), for which I need the IDs from two other tables (B and C). How can I get n rows with a random B.Id and a random C.Id.
I've got:
select
(Select top 1 ID from B order by newid()) as 'B.Id',
(select top 1 ID from C order by newid()) as 'C.Id'
which gives me random Ids from each table, but what's the best way to get n of these? I've tried joining on a large table and doing top n, but the IDs from B and C are the same random Ids repeated for each row.
So looking to end up with something like this, but able to specify N rows.
INSERT INTO A (B-Id,C-Id,Note)
select
(Select top 1 ID from B order by newid()) as 'B.Id',
(select top 1 ID from C order by newid()) as 'C.Id',
'Rar'
So if B had Ids 1,2,3,4 and C had Ids 11,12,13,14, i'm after the equivalent of:
INSERT INTO A (B-Id,C-Id,Note)
Values
(3,11,'rar'), (1,14,'rar'),(4,11,'rar')
Where the Ids from each table are combined at random
If you want to avoid duplicates, you can use row_number() to enumerate the values in each table (randomly) and then join them:
select b.id as b_id, c.id as c_id
from (select b.*, row_number() over (order by newid()) as seqnum
from b
) b join
(select c.*, row_number() over (order by newid()) as seqnum
from c
) c
on b.seqnum = c.seqnum;
You can just add top N or where seqnum <= N to limit the number.
If I'm reading your question correctly, I think you want N random rows from the union of the two tables - so on any given execution you will get X rows from table B and N-X rows from table C. To accomplish this, you first UNION tables B and C together, then ORDER BY the random value generated by NEWID() while pulling your overall TOP N.
SELECT TOP 50 --or however many you like
DerivedUnionOfTwoTables.[ID],
DerivedUnionOfTwoTables.[Source]
FROM
(
(SELECT NEWID() AS [Random ID], [ID], 'Table B' AS [Source] FROM B)
UNION ALL
(SELECT NEWID() AS [Random ID], [ID], 'Table C' AS [Source] FROM C)
) DerivedUnionOfTwoTables
ORDER BY
[Random ID] DESC
I included a column showing which source table any given record comes from so you could see the distribution of the two table sources changing each time it is executed. If you don't need it and/or don't care to verify, simply comment it out from the topmost select.
You shouldn't need to join to a large table - Select top N ID from B order by newid() should work as newid() works per-row (unlike RAND()). Your join is probably doing a cross-join which will give you multiple results for each newid value.

get random top n rows where n is greater than quantity of rows in table

i'm writing a script that generates random data. i have two tables, one that stores first names, and second that stores surnames.
i want to get e.g. 1000 random pairs of first name and surname. i can achieve it using following code:
with x as (
select top 1000 f.firstName from dbo.firstNames f order by newid()
), xx as (
select x.firstName, row_number() over(order by x.firstName) as nameNo from x
), y as (
select top 1000 s.surName from dbo.surNames s order by newid()
), yy as (
select y.surName, row_number() over(order by y.ulica) as nameNo from y
)
select xx.firstName, yy.surName
from xx inner join yy on (xx.nameNo=yy.nameNo)
...but what if one of my tables contains less than 1000 rows?
i wondered how to get more than n rows from table where n is less than quantity of rows in table/view and you don't mind repeated results.
the only way i could think of is to use temp table and while loop, and fill it with random rows until there is enough rows. But i wonder if it's possible to do it with a single select? i'm currently using sql server 2012 on my PC, but i would appreciate it if i could run it under sql server 2008, too.
You could do the randomization after the cross join:
select top 1000 fn.firstname, sn.surname
from firstnames fn cross join
surnames sn
order by newid();
I'm the first to admit that the problem with this approach is performance, but it does work in theory. And performance is probably fine if the tables have at most a few hundred rows.
If you want 1000 random pairs then 32 from each table should suffice (32*32=1024):
WITH f1 AS (
SELECT TOP 32 firstName FROM dbo.firstName ORDER BY newid()
), s1 AS
SELECT TOP 32 surName FROM dbo.surName ORDER BY newid()
)
SELECT f1.firstName, s1.surName
FROM f1 CROSS JOIN s1;
If that's not random enough then you might try the following:
WITH f1 AS (
SELECT TOP 100 firstName FROM dbo.firstName ORDER BY newid()
), s1 AS
SELECT TOP 100 surName FROM dbo.surName ORDER BY newid()
)
SELECT TOP 1000 f1.firstName, s1.surName
FROM f1 CROSS JOIN s1
ORDER BY newid();
The above would get the 10,000 combinations and select 1,000 of them at random.