Select random rows from multiple tables in one query - sql

I'm trying to insert some dummy data into a table (A), for which I need the IDs from two other tables (B and C). How can I get n rows with a random B.Id and a random C.Id.
I've got:
select
(Select top 1 ID from B order by newid()) as 'B.Id',
(select top 1 ID from C order by newid()) as 'C.Id'
which gives me random Ids from each table, but what's the best way to get n of these? I've tried joining on a large table and doing top n, but the IDs from B and C are the same random Ids repeated for each row.
So looking to end up with something like this, but able to specify N rows.
INSERT INTO A (B-Id,C-Id,Note)
select
(Select top 1 ID from B order by newid()) as 'B.Id',
(select top 1 ID from C order by newid()) as 'C.Id',
'Rar'
So if B had Ids 1,2,3,4 and C had Ids 11,12,13,14, i'm after the equivalent of:
INSERT INTO A (B-Id,C-Id,Note)
Values
(3,11,'rar'), (1,14,'rar'),(4,11,'rar')
Where the Ids from each table are combined at random

If you want to avoid duplicates, you can use row_number() to enumerate the values in each table (randomly) and then join them:
select b.id as b_id, c.id as c_id
from (select b.*, row_number() over (order by newid()) as seqnum
from b
) b join
(select c.*, row_number() over (order by newid()) as seqnum
from c
) c
on b.seqnum = c.seqnum;
You can just add top N or where seqnum <= N to limit the number.

If I'm reading your question correctly, I think you want N random rows from the union of the two tables - so on any given execution you will get X rows from table B and N-X rows from table C. To accomplish this, you first UNION tables B and C together, then ORDER BY the random value generated by NEWID() while pulling your overall TOP N.
SELECT TOP 50 --or however many you like
DerivedUnionOfTwoTables.[ID],
DerivedUnionOfTwoTables.[Source]
FROM
(
(SELECT NEWID() AS [Random ID], [ID], 'Table B' AS [Source] FROM B)
UNION ALL
(SELECT NEWID() AS [Random ID], [ID], 'Table C' AS [Source] FROM C)
) DerivedUnionOfTwoTables
ORDER BY
[Random ID] DESC
I included a column showing which source table any given record comes from so you could see the distribution of the two table sources changing each time it is executed. If you don't need it and/or don't care to verify, simply comment it out from the topmost select.

You shouldn't need to join to a large table - Select top N ID from B order by newid() should work as newid() works per-row (unlike RAND()). Your join is probably doing a cross-join which will give you multiple results for each newid value.

Related

Random Samples of XX rows per Column Value

I'm using T-SQL and require some sample output of random rows.
Typically I would write some SQL as per below
Select top 10 *
from SampleTable as ST
Order by NewID()
However this time I want say 100 rows but them split equally by another column value for instance Column 'Type'.
100 Rows with a sample of 25 rows for TypeA , 25 rows for Type B, 25 rows for Type C and lastly 25 rows for Type D scenerio.
My 'Type' values are saved to a temp table
Select top 10 *
from SampleTable as ST
Inner Join #Types as TY
on TY.Type = ST.Type
Order by NewID()
I've seen NTILE but not sure if applicable for my problem.
Thanks.
Use ROW_NUMBER in conjunction with NEWID():
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ST.Type ORDER BY NEWID()) rn
FROM SampleTable AS ST
INNER JOIN #TypesAS TY ON TY.Type = ST.Type
)
SELECT *
FROM cte
WHERE rn <= 25;
The above solution will return 25 records from each type (or however many fewer might be available), randomly.

How to average the top n in each SQL group

I'm trying to figure out how to average the top N values within each group. I have a table with two columns, Group and Value. My goal is to average the top N values within each group where N is different based on another table.
For group A, N equals 3 and is highlighted in red. The output is the average of the top 3 values.
For group B, N equals 2 and is highlighted in green. Because we only have 1 value of 2.2 for group B, we need to go to the filler table. The filler value for group B is 2.0, so we will average 2.2 and 2.0. If N = 5, then the filler value will be repeated 4 times for Group B.
My initial idea is to:
Rank the values in each group
Join it to the second table
Use where Rank <= N to remove the duplicates before averaging
However, I not sure how the filling table could be incorporated since N could be greater than the number of values I have. I do need to use SQL Server 2008.
First of all, I hope that you're using more adequate names instead of Group and Value. Here's a sample code that first defines the order to later define the N values that will be used and get an average from those. The code is untested as you didn't provide consumable sample data.
WITH CTE AS(
SELECT *,
ROW_NUMBER() OVER( PARTITION BY [Group] ORDER BY [Value] DESC) AS rn,
COUNT(*) OVER( PARTITION BY [Group]) ItemCount
FROM TableWithValues
)
SELECT [Group],
(SUM( [Value]) + CASE WHEN N.n > c.ItemCount
THEN (N.n - c.ItemCount) * F.Filler
ELSE 0 END)/ N.n AS [Value]
FROM CTE c
JOIN TableWithN N ON c.[Group] = N.[Group] AND c.rn <= N.n
JOIN Fillers F ON c.[Group] = F.[Group]
GROUP BY [Group];

get ROW NUMBER of random records

For a simple SQL like,
SELECT top 3 MyId FROM MyTable ORDER BY NEWID()
how to add row numbers to them so that the row numbers become 1,2, and 3?
UPDATE:
I thought I can simplify my question as above, but it turns out to be more complicated. So here is a fuller version -- I need to give three random picks (from MyTable) for each person, with pick/row number of 1, 2, and 3, and there is no logical joining between person and picks.
SELECT * FROM Person
LEFT JOIN (
SELECT top 3 MyId FROM MyTable ORDER BY NEWID()
) D ON 1=1
The problem with above SQL are,
Obviously, pick/row number of 1, 2, and 3 should be added
and what is not obvious is that, the above SQL will give each person the same picks, whereas I need to give different person different picks
Here is a working SQL to test it out:
SELECT TOP 15 database_id, create_date, cs.name FROM sys.databases
CROSS apply (
SELECT top 3 Row_number()OVER(ORDER BY (SELECT NULL)) AS RowNo,*
FROM (SELECT top 3 name from sys.all_views ORDER BY NEWID()) T
) cs
So, Please help.
NOTE: This is NOT about MySQL byt T-SQL as their syntax are different, Thus the solution is different as well.
Add Row_number to outer query. Try this
SELECT Row_number()OVER(ORDER BY (SELECT NULL)),*
FROM (SELECT TOP 3 MyId
FROM MyTable
ORDER BY Newid()) a
Logically TOP keyword is processed after Select. After Row Number is generated random 3 records will be pulled. So you should not generate Row Number in original query
Update
It can be achieved through CROSS APPLY. Replace the column names inside cross apply where clause with valid column name from Person table
SELECT *
FROM Person p
CROSS apply (SELECT Row_number()OVER(ORDER BY (SELECT NULL)) rn,*
FROM (SELECT TOP 3 MyId
FROM MyTable
WHERE p.some_col = p.some_col -- Replace it with some column from person table
ORDER BY Newid())a) cs

How can I get a random cartesian product in PostgreSQL?

I have two tables, custassets and tags. To generate some test data I'd like to do an INSERT INTO a many-to-many table with a SELECT that gets random rows from each (so that a random primary key from one table is paired with a random primary key from the second). To my surprise this isn't as easy as I first thought, so I'm persisting with this to teach myself.
Here's my first attempt. I select 10 custassets and 3 tags, but both are the same in each case. I'd be fine with the first table being fixed, but I'd like to randomise the tags assigned.
SELECT
custassets_rand.id custassets_id,
tags_rand.id tags_rand_id
FROM
(
SELECT id FROM custassets WHERE defunct = false ORDER BY RANDOM() LIMIT 10
) AS custassets_rand
,
(
SELECT id FROM tags WHERE defunct = false ORDER BY RANDOM() LIMIT 3
) AS tags_rand
This produces:
custassets_id | tags_rand_id
---------------+--------------
9849 | 3322 }
9849 | 4871 } this pattern of tag PKs is repeated
9849 | 5188 }
12145 | 3322
12145 | 4871
12145 | 5188
17837 | 3322
17837 | 4871
17837 | 5188
....
I then tried the following approach: doing the second RANDOM() call in the SELECT column list. However this one was worse, as it chooses a single tag PK and sticks with it.
SELECT
custassets_rand.id custassets_id,
(SELECT id FROM tags WHERE defunct = false ORDER BY RANDOM() LIMIT 1) tags_rand_id
FROM
(
SELECT id FROM custassets WHERE defunct = false ORDER BY RANDOM() LIMIT 30
) AS custassets_rand
Result:
custassets_id | tags_rand_id
---------------+--------------
16694 | 1537
14204 | 1537
23823 | 1537
34799 | 1537
36388 | 1537
....
This would be easy in a scripting language, and I'm sure can be done quite easily with a stored procedure or temporary table. But can I do it just with a INSERT INTO SELECT?
I did think of choosing integer primary keys using a random function, but unfortunately the primary keys for both tables have gaps in the increment sequences (and so an empty row might be chosen in each table). That would have been fine otherwise!
Note that what you are looking for is not a Cartesian product, which would produce n*m rows; rather a random 1:1 association, which produces GREATEST(n,m) rows.
To produce truly random combinations, it's enough to randomize rn for the bigger set:
SELECT c_id, t_id
FROM (
SELECT id AS c_id, row_number() OVER (ORDER BY random()) AS rn
FROM custassets
) x
JOIN (SELECT id AS t_id, row_number() OVER () AS rn FROM tags) y USING (rn);
If arbitrary combinations are good enough, this is faster (especially for big tables):
SELECT c_id, t_id
FROM (SELECT id AS c_id, row_number() OVER () AS rn FROM custassets) x
JOIN (SELECT id AS t_id, row_number() OVER () AS rn FROM tags) y USING (rn);
If the number of rows in both tables do not match and you do not want to lose rows from the bigger table, use the modulo operator % to join rows from the smaller table multiple times:
SELECT c_id, t_id
FROM (
SELECT id AS c_id, row_number() OVER () AS rn
FROM custassets -- table with fewer rows
) x
JOIN (
SELECT id AS t_id, (row_number() OVER () % small.ct) + 1 AS rn
FROM tags
, (SELECT count(*) AS ct FROM custassets) AS small
) y USING (rn);
Window functions were added with PostgreSQL 8.4.
WITH a_ttl AS (
SELECT count(*) AS ttl FROM custassets c),
b_ttl AS (
SELECT count(*) AS ttl FROM tags),
rows AS (
SELECT gs.*
FROM generate_series(1,
(SELECT max(ttl) AS ttl FROM
(SELECT ttl FROM a_ttl UNION SELECT ttl FROM b_ttl) AS m))
AS gs(row)),
tab_a_rand AS (
SELECT custassets_id, row_number() OVER (order by random()) as row
FROM custassets),
tab_b_rand AS (
SELECT id, row_number() OVER (order by random()) as row
FROM tags)
SELECT a.custassets_id, b.id
FROM rows r
JOIN a_ttl ON 1=1 JOIN b_ttl ON 1=1
LEFT JOIN tab_a_rand a ON a.row = (r.row % a_ttl.ttl)+1
LEFT JOIN tab_b_rand b ON b.row = (r.row % b_ttl.ttl)+1
ORDER BY 1,2;
You can test this query on SQL Fiddle.
Here is a different approach to pick a single combination from 2 tables by random, assuming two tables a and b, both with primary key id. The tables needn't be of same size, and the second row is independently chosen from the first, which might not be that important for testdata.
SELECT * FROM a, b
WHERE a.id = (
SELECT id
FROM a
OFFSET (
SELECT random () * (SELECT count(*) FROM a)
)
LIMIT 1)
AND b.id = (
SELECT id
FROM b
OFFSET (
SELECT random () * (SELECT count(*) FROM b)
)
LIMIT 1);
Tested with two tables, one of size 7000 rows, one with 100k rows, result: immediately. For more than one result, you have to call the query repeatedly - increasing the LIMIT and changing x.id = to x.id IN would produce (aA, aB, bA, bB) result patterns.
It bugs me that after all these years of relational databases, there doesn't seem to be very good cross database ways of doing things like this. The MSDN article http://msdn.microsoft.com/en-us/library/cc441928.aspx seems to have some interesting ideas, but of course that's not PostgreSQL. And even then, their solution requires a single pass, when I'd think it ought to be able to be done without the scan.
I can imagine a few ways that might work without a pass (in selection), but it would involve creating another table that maps your table's primary keys to random numbers (or to linear sequences that you later randomly select, which in some ways may actually be better), and of course, that may have issues as well.
I realize this is probably a non-useful comment, I just felt I needed to rant a bit.
If you just want to get a random set of rows from each side, use a pseudo-random number generator. I would use something like:
select *
from (select a.*, row_number() over (order by NULL) as rownum -- NULL may not work, "(SELECT NULL)" works in MSSQL
from a
) a cross join
(select b.*, row_number() over (order by NULL) as rownum
from b
) b
where a.rownum <= 30 and b.rownum <= 30
This is doing a Cartesian product, which returns 900 rows assuming a and b each have at least 30 rows.
However, I interpreted your question as getting random combinations. Once again, I'd go for the pseudo-random approach.
select *
from (select a.*, row_number() over (order by NULL) as rownum -- NULL may not work, "(SELECT NULL)" works in MSSQL
from a
) a cross join
(select b.*, row_number() over (order by NULL) as rownum
from b
) b
where modf(a.rownum*107+b.rownum*257+17, 101) < <some vaue>
This let's you get combinations among arbitrary rows.
Just a plain carthesian product ON random() appears to work reasonably well. Simple comme bonjour...
-- Cartesian product
-- EXPLAIN ANALYZE
INSERT INTO dirgraph(point_from,point_to,costs)
SELECT p1.the_point , p2.the_point, (1000*random() ) +1
FROM allpoints p1
JOIN allpoints p2 ON random() < 0.002
;

Get row count including column values in sql server

I need to get the row count of a query, and also get the query's columns in one single query. The count should be a part of the result's columns (It should be the same for all rows, since it's the total).
for example, if I do this:
select count(1) from table
I can have the total number of rows.
If I do this:
select a,b,c from table
I'll get the column's values for the query.
What I need is to get the count and the columns values in one query, with a very effective way.
For example:
select Count(1), a,b,c from table
with no group by, since I want the total.
The only way I've found is to do a temp table (using variables), insert the query's result, then count, then returning the join of both. But if the result gets thousands of records, that wouldn't be very efficient.
Any ideas?
#Jim H is almost right, but chooses the wrong ranking function:
create table #T (ID int)
insert into #T (ID)
select 1 union all
select 2 union all
select 3
select ID,COUNT(*) OVER (PARTITION BY 1) as RowCnt from #T
drop table #T
Results:
ID RowCnt
1 3
2 3
3 3
Partitioning by a constant makes it count over the whole resultset.
Using CROSS JOIN:
SELECT a.*, b.numRows
FROM YOUR_TABLE a
CROSS JOIN (SELECT COUNT(*) AS numRows
FROM YOUR_TABLE) b
Look at the Ranking functions of SQL Server.
SELECT ROW_NUMBER() OVER (ORDER BY a) AS 'RowNumber', a, b, c
FROM table;
You could do it like this:
SELECT x.total, a, b, c
FROM
table
JOIN (SELECT total = COUNT(*) FROM table) AS x ON 1=1
which will return the total number of records in the first column, followed by fields a,b & c