SQL Sampling: one element from each bucket - sql

Here is a simulation of the basic setup i have: each person can hold multiple possessions.
Persons table:
id name
1 Carl
2 Sam
3 Tom
4 Jack
Possessions table:
possession personId
car 2
shoes 2
shovel 2
tent 3
matches 3
axe 4
I want to generate a random set of possessions belonging to a random set of people, one possession per person.
So, in a non-SQL world I would generate a set of N random people and then pick a random possession for each person in the set. But how do I implement that in SQL semantics?
I thought of getting a random sample of possessions with some variation of:
SELECT * FROM Posessions WHERE 0.01 >= RAND()
And then filtering out duplicate persons, but that is no good as it will favor persons with large number of possessions in the end, and I want each person to have equal chance of being selected.
Is there a canonical way to solve this?
P.S. Person contains ~50000 entities and Possession contains ~2500000 entities, but i only need to perform this sampling once, so it can be somewhat slow.

Why don't you take random set of persons and join to posessions ranked by random. Something like below. Sorry if it contain any spelling error but I don't have DB to check it now:
select * from (
(select top 1 percent * from persons order by newid()) a
inner join
(select p.*, ROW_NUMBER() OVER (partition by personId order by newid()) r from posessions p) b
on (a.personId = b.personId)
)
where r = 1;

One way would be (for 2 persons below and one possession per person)
DECLARE #PeopleCount INT = 2,
#PossessionsPerPersonCount INT = 1;
SELECT *
FROM (SELECT TOP (#PeopleCount) *
FROM Persons
ORDER BY CRYPT_GEN_RANDOM(4)) RandomPersons
OUTER APPLY (SELECT TOP (#PossessionsPerPersonCount) * FROM Posessions p
WHERE RandomPersons.id = p.personId
ORDER BY CRYPT_GEN_RANDOM(4)) RandomPosessions
Hopefully Possession has an index on personId so that it can seek into the relevant rows per person (average 50) rather than scanning all 2,500,000 in the table for each person.
I've used OUTER APPLY above as not all the people in your example data have possessions (i.e. Carl doesn't).
If you only want to include people with possessions and want one possession per person you can use this instead.
DECLARE #PeopleCount INT = 2;
SELECT TOP (#PeopleCount) *
FROM Persons
CROSS APPLY (SELECT TOP (1) * FROM Posessions p
WHERE Persons.id = p.personId
ORDER BY CRYPT_GEN_RANDOM(4)) RandomPosessions
ORDER BY CRYPT_GEN_RANDOM(4);

The following query will generate 3 random sample for you
SELECT p.id,
(SELECT posession FROM posessions p1 where p1.id=p.id ORDER BY RAND() LIMIT 1) as posession
FROM posessions p
GROUP BY p.id
ORDER BY RAND()
LIMIT 3
The sub-query generate random posession of each person, while the outer-query generate random person.

Related

SQL - ordering table by information from multiple tables

Title of the question may not have been very clear - I am not really sure how to name this question, but I hope that my explanation will make my problem clearer.
I have 3 tables:
[1] score
id
rating_type
1
UPVOTE
2
UPVOTE
3
DOWNVOTE
4
UPVOTE
5
DOWNVOTE
6
DOWNVOTE
[2] post_score
post_id
score_id
1
1
1
2
1
3
2
4
2
5
2
6
and [3] post
id
title
1
title1
2
title2
My goal is to order [3] post table by score.
Assume UPVOTE represents value of 1 and DOWNVOTE value of -1; In this example, post where id = 1 has 3 scores related to it, and the values of them are UPVOTE, UPVOTE, DOWNVOTE, making the "numeric score" of this post: 2;
likewise, post where id = 2, also has 3 scores, and those values are: UPVOTE, DOWNVOTE, DOWNVOTE, making the "numeric score": -1;
How would I order post table by this score? In this example, if I ordered by score asc, I would expect the following result:
id
title
2
title2
1
title1
My attempts didn't go far, I am stuck here with this query currently, which doesn't really do anything useful yet:
WITH fullScoreInformation AS (
SELECT * FROM score s
JOIN post_score ps ON s.id = ps.score_id),
upvotes AS (SELECT * FROM fullScoreInformation WHERE rating_type = 'UPVOTE'),
downvotes AS (SELECT * FROM fullScoreInformation WHERE rating_type = 'DOWNVOTE')
SELECT p.id, rating_type, title FROM post p JOIN fullScoreInformation fsi on p.id = fsi.post_id
I am using PostgreSQL. Queries will be used in my Spring Boot application (I normally use native queries).
Perhaps this data structure is bad and I should have constructed my entities differently ?
My goal is to order post table by score. Assume UPVOTE represents value of 1 and DOWNVOTE value of -1
One option uses a subquery to count the upvotes and downvotes of each post:
select p.*, s.*
from post p
cross join lateral (
select
count(*) filter(where s.rating_type = 'UPVOTE' ) as cnt_up,
count(*) filter(where s.rating_type = 'DOWNVOTE') as cnt_down
from post_score ps
inner join score s on s.id = ps.score_id
where ps.post_id = p.id
) s
order by s.cnt_up - s.cnt_down desc
Perhaps this data structure is bad and I should have constructed my entities differently ?
As it stands, I don't see the need for two distinct tables post_score and score. For the data you have showed, this is a 1-1 relationship, so just one table should be sufficient, storing the post id and the rating type.
You better use a LEFT join, otherwise you wouldn't get posts that have no votes yet. Then aggregate to get the fitered sum of the scores. Then add these sums, apply coalesce() to get 0 for posts without votes and order by the result.
SELECT p.id,
p.title
FROM post p
LEFT JOIN post_score ps
ON ps.post_id = p.id
LEFT JOIN score s
ON s.id = ps.score_id
GROUP BY p.id,
p.title
ORDER BY coalesce(sum(1) FILTER (WHERE rating_type = 'UPVOTE')
+
sum(-1) FILTER (WHERE rating_type = 'DOWNVOTE'),
0);
I second GMB's comment about the superfluous table.

How to Limit Results Per Match on a Left Join - SQL Server

I have a table with student info [STU] and a table with parent info [PAR]. I want to return an email address for each student, but just one. So I run this query:
SELECT [STU].[ID], [PAR].[EM]
FROM (SELECT [STU].* FROM DB1.STU)
STU LEFT JOIN (SELECT [PAR].* FROM DB1.PAR) PAR ON [STU].[ID] = [PAR].[ID]
This gives me the below table:
Student ID ParentEmail
1 jim#email.com
1 sarah#email.com
2 paul#email.com
2 tim#email.com
3 bill#email.com
3 frank#email.com
3 joyce#email.com
4 greg#email.com
5 tony#email.com
5 sam#email.com
Each student has multiple parent emails, but I only want one. In other words, I want the output to look like this:
Student ID ParentEmail
1 jim#email.com
2 paul#email.com
3 frank#email.com
4 greg#email.com
5 sam#email.com
I've tried so many things. I've tried using GROUP BY and MIN/MAX and I've tried complex CASE statements, and I've tried COALESCE but I just can't seem to figure it out.
I think OUTER APPLY is the simplest method:
SELECT [STU].[ID], [PAR].[EM]
FROM DB1.STU OUTER APPLY
(SELECT TOP (1) [PAR].*
FROM DB1.PAR
WHERE [STU].[ID] = [PAR].[ID]
) PAR;
Normally, there would be an ORDER BY in the subquery, to give you control over which email you want -- the longest, shortest, oldest, or whatever. Without an ORDER BY it returns just one email, which is what you are asking for.
If you just want one column from the parent table, a simple approach is a correlated subquery:
select
s.id student_id,
(select max(p.em) from db1.par p where p.id = s.id) parent_email
from db1.stu s
This gives you the greatest parent email per student.

Help construct a query given a schema

Here is the schema for the database: http://i.stack.imgur.com/omX60.png
Question is: How many people have at least five еntitlements?
I've got this, please tell me how wrong it is and fix it.
select count(personId)
from serialNumber_tbl natural join entitlement_tbl
group by personId
having sum(entitlementID) > 5
Thank you.
The condition for at least 5 is >= 5, not > 5
You need to count the distinct ids in the entitlement table, not person
This gives you the persons, next you need to subquery it to find the count of persons.
select count(personId)
FROM
(
select personId
from serialNumber_tbl natural join entitlement_tbl
group by personId
having count(distinct entitlement_id) >= 5
) X
Your request isn't exactly clear. Are you asking for the count of people with more than five entitlement rows whether they exist on multiple serial numbers or not? If so, you could do something like:
Select Count(*) As CountOfPeople
From Person_tbl As P
Where Exists (
Select 1
From serialNumbers As S1
Join entitlement_tbl As E1
On E1.serialNumberId = S.serialNumberId
Where S1.personId = P.personId
Having Count(*) >= 5
)
Or is it that you are asking to find the number of people that have a serialNumber with more than five entitlements? If that is the case, then you could do something like:
Select Count(*) As CountOfPeople
From Person_tbl As P
Where Exists (
Select 1
From serialNumbers As S1
Join entitlement_tbl As E1
On E1.serialNumberId = S.serialNumberId
Where S1.personId = P.personId
Having Count( Distinct S1.serialNumberId ) >= 5
)

How do I select unique pairs of rows from a table at random?

I have two tables like these:
CREATE TABLE people (
id INT NOT NULL,
PRIMARY KEY (id)
)
CREATE TABLE pairs (
person_a_id INT,
person_b_id INT,
FOREIGN KEY (person_a_id) REFERENCES people(id),
FOREIGN KEY (person_b_id) REFERENCES people(id)
)
I want to select pairs of people at random from the people table, and after selecting them I add the randomly select pair to the pairs table. person_a_id always refers to the person with the lower id of the pair (since the order of the pair is not relevant).
The thing is that I never want to select the same pair twice, so I need to check the pairs table before I return my randomly selected pair.
Is it possible to do this using just a single SQL query in a reasonably efficient and elegant manner?
(I'm doing this using the Java Persistence API, but hopefully I'll be able to translate any answers into JPA code)
select a.id, b.id
from people1 a
inner join people1 b on a.id < b.id
where not exists (
select *
from pairs1 c
where c.person_a_id = a.id
and c.person_b_id = b.id)
order by a.id * rand()
limit 1;
Limit 1 returns just one pair if you are "drawing lots" one at a time. Otherwise, up the limit to however many pairs you need.
The above query assumes that you can get
1 - 2
2 - 7
and that the pairing 2 - 7 is valid since it doesn't exist, even if 2 is featured again. If you only want a person to feature in only one pair ever, then
select a.id, b.id
from people1 a
inner join people1 b on a.id < b.id
where not exists (
select *
from pairs1 c
where c.person_a_id in (a.id, b.id))
and not exists (
select *
from pairs1 c
where c.person_b_id in (a.id, b.id))
order by a.id * rand()
limit 1;
If multiple pairs are to be generated in one single query, AND the destination table is still empty, you could use this single query. Take note that LIMIT 6 returns only 3 pairs.
select min(a) a, min(b) b
from
(
select
case when mod(#p,2) = 1 then id end a,
case when mod(#p,2) = 0 then id end b,
#p:=#p+1 grp
from (
select id
from (select #p:=1) p, people1
order by rand()
limit 6
) x
) y
group by floor(grp/2)
This cannot be accomplished in a single-query set-based approach because your set will not have knowledge of what pairs are inserted into the pairs table.
Instead, you should loop
WHILE EXISTS(SELECT * FROM people
WHERE id NOT IN (SELECT person_a_id FROM pairs)
AND id NOT IN (SELECT person_b_id FROM pairs)
This will loop while there are unmatched people.
Then you should two random numbers from 1 to the CNT(*) of that table
which gives you the number of unmatched people... if you get the same number twice, roll again. (IF you're worried about this, randomize numbers from the two halves of the set... but then you're losing some randomness based on your sort criteria)
Pair those people.
Wash, rinse, repeat....
Your only "redo" will be when you generate the same random number twice... more likely as you get few people but still only a 25% chance at most (much better than 1/n^2)

SQL query for finding row with same column values that was created most recently

If I have three columns in my MySQL table people, say id, name, created where name is a string and created is a timestamp.. what's the appropriate query for a scenario where I have 10 rows and each row has a record with a name. The names could have a unique id, but a similar name none the less. So you can have three Bob's, two Mary's, one Jack and 4 Phil's.
There is also a hobbies table with the columns id, hobby, person_id.
Basically I want a query that will do the following:
Return all of the people with zero hobbies, but only check by the latest distinct person created, if that makes sense. Meaning if there is a Bob person that was created yesterday, and one created today.. I only want to know if the Bob created today has zero hobbies. The one from yesterday is no longer relevant.
select pp.id
from people pp, (select name, max(created) from people group by name) p
where pp.name = p.name
and pp.created = p.created
and id not in ( select person_id from hobbies )
SELECT latest_person.* FROM (
SELECT p1.* FROM people p1
WHERE NOT EXISTS (
SELECT * FROM people p2
WHERE p1.name = p2.name AND p1.created < p2.created
)
) AS latest_person
LEFT OUTER JOIN hobbies h ON h.person_id = latest_person.id
WHERE h.id IS NULL;
Try This:
Select *
From people p
Where timeStamp =
(Select Max(timestamp)
From people
Where name = p.Name
And not exists
(Select * From hobbies
Where person_id = p.id))