SQL Sampling based on the whole population - sql

I have a population of records...let's say 10,000 athletes, grouped by sports, where (numbers below would be variable):
4,000 are from NBA
2,000 are from NHL
3,000 are from MLB
1,000 are from NFL
How can I build a sample query that will sample 100 records based on the population, not fully random but pull out:
NBA/Whole Population=X
Select Top X * From MainTable Where league= 'NBA' (something like this)
40 names are from NBA
20 names are from NHL
30 names are from MLB
10 names are from NFL.
This is just a sample of the population, logic here is to calculate what the ratios are with regard to the whole population and then apply them to the sample size.
Regards

Consider using a count correlated subquery for a rank order that you then use as filtering criteria for sample ratio.
SELECT main.*
FROM
(SELECT *,
(SELECT Count(*) FROM MainTable sub
WHERE sub.League = t.League AND sub.UniqueID <= t.UniqueID) As Rank
FROM MainTable t) AS main
WHERE main.Rank <= CInt((SELECT Count(*) FROM MainTable sub
WHERE sub.League = main.League) /
(SELECT Count(*) FROM MainTable) * 100)
ORDER BY main.League, main.Rank
To explain above query with nested subqueries and derived tables:
The derived table, main, is exact source MainTable with a new column called Rank that gives an ordinal count of records for each League. So for the first NBA record (not necessarily first row), it is tagged rank 1, next NBA record (which can appear anywhere like 89th row) is tagged 2, and so on for each League. And yes, Rank will go up to 4,000 if needed!
Once this Rank field is calculated giving ordinal 1, 2, 3, ... indicators for each League grouping, we then position this SELECT statement as a derived table in FROM clause in order to use Rank in WHERE filter for the sample ratio. We cannot calculate a column and filter in same SELECT call.
Sample ratio is the last two subqueries used for a quotient that calculates: (# of League records matching current row / total # of table records). This value is then multiplied by 100 per sample quota. CInt is used to return integer values of possible decimal ratios. Consider also Round(..., 0) which rounds instead of strips decimal points.

Dim Leagues(1 To 4) As String
Leagues(1) = "NHL"
Leagues(2) = "MLB"
Leagues(3) = "NFL"
Leagues(4) = "MLS"
Set db = CurrentDb
For x = 1 To 4
y = 0
sqql = "Select * from Maintable Where League = '" & leagues(x) & "'"
Set cf = db.OpenRecordset(sqql)
Set samp = db.OpenRecordset("RANDOMSAMPLE")
Do While y < (x * 1000) ' adjust as necessary just swagged in you wanted 1000 from league 1, 2000 league 2 etc
cf.MoveLast
cf.MoveFirst
i = Int((cf.RecordCount - 1 + 1) * Rnd + 1)
cf.Move (i)
With samp
.AddNew
.fields("Yourfield here") = cf![your field ]
' repeat as nec
.Update
End With
y = y + 1
Loop
cf.Close
Next x
samp.Close

Related

Sample certain number of result rows from a postgres table based on given proportions

Let's say I have a table named population with 1000 rows like the following:
And I have another table named proportions that holds the desired proportions of different Group_Names that I want to extract:
I want to randomly sample 100 rows from population table where the proportions of the Group_Names within the sample is in line with that of the Proportion field within proportions table. So in that 100 rows sample, 50 rows should be Group-A, 30 rows should be Group-B and 20 rows should be Group-C.
I can manually sample like:
CREATE EXTENSION tsm_system_rows;
SELECT * FROM population TABLESAMPLE SYSTEM_ROWS(100);
But I do not know how to sample from population programmatically based on proportions table especially if proportions table has a lot more Group_Names than 3 as shown in the example.
The main problem that you will be facing is that TABLESAMPLE takes the sample before applying your group filter. Say that you want 20 rows from group C. The chances of getting those 20 by running
SELECT *
FROM population TABLESAMPLE system_rows(20)
WHERE group_name = 'C'
are pretty slim if group C is small relative to other groups in population.
I'd solve this by writing a stored function that receives as parameters the group name and wanted amount of rows, and samples the table until reaching the wanted amount of rows.
You should also limit the number of iterations, in case that the group is very sparse or there or not enough rows to fulfill the need.
So the function could look like so
CREATE OR REPLACE FUNCTION sample_group (p_group_name text, sample_size int, max_iterations int)
RETURNS int[]
LANGUAGE PLPGSQL AS $$
DECLARE
result int[];
i int := 0;
BEGIN
WHILE i < max_iterations AND coalesce(array_length(result, 1), 0) < sample_size LOOP
WITH sample AS (
SELECT group_name, value
FROM population TABLESAMPLE BERNOULLI (1)
LIMIT 10 * sample_size
), add_rows AS (
SELECT result || array_agg(value) arr
FROM sample
WHERE group_name = p_group_name
)
SELECT array_agg(DISTINCT value), i + 1
INTO result, i
FROM add_rows, unnest(arr) AS t(value);
END LOOP;
RETURN result[1:sample_size];
END;
$$;
I'm using BERNOULLI sampling to avoid getting the same rows over and over.
The function did most of the work for you. All that remains is to call it. In this example I'm setting an upper limit of 500 on the iterations.
SELECT group_name, unnest(sample_group(group_name, (100*proportion)::int, 500)) AS value
from proportions;
You can sample based on randomly assigned row numbers:
select *
from
(
select *
,case
when row_number()
over (partition by pop.group_name
order by random()) <= pr.proportion * 100 -- sample size
then 1
else 0
end as flag
from population as pop
join proportions as pr
on pop.group_name = pr.group_name
) as dt
where flag = 1
Edit:
If the table is large creating a SAMPLE before ROW_NUMBER might greatly reduce the number of rows processed. Of course, the SAMPLE size must be large enough to contain at least the required number of rows, i.e. way over 100 rows.

Slow MS Access Query (Using DSum & DCount Functions)

I'm having an issue in Microsoft Access where my query calculates extremely slow (it takes hours and hours). This query is reading a table that has 150,000 records and each record belongs to one of 4,000 unique groups (called API_10).
The goal of the query is to calculate a running cumulative production value (organized by API_10 and Date) such that the running cumulative production starts over at each new API_10 group. Each record in the table has a field called No which is an autonumber that MS Access calculates so that the table has a Primary Key. An example of what I'm describing is shown below:
MyTable:
No API_10 Date Production
1 1 1/1/2010 1000
2 1 2/1/2010 500
3 2 7/1/2014 300
4 2 8/1/2014 400
MyQuery:
No API_10 Date Production Cumulative_Production
1 1 11/1/2010 1000 1000
2 1 12/1/2010 500 1500
3 2 27/1/2014 300 300
4 2 28/1/2014 400 700
Here is a sample of the code (typed in the Expression Builder on MS Access) used to create the Cumulative_Production column in MyQuery:
Cumulative_Production:
DSum("[Production]","[MyTable]","[API_10]='" & [API_10] & "' AND [No]<=" & [No])
Do note that this is a simplified version of the actual query/table. The real query also computes another field called Normalized_Prod_Month which counts the number of production dates (starting at 1) for each unique API_10 as shown below:
NORMALIZED_PROD_MONTH:
DCount("[Date]","[MyTable]","[API_10]='" & [API_10] & "' AND [No]<=" & [No])
Any tips for improving these types of calculations would greatly help!!
If you apply this query to each record, then you must access n * (n + 1) / 2 records. If all 4000 groups have about the same size of 38 records, you get 4000 * 38 * (38 + 1) / 2 = ~ 3 Mio accesses. But this is the best case, since larger groups have an over-proportional cost because of the quadratic nature of n * (n + 1) / 2.
You are better off by creating the running sum in a loop in VBA, and accessing each record only once.
Dim db As DAO.Database, rs As DAO.Recordset
Dim lastNoApi As Long, runningSum As Long
Set db = CurrentDb
Set rs = db.OpenRecordset("SELECT * FROM MyTable ORDER BY NoAPI_10, Date")
Do Until rs.EOF
If rs!NoAPI_10 <> lastNoApi Then
runningSum = 0
lastNoApi = rs!NoAPI_10
End If
runningSum = runningSum + rs!Production
'TODO: insert the result into a temporary table
rs.MoveNext
Loop
rs.Close: Set rs = Nothing
db.Close: Set db = Nothing
Or use the following query. It still has a quadratic cost, but a single query is always more performing than multiple calls to DCount, DSum or DLookup.
SELECT
A.API_10,
A.Date,
A.Production,
(Select Sum(B.Production)
FROM MyTable B
WHERE B.API_10 = A.API_10 And B.[No] <= A.[No]) AS Cumulative_Production
FROM MyTable AS A
ORDER BY A.API_10, A.Date;
Assuming that the No column is consistent with the date sequence. If the dates are unique, you can also replace B.[No] <= A.[No] with B.[Date] <= A.[Date].

SQL: Find rows that match closely but not exactly

I have a table inside a PostgreSQL database with columns c1,c2...cn. I want to run a query that compares each row against a tuple of values v1,v2...vn. The query should not return an exact match but should return a list of rows ordered in descending similarity to the value vector v.
Example:
The table contains sports records:
1,USA,basketball,1956
2,Sweden,basketball,1998
3,Sweden,skating,1998
4,Switzerland,golf,2001
Now when I run a query against this table with v=(Sweden,basketball,1998), I want to get all records that have a similarity with this vector, sorted by number of matching columns in descending order:
2,Sweden,basketball,1998 --> 3 columns match
3,Sweden,skating,1998 --> 2 columns match
1,USA,basketball,1956 --> 1 column matches
Row 4 is not returned because it does not match at all.
Edit: All columns are equally important. Although, when I really think of it... it would be a nice add-on if I could give each column a different weight factor as well.
Is there any possible SQL query that would return the rows in a reasonable amount of time, even when I run it against a million rows?
What would such a query look like?
SELECT * FROM countries
WHERE country = 'sweden'
OR sport = 'basketball'
OR year = 1998
ORDER BY
cast(country = 'sweden' AS integer) +
cast(sport = 'basketball' as integer) +
cast(year = 1998 as integer) DESC
It's not beautiful, but well. You can cast the boolean expressions as integers and sum them.
You can easily change the weight, by adding a multiplicator.
cast(sport = 'basketball' as integer) * 5 +
This is how I would do it ... the multiplication factors used in the case stmts will handle the importance(weight) of the match and they will ensure that those records that have matches for columns designated with the highest weight will come up top even if the other columns don't match for those particular records.
/*
-- Initial Setup
-- drop table sport
create table sport (id int, Country varchar(20) , sport varchar(20) , yr int )
insert into sport values
(1,'USA','basketball','1956'),
(2,'Sweden','basketball','1998'),
(3,'Sweden','skating','1998'),
(4,'Switzerland','golf','2001')
select * from sport
*/
select * ,
CASE WHEN Country='sweden' then 1 else 0 end * 100 +
CASE WHEN sport='basketball' then 1 else 0 end * 10 +
CASE WHEN yr=1998 then 1 else 0 end * 1 as Match
from sport
WHERE
country = 'sweden'
OR sport = 'basketball'
OR yr = 1998
ORDER BY Match Desc
It might help if you wrote a stored procedure that calculates a "similarity metric" between two rows. Then your query could refer to the return value of that procedure directly rather than having umpteen conditions in the where-expression and the order-by-expression.

Split a query result based on the result count

I have a query based on basic criteria that will return X number of records on any given day.
I'm trying to check the result of the basic query then apply a percentage split to it based on the total of X and split it in 2 buckets. Each bucket will be a percentage of the total query result returned in X.
For example:
Query A returns 3500 records.
If the number of records returned from Query A is <= 3000, then split the 3500 records into a 40% / 60% split (1,400 / 2,100).
If the number of records returned from Query A is >=3001 and <=50,000 then split the records into a 10% / 90% split.Etc. Etc.
I want the actual records returned, and not just the math acting on the records that returns one row with a number in it (in the column).
I'm not sure how you want to display different parts of the resulting set of rows, so I've just added additional column(part) in the resulting set of rows that contains values 1 indicating that row belongs to the first part and 2 - second part.
select z.*
, case
when cnt_all <= 3000 and cnt <= 40
then 1
when (cnt_all between 3001 and 50000) and (cnt <= 10)
then 1
else 2
end part
from (select t.*
, 100*(count(col1) over(order by col1) / count(col1) over() )cnt
, count(col1) over() cnt_all
from split_rowset t
order by col1
) z
Demo #1 number of rows 3000.
Demo #2 number of rows 3500.
For better usability you can create a view using the query above and then query that view filtering by part column.
Demo #3 using of a view.

Giving Range to the SQL Column

I have SQL table in which I have column and Probability . I want to select one row from it with randomly but I want to give more chances to the more waighted probability. I can do this by
Order By abs(checksum(newid()))
But the difference between Probabilities are too much so it gives more chance to highest probability.Like After picking 74 times that value it pick up another value for once than again around 74 times.I want to reduce this .Like I want 3-4 times to it and than others and all. I am thinking to give Range to the Probabilies.Its Like
Row[i] = Row[i-1]+Row[i]
How can I do this .Do I need to create function?Is there any there any other way to achieve this.I am neewby.Any help will be appriciated.Thank You
EDIT:
I have solution of my problem . I have one question .
if I have table as follows.
Column1 Column2
1 50
2 30
3 20
can i get?
Column1 Column2 Column3
1 50 50
2 30 80
3 20 100
Each time I want to add value with existing one.Is there any Way?
UPDATE:
Finally get the solution after 3 hours,I just take square root of my probailities that way I can narrow the difference bw them .It is like I add column with
sqrt(sqrt(sqrt(Probability)))....:-)
I'd handle it by something like
ORDER BY rand()*pow(<probability-field-name>,<n>)
for different values of n you will distort the linear probabilities into a simple polynomial. Small values of n (e.g. 0.5) will compress the probabilities to 1 and thus make less probable choices more probable, big values of n (e.g. 2) will do the opposite and further reduce probability of already inprobable values.
Since the difference in probabilities is too great, you need to add a computed field with a revised weighting that has a more even probability distribution. How you do that depends on your data and preferred distribution. One way to do it is to "normalize" the weighting to an integer between 1 and 10 so that the lowest probability is never more than ten times smaller than the highest.
Answer to your recent question:
SELECT t.Column1,
t.Column2,
(SELECT SUM(Column2)
FROM table t2
WHERE t2.Column1 <= t.Column1) Column3
FROM table t
Here is a basic example how to select one row from the table with taking into account the assigned row weights.
Suppose we have table:
CREATE TABLE TableWithWeights(
Id int NOT NULL PRIMARY KEY,
DataColumn nvarchar(50) NOT NULL,
Weight decimal(18, 6) NOT NULL -- Weight column
)
Let's fill table with sample data.
INSERT INTO TableWithWeights VALUES(1, 'Frequent', 50)
INSERT INTO TableWithWeights VALUES(2, 'Common', 30)
INSERT INTO TableWithWeights VALUES(3, 'Rare', 20)
This is the query that returns one random row with taking into account given row weights.
SELECT * FROM
(SELECT tww1.*, -- Select original table data
-- Add column with the sum of all weights of previous rows
(SELECT SUM(tww2.Weight)- tww1.Weight
FROM TableWithWeights tww2
WHERE tww2.id <= tww1.id) as SumOfWeightsOfPreviousRows
FROM TableWithWeights tww1) as tww,
-- Add column with random number within the range [0, SumOfWeights)
(SELECT RAND()* sum(weight) as rnd
FROM TableWithWeights) r
WHERE
(tww.SumOfWeightsOfPreviousRows <= r.rnd)
and ( r.rnd < tww.SumOfWeightsOfPreviousRows + tww.Weight)
To check query results we can run it for 100 times.
DECLARE #count as int;
SET #count = 0;
WHILE ( #count < 100)
BEGIN
-- This is the query that returns one random row with
-- taking into account given row weights
SELECT * FROM
(SELECT tww1.*, -- Select original table data
-- Add column with the sum of all weights of previous rows
(SELECT SUM(tww2.Weight)- tww1.Weight
FROM TableWithWeights tww2
WHERE tww2.id <= tww1.id) as SumOfWeightsOfPreviousRows
FROM TableWithWeights tww1) as tww,
-- Add column with random number within the range [0, SumOfWeights)
(SELECT RAND()* sum(weight) as rnd
FROM TableWithWeights) r
WHERE
(tww.SumOfWeightsOfPreviousRows <= r.rnd)
and ( r.rnd < tww.SumOfWeightsOfPreviousRows + tww.Weight)
-- Increase counter
SET #count += 1
END
PS The query was tested on SQL Server 2008 R2. And of course the query can be optimized (it's easy to do if you get the idea)