It is given the following scenario. I have a list of 3000 first names and a list of 2500 last names. Each one of them has a "ranking" that represents the position in a name's top. Two or more names can have the same ranking. Also, a table with 1500 cities is given, each with 4 census values in certain years.
From the tables above I must generate 5 million random entries containing the first name, last name, birth date and place of birth of one person, that should follow the rules given by ranking of the names and population number of the cities.
This have to be generated using just Oracle (stored functions, stored procedures and so on). How can I do this?
Disclaimer: I'm not a statistics expert, and there are probably way more efficient means to do that.
The most challenging task seems to be the creation of 5 million names according to ranks. In real world, those would be distributed unevenly among the population: difference between second last and last would be 1-2 persons, and the difference between the first and second rank could be thousands of people. That said, I have no idea how to achieve that, so we'll model it in other way. Suppose we have total population of 100 and list of four ranked names:
Alice: 1
Bob: 2
Betty: 2
Claire: 3
We can make the distribution "even", so that rank 3 has X people, rank 2 has twice as many, and rank 1 thrice as many. If the ranks were unique, the formula would be as simple as X + 2X + 3X = 100, but we have two names in rank 2, so it should be X + 2*2X + 3X = 100, so X = 12.5. We can truncate it to integer and get people counts for all ranks except the first (12, 24 and 24) and first rank would get what remains: 40. Seems good enough, though it will not work for edge case when you have multiple first ranks.
There's a little problem, though. For 3000 different names, the sum of coefficients would be 4501500. So, truncated X would be 1, making rank 3000 to rank 2 have 1 to 2999 people respectively, and rank 1 have a little under 500000. That's not quite good enough. To illustrate with four names above, assume total count of 15. With current algorithm, X will be 1 as well, and distribution will be 1-2-2-10. Luckily, we'll be processing ranks one by one in procedure, so we can remove processed people from equation and recalculate X. E.G. first it's X + 2*2X + 3X = 15 with X=1, then 2*2X + 3X = 14 with X=2. This way, distribution will be 1-4-4-6, which is far from ideal, but better.
Now, this can already be expressed as PL/SQL. I suggest to create the table with following columns: LAST_NAME, FIRST_NAME, BIRTHDAY, CITY, RAND_ROWNO.
First of all, let's fill it with 5M last names. Assuming your table for them is last_names(name, name_rank), you'll need the following:
declare
cursor cur_last_name_ranks is
select name_rank, count(*) cnt, row_number() over (order by name_rank desc) coeff
from last_names l
group by name_rank;
cursor cur_last_names (c_rank number) is
select name from last_names
where name_rank = c_rank;
v_coeff_sum number;
v_total_people_count number:= 5000000;
v_remaining_people number;
v_x number;
v_insert_cnt number;
begin
--Get a sum of all coefficients for our formula
select sum(coeff) into v_coeff_sum
from
(
select count(*) * row_number() over (order by name_rank desc) coeff
from last_names l
group by name_rank
);
v_remaining_people := v_total_people_count;
--Now, loop for all coefficients
for r in cur_last_name_ranks loop
--Recalculate X
v_x := trunc(v_remaining_people / v_coeff_sum);
--First, determine how many rows should be inserted per last name with such rank
if r.name_rank = 1 then
if r.cnt > 1 then
--raise an exception here, we don't allow multiple first ranks
raise TOO_MANY_ROWS;
end if;
v_insert_cnt := v_remaining_people;
else
v_insert_cnt := v_x*r.coeff;
end if;
--Insert last names N times.
--Instead of multiple INSERT statements, use select from dual with connect trick.
for n in cur_last_names(r.name_rank) loop
insert into result_table(last_name)
select n.name from dual connect by level <= v_insert_cnt;
end loop;
commit;
--Calculate remaining people count
v_remaining_people := v_remaining_people - v_x*r.cnt*r.coeff;
--Recalculate remmaining coefficients
v_coeff_sum := v_coeff_sum - r.cnt*r.coeff;
end loop;
end;
Now you have 5 million rows with last names filled according to ranks. Now, we'll need to assign random number from 1 to 5000000 for each row - you'll see why. This is done with a single query using merge on self:
merge into result_table t1
using (select rowid rid, row_number() over (ORDER BY DBMS_RANDOM.VALUE) rnk from result_table) t2
on (t1.rowid = t2.rid)
when matched then update set t1.rand_rowno = t2.rnk
Note that it will take some time because of large size.
Now you must repeat the same procedure for first names. It'll be very similar to last names, except you'll be updating existing records, not inserting new. If you keep track of how many rows you've updated already, it'll be as simple putting this in the inner loop:
update result_table
set first_name = n.name
where rand_rowno between
(v_processed_rows+1) and
(v_processed_rows+v_insert_cnt);
v_processed_rows := v_processed_rows+v_insert_cnt;
That does it - you now have a decent sample of 5M names according to your ranking, last names randomly matched with first names.
Now, for census. I don't really understand your format, but that's relatively simple. If you get data to the form of "N people were born in city C between DATE1 and DATE2", you can update the table in a loop, setting N rows to have CITY = C and BIRTHDAY = a random date between DATE1 and DATE2. You'll need a function to return a random date from a time period, see this. Also, don't forget to assign random row numbers again before doing that.
I'll leave the census part for you to implement, I've spent too much time on writing this already. Thanks for a good brain exercise!
Related
I have a table with the following fields
ID,Content,QuestionMarks,TypeofQuestion
350, What is the symbol used to represent Bromine?,2,MCQ
758,What is the symbol used to represent Bromine? ,2,MCQ
2425,What is the symbol used to represent Bromine?,3,Essay
2080,A quadrilateral has four sides, four angles ,1,MCQ
2614,A circular cone has a curved surface area of ,2,MCQ
2520,Two triangles have sides 5 cm, 11 cm, 2 cm . ,2,MCQ
2196,Life supporting process mediated by water? ,2,Essay
I would like to get random questions where total marks is an input number.
For example if I say 25, the result should be all the random questions whose Sum(QuestionMarks) is 25(+/-1)
Is this really possible using a SQL
select content,id,questionmarks,sum(questionmarks) from quiz_question
group by content,id,questionmarks;
Expected Input 25
Expected Result (Sum of Question Marks =25)
Update:
How do I ensure I get atleast 2 Essay Type Questions (this is just an example) I would extend this for other conditions. Thank you for all the help
S-Man's cumulative sum is the right approach. For your logic, though, I think you want to get up to the first row that is 24 or more. That logic is:
where total - questionmark < 24
If you have enough questions, then you could get exactly 25 using:
with q25 as (
select *
from (select t.*,
sum(questionmark) over (order by random()) as running_questionmark
from t
) t
where running_questionmark < 25
)
select q.ID, q.Content, q.QuestionMarks, q.TypeofQuestion
from q25 q
union all
(select t.ID, t.Content, t.QuestionMarks, t.TypeofQuestion
from t cross join
(select sum(questionmark) as questionmark_25 from q25) x
where not exists (select 1 from q25 where q25.id = t.id)
order by abs(questionmark - (25 - questionmark_25))
limit 1
)
This selects questions up to 25 but not at 25. It then tries to find one more to make the total 25.
Supposing, questionmark is of type integer. Then you want to get some records in random order whose questionmark sum is not more than 25:
You can use the consecutive SUM() window function. The order is random. The consecutive SUM() adds every current value to the previous sum. So, you could filter where SUM() <= <your value>:
demo:db<>fiddle
SELECT
*
FROM (
SELECT
*,
SUM(questionmark) OVER (ORDER BY random()) as total
FROM
t
)s
WHERE total <= 25
Note:
This returns a records list with no more than 25, but as close as possible to it with an random order.
To find an exact match of your value is some sort of combinatorical problem which shouldn't be solved in a database. Especially when there's a random factor. What if your current SUM is 22 and the next randomly chosen value is 4. Would you retry maybe until infinity to randomly find a value = 3? Or are you trying to remove an already counted record with value = 1?
Let's say I have a table named population with 1000 rows like the following:
And I have another table named proportions that holds the desired proportions of different Group_Names that I want to extract:
I want to randomly sample 100 rows from population table where the proportions of the Group_Names within the sample is in line with that of the Proportion field within proportions table. So in that 100 rows sample, 50 rows should be Group-A, 30 rows should be Group-B and 20 rows should be Group-C.
I can manually sample like:
CREATE EXTENSION tsm_system_rows;
SELECT * FROM population TABLESAMPLE SYSTEM_ROWS(100);
But I do not know how to sample from population programmatically based on proportions table especially if proportions table has a lot more Group_Names than 3 as shown in the example.
The main problem that you will be facing is that TABLESAMPLE takes the sample before applying your group filter. Say that you want 20 rows from group C. The chances of getting those 20 by running
SELECT *
FROM population TABLESAMPLE system_rows(20)
WHERE group_name = 'C'
are pretty slim if group C is small relative to other groups in population.
I'd solve this by writing a stored function that receives as parameters the group name and wanted amount of rows, and samples the table until reaching the wanted amount of rows.
You should also limit the number of iterations, in case that the group is very sparse or there or not enough rows to fulfill the need.
So the function could look like so
CREATE OR REPLACE FUNCTION sample_group (p_group_name text, sample_size int, max_iterations int)
RETURNS int[]
LANGUAGE PLPGSQL AS $$
DECLARE
result int[];
i int := 0;
BEGIN
WHILE i < max_iterations AND coalesce(array_length(result, 1), 0) < sample_size LOOP
WITH sample AS (
SELECT group_name, value
FROM population TABLESAMPLE BERNOULLI (1)
LIMIT 10 * sample_size
), add_rows AS (
SELECT result || array_agg(value) arr
FROM sample
WHERE group_name = p_group_name
)
SELECT array_agg(DISTINCT value), i + 1
INTO result, i
FROM add_rows, unnest(arr) AS t(value);
END LOOP;
RETURN result[1:sample_size];
END;
$$;
I'm using BERNOULLI sampling to avoid getting the same rows over and over.
The function did most of the work for you. All that remains is to call it. In this example I'm setting an upper limit of 500 on the iterations.
SELECT group_name, unnest(sample_group(group_name, (100*proportion)::int, 500)) AS value
from proportions;
You can sample based on randomly assigned row numbers:
select *
from
(
select *
,case
when row_number()
over (partition by pop.group_name
order by random()) <= pr.proportion * 100 -- sample size
then 1
else 0
end as flag
from population as pop
join proportions as pr
on pop.group_name = pr.group_name
) as dt
where flag = 1
Edit:
If the table is large creating a SAMPLE before ROW_NUMBER might greatly reduce the number of rows processed. Of course, the SAMPLE size must be large enough to contain at least the required number of rows, i.e. way over 100 rows.
I have a two column table currently, with the columns 'probability' and 'age'. I have a given probability, and I need to search the table and return the age related to the closest probability. It's already in ascending order next to age, for example:
20 0.01050
21 0.02199
22 0.03155
23 0.04710
The only thing I can think of doing right now is returning all ages with probabilities greater than the given probability, and taking the first one.
select age from mydb.mytest
where probability > givenProbability;
I'm sure there is a better approach to this than doing that, so I'm wondering what that would be.
What about something like this:
SELECT * FROM mytest
ORDER BY ABS( .0750 - probability )
LIMIT 1
Should return the top 1 closest value, based on a sorted list of the Absolute value of the Difference between Probability and givenProbability.
Different solutions will work for different DBMS. This one works in DB2 and is standard sql:
select age
from (
select age
, row_number() over (order by abs(probability - givenProbability)) as rn
from mydb.mytest
)
where rn = 1
I have SQL table in which I have column and Probability . I want to select one row from it with randomly but I want to give more chances to the more waighted probability. I can do this by
Order By abs(checksum(newid()))
But the difference between Probabilities are too much so it gives more chance to highest probability.Like After picking 74 times that value it pick up another value for once than again around 74 times.I want to reduce this .Like I want 3-4 times to it and than others and all. I am thinking to give Range to the Probabilies.Its Like
Row[i] = Row[i-1]+Row[i]
How can I do this .Do I need to create function?Is there any there any other way to achieve this.I am neewby.Any help will be appriciated.Thank You
EDIT:
I have solution of my problem . I have one question .
if I have table as follows.
Column1 Column2
1 50
2 30
3 20
can i get?
Column1 Column2 Column3
1 50 50
2 30 80
3 20 100
Each time I want to add value with existing one.Is there any Way?
UPDATE:
Finally get the solution after 3 hours,I just take square root of my probailities that way I can narrow the difference bw them .It is like I add column with
sqrt(sqrt(sqrt(Probability)))....:-)
I'd handle it by something like
ORDER BY rand()*pow(<probability-field-name>,<n>)
for different values of n you will distort the linear probabilities into a simple polynomial. Small values of n (e.g. 0.5) will compress the probabilities to 1 and thus make less probable choices more probable, big values of n (e.g. 2) will do the opposite and further reduce probability of already inprobable values.
Since the difference in probabilities is too great, you need to add a computed field with a revised weighting that has a more even probability distribution. How you do that depends on your data and preferred distribution. One way to do it is to "normalize" the weighting to an integer between 1 and 10 so that the lowest probability is never more than ten times smaller than the highest.
Answer to your recent question:
SELECT t.Column1,
t.Column2,
(SELECT SUM(Column2)
FROM table t2
WHERE t2.Column1 <= t.Column1) Column3
FROM table t
Here is a basic example how to select one row from the table with taking into account the assigned row weights.
Suppose we have table:
CREATE TABLE TableWithWeights(
Id int NOT NULL PRIMARY KEY,
DataColumn nvarchar(50) NOT NULL,
Weight decimal(18, 6) NOT NULL -- Weight column
)
Let's fill table with sample data.
INSERT INTO TableWithWeights VALUES(1, 'Frequent', 50)
INSERT INTO TableWithWeights VALUES(2, 'Common', 30)
INSERT INTO TableWithWeights VALUES(3, 'Rare', 20)
This is the query that returns one random row with taking into account given row weights.
SELECT * FROM
(SELECT tww1.*, -- Select original table data
-- Add column with the sum of all weights of previous rows
(SELECT SUM(tww2.Weight)- tww1.Weight
FROM TableWithWeights tww2
WHERE tww2.id <= tww1.id) as SumOfWeightsOfPreviousRows
FROM TableWithWeights tww1) as tww,
-- Add column with random number within the range [0, SumOfWeights)
(SELECT RAND()* sum(weight) as rnd
FROM TableWithWeights) r
WHERE
(tww.SumOfWeightsOfPreviousRows <= r.rnd)
and ( r.rnd < tww.SumOfWeightsOfPreviousRows + tww.Weight)
To check query results we can run it for 100 times.
DECLARE #count as int;
SET #count = 0;
WHILE ( #count < 100)
BEGIN
-- This is the query that returns one random row with
-- taking into account given row weights
SELECT * FROM
(SELECT tww1.*, -- Select original table data
-- Add column with the sum of all weights of previous rows
(SELECT SUM(tww2.Weight)- tww1.Weight
FROM TableWithWeights tww2
WHERE tww2.id <= tww1.id) as SumOfWeightsOfPreviousRows
FROM TableWithWeights tww1) as tww,
-- Add column with random number within the range [0, SumOfWeights)
(SELECT RAND()* sum(weight) as rnd
FROM TableWithWeights) r
WHERE
(tww.SumOfWeightsOfPreviousRows <= r.rnd)
and ( r.rnd < tww.SumOfWeightsOfPreviousRows + tww.Weight)
-- Increase counter
SET #count += 1
END
PS The query was tested on SQL Server 2008 R2. And of course the query can be optimized (it's easy to do if you get the idea)
I want grouped ranking on a very large table, I've found a couple of solutions for this problem e.g. in this post and other places on the web. I am, however, unable to figure out the worst case complexity of these solutions. The specific problem consists of a table where each row has a number of points and a name associated. I want to be able to request rank intervals such as 1-4. Here are some data examples:
name | points
Ab 14
Ac 14
B 16
C 16
Da 15
De 13
With these values the following "ranking" is created:
Query id | Rank | Name
1 1 B
2 1 C
3 3 Da
4 4 Ab
5 4 Ac
6 6 De
And it should be possible to create the following interval on query-id's: 2-5 giving rank: 1,3,4 and 4.
The database holds about 3 million records so if possible I want to avoid a solution with complexity greater than log(n). There are constantly updates and inserts on the database so these actions should preferably be performed in log(n) complexity as well. I am not sure it's possible though and I've tried wrapping my head around it for some time. I've come to the conclusion that a binary search should be possible but I haven't been able to create a query that does this. I am using a MySQL server.
I will elaborate on how the pseudo code for the filtering could work. Firstly, an index on (points, name) is needed. As input you give a fromrank and a tillrank. The total number of records in the database is n. The pseudocode should look something like this:
Find median point value, count rows less than this value (the count gives a rough estimate of rank, not considering those with same amount of points). If the number returned is greater than the fromrank delimiter, we subdivide the first half and find median of it. We keep doing this until we are pinpointed to the amount of points where fromrank should start. then we do the same within that amount of points with the name index, and find median until we have reached the correct row. We do the exact same thing for tillrank.
The result should be log(n) number of subdivisions. So given the median and count can be made in log(n) time it should be possible to solve the problem in worst case complexity log(n). Correct me if I am wrong.
You need a stored procedure to be able to call this with parameters:
CREATE TABLE rank (name VARCHAR(20) NOT NULL, points INTEGER NOT NULL);
CREATE INDEX ix_rank_points ON rank(points, name);
CREATE PROCEDURE prc_ranks(fromrank INT, tillrank INT)
BEGIN
SET #fromrank = fromrank;
SET #tillrank = tillrank;
PREPARE STMT FROM
'
SELECT rn, rank, name, points
FROM (
SELECT CASE WHEN #cp = points THEN #rank ELSE #rank := #rn + 1 END AS rank,
#rn := #rn + 1 AS rn,
#cp := points,
r.*
FROM (
SELECT #cp := -1, #rn := 0, #rank = 1
) var,
(
SELECT *
FROM rank
FORCE INDEX (ix_rank_points)
ORDER BY
points DESC, name DESC
LIMIT ?
) r
) o
WHERE rn >= ?
';
EXECUTE STMT USING #tillrank, #fromrank;
END;
CALL prc_ranks (2, 5);
If you create the index and force MySQL to use it (as in my query), then the complexity of the query will not depend on the number of rows at all, it will depend only on tillrank.
It will actually take last tillrank values from the index, perform some simple calculations on them and filter out first fromrank values.
Time of this operation, as you can see, depends only on tillrank, it does not depend on how many records are there.
I just checked in on 400,000 rows, it selects ranks from 5 to 100 in 0,004 seconds (that is, instantly)
Important: this only works if you sort on names in DESCENDING order. MySQL does not support DESC clause in the indices, that means that the points and name must be sorted in one order for INDEX SORT to be usable (either both ASCENDING or both DESCENDING). If you want fast ASC sorting by name, you will need to keep negative points in the database, and change the sign in the SELECT clause.
You may also remove name from the index at all, and perform a final ORDER'ing without using an index:
CREATE INDEX ix_rank_points ON rank(points);
CREATE PROCEDURE prc_ranks(fromrank INT, tillrank INT)
BEGIN
SET #fromrank = fromrank;
SET #tillrank = tillrank;
PREPARE STMT FROM
'
SELECT rn, rank, name, points
FROM (
SELECT CASE WHEN #cp = points THEN #rank ELSE #rank := #rn + 1 END AS rank,
#rn := #rn + 1 AS rn,
#cp := points,
r.*
FROM (
SELECT #cp := -1, #rn := 0, #rank = 1
) var,
(
SELECT *
FROM rank
FORCE INDEX (ix_rank_points)
ORDER BY
points DESC
LIMIT ?
) r
) o
WHERE rn >= ?
ORDER BY rank, name
';
EXECUTE STMT USING #tillrank, #fromrank;
END;
That will impact performance on big ranges, but you will hardly notice it on small ranges.