Oracle random segments select - sql

I would like to select the following segment.
Random 5500 rows including the following segments:
Subcategorie (sex): - 3300 men
- 2200 women
Subcategorie (age): - 2140 between 18-34 years
- 2100 between 35-54 years
- 1260 between 55-99 years
How could I solve this in a select statement?

The problem is, you use the word "random" but you have a very precise break down of cohorts by age and sex. A truly random single query won't produce such exact quotas. So your query must necessarily be complicated: you need to divide the whole table into subsets which meet your constraints then randomly select from those subsets. Something like this...
select * from (
select * from whatever
where sex = 'M'
and age between 18 and 34
order by dbms_random.value
)
where rownum <= 1284
union all
select * from (
select * from whatever
where sex = 'M'
and age between 35 and 54
order by dbms_random.value
)
where rownum <= 1260
union all select * from (
select * from whatever
where sex = 'M'
and age between 55 and 99
order by dbms_random.value
)
where rownum <= 756
union all
select * from (
select * from whatever
where sex = 'F'
and age between 18 and 34
order by dbms_random.value
)
where rownum <= 856
union all
select * from (
select * from whatever
where sex = 'F'
and age between 35 and 54
order by dbms_random.value
)
where rownum <= 840
union all select * from (
select * from whatever
where sex = 'F'
and age between 55 and 99
order by dbms_random.value
)
where rownum <= 504
This may perform poorly, depending on the usual factors - size of table, indexing, etc - but it will produce those exact cohorts.
In case it's not obvious, the rownum bounds are the number of hits in each age group multiplied by the ratio of men to women (3:2).

Related

How to sum and count in grouping in sql

I have tables like below.
I would like to grouping and counting by referreing to its score.
customer score
A 10
A 20
B 30
B 40
C 50
C 60
First, I would like to sum score by customer
It achived by group by method.
customer score
A 30
B 70
C 110
second, I would like to count by binning following band.
I couldn't figure out how to count after grouping
band count
0-50 1
50-100 1
100- 0
Are there any way to achieve this?
Thanks
You could use a union approach:
WITH cte AS (
SELECT COUNT(*) AS score
FROM yourTable
GROUP BY customer
)
SELECT '0-50' AS band, COUNT(*) AS count, 0 AS position FROM cte WHERE score <= 50 UNION ALL
SELECT '50-100', COUNT(*), 1 FROM cte WHERE score > 50 AND score <= 100 UNION ALL
SELECT '100-', COUNT(*), 2 FROM cte WHERE score > 100
ORDER BY position;
Try the following with case expression
select
case
when score >= 0 and score <= 50 then '0-50'
when score >= 50 and score <= 100 then '50-100'
when score >= 100 then '100-'
end as band,
count(*) as count
from
(
select
customer,
sum(score) as score
from myTable
group by
customer
) val
group by
case
when score >= 0 and score <= 50 then '0-50'
when score >= 50 and score <= 100 then '50-100'
when score >= 100 then '100-'
end
I would recommend two levels of aggregation but with a left join:
select b.band, count(c.customer)
from (select 0 as lo, 50 as hi, '0-50' as band from dual union all
select 50 as lo, 100 as hi, '50-100' as band from dual union all
select 100 as lo, null as hi, '100+' as band from dual
) b left join
(select customer, sum(score) as score
from myTable
group by customer
) c
on c.score >= b.lo and
(c.score < b.hi or b.hi is null)
group by b.band;
This structure also suggests that the bands can be stored in a separate reference table. That can be quite handy, if you intend to reuse these across different queries or over time.

stratified sample on ranges

I have table_1, that has data such as:
Range Start Range End Frequency
10 20 90
20 30 68
30 40 314
40 40 191 (here, it means we have just 40 as data point repeating 191 times)
table_2:
group value
10 56.1
10 88.3
20 53
20 20
30 55
I need to get the stratified sample on the basis of range from table_1, the table_2 can have millions of rows but the result should be restricted to just 10k points.
Tried below query:
SELECT
d.*
FROM
(
SELECT
ROW_NUMBER() OVER(
PARTITION BY group
ORDER BY group
) AS seqnum,
COUNT(*) OVER() AS ct,
COUNT(*) OVER(PARTITION BY group) AS cpt,
group, value
FROM
table_2 d
) d
WHERE
seqnum < 10000 * ( cpt * 1.0 / ct )
but a bit confused with the analytics functions usage here.
Expecting 10k records as a stratified sample from table_2:
Result table:
group value
10 56.1
20 53
20 20
30 55
It means you need atleast one record of each group and more records on random basis then try this:
SELECT GROUP, VALUE FROM
(SELECT T2.GROUP, T2.VALUE,
ROW_NUMBER()
OVER (PARTITION BY T2.GROUP ORDER BY NULL) AS RN
FROM TABLE_1 T1
JOIN TABLE_2 T2
ON(T1.RANGE = T2.GROUP))
WHERE RN = 1 OR
CASE WHEN RN > 1
AND RN = CEIL(DBMS_RANDOM.VALUE(1,RN))
THEN 1 END = 1
FETCH FIRST 10000 ROWS ONLY;
Here, Rownum is taken on random basis for each group and then result is taking rownum 1 and other rownum if they fulfill random condition.
Cheers!!
If I understand what you want - which is by no means certain - then I think you want to get a maximum of 10000 rows, with the number of group values proportional to the frequencies. So you can get the number of rows you want from each range with:
select range_start, range_end, frequency,
frequency/sum(frequency) over () as proportion,
floor(10000 * frequency/sum(frequency) over ()) as limit
from table_1;
RANGE_START RANGE_END FREQUENCY PROPORTION LIMIT
----------- ---------- ---------- ---------- ----------
10 20 90 .135746606 1357
20 30 68 .102564103 1025
30 40 314 .473604827 4736
40 40 191 .288084465 2880
Those limits don't quite add up to 10000; you could go slightly above with ceil instead of floor.
You can then assign a nominal row number to each entry in table_2 based on which range it is in, and then restrict the number of rows from that range via that limit:
with cte1 (range_start, range_end, limit) as (
select range_start, range_end, floor(10000 * frequency/sum(frequency) over ())
from table_1
),
cte2 (grp, value, limit, rn) as (
select t2.grp, t2.value, cte1.limit,
row_number() over (partition by cte1.range_start order by t2.value) as rn
from cte1
join table_2 t2
on (cte1.range_end > cte1.range_start and t2.grp >= cte1.range_start and t2.grp < cte1.range_end)
or (cte1.range_end = cte1.range_start and t2.grp = cte1.range_start)
)
select grp, value
from cte2
where rn <= limit;
...
9998 rows selected.
I've used order by t2.value in the row_number() call because it isn't clear how you want to pick which rows in the range you actually want; you might want to order by dbms_random.value or something else.
db<>fiddle with some artificial data.

How can I select top 3 for each group based on another column in sqlite?

I'm trying to get top 3 most profitable UserIDs in each country in one table using sqlite. I'm not sure where to use LIMIT 3.
Here is the table I have:
Country | UserID | Profit
US 1 100
US 12 98
US 13 10
US 5 8
US 2 5
IR 9 95
IR 3 90
IR 8 70
IR 4 56
IR 15 40
the result should look like this:
Country | UserID | Profit
US 1 100
US 12 98
US 13 10
IR 9 95
IR 3 90
IR 8 70
One pretty simple method is:
select t.*
from t
where t.profit >= (select t2.profit
from t t2
where t2.country = t.country
order by t2.profit desc
limit 1 offset 2
);
This assumes at least three records for each country. You can get around that with coalesce():
select t.*
from t
where t.profit >= coalesce((select t2.profit
from t t2
where t2.country = t.country
order by t2.profit desc
limit 1 offset 2
), t.profit
);
Since SQLite doesn't support windows function, so you can write a subquery be a seqnum by Country, then get top 3
You can try this query.
select t.Country,t.UserID,t.Profit
from(
select t.*,
(select count(*)
from T t2
where t2.Country = t.Country and t2.Profit >= t.Profit
) as seqnum
from T t
)t
where t.seqnum <=3
sqlfiddle:https://www.db-fiddle.com/f/tmNhRLGG2oKqCKXJEDsjfe/0
LIMIT won't be usefull as it applies to a whole result set.
I would create an auxiliary column "CountryRank" like this:
SELECT *, (SELECT COUNT() FROM Data AS d WHERE d.Country=Data.Country AND d.Profit>Data.Country)+1 AS CountryRank
FROM Data;
And query on that result:
SELECT Country, UserID, Profit
FROM (
SELECT *, (SELECT COUNT() FROM Data AS d WHERE d.Country=Data.Country AND d.Profit>Data.Profit)+1 AS CountryRank FROM Data)
WHERE CountryRank<=3
ORDER BY Country, CountryRank;

How to select different percentages of data based in a column value?

I need to query a table that has a "gender" column, like so:
| id | gender | name |
-------------------------
| 1 | M | Michael |
-------------------------
| 2 | F | Hanna |
-------------------------
| 3 | M | Louie |
-------------------------
And I need to extract the first N results which have, for example 80% males and 20% females. So, if I needed 1000 results I would want to retrieve 800 males and 200 females.
Is it possible to do it in a single query? How?
If I don't have enough records (imagine I have only 700 males on the example above) is it possible to select 700 / 300 automatically?
Basically, you want to get as many 'M' as you can, but not more than your percentage and then get enough 'F' so you have total 1000 rows:
with cte_m as (
select * from Table1 where gender = 'M' limit (1000 * 0.8)
), cte as (
select *, 0 as ord from cte_m
union all
select *, 1 as ord from Table1 where gender = 'F'
order by ord
limit 1000
)
select id, gender, name
from cte
sql fiddle demo
How about the following, which assumes you are supplying a row count ("lmt"), and floats for the M/F distribution:
create table gen (
id integer,
gender text,
name text
);
-- inserts 75% males and 25% females into the source table ("gen")
insert into gen select n, case when mod(n,5) = 0 then 'F' else 'M' end, (case when mod(n,5) = 0 then 'F' else 'M' end)||'_'||n::text
from generate_series(1,20000) n
-- extract 80/20 M vs F
with conf as (select 1000 as lmt, .80::FLOAT as mpct, .20::FLOAT as fpct),
g as (select id,gender,name,row_number() over (partition by gender order by gender) rn from gen)
select *
from g
where (gender = 'M' and rn <= (select lmt*mpct from conf))
or (gender = 'F' and rn <= (select lmt*fpct from conf));
-- Same query, to show the percent M vs F:
with conf as (select 1000 as lmt, .80::FLOAT as mpct, .20::FLOAT as fpct),
g as (select id,gender,name,row_number() over (partition by gender order by gender) rn from gen)
select gender,count(*)
from (
select *
from g
where (gender = 'M' and rn <= (select lmt*mpct from conf))
or (gender = 'F' and rn <= (select lmt*fpct from conf))
) y
group by gender
I don't have postgresql with me, but the first scenario is pretty easy with a union in MS SQL 2012. I assume you can do it similarly in postgre:
declare #MaxRows INT
,#PercentageMale INT
,#PercentageFemale INT
select #MaxRows = 1000
,#PercentageMale = 80
,#PercentageFemale = 20
select top (#MaxRows*#PercentageMale/100) *
FROM someTable
WHERE Gender = 'M'
UNION
select top (#MaxRows*#PercentageFemale/100) *
FROM someTable
WHERE Gender = 'F'
The second bit is actually quite easy. Basically you want to select the top % of males, and then fill the rest of the list with females, up to the total number of rows. The number of females is not actually relavent:
declare #MaxRows INT
,#PercentageMale INT
select #MaxRows = 1000
,#PercentageMale = 80
SELECT TOP #MaxRows *
FROM
(
select top (#MaxRows*#PercentageMale/100) *
FROM someTable
WHERE Gender = 'M'
UNION
select top (#MaxRows) * --we never want more than #MaxRows
--so no need to check for a %,
--just fill in the rest of the data set
FROM someTable
WHERE Gender = 'F'
) a

Combine two statements with LIMITS using UNION

Is there a way to combine these two statements into one without having duplicate entries?
SELECT * FROM Seq where JULIANDAY('2012-05-25 19:02:00')<=JULIANDAY(TimeP)
order by TimeP limit 50
SELECT * FROM Seq where JULIANDAY('2012-05-29 06:20:50')<=JULIANDAY(TimeI)
order by TimeI limit 50
My first, obvious attempt is not supported by SQLITE (Syntax error: Limit clause should come after UNION not before):
SELECT * FROM Seq where JULIANDAY('2012-05-25 19:02:00')<=JULIANDAY(TimeP)
order by TimeP limit 50
UNION
SELECT * FROM Seq where JULIANDAY('2012-05-29 06:20:50')<=JULIANDAY(TimeI)
order by TimeI limit 50
Use subqueries and perform the limit within them.
SELECT *
FROM ( SELECT *
FROM Seq
WHERE JULIANDAY('2012-05-25 19:02:00') <= JULIANDAY(TimeP)
ORDER BY TimeP
LIMIT 50
)
UNION
SELECT *
FROM ( SELECT *
FROM Seq
WHERE JULIANDAY('2012-05-29 06:20:50') <= JULIANDAY(TimeI)
ORDER BY TimeI
LIMIT 50
)
Queries are processed in stages:
FROM clause and all the joins;
WHERE clause and all the predicates. So if you whant to see NULL values in the result set, you should never filter OUTER-joined table columns in the WHERE section, as this will turn your query into INNER join;
GROUP BY and HAVING clause;
Query combinations: UNION, INTERSECT, EXCEPT or MINUS
ORDER BY
LIMIT
Therefore, as others pointed out, it is syntatically wrong to use ORDER BY and LIMIT before UNION clause. You should use subqueries:
SELECT *
FROM (SELECT * FROM Seq
WHERE JULIANDAY('2012-05-25 19:02:00') <= JULIANDAY(TimeP)
ORDER BY TimeP LIMIT 50) AS tab1
UNION
SELECT *
FROM (SELECT * FROM Seq
WHERE JULIANDAY('2012-05-29 06:20:50') <= JULIANDAY(TimeI)
ORDER BY TimeI LIMIT 50) AS tab2;
SELECT * from
(SELECT *
FROM Seq
where JULIANDAY('2012-05-25 19:02:00')<=JULIANDAY(TimeP)
order by TimeP limit 50)
UNION
SELECT * from
(SELECT *
FROM Seq
where JULIANDAY('2012-05-29 06:20:50')<=JULIANDAY(TimeI)
order by TimeI limit 50)
I have a table buysell_product. I want to select top 5 based on views_1 column and another top 5 using views_2 column, then merge the two. Columns are the same in both queries. Expecting this to work, it does not:
SELECT id, name, views_1, views_2
FROM buysell_product
ORDER BY views_1 DESC
LIMIT 5
UNION
SELECT id, name, views, views_2
FROM buysell_product as b
ORDER BY views_2 DESC
LIMIT 5
Error:
Execution finished with errors.
Result: ORDER BY clause should come after UNION not before
At line 1:
...
They work separately but I need to merge them:
SELECT * FROM (
SELECT id, views_1, views_2, name
FROM buysell_product
ORDER BY views_1 DESC
LIMIT 5
)
UNION
SELECT * FROM (
SELECT id, views_1, views_2, name
FROM buysell_product as b
ORDER BY views_2 DESC
LIMIT 5
)
Result:
id
views_1
views_2
name
2
41
16
Excellent 2013 ford ecosport
3
72
10
Excellent Hyundai creta
5
39
39
iPhone 11 128gb
7
12
84
Excellent Hyundai creta sx
9
37
84
Volkswagen Polo 1.2 GT AMT 2017
44
34
81
Usupso Massage Anti Skid Slippers
45
15
75
Garlic Powder - 100Gm
57
35
11
Iphone 13 and 14
67
15
73
Universal Touch Screen Capacitive Stylus