Distribution of data in buckets - Oracle 11g - sql

I have a table with two columns BRANCH and ACTIVITIES, where BRANCH is a unique id of location and ACTIVITIES are number of records belong to respective BRANCH. These records to be distributed in 5 buckets in a way that all buckets should contain almost equal number of records. (no matter if difference is +/-1000)
The challenge is if one branch is selected in a bucket then all activities of same branch will also be selected in same bucket, in other words, number of activities belong to one BRANCH cannot be split. Lets take a very simple example so that I can explain what I am trying to achieve
Total Branches=10
Total Number of activities (records) = 55,000
Average (total activities/total buckets) = 11,000
Sample Data
After Distribution
All buckets contain 11,000 records but things are not such straight forward when we look into real data real data
All Oracle query masters are requested to please look into this. Your expert opinion will highly be appreciated.

Unfortunately, this is a bin-packing problem and a "perfect" solution requires -- essentially -- searching through all possible assignments of buckets and then choosing the "best" one. And such an approach is not really suitable for SQL.
For a "good-enough" solution, though, something like a round-robin approach often works well. Simply enumerate the branches from biggest to smallest and assign them to buckets:
select a.branch,
1 + mod(seqnum, 5) as bucket
from (select a.branch, count(*) as cnt,
row_number() over (order by count(*)) desc as seqnum
from activities a
group by a.branch
) a;
Because of the ordering, this is going to generally create buckets of different sizes. So, a slight variation assigns the buckets as 1-2-3-4-5-5-4-3-2-1:
select a.branch,
(case when mod(seqnum, 10) in (0, 9) then 1
when mod(seqnum, 10) in (1, 8) then 2
when mod(seqnum, 10) in (2, 7) then 3
when mod(seqnum, 10) in (3, 6) then 4
when mod(seqnum, 10) in (4, 5) then 5
end) as bucket
from (select a.branch, count(*) as cnt,
row_number() over (order by count(*)) desc as seqnum
from activities a
group by a.branch
) a;

You could also try below query. I added some stats columns in this inline view stats_cols_added_tab before I applied dense_rank analytic function to that inline view. Finally I used NTILE analytic function to get five groups.
with sample_data (branch, activities) as (
select 1, 1000 from dual union all
select 2, 2000 from dual union all
select 3, 3000 from dual union all
select 4, 4000 from dual union all
select 5, 5000 from dual union all
select 6, 6000 from dual union all
select 7, 7000 from dual union all
select 8, 8000 from dual union all
select 9, 9000 from dual union all
select 10, 10000 from dual
)
,
stats_cols_added_tab as (
select s.*
, count(*)over() total_branches
, sum(activities)over() total_number_of_activities
, avg(activities)over() * 2 Average
, case when row_number()over(order by s.branch) <= count(*)over() / 2 then 1 else 2 end grp
from sample_data s
)
SELECT BRANCH, ACTIVITIES, NTILE(5) OVER (ORDER BY ranked_grp, BRANCH) AS bucket
FROM (
select BRANCH, ACTIVITIES
, dense_rank()over(
PARTITION BY grp
order by decode(grp, 1, activities, -1 * activities)
) ranked_grp
from stats_cols_added_tab t
) t
order by ranked_grp, BRANCH
;

Related

Recursive ORDER BY

I have a USERS table which is a membership matrix like below. Table is unique on ID, and each ID belongs to at least one group, but could belong to all 3.
SELECT 1 AS ID, 0 AS IS_A, 0 AS IS_B, 1 AS IS_C FROM DUAL UNION ALL
SELECT 2,0,1,0 FROM DUAL UNION ALL
SELECT 3,0,1,1 FROM DUAL UNION ALL
SELECT 4,1,1,0 FROM DUAL UNION ALL
SELECT 5,1,1,0 FROM DUAL UNION ALL
SELECT 6,1,1,1 FROM DUAL UNION ALL
SELECT 7,0,1,1 FROM DUAL UNION ALL
SELECT 8,0,0,1 FROM DUAL UNION ALL
SELECT 9,1,0,0 FROM DUAL UNION ALL
SELECT 10,1,0,1 FROM DUAL UNION ALL
SELECT 11,0,0,1 FROM DUAL UNION ALL
SELECT 12,0,1,1 FROM DUAL
The final goal is to SELECT randomly a sample of at least 4 users from A, 3 from B and 5 from C (just an example) but with exactly 10 distinct IDs (otherwise the solution is trivial; just SELECT *).
The focus is less to determine if it's possible at all, but more to attempt a best effort to maximize memberships.
The output is expected to be unique on ID.
I can only think of a procedural way to achieve this:
Take the first ID with MAX(IS_A+IS_B+IS_C)
Check if the quotas are reached
If, for example, we already have 4 users from A, then we'll continue with the next ID with MAX(IS_B+IS_C), completely ignoring any further contributions from IS_A column
If we have already achieved all quotas, revert back to taking MAX(IS_A+IS_B+IS_C) to get "bonus" points
Stop upon reaching the overall maximum of 10
In essence, we prioritize and incrementally take the ID that has the most memberships in groups that have not reached the quota
However, I can't figure out how to do this in Oracle SQL since the ORDER BY would depend on not just the current row's values, but also recursively on whether the earlier rows have filled up the respective quotas.
I've tried ROWNUM, ROW_NUMBER(), SUM(IS_A) OVER (ORDER BY ...), RECURSIVE CTE but to no avail. Best I have is
WITH CTE AS (
SELECT ID, IS_A, IS_B, IS_C
, ROW_NUMBER() OVER (ORDER BY IS_A+IS_B+IS_C DESC) AS RN
FROM USERS
)
, CTE2 AS (
SELECT CTE.*
, GREATEST(4 - SUM(IS_A) OVER (ORDER BY RN), 0.001) AS QUOTA_A --clip negatives to 0.001
, GREATEST(3 - SUM(IS_B) OVER (ORDER BY RN), 0.001) AS QUOTA_B --so that when all quotas are exhausted,
, GREATEST(5 - SUM(IS_C) OVER (ORDER BY RN), 0.001) AS QUOTA_C --we still prioritize those that contribute most number of concurrent memberships
FROM CTE
)
SELECT ID FROM CTE2
ORDER BY QUOTA_A*IS_A + QUOTA_B*IS_B + QUOTA_C*IS_C DESC
FETCH NEXT 10 ROWS ONLY
but it does not work because QUOTA_A is computed based on ORDER BY RN instead of recursively.
Thanks in advance!

ORACLE SQL group based on values in a reference table

Customer table and Acct tables has global scope, they share and increment this value
Below is customer table, SEQ NO 1 is beginning of customer data, SEQ_NO 238 is beginning of another customer data
Another is account table, all accounts with their SEQ_NOs inside a boundary of customer get same group (I want to group those accounts to the same customer, so that I can use listAgg to concatenate account id.), for example, below from SEQ_NO 2 and NO 224 (inclusive) should be assigned to the same group.
Is there a SQL way to do that, The worst case I was thinking is to define oracle type, and using function do that.
Any help is appreciate.
If I understand your question correctly, you want to be able to assign rows in the account table to groups, one per customer, so that you can then aggregate based on these groups.
So, the question is how to identify to which customer each account belongs, based on the sequence boundaries given in the first table ("customer") and the specific account numbers in the second table ("account").
This can be done in plain SQL, and relatively easily. You need a join between the accounts table and a subquery based on the customers table. The subquery must show the first and the last sequence number allocated to each client; to do that, you can use the lead analytic function. A bit of care must be taken regarding the last customer, for whom there is no upper limit for the sequence numbers.
You didn't provide test data in a usable format, so I created sample data in the with clause below (which is not part of the query - it's just there as a placeholder for test data).
with
customer (cust_id, seq_no) as (
select 101, 1 from dual union all
select 102, 34 from dual union all
select 200, 58 from dual union all
select 130, 90 from dual
)
, account (acct_id, seq_no) as (
select 1003, 3 from dual union all
select 1005, 11 from dual union all
select 1007, 33 from dual union all
select 1008, 60 from dual union all
select 1103, 77 from dual union all
select 1140, 92 from dual union all
select 1145, 99 from dual
)
select c.cust_id,
listagg(a.acct_id, ',') within group (order by a.acct_id) as acct_list
from (
select cust_id, seq_no as lower_no,
lead(seq_no) over (order by seq_no) - 1 as upper_no
from customer
) c
left outer join account a
on a.seq_no between c.lower_no and nvl(c.upper_no, a.seq_no)
group by c.cust_id
order by c.cust_id
;
OUTPUT
CUST_ID ACCT_LIST
------- --------------------
101 1003,1005,1007
102
130 1140,1145
200 1008,1103

Put results into a group of 2, or any number I specify

I need a way to put results into # of groups that I specify.
I have tried ntile() function, which I thought would use but it's not working:
WITH CTE AS (
SELECT 1 as Number
UNION ALL
SELECT Number+1
FROM CTE
WHERE Number < 100
)
SELECT *, ntile(80) over (order by number desc) as 'test'
FROM CTE
For the expected results, the Quartile column should output a number for every 2 entries (as specified in NTILE(80)), but it can be 2, 4, 10, or any number I specify.
Maybe NTILE() is not the right function but is there a function that does what I want?
So, if I specify 3, then the result should group every 3 records. If I specify 15, then the result should group every 15 records and move onto next group.
Hope I'm being clear
...should output a number for every 2 entries...
No, you have 100 entries and you want to divide them in 80 groups. You'll get some groups with 1 entry and other groups with 2 entries.
Read the definition of NTILE(). If you want groups with 2 entries you can do it as shown below by dividing it in 50 groups:
WITH recursive
CTE AS (
SELECT 1 as Number
UNION ALL
SELECT Number + 1
FROM CTE
WHERE Number < 100
)
SELECT *,
ntile(50) -- changed here
over (order by number desc) as test
FROM CTE
You didn't say what database engine you are using, so I assumed PostgreSQL.
I think you simply want the modulus operator:
WITH CTE AS (
SELECT 1 as Number
UNION ALL
SELECT Number+1
FROM CTE
WHERE Number < 100
)
SELECT cte.*,
(ROW_NUMBER() OVER (ORDER BY Number DESC) - 1) % 3 -- or however many groups that you want
FROM CTE

SQL - Select by status

I need to do the following selection.
I have a list of 100.000 user with different status (1 - 4). I need to choose an amount of 10.000 user out of this, first all with status 4 (but if there are less than 10.000 with status 4 than choose all with status 3 and than with 2 and than with 1)
Help is highly appreciated.
Thanks for replies, I tried to go with Gordons Version. As I need to union I have the following now. But this allows me only to prioritize the score for the second selection (%#test2.com) but I would need it for every selection that I create. If I put the "order by" before the union I receive an invalid syntax notification:
SELECT TOP 10000 *
FROM [table]
WHERE Email like '%#test1.com' and score in ( 1, 2, 3, 4)
UNION
SELECT TOP 5000 *
FROM [table]
WHERE Email like '%#test2.com' and score in ( 1, 2, 3, 4)
order by score desc
This is a prioritization query. You can handle it using the ANSI standard function row_number():
select t.*
from (select t.*,
row_number() over (order by status desc) as seqnum
from t
where status in (1, 2, 3, 4)
) t
where seqnum <= 10000;
You can also simplify this to:
select t.*
from t
where status in (1, 2, 3, 4)
order by status desc
fetch first 10000 rows only;

How to count the number of times an element appears consecutively in a table in Teradata?

I have a table that looks like this
ID, Order, Segment
1, 1, A
1, 2, B
1, 3, B
1, 4, C
1, 5, B
1, 6, B
1, 7, B
1, 8, B
Basically by ordering the data using the Order column. I would like to understand the number of consecutive B's for each of the ID's. Ideally the output I would like is
ID, Consec
1, 2
1, 4
Because the segment B appears consecutively in row 2 and 3 (2 times), and then again in row 5,6,7,8 (4 times).
I can't think of a solution in SQL since there is no loop facility in SQL.
Are there elegant solutions in Teradata SQL?
P.S. The data I am dealing with has ~20 million rows.
The way to do it in R has been published here.
How to count the number of times an element appears consecutively in a data.table?
It is easy to do with analytic functions. While I don't know anything about teradata, quickly googling makes it appear as though it does support analytic functions.
In any case, I've tested the following in Oracle --
select id,
count(*)
from (select x.*,
row_number() over(partition by id order by ord) -
row_number() over(partition by id, seg order by ord) as grp
from tbl x) x
where seg = 'B'
group by id, grp
order by grp
The trick is establishing the 'groups' of Bs.
Fiddle: http://sqlfiddle.com/#!4/4ed6c/2/0