Distribute records based on various percentages with tsql

Distribute records based on various percentages with tsql - sql

I have a table of items with around 800k rows. I need to create a SQL statement that allows my users to pass in various percentages that will total 100% and be limited to 5 percentages. These are then used to group the rows by a group number of each percentage.
For example, a user may request rows to be split using to following random percentages (user decides percentages):
1. 20%, 20%, 30%, 30%
2. 12%, 12%, 12%, 12%, 52%
3. 30%, 30%, 40%
4. 100%
Based on above percentages, I need to return the following:
Field 1 | Field 2 | Group
--------------------------------
Data | Data | 1
Data | Data | 1
The group would represent a number corresponding to the percentages. So for example percentages #1 above, there would be 4 groups with the first group's records being the 1st 20% of all items selected, group 2 being the next 20%, the 3rd group being the next 30%, and the 4th group being the last 30%. Therefore, if there were a total of 200 records, group 1 should have 40 records, group 2 have 40, group 3 have 60, and group 4 have 60.
Sorry if I'm over explaining this but trying to reduce any ambiguity in my question so it's clear.
This data is stored in Azure SQL so any solution provided can use anything Azure SQL and/or SQL 2016 (in most cases) offers.
Thanks in advance to the SQL geniuses out there that are sure to make me feel appreciative and inferior all at the same time! :)

Passing in the percentages is the hard part. The work is done by percent_rank():
with p as (
select ind, p, (sum(p) over (order by ind) - p) as cume_p
from (values (1, 0.2), (2, 0.2), (3, 0.3), (4, 0.4)) v(ind, p)
)
select t.*, v.grp
from (select t.*, percent_rank() over (order by ?) as pr
from t
) t cross apply
(select max(ind)
from p
where p.cume_p <= t.pr
) v(grp);

Related

Query smallest number of rows to match a given value threshold

I would like to create a query that operates similar to a cash register. Imagine a cash register full of coins of different sizes. I would like to retrieve a total value of coins in the fewest number of coins possible.
Given this table:
id
value
1
100
2
100
3
500
4
500
5
1000
How would I query for a list of rows that:
has a total value of AT LEAST a given threshold
with the minimum excess value (value above the threshod)
in the fewest possible rows
For example, if my threshold is 1050, this would be the expected result:
id
value
1
100
5
1000
I'm working with postgres and elixir/ecto. If it can be done in a single query great, if it requires a sequence of multiple queries no problem.

I had a go at this myself, using answers from previous questions:
Using ABS() to order by the closest value to the threshold
Select rows until a sum reduction of a single column reaches a threshold
Based on #TheImpaler's comment above, this prioritises minimum number of rows over minimum excess. It's not 100% what I was looking for, so open to improvements if anyone can, but if not I think this is going to be good enough:
-- outer query selects all rows underneath the threshold
-- inner subquery adds a running total column
-- window function orders by the difference between value and threshold
SELECT
*
FROM (
SELECT
i.*,
SUM(i.value) OVER (
ORDER BY
ABS(i.value - $THRESHOLD),
i.id
) AS total
FROM
inputs i
) t
WHERE
t.total - t.value < $THRESHOLD;

fetch aggregate value along with data

I have a table with the following fields
ID,Content,QuestionMarks,TypeofQuestion
350, What is the symbol used to represent Bromine?,2,MCQ
758,What is the symbol used to represent Bromine? ,2,MCQ
2425,What is the symbol used to represent Bromine?,3,Essay
2080,A quadrilateral has four sides, four angles ,1,MCQ
2614,A circular cone has a curved surface area of ,2,MCQ
2520,Two triangles have sides 5 cm, 11 cm, 2 cm . ,2,MCQ
2196,Life supporting process mediated by water? ,2,Essay
I would like to get random questions where total marks is an input number.
For example if I say 25, the result should be all the random questions whose Sum(QuestionMarks) is 25(+/-1)
Is this really possible using a SQL
select content,id,questionmarks,sum(questionmarks) from quiz_question
group by content,id,questionmarks;
Expected Input 25
Expected Result (Sum of Question Marks =25)
Update:
How do I ensure I get atleast 2 Essay Type Questions (this is just an example) I would extend this for other conditions. Thank you for all the help

S-Man's cumulative sum is the right approach. For your logic, though, I think you want to get up to the first row that is 24 or more. That logic is:
where total - questionmark < 24
If you have enough questions, then you could get exactly 25 using:
with q25 as (
select *
from (select t.*,
sum(questionmark) over (order by random()) as running_questionmark
from t
) t
where running_questionmark < 25
)
select q.ID, q.Content, q.QuestionMarks, q.TypeofQuestion
from q25 q
union all
(select t.ID, t.Content, t.QuestionMarks, t.TypeofQuestion
from t cross join
(select sum(questionmark) as questionmark_25 from q25) x
where not exists (select 1 from q25 where q25.id = t.id)
order by abs(questionmark - (25 - questionmark_25))
limit 1
)
This selects questions up to 25 but not at 25. It then tries to find one more to make the total 25.

Supposing, questionmark is of type integer. Then you want to get some records in random order whose questionmark sum is not more than 25:
You can use the consecutive SUM() window function. The order is random. The consecutive SUM() adds every current value to the previous sum. So, you could filter where SUM() <= <your value>:
demo:db<>fiddle
SELECT
*
FROM (
SELECT
*,
SUM(questionmark) OVER (ORDER BY random()) as total
FROM
t
)s
WHERE total <= 25
Note:
This returns a records list with no more than 25, but as close as possible to it with an random order.
To find an exact match of your value is some sort of combinatorical problem which shouldn't be solved in a database. Especially when there's a random factor. What if your current SUM is 22 and the next randomly chosen value is 4. Would you retry maybe until infinity to randomly find a value = 3? Or are you trying to remove an already counted record with value = 1?

How do I SELECT minimum set of rows to cover all possible values of each columns in SQL?

I am running a SQL query to get data from a table to map all different possible values of all categories represented by each columns.
How do I run the SELECT query such that it returns the minimum number of rows just enough to include all possible values of all columns?
For example, if I have a table of 10 rows and 3 columns, each column containing 3 possible values:
TABLE sales
--------------------------------
brandID color size
--------------------------------
2 red big
3 blue big
2 blue big
2 red small
2 blue medium
3 green small
3 red big
1 green medium
2 red medium
2 blue big
Of course I could SELECT all rows from table without filter, but that would be an expensive query of 10 rows.
However, as you can see, if we filter the SELECT query to only return the following rows below, it is possible to cover all the possible values of all columns:
1,2,3 for brandID
red,blue,green for color
big,small,medium for size
--------------------------------
brandID color size
--------------------------------
3 blue big
2 red small
1 green medium
How do I do that in SQL query?

This one does what you expect:
select b.brandid, c.color, s.size
from (
select brandid, row_number() over (order by brandid) as rn
from sales
group by brandid
) b
full join (
select color, row_number() over (order by color) as rn
from sales
group by color
) c on b.rn = c.rn
full join (
select size, row_number() over (order by size) as rn
from sales
group by size
) s on b.rn = s.rn;
Online example: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=e72e7d1dfed43825025c5703b5d3671a
But this only works properly, if you have the same number of (distinct) brands, colors and sizes. If you have e.g. 5 brands, 6 colors and 7 sizes the result is rather "strange":
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=4417a4d97ecf7601364f09d65f6522fa

First, a query that returns ten rows is not "expensive".
Second, this is a very hard problem. It involves looking at all combinations of rows to see if the set has all combinations of columns. I suspect that any algorithm will need to basically search through all possible combinations -- although there may be some efficiencies, such as automatically including all rows with a unique value in any column.
As a hard problem involving comparing zillions of sets, SQL is not really an appropriate language for addressing the issue.

This is a rather weird requirement... But you might try something along this:
DECLARE #sales TABLE(BrandID INT, color VARCHAR(10),size VARCHAR(10));
INSERT INTO #sales VALUES
(2,'red', 'big'),
(3,'blue', 'big'),
(2,'blue', 'big'),
(2,'red', 'small'),
(2,'blue', 'medium'),
(3,'green', 'small'),
(3,'red', 'big'),
(1,'green', 'medium'),
(2,'red', 'medium'),
(2,'blue', 'big');
WITH AllBrands AS (SELECT ROW_NUMBER() OVER(ORDER BY BrandID) AS RowInx, BrandID FROM #sales GROUP BY BrandID)
,AllColors AS (SELECT ROW_NUMBER() OVER(ORDER BY color) AS RowInx, color FROM #sales GROUP BY color)
,AllSizes AS (SELECT ROW_NUMBER() OVER(ORDER BY size) AS RowInx, size FROM #sales GROUP BY size)
SELECT COALESCE(b.RowInx,c.RowInx,s.RowInx) AS RowInx
,b.BrandID
,c.color
,s.size
FROM AllBrands b
FULL OUTER JOIN AllColors c ON COALESCE(b.RowInx,c.RowInx)=c.RowInx
FULL OUTER JOIN AllSizes s ON COALESCE(b.RowInx,c.RowInx,s.RowInx)=s.RowInx;
This solution is similar to #a_horse_with_no_name's, but avoids gaps in the result in case of unequal counts of values per column.
The idea in short:
We create a numbered set of all distinct values per column and join all sets on this number. As we don't know the count in advance I use COALESCE to pick the first value, which is not null.

This is not a good problem if you demand ONE AND ONLY ONE query and ONE AND ONLY ONE of each result set, and ONE AND ONLY ONE instance of each result. As Gordon Linoff accurately put: that is not a problem for SQL.I get that maybe you have a MUCH larger table, but he's absolutely right.
But add another layer, and you can have exactly what you want, with all the efficiency you want, and a readable output. Use a cursor and some basic SELECT from dynamic SQL with a SELECT columns.name from sys.tables JOIN sys.columns ON tables.object_id = columns.object_id, if you absolutely have to do this with TSQL alone.
And if you're willing to build a basic application with any framework with a SQL driver, you can just SELECT DISTINCT FROM < and put the various results into arrays.
Alternatively: reword your question, with the understanding that the results of any SQL query are gonna be x rows by x columns. Not an array for each column.

I think your example confuses things by having exactly 3 values for each field, which makes the requested result seem like a reasonable thing to expect. But what happens when two more brands are added, or a new colour? Then what would you expect to be returned?
Really you are asking three questions, so I feel this should be done as three queries:
"What are the different brands?"
"What are the different colours?"
"What are the different sizes?"
If they need to be displayed in a neat table, stitch them together afterwards in your application layer. You could maybe do it in the SQL with something like a_horse_with_no_name suggests, but really its the wrong place.

SQL Server - selecting an item based on the previous counter value (same foreign key)

Not sure how to word this above so hopefully this will explain it better. I currently have a table of data as follows which is fetched using this query (the query is looking at a view)
CODE
SELECT
AppRunningPercentages.ProjectID,
AppRunningPercentages.AppID,
AppRunningPercentages.AppCounter,
AppRunningPercentages.PercentageComplete,
RunningPercentage= NULL
from AppRunningPercentages
where ProjectID = 123
DATA
ProjectID(FK) AppID AppCounter PercentageComplete RunningPercentage
123 1 1 50%
123 4 2 40%
123 7 3 10%
Based on my SELECT Statement the values above are shown, however I unsure on how to display the RunningPercentage. based on the above scenario I would like the table below to calculate them as follows within the same SELECT statement however I am unsure on how I can achieve this running total.
RunningPercentage
0%
50%
90%
when the AppCounter = 1, then I want the RunningPercentage to display as 0. This is so I can calculate a value correctly to the current percentage. It is effectively adding the previous percentages together, so when AppCounter = 1, then it is looking for an AppCounter with the value of 0.
When AppCounter = 2, it will add the 0% and the 50% together (50%)
When AppCounter = 3, it will add the 0%, 50% and 40% together (90%)
......And so on
Thankyou for any help with this

In SQL Server 2012+, you would use a cumulative sum:
select t.*,
(sum(PercentageComplete) over (partition by projectid
order by appcounter
) - PercentageComplete
) as RunningPercentage
from t;
Note: you can use a rows between clause instead of subtracting the value in the current row. I find subtracting the value in the current row to be simpler for this logic.
In early versions, you can use outer apply:
select t.*, coalesce(RunningPercentage, 0)
from t outer apply
(select sum(PercentageComplete) as RunningPercentage
from t t2
where t2.projectid = t.projectid and t2.appcounter < t.appcounter
) t2;

How can I add a running total of one column to an Access query?

I have a query that contains in one field the percentage of total sales corresponding to a specific product in the past 12 months. For example:
Product 1 - 38%
Product 2 - 25%
Product 3 - 16%
(...)
The records are sorted in descending order by the percentage column, and the sum of that has to be 100%. I want to create a new column that adds the previous percentages as a running total, like this:
Product 1 - 38% - 38%
Product 2 - 25% - 63%
Product 3 - 16% - 79%
(... until it reaches the last product and a 100% sub-total)
How could I do this?

If you have an ID field, or a date field, you can use a variation of this correlated subquery.
SELECT t.*,
t.productpct+[prev_value] AS RunningSum,
(select sum([ProductPct])
from test AS t2
WHERE
t2.ID < t.ID
) AS Prev_Value
FROM test AS t;
There are people who are way better at SQL than I, however if this helps or gives you your answer then great.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Distribute records based on various percentages with tsql - sql

Related

Query smallest number of rows to match a given value threshold

fetch aggregate value along with data

How do I SELECT minimum set of rows to cover all possible values of each columns in SQL?

SQL Server - selecting an item based on the previous counter value (same foreign key)

How can I add a running total of one column to an Access query?

Categories

Resources