Distribute numbers as close to possible - sql

This seems to be a 2 step problem I'm trying to solve.
Let's say we have N records, and we are trying to distribute as evenly as possible into K groups.
The second problem - each group in K can only accept an M amount of records.
For example, if we have 5 records, and 3 groups, then we would distribute 2 into Group K1, 2 into Group K2 and 1 record into Group K3. However, if say in group 1, it only accepts at most 1 record. Then the arrangement would need to be 1 into Group K1, 2 into Group K2, and 2 into Group K3.
I'm not necessary after the solution but what algorithm I might need to use to solve this? Apparently for the distribution, I need to use the Greedy algorithm? But for the second step, this seems to be a bit more complicated
Edit:
The example I'm looking at is:
Number of records: 23
Groups: 10
Max records for each group
G1 = 4
G2 = 1
G3 = 0
G4 = 5
G5 = 0
G6 = 0
G7 = 2
G8 = 4
G9 = 2
G10 = 2

if N=12 and K=3 then in normal situation,you just split it V=12/3=4 for each group. but since you have M limitation, and for example K3 can only accept 1 then the distribution can be 6-5-1 which is not evenly distributed.
So i guess you need to sort K based on the M limitation, so for the example above the groups order become K3-K1-K2.
then if the distributed value V is bigger than the accepted amount M for that group, you need to take the remainder and distribute it again to the remaining group (K3=1, then 4-1=3 must be distributed to K1 and K2).
the implementation might be complicated, i hope you can find more simple solution for this

From what I understood, you need to separate all groups which allows a fixed number of values first and then equally distribute records among remaining groups. Let's take an example, let's say we have 15 records which needs to be distributed among 5 groups (G1, G2, G3, G4 and G5). Also let's assume that G2 and G4 allows max records of 2 and 4 respectively. Now algorithm should go like this:
Get average(ceiling integer) of records based on number of groups (In this example we'll get 3).
Add all max allowed records which are smaller than our average (In this example it's G2 only who's max limit(i.e. 2) is less than our average hence the number comes as 2).
Now subtract our number from step 2 from total records and also subtract the number of groups involved in step 2 from total groups. (remaining total records: 13, remaining total groups 4).
Get the new average(ceiling integer) using remaining records and groups. (New average 4).
Get average (Integer) (i.e. 3) and allot equal number of records to remaining groups - 1.
Get Mod (i.e. 1) and allot that number to the last group.
Now what we finally will have here:
G1(No limit): 4
G2(Limit 2): 2
G3(No limit): 4
G4(Limit 4): 4
G5(No limit): 1
Let me know if you think that this algo might fail for some scenarios.
Formula to get ceiling integer average
floor((#total_records + #total_groups-1) / #total_groups)

Related

sql more complicated querying measurements

I have two tables (sql server), as shown below:
locations
id cubicfeet order
-------------------------------------
1 5 1
2 10 1
3 6 1
items
id cubic feet order
--------------------------------------
1 6 1
2 6 1
3 6 1
I need a query to tell me if all the items will fit into all the locations (for a given order). If all items will not fit into 1 or all locations then I need to create a new location for that given order - and then move any items that DID fit into the locations before to the new location (as many as fit). The new location will only be given a certain amount of cubic feet also - say 17. In this example, sum won't work because all 3 records are 6 so the sum is 18, which is less than the sum of 5,10,6, but the location with volume 5 can't fit any of the items since they are all volume 6 cubic feet.
the only way I think I can do it is creating temp tables in my sp and using a while loop to go through them and update the locations 1 at a time to see if it still fits more...

How do I remove contiguous sequences of almost identical records from database

I have a SQL Server database containing real-time stock quotes.
There is a Quotes table containing what you would expect-- a sequence number, ticker symbol, time, price, bid, bid size, ask, ask size, etc.
The sequence number corresponds to a message that was received containing data for a set of ticker symbols being tracked. A new message (with a new, incrementing sequence number) is received whenever anything changes for any of the symbols being tracked. The message contains data for all symbols (even for those where nothing changed).
When the data was put into the database, a record was inserted for every symbol in each message, even for symbols where nothing changed since the prior message. So a lot of records contain redundant information (only the sequence number changed) and I want to remove these redundant records.
This is not the same as removing all but one record from the entire database for a combination of identical columns (already answered). Rather, I want to compress each contiguous block of identical records (identical except for sequence number) into a single record. When finished, there may be duplicate records but with differing records between them.
My approach was to find contiguous ranges of records (for a ticker symbol) where everything is the same except the sequence number.
In the following sample data I simplify things by showing only Sequence, Symbol, and Price. The compound primary key would be Sequence+Symbol (each symbol appears only once in a message). I want to remove records where Price is the same as the prior record (for a given ticker symbol). For ticker X it means I want to remove the range [1, 6], and for ticker Y I want to remove the ranges [1, 2], [4, 5] and [7, 7]:
Before:
Sequence Symbol Price
0 X $10
0 Y $ 5
1 X $10
1 Y $ 5
2 X $10
2 Y $ 5
3 X $10
3 Y $ 6
4 X $10
4 Y $ 6
5 X $10
5 Y $ 6
6 X $10
6 Y $ 5
7 X $11
7 Y $ 5
After:
Sequence Symbol Price
0 X $10
0 Y $ 5
3 Y $ 6
6 Y $ 5
7 X $11
Note that (Y, $5) appears twice but with (Y, $6) between.
The following generates the ranges I need. The left outer join ensures I select the first group of records (where there is no earlier record that is different), and the BETWEEN is intended to reduce the number of records that need to be searched to find the next-earlier different record (the results are the same without the BETWEEN, but slower). I would need only to add something like "DELETE FROM Quotes WHERE Sequence BETWEEN StartOfRange AND EndOfRange".
SELECT
GroupsOfIdenticalRecords.Symbol,
MIN(GroupsOfIdenticalRecords.Sequence)+1 AS StartOfRange,
MAX(GroupsOfIdenticalRecords.Sequence) AS EndOfRange
FROM
(
SELECT
Q1.Symbol,
Q1.Sequence,
MAX(Q2.Sequence) AS ClosestEarlierDifferentRecord
FROM
Quotes AS Q1
LEFT OUTER JOIN
Quotes AS Q2
ON
Q2.Sequence BETWEEN Q1.Sequence-100 AND Q1.Sequence-1
AND Q2.Symbol=Q1.Symbol
AND Q2.Price<>Q1.Price
GROUP BY
Q1.Sequence,
Q1.Symbol
) AS GroupsOfIdenticalRecords
GROUP BY
GroupsOfIdenticalRecords.Symbol,
GroupsOfIdenticalRecords.ClosestEarlierDifferentRecord
The problem is that this is way too slow and runs out of memory (crashing SSMS- remarkably) for the 2+ million records in the database. Even if I change "-100" to "-2" it is still slow and runs out of memory. I expected the "ON" clause of the LEFT OUTER JOIN to limit the processing and memory usage (2 million iterations, processing about 100 records each, which should be tractable), but it seems like SQL Server may first be generating all combinations of the 2 instances of the table, Q1 and Q2 (about 4e12 combinations) before selecting based on the criteria specified in the ON clause.
If I run the query on a smaller subset of the data (for example, by using "(SELECT TOP 100000 FROM Quotes) AS Q1", and similar for Q2), it completes in a reasonable amount time. I was trying to figure out how to automatically run this 20 or so times using "WHERE Sequence BETWEEN 0 AND 99999", then "...BETWEEN 100000 AND 199999", etc. (actually I would use overlapping ranges such as [0,99999], [99900, 199999], etc. to remove ranges that span boundaries).
The following generates sets of ranges to split the data into 100000 record blocks ([0,99999], [100000, 199999], etc). But how do I apply the above query repeatedly (once for each range)? I keep getting stuck because you can't group these using "BETWEEN" without applying an aggregate function. So instead of selecting blocks of records, I only know how to get MIN(), MAX(), etc. (single values) which does not work with the above query (as Q1 and Q2). Is there a way to do this? Is there totally different (and better) approach to the problem?
SELECT
CONVERT(INTEGER, Sequence / 100000)*100000 AS BlockStart,
MIN(((1+CONVERT(INTEGER, Sequence / 100000))*100000)-1) AS BlockEnd
FROM
Quotes
GROUP BY
CONVERT(INTEGER, Sequence / 100000)*100000
You can do this with a nice little trick. The groups that you want can be defined as the difference between two sequences of numbers. One is assigned for each symbol in order by sequence. The other is assigned for each symbol and price. This is what is looks like for your data:
Sequence Symbol Price seq1 seq2 diff
0 X $10 1 1 0
0 Y $ 5 1 1 0
1 X $10 2 2 0
1 Y $ 5 2 2 0
2 X $10 3 3 0
2 Y $ 5 3 3 0
3 X $10 4 4 0
3 Y $ 6 4 1 3
4 X $10 5 5 0
4 Y $ 6 5 2 3
5 X $10 6 6 0
5 Y $ 6 6 3 3
6 X $10 7 7 0
6 Y $ 5 7 4 3
7 X $11 8 1 7
7 Y $ 5 8 5 3
You can stare at this and figure out that the combination of symbol, diff, and price define each group.
The following puts this into a SQL query to return the data you want:
select min(q.sequence) as sequence, symbol, price
from (select q.*,
(row_number() over (partition by symbol order by sequence) -
row_number() over (partition by symbol, price order by sequence)
) as grp
from quotes q
) q
group by symbol, grp, price;
If you want to replace the data in the original table, I would suggest that you store the results of the query in a temporary table, truncate the original table, and then re-insert the values from the temporary table.
Answering my own question. I want to add some additional comments to complement the excellent answer by Gordon Linoff.
You're right. It is a nice little trick. I had to stare at it for a while to understand how it works. Here's my thoughts for the benefit of others.
The numbering by Sequence/Symbol (seq1) always increases, whereas the numbering by Symbol/Price (seq2) only increases sometimes (within each group, only when a record for Symbol contains the group's Price). Therefore seq1 either remains in lock step with seq2 (i.e., diff remains constant, until either Symbol or Price changes), or seq1 "runs away" from seq2 (while it is busy "counting" other Prices and other Symbols-- which increases the difference between seq1 and seq2 for a given Symbol and Price). Once seq2 falls behind, it can never "catch up" to seq1, so a given value of diff is never seen again once diff moves to the next larger value (for a given Price). By taking the minimum value within each Symbol/Price group, you get the first record in each contiguous block, which is exactly what I needed.
I don't use SQL a lot, so I wasn't familiar with the OVER clause. I just took it on faith that the first clause generates seq1 and the second generates seq2. I can kind of see how it works, but that's not the interesting part.
My data contained more than just Price. It was a simple thing to add the other fields (Bid, Ask, etc.) to the second OVER clause and the final GROUP BY:
row_number() over (partition by Symbol, Price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time order by Sequence)
group by Symbol, grp, price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time
Also, I was able to use use >MIN(...) and <=MAX(...) to define ranges of records to delete.

SPSS Compute Variable

Below is some data:
Test Day1 Day2 Score
A 1 2 100
B 1 3 62
C 3 4 90
D 2 4 20
E 4 5 80
I am trying to take the values from column 'day' and 'day2' and use them to select the row number for the column score. For example for Test A I would like to find the sum of 100 and 62 because that is the values of the first and second rows of score. Test B I would like to find the sum of 100, 62 and 90.
Is their anyway to do this in the Compute Variable window? Found in the menu Transform-Compute Variable?
I tried the following:
Score(MEAN(VALUE(Day1), VALUE(DAY2)))
This is not the proper way to call the cell location of Score and I received an error.
Can anyone help?
Thank you!
You really have two different datasets here. One is a dataset of scores numbered 1 through 5.
The other is a dataset that includes indexes into the score dataset. So the steps would be something like this.
First take the scores dataset and transpose it so that it has one row and 5 columns (Data>Transpose)
Then match that dataset to each case in the main dataset (Data>Merge Files>Add Variables).
Next you have to resort to using syntax directly.
You would declare a vector for the scores (VECTOR)
Finally, you use COMPUTE to index into the scores.
For your real problem, I suppose that you might have batches of scores and maybe there are some gaps. The Restructure Data Wizard can help you generalize this - convert cases into variables, but let's not go there yet.
HTH,
Jon Peck

How to generate sequences with distinct subsums?

I'm looking for a way to generate some (6 for default) equations where all subsums are unique.
For example,
a+b+c=50
d+e+f=50
g+h+i=50
a, d and g have to be distinct.
a+b and d+e have to be distinct.
e+f and h+i have to be distinct.
a+c and d+f have to be distinct.
But, a+b and e+f can be the same. So I only care about the subsums of aligned parameters..
I could only found ways to check whether some sequence is subsum-distinct, but I found nothing on how to generate such a sequence..
You didn't state whether you need it to be a random sequence, so suppose that this is not required.
One simple approach is this:
1 + 2 + 47 = 50
3 + 4 + 43 = 50
5 + 6 + 39 = 50
7 + 8 + 35 = 50
9 + 10 + 31 = 50
11 + 12 + 27 = 50
First two numbers are 2 smallest available numbers, the third number is final sum - those numbers.
a and b are always increasing, c is always decreasing
a + b is always increasing, b + c and a + c are always decreasing
You can generate it this way in a loop.
EDIT after comment that it has to be a random sequence:
Possibly you could create several sets (some sort of hashset/hashmap would be the most appropriate)
set of first summands
set of sums of first and second summands
set of sums of second and third summands
set of sums of first and third summands
set of previously generated triples
You would generate random triples this way:
If total number of demanded triples was not achieved generate a random triple, otherwise finish.
Check if the triple was not previously generated, if not proceed with step 3.
Conduct checks for first four sets. If no sums are contained within those sets, add triple and proceed with step 1.
However, I am not sure if this approach guarantees that you will get results (especially in small final sums).
So, I would add an counter, if too many consecutive attempts are not successful, then I would switch to brute force approach (which should not be problem if final sums are small and on other hand is very unlikely to happen if a final sum is large).
Overall, performance should be good.

how to find Sum(field) in condition ie "select * from table where sum(field) < 150"

I have to retrieve only particular records whose sum value of size field is <=150.
I have table like below ...
userid size
1 70
2 100
3 50
4 25
5 120
6 90
The output should be ...
userid size
1 70
3 50
4 25
For example, if we add 70,50,25 we get 145 which is <=150.
How would I write a query to accomplish this?
Here's a query which will produce the above results:
SELECT * FROM `users` u
WHERE (select sum(size) from `users` where size <= u.size order by size) < 150
ORDER BY userid
However, the problem you describe of wanting the selection of users which would most closely fit into a given size, is a bin packing problem. This is an NP-Hard problem, and won't be easily solved with ANSI SQL. However, the above seems to return the right result, but in fact it simply starts with the smallest item, and continues to add items until the bin is full.
A general, more effective bin packing algorithm would is to start with the largest item and continue to add smaller ones as they fit. This algorithm would select users 5 and 4.
What you're looking for is a greedy algorithm. You can't really do this with one SQL statement.
It's similar to the subset sum problem. You are definitely going to be into exponential time ...
There are several ways to solve subset
sum in time exponential in N. The most
naïve algorithm would be to cycle
through all subsets of N numbers and,
for every one of them, check if the
subset sums to the right number. The
running time is of order O(2^N*N), since
there are 2N subsets and, to check
each subset, we need to sum at most N
elements.
Unless you can constrain the problem to smaller subsets.
According to your definition as it stands you could get any of these tables:
userid size userid size
1 70 2 100
userid size userid size
3 50 4 25
userid size userid size
5 120 6 90
userid size userid size
1 70 2 100
3 50 3 50
userid size userid size
1 70 2 100
4 25 4 25
userid size userid size
1 70 4 25
3 50 6 90
4 25
userid size userid size
4 25 3 50
5 120 6 90
SQL sucks at guessing. Do you mean to say you want the most users who's total size is under a certain limit? You'll need to create a temp table of all the combinations of users, then select the ones who's total size is less then the limit, then select the one with the most users, and possibly the lowest user ID or something. Either way, it won't be fast due to the first step.
But do you want to maximize the number of results or minimize or you simply don't care? first two cases is constraints optimization for which there should be solution using SQL, the latter (as mentioned above) requires greedy strategy.