Sorting/Optimization problem with rows in a pandas dataframe [duplicate] - pandas

So if I was given a sorted list/array i.e. [1,6,8,15,40], the size of the array, and the requested number..
How would you find the minimum number of values required from that list to sum to the requested number?
For example given the array [1,6,8,15,40], I requested the number 23, it would take 2 values from the list (8 and 15) to equal 23. The function would then return 2 (# of values). Furthermore, there are an unlimited number of 1s in the array (so you the function will always return a value)
Any help is appreciated

The NP-complete subset-sum problem trivially reduces to your problem: given a set S of integers and a target value s, we construct set S' having values (n+1) xk for each xk in S and set the target equal to (n+1) s. If there's a subset of the original set S summing to s, then there will be a subset of size at most n in the new set summing to (n+1) s, and such a set cannot involve extra 1s. If there is no such subset, then the subset produced as an answer must contain at least n+1 elements since it needs enough 1s to get to a multiple of n+1.
So, the problem will not admit any polynomial-time solution without a revolution in computing. With that disclaimer out of the way, you can consider some pseudopolynomial-time solutions to the problem which work well in practice if the maximum size of the set is small.
Here's a Python algorithm that will do this:
import functools
S = [1, 6, 8, 15, 40] # must contain only positive integers
#functools.lru_cache(maxsize=None) # memoizing decorator
def min_subset(k, s):
# returns the minimum size of a subset of S[:k] summing to s, including any extra 1s needed to get there
best = s # use all ones
for i, j in enumerate(S[:k]):
if j <= s:
sz = min_subset(i, s-j)+1
if sz < best: best = sz
return best
print min_subset(len(S), 23) # prints 2
This is tractable even for fairly large lists (I tested a random list of n=50 elements), provided their values are bounded. With S = [random.randint(1, 500) for _ in xrange(50)], min_subset(len(S), 8489) takes less than 10 seconds to run.

There may be a simpler solution, but if your lists are sufficiently short, you can just try every set of values, i.e.:
1 --> Not 23
6 --> Not 23
...
1 + 6 = 7 --> Not 23
1 + 8 = 9 --> Not 23
...
1 + 40 = 41 --> Not 23
6 + 8 = 14 --> Not 23
...
8 + 15 = 23 --> Oh look, it's 23, and we added 2 values
If you know your list is sorted, you can skip some tests, since if 6 + 20 > 23, then there's no need to test 6 + 40.

Related

SQL Max Consecutive Values in a number set using recursion

The following SQL query is supposed to return the max consecutive numbers in a set.
WITH RECURSIVE Mystery(X,Y) AS (SELECT A AS X, A AS Y FROM R)
UNION (SELECT m1.X, m2.Y
FROM Mystery m1, Mystery m2
WHERE m2.X = m1.Y + 1)
SELECT MAX(Y-X) + 1 FROM Mystery;
This query on the set {7, 9, 10, 14, 15, 16, 18} returns 3, because {14 15 16} is the longest chain of consecutive numbers and there are three numbers in that chain. But when I try to work through this manually I don't see how it arrives at that result.
For example, given the number set above I could create two columns:
m1.x
m2.y
7
7
9
9
10
10
14
14
15
15
16
16
18
18
If we are working on rows and columns, not the actual data, as I understand it WHERE m2.X = m1.Y + 1 takes the value from the next row in Y and puts it in the current row of X, like so
m1.X
m2.Y
9
7
10
9
14
10
15
14
16
15
18
16
18
Null?
The main part on which I am uncertain is where in the SQL recursion actually happens. According to Denis Lukichev recursion is the R part - or in this case the RECURSIVE Mystery(X,Y) - and stops when the table is empty. But if the above is true, how would the table ever empty?
Since I don't know how to proceed with the above, let me try a different direction. If WHERE m2.X = m1.Y + 1 is actually a comparison, the result should be:
m1.X
m2.Y
14
14
15
15
16
16
But at this point, it seems that it should continue recursively on this until only two rows are left (nothing else to compare). If it stops here to get the correct count of 3 rows (2 + 1), what is actually stopping the recursion?
I understand that for the above example the MAX(Y-X) + 1 effectively returns the actual number of recursion steps and adds 1.
But if I have 7 consecutive numbers and the recursion flows down to 2 rows, should this not end up with an incorrect 3 as the result? I understand recursion in C++ and other languages, but this is confusing to me.
Full disclosure, yes it appears this is a common university question, but I am retired, discovered this while researching recursion for my use, and need to understand how it works to use similar recursion in my projects.
Based on this db<>fiddle shared previously, you may find it instructive to alter the CTE to include an iteration number as follows, and then to show the content of the CTE rather than the output of final SELECT. Here's an amended CTE and its content after the recursion is complete:
Amended CTE
WITH RECURSIVE Mystery(X,Y) AS ((SELECT A AS X, A AS Y, 1 as Z FROM R)
UNION (SELECT m1.X, m2.A, Z+1
FROM Mystery m1
JOIN R m2 ON m2.A = m1.Y + 1))
CTE Content
x
y
z
7
7
1
9
9
1
10
10
1
14
14
1
15
15
1
16
16
1
18
18
1
9
10
2
14
15
2
15
16
2
14
16
3
The Z field holds the iteration count. Where Z = 1 we've simply got the rows from the table R. The, values X and Y are both from the field A. In terms of what we are attempting to achieve these represent sequences consecutive numbers, which start at X and continue to (at least) Y.
Where Z = 2, the second iteration, we find all the rows first iteration where there is a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That becomes the new highest number, and we add one to the number of iterations. As only three numbers in our original data set have successors within the set, there are only three rows output in the second iteration.
Where Z = 3, the third iteration, we find all the rows of the second iteration (note we are not considering all the rows of the first iteration again), where there is, again, a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That, again, becomes the new highest number, and we add one to the number of iterations.
The process will attempt a fourth iteration, but as there are no rows in R where the value is one more than the Y values from our third iteration, no extra data gets added to the CTE and recursion ends.
Going back to the original db<>fiddle, the process then searches our CTE content to output MAX(Y-X) + 1, which is the maximum difference between the first and last values in any consecutive sequence, plus one. This finds it's value from the record produced in the third iteration, using ((16-14) + 1) which has a value of 3.
For this specific piece of code, the output is always equivalent to the value in the Z field as every addition of a row through the recursion adds one to Z and adds one to Y.

Get coherent subsets from pandas series

I'm rather new to pandas and recently run into a problem. I have a pandas DataFrame that I need to process. I need to extract parts of the DataFrame where specific conditions are met. However, i want these parts to be coherent blocks, not one big set.
Example:
Consider the following pandas DataFrame
col1 col2
0 3 11
1 7 15
2 9 1
3 11 2
4 13 2
5 16 16
6 19 17
7 23 13
8 27 4
9 32 3
I want to extract the subframes where the values of col2 >= 10, resulting maybe in a list of DataFrames in the form of (in this case):
col1 col2
0 3 11
1 7 15
col1 col2
5 16 16
6 19 17
7 23 13
Ultimately, I need to do further analysis on the values in col1 within the resulting parts. However, the start and end of each of these blocks is important to me, so simply creating a subset using pandas.DataFrame.loc isn't going to work for me, i think.
What I have tried:
Right now I have a workaround that gets the subset using pandas.DataFrame.loc and then extracts the start and end index of each coherent block afterwards, by iterating through the subset and check, whether there is a jump in the indices. However, it feels rather clumsy and I feel that I'm missing a basic pandas function here, that would make my code more efficient and clean.
This is code representing my current workaround as adapted to the above example
# here the blocks will be collected for further computations
blocks = []
# get all the items where col2 >10 using 'loc[]'
subset = df.loc[df['col2']>10]
block_start = 0
block_end = None
#loop through all items in subset
for i in range(1, len(subset)):
# if the difference between the current index and the last is greater than 1 ...
if subset.index[i]-subset.index[i-1] > 1:
# ... this is the current blocks end
next_block_start = i
# extract the according block and add it to the list of all blocks
block = subset[block_start:next_block_start]
blocks.append(block)
#the next_block_start index is now the new block's starting index
block_start = next_block_start
#close and add last block
blocks.append(subset[block_start:])
Edit: I was by mistake previously referring to 'pandas.DataFrame.where' instead of 'pandas.DataFrame.loc'. I seem to be a bit confused by my recent research.
You can split you problem into parts. At first you check the condition:
df['mask'] = (df['col2']>10)
We use this to see where a new subset starts:
df['new'] = df['mask'].gt(df['mask'].shift(fill_value=False))
Now you can combine these informations into a group number. The cumsum will generate a step function which we set to zero (via the mask column) if this is not a group we are interested in.
df['grp'] = (df.new + 0).cumsum() * df['mask']
EDIT
You don't have to do the group calculation in your df:
s = (df['col2']>10)
s = (s.gt(s.shift(fill_value=False)) + 0).cumsum() * s
After that you can split this into a dict of separate DataFrames
grp = {}
for i in np.unique(s)[1:]:
grp[i] = df.loc[s == i, ['col1', 'col2']]

Create 20 unique bingo cards

I'm trying to create 20 unique cards with numbers, but I struggle a bit.. So basically I need to create 20 unique matrices 3x3 having numbers 1-10 in first column, numbers 11-20 in the second column and 21-30 in the third column.. Any ideas? I'd prefer to have it done in r, especially as I don't know Visual Basic. In excel I know how to generate the cards, but not sure how to ensure they are unique..
It seems to be quite precise and straightforward to me. Anyway, i needed to create 20 matrices that would look like :
[,1] [,2] [,3]
[1,] 5 17 23
[2,] 8 18 22
[3,] 3 16 24
Each of the matrices should be unique and each of the columns should consist of three unique numbers ( the 1st column - numbers 1-10, the 2nd column 11-20, the 3rd column - 21-30).
Generating random numbers is easy, though how to make sure that generated cards are unique?Please have a look at the post that i voted for as an answer - as it gives you thorough explanation how to achieve it.
(N.B. : I misread "rows" instead of "columns", so the following code and explanation will deal with matrices with random numbers 1-10 on 1st row, 11-20 on 2nd row etc., instead of columns, but it's exactly the same just transposed)
This code should guarantee uniqueness and good randomness :
library(gtools)
# helper function
getKthPermWithRep <- function(k,n,r){
k <- k - 1
if(n^r< k){
stop('k is greater than possibile permutations')
}
v <- rep.int(0,r)
index <- length(v)
while ( k != 0 )
{
remainder<- k %% n
k <- k %/% n
v[index] <- remainder
index <- index - 1
}
return(v+1)
}
# get all possible permutations of 10 elements taken 3 at a time
# (singlerowperms = 720)
allperms <- permutations(10,3)
singlerowperms <- nrow(allperms)
# get 20 random and unique bingo cards
cards <- lapply(sample.int(singlerowperms^3,20),FUN=function(k){
perm2use <- getKthPermWithRep(k,singlerowperms,3)
m <- allperms[perm2use,]
m[2,] <- m[2,] + 10
m[3,] <- m[3,] + 20
return(m)
# if you want transpose the result just do:
# return(t(m))
})
Explanation
(disclaimer tl;dr)
To guarantee both randomness and uniqueness, one safe approach is generating all the possibile bingo cards and then choose randomly among them without replacements.
To generate all the possible cards, we should :
generate all the possibilities for each row of 3 elements
get the cartesian product of them
Step (1) can be easily obtained using function permutations of package gtools (see the object allPerms in the code). Note that we just need the permutations for the first row (i.e. 3 elements taken from 1-10) since the permutations of the other rows can be easily obtained from the first by adding 10 and 20 respectively.
Step (2) is also easy to get in R, but let's first consider how many possibilities will be generated. Step (1) returned 720 cases for each row, so, in the end we will have 720*720*720 = 720^3 = 373248000 possible bingo cards!
Generate all of them is not practical since the occupied memory would be huge, thus we need to find a way to get 20 random elements in this big range of possibilities without actually keeping them in memory.
The solution comes from the function getKthPermWithRep, which, given an index k, it returns the k-th permutation with repetition of r elements taken from 1:n (note that in this case permutation with repetition corresponds to the cartesian product).
e.g.
# all permutations with repetition of 2 elements in 1:3 are
permutations(n = 3, r = 2,repeats.allowed = TRUE)
# [,1] [,2]
# [1,] 1 1
# [2,] 1 2
# [3,] 1 3
# [4,] 2 1
# [5,] 2 2
# [6,] 2 3
# [7,] 3 1
# [8,] 3 2
# [9,] 3 3
# using the getKthPermWithRep you can get directly the k-th permutation you want :
getKthPermWithRep(k=4,n=3,r=2)
# [1] 2 1
getKthPermWithRep(k=8,n=3,r=2)
# [1] 3 2
Hence now we just choose 20 random indexes in the range 1:720^3 (using sample.int function), then for each of them we get the corresponding permutation of 3 numbers taken from 1:720 using function getKthPermWithRep.
Finally these triplets of numbers, can be converted to actual card rows by using them as indexes to subset allPerms and get our final matrix (after, of course, adding +10 and +20 to the 2nd and 3rd row).
Bonus
Explanation of getKthPermWithRep
If you look at the example above (permutations with repetition of 2 elements in 1:3), and subtract 1 to all number of the results you get this :
> permutations(n = 3, r = 2,repeats.allowed = T) - 1
[,1] [,2]
[1,] 0 0
[2,] 0 1
[3,] 0 2
[4,] 1 0
[5,] 1 1
[6,] 1 2
[7,] 2 0
[8,] 2 1
[9,] 2 2
If you consider each number of each row as a number digit, you can notice that those rows (00, 01, 02...) are all the numbers from 0 to 8, represented in base 3 (yes, 3 as n). So, when you ask the k-th permutation with repetition of r elements in 1:n, you are also asking to translate k-1 into base n and return the digits increased by 1.
Therefore, given the algorithm to change any number from base 10 to base n :
changeBase <- function(num,base){
v <- NULL
while ( num != 0 )
{
remainder = num %% base # assume K > 1
num = num %/% base # integer division
v <- c(remainder,v)
}
if(is.null(v)){
return(0)
}
return(v)
}
you can easily obtain getKthPermWithRep function.
One 3x3 matrix with the desired value range can be generated with the following code:
mat <- matrix(c(sample(1:10,3), sample(11:20,3), sample(21:30, 3)), nrow=3)
Furthermore, you can use a for loop to generate a list of 20 unique matrices as follows:
for (i in 1:20) {
mat[[i]] <- list(matrix(c(sample(1:10,3), sample(11:20,3), sample(21:30,3)), nrow=3))
print(mat[[i]])
}
Well OK I may fall on my face here but I propose a checksum (using Excel).
This is a unique signature for each bingo card which will remain invariate if the order of numbers within any column is changed without changing the actual numbers. The formula is
=SUM(10^MOD(A2:A4,10)+2*10^MOD(B2:B4,10)+4*10^MOD(C2:C4,10))
where the bingo numbers for the first card are in A2:C4.
The idea is to generate a 10-digit number for each column, then multiply each by a constant and add them to get the signature.
So here I have generated two random bingo cards using a standard formula from here plus two which are deliberately made to be just permutations of each other.
Then I check if any of the signatures are duplicates using the formula
=MAX(COUNTIF(D5:D20,D5:D20))
which shouldn't given an answer more than 1.
In the unlikely event that there were duplicates, then you would just press F9 and generate some new cards.
All formulae are array formulae and must be entered with CtrlShiftEnter
Here is an inelegant way to do this. Generate all possible combinations and then sample without replacement. These are permutations, combinations: order does matter in bingo
library(dplyr)
library(tidyr)
library(magrittr)
generate_samples = function(n) {
first = data_frame(first = (n-9):n)
first %>%
merge(first %>% rename(second = first)) %>%
merge(first %>% rename(third = first)) %>%
sample_n(20)
}
suffix = function(df, suffix)
df %>%
setNames(names(.) %>%
paste0(suffix))
generate_samples(10) %>% suffix(10) %>%
bind_cols(generate_samples(20) %>% suffix(20)) %>%
bind_cols(generate_samples(30) %>% suffix(30)) %>%
rowwise %>%
do(matrix = t(.) %>% matrix(3)) %>%
use_series(matrix)

How to generate sequences with distinct subsums?

I'm looking for a way to generate some (6 for default) equations where all subsums are unique.
For example,
a+b+c=50
d+e+f=50
g+h+i=50
a, d and g have to be distinct.
a+b and d+e have to be distinct.
e+f and h+i have to be distinct.
a+c and d+f have to be distinct.
But, a+b and e+f can be the same. So I only care about the subsums of aligned parameters..
I could only found ways to check whether some sequence is subsum-distinct, but I found nothing on how to generate such a sequence..
You didn't state whether you need it to be a random sequence, so suppose that this is not required.
One simple approach is this:
1 + 2 + 47 = 50
3 + 4 + 43 = 50
5 + 6 + 39 = 50
7 + 8 + 35 = 50
9 + 10 + 31 = 50
11 + 12 + 27 = 50
First two numbers are 2 smallest available numbers, the third number is final sum - those numbers.
a and b are always increasing, c is always decreasing
a + b is always increasing, b + c and a + c are always decreasing
You can generate it this way in a loop.
EDIT after comment that it has to be a random sequence:
Possibly you could create several sets (some sort of hashset/hashmap would be the most appropriate)
set of first summands
set of sums of first and second summands
set of sums of second and third summands
set of sums of first and third summands
set of previously generated triples
You would generate random triples this way:
If total number of demanded triples was not achieved generate a random triple, otherwise finish.
Check if the triple was not previously generated, if not proceed with step 3.
Conduct checks for first four sets. If no sums are contained within those sets, add triple and proceed with step 1.
However, I am not sure if this approach guarantees that you will get results (especially in small final sums).
So, I would add an counter, if too many consecutive attempts are not successful, then I would switch to brute force approach (which should not be problem if final sums are small and on other hand is very unlikely to happen if a final sum is large).
Overall, performance should be good.

Hash function to iterate through a matrix

Given a NxN matrix and a (row,column) position, what is a method to select a different position in a random (or pseudo-random) order, trying to avoid collisions as much as possible?
For example: consider a 5x5 matrix and start from (1,2)
0 0 0 0 0
0 0 X 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
I'm looking for a method like
(x,y) hash (x,y);
to jump to a different position in the matrix, avoiding collisions as much as possible
(do not care how to return two different values, it doesn't matter, just think of an array).
Of course, I can simply use
row = rand()%N;
column = rand()%N;
but it's not that good to avoid collisions.
I thought I could apply twice a simple hash method for both row and column and use the results as new coordinates, but I'm not sure this is a good solution.
Any ideas?
Can you determine the order of the walk before you start iterating? If your matrices are large, this approach isn't space-efficient, but it is straightforward and collision-free. I would do something like:
Generate an array of all of the coordinates. Remove the starting position from the list.
Shuffle the list (there's sample code for a Fisher-Yates shuffle here)
Use the shuffled list for your walk order.
Edit 2 & 3: A modular approach: Given s array elements, choose a prime p of form 2+3*n, p>s. For i=1 to p, use cells (iii)%p when that value is in range 1...s-1. (For row-length r, cell #c subscripts are c%r, c/r.)
Effectively, this method uses H(i) = (iii) mod p as a hash function. The reference shows that as i ranges from 1 to p, H(i) takes on each of the values from 0 to p-1, exactly one time each.
For example, with s=25 and p=29 or 47, this uses cells in following order:
p=29: 1 8 6 9 13 24 19 4 14 17 22 18 11 7 12 3 15 10 5 16 20 23 2 21 0
p=47: 1 8 17 14 24 13 15 18 7 4 10 2 6 21 3 22 9 12 11 23 5 19 16 20 0
according to bc code like
s=25;p=29;for(i=1;i<=p;++i){t=(i^3)%p; if(t<s){print " ",t}}
The text above shows the suggestion I made in Edit 2 of my answer. The text below shows my first answer.
Edit 0: (This is the suggestion to which Seamus's comment applied): A simple method to go through a vector in a "random appearing" way is to repeatedly add d (d>1) to an index. This will access all elements if d and s are coprime (where s=vector length). Note, my example below is in terms of a vector; you could do the same thing independently on the other axis of your matrix, with a different delta for it, except a problem mentioned below would occur. Note, "coprime" means that gcd(d,s)=1. If s is variable, you'd need gcd() code.
Example: Say s is 10. gcd(s,x) is 1 for x in {1,3,7,9} and is not 1 for x in {2,4,5,6,8,10}. Suppose we choose d=7, and start with i=0. i will take on values 0, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, which modulo 10 is 0, 7, 4, 1, 8, 5, 2, 9, 6, 3, 0.
Edit 1 & 3: Unfortunately this will have a problem in the two-axis case; for example, if you use d=7 for x axis, and e=3 for y-axis, while the first 21 hits will be distinct, it will then continue repeating the same 21 hits. To address this, treat the whole matrix as a vector, use d with gcd(d,s)=1, and convert cell numbers to subscripts as above.
If you just want to iterate through the matrix, what is wrong with row++; if (row == N) {row = 0; column++}?
If you iterate through the row and the column independently, and each cycles back to the beginning after N steps, then the (row, column) pair will interate through only N of the N^2 cells of the matrix.
If you want to iterate through all of the cells of the matrix in pseudo-random order, you could look at questions here on random permutations.
This is a companion answer to address a question about my previous answer: How to find an appropriate prime p >= s (where s = the number of matrix elements) to use in the hash function H(i) = (i*i*i) mod p.
We need to find a prime of form 3n+2, where n is any odd integer such that 3*n+2 >= s. Note that n odd gives 3n+2 = 3(2k+1)+2 = 6k+5 where k need not be odd. In the example code below, p = 5+6*(s/6); initializes p to be a number of form 6k+5, and p += 6; maintains p in this form.
The code below shows that half-a-dozen lines of code are enough for the calculation. Timings are shown after the code, which is reasonably fast: 12 us at s=half a million, 200 us at s=half a billion, where us denotes microseconds.
// timing how long to find primes of form 2+3*n by division
// jiw 20 Sep 2011
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
double ttime(double base) {
struct timeval tod;
gettimeofday(&tod, NULL);
return tod.tv_sec + tod.tv_usec/1e6 - base;
}
int main(int argc, char *argv[]) {
int d, s, p, par=0;
double t0=ttime(0);
++par; s=5000; if (argc > par) s = atoi(argv[par]);
p = 5+6*(s/6);
while (1) {
for (d=3; d*d<p; d+=2)
if (p%d==0) break;
if (d*d >= p) break;
p += 6;
}
printf ("p = %d after %.6f seconds\n", p, ttime(t0));
return 0;
}
Timing results on 2.5GHz Athlon 5200+:
qili ~/px > for i in 0 00 000 0000 00000 000000; do ./divide-timing 500$i; done
p = 5003 after 0.000008 seconds
p = 50021 after 0.000010 seconds
p = 500009 after 0.000012 seconds
p = 5000081 after 0.000031 seconds
p = 50000021 after 0.000072 seconds
p = 500000003 after 0.000200 seconds
qili ~/px > factor 5003 50021 500009 5000081 50000021 500000003
5003: 5003
50021: 50021
500009: 500009
5000081: 5000081
50000021: 50000021
500000003: 500000003
Update 1 Of course, timing is not determinate (ie, can vary substantially depending on the value of s, other processes on machine, etc); for example:
qili ~/px > time for i in 000 004 010 058 070 094 100 118 184; do ./divide-timing 500000$i; done
p = 500000003 after 0.000201 seconds
p = 500000009 after 0.000201 seconds
p = 500000057 after 0.000235 seconds
p = 500000069 after 0.000394 seconds
p = 500000093 after 0.000200 seconds
p = 500000099 after 0.000201 seconds
p = 500000117 after 0.000201 seconds
p = 500000183 after 0.000211 seconds
p = 500000201 after 0.000223 seconds
real 0m0.011s
user 0m0.002s
sys 0m0.004s
Consider using a double hash function to get a better distribution inside the matrix,
but given that you cannot avoid colisions, what I suggest is to use an array of sentinels
and mark the positions you visit, this way you are sure you get to visit a cell once.