generate maximum number of combinations [duplicate] - vba

This question already has answers here:
How to generate a power set of a given set?
(8 answers)
Closed 4 years ago.
I am trying to find an algorithm enabling to generate the full list of possible combinations from x given numbers.
Example: possible combinations from 3 numbers (a, b,c):
a, b, c , a +b , a + c , b + c , a+b+c
Many thanks in advance for your help!

Treat the binary representation of the numbers from 0 to 2^x-1 as set membership. E.g., for ABC:
0 = 000 = {}
1 = 001 = {C}
2 = 010 = {B}
3 = 011 = {B,C}
4 = 100 = {A}
etc...

Do you meant generate possible combination of sum of numbers?
Start with an empty set s = {0}
For each number a,b,c:
duplicate the existing set s, add each number to the duplicated set. Add the results back to s.
Example:
s = {0}
for a:
duplicate s, s' = {0}
add a to each of s', s' = {a}
add s' back to s, s = {0,a}
for b:
duplicate s, s' = {0,a}
add b to each of s' = {b,a+b}
add s' back to s, s= {0,a,b,a+b}
for c:
dupicate s, s' = {0,a,b,a+b}
add c to each of s' = {c,a+c,b+c,a+b+c}
add s' to s, s = {0,a,b,a+b,c,a+c,b+c,a+b+c}

Related

Groupby and A)Concate matching strings(and or substring) B)Sum the values

I have df:
row_numbers ID code amount
1 med a 1
2 med a, b 1
3 med b, c 1
4 med c 1
5 med d 10
6 cad a, b 1
7 cad a, b, d 0
8 cad e 2
Pasted the above df:
I wanted to do groupby on column-ID and A)Combine the strings if substring/string matches(on column-code) B)sum the values of column-amount.
Expected results:
Explanation:
column-row_numbers has no role here in df. I just took here to explain the output.
A)grouping on column-ID and looking at column-code, row1 string i.e., a is matching with row2's sub string. row2's substring i.e., b is matching with row3's substring. row3's substring i.e., c is matching with string of row4 and Hence combining row1, row2, row3 and row4. row5 string is not matching with any of string/substring so it is separate group. B) Based on this adding row1, row2, row3 and row4 values. and row5 as separate group.
Thanks in advance for your time and thoughts:).
EDIT - 1
Pasting the real time.
Expected output:
Explanation:
have to do on grouping column-id and concatenating the values of column-code and summing the values of column-units and vol. It is color coded the matching(to be contacted) values of column-code. row1 has link with row5 and row9. row9 has inturn link with row3. Hence combining row1, row5, row9, row3. Simliarly row2 and row7 and so on. row8 has no link with any of the values with-in group-med(column-id) and hence will be as separate row.
Thanks!.
Update: From your latest sample data, this is not a simple data munging. There is no vectorized solution. It relates to graph theory. You need to find connected components within each group of ID and do the calculation on each connected components.
Consider each string as a node of graph. If 2 strings are overlapped, they are connected nodes. For every node, you need to traverse all paths connected to it. Do calculation on all connected nodes through these paths. This traversal can be done by using Depth-first search logic.
However, before processing depth-first search, you need to preprocess strings to set to check overlapping.
Method 1: Recursive
Do the following:
Define a function dfs to recursively run depth-first search
Define a function gfunc to use with groupby apply. This function will traverse elements of each group of ID and return the desired dataframe.
Get rid of any blank spaces in each string and split and convert them
to sets using replace, split and map and assign it to a new column new_code to df
Call groupby on ID and apply using function gfunc. Call droplevel and reset_index to get the desired output
Codes as follows:
import numpy as np
def dfs(node, index, glist, g_checked_rows):
ret_arr = df.loc[index, ['code', 'amount', 'volume']].values
g_checked_rows.add(index)
for j, s in glist:
if j not in g_checked_rows and not node.isdisjoint(s):
t_arr = dfs(s, j, glist, g_checked_rows)
ret_arr[0] += ', ' + t_arr[0]
ret_arr[1:] += t_arr[1:]
return ret_arr
def gfunc(x):
checked_rows = set()
final = []
code_list = list(x.new_code.items())
for i, row in code_list:
if i not in checked_rows:
final.append(dfs(row, i, code_list, checked_rows))
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc).droplevel(1).reset_index()
Out[16]:
ID code units vol
0 med CO-96, CO-B15, CO-B15, CO-96, OA-18, OA-18 4 4
1 med CO-16, CO-B20, CO-16 3 3
2 med CO-252, CO-252, CO-45 3 3
3 med OA-258 1 1
4 cad PR-96, PR-96, CO-243 4 4
5 cad PR-87, OA-258, PR-87 3 3
Note: I assume your pandas version is 0.24+. If it is < 0.24, the last step you need to use reset_index and drop instead of droplevel and reset_index as follows
df_final = df.groupby('ID', sort=False).apply(gfunc).reset_index().drop('level_1', 1)
Method 2: Iterative
To make this complete, I implement a version of gfunc using iterative process instead of recursive. Iterative process requires only one function.
However, the function is more complicated. The logic of iterative process as follows
push the first node to deque. Check if deque not empty, pop the top node out.
if a node is not marked checked, process it and mark it as checked
find all its neighbors in the reverse order of list of nodes that
haven't been marked, push them to the deque
Check if deque not empty, pop out a node from the top deque and
process from step 2
Code as follows:
def gfunc_iter(x):
checked_rows = set()
final = []
q = deque()
code_list = list(x.new_code.items())
code_list_rev = code_list[::-1]
for i, row in code_list:
if i not in checked_rows:
q.append((i, row))
ret_arr = np.array(['', 0, 0], dtype='O')
while (q):
n, node = q.pop()
if n in checked_rows:
continue
ret_arr_child = df.loc[n, ['code', 'amount', 'volume']].values
if not ret_arr[0]:
ret_arr = ret_arr_child.copy()
else:
ret_arr[0] += ', ' + ret_arr_child[0]
ret_arr[1:] += ret_arr_child[1:]
checked_rows.add(n)
#push to `q` all neighbors in the reversed list of nodes
for j, s in code_list_rev:
if j not in checked_rows and not node.isdisjoint(s):
q.append((j, s))
final.append(ret_arr)
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc_iter).droplevel(1).reset_index()
I believe the three main ideas for executing what you want are:
create an accumulator datastructure ( a DataFrame in this case)
iterate over a pair of rows, in each iteration you have (currentRow, nextRow)
pattern matching of current row in next row and pattern matching in the accumulated rows
It's not totally clear the exactly pattern match you're looking for, so I assumed that if any letter of currentRow code is on the next one, then concatenate them.
using a data.csv (with espace separators) as example:
row_numbers ID code amount
1 med a 1
2 med a,b 1
3 med b,c 1
4 med c 1
5 med d 10
6 cad a,b 1
7 cad a,b,d 0
8 cad e 2
import pandas as pd
from itertools import zip_longest
def generate_pairs(group):
''' generate pairs (currentRow, nextRow) '''
group_curriterrows = group.iterrows()
group_nextiterrows = group.iterrows()
group_nextiterrows.__next__()
zip_list = zip_longest(group_curriterrows, group_nextiterrows)
return zip_list
def generate_lists_to_check(currRow, nextRow, accumulated_rows):
''' generate list if any next letters are in current ones and
another list if any next letters are in the accumulated codes '''
currLetters = str(currRow["code"]).split(",")
nextLetters = str(nextRow["code"]).split(",")
letter_inNext = [letter in nextLetters for letter in currLetters]
unique_acc_codes = [str(v) for v in accumulated_rows["code"].unique()]
letter_inHistory = [any(letter in unq for letter in nextLetters)
for unq in unique_acc_codes]
return letter_inNext, letter_inHistory
def create_newRow(accumulated_rows, nextRow):
nextRow["row_numbers"] = str(nextRow["row_numbers"])
accumulated_rows = accumulated_rows.append(nextRow,ignore_index=True)
return accumulated_rows
def update_existingRow(accumulated_rows, match_idx, Row):
accumulated_rows.loc[match_idx]["code"] += ","+Row["code"]
accumulated_rows.loc[match_idx]["amount"] += Row["amount"]
accumulated_rows.loc[match_idx]["volume"] += Row["volume"]
accumulated_rows.loc[match_idx]["row_numbers"] += ','+str(Row["row_numbers"])
return accumulated_rows
if __name__ == "__main__":
df = pd.read_csv("extended.tsv",sep=" ")
groups = pd.DataFrame(columns=df.columns)
for ID, group in df.groupby(["ID"], sort=False):
accumulated_rows = pd.DataFrame(columns=df.columns)
group_firstRow = group.iloc[0]
accumulated_rows.loc[len(accumulated_rows)] = group_firstRow.values
row_numbers = str(group_firstRow.values[0])
accumulated_rows.set_value(0,'row_numbers',row_numbers)
zip_list = generate_pairs(group)
for (currRow_idx, currRow), Next in zip_list:
if not (Next is None):
(nextRow_idx, nextRow) = Next
letter_inNext, letter_inHistory = \
generate_lists_to_check(currRow, nextRow, accumulated_rows)
if any(letter_inNext) :
accumulated_rows = update_existingRow(accumulated_rows, (len(accumulated_rows)-1), nextRow)
elif any(letter_inHistory):
matches = [ idx for (idx, bool_val) in enumerate(letter_inHistory) if bool_val == True ]
first_match_idx = matches[0]
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, nextRow)
for match_idx in matches[1:]:
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, accumulated_rows.loc[match_idx])
accumulated_rows = accumulated_rows.drop(match_idx)
elif not any(letter_inNext):
accumulated_rows = create_newRow(accumulated_rows, nextRow)
groups = groups.append(accumulated_rows)
groups.reset_index(inplace=True,drop=True)
print(groups)
OUTPUT normal rows order REMOVING lines using column volume from current code because first exampe has no column volume:
row_numbers ID code amount
0 1 med a,a,b,b,c,c 4
1 5 med d 10
2 6 cad a,b,a,b,d 1
3 8 cad e 2
OUTPUT new example:
row_numbers ID code amount volume
0 1,5,9,3 med CO-96,CO-B15,CO-B15,CO-96,OA-18,OA-18 4 4
1 2,7 med CO-16,CO-B20,CO-16 3 3
2 4,6 med CO-252,CO-252,CO-45 3 3
3 8 med OA-258 1 1
4 10,13 cad PR-96,PR-96,CO-243 4 4
5 11,12 cad PR-87,OA-258,PR-87 3 3

optimization algorithm for grouping sets of numbers

There is N subsets of natural numbers between 1 and K (sample set: {2,9,32}). Number of items/numbers in each set varies, but it cannot exceed K. 50% of subsets are 1- or 2-element sets. The distribution can be visualised as
number of elem.|frequency
1 #########################
2 ##############
3 #####
4 ###
...
n #
We can combine sets - this is just a simple union of sets, i.e. if A = {1,2,5,6}, B = {2,6,33} then A + B = {1,2,5,6,33}.
We have to cluster these sets so in each cluster the number of elements is minimal and there's minimum P elements in each cluster.
For example: A = {1,2,3}, B = {5,6}, C = {7,8}, D = {9,10,11} the output should be: group 1: AB, group 2: CD (or AC and BD) - we have 2 groups with 5 elements. Grouping AD and BC is not optimal because we have 6 and 4 elements respectively.
N and P can be arbitrary numbers, in my case 25000<N<35000, 10<P<30. The problem is very practical, not only a math task.
How can I approach this? What alghoritm is most appropriate?

Add string end column name dataframe

I have the following df:
BRET CRET NET
SEEN                          4             4              5       
NOT SEEN                    5             9              9          
DELETED                       9            14             13        
I would like to add a string to each of the column headers.
The desired output would look like this:
BRET :this M CRET : past M NET : past 2 M
SEEN                          4             4              5       
NOT SEEN                    5             9              9          
DELETED                       9            14             13 
The issue is that I would like not to rename the columns but simply add the strings at the end of the column names
Would that be possible?
Yes, you can use .values:
df.columns.values[0] = df.columns.values[0] + ' :this M'
df.columns.values[1] = df.columns.values[1] + ' : past M'
df.columns.values[2] = df.columns.values[2] + ' : past 2 M'
Output:
BRET: this M CRET : past M NET : past 2 M
SEEN 4 4 5
NOT SEEN 5 9 9
DELETED 9 14 13

Replicate a SQL Join + Group By in data.table

I am trying to replicate a SQL sparse matrix multiplication using data tables. The SQL expression would be:
SELECT a.i, b.j, SUM(a.value*b.value)
FROM a, b
WHERE a.j = b.i
GROUP BY a.i, b.j;
where my data is structured as |i|j|value in each table
To create this data in R you can use:
library(reshape2)
library(data.table)
A <- matrix(runif(25),5,5)
B <- matrix(runif(25),5,5)
ADT <- data.table(melt(A))
BDT <- data.table(melt(B))
setnames(ADT,old = c("Var1","Var2","value"), new = c("Ai","Aj","AVal"))
setnames(BDT,old = c("Var1","Var2","value"), new = c("Bi","Bj","BVal"))
To merge using .[ we need to set the keys that we join on:
setkey(ADT,"Aj")
setkey(BDT,"Bi")
To build up piece by piece
ADT[BDT, allow.cartesian = T]
Ai Aj AVal Bj BVal
1: 1 1 0.39230905 1 0.7083956
2: 2 1 0.89523490 1 0.7083956
3: 3 1 0.92464689 1 0.7083956
4: 4 1 0.15127499 1 0.7083956
5: 5 1 0.88838458 1 0.7083956
---
121: 1 5 0.70144360 5 0.7924433
122: 2 5 0.50409075 5 0.7924433
123: 3 5 0.15693879 5 0.7924433
124: 4 5 0.09164371 5 0.7924433
125: 5 5 0.63787487 5 0.7924433
So far so good. The merge worked properly, Bi has disappeared, but this is encoded by Aj anyway. We now want to multiply AVal by BVal, and then sum the created groups (! in the same expression, I know that I could store and apply a second expression here). I had thought this would be:
ADT[BDT, j = list(Ai, Bj, Value = sum(AVal*BVal)), by = c("Ai","Bj") , allow.cartesian = T]
but I get the error: Object Bj not found. In fact, none of the values from 'BDT' are usable once I insert the by = clause (try to systematically remove Bj,BVal and "Bj" from the expression above, left to right, and you will see what I mean).
Looking into the .EACHI expression, it seems like the motive is here to do what I want, but .EACHI groups on the merged index, not on a separate variable.
Sounds like you simply want to aggregate after the merge:
ADT[BDT, allow.cartesian = T][, sum(AVal * BVal), by = .(Ai, Bj)]

Find all paths of at most length 2 from a set of relationships

I have a connection data set with each row marks A connects B in the form A B. The direct connection between A and B appears only once, either in the form A B or B A. I want to find all the connections at most one hop away, i.e. A and C are at most one hop away, if A and C are directly connected, or A connects C through some B.
For example, I have the following direct connection data
1 2
2 4
3 7
4 5
Then the resulting data I want is
1 {2,4}
2 {1,4,5}
3 {7}
4 {1,2,5}
5 {2,4}
7 {3}
Could anybody help me to find a way as efficient as possible? Thank you.
You could do this:
myudf.py
#outputSchema('bagofnums: {(num:int)}')
def merge_distinct(b1, b2):
out = []
for ignore, n in b1:
out.append(n)
for ignore, n in b2:
out.append(n)
return out
script.pig
register 'myudf.py' using jython as myudf ;
A = LOAD 'foo.in' USING PigStorage(' ') AS (num: int, link: int) ;
-- Essentially flips A
B = FOREACH A GENERATE link AS num, num AS link ;
-- We need to union the flipped A with A so that we will know:
-- 3 links to 7
-- 7 links to 3
-- Instead of just:
-- 3 links to 7
C = UNION A, B ;
-- C is in the form (num, link)
-- You can't do JOIN C BY link, C BY num ;
-- So, T just is C replicated
T = FOREACH D GENERATE * ;
D = JOIN C BY link, T BY num ;
E = FOREACH (FILTER E BY $0 != $3) GENERATE $0 AS num, $3 AS link_hopped ;
-- The output from E are (num, link) pairs where the link is one hop away. EG
-- 1 links to 2
-- 2 links to 4
-- 3 links to 7
-- The output will be:
-- 1 links to 4
F = COGROUP C BY num, E BY num ;
-- I use a UDF here to merge the bags together. Otherwise you will end
-- up with a bag for C (direct links) and E (links one hop away).
G = FOREACH F GENERATE group AS num, myudf.merge_distinct(C, E) ;
Schema and output for G using your sample input:
G: {num: int,bagofnums: {(num: int)}}
(1,{(2),(4)})
(2,{(4),(1),(5)})
(3,{(7)})
(4,{(5),(2),(1)})
(5,{(4),(2)})
(7,{(3)})