fast row-wise boolean operations in sparse matrices - pandas

I have a ~4.4M dataframe with purchase orders. I'm interested in a column that indicates the presence of certain items in that purchase order. It is structured like this:
df['item_arr'].head()
1 [a1, a2, a5]
2 [b1, b2, c3...
3 [b3]
4 [a2]
There are 4k different items, and in each row there is always at least one. I have generated another 4.4M x 4k dataframe df_te_sub with a sparse structure indicating the same array in terms of booleans, i.e.
c = df_te_sub.columns[[10, 20, 30]]
df_te_sub[c].head()
>>a10 b8 c1
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
The name of the columns is not important, although it is in alphabetical order, for what is worth.
Given a subset g of items, I am trying to extract the orders (rows) for two different cases:
At least one of the items is present in the row
The items is present in the row are all a subset of g
The first rule I found it the best way was to do:
c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)
The second rule presented a challenge. I tried different things but they are all slow:
# using set
s = set(g)
df['item_arr'].apply(s.issuperset)
# using the "non selected items"
c = df_te_sub[df_te_sub.columns[~df_te_sub.columns.isin(g)]].sparse.to_coo()
x = np.ones(len(df_te_sub), dtype='bool')
x[c.row] = False
# mix
s = set(g)
c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)
df['item_arr'].iloc[rows].apply(s.issuperset)
Any ideas to improve performance? I need to do this for several subsets.
The output can be given either in rows (e.g. [0, 2, 3]) or as a boolean mask (e.g. True False True True ....), as both will work to slice the order dataframe.

I feel like you're overthinking this. If you have a boolean array of membership you've already done 90% of the work.
from scipy.sparse import csc_matrix
# Turn your sparse data into a sparse array
arr = csc_matrix(df_te_sub.sparse.to_coo())
# Get the number of items per row
row_len = arr.sum(axis=1).A.flatten()
# Get the column indices for each item and slice your array
arr_col_idx = [df.columns.get_loc(g_val) for g_val in g]
# Sum the number of items in g in the slice per row
arr_g = arr[:, arr_col_idx].sum(axis=1).A.flatten()
# Find all the rows with at least one thing in g
arr_one_g = arr_g > 0
# Find all the things in the rows which are subsets of G
# This assumes row_len is always greater than 0, if it isnt add a test for that
arr_subset_g = (row_len - arr_g) == 0
arr_one_g and arr_subset_g are 1d boolean arrays that should index for the things you want.

Related

Is it possible to add columns to a pandas dataframe without filling it with any values?

So I have a pandas dataframe which is being passed from function to function. However a the moment I do not have any data to populate the rows with.
Furthermore, because of the way the code is structured, the dataframe needs to have certain columns.
Is it possible to add columns to a dataframe without mapping it to any value? I also don't want to map it to 0 or None or any default value. I would just like the empty dataframe with certain columns.
e.g.
...
def _trades(self, trades_df):
trades_df = trades_df.rename(columns={'timestamp': 'trade_timestamp'})
trades_df['publication_timestamp'] = trades_df['trade_timestamp']
trades_df['trade_id'] = trades_df['trade_id'].astype(str)
# set printable column - this way is empty dataframe safe
trades_df['printable'] = True
# No trade_types to map explicitly
trades_df['trade_type'] = None
trades_df['implied'] = 0
return trades_df
As you can see above the implied column is mapped to 0 and trade_type is also mapped to None.
However I just want to add the columns without mapping it with any default value.
In pandas, the dataframe object is tabular. This means it contains a rectangular collection of values. This rectangle can have zero rows, in which case columns can be added without any values in those columns.
However, if the rectangle has a non-zero number of rows, then each row in a column must have a value. This value can be None (python's null object value) or NaN (numpy's not-a-number value) or the empty string, or even an empty python sequence (tuple or list). But there is no such thing, in a dataframe with both axes (rows and columns) having non-zero length, as a cell without any value.
The one other thing you can do is to initialize the data in a new column using numpy.empty() which according to the docs will:
Return a new array of given shape and type, without initializing entries.
Consider this code:
trades_df['trade_type'] = np.empty([len(trades_df)])
trades_df['implied'] = np.empty([len(trades_df)])
Input:
trade_timestamp publication_timestamp trade_id printable
0 1 1 101 True
1 2 2 102 True
2 3 3 103 True
Output:
trade_timestamp publication_timestamp trade_id printable trade_type
0 1 1 101 True 6.953347e-310
1 2 2 102 True 6.953347e-310
2 3 3 103 True 6.953347e-310
trade_timestamp publication_timestamp trade_id printable trade_type implied
0 1 1 101 True 6.953347e-310 1.232637e-311
1 2 2 102 True 6.953347e-310 1.232637e-311
2 3 3 103 True 6.953347e-310 1.232637e-311
Th above example passes the default dtype argument float to numpy.empty(), but it is possible to use other numpy scalar types instead.
yes, of course:
df["C"] = ""

How to process 50 million rows fast in pandas?

I am trying to measure sentence similarity between 2 sets of questions using Spacy, and then output the pairs along with their similarity score in 3 columns (one dataframe) using pandas.
This will result about 50 million rows, and it has been processing for 12hrs.
Is there any way I can speed up this process please?
user_inputs_vector = [nlp(row) for row in user_inputs_df["User_Input"]] #vectorise user inputs
#creating doc object for comparison
sample_df_vector = [nlp(row) for row in sample_df["Frage"]] #vectorise sample sentences
#for loop to compare questions and then append similarity score in empty lists
similarity_score_list = []
sample_list = []
user_input_list = []
for i in range (len(sample_df_vector)):
for j in range (len(user_inputs_vector)):
similar_frage = user_inputs_vector[j].similarity(sample_df_vector[i])
similarity_score_list.append(similar_frage)
sample_list.append(sample_df_vector[i])
user_input_list.append(user_inputs_vector[j])
similarity_dataframe = pd.DataFrame(list(zip(sample_list, user_input_list, similarity_score_list)), columns = ["Samples", "User Inputs", "Similarity Score"])
I have run this code in a very small dataframe and it should be ok. It's just that the actual dataset has millions of rows and I don't see it ending yet. Please help.

Groupby and A)Concate matching strings(and or substring) B)Sum the values

I have df:
row_numbers ID code amount
1 med a 1
2 med a, b 1
3 med b, c 1
4 med c 1
5 med d 10
6 cad a, b 1
7 cad a, b, d 0
8 cad e 2
Pasted the above df:
I wanted to do groupby on column-ID and A)Combine the strings if substring/string matches(on column-code) B)sum the values of column-amount.
Expected results:
Explanation:
column-row_numbers has no role here in df. I just took here to explain the output.
A)grouping on column-ID and looking at column-code, row1 string i.e., a is matching with row2's sub string. row2's substring i.e., b is matching with row3's substring. row3's substring i.e., c is matching with string of row4 and Hence combining row1, row2, row3 and row4. row5 string is not matching with any of string/substring so it is separate group. B) Based on this adding row1, row2, row3 and row4 values. and row5 as separate group.
Thanks in advance for your time and thoughts:).
EDIT - 1
Pasting the real time.
Expected output:
Explanation:
have to do on grouping column-id and concatenating the values of column-code and summing the values of column-units and vol. It is color coded the matching(to be contacted) values of column-code. row1 has link with row5 and row9. row9 has inturn link with row3. Hence combining row1, row5, row9, row3. Simliarly row2 and row7 and so on. row8 has no link with any of the values with-in group-med(column-id) and hence will be as separate row.
Thanks!.
Update: From your latest sample data, this is not a simple data munging. There is no vectorized solution. It relates to graph theory. You need to find connected components within each group of ID and do the calculation on each connected components.
Consider each string as a node of graph. If 2 strings are overlapped, they are connected nodes. For every node, you need to traverse all paths connected to it. Do calculation on all connected nodes through these paths. This traversal can be done by using Depth-first search logic.
However, before processing depth-first search, you need to preprocess strings to set to check overlapping.
Method 1: Recursive
Do the following:
Define a function dfs to recursively run depth-first search
Define a function gfunc to use with groupby apply. This function will traverse elements of each group of ID and return the desired dataframe.
Get rid of any blank spaces in each string and split and convert them
to sets using replace, split and map and assign it to a new column new_code to df
Call groupby on ID and apply using function gfunc. Call droplevel and reset_index to get the desired output
Codes as follows:
import numpy as np
def dfs(node, index, glist, g_checked_rows):
ret_arr = df.loc[index, ['code', 'amount', 'volume']].values
g_checked_rows.add(index)
for j, s in glist:
if j not in g_checked_rows and not node.isdisjoint(s):
t_arr = dfs(s, j, glist, g_checked_rows)
ret_arr[0] += ', ' + t_arr[0]
ret_arr[1:] += t_arr[1:]
return ret_arr
def gfunc(x):
checked_rows = set()
final = []
code_list = list(x.new_code.items())
for i, row in code_list:
if i not in checked_rows:
final.append(dfs(row, i, code_list, checked_rows))
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc).droplevel(1).reset_index()
Out[16]:
ID code units vol
0 med CO-96, CO-B15, CO-B15, CO-96, OA-18, OA-18 4 4
1 med CO-16, CO-B20, CO-16 3 3
2 med CO-252, CO-252, CO-45 3 3
3 med OA-258 1 1
4 cad PR-96, PR-96, CO-243 4 4
5 cad PR-87, OA-258, PR-87 3 3
Note: I assume your pandas version is 0.24+. If it is < 0.24, the last step you need to use reset_index and drop instead of droplevel and reset_index as follows
df_final = df.groupby('ID', sort=False).apply(gfunc).reset_index().drop('level_1', 1)
Method 2: Iterative
To make this complete, I implement a version of gfunc using iterative process instead of recursive. Iterative process requires only one function.
However, the function is more complicated. The logic of iterative process as follows
push the first node to deque. Check if deque not empty, pop the top node out.
if a node is not marked checked, process it and mark it as checked
find all its neighbors in the reverse order of list of nodes that
haven't been marked, push them to the deque
Check if deque not empty, pop out a node from the top deque and
process from step 2
Code as follows:
def gfunc_iter(x):
checked_rows = set()
final = []
q = deque()
code_list = list(x.new_code.items())
code_list_rev = code_list[::-1]
for i, row in code_list:
if i not in checked_rows:
q.append((i, row))
ret_arr = np.array(['', 0, 0], dtype='O')
while (q):
n, node = q.pop()
if n in checked_rows:
continue
ret_arr_child = df.loc[n, ['code', 'amount', 'volume']].values
if not ret_arr[0]:
ret_arr = ret_arr_child.copy()
else:
ret_arr[0] += ', ' + ret_arr_child[0]
ret_arr[1:] += ret_arr_child[1:]
checked_rows.add(n)
#push to `q` all neighbors in the reversed list of nodes
for j, s in code_list_rev:
if j not in checked_rows and not node.isdisjoint(s):
q.append((j, s))
final.append(ret_arr)
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc_iter).droplevel(1).reset_index()
I believe the three main ideas for executing what you want are:
create an accumulator datastructure ( a DataFrame in this case)
iterate over a pair of rows, in each iteration you have (currentRow, nextRow)
pattern matching of current row in next row and pattern matching in the accumulated rows
It's not totally clear the exactly pattern match you're looking for, so I assumed that if any letter of currentRow code is on the next one, then concatenate them.
using a data.csv (with espace separators) as example:
row_numbers ID code amount
1 med a 1
2 med a,b 1
3 med b,c 1
4 med c 1
5 med d 10
6 cad a,b 1
7 cad a,b,d 0
8 cad e 2
import pandas as pd
from itertools import zip_longest
def generate_pairs(group):
''' generate pairs (currentRow, nextRow) '''
group_curriterrows = group.iterrows()
group_nextiterrows = group.iterrows()
group_nextiterrows.__next__()
zip_list = zip_longest(group_curriterrows, group_nextiterrows)
return zip_list
def generate_lists_to_check(currRow, nextRow, accumulated_rows):
''' generate list if any next letters are in current ones and
another list if any next letters are in the accumulated codes '''
currLetters = str(currRow["code"]).split(",")
nextLetters = str(nextRow["code"]).split(",")
letter_inNext = [letter in nextLetters for letter in currLetters]
unique_acc_codes = [str(v) for v in accumulated_rows["code"].unique()]
letter_inHistory = [any(letter in unq for letter in nextLetters)
for unq in unique_acc_codes]
return letter_inNext, letter_inHistory
def create_newRow(accumulated_rows, nextRow):
nextRow["row_numbers"] = str(nextRow["row_numbers"])
accumulated_rows = accumulated_rows.append(nextRow,ignore_index=True)
return accumulated_rows
def update_existingRow(accumulated_rows, match_idx, Row):
accumulated_rows.loc[match_idx]["code"] += ","+Row["code"]
accumulated_rows.loc[match_idx]["amount"] += Row["amount"]
accumulated_rows.loc[match_idx]["volume"] += Row["volume"]
accumulated_rows.loc[match_idx]["row_numbers"] += ','+str(Row["row_numbers"])
return accumulated_rows
if __name__ == "__main__":
df = pd.read_csv("extended.tsv",sep=" ")
groups = pd.DataFrame(columns=df.columns)
for ID, group in df.groupby(["ID"], sort=False):
accumulated_rows = pd.DataFrame(columns=df.columns)
group_firstRow = group.iloc[0]
accumulated_rows.loc[len(accumulated_rows)] = group_firstRow.values
row_numbers = str(group_firstRow.values[0])
accumulated_rows.set_value(0,'row_numbers',row_numbers)
zip_list = generate_pairs(group)
for (currRow_idx, currRow), Next in zip_list:
if not (Next is None):
(nextRow_idx, nextRow) = Next
letter_inNext, letter_inHistory = \
generate_lists_to_check(currRow, nextRow, accumulated_rows)
if any(letter_inNext) :
accumulated_rows = update_existingRow(accumulated_rows, (len(accumulated_rows)-1), nextRow)
elif any(letter_inHistory):
matches = [ idx for (idx, bool_val) in enumerate(letter_inHistory) if bool_val == True ]
first_match_idx = matches[0]
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, nextRow)
for match_idx in matches[1:]:
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, accumulated_rows.loc[match_idx])
accumulated_rows = accumulated_rows.drop(match_idx)
elif not any(letter_inNext):
accumulated_rows = create_newRow(accumulated_rows, nextRow)
groups = groups.append(accumulated_rows)
groups.reset_index(inplace=True,drop=True)
print(groups)
OUTPUT normal rows order REMOVING lines using column volume from current code because first exampe has no column volume:
row_numbers ID code amount
0 1 med a,a,b,b,c,c 4
1 5 med d 10
2 6 cad a,b,a,b,d 1
3 8 cad e 2
OUTPUT new example:
row_numbers ID code amount volume
0 1,5,9,3 med CO-96,CO-B15,CO-B15,CO-96,OA-18,OA-18 4 4
1 2,7 med CO-16,CO-B20,CO-16 3 3
2 4,6 med CO-252,CO-252,CO-45 3 3
3 8 med OA-258 1 1
4 10,13 cad PR-96,PR-96,CO-243 4 4
5 11,12 cad PR-87,OA-258,PR-87 3 3

Tukey-Test Grouping and plotting in SciPy

I'm trying to plot results from a Tukey test, but I am struggling with putting data into groups based on a P-Value. This is the equivalent in R which I am trying to replicate. I have been using the SciPy one-way ANOVA tests and the Tukey test statsmodel but can't get these groups done in the same way.
Any help is greatly appreciated
I've also just found this another example in R of what I want to do in python
I have been struggling to do the same thing. I found a paper that tells you how to code the letters.
Hans-Peter Piepho (2004) An Algorithm for a Letter-Based Representation of All-Pairwise Comparisons, Journal of Computational and Graphical Statistics, 13:2, 456-466, DOI: 10.1198/1061860043515
Doing the coding was a little tricky as you need to check and replicate columns and then combine columns. I tried to add some comments to the colde. I figured out a method where you can run tukeyhsd and then from the results compute the letters. It should be possible to turn this into a function. Or hopefully part of tukeyhsd. My data is not posted but it is a column of data and then a column describing the groups. The groups for me are the five boroughs of NYC. You can also just change the comments and use random data the first time.
# Read data. Comment out the next ones to use random data.
df=pd.read_excel('anova_test.xlsx')
#n=1000
#df = pd.DataFrame(columns=['Groups','Data'],index=np.arange(n))
#df['Groups']=np.random.randint(1, 4,size=n)
#df['Data']=df['Groups']*np.random.random_sample(size=n)
# define columns for data and then grouping
col_to_group='Groups'
col_for_data='Data'
#Now take teh data and regroup for anova
samples = [cols[1] for cols in df.groupby(col_to_group)[col_for_data]] #I am not sure how this works but it makes an numpy array for each group
f_val, p_val = stats.f_oneway(*samples) # I am not sure what this star does but this passes all the numpy arrays correctly
#print('F value: {:.3f}, p value: {:.3f}\n'.format(f_val, p_val))
# this if statement can be uncommmented if you don't won't to go furhter with out p<0.05
#if p_val<0.05: #If the p value is less than 0.05 it then does the tukey
mod = MultiComparison(df[col_for_data], df[col_to_group])
thsd=mod.tukeyhsd()
#print(mod.tukeyhsd())
#this is a function to do Piepho method. AN Alogrithm for a letter based representation of al-pairwise comparisons.
tot=len(thsd.groupsunique)
#make an empty dataframe that is a square matrix of size of the groups. #set first column to 1
df_ltr=pd.DataFrame(np.nan, index=np.arange(tot),columns=np.arange(tot))
df_ltr.iloc[:,0]=1
count=0
df_nms = pd.DataFrame('', index=np.arange(tot), columns=['names']) # I make a dummy dataframe to put axis labels into. sd stands for signifcant difference
for i in np.arange(tot): #I loop through and make all pairwise comparisons.
for j in np.arange(i+1,tot):
#print('i=',i,'j=',j,thsd.reject[count])
if thsd.reject[count]==True:
for cn in np.arange(tot):
if df_ltr.iloc[i,cn]==1 and df_ltr.iloc[j,cn]==1: #If the column contains both i and j shift and duplicat
df_ltr=pd.concat([df_ltr.iloc[:,:cn+1],df_ltr.iloc[:,cn+1:].T.shift().T],axis=1)
df_ltr.iloc[:,cn+1]=df_ltr.iloc[:,cn]
df_ltr.iloc[i,cn]=0
df_ltr.iloc[j,cn+1]=0
#Now we need to check all columns for abosortpion.
for cleft in np.arange(len(df_ltr.columns)-1):
for cright in np.arange(cleft+1,len(df_ltr.columns)):
if (df_ltr.iloc[:,cleft].isna()).all()==False and (df_ltr.iloc[:,cright].isna()).all()==False:
if (df_ltr.iloc[:,cleft]>=df_ltr.iloc[:,cright]).all()==True:
df_ltr.iloc[:,cright]=0
df_ltr=pd.concat([df_ltr.iloc[:,:cright],df_ltr.iloc[:,cright:].T.shift(-1).T],axis=1)
if (df_ltr.iloc[:,cleft]<=df_ltr.iloc[:,cright]).all()==True:
df_ltr.iloc[:,cleft]=0
df_ltr=pd.concat([df_ltr.iloc[:,:cleft],df_ltr.iloc[:,cleft:].T.shift(-1).T],axis=1)
count+=1
#I sort so that the first column becomes A
df_ltr=df_ltr.sort_values(by=list(df_ltr.columns),axis=1,ascending=False)
# I assign letters to each column
for cn in np.arange(len(df_ltr.columns)):
df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(1,chr(97+cn))
df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(0,'')
df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(np.nan,'')
#I put all the letters into one string
df_ltr=df_ltr.astype(str)
df_ltr.sum(axis=1)
#print(df_ltr)
#print('\n')
#print(df_ltr.sum(axis=1))
#Now to plot like R with a violing plot
fig,ax=plt.subplots()
df.boxplot(column=col_for_data, by=col_to_group,ax=ax,fontsize=16,showmeans=True
,boxprops=dict(linewidth=2.0),whiskerprops=dict(linewidth=2.0)) #This makes the boxplot
ax.set_ylim([-10,20])
grps=pd.unique(df[col_to_group].values) #Finds the group names
grps.sort() # This is critical! Puts the groups in alphabeical order to make it match the plotting
props=dict(facecolor='white',alpha=1)
for i,grp in enumerate(grps): #I loop through the groups to make the scatters and figure out the axis labels.
x = np.random.normal(i+1, 0.15, size=len(df[df[col_to_group]==grp][col_for_data]))
ax.scatter(x,df[df[col_to_group]==grp][col_for_data],alpha=0.5,s=2)
name="{}\navg={:0.2f}\n(n={})".format(grp
,df[df[col_to_group]==grp][col_for_data].mean()
,df[df[col_to_group]==grp][col_for_data].count())
df_nms['names'][i]=name
ax.text(i+1,ax.get_ylim()[1]*1.1,df_ltr.sum(axis=1)[i],fontsize=10,verticalalignment='top',horizontalalignment='center',bbox=props)
ax.set_xticklabels(df_nms['names'],rotation=0,fontsize=10)
ax.set_title('')
fig.suptitle('')
fig.savefig('anovatest.jpg',dpi=600,bbox_inches='tight')
Results showing the letters above plots using the tukeyhsd
Here is a function that returns letter labels if you have a symmetric matrix of p-values from a Tukey test:
import numpy as np
def tukeyLetters(pp, means=None, alpha=0.05):
'''TUKEYLETTERS - Produce list of group labels for TukeyHSD
letters = TUKEYLETTERS(pp), where PP is a symmetric matrix of
probabilities from a Tukey test, returns alphabetic labels
for each group to indicate clustering. PP may also be a vector
from PAIRWISE_TUKEYHSD.
Optional argument MEANS specifies group means, which is used for
ordering the letters. ("a" gets assigned to the group with lowest
mean.) Without this argument, ordering is arbitrary.
Optional argument ALPHA specifies cutoff for treating groups as
part of the same cluster.'''
if len(pp.shape)==1:
# vector
G = int(3 + np.sqrt(9 - 4*(2-len(pp))))//2
ppp = .5*np.eye(G)
ppp[np.triu_indices(G,1)] = pp
pp = ppp + ppp.T
conn = pp>alpha
G = len(conn)
if np.all(conn):
return ['a' for g in range(G)]
conns = []
for g1 in range(G):
for g2 in range(g1+1,G):
if conn[g1,g2]:
conns.append((g1,g2))
letters = [ [] for g in range(G) ]
nextletter = 0
for g in range(G):
if np.sum(conn[g,:])==1:
letters[g].append(nextletter)
nextletter += 1
while len(conns):
grp = set(conns.pop(0))
for g in range(G):
if all(conn[g, np.sort(list(grp))]):
grp.add(g)
for g in grp:
letters[g].append(nextletter)
for g in grp:
for h in grp:
if (g,h) in conns:
conns.remove((g,h))
nextletter += 1
if means is None:
means = np.arange(G)
means = np.array(means)
groupmeans = []
for k in range(nextletter):
ingroup = [g for g in range(G) if k in letters[g]]
groupmeans.append(means[np.array(ingroup)].mean())
ordr = np.empty(nextletter, int)
ordr[np.argsort(groupmeans)] = np.arange(nextletter)
result = []
for ltr in letters:
lst = [chr(97 + ordr[x]) for x in ltr]
lst.sort()
result.append(''.join(lst))
return result
To make that concrete, here is a full example:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
data = [ 1,2,2,1,4,5,4,5,7,8,7,8,1,3,4,5 ]
group = [ 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 ]
tuk = pairwise_tukeyhsd(data, group)
letters = tukeyLetters(tuk.pvalues)
This will result in letters containing ['a', 'c', 'b', 'ac']

Create 20 unique bingo cards

I'm trying to create 20 unique cards with numbers, but I struggle a bit.. So basically I need to create 20 unique matrices 3x3 having numbers 1-10 in first column, numbers 11-20 in the second column and 21-30 in the third column.. Any ideas? I'd prefer to have it done in r, especially as I don't know Visual Basic. In excel I know how to generate the cards, but not sure how to ensure they are unique..
It seems to be quite precise and straightforward to me. Anyway, i needed to create 20 matrices that would look like :
[,1] [,2] [,3]
[1,] 5 17 23
[2,] 8 18 22
[3,] 3 16 24
Each of the matrices should be unique and each of the columns should consist of three unique numbers ( the 1st column - numbers 1-10, the 2nd column 11-20, the 3rd column - 21-30).
Generating random numbers is easy, though how to make sure that generated cards are unique?Please have a look at the post that i voted for as an answer - as it gives you thorough explanation how to achieve it.
(N.B. : I misread "rows" instead of "columns", so the following code and explanation will deal with matrices with random numbers 1-10 on 1st row, 11-20 on 2nd row etc., instead of columns, but it's exactly the same just transposed)
This code should guarantee uniqueness and good randomness :
library(gtools)
# helper function
getKthPermWithRep <- function(k,n,r){
k <- k - 1
if(n^r< k){
stop('k is greater than possibile permutations')
}
v <- rep.int(0,r)
index <- length(v)
while ( k != 0 )
{
remainder<- k %% n
k <- k %/% n
v[index] <- remainder
index <- index - 1
}
return(v+1)
}
# get all possible permutations of 10 elements taken 3 at a time
# (singlerowperms = 720)
allperms <- permutations(10,3)
singlerowperms <- nrow(allperms)
# get 20 random and unique bingo cards
cards <- lapply(sample.int(singlerowperms^3,20),FUN=function(k){
perm2use <- getKthPermWithRep(k,singlerowperms,3)
m <- allperms[perm2use,]
m[2,] <- m[2,] + 10
m[3,] <- m[3,] + 20
return(m)
# if you want transpose the result just do:
# return(t(m))
})
Explanation
(disclaimer tl;dr)
To guarantee both randomness and uniqueness, one safe approach is generating all the possibile bingo cards and then choose randomly among them without replacements.
To generate all the possible cards, we should :
generate all the possibilities for each row of 3 elements
get the cartesian product of them
Step (1) can be easily obtained using function permutations of package gtools (see the object allPerms in the code). Note that we just need the permutations for the first row (i.e. 3 elements taken from 1-10) since the permutations of the other rows can be easily obtained from the first by adding 10 and 20 respectively.
Step (2) is also easy to get in R, but let's first consider how many possibilities will be generated. Step (1) returned 720 cases for each row, so, in the end we will have 720*720*720 = 720^3 = 373248000 possible bingo cards!
Generate all of them is not practical since the occupied memory would be huge, thus we need to find a way to get 20 random elements in this big range of possibilities without actually keeping them in memory.
The solution comes from the function getKthPermWithRep, which, given an index k, it returns the k-th permutation with repetition of r elements taken from 1:n (note that in this case permutation with repetition corresponds to the cartesian product).
e.g.
# all permutations with repetition of 2 elements in 1:3 are
permutations(n = 3, r = 2,repeats.allowed = TRUE)
# [,1] [,2]
# [1,] 1 1
# [2,] 1 2
# [3,] 1 3
# [4,] 2 1
# [5,] 2 2
# [6,] 2 3
# [7,] 3 1
# [8,] 3 2
# [9,] 3 3
# using the getKthPermWithRep you can get directly the k-th permutation you want :
getKthPermWithRep(k=4,n=3,r=2)
# [1] 2 1
getKthPermWithRep(k=8,n=3,r=2)
# [1] 3 2
Hence now we just choose 20 random indexes in the range 1:720^3 (using sample.int function), then for each of them we get the corresponding permutation of 3 numbers taken from 1:720 using function getKthPermWithRep.
Finally these triplets of numbers, can be converted to actual card rows by using them as indexes to subset allPerms and get our final matrix (after, of course, adding +10 and +20 to the 2nd and 3rd row).
Bonus
Explanation of getKthPermWithRep
If you look at the example above (permutations with repetition of 2 elements in 1:3), and subtract 1 to all number of the results you get this :
> permutations(n = 3, r = 2,repeats.allowed = T) - 1
[,1] [,2]
[1,] 0 0
[2,] 0 1
[3,] 0 2
[4,] 1 0
[5,] 1 1
[6,] 1 2
[7,] 2 0
[8,] 2 1
[9,] 2 2
If you consider each number of each row as a number digit, you can notice that those rows (00, 01, 02...) are all the numbers from 0 to 8, represented in base 3 (yes, 3 as n). So, when you ask the k-th permutation with repetition of r elements in 1:n, you are also asking to translate k-1 into base n and return the digits increased by 1.
Therefore, given the algorithm to change any number from base 10 to base n :
changeBase <- function(num,base){
v <- NULL
while ( num != 0 )
{
remainder = num %% base # assume K > 1
num = num %/% base # integer division
v <- c(remainder,v)
}
if(is.null(v)){
return(0)
}
return(v)
}
you can easily obtain getKthPermWithRep function.
One 3x3 matrix with the desired value range can be generated with the following code:
mat <- matrix(c(sample(1:10,3), sample(11:20,3), sample(21:30, 3)), nrow=3)
Furthermore, you can use a for loop to generate a list of 20 unique matrices as follows:
for (i in 1:20) {
mat[[i]] <- list(matrix(c(sample(1:10,3), sample(11:20,3), sample(21:30,3)), nrow=3))
print(mat[[i]])
}
Well OK I may fall on my face here but I propose a checksum (using Excel).
This is a unique signature for each bingo card which will remain invariate if the order of numbers within any column is changed without changing the actual numbers. The formula is
=SUM(10^MOD(A2:A4,10)+2*10^MOD(B2:B4,10)+4*10^MOD(C2:C4,10))
where the bingo numbers for the first card are in A2:C4.
The idea is to generate a 10-digit number for each column, then multiply each by a constant and add them to get the signature.
So here I have generated two random bingo cards using a standard formula from here plus two which are deliberately made to be just permutations of each other.
Then I check if any of the signatures are duplicates using the formula
=MAX(COUNTIF(D5:D20,D5:D20))
which shouldn't given an answer more than 1.
In the unlikely event that there were duplicates, then you would just press F9 and generate some new cards.
All formulae are array formulae and must be entered with CtrlShiftEnter
Here is an inelegant way to do this. Generate all possible combinations and then sample without replacement. These are permutations, combinations: order does matter in bingo
library(dplyr)
library(tidyr)
library(magrittr)
generate_samples = function(n) {
first = data_frame(first = (n-9):n)
first %>%
merge(first %>% rename(second = first)) %>%
merge(first %>% rename(third = first)) %>%
sample_n(20)
}
suffix = function(df, suffix)
df %>%
setNames(names(.) %>%
paste0(suffix))
generate_samples(10) %>% suffix(10) %>%
bind_cols(generate_samples(20) %>% suffix(20)) %>%
bind_cols(generate_samples(30) %>% suffix(30)) %>%
rowwise %>%
do(matrix = t(.) %>% matrix(3)) %>%
use_series(matrix)