Most efficient way to find overlap between two tables - sql

Given two tables with the column "title" that is not sorted or unique:
|id|title |
|1 |book_1|
|2 |book_2|
|3 |book_3|
|4 |book_4|
|5 |book_5|
|6 |book_5|
|7 |book_5|
|8 |book_6|
|9 |book_7|
|user_id|book_id|state |title |
|1 |2 |"in progress"|book_2 |
|1 |4 |"completed" |book_4 |
|1 |6 |"completed" |book_5 |
|2 |3 |"completed" |book_3 |
|2 |6 |"completed" |book_5 |
|3 |1 |"completed" |book_1 |
|3 |2 |"completed" |book_2 |
|3 |4 |"completed" |book_4 |
|3 |7 |"in progress"|book_5 |
|3 |8 |"completed" |book_6 |
|3 |9 |"completed" |book_7 |
I'd like to create a binary matrix of users and book titles with state "completed".
[0, 0, 0, 1, 1, 0, 0]
[0, 0, 1, 0, 1, 0, 0]
[1, 1, 0, 1, 0, 1, 1]
This gets the results I'd like, but has very high algorithmic complexity. I am hoping to get the results with SQL.
How much more simple could it be if state was boolean and titles were unique?
matrix = []
User.all.each do |user|
books = Book.distinct.sort(title: :asc).pluck(:title).uniq
user_books = UserBook.where(user: user, state: "completed").order(title: :asc).pluck(:title)
matrix <<{|v| user_books.include?(v) ? 1 : 0}

SQL is not very good at matrices. But you can store the values as (x,y) pairs. You want to include 0 values as well as 1, so the idea is to generate the rows using a cross join and then bring in the existing data:
select b.book_id, u.user_id,
(case when is not null then 1 else 0 end) as is_completed
from books b cross join
users u left join
user_books ub
on ub.user_id = and
ub.book_id = and
ub.state = 'completed';

You could group UserBook by user_id and use aggregate functions to select the list of books on each group. The entire code snippets is as follows:
books = Book.order(title: :asc).pluck(:title).uniq
matrix = []
UserBook.where(state: "completed")
.select("string_agg(title, ',') as grouped_name")
.each do |group|
user_books = group.grouped_name.split(',')
matrix << { |title| user_books.include?(title) ? 1 : 0 }
In MySQL you need to replace string_agg(title, ',') with GROUP_CONCAT(title)

Should you consider producing the desired array using Ruby, rather than SQL, first read data from the table Book into an array book:
book = [
[1, "book_1"], [2, "book_2"], [3, "book_3"], [4, "book_4"],
[5, "book_5"], [6, "book_5"], [7, "book_5"], [8, "book_6"],
[9, "book_7"]
and data from the table UserBook into an array user_book:
user_book = [
[1, 2, :in_progress], [1, 4, :completed], [1, 6, :completed],
[2, 3, :completed], [2, 6, :completed],
[3, 1, :completed], [3, 2, :completed], [3, 4, :completed], [3, 7, :in_progress],
[3, 8, :completed], [3, 9, :completed]
Note the first element of each element of book, an integer, is the book_id, and the first two elements of each element of user_book, integers, are respectively the user_id and book_id.
You could then construct the desired array as follows:
h = { |book_id,title| [book_id, title[/\d+\z/].to_i-1] }.to_h
#=> {1=>0, 2=>1, 3=>2, 4=>3, 5=>4, 6=>4, 7=>4, 8=>5, 9=>6}
cols = h.values.max + 1
#=> 6
arr = {, 0) }
#=> [[0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0]]
user_book.each do |user_id, book_id, status|
arr[user_id-1][h[book_id]] = 1 if status == :completed
#=> [[0, 0, 0, 1, 1, 0, 0],
# [0, 0, 1, 0, 1, 0, 0],
# [1, 1, 0, 1, 0, 1, 1]]

in straight SQL
select * from books join user_books on ( =
where user_books.state = 'completed';
In Ruby ActiveRecord
Book.joins(:user_books).where(:state => 'completed')


How to use an input layer that also feeds on a previous layer of a neural network?

Let's say I want to predict the winner of a tag-team race, where some drivers are more usually place higher in certain weather conditions:
Race |Driver | Weather | Time
Dummy1 |D1 | Rain | 2:00
Dummy1 |D2 | Rain | 5:00
Dummy1 |D3 | Rain | 4:50
Dummy2 |D1 | Sunny | 3:00
Dummy2 |D2 | Sunny | 2:50
Dummy2 |D2 | Sunny | 2:30
The logic is that a team composed of D1 and D3 would outperform any other combination on Rain, but wouldn't have the same luck on other weather. With that said, I thought about the following model:
Layer 1 | Layer 2 | Layer 3 (output)
Driver encoding | weather encoding | expected race time
Input of 0 or 1 | sum(Layer 1 * weights | sum(Layer 2 * weights)
| * Input of 0 or 1) |
This means that layer 2 uses layer 1 as well as input values to compute a value.
The reason I want this architecture instead of having every feature on layer 1 is that I want different features to multiply each other instead of their sum.
I could not find anything like this, but it is probably just me not knowing the name of this approach. Can someone point me to sources or explain know how to replicate this on tensorflow/pytorch/any other lib?
Turns out it was actually pretty simple, for anyone that might stumble upon this post and would like to test this approach, here's rough code:
Racing dataset
# TEAM 1 TEAM 2 "Weather" "WON"
# "A","B","C","D","E", "A","B","C","D","E", W1 W2 W3 (combined times of team 1< combined times of team 2)
dataset=[[ 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1],
[ 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1],
[ 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1],
[ 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1],
[ 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0],
[ 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
[ 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0],
[ 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0],
[ 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0],
[ 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1],
inputs=[[x[0:-4],x[-4:-1]] for x in dataset]
results=[[x[-1]] for x in dataset]
Typings to make code more readable
from typing import Iterator
class InputLayer():
def __init__(self, inputs,useBias=False):
def __str__(self):
return "Layer of size "+ str(self.inputs)
def __repr__(self) -> str:
return self.__str__()
class InputLayerValue():
def __init__(self, values):
Actual model
import torch
from torch import nn
class MutipleInputModel(nn.Module):
def __init__(self,input_layers:Iterator[InputLayer],output_size):
super(MutipleInputModel, self).__init__()
for i in range(len(input_layers)-1):
#To have hidden layers, you need to either use another model or create and attach multiple Linear models - nn.Linear(next.inputs,next.inputs)
#models must be directly under self to be found by model.parameters()
def forward(self, inputs:Iterator[InputLayerValue]):
if inputsLen != len(self.nns):
raise Exception("Number of input values provided and input layers must be equal. Provided "+str(inputsLen)+" sets of inputs for a "+str(len(self.nns))+"-input-layer network")
#Initialize first layer of inputs with ones which will then be multiplied by the actual input values
lastOutput=torch.ones(len(inputs),len(inputs[0][0].values)) # Layer 1 Outputs | Layer 2 provided Inputs | Layer 2 actual Inputs
for i in range(inputsLen): # lastOutput | multiplier | input
multiplier=torch.from_numpy(np.array([x[i].values for x in inputs])).float() # 0.2 | 0 | 0
input=lastOutput*multiplier # 1.5 | 1 | 1.5
lastOutput=self.__getattr__(self.nns[i])(input) # 1.0 | 5 | 5
return lastOutput
# Define hyperparameters
model = MutipleInputModel(input_layers=[InputLayer(len(x)) for x in inputs[0]],output_size=1)
n_epochs = 1000
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
for epoch in range(1, n_epochs + 1):
optimizer.zero_grad() # Clears existing gradients from previous epoch
output = model([[InputLayerValue(y) for y in x] for x in inputs])
loss = criterion(output, torch.from_numpy(np.array(results)).float())
print('Epoch: {}/{}.............'.format(epoch, n_epochs), end=' ')
print("Loss: {:.4f}".format(loss.item()))
def predict(model, input):
input = [[InputLayerValue(y) for y in input]]
out = model(input)
return nn.Sigmoid()(out[0][0]).item()
print(predict(model,[[1, 1, 0, 0, 0, 0, 0, 1, 1, 0], [1, 0, 0]]))
print(predict(model,[[1, 1, 0, 0, 0, 0, 0, 1, 1, 0], [0, 1, 0]]))
print(predict(model,[[1, 1, 0, 0, 0, 0, 0, 1, 1, 0], [0, 0, 1]]))
This is a really basic implementation, but could easily be modified to have hidden layers.
Clearly needs further testing to see if it is actually better than a traditional NN, but I would say it is great for NN explainability.

Split row into multiple rows to limit length of array in column (spark / scala)

I have a dataframe that looks like this:
|id | items |
| 1|[a, b, .... x, y, z]|
| 1|[q, z, .... x, b, 5]|
| 2|[q, z, .... x, b, 5]|
I want to split the rows so that the array in the items column is at most length 20. If an array has length greater than 20, I would want to make new rows and split the array up so that each array is of length 20 or less. So for the first row in my example dataframe, if we assume the length is 10 and I want at most length 3 for each row, I would like for it to be split like this:
|id | items |
| 1|[a, b, c] |
| 1|[z, y, z] |
| 1|[e, f, g] |
| 1|[q] |
Ideally, all rows should be of length 3 except the last row if the length of the array is not evenly divisible by the max desired length. Note - the id column is not unique
Using higher-order functions transform + filter along with slice, you can split the array into sub arrays of size 20 then explode it:
val l = 20
val df1 = df.withColumn(
s"filter(transform(items, (x,i)-> IF(i%$l=0, slice(items,i+1,$l), null)), x-> x is not null)"
You could try this:
import pandas as pd
max_item_length = 3
df = pd.DataFrame(
{"fake_index": [1, 2, 3],
"items": [["a", "b", "c", "d", "e"], ["f", "g", "h", "i", "j"], ["k", "l"]]}
df2 = pd.DataFrame({"fake_index": [], "items": []})
for i in df.index:
df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1][:max_item_length]},
df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1][max_item_length:]},
df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1]}, ignore_index=True)
df = df2
fake_index items
0 1 [a, b, c, d, e]
1 2 [f, g, h, i, j]
2 3 [k, l]
fake_index items
0 1 [a, b, c]
1 1 [d, e]
2 2 [f, g, h]
3 2 [i, j]
4 3 [k, l]
Since this requires a more complex transformation, I've used datasets. This might not be as performant, but it will get what you want.
Creating some sample data to mimic your data.
val arrayData = Seq(
Row(1,List(1, 2, 3, 4, 5, 6, 7)),
Row(2,List(1, 2, 3, 4)),
Row(3,List(1, 2)),
Row(4,List(1, 2, 3))
val arraySchema = new StructType().add("id",IntegerType).add("values", ArrayType(IntegerType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData), arraySchema)
|id |values |
|1 |[1, 2, 3, 4, 5, 6, 7]|
|2 |[1, 2, 3, 4] |
|3 |[1, 2] |
|4 |[1, 2, 3] |
// encoder for custom type of transformation
implicit val encoder = ExpressionEncoder[(Int, Array[Array[Int]])]
// Here we are using a sliding window of size 3 and step 3.
// This can be made into a generic function for a window of size k.
val df2 = => {
val id = r.getInt(0)
val a = r.getSeq[Int](1).toArray
val arrays = a.sliding(3, 3).toArray
(id, arrays)
|_1 |_2 |
|1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6), WrappedArray(7)]|
|2 |[WrappedArray(1, 2, 3), WrappedArray(4)] |
|3 |[WrappedArray(1, 2)] |
|4 |[WrappedArray(1, 2, 3)] |
val df3 = df2
.withColumnRenamed("_1", "id")
.withColumnRenamed("_2", "values")
|id |values |
|1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6), WrappedArray(7)]|
|2 |[WrappedArray(1, 2, 3), WrappedArray(4)] |
|3 |[WrappedArray(1, 2)] |
|4 |[WrappedArray(1, 2, 3)] |
Use explode
Expode will create a new element for each array entry in the second column.
val df4 = df3.withColumn("values", functions.explode($"values"))
|id |values |
|1 |[1, 2, 3]|
|1 |[4, 5, 6]|
|1 |[7] |
|2 |[1, 2, 3]|
|2 |[4] |
|3 |[1, 2] |
|4 |[1, 2, 3]|
This approach is not without limitations.
Primarily, it will not be as performant on larger datasets since this code is no longer using dataframe built-in optimizations. However, the dataframe API might require the use of window functions, which can also have limited performance based on the size of the data. If it's possible to alter this data at the source, this would be recommended.
This approach also requires defining an encoder for something more complex. If the data schema changes, then different encoders will have to be used.

Reading values within pandas.groupby

I have a dataframe like below
name item
0 Jack A
1 Sarah B
2 Ross A
3 Sean C
4 Jack C
5 Ross B
What I like to do is to produce a dictionary that connects people to the products they are related to.
{Jack: [1, 0, 1], Sarah: [0, 1, 0], Ross:[1, 1, 0], Sean:[0, 0, 1]}
I feel like this should be done fairly easily using pandas.groupby
I have tried looping through the dataframe, but I have >1E7 entries, and looping does not look very efficient.
Check with crosstab and to_dict
{'Jack': [1, 0, 1], 'Ross': [1, 1, 0], 'Sarah': [0, 1, 0], 'Sean': [0, 0, 1]}
Another interesting option is using str.get_dummies:
# if you need counts
# if you want to record boolean indicators
# {'Jack': [1, 0, 1], 'Ross': [1, 1, 0], 'Sarah': [0, 1, 0], 'Sean': [0, 0, 1]}

Selecting only non-NULL keys in a Postgres JSONB field

I have a postgres 9.6 table with a JSONB column
> SELECT id, data FROM my_table ORDER BY id LIMIT 4;
id | data
1 | {"a": [1, 7], "b": null, "c": [8]}
2 | {"a": [2, 9], "b": [1], "c": null}
3 | {"a": [8, 9], "b": null, "c": [3, 4]}
4 | {}
As you can see, some JSON keys have null values.
I'd like to exclude these - is there an easy way to SELECT only the non-null key-value pairs to produce:
id | data
1 | {"a": [1, 7], "c": [8]}
2 | {"a": [2, 9], "b": [1]}
3 | {"a": [8, 9], "c": [3, 4]}
4 | {}
You can use jsonb_strip_nulls()
select id, jsonb_strip_nulls(data) as data
from my_table;
Online example:
Note that this function would not remove null values inside the arrays.

Finding minimal subset of columns that make rows in a matrix unique

What is a generic, efficient algorithm to find the minimal subset of columns in a discrete-valued matrix that makes that rows unique.
For example, consider this matrix (with named columns):
a b c d
2 1 0 0
2 0 0 0
2 1 2 2
1 2 2 2
2 1 1 0
Each row in the matrix is unique. However, if we remove columns a and d we maintain that same property.
I could enumerate all possible combinations of the columns, however, that will quickly become intractable as my matrix grows. Is there a faster, optimal algorithm for doing this?
Actually, my original formulation wasn't very good. This is better as a set cover.
import pulp
# Input data
A = [
[2, 1, 0, 0],
[2, 0, 0, 0],
[2, 1, 2, 2],
[1, 2, 2, 2],
[2, 1, 1, 0]
# Preprocess the data a bit.
# Bikj = 1 if Aij != Akj, 0 otherwise
B = []
for i in range(len(A)):
Bi = []
for k in range(len(A)):
Bik = [int(A[i][j] != A[k][j]) for j in range(len(A[i]))]
model = pulp.LpProblem('Tim', pulp.LpMinimize)
# Variables turn on and off columns.
x = [pulp.LpVariable('x_%d' % j, cat=pulp.LpBinary) for j in range(len(A[0]))]
# The sum of elementwise absolute difference per element and row.
for i in range(len(A)):
for k in range(i + 1, len(A)):
model += sum(B[i][k][j] * x[j] for j in range(len(A[i]))) >= 1
assert model.solve() == pulp.LpStatusOptimal
print([xi.value() for xi in x])
An observation: if M has unique rows without both columns i and j, then it has unique rows without column i and without column j independently (in other words, adding a column to a matrix with unique rows cannot make the rows not unique). Therefore, you should be able to find the minimum (not just minimal) solution by using a depth first search.
def has_unique_rows(M):
return len(set([tuple(i) for i in M])) == len(M)
def remove_cols(M, cols):
ret = []
for row in M:
new_row = []
for i in range(len(row)):
if i in cols:
return ret
def minimum_unique_rows(M):
if not has_unique_rows(M):
raise ValueError("M must have unique rows")
cols = list(range(len(M[0])))
def _cols_to_remove(M, removed_cols=(), max_removed_cols=()):
for i in set(cols) - set(removed_cols):
new_removed_cols = removed_cols + (i,)
new_M = remove_cols(M, new_removed_cols)
if not has_unique_rows(new_M):
if len(new_removed_cols) > len(max_removed_cols):
max_removed_cols = new_removed_cols
return _cols_to_remove(M, new_removed_cols, max_removed_cols)
return max_removed_cols
removed_cols = _cols_to_remove(M)
return remove_cols(M, removed_cols), removed_cols
(note that my variable naming is terrible)
Here's it on your matrix
In [172]: rows = [
.....: [2, 1, 0, 0],
.....: [2, 0, 0, 0],
.....: [2, 1, 2, 2],
.....: [1, 2, 2, 2],
.....: [2, 1, 1, 0]
.....: ]
In [173]: minimum_unique_rows(rows)
Out[173]: ([[1, 0], [0, 0], [1, 2], [2, 2], [1, 1]], (0, 3))
I generated a random matrix (using sympy.randMatrix) which is shown below
⎡0 1 0 1 0 1 1⎤
⎢ ⎥
⎢0 1 1 2 0 0 2⎥
⎢ ⎥
⎢1 0 1 1 1 0 0⎥
⎢ ⎥
⎢1 2 2 1 1 2 2⎥
⎢ ⎥
⎢2 0 0 0 0 1 1⎥
⎢ ⎥
⎢2 0 2 2 1 1 0⎥
⎢ ⎥
⎢2 1 2 1 1 0 1⎥
⎢ ⎥
⎢2 2 1 2 1 0 1⎥
⎢ ⎥
⎣2 2 2 1 1 2 1⎦
(note that sorting the rows of M helps a lot in checking these things by hand)
In [224]: M1 = [[0, 1, 0, 1, 0, 1, 1], [0, 1, 1, 2, 0, 0, 2], [1, 0, 1, 1, 1, 0, 0], [1, 2, 2, 1, 1, 2, 2], [2, 0, 0, 0, 0, 1, 1], [2, 0, 2, 2, 1, 1, 0], [2, 1, 2, 1, 1, 0
, 1], [2, 2, 1, 2, 1, 0, 1], [2, 2, 2, 1, 1, 2, 1]]
In [225]: minimum_unique_rows(M1)
Out[225]: ([[1, 1, 1], [2, 0, 2], [1, 0, 0], [1, 2, 2], [0, 1, 1], [2, 1, 0], [1, 0, 1], [2, 0, 1], [1, 2, 1]], (0, 1, 2, 4))
Here's a brute-force check that it's the minimum answer (actually there are quite a few minimums).
In [229]: from itertools import combinations
In [230]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 6)])
[False, False, False, False, False, False, False]
In [231]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 5)])
[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]
In [232]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 4)])
[False, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, True, True]
Here is my greedy solution. (Yes, that fails your "optimal" criterion.) Randomly pick a row that can be safely thrown away and throw it away. Keep going until no more such rows. I'm sure the is_valid could be optimized.
rows = [
[2, 1, 0, 0],
[2, 0, 0, 0],
[2, 1, 2, 2],
[1, 2, 2, 2],
[2, 1, 1, 0]
col_names = [0, 1, 2, 3]
def is_valid(rows, col_names):
# it's valid if every row has a distinct "signature"
signatures = { tuple(row[col] for col in col_names) for row in rows }
return len(signatures) == len(rows)
import random
def minimal_distinct_columns(rows, col_names):
col_names = col_names[:]
for i, col in enumerate(col_names):
fewer_col_names = col_names[:i] + col_names[(i+1):]
if is_valid(rows, fewer_col_names):
return minimal_distinct_columns(rows, fewer_col_names)
return col_names
Since it's greedy, it doesn't get the best answer always, but it should be relatively speedy (and simple).
Although I'm sure there's better approaches, this fondly reminded me of some Genetic Algorithms stuff I did in the 90s. I wrote up a quick version using R's GA package.
matrix_to_minimize <- matrix(c(2,2,1,1,2,
0,0,2,2,0), ncol=4)
evaluate <- function(indices) {
if(all(indices == 0)) {
selected_cols <- matrix_to_minimize[, as.logical(indices), drop=FALSE]
are_unique <- nrow(selected_cols) == nrow(unique(selected_cols))
if (are_unique == FALSE) {
retval <- (1/sum(as.logical(indices)))
ga_results <- ga("binary", evaluate,
popSize=10 * ncol(matrix_to_minimize), #why not
run=10) #probably want to play with this
print("Best Solution: ")
I don't know that it's good or optimal, but I bet it will provide a reasonably good answer in a reasonable amount of time? :)