I want to create a subset of my data by applying tf.data.Dataset filter operation. I have this data:
data = tf.convert_to_tensor([[1, 2, 1, 1, 5, 5, 9, 12], [1, 2, 3, 8, 4, 5, 9, 12]])
dataset = tf.data.Dataset.from_tensor_slices(data)
I want to retrieve a subset of 'dataset' which corresponds to all elements whose first column is equal to 1. So, result should be:
[[1, 1, 1], [1, 3, 8]] # dtype : dataset
I tried this:
subset = dataset.filter(lambda x: tf.equal(x[0], 1))
But I don't get the correct result, since it sends me back x[0]
Someone to help me ?
I finally resolved it:
a = tf.convert_to_tensor([1, 2, 1, 1, 5, 5, 9, 12])
b = tf.convert_to_tensor([1, 2, 3, 8, 4, 5, 9, 12])
data_set = tf.data.Dataset.from_tensor_slices((a, b))
subset = data_set.filter(lambda x, y: tf.equal(x, 1))
If I create a numpy array, and another to serve as a selective index into it:
>>> x
array([[ 2, 3, 4],
[ 5, 6, 7],
[ 6, 7, 8],
[11, 12, 13]])
>>> nz
array([ True, True, False, True], dtype=bool)
then direct use of nz returns a view of the original array:
>>> x[nz,:]
array([[ 2, 3, 4],
[ 5, 6, 7],
[11, 12, 13]])
>>> x[nz,:] += 2
>>> x
array([[ 4, 5, 6],
[ 7, 8, 9],
[ 6, 7, 8],
[13, 14, 15]])
however, naturally, an assignment makes a copy:
>>> v = x[nz,:]
Any operation on v is on the copy, and has no effect on the original array.
Is there any way to create a named view, from x[nz,:], simply to abbreviate code, or which I can pass around, so operations on the named view will affect only the selected elements of x?
Numpy has masked_array, which might be what you are looking for:
import numpy as np
x = np.asarray([[ 2, 3, 4],[ 5, 6, 7],[ 6, 7, 8],[11, 12, 13]])
nz = np.asarray([ True, True, False, True], dtype=bool)
mx = np.ma.masked_array(x, ~nz.repeat(3)) # True means masked, so "~" is needed
mx += 2
# x changed as well because it is the base of mx
print(x)
print(x is mx.base)
There is a similar question ( Is BigQuery ROLLUP supports grouping by repeated fields ), but it lacks an example.
Consider following code:
SELECT user_segments AS user_segments,
SUM(impressions) AS imps,
SUM(clicks) AS clicks,
FROM [theTable]
GROUP BY ROLLUP (user_segments)
ORDER BY imps DESC
LIMIT 1000
Where theTable contains impressions and clicks of two users (table only has 10 rows, and impressions = 1 on every row):
{"impressions": 1, "user_segments": [0, 1], "user_id": "A0", "clicks": 0}
{"impressions": 1, "user_segments": [1, 2], "user_id": "A1", "clicks": 1}
{"impressions": 1, "user_segments": [0, 1], "user_id": "A0", "clicks": 2}
{"impressions": 1, "user_segments": [1, 2], "user_id": "A1", "clicks": 0}
{"impressions": 1, "user_segments": [0, 1], "user_id": "A0", "clicks": 1}
{"impressions": 1, "user_segments": [1, 2], "user_id": "A1", "clicks": 2}
{"impressions": 1, "user_segments": [0, 1], "user_id": "A0", "clicks": 0}
{"impressions": 1, "user_segments": [1, 2], "user_id": "A1", "clicks": 1}
{"impressions": 1, "user_segments": [0, 1], "user_id": "A0", "clicks": 2}
{"impressions": 1, "user_segments": [1, 2], "user_id": "A1", "clicks": 0}
Query output is:
user_segments imps clicks
null 20 18
1 10 9
2 5 4
0 5 5
But there are only 10 (ten!) impressions in the table. In my opinion correct values for totals would be:
user_segments imps clicks
null 10 9
1 10 9
2 5 4
0 5 5
Is there any way to get to the correct totals w/o a separate query? Thanks!
below is obvious workaround you most likely using already - but still posting just in case
SELECT * FROM (
SELECT
user_segments AS user_segments,
SUM(impressions) AS imps,
SUM(clicks) AS clicks
FROM theTable
GROUP BY user_segments
), (
SELECT
NULL AS user_segments,
SUM(impressions) AS imps,
SUM(clicks) AS clicks
FROM theTable
)
ORDER BY imps DESC, user_segments
What is a generic, efficient algorithm to find the minimal subset of columns in a discrete-valued matrix that makes that rows unique.
For example, consider this matrix (with named columns):
a b c d
2 1 0 0
2 0 0 0
2 1 2 2
1 2 2 2
2 1 1 0
Each row in the matrix is unique. However, if we remove columns a and d we maintain that same property.
I could enumerate all possible combinations of the columns, however, that will quickly become intractable as my matrix grows. Is there a faster, optimal algorithm for doing this?
Actually, my original formulation wasn't very good. This is better as a set cover.
import pulp
# Input data
A = [
[2, 1, 0, 0],
[2, 0, 0, 0],
[2, 1, 2, 2],
[1, 2, 2, 2],
[2, 1, 1, 0]
]
# Preprocess the data a bit.
# Bikj = 1 if Aij != Akj, 0 otherwise
B = []
for i in range(len(A)):
Bi = []
for k in range(len(A)):
Bik = [int(A[i][j] != A[k][j]) for j in range(len(A[i]))]
Bi.append(Bik)
B.append(Bi)
model = pulp.LpProblem('Tim', pulp.LpMinimize)
# Variables turn on and off columns.
x = [pulp.LpVariable('x_%d' % j, cat=pulp.LpBinary) for j in range(len(A[0]))]
# The sum of elementwise absolute difference per element and row.
for i in range(len(A)):
for k in range(i + 1, len(A)):
model += sum(B[i][k][j] * x[j] for j in range(len(A[i]))) >= 1
model.setObjective(pulp.lpSum(x))
assert model.solve() == pulp.LpStatusOptimal
print([xi.value() for xi in x])
An observation: if M has unique rows without both columns i and j, then it has unique rows without column i and without column j independently (in other words, adding a column to a matrix with unique rows cannot make the rows not unique). Therefore, you should be able to find the minimum (not just minimal) solution by using a depth first search.
def has_unique_rows(M):
return len(set([tuple(i) for i in M])) == len(M)
def remove_cols(M, cols):
ret = []
for row in M:
new_row = []
for i in range(len(row)):
if i in cols:
continue
new_row.append(row[i])
ret.append(new_row)
return ret
def minimum_unique_rows(M):
if not has_unique_rows(M):
raise ValueError("M must have unique rows")
cols = list(range(len(M[0])))
def _cols_to_remove(M, removed_cols=(), max_removed_cols=()):
for i in set(cols) - set(removed_cols):
new_removed_cols = removed_cols + (i,)
new_M = remove_cols(M, new_removed_cols)
if not has_unique_rows(new_M):
continue
if len(new_removed_cols) > len(max_removed_cols):
max_removed_cols = new_removed_cols
return _cols_to_remove(M, new_removed_cols, max_removed_cols)
return max_removed_cols
removed_cols = _cols_to_remove(M)
return remove_cols(M, removed_cols), removed_cols
(note that my variable naming is terrible)
Here's it on your matrix
In [172]: rows = [
.....: [2, 1, 0, 0],
.....: [2, 0, 0, 0],
.....: [2, 1, 2, 2],
.....: [1, 2, 2, 2],
.....: [2, 1, 1, 0]
.....: ]
In [173]: minimum_unique_rows(rows)
Out[173]: ([[1, 0], [0, 0], [1, 2], [2, 2], [1, 1]], (0, 3))
I generated a random matrix (using sympy.randMatrix) which is shown below
⎡0 1 0 1 0 1 1⎤
⎢ ⎥
⎢0 1 1 2 0 0 2⎥
⎢ ⎥
⎢1 0 1 1 1 0 0⎥
⎢ ⎥
⎢1 2 2 1 1 2 2⎥
⎢ ⎥
⎢2 0 0 0 0 1 1⎥
⎢ ⎥
⎢2 0 2 2 1 1 0⎥
⎢ ⎥
⎢2 1 2 1 1 0 1⎥
⎢ ⎥
⎢2 2 1 2 1 0 1⎥
⎢ ⎥
⎣2 2 2 1 1 2 1⎦
(note that sorting the rows of M helps a lot in checking these things by hand)
In [224]: M1 = [[0, 1, 0, 1, 0, 1, 1], [0, 1, 1, 2, 0, 0, 2], [1, 0, 1, 1, 1, 0, 0], [1, 2, 2, 1, 1, 2, 2], [2, 0, 0, 0, 0, 1, 1], [2, 0, 2, 2, 1, 1, 0], [2, 1, 2, 1, 1, 0
, 1], [2, 2, 1, 2, 1, 0, 1], [2, 2, 2, 1, 1, 2, 1]]
In [225]: minimum_unique_rows(M1)
Out[225]: ([[1, 1, 1], [2, 0, 2], [1, 0, 0], [1, 2, 2], [0, 1, 1], [2, 1, 0], [1, 0, 1], [2, 0, 1], [1, 2, 1]], (0, 1, 2, 4))
Here's a brute-force check that it's the minimum answer (actually there are quite a few minimums).
In [229]: from itertools import combinations
In [230]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 6)])
[False, False, False, False, False, False, False]
In [231]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 5)])
[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]
In [232]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 4)])
[False, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, True, True]
Here is my greedy solution. (Yes, that fails your "optimal" criterion.) Randomly pick a row that can be safely thrown away and throw it away. Keep going until no more such rows. I'm sure the is_valid could be optimized.
rows = [
[2, 1, 0, 0],
[2, 0, 0, 0],
[2, 1, 2, 2],
[1, 2, 2, 2],
[2, 1, 1, 0]
]
col_names = [0, 1, 2, 3]
def is_valid(rows, col_names):
# it's valid if every row has a distinct "signature"
signatures = { tuple(row[col] for col in col_names) for row in rows }
return len(signatures) == len(rows)
import random
def minimal_distinct_columns(rows, col_names):
col_names = col_names[:]
random.shuffle(col_names)
for i, col in enumerate(col_names):
fewer_col_names = col_names[:i] + col_names[(i+1):]
if is_valid(rows, fewer_col_names):
return minimal_distinct_columns(rows, fewer_col_names)
return col_names
Since it's greedy, it doesn't get the best answer always, but it should be relatively speedy (and simple).
Although I'm sure there's better approaches, this fondly reminded me of some Genetic Algorithms stuff I did in the 90s. I wrote up a quick version using R's GA package.
library(GA)
matrix_to_minimize <- matrix(c(2,2,1,1,2,
1,0,1,2,1,
0,0,2,2,1,
0,0,2,2,0), ncol=4)
evaluate <- function(indices) {
if(all(indices == 0)) {
return(0)
}
selected_cols <- matrix_to_minimize[, as.logical(indices), drop=FALSE]
are_unique <- nrow(selected_cols) == nrow(unique(selected_cols))
if (are_unique == FALSE) {
return(0)
}
retval <- (1/sum(as.logical(indices)))
return(retval)
}
ga_results <- ga("binary", evaluate,
nBits=ncol(matrix_to_minimize),
popSize=10 * ncol(matrix_to_minimize), #why not
maxiter=1000,
run=10) #probably want to play with this
print("Best Solution: ")
print(ga_results#solution)
I don't know that it's good or optimal, but I bet it will provide a reasonably good answer in a reasonable amount of time? :)
Googled around a bit and couldn't seem to find anything on this.
Is there an option to access data in a pandas data frame using "not index"?
So something like
df_index = asdf = pandas.MultiIndex(levels=[
['2014-10-19', '2014-10-20', '2014-10-21', '2014-10-22', '2014-10-30'],
[u'after_work', u'all_day', u'breakfast', u'lunch', u'mid_evening']],
labels=[[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 4, 4, 4, 4],
[4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 2, 0, 1, 3, 4]],
names=[u'start_date', u'time_group'])
And then I would like to be able to call the following to get everything not in df_index
df.ix[~df_index]
I know you can do it for logical indexing within pandas. Just curious if I could do it using an Index Object
you can use df.drop(df_index, errors="ignore").