Weird behavior of zipped tensorflow dataset with random tensors - tensorflow

In the example below (Tensorflow 2.0), we have a dummy tensorflow dataset with three elements. We map a function on it (replace_with_float) that returns a randomly generated value in two copies. As we expect, when we take elements from the dataset, the first and second coordinates have the same value.
Now, we create two "slice" datasets from the first coordinates and the second coordinates, respectively and we zip the two datasets back together. The slicing and the zipping operations seems inverses of each other, so I would expect the resulting dataset to be equivalent to the previous one. However, as we see, now the first and second coordinates are different randomly generated values.
Maybe even more interestingly, if we zip the "same" dataset with itself by
df = tf.data.Dataset.zip((df.map(lambda x, y: x), df.map(lambda x, y: x))), the two coordinates will also have different values.
How can this behavior be explained? Perhaps two different graphs are constructed for the two datasets to be zipped and they are run independently?
import tensorflow as tf
def replace_with_float(element):
rand = tf.random.uniform([])
return (rand, rand)
df = tf.data.Dataset.from_tensor_slices([0, 0, 0])
df = df.map(replace_with_float)
print('Before zipping: ')
for x in df:
print(x[0].numpy(), x[1].numpy())
df = tf.data.Dataset.zip((df.map(lambda x, y: x), df.map(lambda x, y: y)))
print('After zipping: ')
for x in df:
print(x[0].numpy(), x[1].numpy())
Sample output:
Before zipping:
0.08801079 0.08801079
0.638958 0.638958
0.800568 0.800568
After zipping:
0.9676769 0.23045003
0.91056764 0.6551999
0.4647777 0.6758332

The short answer is that datasets don't cache intermediate values between full iterations, unless you explicitly request that using df.cache(), and they don't deduplicate common inputs either.
So in the second loop, the entire pipeline runs again.
Similarly, in the second instance, the two df.map calls cause df to run twice.
Adding a tf.print helps explain what happens:
def replace_with_float(element):
rand = tf.random.uniform([])
tf.print('replacing', element, 'with', rand)
return (rand, rand)
I've also pulled the lambdas on separate lines to avoid the autograph warning:
first = lambda x, y: x
second = lambda x, y: y
df = tf.data.Dataset.zip((df.map(first), df.map(second)))
Before zipping:
replacing 0 with 0.624579549
0.62457955 0.62457955
replacing 0 with 0.471772075
0.47177207 0.47177207
replacing 0 with 0.394005418
0.39400542 0.39400542
After zipping:
replacing 0 with 0.537954807
replacing 0 with 0.558757305
0.5379548 0.5587573
replacing 0 with 0.839109302
replacing 0 with 0.878996611
0.8391093 0.8789966
replacing 0 with 0.0165234804
replacing 0 with 0.534951568
0.01652348 0.53495157
To avoid the duplicate input problem, you can use use a single map call:
swap = lambda x, y: (y, x)
df = df.map(swap)
Or you can use df = df.cache() to avoid both effects:
df = df.map(replace_with_float)
df = df.cache()
Before zipping:
replacing 0 with 0.728474379
0.7284744 0.7284744
replacing 0 with 0.419658661
0.41965866 0.41965866
replacing 0 with 0.911524653
0.91152465 0.91152465
After zipping:
0.7284744 0.7284744
0.41965866 0.41965866
0.91152465 0.91152465

Related

calculating the covariance matrix fast in python with some minor customizing

I have a pandas data frame and I'm trying to find the covariance of the percentage change of each column. For each pair, I want rows with missing values to be dropped, and the percentage be calculated afterwards. That is, I want something like this:
import pandas as pd
import numpy as np
# create dataframe example
N_ROWS, N_COLS = 249, 3535
df = pd.DataFrame(np.random.random((N_ROWS, N_COLS)))
df.iloc[np.random.choice(N_ROWS, N_COLS), np.random.choice(10, 50)] = np.nan
cov_df = pd.DataFrame(index=df.columns, columns=df.columns)
for col_i in df:
for col_j in df:
cov = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().cov()
cov_df.loc[col_i, col_j] = cov.iloc[0, 1]
The thing is this is super slow. The code below gives me results that is similar (but not exactly) what I want, but it runs quite fast
df.dropna(how='any', axis=0).pct_change().cov()
I am not sure why the second one runs so much faster. I want to speed up my code in the first, but I can't figure out how.
I have tried using combinations from itertools to avoid repeating the calculation for (col_i, col_j) and (col_j, col_i), and using map from multiprocessing to do the computations in parallel, but it still hasn't finished running after 90+ mintues.
somehow this works fast enough, although I am not sure why
from scipy.stats import pearsonr
corr = np.zeros((x.shape[1], x.shape[1]))
for i in range(x.shape[1]):
for j in range (i + 1, x.shape[1]):
y = x[:, [i, j]]
y = y[~np.isnan(y).any(axis=1)]
y = np.diff(y, axis=0) / y[:-1, :]
if len(y) < 2:
corr[i, j] = np.nan
continue
y = pearsonr(y[:, 0], y[:, 1])[0]
corr[i, j] = y
corr = corr + corr.T
np.fill_diagonal(corr, 1)
This takes within 8 minutes, which is fast enough for my use case.
On the other hand, this has been running for 30 minutes but still isn't done.
corr = pd.DataFrame(index=nav.columns, columns=nav.columns)
for col_i in df:
for col_j in df:
corr_ij = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().corr().iloc[0, 1]
corr.loc[col_i, col_j] = corr_ij
t1 = time.time()
Don't know why this is but anyways the first one is a good enough solution for me now.

Tensorflow indexing into python list during tf.while_loop

I have this annoying problem and i dont know how to solve it.
I am reading in batches of data from a CSV using a dataset reader and am wanting to gather certain columns. The reader returns a tuple of tensors and, depending on which reader i use, columns are either indexed via integer or string.
I can easily enough do a for loop in python and slice the columns I want however I am wanting to do this in a tf.while_loop to take advantage of parallel execution.
This is where my issue lies - the iterator in the while loop is tensor based and i cannot use this to index into my dataset. If i try and evaluate it I get an error about the session not being the same etc etc
How can i use a while loop (or a map function) and have the function be able to index into a python list/dict without evaluating or running the iterator tensor?
Simple example:
some_data = [1,2,3,4,5]
x = tf.constant(0)
y = len(some_data)
c = lambda x: tf.less(x, y)
b = lambda x: some_data[x] <--- You cannot index like this!
tf.while_loop(c, b, [x])
Does this fit your requirement somewhat ? It does nothing apart from print the value.
import tensorflow as tf
from tensorflow.python.framework import tensor_shape
some_data = [11,222,33,4,5,6,7,8]
def func( v ):
print (some_data[v])
return some_data[v]
with tf.Session() as sess:
r = tf.while_loop(
lambda i, v: i < 4,
lambda i, v: [i + 1, tf.py_func(func, [i], [tf.int32])[0]],
[tf.constant(0), tf.constant(2, tf.int32)],
[tensor_shape.unknown_shape(), tensor_shape.unknown_shape()])
r[1].eval()
It prints
11
4
222
33
The order changes everytime but I guess tf.control_dependencies may be useful to control that.

Linear 1D interpolation on multiple datasets using loops

I'm interested in performing Linear interpolation using the scipy.interpolate library. The dataset looks somewhat like this:
DATAFRAME for interpolation between X, Y for different RUNs
I'd like to use this interpolated function to find the missing Y from this dataset:
DATAFRAME to use the interpolation function
The number of runs given here is just 3, but I'm running on a dataset that will run into 1000s of runs. Hence appreciate if you could advise how to use the iterative functions for the interpolation ?
from scipy.interpolate import interp1d
for RUNNumber in range(TotalRuns)
InterpolatedFunction[RUNNumber]=interp1d(X, Y)
As I understand it, you want a separate interpolation function defined for each run. Then you want to apply these functions to a second dataframe. I defined a dataframe df with columns ['X', 'Y', 'RUN'], and a second dataframe, new_df with columns ['X', 'Y_interpolation', 'RUN'].
interpolating_functions = dict()
for run_number in range(1, max_runs):
run_data = df[df['RUN']==run_number][['X', 'Y']]
interpolating_functions[run_number] = interp1d(run_data['X'], run_data['Y'])
Now that we have interpolating functions for each run, we can use them to fill in the 'Y_interpolation' column in a new dataframe. This can be done using the apply function, which takes a function and applies it to each row in a dataframe. So let's define an interpolate function that will take a row of this new df and use the X value and the run number to calculate an interpolated Y value.
def interpolate(row):
int_func = interpolating_functions[row['RUN']]
interp_y = int_func._call_linear([row['X'])[0] #the _call_linear method
#expects and returns an array
return interp_y[0]
Now we just use apply and our defined interpolate function.
new_df['Y_interpolation'] = new_df.apply(interpolate,axis=1)
I'm using pandas version 0.20.3, and this gives me a new_df that looks like this:

Sklearn and Sparse Matrices ValueError

I'm aware similar questions have been asked before, and I've tried everything suggested in them, but I'm still stumped. I have a dataset with 2 columns: The first with vectors representing words stored as a 1x10000 sparse csr matrix (so a matrix in each cell), and the second contains integer ratings which I will use for classification. When I run the following code
for index, row in data.iterrows():
print(row)
print(row[0].shape)
I get the correct output for all the rows
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
Now when I try passing my data in any SKlearn classifier like so:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
I get the following error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
What am I doing wrong? I've made sure all my sparse matrices are the same size and I've tried reshaping my data in various ways, but with no luck, and the Sklearn classifiers are supposed to be able to deal with csr matrices.
Update: Converting the entire "Vectors" column into one large 2-D matrix did the trick, but for completeness sake the following is the code I used to generate my dataframe if anyone is curious and wants to try solving the original issue. Assume data is a pandas dataframe with rows that look like
"560 420 222" 5.0
"2345 2344 2344 5" 3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data

Transform a numpy 3D ndarray to a symmetric form with respect to a specific index

In the case of a matrix mat n x n, i can do the following
sym = 0.5 * (mat + mat.T)
the operation gives the desired result sym[i,j] = sym[j,i]
Suppose we have a 3D array ndarr[i,j,k], where i,j,k 0,1,...n,
then ndarr is n x n x n. The idea is to obtain the following "symmetric" form
nsym[i,j,k] = nsym[j,i,k] using ndarr. I tried this:
import numpy as np
# Generate some random matrix, n = 5
ndarr = np.random.beta(0.1,1,(5,5,5))
# First attempt to symmetrize
sym1 = np.array([0.5*(ndarr[:,:,k]+ndarr[:,:,k].T) for k in range(5)])
The problem here is that sym1[i,j,k] != sym1[j,i,k] as it is required. In fact I obtain sym1[i,j,k] = sym1[i,k,j], symmetric under the exchange of the last two symbols!
# Second attempt
sym2 = 0.5*(ndarr+ndarr.T)
Same problem here and sym2 is symmetric with respect the second index sym2[i,j,k]=sym2[k,j,i].
To resume, the goal is to find a symmetric form for a 3D array with respect to the third index and to preserve the values in the diagonal for the original ndarr[i,i,i].
The problem here is that you're not using the correct transpose:
sym = 0.5 * (ndarr + np.transpose(ndarr, (1, 0, 2)))
By default, np.transpose and the .T property will reverse the order of the axes. In your case, we want to only flip the first two axes: (0,1,2) -> (1,0,2).
EDIT: The reason your first attempt failed is because you were concatenating each symmetrized matrix along the first axis, not the last. It's more clear if you make ndarr with shape (5, 5, 3):
In [16]: sym = np.array([0.5*(ndarr[:,:,k]+ndarr[:,:,k].T) for k in range(3)])
In [17]: sym.shape
Out[17]: (3L, 5L, 5L)
In any case, the version above with np.transpose is cleaner and more efficient.