scipy sparse A[:,0] = ndarray ValueError - indexing

Setting the first row of a scipy sparse array A[0,:] = np.ones() works fine,
but trying to set the first column with A[:,0] = np.ones() raises a ValueError.
Is this a bug in scipy 1.5.2, or have I not found doc which describes this ?
Answer 13 sep: this is a known bug area, see issues/10695
and the newest scipy/sparse/_index.py.
However I have not tested A[:,0] with this.
""" scipy sparse A[:,0] = ndarray ValueError """
# sparse A[0,:] = ndarray works, sparse A[:,0] = ndarray raises ValueError
# https://stackoverflow.com/search?q=[scipy] [sparse-matrix] ValueError > 100
import numpy as np
from scipy import sparse
# import warnings
# warnings.simplefilter( "ignore", sparse.SparseEfficiencyWarning )
def versionstr():
import numpy, scipy, sys
return "versions: numpy %s scipy %s python %s " % (
numpy.__version__, scipy.__version__ , sys.version.split()[0] )
print( versionstr() ) # 11 Sep 2020: numpy 1.19.2 scipy 1.5.2 python 3.7.6
#...........................................................................
n = 3
ones = np.ones( n )
for A in [
np.eye(n),
sparse.eye( n ).tolil(),
sparse.eye( n ).tocsr(),
sparse.eye( n ).tocsr(),
]:
print( "\n-- A:", type(A).__name__, A.shape )
print( "A[0,:] = ones" )
A[0,:] = ones
print( "A: \n", getattr( A, "A", A )) # dense
# first column = ones --
if sparse.issparse( A ):
A[:,0] = ones.reshape( n, 1 ) # ok
A[:,0] = np.matrix( ones ).T # ok
A[range(n),0] = ones # ok
try:
print( "A[:,0] = ones" )
A[:,0] = ones # A dense ok, A sparse ValueError
except ValueError as msg:
print( "ValueError:", msg )
# ValueError: cannot reshape array of size 9 into shape (3,1)

I'd probably call this a bug, yes - that's not the behavior I would have expected. Under the hood, it looks like this is driven by np.broadcast_arrays(), which is called when the fill array is dense. This function treats 1d arrays as if they're 2d (1, N) arrays. I would have expected based on the behavior of numpy slicing that a 1d array would be used without broadcasting if the size is correct.
Column slices:
>>> np.broadcast_arrays(np.ones((3,1)), A[:,0].A)
[array([[1.],
[1.],
[1.]]), array([[1.],
[0.],
[0.]])]
>>> np.broadcast_arrays(np.ones((3,)), A[:,0].A)
[array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]]), array([[1., 1., 1.],
[0., 0., 0.],
[0., 0., 0.]])]
>>> np.broadcast_arrays(np.ones((1, 3)), A[:,0].A)
[array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]]), array([[1., 1., 1.],
[0., 0., 0.],
[0., 0., 0.]])]
Row slices:
>>> np.broadcast_arrays(np.ones((3, )), A[0, :].A)
[array([[1., 1., 1.]]), array([[1., 0., 0.]])]
>>> np.broadcast_arrays(np.ones((3, 1)), A[0, :].A)
[array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]]), array([[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.]])]
>>> np.broadcast_arrays(np.ones((1, 3)), A[0, :].A)
[array([[1., 1., 1.]]), array([[1., 0., 0.]])]

Related

Add a row to mnist images dataset

I have to add a row of ones to mnist images dataset, which is batched to 32 samples. Here the code:
(mnist_images, mnist_labels), _ = tf.keras.datasets.mnist.load_data()
dataset = tf.data.Dataset.from_tensor_slices(
(tf.cast(mnist_images[...,tf.newaxis]/255, tf.float32),
tf.cast(mnist_labels,tf.int64)))
dataset = dataset.shuffle(1000).batch(32)
for images,labels in dataset.take(1):
print("Logits: ", mnist_model(images[0:1]).numpy())
b =tf.reshape(images, [784,32], tf.float32)
c = tf.concat(b,tf.ones([1,32], tf.float32),0)
I get the following error, but both are dtype float 32,
ValueError: Tensor conversion requested dtype int32 for Tensor with dtype float32: <tf.Tensor:
shape=(1, 32), dtype=float32, numpy= array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
dtype=float32)>
Is there another way to add a row into images tensor?
It seems like you've just forgotten to use brackets - [ ] .
use:
c = tf.concat([b,tf.ones([1,32], tf.float32)],0)
instead of :
c = tf.concat(b,tf.ones([1,32], tf.float32),0)

What is the logic of the extra columns in Tensorflow categorical encoding?

I am following the official Tensorflow tutorial for preprocessing layers, and I am not sure I get why I end up getting these extra columns after the categorical encoding.
Here is a stripped-down minimal reproducible example (including the data):
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import pathlib
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url, extract=True, cache_dir='.')
df = pd.read_csv(csv_file)
# In the original dataset "4" indicates the pet was not adopted.
df['target'] = np.where(df['AdoptionSpeed']==4, 0, 1)
# Drop un-used columns.
df = df.drop(columns=['AdoptionSpeed', 'Description'])
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
ds = ds.prefetch(batch_size)
return ds
batch_size = 5
ds = df_to_dataset(df, batch_size=batch_size)
[(train_features, label_batch)] = ds.take(1)
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a StringLookup layer which will turn strings into integer indices
if dtype == 'string':
index = preprocessing.StringLookup(max_tokens=max_tokens)
else:
index = preprocessing.IntegerLookup(max_values=max_tokens)
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Create a Discretization for our integer indices.
encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())
#encoder = preprocessing.CategoryEncoding(max_tokens=2)
# Prepare a Dataset that only yields our feature.
feature_ds = feature_ds.map(index)
# Learn the space of possible indices.
encoder.adapt(feature_ds)
# Apply one-hot encoding to our indices. The lambda function captures the
# layer so we can use them, or include them in the functional model later.
return lambda feature: encoder(index(feature))
So, after running
type_col = train_features['Type']
layer = get_category_encoding_layer('Type', ds, 'string')
layer(type_col)
I get a result as:
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
similar to what is shown in the tutorial indeed.
Notice that this is a binary classification problem (Cat/Dog):
np.unique(type_col)
# array([b'Cat', b'Dog'], dtype=object)
So, what is the logic of the 2 extra columns after the categorical encoding shown in the result above? What do they represent, and why they are 2 (and not, say, 1, or 3, or more)?
(I am perfectly aware that, should I wish for a simple one-hot encoding, I could simply use to_categorical(), but this is not the question here)
As already implied in the question, categorical encoding is somewhat richer that simple one-hot encoding. To see what these two columns represent it suffices to add a diagnostic print somewhere inside the get_category_encoding_layer() function:
print(index.get_vocabulary())
Then the result of the last commands will be:
['', '[UNK]', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
The hint should hopefully be clear: the extra two columns here represent the empty value '' and unknown ones '[UNK]', respectively, which could be present in future (unseen) data.
This is actually determined from the default arguments, not of CategoryEncoding, but of the preceding StringLookup; from the docs:
mask_token=''
oov_token='[UNK]'
You can end up with a somewhat more tight encoding (only 1 extra column instead of 2) by asking for oov_token='' instead of oov_token='[UNK]'; replace the call to StringLookup in the get_category_encoding_layer() function with
index = preprocessing.StringLookup(oov_token='',mask_token=None, max_tokens=max_tokens)
after which, the result will be:
['', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]], dtype=float32)>
i.e. with only 3 columns (without a dedicated one for '[UNK]'). AFAIK, this is the lowest you can go - attempting to set both mask_token and oov_token to None will result to an error.

Tensorflow / Keras categorical encoding [duplicate]

I am following the official Tensorflow tutorial for preprocessing layers, and I am not sure I get why I end up getting these extra columns after the categorical encoding.
Here is a stripped-down minimal reproducible example (including the data):
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import pathlib
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url, extract=True, cache_dir='.')
df = pd.read_csv(csv_file)
# In the original dataset "4" indicates the pet was not adopted.
df['target'] = np.where(df['AdoptionSpeed']==4, 0, 1)
# Drop un-used columns.
df = df.drop(columns=['AdoptionSpeed', 'Description'])
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
ds = ds.prefetch(batch_size)
return ds
batch_size = 5
ds = df_to_dataset(df, batch_size=batch_size)
[(train_features, label_batch)] = ds.take(1)
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a StringLookup layer which will turn strings into integer indices
if dtype == 'string':
index = preprocessing.StringLookup(max_tokens=max_tokens)
else:
index = preprocessing.IntegerLookup(max_values=max_tokens)
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Create a Discretization for our integer indices.
encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())
#encoder = preprocessing.CategoryEncoding(max_tokens=2)
# Prepare a Dataset that only yields our feature.
feature_ds = feature_ds.map(index)
# Learn the space of possible indices.
encoder.adapt(feature_ds)
# Apply one-hot encoding to our indices. The lambda function captures the
# layer so we can use them, or include them in the functional model later.
return lambda feature: encoder(index(feature))
So, after running
type_col = train_features['Type']
layer = get_category_encoding_layer('Type', ds, 'string')
layer(type_col)
I get a result as:
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
similar to what is shown in the tutorial indeed.
Notice that this is a binary classification problem (Cat/Dog):
np.unique(type_col)
# array([b'Cat', b'Dog'], dtype=object)
So, what is the logic of the 2 extra columns after the categorical encoding shown in the result above? What do they represent, and why they are 2 (and not, say, 1, or 3, or more)?
(I am perfectly aware that, should I wish for a simple one-hot encoding, I could simply use to_categorical(), but this is not the question here)
As already implied in the question, categorical encoding is somewhat richer that simple one-hot encoding. To see what these two columns represent it suffices to add a diagnostic print somewhere inside the get_category_encoding_layer() function:
print(index.get_vocabulary())
Then the result of the last commands will be:
['', '[UNK]', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
The hint should hopefully be clear: the extra two columns here represent the empty value '' and unknown ones '[UNK]', respectively, which could be present in future (unseen) data.
This is actually determined from the default arguments, not of CategoryEncoding, but of the preceding StringLookup; from the docs:
mask_token=''
oov_token='[UNK]'
You can end up with a somewhat more tight encoding (only 1 extra column instead of 2) by asking for oov_token='' instead of oov_token='[UNK]'; replace the call to StringLookup in the get_category_encoding_layer() function with
index = preprocessing.StringLookup(oov_token='',mask_token=None, max_tokens=max_tokens)
after which, the result will be:
['', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]], dtype=float32)>
i.e. with only 3 columns (without a dedicated one for '[UNK]'). AFAIK, this is the lowest you can go - attempting to set both mask_token and oov_token to None will result to an error.

How can I transfer an sparse representaion of .txt to a dense matrix in scipy?

I have a .txt file from epinion data set which is a sparse representation (ie.
23 387 5 represents the fact "user 23 has rated item 387 as 5") . from this sparse format I want to transfer it to its dense Representation scipy so I can do matrix factorization on it.
I have loaded the file with loadtxt() from numpy and it is a [664824, 3] array. Using scipy.sparse.csr_matrix I transfer it to numpy array and using todense() from scipy I was hoping to achieve the dense format but I always get the same matrix: [664824, 3]. How can I turn it into the original [40163,139738] dense representation?
import numpy as np
from io import StringIO
d = np.loadtxt("MFCode/Epinions_dataset.txt")
S = csr_matrix(d)
D = R.todense()
I expected a dense matrix with the shape of [40163,139738]
A small sample csv like text:
In [218]: np.lib.format.open_memmap?
In [219]: txt = """0 1 3
...: 1 0 4
...: 2 2 5
...: 0 3 6""".splitlines()
In [220]: data = np.loadtxt(txt)
In [221]: data
Out[221]:
array([[0., 1., 3.],
[1., 0., 4.],
[2., 2., 5.],
[0., 3., 6.]])
Using sparse, using the (data, (row, col)) style of input:
In [222]: from scipy import sparse
In [223]: M = sparse.coo_matrix((data[:,2], (data[:,0], data[:,1])), shape=(5,4))
In [224]: M
Out[224]:
<5x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in COOrdinate format>
In [225]: M.A
Out[225]:
array([[0., 3., 0., 6.],
[4., 0., 0., 0.],
[0., 0., 5., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
Alternatively fill in a zeros array directly:
In [226]: arr = np.zeros((5,4))
In [227]: arr[data[:,0].astype(int), data[:,1].astype(int)]=data[:,2]
In [228]: arr
Out[228]:
array([[0., 3., 0., 6.],
[4., 0., 0., 0.],
[0., 0., 5., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
But be ware that np.zeros([40163,139738]) could raise a memory error. M.A (M.toarray())` could also do that.

Tensorflow skflow, Data seems compatible, Valuerror, Shape error

Trying to run a very simple nn classifier with skflow.
classifier = skflow.TensorFlowDNNClassifier(
hidden_units=[10, 10, 10],
n_classes=10,
batch_size=100,
learning_rate=0.05)
print (data.train.images).shape
print (data.train.labels).shape
classifier.fit(data.train.images,data.train.labels)
output is :
(73257, 3072)
(73257, 10)
and the error is :
in assert_same_rank
"Shapes %s and %s must have the same rank" % (self, other))
ValueError: Shapes (?, 10) and (?, 10, 10) must have the same rank
I do not really understand what the problem is here :(
Perhaps the labels of your dataset are one-hot vectors. (In this case, I use mnist dataset. See also https://www.tensorflow.org/versions/r0.7/tutorials/mnist/beginners/index.html)
In [1]: from tensorflow.examples.tutorials.mnist import input_data
In [2]: mnist = input_data.read_data_sets("MNIST_DATA/", one_hot=True)
Extracting MNIST_DATA/train-images-idx3-ubyte.gz
Extracting MNIST_DATA/train-labels-idx1-ubyte.gz
Extracting MNIST_DATA/t10k-images-idx3-ubyte.gz
Extracting MNIST_DATA/t10k-labels-idx1-ubyte.gz
In [3]: mnist.train.labels
Out[3]:
array([[ 0., 0., 0., ..., 1., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 1., 0.]])
In [4]: mnist.train.labels.shape
Out[4]: (55000, 10)
In [5]: import skflow
In [6]: classifier = skflow.TensorFlowDNNClassifier(hidden_units=[10, 10, 10], n_classes=10, batch_size=100, learning_rate=0.05)
In [7]: classifier.fit(mnist.train.images, mnist.train.labels)
Then I got same error.
ValueError: Shapes (?, 10) and (?, 10, 10) must have the same rank
But skflow assumes that lables are numbers between 0 and 9. (one_hot=False)
In [5]: mnist = input_data.read_data_sets("MNIST_DATA/", one_hot=False)
Extracting MNIST_DATA/train-images-idx3-ubyte.gz
Extracting MNIST_DATA/train-labels-idx1-ubyte.gz
Extracting MNIST_DATA/t10k-images-idx3-ubyte.gz
Extracting MNIST_DATA/t10k-labels-idx1-ubyte.gz
In [6]: mnist.train.labels
Out[6]: array([7, 3, 4, ..., 5, 6, 8], dtype=uint8)
In [7]: classifier.fit(mnist.train.images, mnist.train.labels)
Step #99, avg. train loss: 2.31658
Step #199, avg. train loss: 1.63361
Out[7]:
TensorFlowDNNClassifier(batch_size=100, class_weight=None,
config_addon=<skflow.addons.config_addon.ConfigAddon object at 0x11cf7eb90>,
continue_training=False, hidden_units=[10, 10, 10],
keep_checkpoint_every_n_hours=10000, learning_rate=0.05,
max_to_keep=5, n_classes=10, optimizer='SGD', steps=200,
tf_master='', tf_random_seed=42, verbose=1)
Please give it a try.