Collect Spark dataframe into Numpy matrix - numpy

I've used spark to compute the PCA on a large dataset, now I have a spark dataframe with the following structure:
Row('pcaFeatures'=DenseVector(elem1,emlem2..))
where elem1,..., elemN are double numbers. I would like to transform it in a numpy matrix. Right now I'm using the following code:
numpymatrix = datapca.toPandas().as_Matrix()
but I get a numpy Series with elements of type Object instead of a numeric matrix. Is there a way to get the matrix I need?

Your request makes sense only if the resulting data can fit into your main memory (i.e. you can safely use collect()); on the other hand, if this is the case, admittedly you have absolutely no reason to use Spark at all.
Anyway, given this assumption, here is a general way to convert a single-column features Spark dataframe (Rows of DenseVector) to a NumPy array using toy data:
spark.version
# u'2.2.0'
from pyspark.ml.linalg import Vectors
import numpy as np
# toy data:
df = spark.createDataFrame([(Vectors.dense([0,45,63,0,0,0,0]),),
(Vectors.dense([0,0,0,85,0,69,0]),),
(Vectors.dense([0,89,56,0,0,0,0]) ,),
], ['features'])
dd = df.collect()
dd
# [Row(features=DenseVector([0.0, 45.0, 63.0, 0.0, 0.0, 0.0, 0.0])),
# Row(features=DenseVector([0.0, 0.0, 0.0, 85.0, 0.0, 69.0, 0.0])),
# Row(features=DenseVector([0.0, 89.0, 56.0, 0.0, 0.0, 0.0, 0.0]))]
np.asarray([x[0] for x in dd])
# array([[ 0., 45., 63., 0., 0., 0., 0.],
# [ 0., 0., 0., 85., 0., 69., 0.],
# [ 0., 89., 56., 0., 0., 0., 0.]])

Related

Numpy matrix is reseting the values inside it

I am implementing a co-occurrence matrix for an image to be able to detect the edges of an image through the change in brightness. So I made a 256x256 numpy matrix to store the co-occurrences, and then I wrote a function that turns all of the values of occurrences in the matrix to 0 if the change between them is less than a certain value like 30, ie the difference between the i and j of the matrix is less than 30 then the value inside that cell is turned into 0.
Here is the function, it takes the co-occurrence matrix and turn the values into 0.
def nullify(matrix):
for i in range (0,matrix.shape[0]):
for j in range(0,matrix.shape[1]):
if(abs(i-j)<30):
matrix[i,j]=0
return matrix
But for some reason it turn the entire matrix into 0's, the function work perfectly when I'm using a smaller matrix like a 3x3.
This is the function that I use to calculate the Cooccurrence
def calculateCooccurrence(im):
Horizontal = np.zeros((256, 256))
for i in range (0,im.size[0]):
for j in range (0,im.size[1]-1):
pixelRGB = im.getpixel((i,j))
R,G,B = pixelRGB
brightness = int(sum([R,G,B])/3)
pixelRGB1 = im.getpixel((i,j+1))
R1,G1,B1 = pixelRGB
brightness1 = int(sum([R1,G1,B1])/3)
Horizontal[brightness,brightness1]+=1
Vertical = np.zeros((256, 256))
for i in range (0,im.size[0]-1):
for j in range (0,im.size[1]):
pixelRGB = im.getpixel((i,j))
R,G,B = pixelRGB
brightness = int(sum([R,G,B])/3)
pixelRGB1 = im.getpixel((i+1,j))
R1,G1,B1 = pixelRGB
brightness1 = int(sum([R1,G1,B1])/3)
Vertical[brightness,brightness1]+=1
return Horizontal,Vertical
And this is what I do exactly
horiz,vertic=calculateCooccurrence(im2)
horizon=nullify(horiz)
there are some things to point about your code. I can't tell why your entire matrix turns into zeros, but these are things that should help you:
This might be due to formatting in stackoverflow, but your matrix is returned after the first iteration of the i-loop.
You actually are not working on the values of the matrix. You iterate over the range 0...256. This will set all values close to the diagonal to 0. Based on your text where you say that you want to detect the edges, I am not sure if this is what you actually want to do.
instead of creating a variable named difference you can also put it simply in the if-statement if(abs(i-j)<30:
Edit: Found the problem: Your code works intended, which is the problem. All the elements are on the diagonal. I just used a test image myself and found that np.sum(matrix) and np.trace(matrix) returned the same result. So when your code eliminates all elements along the diagonal, it turns all elements to 0.
Your code seems to work properly for Python 3.7.7 and NumPy 1.18.1:
>>> import numpy as np
>>> matrix = np.ones((256, 256))
>>> def nullify(matrix):
... for i in range (0,matrix.shape[0]):
... for j in range(0,matrix.shape[1]):
... if(abs(i-j)<30):
... matrix[i,j]=0
... return matrix
...
>>> nullify(matrix)
array([[0., 0., 0., ..., 1., 1., 1.],
[0., 0., 0., ..., 1., 1., 1.],
[0., 0., 0., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 0., 0., 0.],
[1., 1., 1., ..., 0., 0., 0.],
[1., 1., 1., ..., 0., 0., 0.]])
You can see that the elements far away from the diagonal remain unchanged. You should provide a minimal working example of what goes wrong in your code.

How to write a custom kernel in GPflow for the covariance matrix RBF plus noise only on the main block diagonal?

The required covariance matrix is
where t is 1D time and k={0, 1}
A sample from the kernel should look like:
with the orange sequence corresponding to k=0, and the blue one to k=1.
It sounds like you're looking for a kernel for different discrete outputs. You can achieve this in GPflow for example with the Coregion kernel, for which there is a tutorial notebook.
To construct a coregionalization kernel that is block-diagonal (all off-diagonal entries are zero), you can set rank=0. Note that you need to explicitly specify which kernel should act on which dimensions:
import gpflow
k_time = gpflow.kernels.SquaredExponential(active_dims=[0])
k_coreg = gpflow.kernels.Coregion(output_dim=2, rank=0, active_dims=[1])
You can combine them with * as in the notebook, or with + as specified in the question:
k = k_time + k_coreg
You can see that the k_coreg term is block-diagonal as you specified: Evaluating
test_inputs = np.array([
[0.1, 0.0],
[0.5, 0.0],
[0.7, 1.0],
[0.1, 1.0],
])
k_coreg(test_inputs)
returns
<tf.Tensor: shape=(4, 4), dtype=float64, numpy=
array([[1., 1., 0., 0.],
[1., 1., 0., 0.],
[0., 0., 1., 1.],
[0., 0., 1., 1.]])>
And you can get samples as in the graph in the question by running
import numpy as np
num_inputs = 51
num_outputs = 2
X = np.linspace(0, 5, num_inputs)
Q = np.arange(num_outputs)
XX, QQ = np.meshgrid(X, Q, indexing='ij')
pts = np.c_[XX.flatten(), QQ.flatten()]
K = k(pts)
L = np.linalg.cholesky(K + 1e-8 * np.eye(len(K)))
num_samples = 3
v = np.random.randn(len(L), num_samples)
f = L # v
import matplotlib.pyplot as plt
for i in range(num_samples):
plt.plot(X, f[:, i].reshape(num_inputs, num_outputs))
In GPflow, you can construct this kernel using a sum kernel consisting of a Squared Exponential (RBF) and a White kernel.
import gpflow
kernel = gpflow.kernels.SquaredExponential() + gpflow.kernels.White()

What is the logic of the extra columns in Tensorflow categorical encoding?

I am following the official Tensorflow tutorial for preprocessing layers, and I am not sure I get why I end up getting these extra columns after the categorical encoding.
Here is a stripped-down minimal reproducible example (including the data):
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import pathlib
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url, extract=True, cache_dir='.')
df = pd.read_csv(csv_file)
# In the original dataset "4" indicates the pet was not adopted.
df['target'] = np.where(df['AdoptionSpeed']==4, 0, 1)
# Drop un-used columns.
df = df.drop(columns=['AdoptionSpeed', 'Description'])
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
ds = ds.prefetch(batch_size)
return ds
batch_size = 5
ds = df_to_dataset(df, batch_size=batch_size)
[(train_features, label_batch)] = ds.take(1)
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a StringLookup layer which will turn strings into integer indices
if dtype == 'string':
index = preprocessing.StringLookup(max_tokens=max_tokens)
else:
index = preprocessing.IntegerLookup(max_values=max_tokens)
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Create a Discretization for our integer indices.
encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())
#encoder = preprocessing.CategoryEncoding(max_tokens=2)
# Prepare a Dataset that only yields our feature.
feature_ds = feature_ds.map(index)
# Learn the space of possible indices.
encoder.adapt(feature_ds)
# Apply one-hot encoding to our indices. The lambda function captures the
# layer so we can use them, or include them in the functional model later.
return lambda feature: encoder(index(feature))
So, after running
type_col = train_features['Type']
layer = get_category_encoding_layer('Type', ds, 'string')
layer(type_col)
I get a result as:
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
similar to what is shown in the tutorial indeed.
Notice that this is a binary classification problem (Cat/Dog):
np.unique(type_col)
# array([b'Cat', b'Dog'], dtype=object)
So, what is the logic of the 2 extra columns after the categorical encoding shown in the result above? What do they represent, and why they are 2 (and not, say, 1, or 3, or more)?
(I am perfectly aware that, should I wish for a simple one-hot encoding, I could simply use to_categorical(), but this is not the question here)
As already implied in the question, categorical encoding is somewhat richer that simple one-hot encoding. To see what these two columns represent it suffices to add a diagnostic print somewhere inside the get_category_encoding_layer() function:
print(index.get_vocabulary())
Then the result of the last commands will be:
['', '[UNK]', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
The hint should hopefully be clear: the extra two columns here represent the empty value '' and unknown ones '[UNK]', respectively, which could be present in future (unseen) data.
This is actually determined from the default arguments, not of CategoryEncoding, but of the preceding StringLookup; from the docs:
mask_token=''
oov_token='[UNK]'
You can end up with a somewhat more tight encoding (only 1 extra column instead of 2) by asking for oov_token='' instead of oov_token='[UNK]'; replace the call to StringLookup in the get_category_encoding_layer() function with
index = preprocessing.StringLookup(oov_token='',mask_token=None, max_tokens=max_tokens)
after which, the result will be:
['', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]], dtype=float32)>
i.e. with only 3 columns (without a dedicated one for '[UNK]'). AFAIK, this is the lowest you can go - attempting to set both mask_token and oov_token to None will result to an error.

Tensorflow / Keras categorical encoding [duplicate]

I am following the official Tensorflow tutorial for preprocessing layers, and I am not sure I get why I end up getting these extra columns after the categorical encoding.
Here is a stripped-down minimal reproducible example (including the data):
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import pathlib
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url, extract=True, cache_dir='.')
df = pd.read_csv(csv_file)
# In the original dataset "4" indicates the pet was not adopted.
df['target'] = np.where(df['AdoptionSpeed']==4, 0, 1)
# Drop un-used columns.
df = df.drop(columns=['AdoptionSpeed', 'Description'])
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
ds = ds.prefetch(batch_size)
return ds
batch_size = 5
ds = df_to_dataset(df, batch_size=batch_size)
[(train_features, label_batch)] = ds.take(1)
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a StringLookup layer which will turn strings into integer indices
if dtype == 'string':
index = preprocessing.StringLookup(max_tokens=max_tokens)
else:
index = preprocessing.IntegerLookup(max_values=max_tokens)
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Create a Discretization for our integer indices.
encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())
#encoder = preprocessing.CategoryEncoding(max_tokens=2)
# Prepare a Dataset that only yields our feature.
feature_ds = feature_ds.map(index)
# Learn the space of possible indices.
encoder.adapt(feature_ds)
# Apply one-hot encoding to our indices. The lambda function captures the
# layer so we can use them, or include them in the functional model later.
return lambda feature: encoder(index(feature))
So, after running
type_col = train_features['Type']
layer = get_category_encoding_layer('Type', ds, 'string')
layer(type_col)
I get a result as:
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
similar to what is shown in the tutorial indeed.
Notice that this is a binary classification problem (Cat/Dog):
np.unique(type_col)
# array([b'Cat', b'Dog'], dtype=object)
So, what is the logic of the 2 extra columns after the categorical encoding shown in the result above? What do they represent, and why they are 2 (and not, say, 1, or 3, or more)?
(I am perfectly aware that, should I wish for a simple one-hot encoding, I could simply use to_categorical(), but this is not the question here)
As already implied in the question, categorical encoding is somewhat richer that simple one-hot encoding. To see what these two columns represent it suffices to add a diagnostic print somewhere inside the get_category_encoding_layer() function:
print(index.get_vocabulary())
Then the result of the last commands will be:
['', '[UNK]', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
The hint should hopefully be clear: the extra two columns here represent the empty value '' and unknown ones '[UNK]', respectively, which could be present in future (unseen) data.
This is actually determined from the default arguments, not of CategoryEncoding, but of the preceding StringLookup; from the docs:
mask_token=''
oov_token='[UNK]'
You can end up with a somewhat more tight encoding (only 1 extra column instead of 2) by asking for oov_token='' instead of oov_token='[UNK]'; replace the call to StringLookup in the get_category_encoding_layer() function with
index = preprocessing.StringLookup(oov_token='',mask_token=None, max_tokens=max_tokens)
after which, the result will be:
['', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]], dtype=float32)>
i.e. with only 3 columns (without a dedicated one for '[UNK]'). AFAIK, this is the lowest you can go - attempting to set both mask_token and oov_token to None will result to an error.

NumPy matrix to SciPy sparse matrix: What is the safest way to add a scalar?

First off, I'm no mathmatician. I admit that. Yet I still need to understand how ScyPy's sparse matrices work arithmetically in order to switch from a dense NumPy matrix to a SciPy sparse matrix in an application I have to work on. The issue is memory usage. A large dense matrix will consume tons of memory.
The formula portion at issue is where a matrix is added to a scalar.
A = V + x
Where V is a square matrix (its large, say 60,000 x 60,000) and sparsely populated. x is a float.
The operation with NumPy will (if I'm not mistaken) add x to each field in V. Please let me know if I'm completely off base, and x will only be added to non-zero values in V.
With a SciPy, not all sparse matrices support the same features, like scalar addition. dok_matrix (Dictionary of Keys) supports scalar addition, but it looks like (in practice) that it's allocating each matrix entry, effectively rendering my sparse dok_matrix as a dense matrix with more overhead. (not good)
The other matrix types (CSR, CSC, LIL) don't support scalar addition.
I could try constructing a full matrix with the scalar value x, then adding that to V. I would have no problems with matrix types as they all seem to support matrix addition. However I would have to eat up a lot of memory to construct x as a matrix, and the result of the addition could end up being fully populated matrix as well.
There must be an alternative way to do this that doesn't require allocating 100% of a sparse matrix.
I'm will to accept that large amounts of memory are needed, but I thought I would seek some advice first. Thanks.
Admittedly sparse matrices aren't really in my wheelhouse, but ISTM the best way forward depends on the matrix type. If you're DOK:
>>> S = dok_matrix((5,5))
>>> S[2,3] = 10; S[4,1] = 20
>>> S.todense()
matrix([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 10., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 20., 0., 0., 0.]])
Then you could update:
>>> S.update(zip(S.keys(), np.array(S.values()) + 99))
>>> S
<5x5 sparse matrix of type '<type 'numpy.float64'>'
with 2 stored elements in Dictionary Of Keys format>
>>> S.todense()
matrix([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 109., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 119., 0., 0., 0.]])
Not particularly performant, but is O(nonzero).
OTOH, if you have something like COO, CSC, or CSR, you can modify the data attribute directly:
>>> C = S.tocoo()
>>> C
<5x5 sparse matrix of type '<type 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
>>> C.data
array([ 119., 109.])
>>> C.data += 1000
>>> C
<5x5 sparse matrix of type '<type 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
>>> C.todense()
matrix([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1109., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 1119., 0., 0., 0.]])
Note that you're probably going to want to add an additional
>>> C.eliminate_zeros()
to handle the possibility that you've added a negative number and so there's now a 0 which is actually being recorded. By itself, that should work fine, but the next time you did the C.data += some_number trick, it would add somenumber to that zero you introduced.