Image in the form of Numpy array in a cell in Pyspark data frame - numpy

I would like to store a image represented as a numpy array in a Pyspark data frame.
When I try the I get an error data type not supported.
looking at the data types supported in Pyspark I don't see numpy, wondering if there's a way to store array.
I also tried numpy as string but the string for some reason is truncated contains ...
Any suggestions or solutions?

Related

Why are numpy array called homogeneous?

Why are numpy arrays called homogeneous when you can have elements of different type in the same numpy array like this?
np.array([1,2,3,4,"a"])
I understand that I cannot perform some types of broadcasting operations like I cannot perform
np1*4 here and it results in an error.
but my question really is when it can have elements of different types, why it is called homogeneous?
Numpy automatically converts them to most applicable datatype.
e.g.,
>>> np.array([1,2,3,4,"a"]).dtype.type
numpy.str_
In short this means all elements are of string.
>>> np.array([1,2,3,4]).dtype.type
numpy.int64

Ways to save data frame of tensors into a file for easy loading and access?

I have a dataframe with 2600 rows, and for each row there are torch tensors of shape (192,).
How can I save this dataframe into a file so when I load it back again I could still use a "dictionary-like" access to it's contents?
Saving to_csv() converts the tensor into a string causing a mess where I need to parse.
Turns out, using pandas.to_pickle() function retains the dataframe format and I can still access the data conveniently after loading it back.

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

Can a numpy matrix be converted to database table?

Can a numpy matrix created within a plpython function be converted into a database table?
You can use the dbTable library to store a numpy array as a table. Please refer to the complete documentation - https://pypi.org/project/dbTable/

Save a numpy sparse matrix into file

I want to save the result of TfidfVectorizer in sklearn.feature_extraction.text into a text file for future use. As I found, it is a sparse matrix of type ''. However when I try to save it using the following code
np.savetxt('Feature_TfIdf.txt', X_Tfidf, fmt='%2.6f')
I get an error like this
IndexError: tuple index out of range
Use joblib.dump or sklearn.externals.joblib.dump for this. NumPy doesn't get SciPy sparse matrices.
Simple example:
np.save('TfIdf.pkl',tfidf)
I manage to solve the problem by converting the sparse matrix to full matrix and then save matrix and save the results. This approach however is not useful for large arrays so it is better to save the matrix in .pkl format.