Save a numpy sparse matrix into file - numpy

I want to save the result of TfidfVectorizer in sklearn.feature_extraction.text into a text file for future use. As I found, it is a sparse matrix of type ''. However when I try to save it using the following code
np.savetxt('Feature_TfIdf.txt', X_Tfidf, fmt='%2.6f')
I get an error like this
IndexError: tuple index out of range

Use joblib.dump or sklearn.externals.joblib.dump for this. NumPy doesn't get SciPy sparse matrices.

Simple example:
np.save('TfIdf.pkl',tfidf)

I manage to solve the problem by converting the sparse matrix to full matrix and then save matrix and save the results. This approach however is not useful for large arrays so it is better to save the matrix in .pkl format.

Related

How to convert a scipy sparse matrix to pyspark dataframe without calling toPandas or todense?

I have a big scipy.sparse matrix data_transformed of the following size:
<101772x69768 sparse matrix of type '<class 'numpy.float64'>'
with 17317540 stored elements in Compressed Sparse Row format>
And I'd like to convert it to pyspark.DataFrame without collecting it on driver. My tries:
Batch processing by rows
spark.createDataFrame(pd.DataFrame(np.array(data_transformed[:5].todense())))
but it seems that spark is having trouble in inferring a schema for this many columns...
Batch processing by columns
data_transformed_sp_list = []
for i in tqdm(range(0, data_transformed.shape[1])):
data_transformed_sp_list.append(
spark.createDataFrame(pd.DataFrame(np.array(data_transformed[:, i].todense())))
)
but it's also not feasible as per tqdm:
1%| | 436/69768 [01:04<2:42:39, 7.10it/s]
Is there an elegant way to do it?
Seeing that the matrix is a CSR you can try to create a sparse dataframe directly:
pd.DataFrame.sparse.from_spmatrix(data_transformed)
see documentation here

Object dtype dtype('O') has no native HDF5 equivalent

Well, it seems like a couple of similar questions were asked here in stack overflow, but none of them seem like answered correctly or properly, nor they described the exact examples.
I have a problem with saving array or list into hdf5 ...
I have a several files contains list of (n, 35) dimensions, where n may be different in each file. Each of them can be saved in hdf5 with code below.
hdf = hf.create_dataset(fname, data=d)
However, if I want to merge them to make in 3d the error occurs as below.
Object dtype dtype('O') has no native HDF5 equivalent
I have no idea why it turns to dtype object, since what I have done is only this
all_data = list()
for fname in file_list:
d = np.load(fname)
all_data.append(d)
hdf = hf.create_dataset('all_data', data=all_data)
How can I save such data?
I tried a couple of tests, and it seems like all_data turns to dtype with 'object' when I change them with
all_data = np.array(all_data)
Which looks it has the similar problem with saving hdf5.
Again, how can I save such data in hdf5?
I was running into a similar issue with h5py, and changing the type of the NumPy array using array.astype worked for me (I believe this changes the type from dtype('O') to the data type you specify). Please see the code snippet below:
import numpy as np
print(X.dtype)
--> dtype('O')
print(X.astype(np.float64).dtype)
--> dtype('float64')
When I ran h5.create_dataset with this data type conversion, I was able to successfully create a h5 dataset. Hope this helps!
ONE ADDITIONAL UPDATE: I believe the NumPy object type 'O' is created when the NumPy array itself has mixed element types (e.g. np.int8 and np.float32).
dtype('O') stands for object. In my case I had a list of lists where the lengths were different and got the same error. If you convert it to a numpy array numpy warns Creating an ndarray from ragged nested sequences. h5 files can't handle this type of data for more info see this post
This error comes when I use:
with h5py.File(peakfilename, 'w') as pfile: # saves the data
pfile['peakY'] = np.array(X)
pfile['peakX'] = np.array(Y)
However when I used dtype when saving the arrays... the problem went away... I guess h5py is not able to create datasets from undefined data types.
with h5py.File(peakfilename, 'w') as pfile: # saves the data
pfile['peakY'] = np.array(X, dtype=np.float32)
pfile['peakX'] = np.array(Y, dtype=np.float32)

Vectorizing text from data frame column using pandas

I have a Data Frame which looks like this:
I am trying to vectorize every row, but only from the text column. I wrote this code:
vectorizerCount = CountVectorizer(stop_words='english')
# tokenize and build vocab
allDataVectorized = allData.apply(vectorizerCount.fit_transform(allData.iloc[:]['headline_text']), axis=1)
The error says:
TypeError: ("'csr_matrix' object is not callable", 'occurred at index 0')
Doing some research and trying changes I found out the fit_transform function returns a scipy.sparse.csr.csr_matrix and that is not callable.
Is there another way to do this?
Thanks!
There are a number of problems with your code. You probably need something like
allDataVectorized = pd.DataFrame(vectorizerCount.fit_transform(allData[['headline_text']]))
allData[['headline_text']]) (with the double brackets) is a DataFrame, which transforms to a numpy 2d array.
fit_transform returns a csr matrix.
pd.DataFrame(...) creates a DataFrame from a csr matrix.

How to load and convert .mat file into numpy 2D array?

I have a data in mat file (observations and features) and i want to load it into numpy 2D array. I dont want to convert it into csv first and then load csv into numpy.
Use scipy's loadmat (API-docs).
The docs should be sufficient to get you going, but make sure to read the notes.
There is also the io-tutorial with some examples.

check if a numpy array is a numpy masked array

As output of a script, I have numpy masked array and standard numpy array. How do I easily check while running the script if an array is a masked (has data, mask attributes) one or not?
You can check explicitly if it is a masked array by isinstance(arr, np.ma.MaskedArray), or you can check for the attributes hasattr(arr, 'mask'). I'd probably recommend the first approach in general.