Save numpy ndarray to file with index to file - numpy

I have a large NumPy array, say 1000*1000.
I want to save it to a file with each line format like the following.
row col matrix[row,col]
I can't seem to find a method that does it efficiently.
I could do a nested for loop but it's too slow.
Constructing a larger matrix which contains the indices would be too expensive on memory.
I was thinking of list comprehension, but are there other ways of doing it?

Related

How can I open an arbitrary netCDF file with xarray and get the *n*th time slice as a NumPy array?

When I open a netCDF file with xarray in Python, I open it as a Dataset object:
ds = xr.open_dataset(file_path)
How do I get the nth time slice of this dataset as a NumPy array?
I know that I can get that if I know the NetCDF variable name, like so:
xvar = ds.data_vars[var_name]
array = xvar.isel(time=n).values
but that requires knowing var_name, i.e., the NetCDF variable name, which I may not know for all netCDF files.
With iris, this name is available as the attribute var_name in the resulting Cube object after loading the netCDF file with iris.load_cube. How can I get the same variable name in xarray after loading the netCDF file into an xarray dataset?
Or is there any even simpler way to get the nth time slice of the netCDF file as a NumPy array with xarray?
Have yout tried:
results = ds.isel(time=n).values
That should return the time slice for all the variables, as you desire. Obviously you will have problems if there are multiple variables and you only want one of them, but there is no way you can know which you want without knowing the variable name anyway, so I don't think that should really be an issue.
If you question is "how can I specifically extract only one variable from a list of others without knowing its name", then that doesn't really make sense. If you don't know the data you want, how do you expect to get the data you want?

Ways to save data frame of tensors into a file for easy loading and access?

I have a dataframe with 2600 rows, and for each row there are torch tensors of shape (192,).
How can I save this dataframe into a file so when I load it back again I could still use a "dictionary-like" access to it's contents?
Saving to_csv() converts the tensor into a string causing a mess where I need to parse.
Turns out, using pandas.to_pickle() function retains the dataframe format and I can still access the data conveniently after loading it back.

Is there a Pandas DataFrame implementation that loads lazily records from a table in a HDF5 file?

I am trying to convert millions of existing HDF5 files to Parquet format. Problem is that both input and output can't fit memory so I need means to process the input data (tables in a HDF5 file) in chunks and somehow have Pandas DataFrame that lazily load these chunks while fastparquet write function reads from it.
Pandas read_hdf() and HDF5Store's select do take chunksize as parameter, but they do not return usable dataframe. Without the chunksize parameter the program runs out of memory because Pandas loads the whole dataset in memory.

Pandas DataFrame chunks: writing a DataFrame generator object to_csv

I'm reading a large amount of data from a database via pd.read_sql(...chunksize=10000) which generates a df generator object.
While I can still work with that dataframe in merging it with pd.merge(df,df2...) some functions are no longer available, such as df.to_cs(...)
What is the best way to handle that? How can I write such a dataframe to a CSV? Do I need to iterate over it manually?
You can either process each chunk individually, or combine them using e.g. pd.concat to operate on all chunks as a whole.
Individually, you would indeed iterate over the chunks like so:
for chunk in pd.read_sql(...chunksize=10000):
# process chunk
To combine, you can use list comprehension:
df = pd.concat([chunk for chunk in pd.read_sql(...chunksize=10000)])
#process df

Save a numpy sparse matrix into file

I want to save the result of TfidfVectorizer in sklearn.feature_extraction.text into a text file for future use. As I found, it is a sparse matrix of type ''. However when I try to save it using the following code
np.savetxt('Feature_TfIdf.txt', X_Tfidf, fmt='%2.6f')
I get an error like this
IndexError: tuple index out of range
Use joblib.dump or sklearn.externals.joblib.dump for this. NumPy doesn't get SciPy sparse matrices.
Simple example:
np.save('TfIdf.pkl',tfidf)
I manage to solve the problem by converting the sparse matrix to full matrix and then save matrix and save the results. This approach however is not useful for large arrays so it is better to save the matrix in .pkl format.