Huge sparse dataframe to scipy sparse matrix without dense transform - pandas

Have data with more then 1 million rows and 30 columns, one of the columns is user_id (more then 1500 different users).
I want one-hot-encode this column and to use data in ML algorithms (xgboost, FFM, scikit). But due to huge row numbers and unique user values matrix will be ~ 1 million X 1500, so need do this in sparse format (otherwise data kill all RAM).
For me convenient way to work with data through pandas DataFrame, which also now it support sparse format:
df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
Work pretty fast and have small size in RAM. But for working with scikit algos and xgboost it's necessary transform dataframe to sparse matrix.
Is there any way to do this rather than iterate through columns and hstack them in one scipy sparse matrix?
I tried df.as_matrix() and df.values, but all of first transform data to dense what arise MemoryError :(
P.S.
Same to get DMatrix for xgboost
UPDATE:
So i release next solution (will be thankful for optimisation suggestions):
def sparse_df_to_saprse_matrix (sparse_df):
index_list = sparse_df.index.values.tolist()
matrix_columns = []
sparse_matrix = None
for column in sparse_df.columns:
sps_series = sparse_df[column]
sps_series.index = pd.MultiIndex.from_product([index_list, [column]])
curr_sps_column, rows, cols = sps_series.to_coo()
if sparse_matrix != None:
sparse_matrix = sparse.hstack([sparse_matrix, curr_sps_column])
else:
sparse_matrix = curr_sps_column
matrix_columns.extend(cols)
return sparse_matrix, index_list, matrix_columns
And the following code allows to get sparse dataframe:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
full_sparse_df = one_hot_df.to_sparse(fill_value=0)
I have created sparse matrix 1,1 million rows x 1150 columns. But during creating it's still uses significant amount of RAM (~10Gb on edge with my 12Gb).
Don't know why, because resulting sparse matrix uses only 300 Mb (after loading from HDD). Any ideas?

You should be able to use the experimental .to_coo() method in pandas [1] in the following way:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
one_hot_df, idx_rows, idx_cols = one_hot_df.stack().to_sparse().to_coo()
This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().
The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).
http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse

Does my answer from a few months back help?
Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory
It was accepted but I didn't get any further feedback.
I'm familiar with the scipy sparse formats and their inputs, but don't know much about pandas sparse.

Related

How to convert a scipy sparse matrix to pyspark dataframe without calling toPandas or todense?

I have a big scipy.sparse matrix data_transformed of the following size:
<101772x69768 sparse matrix of type '<class 'numpy.float64'>'
with 17317540 stored elements in Compressed Sparse Row format>
And I'd like to convert it to pyspark.DataFrame without collecting it on driver. My tries:
Batch processing by rows
spark.createDataFrame(pd.DataFrame(np.array(data_transformed[:5].todense())))
but it seems that spark is having trouble in inferring a schema for this many columns...
Batch processing by columns
data_transformed_sp_list = []
for i in tqdm(range(0, data_transformed.shape[1])):
data_transformed_sp_list.append(
spark.createDataFrame(pd.DataFrame(np.array(data_transformed[:, i].todense())))
)
but it's also not feasible as per tqdm:
1%| | 436/69768 [01:04<2:42:39, 7.10it/s]
Is there an elegant way to do it?
Seeing that the matrix is a CSR you can try to create a sparse dataframe directly:
pd.DataFrame.sparse.from_spmatrix(data_transformed)
see documentation here

Memory problems with turning a large Xarray to a DataFrame, Python

I have an x,y,time xarray of about 2 million datapoints. I have tried turning it into a dask DataArray and then turning that into a dataframe, but there isn't enough memory. I use colab. Is there a different way to do this that might fit into colabs memory?
dask_array = da.array.from_array(xarray.values, chunks=
(xarray.shape[0], 100, 100))
#reshape and transpose to make rows of timeseries for each
point
dask_t =
dask_array.transpose(1,2,0).reshape(arr.shape[1]*arr.shape[2],
arr.shape[0])
#prepare column names for timeseries
feat_cols = ['time_'+str(i) for i in
range(dask_array.shape[0])]
#turn into a dataframe
pd.DataFrame(dask_t.compute(), columns=feat_cols)

Clean np array of NaN while deleting entries in other array accordingly

I have two numpy arrays, one of which contains about 1% NaNs.
a = np.array([-2,5,nan,6])
b = np.array([2,3,1,0])
I'd like to compute the mean squared error of a and b using sklearn's mean_squared_error.
So my question is, what's the pythonic way of removing all NaNs from a while at the same time deleting all corresponding entries from b as efficiently as possible?
You can simply use vanilla NumPy's np.nanmean for this purpose:
In [136]: np.nanmean((a-b)**2)
Out[136]: 18.666666666666668
If this didn't exist, or you really wanted to use the sklearn method, you could create a mask to index the NaNs:
In [148]: mask = ~np.isnan(a)
In [149]: mean_squared_error(a[mask], b[mask])
Out[149]: 18.666666666666668

Keep pandas index while applying sklearn

I have a dataset which has a DateTime index and I'm using PCA from sklearn to reduce the number of dimensions.
The following question bugs me - will PCA keep the order of the points in my series so that I can reuse the index from the original dataframe?
df = pd.DataFrame(...)
df2 = pca.fit_transform(df)
df2.index = df.index
Moreover, is there a better (safer) approach than doing this?
Though the indices are removed by PCA but the underlying order of rows remains(see implementation for the transform function of PCA*). So it is safe to have df2.index = df1.index
*fit_transform is same as fit and then transform. None of them reorder the rows.
Moreover, is there a better (safer) approach than doing this?
What you do is safe. But a cleaner way to do this is to wrap the output in either a DataFrame or Series and provide the original index. In your example:
df = pd.DataFrame(...)
df2 = pd.DataFrame(pca.fit_transform(df), index=df.index)
This is very useful when dealing with prediction vectors (np.ndarrays) out of a sci-kit learn model:
y_pred = pd.Series(clf.predict(X_train), index=X_train.index)
This is important when you have a more complicated index, like a MultiIndex.

How to create sparse boolean mask in Pandas?

I have the following code for mask filtering of df :
for i, y in enumerate(cols) :
dfm = df[y].str.contains(s)
mask= dfm if i==0 else np.column_stack((mask, dfm))
df is not sparse, but the filtering results mask is sparse.
Storing the mask in full boolean consumes a lot of memory when having a large dataframe ( 50mio rows * 100columns).
So, as mask result is very sparse (0.1% is TRUE), wondering if there is a way to use sparse boolean mask instead of array mask in order to reduce memory load...
Could not find any solution even there is already sparse array in Pandas.
Since this is not clear how to use it for the mask storage and usage.
ie
mask_sparse = pd.SparseArray(mask)
EDIT 2: Clarification of the question :
Can we get directly the filter result mask into a sparse array
without manipulating the full array ?
You can create sparse dataframes easily. But there is one major gotcha!
Consider the following dataframe df and its memory footprint
# 10,000 cells with 1% ones and 99% zeros
df = pd.DataFrame(np.random.choice((0, 1), size=(10000, 1000), p=(.99, .01)))
df.memory_usage().sum()
80000080
Let's try to sparsify
df_sparse = df.to_sparse()
df_sparse.memory_usage().sum()
80000080
Hmmm, that didn't do anything. That's because, we need to specify the object that is the majority place holder. Let's see
df_sparse_2 = df.to_sparse(1)
df_sparse_2.memory_usage().sum()
79196744
And
df_sparse_3 = df.to_sparse(0)
df_sparse_3.memory_usage().sum()
803416
That's better. Make sure to specify the place holder value.