Keep pandas index while applying sklearn - pandas

I have a dataset which has a DateTime index and I'm using PCA from sklearn to reduce the number of dimensions.
The following question bugs me - will PCA keep the order of the points in my series so that I can reuse the index from the original dataframe?
df = pd.DataFrame(...)
df2 = pca.fit_transform(df)
df2.index = df.index
Moreover, is there a better (safer) approach than doing this?

Though the indices are removed by PCA but the underlying order of rows remains(see implementation for the transform function of PCA*). So it is safe to have df2.index = df1.index
*fit_transform is same as fit and then transform. None of them reorder the rows.

Moreover, is there a better (safer) approach than doing this?
What you do is safe. But a cleaner way to do this is to wrap the output in either a DataFrame or Series and provide the original index. In your example:
df = pd.DataFrame(...)
df2 = pd.DataFrame(pca.fit_transform(df), index=df.index)
This is very useful when dealing with prediction vectors (np.ndarrays) out of a sci-kit learn model:
y_pred = pd.Series(clf.predict(X_train), index=X_train.index)
This is important when you have a more complicated index, like a MultiIndex.

Related

Numpy iterating over rows

I kind of have the misconception that for loops should be avoided in Numpy for speed reasons, for example
import numpy
a = numpy.array([[2,0,1,3],[0,2,3,1]])
targets = numpy.array([[1,1,1,1,1,1,1]])
output = numpy.zeros((2,1))
for i in range(2):
output[i] = numpy.mean(targets[a[i]])
Is this a good way to get the mean on selected positions of each row? Feels like there might be ways to slice the array first then apply mean directly.
I think you are looking for this:
targets[a].mean(1)
Note that in your example, targets need to be 1-D and not 2-D. Otherwise, your loop throws out of bound index as it interprets the index for row index and not the column index.
numpy actually interprets this for you: targets[a] works "row-wise" and subsequently using np.mean(targets[a], axis=1) as suggested by #hpaulj in the comments does exactly what you want:
import numpy
a = numpy.array([[2,0,1,3],[0,2,3,1]])
targets = numpy.arange(1,6) # To make the results differ
output = numpy.mean(targets[a], axis=1) # the i-th row of targets[a] is targets[a[i]]

Using .apply vs subsets

As far as I know using .apply() in pandas is rather inefficient because it isn't vectorized.
I have a bunch of relatively normal operations like addition or multiplication which I want to do differently depending on the content of certain columns.
The central question is what are the advantages and disadvantages of the two below code snippets:
df['col'] = df['col'].apply(lambda x: x/df['col'].max() if x < 1000 else x)
# or
df.loc[df['col']<1000,'col'] = df["col"]/df['col'].max()
I've noticed that the first is slower but I've seen it recommended a lot and I sometimes get slice errors for the second version so was hesitant to use it.
When you use loc to set a subset on the LHS, you should also subset on the RHS so it's explicit. This will avoid errors in cases where the index might be duplicated.
import pandas as pd
df = pd.DataFrame({'col': range(997,1003)})
m = df['col'].lt(1000)
df.loc[m, 'col'] = df.loc[m, 'col']/df['col'].max()
# col
#0 0.995010
#1 0.996008
#2 0.997006
#3 1000.000000
#4 1001.000000
#5 1002.000000
Alternatively, use np.where for an if-else clause:
import numpy as np
df = pd.DataFrame({'col': range(997,1003)})
df['col'] = np.where(df['col'].lt(1000), df['col']/df['col'].max(), df['col'])
In terms of using apply this question has much more thorough answers. Particularly, see #jpp's answer. You may have have seen .apply suggested for a groupby object, or to perform column-wise calculations for a narrow DataFrame, which are typically fine.

How to create sparse boolean mask in Pandas?

I have the following code for mask filtering of df :
for i, y in enumerate(cols) :
dfm = df[y].str.contains(s)
mask= dfm if i==0 else np.column_stack((mask, dfm))
df is not sparse, but the filtering results mask is sparse.
Storing the mask in full boolean consumes a lot of memory when having a large dataframe ( 50mio rows * 100columns).
So, as mask result is very sparse (0.1% is TRUE), wondering if there is a way to use sparse boolean mask instead of array mask in order to reduce memory load...
Could not find any solution even there is already sparse array in Pandas.
Since this is not clear how to use it for the mask storage and usage.
ie
mask_sparse = pd.SparseArray(mask)
EDIT 2: Clarification of the question :
Can we get directly the filter result mask into a sparse array
without manipulating the full array ?
You can create sparse dataframes easily. But there is one major gotcha!
Consider the following dataframe df and its memory footprint
# 10,000 cells with 1% ones and 99% zeros
df = pd.DataFrame(np.random.choice((0, 1), size=(10000, 1000), p=(.99, .01)))
df.memory_usage().sum()
80000080
Let's try to sparsify
df_sparse = df.to_sparse()
df_sparse.memory_usage().sum()
80000080
Hmmm, that didn't do anything. That's because, we need to specify the object that is the majority place holder. Let's see
df_sparse_2 = df.to_sparse(1)
df_sparse_2.memory_usage().sum()
79196744
And
df_sparse_3 = df.to_sparse(0)
df_sparse_3.memory_usage().sum()
803416
That's better. Make sure to specify the place holder value.

Huge sparse dataframe to scipy sparse matrix without dense transform

Have data with more then 1 million rows and 30 columns, one of the columns is user_id (more then 1500 different users).
I want one-hot-encode this column and to use data in ML algorithms (xgboost, FFM, scikit). But due to huge row numbers and unique user values matrix will be ~ 1 million X 1500, so need do this in sparse format (otherwise data kill all RAM).
For me convenient way to work with data through pandas DataFrame, which also now it support sparse format:
df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
Work pretty fast and have small size in RAM. But for working with scikit algos and xgboost it's necessary transform dataframe to sparse matrix.
Is there any way to do this rather than iterate through columns and hstack them in one scipy sparse matrix?
I tried df.as_matrix() and df.values, but all of first transform data to dense what arise MemoryError :(
P.S.
Same to get DMatrix for xgboost
UPDATE:
So i release next solution (will be thankful for optimisation suggestions):
def sparse_df_to_saprse_matrix (sparse_df):
index_list = sparse_df.index.values.tolist()
matrix_columns = []
sparse_matrix = None
for column in sparse_df.columns:
sps_series = sparse_df[column]
sps_series.index = pd.MultiIndex.from_product([index_list, [column]])
curr_sps_column, rows, cols = sps_series.to_coo()
if sparse_matrix != None:
sparse_matrix = sparse.hstack([sparse_matrix, curr_sps_column])
else:
sparse_matrix = curr_sps_column
matrix_columns.extend(cols)
return sparse_matrix, index_list, matrix_columns
And the following code allows to get sparse dataframe:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
full_sparse_df = one_hot_df.to_sparse(fill_value=0)
I have created sparse matrix 1,1 million rows x 1150 columns. But during creating it's still uses significant amount of RAM (~10Gb on edge with my 12Gb).
Don't know why, because resulting sparse matrix uses only 300 Mb (after loading from HDD). Any ideas?
You should be able to use the experimental .to_coo() method in pandas [1] in the following way:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
one_hot_df, idx_rows, idx_cols = one_hot_df.stack().to_sparse().to_coo()
This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().
The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).
http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse
Does my answer from a few months back help?
Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory
It was accepted but I didn't get any further feedback.
I'm familiar with the scipy sparse formats and their inputs, but don't know much about pandas sparse.

Pandas idiom for attaching a predictions column to a dataframe

What is the Pandas idiom for attaching the results of a prediction to the dataframe on which the prediction was made.
For example, if I have something like (where qualityTrain is the result of a stats models fit)
qualityTrain = quality_data[some_selection_criterion]
pred1 = QualityLog.predict(qualityTrain)
qualityTrain = pd.concat([qualityTrain, pd.DataFrame(pred1, columns=['Pred1'])], axis=1)
the 'Pred1' values are not aligned correctly with the rest of qualityTrain. If I modify the last line so to reads
...pd.DataFrame(pred1, columns=['Pred1'], index=qualityTrain.index)...
I get the results I expect.
Is there a better idiom for attaching results to a dataframe where the dataframe's may have an arbitrary index?
You can just do
qualityTrain['Pred1'] = pred1
Note that we're (statsmodels) going to have pandas-in, pandas-out for predict pretty soon, so it'll hopefully alleviate some of these pain points.