What is the best way to delete and match length of datasets when doing a pearsonr correlation?
I am currently running a pearsonr correlation on returns and various fundamental indicator only issue is when I have nans and when I run it I get nan when I dropna() I have different size datasets and get an error regarding the shapes.
operands could not be broadcast together with shapes (469099,) (539093,)
It is not clear on the question what you are trying to do; however, I assume you are trying to drop 'Na' from the data so the both sets match in shape. If you are running dropna(), make sure to set 'inplace = True' as a parameter or to assign it to a dataframe.
Either
df.dropna(inplace = True)
or
df = df.dropna()
You can also check: Can't drop NAN with dropna in pandas
Related
Trying to drop NAs by column in Dask, given a certain threshold and I receive the error below.
I'm receiving the following error, but this should be working. Please advise.
reproducible example.
import pandas as pd
import dask
data = [['tom', 10], ['nick', 15], ['juli', 5]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
import numpy as np
df = df.replace(5, np.nan)
ddf = dd.from_pandas(df, npartitions = 2)
ddf.dropna(axis='columns')
Passing axis is not support for dask dataframes as of now. You cvan also print docstring of the function via ddf.dropna? and it will tell you the same:
Signature: ddf.dropna(how='any', subset=None, thresh=None)
Docstring:
Remove missing values.
This docstring was copied from pandas.core.frame.DataFrame.dropna.
Some inconsistencies with the Dask version may exist.
See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.
Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0 (Not supported in Dask)
Determine if rows or columns which contain missing values are
removed.
* 0, or 'index' : Drop rows which contain missing values.
* 1, or 'columns' : Drop columns which contain missing value.
.. versionchanged:: 1.0.0
Pass tuple or list to drop on multiple axes.
Only a single axis is allowed.
how : {'any', 'all'}, default 'any'
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.
thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False (Not supported in Dask)
If True, do operation inplace and return None.
Returns
-------
DataFrame or None
DataFrame with NA entries dropped from it or None if ``inplace=True``.
Worth noting that Dask Documentation is copied from pandas for many instances like this. But wherever it does, it specifically states that:
This docstring was copied from pandas.core.frame.DataFrame.drop. Some
inconsistencies with the Dask version may exist.
Therefore its always best to check docstring for dask's pandas-driven functions instead of relying on documentation
The reason this isn't supported in dask is because it requires computing the entire dataframe in order for dask to know the shape of the result. This is very different from the row-wise case, where the number of columns and partitions won't change, so the operation can be scheduled without doing any work.
Dask does not allow some parts of the pandas API which seem like normal pandas operations which might be ported to dask, but in reality can't be scheduled without triggering compute on the current frame. You're running into this issue by design, because while .dropna(axis=0) would work just fine as a scheduled operation, .dropna(axis=1) would have a very different implication.
You can do this manually with the following:
ddf[ddf.columns[~ddf.isna().any(axis=0)]]
but the filtering operation ddf.columns[~ddf.isna().any(axis=0)] will trigger a compute on the whole dataframe. It probably makes sense to persist prior to running this if you can fit the dataframe in your cluster's memory.
I am trying to split my data set into train and test sets by using:
for train_set, test_set in stratified.split(complete_df, complete_df["loan_condition_int"]):
stratified_train = complete_df.loc[train_set]
stratified_test = complete_df.loc[test_set]
My dataframe complete_df does not have any NaN value. I make sured it by using complete_df.isnull().sum().max() which returned 0.
But I still get a warning saying:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
And it leads to an error later. I tried to use some techniques I found online but it does not still fix it.
First, you should clarify what is stratified. I'm assuming it's a sklearn's StratifiedShuffleSplit object.
my data set complete_df does not have any NAN value.
"missing labels" from the warning message don't refer to missing values, i.e. NaNs. The error is saying that train_set and/ or test_set contain values (labels) that are not present in the index of complete_df. That's because .loc performs indexing based on row (and column) labels, not row position, while train_set and test_set indicate the row numbers. So if the index of your DataFrame doesn't coincide with the integer locations of the rows, which seems the case, the warning is raised.
To select by row position, use iloc. This should work
for train_set, test_set in stratified.split(complete_df, complete_df["loan_condition_int"]):
stratified_train = complete_df.iloc[train_set]
stratified_test = complete_df.iloc[test_set]
I'am trying to change nan values of item_price to the mean value based on item_id
in the following dask dataframe:
all_data['item_price'] = all_data[['item_id','item_price']].groupby('item_id')['item_price'].apply(lambda x: x.fillna(x.mean()))
All_data.head()
Unfortunately I get the following error:
ValueError: cannot reindex from a duplicate axis
Any idea how to avoid this error or any other way to change nan values to mean values for a dask dataframe?
I found a solution to the problem. Fillna along with map can be used instead:
all_data['item_price'] = all_data['item_price'].fillna(
all_data['item_id'].map(
all_data.groupby('item_id')['item_price'].mean().compute()
)
)
This gets rid of the duplicate axis problem. Beware you have to use compute as seen in the code inside the map function for it to work without an error.
I have two numpy arrays, one of which contains about 1% NaNs.
a = np.array([-2,5,nan,6])
b = np.array([2,3,1,0])
I'd like to compute the mean squared error of a and b using sklearn's mean_squared_error.
So my question is, what's the pythonic way of removing all NaNs from a while at the same time deleting all corresponding entries from b as efficiently as possible?
You can simply use vanilla NumPy's np.nanmean for this purpose:
In [136]: np.nanmean((a-b)**2)
Out[136]: 18.666666666666668
If this didn't exist, or you really wanted to use the sklearn method, you could create a mask to index the NaNs:
In [148]: mask = ~np.isnan(a)
In [149]: mean_squared_error(a[mask], b[mask])
Out[149]: 18.666666666666668
I have the following code for mask filtering of df :
for i, y in enumerate(cols) :
dfm = df[y].str.contains(s)
mask= dfm if i==0 else np.column_stack((mask, dfm))
df is not sparse, but the filtering results mask is sparse.
Storing the mask in full boolean consumes a lot of memory when having a large dataframe ( 50mio rows * 100columns).
So, as mask result is very sparse (0.1% is TRUE), wondering if there is a way to use sparse boolean mask instead of array mask in order to reduce memory load...
Could not find any solution even there is already sparse array in Pandas.
Since this is not clear how to use it for the mask storage and usage.
ie
mask_sparse = pd.SparseArray(mask)
EDIT 2: Clarification of the question :
Can we get directly the filter result mask into a sparse array
without manipulating the full array ?
You can create sparse dataframes easily. But there is one major gotcha!
Consider the following dataframe df and its memory footprint
# 10,000 cells with 1% ones and 99% zeros
df = pd.DataFrame(np.random.choice((0, 1), size=(10000, 1000), p=(.99, .01)))
df.memory_usage().sum()
80000080
Let's try to sparsify
df_sparse = df.to_sparse()
df_sparse.memory_usage().sum()
80000080
Hmmm, that didn't do anything. That's because, we need to specify the object that is the majority place holder. Let's see
df_sparse_2 = df.to_sparse(1)
df_sparse_2.memory_usage().sum()
79196744
And
df_sparse_3 = df.to_sparse(0)
df_sparse_3.memory_usage().sum()
803416
That's better. Make sure to specify the place holder value.