Trying to Drop values by column (I convert these values to nan but could be anything) not working - pandas

Trying to drop NAs by column in Dask, given a certain threshold and I receive the error below.
I'm receiving the following error, but this should be working. Please advise.
reproducible example.
import pandas as pd
import dask
data = [['tom', 10], ['nick', 15], ['juli', 5]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
import numpy as np
df = df.replace(5, np.nan)
ddf = dd.from_pandas(df, npartitions = 2)
ddf.dropna(axis='columns')

Passing axis is not support for dask dataframes as of now. You cvan also print docstring of the function via ddf.dropna? and it will tell you the same:
Signature: ddf.dropna(how='any', subset=None, thresh=None)
Docstring:
Remove missing values.
This docstring was copied from pandas.core.frame.DataFrame.dropna.
Some inconsistencies with the Dask version may exist.
See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.
Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0 (Not supported in Dask)
Determine if rows or columns which contain missing values are
removed.
* 0, or 'index' : Drop rows which contain missing values.
* 1, or 'columns' : Drop columns which contain missing value.
.. versionchanged:: 1.0.0
Pass tuple or list to drop on multiple axes.
Only a single axis is allowed.
how : {'any', 'all'}, default 'any'
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.
thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False (Not supported in Dask)
If True, do operation inplace and return None.
Returns
-------
DataFrame or None
DataFrame with NA entries dropped from it or None if ``inplace=True``.
Worth noting that Dask Documentation is copied from pandas for many instances like this. But wherever it does, it specifically states that:
This docstring was copied from pandas.core.frame.DataFrame.drop. Some
inconsistencies with the Dask version may exist.
Therefore its always best to check docstring for dask's pandas-driven functions instead of relying on documentation

The reason this isn't supported in dask is because it requires computing the entire dataframe in order for dask to know the shape of the result. This is very different from the row-wise case, where the number of columns and partitions won't change, so the operation can be scheduled without doing any work.
Dask does not allow some parts of the pandas API which seem like normal pandas operations which might be ported to dask, but in reality can't be scheduled without triggering compute on the current frame. You're running into this issue by design, because while .dropna(axis=0) would work just fine as a scheduled operation, .dropna(axis=1) would have a very different implication.
You can do this manually with the following:
ddf[ddf.columns[~ddf.isna().any(axis=0)]]
but the filtering operation ddf.columns[~ddf.isna().any(axis=0)] will trigger a compute on the whole dataframe. It probably makes sense to persist prior to running this if you can fit the dataframe in your cluster's memory.

Related

Convert type object column to float

I have a table with a column named "price". This column is of type object. So, it contains numbers as strings and also NaN or ? characters. I want to find the mean of this column but first I have to remove the NaN and ? values and also convert it to float
I am using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('Automobile_data.csv', sep = ',')
df = df.dropna('price', inplace=True)
df['price'] = df['price'].astype('int')
df['price'].mean()
But, this doesn't work. The error says:
ValueError: No axis named price for object type DataFrame
How can I solve this problem?
edit: in pandas version 1.3 and less, you need subset=[col] wrapped in a list/array. In verison 1.4 and greater you can pass a single column as a string.
You've got a few problems:
df.dropna() arguments require the axis and then the subset. The axis is rows/columns, and then subset is which of those to look at. So you want this to be (I think) df.dropna(axis='rows',subset='price')
Using inplace=True makes the whole thing return None, and so you have set df = None. You don't want to do that. If you are using inplace=True, then you don't assign something to that, the whole line would just be df.dropna(...,inplace=True).
Don't use inplace=True, just do the assignment. That is, you should use df=df.dropna(axis='rows',subset='price')

Sparse columns in pandas: directly access the indices of non-null values

I have a large dataframe (approx. 10^8 rows) with some sparse columns. I would like to be able to quickly access the non-null values in a given column, i.e. the values that are actually saved in the array. I figured that this could be achieved by df.<column name>[<indices of non-null values>]. However, I can't see how to access <indices of non-null values> directly, i.e. without any computation. When I try df.<column name>.index it tells me that it's a RangeIndex, which doesn't help. I can even see <indices of non-null values> when I run df.<column name>.values, but looking through dir(df.<column name>.values) I still cant't see a way to access them.
To make clear what I mean, here is a toy example:
In this example <indices of non-null values> is [0,1,3].
EDIT: The answer below by #Piotr Żak is a viable solution, but it requires computation. Is there a way to access <indices of non-null values> directly via an attribute of the column or array?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1], [np.nan], [4], [np.nan], [9]]),
columns=['a'])
just filter without nan:
filtered_df = df[df['a'].notnull()]
transform column from df to array:
s_array = filtered_df[["a"]].to_numpy()
or - transform indexes from df to array:
filtered_df.index.tolist()

Using Dask Delayed on Small/Partitioned Dataframes

I am working with time series data that is formatted as each row is a single instance of a ID/time/data. This means that the rows don't correspond 1 to 1 for each ID. Each ID has many rows across time.
I am trying to use dask delayed to have a function run on an entire ID sequence (it makes sense that the operation should be able to run on each individual ID at the same time since they don't affect each other). To do this I am first looping through each of the ID tags, pulling/locating all the data from that ID (with .loc in pandas, so it is a separate "mini" df), then delaying the function call on the mini df, adding a column with the delayed values and adding it to a list of all mini dfs. At the end of the for loop I want to call dask.compute() on all the mini-dfs at once but for some reason the mini df's values are still delayed. Below I will post some pseudocode about what I just tried to explain.
I have a feeling that this may not be the best way to go about this but it's what made sense at the time and I can't understand whats wrong so any help would be very much appreciated.
Here is what I am trying to do:
list_of_mini_dfs = []
for id in big_df:
curr_df = big_df.loc[big_df['id'] == id]
curr_df['new value 1'] = dask.delayed(myfunc)(args1)
curr_df['new value 2'] = dask.delayed(myfunc)(args2) #same func as previous line
list_of_mini_dfs.append(curr_df)
list_of_mini_dfs = dask.delayed(list_of_mini_dfs).compute()
Concat all mini dfs into new big df.
As you can see by the code I have to reach into my big/overall dataframe to pull out each ID's sequence of data since it is interspersed throughout the rows. I want to be able to call a delayed function on that single ID's data and then return the values from the function call into the big/overall dataframe.
Currently this method is not working, when I concat all the mini dataframes back together the two values I have delayed are still delayed, which leads me to think that it is due to the way I am delaying a function within a df and trying to compute the list of dataframes. I just can't see how to fix it.
Hopefully this was relatively clear and thank you for the help.
IIUC you are trying to do a sort of transform using dask.
import pandas as pd
import dask.dataframe as dd
import numpy as np
# generate big_df
dates = pd.date_range(start='2019-01-01',
end='2020-01-01')
l = len(dates)
out = []
for i in range(1000):
df = pd.DataFrame({"ID":[i]*l,
"date": dates,
"data0": np.random.randn(l),
"data1": np.random.randn(l)})
out.append(df)
big_df = pd.concat(out, ignore_index=True)\
.sample(frac=1)\
.reset_index(drop=True)
Now you want to apply your function fun on columns data0 and data1
Pandas
out = big_df.groupby("ID")[["data0","data1"]]\
.apply(fun)\
.reset_index()
df_pd = pd.merge(big_df, out, how="left", on="ID" )
Dask
df = dd.from_pandas(big_df, npartitions=4)
out = df.groupby("ID")[["data0","data1"]]\
.apply(fun, meta={'data0':'f8',
'data1':'f8'})\
.rename(columns={'data0': 'new_values0',
'data1': 'new_values1'})\
.compute() # Here you need to compute otherwise you'll get NaNs
df_dask = dd.merge(df, out,
how="left",
left_on=["ID"],
right_index=True)
The dask version is not necessarily faster than the pandas one. In particular if your df fits in RAM.

Reshaping for Pearsonr correlation

What is the best way to delete and match length of datasets when doing a pearsonr correlation?
I am currently running a pearsonr correlation on returns and various fundamental indicator only issue is when I have nans and when I run it I get nan when I dropna() I have different size datasets and get an error regarding the shapes.
operands could not be broadcast together with shapes (469099,) (539093,)
It is not clear on the question what you are trying to do; however, I assume you are trying to drop 'Na' from the data so the both sets match in shape. If you are running dropna(), make sure to set 'inplace = True' as a parameter or to assign it to a dataframe.
Either
df.dropna(inplace = True)
or
df = df.dropna()
You can also check: Can't drop NAN with dropna in pandas

Slice Spark’s DataFrame SQL by row (pyspark)

I have a Spark's Dataframe parquet file that can be read by spark as follows
df = sqlContext.read.parquet('path_to/example.parquet')
df.registerTempTable('temp_table')
I want to slice my dataframe, df, by row (i.e. equivalent to df.iloc[0:4000], df.iloc[4000:8000] etc. in Pandas dataframe) since I want to convert each small chunks to pandas dataframe to work on each later on. I only know how to do it by using sample random fraction i.e.
df_sample = df.sample(False, fraction=0.1) # sample 10 % of my data
df_pandas = df_sample.toPandas()
I would be great if there is a method to slice my dataframe df by row. Thanks in advance.
You can use monotonically_increasing_id() to add an ID column to your dataframe and use that to get a working set of any size.
import pyspark.sql.functions as f
# add an index column
df = df.withColumn('id', f.monotonically_increasing_id())
# Sort by index and get first 4000 rows
working_set = df.sort('id').limit(4000)
Then, you can remove the working set from your dataframe using subtract().
# Remove the working set, and use this `df` to get the next working set
df = df.subtract(working_set)
Rinse and repeat until you're done processing all rows. Not the ideal way to do things, but it works. Consider filtering out your Spark data frame to be used in Pandas.