Convert type object column to float - pandas

I have a table with a column named "price". This column is of type object. So, it contains numbers as strings and also NaN or ? characters. I want to find the mean of this column but first I have to remove the NaN and ? values and also convert it to float
I am using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('Automobile_data.csv', sep = ',')
df = df.dropna('price', inplace=True)
df['price'] = df['price'].astype('int')
df['price'].mean()
But, this doesn't work. The error says:
ValueError: No axis named price for object type DataFrame
How can I solve this problem?

edit: in pandas version 1.3 and less, you need subset=[col] wrapped in a list/array. In verison 1.4 and greater you can pass a single column as a string.
You've got a few problems:
df.dropna() arguments require the axis and then the subset. The axis is rows/columns, and then subset is which of those to look at. So you want this to be (I think) df.dropna(axis='rows',subset='price')
Using inplace=True makes the whole thing return None, and so you have set df = None. You don't want to do that. If you are using inplace=True, then you don't assign something to that, the whole line would just be df.dropna(...,inplace=True).
Don't use inplace=True, just do the assignment. That is, you should use df=df.dropna(axis='rows',subset='price')

Related

Trying to Drop values by column (I convert these values to nan but could be anything) not working

Trying to drop NAs by column in Dask, given a certain threshold and I receive the error below.
I'm receiving the following error, but this should be working. Please advise.
reproducible example.
import pandas as pd
import dask
data = [['tom', 10], ['nick', 15], ['juli', 5]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
import numpy as np
df = df.replace(5, np.nan)
ddf = dd.from_pandas(df, npartitions = 2)
ddf.dropna(axis='columns')
Passing axis is not support for dask dataframes as of now. You cvan also print docstring of the function via ddf.dropna? and it will tell you the same:
Signature: ddf.dropna(how='any', subset=None, thresh=None)
Docstring:
Remove missing values.
This docstring was copied from pandas.core.frame.DataFrame.dropna.
Some inconsistencies with the Dask version may exist.
See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.
Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0 (Not supported in Dask)
Determine if rows or columns which contain missing values are
removed.
* 0, or 'index' : Drop rows which contain missing values.
* 1, or 'columns' : Drop columns which contain missing value.
.. versionchanged:: 1.0.0
Pass tuple or list to drop on multiple axes.
Only a single axis is allowed.
how : {'any', 'all'}, default 'any'
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.
thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False (Not supported in Dask)
If True, do operation inplace and return None.
Returns
-------
DataFrame or None
DataFrame with NA entries dropped from it or None if ``inplace=True``.
Worth noting that Dask Documentation is copied from pandas for many instances like this. But wherever it does, it specifically states that:
This docstring was copied from pandas.core.frame.DataFrame.drop. Some
inconsistencies with the Dask version may exist.
Therefore its always best to check docstring for dask's pandas-driven functions instead of relying on documentation
The reason this isn't supported in dask is because it requires computing the entire dataframe in order for dask to know the shape of the result. This is very different from the row-wise case, where the number of columns and partitions won't change, so the operation can be scheduled without doing any work.
Dask does not allow some parts of the pandas API which seem like normal pandas operations which might be ported to dask, but in reality can't be scheduled without triggering compute on the current frame. You're running into this issue by design, because while .dropna(axis=0) would work just fine as a scheduled operation, .dropna(axis=1) would have a very different implication.
You can do this manually with the following:
ddf[ddf.columns[~ddf.isna().any(axis=0)]]
but the filtering operation ddf.columns[~ddf.isna().any(axis=0)] will trigger a compute on the whole dataframe. It probably makes sense to persist prior to running this if you can fit the dataframe in your cluster's memory.

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

Importing Numbers using DataFrame

I'm trying to import numbers from a xlsx file using pandas DataFrame. But I'm getting numbers in a slightly different format,
let's say the number is: 9582*****4
the number i get using this code is 9582*****4.0
df=pd.read_excel("Contacts.xlsx")
for i in range(len(df)):
print(df.iloc[i,0])
It was working just fine till last night.
i guess you need to change the data type from float to int
df=pd.read_excel("Contacts.xlsx")
df = df.astype(int) # for all columns
type df = df.astype({"Column_name": int}) # for specific column
for i in range(len(df)):
print(df.iloc[i,0])

Holoviz panel will not print pandas dataframe row in Jupyter notebook

I'm trying to recreate the first panel.interact example in the Holoviz tutorial using a Pandas dataframe instead of a Dask dataframe. I get the slider, but the pandas dataframe row does not show.
See the original example at: http://holoviz.org/tutorial/Building_Panels.html
I've tried using Dask as in the Holoviz example. Dask rows print out just fine, but it demonstrates that panel seem to treat Dask dataframe rows differently for printing than Pandas dataframe rows. Here's my minimal code:
import pandas as pd
import panel
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
def select_row(rowno=0):
row = df.loc[rowno]
return row
panel.extension()
panel.extension('katex')
panel.interact(select_row, rowno=(0, 5))
I've included a line with the katex extension, because without it, I get a warning that it is needed. Without it, I don't even get the slider.
I can call the select_row(rowno=0) function separately in a Jupyter cell and get a nice printout of the row, so it appears the function is working as it should.
Any help in getting this to work would be most appreciated. Thanks.
Got a solution. With Pandas, loc[rowno:rowno] returns a pandas.core.frame.DataFrame object of length 1 which works fine with panel while loc[rowno] returns a pandas.core.series.Series object which does not work so well. Thus modifying the select_row() function like this makes it all work:
def select_row(rowno=0):
row = df.loc[rowno:rowno]
return row
Still not sure, however, why panel will print out the Dataframe object and not the Series object.
Note: if you use iloc, then you use add +1, i.e., df.iloc[rowno:rowno+1].

Dask DataFrame after Apply cannot reindex from a duplicate axis

I'am trying to change nan values of item_price to the mean value based on item_id
in the following dask dataframe:
all_data['item_price'] = all_data[['item_id','item_price']].groupby('item_id')['item_price'].apply(lambda x: x.fillna(x.mean()))
All_data.head()
Unfortunately I get the following error:
ValueError: cannot reindex from a duplicate axis
Any idea how to avoid this error or any other way to change nan values to mean values for a dask dataframe?
I found a solution to the problem. Fillna along with map can be used instead:
all_data['item_price'] = all_data['item_price'].fillna(
all_data['item_id'].map(
all_data.groupby('item_id')['item_price'].mean().compute()
)
)
This gets rid of the duplicate axis problem. Beware you have to use compute as seen in the code inside the map function for it to work without an error.