Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative - pandas

I am trying to split my data set into train and test sets by using:
for train_set, test_set in stratified.split(complete_df, complete_df["loan_condition_int"]):
stratified_train = complete_df.loc[train_set]
stratified_test = complete_df.loc[test_set]
My dataframe complete_df does not have any NaN value. I make sured it by using complete_df.isnull().sum().max() which returned 0.
But I still get a warning saying:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
And it leads to an error later. I tried to use some techniques I found online but it does not still fix it.

First, you should clarify what is stratified. I'm assuming it's a sklearn's StratifiedShuffleSplit object.
my data set complete_df does not have any NAN value.
"missing labels" from the warning message don't refer to missing values, i.e. NaNs. The error is saying that train_set and/ or test_set contain values (labels) that are not present in the index of complete_df. That's because .loc performs indexing based on row (and column) labels, not row position, while train_set and test_set indicate the row numbers. So if the index of your DataFrame doesn't coincide with the integer locations of the rows, which seems the case, the warning is raised.
To select by row position, use iloc. This should work
for train_set, test_set in stratified.split(complete_df, complete_df["loan_condition_int"]):
stratified_train = complete_df.iloc[train_set]
stratified_test = complete_df.iloc[test_set]

Related

Convert type object column to float

I have a table with a column named "price". This column is of type object. So, it contains numbers as strings and also NaN or ? characters. I want to find the mean of this column but first I have to remove the NaN and ? values and also convert it to float
I am using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('Automobile_data.csv', sep = ',')
df = df.dropna('price', inplace=True)
df['price'] = df['price'].astype('int')
df['price'].mean()
But, this doesn't work. The error says:
ValueError: No axis named price for object type DataFrame
How can I solve this problem?
edit: in pandas version 1.3 and less, you need subset=[col] wrapped in a list/array. In verison 1.4 and greater you can pass a single column as a string.
You've got a few problems:
df.dropna() arguments require the axis and then the subset. The axis is rows/columns, and then subset is which of those to look at. So you want this to be (I think) df.dropna(axis='rows',subset='price')
Using inplace=True makes the whole thing return None, and so you have set df = None. You don't want to do that. If you are using inplace=True, then you don't assign something to that, the whole line would just be df.dropna(...,inplace=True).
Don't use inplace=True, just do the assignment. That is, you should use df=df.dropna(axis='rows',subset='price')

DataFrame to DataFrameRow conversion (Julia)

I'm using Pingouin.jl to test normality.
In their docs, we have
dataset = Pingouin.read_dataset("mediation")
Pingouin.normality(dataset, method="jarque_bera")
Which should return a DataFrame with normality true or false for each name in the dataset.
Currently, this broadcasting is deprecated, and I'm unable to concatenate the result in one DataFrame for each unique-column-output (which is working and outputs a DataFrame).
So, what I have so far.
function var_norm(df)
norm = DataFrame([])
for i in 1:1:length(names(df))
push!(norm, Pingouin.normality(df[!,names(df)[i]], method="jarque_bera"))
end
return norm
end
The error I get:
julia> push!(norm, Pingouin.normality(df[!,names(df)[1]], method="jarque_bera"))
ERROR: ArgumentError: `push!` does not allow passing collections of type DataFrame to be pushed into a DataFrame. Only `Tuple`, `AbstractArray`, `AbstractDict`, `DataFrameRow` and `NamedTuple` are allowed.
Stacktrace:
[1] push!(df::DataFrame, row::DataFrame; promote::Bool)
# DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:1603
[2] push!(df::DataFrame, row::DataFrame)
# DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:1601
[3] top-level scope
# REPL[163]:1
EDIT: push! function was not properly written at my first version of the post. But, the error persists after the change. How can I reformat the output of type DataFrame from Pingouin into DataFrameRow?
As Pengouin.normality returns a DataFrame, you will have to iterate over its results and push one-by-one:
df = Pengouin.normality(…)
for row in eachrow(df)
push!(norms, row)
end
If you are sure Pengouin.normality returns a DataFrame with exactly one row, you can simply write
push!(norms, only(Pengouin.normality(…)))

Dask DataFrame after Apply cannot reindex from a duplicate axis

I'am trying to change nan values of item_price to the mean value based on item_id
in the following dask dataframe:
all_data['item_price'] = all_data[['item_id','item_price']].groupby('item_id')['item_price'].apply(lambda x: x.fillna(x.mean()))
All_data.head()
Unfortunately I get the following error:
ValueError: cannot reindex from a duplicate axis
Any idea how to avoid this error or any other way to change nan values to mean values for a dask dataframe?
I found a solution to the problem. Fillna along with map can be used instead:
all_data['item_price'] = all_data['item_price'].fillna(
all_data['item_id'].map(
all_data.groupby('item_id')['item_price'].mean().compute()
)
)
This gets rid of the duplicate axis problem. Beware you have to use compute as seen in the code inside the map function for it to work without an error.

Reshaping for Pearsonr correlation

What is the best way to delete and match length of datasets when doing a pearsonr correlation?
I am currently running a pearsonr correlation on returns and various fundamental indicator only issue is when I have nans and when I run it I get nan when I dropna() I have different size datasets and get an error regarding the shapes.
operands could not be broadcast together with shapes (469099,) (539093,)
It is not clear on the question what you are trying to do; however, I assume you are trying to drop 'Na' from the data so the both sets match in shape. If you are running dropna(), make sure to set 'inplace = True' as a parameter or to assign it to a dataframe.
Either
df.dropna(inplace = True)
or
df = df.dropna()
You can also check: Can't drop NAN with dropna in pandas

DataFrame.apply(func, raw=True) doesn't seem to take effect?

I am trying to hash together only a few columns of my dataframe df so I do
temp = df['field1', 'field2']
df["hash"] = temp.apply(lambda x: hash(x), raw=True, axis=1)
I set raw to true because the doc (I am using 0.22) says it will pass a numpy array instead of a mutable Series but even with raw=True I am getting a Series, why?
File "/nix/store/9ampki9dbq0imhhm7i27qkh56788cjpz-python3.6-pandas-0.22.0/lib/python3.6/site-packages/pandas/core/frame.py", line 4877, in apply
ignore_failures=ignore_failures)
File "/nix/store/9ampki9dbq0imhhm7i27qkh56788cjpz-python3.6-pandas-0.22.0/lib/python3.6/site-packages/pandas/core/frame.py", line 4973, in _apply_standard
results[i] = func(v)
File "/home/teto/mptcpanalyzer/mptcpanalyzer/data.py", line 190, in _hash_row
return hash(x)
File "/nix/store/9ampki9dbq0imhhm7i27qkh56788cjpz-python3.6-pandas-0.22.0/lib/python3.6/site-packages/pandas/core/generic.py", line 1045, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 1')
It's strange, as I can't reproduce your exact error (that is, by me, raw=True indeed results in an np.ndarray being passed). In any case, neither a Series nor a np.ndarray are hashable. The following works, though:
temp.apply(lambda x: hash(tuple(x)), axis=1)
A tuple is hashable.