AttributeError: 'Index' object has no attribute 'mean' - dataframe

I am encountering an error that I can't get out of:
I have separated my columns according to the unique values of the variables:
cats_df = df.columns[df.nunique() < 6]
num_df = df.columns[df.nunique()>= 6]
And I wanted to replace the missing values of the numerical columns >= 6 with the average:
num_df = num_df.fillna(num_df.mean())
But I get this error message :
AttributeError Traceback (most recent call last)
<ipython-input-22-59bfd4048c41> in <module> ----> 1 num_df = num_df.fillna(num_df.mean()) 2 num_df
AttributeError: 'Index' object has no attribute 'mean'
Can you help me solve this problem?

The problem is that num_df is an index, not a dataframe, you may need something like this:
num_df = df[df.columns[df.nunique()>= 6]]
num_df = num_df.fillna(num_df.mean())

Related

Pandasql returns error with a basic example

The following code when run
import pandas as pd
from pandasql import sqldf
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [10, 20, 30, 40]})
query = "SELECT * FROM df WHERE col1 > 2"
result = sqldf(query, globals())
print(result)
gives the following error:
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File ~/.virtualenvs/r-reticulate/lib64/python3.11/site-packages/sqlalchemy/engine/base.py:1410, in Connection.execute(self, statement, parameters, execution_options)
1409 try:
-> 1410 meth = statement._execute_on_connection
1411 except AttributeError as err:
AttributeError: 'str' object has no attribute '_execute_on_connection'
The above exception was the direct cause of the following exception:
ObjectNotExecutableError Traceback (most recent call last)
Cell In[1], line 11
8 query = "SELECT * FROM df WHERE col1 > 2"
10 # Execute the query using pandasql
---> 11 result = sqldf(query, globals())
13 print(result)
File ~/.virtualenvs/r-reticulate/lib64/python3.11/site-packages/pandasql/sqldf.py:156, in sqldf(query, env, db_uri)
124 def sqldf(query, env=None, db_uri='sqlite:///:memory:'):
125 """
126 Query pandas data frames using sql syntax
127 This function is meant for backward compatibility only. New users are encouraged to use the PandaSQL class.
(...)
154 >>> sqldf("select avg(x) from df;", locals())
...
1416 distilled_parameters,
1417 execution_options or NO_OPTIONS,
1418 )
ObjectNotExecutableError: Not an executable object: 'SELECT * FROM df WHERE col1 > 2'
Could someone please help me?
The problem could be fixed by downgrading SQLAlchemy:
pip install SQLAlchemy==1.4.46
See bug report for more details.

code that works gets 0 key error when iterate it

When I run below code, it works
df[df['column1'].isin([data['column2'][0]])]['column3'][0]
But when I iterate it as below, it gives key error 0
newlist2=[]
for i in datalist:
newlist2.append(df[df['column1'].isin([globals()[i]['column2'][0]])]['column3'][0])
Error:
KeyError Traceback (most recent call last)
Input In [67], in <cell line: 4>()
3 newlist2=[]
4 for i in datalist:
----> 5 newlist2.append(mergeddata[mergeddata['DATE_OF_RESTRUCTURE'].isin([**globals()[i]['REPORT_DATE'][0]**])]['CONTRACT_NUMBER'][0])
Let's split following code into four parts
df[df['column1'].isin([globals()[i]['column2'][0]])]['column3'][0]
Condition check: Series.isin().
Boolean indexing: df[boolean]
Column selection: sub_df['column3']
Index selection: column3[0]
When the Series.isin() returns all False and you do boolean indexing with it. This will cause the sub_df empty so your [0] indexing will through indexing error.

Getting the mean of a column using Groupby

I am attempting to get the mean of the columns in a df but keeping getting this error using groupby
grouped_df = df.groupby('location')['number of orders'].mean()
print(grouped_df)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-8cc491c4c100> in <module>
----> 1 grouped_df = df.groupby('location')['number of orders'].mean()
2 print(grouped_df)
NameError: name 'df' is not defined
If I understand your comment correctly, your DataFrame is calleed df_dig. Accordingly
grouped_df = df_dig.groupby('location')['number of orders'].mean()

Substituting variables when using Dataframes

I am trying to iterate to_datetime formating across multiple columns and create a new column with a prefix. The issue I seem to be having is substituting the Column Header in the to_datetime command. Manually the command below works:-
pipeline['pyCreated_Date'] = pd.to_datetime(pipeline.Created_Date, errors='raise')
But I get a Attribute Error: 'DataFrame' object has no attribute 'dh' when I try to iterate. I have searched for answers and tried various attempts based on Renaming pandas data frame columns using a for loop but I appear to be missing so fundemental information. Here's my most recent code:-
date_header = ['Created_Date', 'End_Date', 'Expected_Book_Date', 'Last_Modified_Date',
'Start_Date', 'Workspace_Won/Lost_Date', 'pyCreated_Date']
for dh in date_header:
pipeline['py' + dh.format()] = pd.to_datetime(
pipeline.dh.format(), errors='raise')
It appears dh is not being recognised as the Error reads:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-121-d00bf0a5a7fd> in <module>()
3 date_header = ['Created_Date', 'End_Date', 'Expected_Book_Date', 'Last_Modified_Date', 'Start_Date', 'Workspace_Won/Lost_Date']
4 for dh in date_header:
----> 5 pipeline['py' + dh.format()] = pd.to_datetime(pipeline.dh.format(), errors='raise')
/usr/local/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
4370 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'dh'
What is the correct syntax to achieve this please? Apologies if it's a rookie mistake but I appreciate your support.
Many thanks
UPDATED after ALollz kind help!
Here's what finally worked
for col_name in date_header:
pipeline['py'+ col_name.format()] = pd.to_datetime(pipeline[col_name], errors='coerce')
print(f"{pipeline['py'+ col_name.format()].value_counts(dropna=False)}")

Tensorflow tf.split() list index out of range?

Here's the codes:
a = tf.constant([1,2,3,4])
b = tf.constant([4])
c = tf.split(a, tf.squeeze(b))
then, it turns out to be wrong:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jeff/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1203, in split
num = size_splits_shape.dims[0]
IndexError: list index out of range
But why?
The docs state,
If num_or_size_splits is a tensor, size_splits, then splits value into len(size_splits) pieces. The shape of the i-th piece has the same size as the value except along dimension axis where the size is size_splits[i].
Note that size_splits needs to be slicable.
However when you squeeze(b), because it has only one element in your example, it returns a scalar that has no dimension. A scalar cannot be sliced :
b_ = tf.squeeze(b)
b_[0] # error
Hence your error.