Getting the mean of a column using Groupby - pandas

I am attempting to get the mean of the columns in a df but keeping getting this error using groupby
grouped_df = df.groupby('location')['number of orders'].mean()
print(grouped_df)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-8cc491c4c100> in <module>
----> 1 grouped_df = df.groupby('location')['number of orders'].mean()
2 print(grouped_df)
NameError: name 'df' is not defined

If I understand your comment correctly, your DataFrame is calleed df_dig. Accordingly
grouped_df = df_dig.groupby('location')['number of orders'].mean()

Related

Pyspark pandas TypeError when try to concatenate two dataframes

I got an below error while I am trying to concatenate two pandas dataframes:
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
At the beginning I thought It was emerged because of one the dataframe that includes list on some column. So I tried to concatenate the two dataframes that does not include list on their columns. But I got the same error. I printed the type of dataframes to be sure. Both of them are pandas.core.frame.DataFrame. Why I got this error even they are not list?
import pyspark.pandas as ps
split_col = split_col.toPandas()
split_col2 = split_col2.toPandas()
dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_1455538/463168233.py in <module>
2 split_col = split_col.toPandas()
3 split_col2 = split_col2.toPandas()
----> 4 dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
/home/anaconda3/envs/virtenv/lib/python3.10/site-packages/pyspark/pandas/namespace.py in concat(objs, axis, join, ignore_index, sort)
2464 for obj in objs:
2465 if not isinstance(obj, (Series, DataFrame)):
-> 2466 raise TypeError(
2467 "cannot concatenate object of type "
2468 "'{name}"
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
type(split_col)
pandas.core.frame.DataFrame
type(split_col2)
pandas.core.frame.DataFrame
I want to concatenate 2 dataframe but I stuck. Do you have any suggestion?
You're having this error because you're trying to concatenate two pandas DataFrames using the Pandas API for pyspark.
Instead of converting your pyspark dataframes to pandas dataframes using the toPandas() method, try the following:
split_col = split_col.to_pandas_on_spark()
split_col2 = split_col2.to_pandas_on_spark()
More documentation on this method.
https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.DataFrame.to_pandas_on_spark.html

Vaex Dataframe - Groupby on a calculated field - throws error

I have the referenced vaex dataframe
The column "Amount_INR" is calculated using the other three attributes using the function:
def convert_curr(x,y,z):
c = CurrencyRates()
return c.convert(x, 'INR', y, z)
data_df_usd['Amount_INR'] = data_df_usd.apply(convert_curr,arguments=[data_df_usd.CURRENCY_CODE,data_df_usd.TOTAL_AMOUNT,data_df_usd.SUBSCRIPTION_START_DATE_DATE])
I'm trying to perform a groupby operation using the below code:
data_df_usd.groupby('CONTENTID', agg={'Revenue':vaex.agg.sum('Amount_INR')})
The code throws the below error:
RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/vaex/scopes.py", line 113, in evaluate
result = self[expression]
File "/usr/local/lib/python3.7/dist-packages/vaex/scopes.py", line 198, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
**KeyError: "Unknown variables or column: 'lambda_function(CURRENCY_CODE, TOTAL_AMOUNT, SUBSCRIPTION_START_DATE_DATE)'"**
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/forex_python/converter.py", line 103, in convert
converted_amount = rate * amount
TypeError: can't multiply sequence by non-int of type 'float'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.7/dist-packages/vaex/expression.py", line 1616, in _apply
scalar_result = self.f(*[fix_type(k[i]) for k in args], **{key: value[i] for key, value in kwargs.items()})
File "<ipython-input-7-8cc933ccf57d>", line 3, in convert_curr
return c.convert(x, 'INR', y, z)
File "/usr/local/lib/python3.7/dist-packages/forex_python/converter.py", line 107, in convert
"convert requires amount parameter is of type Decimal when force_decimal=True")
forex_python.converter.DecimalFloatMismatchError: convert requires amount parameter is of type Decimal when force_decimal=True
"""
The above exception was the direct cause of the following exception:
DecimalFloatMismatchError Traceback (most recent call last)
<ipython-input-13-cc7b243be138> in <module>
----> 1 data_df_usd.groupby('CONTENTID', agg={'Revenue':vaex.agg.sum('Amount_INR')})
According to the error screenshot, this does not look like it is related to the groupby. Something is happening with the convert_curr function.
You get the error
TypeError; can't multiply sequence by non-int of type 'float'
See of you can evaluate the Amount_INR in the first place.

Py4JJavaError while using streaming with PySpark

For the following code:
%%time
steps = df.select("step").distinct().collect()
for step in steps[:]:
_df = df.where(f"step = {step[0]}")
# by adding coalesce(1) we save the dataframe to one file
_df.coalesce(1).write.mode("append").option("header", "true").csv("paysim1")
I am getting the following error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_4892/519786224.py in <module>
2 _df = df.where(f"step = {step[0]}")
3 # by adding coalesce(1) we save the dataframe to one file
----> 4 _df.coalesce(1).write.mode("append").option("header", "true").csv("paysim1")
Need solution for this.

AttributeError: 'Index' object has no attribute 'mean'

I am encountering an error that I can't get out of:
I have separated my columns according to the unique values of the variables:
cats_df = df.columns[df.nunique() < 6]
num_df = df.columns[df.nunique()>= 6]
And I wanted to replace the missing values of the numerical columns >= 6 with the average:
num_df = num_df.fillna(num_df.mean())
But I get this error message :
AttributeError Traceback (most recent call last)
<ipython-input-22-59bfd4048c41> in <module> ----> 1 num_df = num_df.fillna(num_df.mean()) 2 num_df
AttributeError: 'Index' object has no attribute 'mean'
Can you help me solve this problem?
The problem is that num_df is an index, not a dataframe, you may need something like this:
num_df = df[df.columns[df.nunique()>= 6]]
num_df = num_df.fillna(num_df.mean())

Substituting variables when using Dataframes

I am trying to iterate to_datetime formating across multiple columns and create a new column with a prefix. The issue I seem to be having is substituting the Column Header in the to_datetime command. Manually the command below works:-
pipeline['pyCreated_Date'] = pd.to_datetime(pipeline.Created_Date, errors='raise')
But I get a Attribute Error: 'DataFrame' object has no attribute 'dh' when I try to iterate. I have searched for answers and tried various attempts based on Renaming pandas data frame columns using a for loop but I appear to be missing so fundemental information. Here's my most recent code:-
date_header = ['Created_Date', 'End_Date', 'Expected_Book_Date', 'Last_Modified_Date',
'Start_Date', 'Workspace_Won/Lost_Date', 'pyCreated_Date']
for dh in date_header:
pipeline['py' + dh.format()] = pd.to_datetime(
pipeline.dh.format(), errors='raise')
It appears dh is not being recognised as the Error reads:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-121-d00bf0a5a7fd> in <module>()
3 date_header = ['Created_Date', 'End_Date', 'Expected_Book_Date', 'Last_Modified_Date', 'Start_Date', 'Workspace_Won/Lost_Date']
4 for dh in date_header:
----> 5 pipeline['py' + dh.format()] = pd.to_datetime(pipeline.dh.format(), errors='raise')
/usr/local/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
4370 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'dh'
What is the correct syntax to achieve this please? Apologies if it's a rookie mistake but I appreciate your support.
Many thanks
UPDATED after ALollz kind help!
Here's what finally worked
for col_name in date_header:
pipeline['py'+ col_name.format()] = pd.to_datetime(pipeline[col_name], errors='coerce')
print(f"{pipeline['py'+ col_name.format()].value_counts(dropna=False)}")