Substituting variables when using Dataframes - pandas

I am trying to iterate to_datetime formating across multiple columns and create a new column with a prefix. The issue I seem to be having is substituting the Column Header in the to_datetime command. Manually the command below works:-
pipeline['pyCreated_Date'] = pd.to_datetime(pipeline.Created_Date, errors='raise')
But I get a Attribute Error: 'DataFrame' object has no attribute 'dh' when I try to iterate. I have searched for answers and tried various attempts based on Renaming pandas data frame columns using a for loop but I appear to be missing so fundemental information. Here's my most recent code:-
date_header = ['Created_Date', 'End_Date', 'Expected_Book_Date', 'Last_Modified_Date',
'Start_Date', 'Workspace_Won/Lost_Date', 'pyCreated_Date']
for dh in date_header:
pipeline['py' + dh.format()] = pd.to_datetime(
pipeline.dh.format(), errors='raise')
It appears dh is not being recognised as the Error reads:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-121-d00bf0a5a7fd> in <module>()
3 date_header = ['Created_Date', 'End_Date', 'Expected_Book_Date', 'Last_Modified_Date', 'Start_Date', 'Workspace_Won/Lost_Date']
4 for dh in date_header:
----> 5 pipeline['py' + dh.format()] = pd.to_datetime(pipeline.dh.format(), errors='raise')
/usr/local/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
4370 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'dh'
What is the correct syntax to achieve this please? Apologies if it's a rookie mistake but I appreciate your support.
Many thanks
UPDATED after ALollz kind help!
Here's what finally worked
for col_name in date_header:
pipeline['py'+ col_name.format()] = pd.to_datetime(pipeline[col_name], errors='coerce')
print(f"{pipeline['py'+ col_name.format()].value_counts(dropna=False)}")

Related

Pyspark pandas TypeError when try to concatenate two dataframes

I got an below error while I am trying to concatenate two pandas dataframes:
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
At the beginning I thought It was emerged because of one the dataframe that includes list on some column. So I tried to concatenate the two dataframes that does not include list on their columns. But I got the same error. I printed the type of dataframes to be sure. Both of them are pandas.core.frame.DataFrame. Why I got this error even they are not list?
import pyspark.pandas as ps
split_col = split_col.toPandas()
split_col2 = split_col2.toPandas()
dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_1455538/463168233.py in <module>
2 split_col = split_col.toPandas()
3 split_col2 = split_col2.toPandas()
----> 4 dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
/home/anaconda3/envs/virtenv/lib/python3.10/site-packages/pyspark/pandas/namespace.py in concat(objs, axis, join, ignore_index, sort)
2464 for obj in objs:
2465 if not isinstance(obj, (Series, DataFrame)):
-> 2466 raise TypeError(
2467 "cannot concatenate object of type "
2468 "'{name}"
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
type(split_col)
pandas.core.frame.DataFrame
type(split_col2)
pandas.core.frame.DataFrame
I want to concatenate 2 dataframe but I stuck. Do you have any suggestion?
You're having this error because you're trying to concatenate two pandas DataFrames using the Pandas API for pyspark.
Instead of converting your pyspark dataframes to pandas dataframes using the toPandas() method, try the following:
split_col = split_col.to_pandas_on_spark()
split_col2 = split_col2.to_pandas_on_spark()
More documentation on this method.
https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.DataFrame.to_pandas_on_spark.html

AttributeError: 'Index' object has no attribute 'mean'

I am encountering an error that I can't get out of:
I have separated my columns according to the unique values of the variables:
cats_df = df.columns[df.nunique() < 6]
num_df = df.columns[df.nunique()>= 6]
And I wanted to replace the missing values of the numerical columns >= 6 with the average:
num_df = num_df.fillna(num_df.mean())
But I get this error message :
AttributeError Traceback (most recent call last)
<ipython-input-22-59bfd4048c41> in <module> ----> 1 num_df = num_df.fillna(num_df.mean()) 2 num_df
AttributeError: 'Index' object has no attribute 'mean'
Can you help me solve this problem?
The problem is that num_df is an index, not a dataframe, you may need something like this:
num_df = df[df.columns[df.nunique()>= 6]]
num_df = num_df.fillna(num_df.mean())

Getting the mean of a column using Groupby

I am attempting to get the mean of the columns in a df but keeping getting this error using groupby
grouped_df = df.groupby('location')['number of orders'].mean()
print(grouped_df)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-8cc491c4c100> in <module>
----> 1 grouped_df = df.groupby('location')['number of orders'].mean()
2 print(grouped_df)
NameError: name 'df' is not defined
If I understand your comment correctly, your DataFrame is calleed df_dig. Accordingly
grouped_df = df_dig.groupby('location')['number of orders'].mean()

How do I change the format of a single cell in a pandas data frame?

My Data frame has summary of values for different types of metrics. They are all floats, but I need the budget to show up with a '$' and the two bottom rows as % instead of decimal. I have provided screen shots as I do not know how to else to properly display my jupyter notebook code in stackoverflow.
I tried using iLoc .map .format but it did not work.
district_summary.iloc[[6,1].map('{:,%.2f}'.format)]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-40-a2f927a382d3> in <module>
----> 1 district_summary.iloc[[6,1].map('{:,%.2f}'.format)]
AttributeError: 'list' object has no attribute 'map'
I need the budget to show with a preceeding $ and with no decimal points, and the two percentages at the bottom 2 rows to show up as %xx.xx
This is one way to get it to work, however, better solutions are out there. The df is converted to a string so that all datatypes within the cells remain consistent. i.e. a '%' or '$' cannot be read as a float.
pd.set_option('mode.chained_assignment', None)
df['Value'][6:8] = df['Value'][6:8]*100
df = df.applymap(str)
df['Value'][2] = '$' + df['Value'][2]
df['Value'][:3] = df['Value'][:3] + '0'
df['Value'][6:8] = '%'+ df['Value'][6:8] + '0'
String Dataframe Result:

to_dataframe() bug when query returns no results

If a valid BigQuery query returns 0 rows, to_dataframe() crashes. (btw, I am running this on Google Cloud Datalab)
for example:
q = bq.Query('SELECT * FROM [isb-cgc:tcga_201510_alpha.Somatic_Mutation_calls] WHERE ( Protein_Change="V600E" ) LIMIT 10')
r = q.results()
r.to_dataframe()
produces:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-de55245104c0> in <module>()
----> 1 r.to_dataframe()
/usr/local/lib/python2.7/dist-packages/gcp/bigquery/_table.pyc in to_dataframe(self, start_row, max_rows)
628 # Need to reorder the dataframe to preserve column ordering
629 ordered_fields = [field.name for field in self.schema]
--> 630 return df[ordered_fields]
631
632 def to_file(self, destination, format='csv', csv_delimiter=',', csv_header=True):
TypeError: 'NoneType' object has no attribute '__getitem__'
is this a known bug?
Certainly not a known bug. Please do log a bug as mentioned by Felipe.
Contributions, both bug reports, and of course fixes, are welcome! :)