Attribute Error when transforming a Pyspark dataframe into a Pandas dataframe - pandas

I am trying to implement a page rank using pyspark. In one step I need to pass a pyspark dataframe into a normal pandas one, but when I try to run it I get the following error:
An exception was thrown from a UDF: 'AttributeError: 'DataFrame' object has no attribute 'Id''
The code I am using is the following
n = 0
change = 1
while n <= max_iterations and change > t:
print("Iteration", n)
#We create a new function, as we will be changing the pagerankPDF
new_pagerank_udf = udf (lambda x,y: new_pagerank(x,pageRankPDF,y), DoubleType())
#We create a new pagerankPDF with the updated page rank value
NewPageRankDF=ReverseDF.select(ReverseDF["id"],new_pagerank_udf(ReverseDF["links"],ReverseDF["n_out"]).alias("PR"))
#We transform the NewPageRankDF to pandas in order to be able to operate with it
NewPageRankPDF=NewPageRankDF.toPandas()
#We update the exit conditions
n += 1
change = np.linalg.norm(pageRankPDF["PR"] - NewPageRankDF["PR"])
#We transform the NewPageRankPDF into the pageRankPDF
pageRankPDF=NewPageRankPDF
With the error being in the line
NewPageRankPDF=NewPageRankDF.toPandas()
If you can share any inshigh about what might be causing the error, I would greatly appreciate it

Related

How to calculate rolling.agg('max') utilising a dataframe column as input to my function

I'm working with a kline dataframe. I'm adding a Swing_High and Swing_Low column to my df.
I've picked up an error where during low volatile periods my Close == Swing_Low price. This gives me a inf error in another function I have where close / Swing_Low.
To fix this I need to calculate the max/min value based on whether Close == Swing_Low or not. Default is for the rolling period to be 10 but if the above is true then increase the rolling period to 15.
Below is how I calculated the Swing_High and Swing_Low up to encountering Inf error.
import pandas as pd
df = pd.read_csv('Data/bybit_BTCUSD_15m.csv')
df["Date"] = df["Date"].astype('datetime64[ns]')
# Calculate the swing high and low for a given length
df['Swing_High'] = df['High'].rolling(10).agg('max')
df['Swing_Low'] = df['Low'].rolling(10).agg('min')
I tried the below function but it gives me a ValueError: The truth value of a Series is ambiguous
def swing_high(close, high, period1, period2):
a = high.rolling(period1).agg('max')
b = high.rolling(period2).agg('max')
if a != close:
return a
else:
return b
df['Swing_High'] = swing_high(df['Close'], df['High'], 10, 15)
How do I fix this or is there a better way to achieve my desired outcome?
A simple solution for what you're trying to achieve :
using the where function:
here’s the basic syntax using the pandas where() function:
df['col'] = (value_if_false).where(condition, value_if_true)
df['Swing_High_10']=df['High'].rolling(10).agg('max')
df['Swing_High_15']=df['High'].rolling(15).agg('max')
df['Swing_High']=(df['Swing_High_15']).where(df['Swing_High_10']!=df['Close'], df['Swing_High_15'])

Pandas Rolling Operation on Categorical column

The code I am trying to execute:
for cat_name in df['movement_state'].cat.categories:
transformed_df[f'{cat_name} Count'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts()[cat])
transformed_df[f'{cat_name} Ratio'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts(normalize=True)[cat])
For reproduction purposes just assume the following:
import numpy as np
import pandas as pd
d = {'movement_state': pd.Categorical(np.random.choice(['moving', 'standing', 'parking'], 20))}
grouped_df = pd.DataFrame.from_dict(d)
rolling_window_size = 3
I want to do rolling window operations on my GroupBy Object. I am selecting the column movement_state beforehand. This column is categorical as shown below.
grouped_df['movement_state'].dtypes
# Output
CategoricalDtype(categories=['moving', 'parking', 'standing'], ordered=False)
If I execute, I get these error messages:
pandas.core.base.DataError: No numeric types to aggregate
TypeError: cannot handle this type -> category
ValueError: could not convert string to float: 'standing'
Inside this code snippet of rolling.py from the pandas source code I read that the data must be converted to float64 before it can be processed by cython.
def _prep_values(self, values: ArrayLike) -> np.ndarray:
"""Convert input to numpy arrays for Cython routines"""
if needs_i8_conversion(values.dtype):
raise NotImplementedError(
f"ops for {type(self).__name__} for this "
f"dtype {values.dtype} are not implemented"
)
else:
# GH #12373 : rolling functions error on float32 data
# make sure the data is coerced to float64
try:
if isinstance(values, ExtensionArray):
values = values.to_numpy(np.float64, na_value=np.nan)
else:
values = ensure_float64(values)
except (ValueError, TypeError) as err:
raise TypeError(f"cannot handle this type -> {values.dtype}") from err
My question to you
Is it possible to count the values of a categorical column in a pandas DataFrame using the rolling method as I tried to do?
A possible workaround a came up with is to just use the codes of the categorical column instead of the string values. But this way, s.value_counts()[cat] would raise a KeyError if the window I am looking at does not contain every possible value.

Got TypeError: string indices must be integers with .apply [duplicate]

I have a dataframe, one column is a URL, the other is a name. I'm simply trying to add a third column that takes the URL, and creates an HTML link.
The column newsSource has the Link name, and url has the URL. For each row in the dataframe, I want to create a column that has:
[newsSource name]
Trying the below throws the error
File "C:\Users\AwesomeMan\Documents\Python\MISC\News Alerts\simple_news.py", line 254, in
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x[0]['newsSource']))
TypeError: string indices must be integers
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x['source']))
But I've used x[colName] before? The below line works fine, it simply creates a column of the source's name:
df['newsSource'] = df['source'].apply(lambda x: x['name'])
Why suddenly ("suddenly" to me) is it saying I can't access the indices?
pd.Series.apply has access only to a single series, i.e. the series on which you are calling the method. In other words, the function you supply, irrespective of whether it is named or an anonymous lambda, will only have access to df['source'].
To access multiple series by row, you need pd.DataFrame.apply along axis=1:
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
Note there is an overhead associated with passing an entire series in this way; pd.DataFrame.apply is just a thinly veiled, inefficient loop.
You may find a list comprehension more efficient:
df['sourceURL'] = ['{1}'.format(i, j) \
for i, j in zip(df['url'], df['source'])]
Here's a working demo:
df = pd.DataFrame([['BBC', 'http://www.bbc.o.uk']],
columns=['source', 'url'])
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
print(df)
source url sourceURL
0 BBC http://www.bbc.o.uk BBC
With zip and string old school string format
df['sourceURL'] = ['%s.' % (x,y) for x , y in zip (df['url'], df['source'])]
This is f-string
[f'{y}' for x , y in zip ((df['url'], df['source'])]

How to show truncated form of large pandas dataframe after style.apply?

Normally, a relatively long dataframe like
df = pd.DataFrame(np.random.randint(0,10,(100,2)))
df
will display a truncated form in jupyter notebook like
With head, tail, ellipsis in between and row column count in the end.
However, after style.apply
def highlight_max(x):
return ['background-color: yellow' if v == x.max() else '' for v in x]
df.style.apply(highlight_max)
we got all rows displayed
Is it possible to still display the truncated form of dataframe after style.apply?
Something simple like this?
def display_df(dataframe, function):
display(dataframe.head().style.apply(function))
display(dataframe.tail().style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
**** EDIT ****
def display_df(dataframe, function):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns),
dataframe.iloc[-5:,:]]).style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
The jupyter preview is basically something like this:
def display_df(dataframe):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns, data={0: '...', 1: '...'}),
dataframe.iloc[-5:,:]]))
but if you try to apply style you are getting an error (TypeError: '>=' not supported between instances of 'int' and 'str') because it's trying to compare and highlight the string values '...'
You can capture the output in a variable and then use head or tail on it. This gives you more control on what you display every time.
output = df.style.apply(highlight_max)
output.head(10) # 10 -> number of rows to display
If you want to see more variate data you can also use sample, which will get random rows:
output.sample(10)

Error: missing 1 required positional argument: , 'occurred at index time'

I am trying to write different sheet names in python based on different dataframes that are created from groupby function for two columns.
list_dfs=[]
TT=Dataframe.groupby(['change','x2'])
for group,name in TT:
list_dfs.append(group)
writer = pd.ExcelWriter('output.xlsx')
def dt(_,g):
for _,g in Dataframe.groupby (Dataframe.index):
print (g)
_.to_excel(writer,g)
writer.save()
DT=Dataframe.apply(dt)
it keeps giving me this error
TypeError: ("dt() missing 1 required positional argument: 'g'", 'occurred at index time')
Your function
def dt(_,g):
takes two arguments.
DataFrame.apply takes a function (or lambda) that takes in only one argument (either a Series or ndarray).
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Try changing the signature of function dt to:
def dt(g):
I just do a small trick as follows.
Dataframe['x2']=res
TT=Dataframe.groupby(['change',res])
writer = ExcelWriter('output.xls')
for name,group in TT:
group.to_excel(writer, sheet_name='Sheet_{}'.format(name))
writer.save()