I am trying to write different sheet names in python based on different dataframes that are created from groupby function for two columns.
list_dfs=[]
TT=Dataframe.groupby(['change','x2'])
for group,name in TT:
list_dfs.append(group)
writer = pd.ExcelWriter('output.xlsx')
def dt(_,g):
for _,g in Dataframe.groupby (Dataframe.index):
print (g)
_.to_excel(writer,g)
writer.save()
DT=Dataframe.apply(dt)
it keeps giving me this error
TypeError: ("dt() missing 1 required positional argument: 'g'", 'occurred at index time')
Your function
def dt(_,g):
takes two arguments.
DataFrame.apply takes a function (or lambda) that takes in only one argument (either a Series or ndarray).
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Try changing the signature of function dt to:
def dt(g):
I just do a small trick as follows.
Dataframe['x2']=res
TT=Dataframe.groupby(['change',res])
writer = ExcelWriter('output.xls')
for name,group in TT:
group.to_excel(writer, sheet_name='Sheet_{}'.format(name))
writer.save()
Related
I am trying to implement a page rank using pyspark. In one step I need to pass a pyspark dataframe into a normal pandas one, but when I try to run it I get the following error:
An exception was thrown from a UDF: 'AttributeError: 'DataFrame' object has no attribute 'Id''
The code I am using is the following
n = 0
change = 1
while n <= max_iterations and change > t:
print("Iteration", n)
#We create a new function, as we will be changing the pagerankPDF
new_pagerank_udf = udf (lambda x,y: new_pagerank(x,pageRankPDF,y), DoubleType())
#We create a new pagerankPDF with the updated page rank value
NewPageRankDF=ReverseDF.select(ReverseDF["id"],new_pagerank_udf(ReverseDF["links"],ReverseDF["n_out"]).alias("PR"))
#We transform the NewPageRankDF to pandas in order to be able to operate with it
NewPageRankPDF=NewPageRankDF.toPandas()
#We update the exit conditions
n += 1
change = np.linalg.norm(pageRankPDF["PR"] - NewPageRankDF["PR"])
#We transform the NewPageRankPDF into the pageRankPDF
pageRankPDF=NewPageRankPDF
With the error being in the line
NewPageRankPDF=NewPageRankDF.toPandas()
If you can share any inshigh about what might be causing the error, I would greatly appreciate it
The code I am trying to execute:
for cat_name in df['movement_state'].cat.categories:
transformed_df[f'{cat_name} Count'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts()[cat])
transformed_df[f'{cat_name} Ratio'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts(normalize=True)[cat])
For reproduction purposes just assume the following:
import numpy as np
import pandas as pd
d = {'movement_state': pd.Categorical(np.random.choice(['moving', 'standing', 'parking'], 20))}
grouped_df = pd.DataFrame.from_dict(d)
rolling_window_size = 3
I want to do rolling window operations on my GroupBy Object. I am selecting the column movement_state beforehand. This column is categorical as shown below.
grouped_df['movement_state'].dtypes
# Output
CategoricalDtype(categories=['moving', 'parking', 'standing'], ordered=False)
If I execute, I get these error messages:
pandas.core.base.DataError: No numeric types to aggregate
TypeError: cannot handle this type -> category
ValueError: could not convert string to float: 'standing'
Inside this code snippet of rolling.py from the pandas source code I read that the data must be converted to float64 before it can be processed by cython.
def _prep_values(self, values: ArrayLike) -> np.ndarray:
"""Convert input to numpy arrays for Cython routines"""
if needs_i8_conversion(values.dtype):
raise NotImplementedError(
f"ops for {type(self).__name__} for this "
f"dtype {values.dtype} are not implemented"
)
else:
# GH #12373 : rolling functions error on float32 data
# make sure the data is coerced to float64
try:
if isinstance(values, ExtensionArray):
values = values.to_numpy(np.float64, na_value=np.nan)
else:
values = ensure_float64(values)
except (ValueError, TypeError) as err:
raise TypeError(f"cannot handle this type -> {values.dtype}") from err
My question to you
Is it possible to count the values of a categorical column in a pandas DataFrame using the rolling method as I tried to do?
A possible workaround a came up with is to just use the codes of the categorical column instead of the string values. But this way, s.value_counts()[cat] would raise a KeyError if the window I am looking at does not contain every possible value.
I have a dataframe, one column is a URL, the other is a name. I'm simply trying to add a third column that takes the URL, and creates an HTML link.
The column newsSource has the Link name, and url has the URL. For each row in the dataframe, I want to create a column that has:
[newsSource name]
Trying the below throws the error
File "C:\Users\AwesomeMan\Documents\Python\MISC\News Alerts\simple_news.py", line 254, in
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x[0]['newsSource']))
TypeError: string indices must be integers
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x['source']))
But I've used x[colName] before? The below line works fine, it simply creates a column of the source's name:
df['newsSource'] = df['source'].apply(lambda x: x['name'])
Why suddenly ("suddenly" to me) is it saying I can't access the indices?
pd.Series.apply has access only to a single series, i.e. the series on which you are calling the method. In other words, the function you supply, irrespective of whether it is named or an anonymous lambda, will only have access to df['source'].
To access multiple series by row, you need pd.DataFrame.apply along axis=1:
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
Note there is an overhead associated with passing an entire series in this way; pd.DataFrame.apply is just a thinly veiled, inefficient loop.
You may find a list comprehension more efficient:
df['sourceURL'] = ['{1}'.format(i, j) \
for i, j in zip(df['url'], df['source'])]
Here's a working demo:
df = pd.DataFrame([['BBC', 'http://www.bbc.o.uk']],
columns=['source', 'url'])
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
print(df)
source url sourceURL
0 BBC http://www.bbc.o.uk BBC
With zip and string old school string format
df['sourceURL'] = ['%s.' % (x,y) for x , y in zip (df['url'], df['source'])]
This is f-string
[f'{y}' for x , y in zip ((df['url'], df['source'])]
I'm trying to create 2 columns in Databricks which are the result of substractin the values of 2 columns and adding the values of these 2 colums.
This is the code I've entered.
dfPrep = dfCleanYear.withColumn(df.withColumn("NuevaCol", df["AverageTemperature"] - df["AverageTemperatureUncertainty"])).withColumn(df.withColumn("NuevaCol", df["AverageTemperature"] + df["AverageTemperatureUncertainty"]))
dfPrep.show()
And this the error.
TypeError: withColumn() takes exactly 3 arguments (2 given)
Would you know which argument is missing?
Thanks
It's not clear which Spark version/flavour you're using. But DataBricks documentation is usually clear about this, first parameter in .withColumn call should be a DataFrame.
Example: https://docs.azuredatabricks.net/spark/1.6/sparkr/functions/withColumn.html
Syntax:
withColumn(df, “newColName”, colExpr)
Parameters:
df: Any SparkR DataFrame
newColName: String, name of new column to be added
colExpr: Column Expression
We can rewrite your code:
a = df.withColumn("NuevaCol", df["AverageTemperature"] - df["AverageTemperatureUncertainty"])
b = df.withColumn("NuevaCol", df["AverageTemperature"] + df["AverageTemperatureUncertainty"])
dfPrep = dfCleanYear.withColumn(a).withColumn(b)
The first two lines are fine. The error comes from the 3rd one. There are two problems with this line:
The withColumn syntax should be dataframe.withColumn("New_col_name", expression), here there is only one argument in the brackets
What you want here is to take a column from one dataframe df and add to another dataframe dfCleanYear. So, you should use join, not withColumn.
Something likes (not tested):
df = df.withColumn("NuevaCol_A", df["AverageTemperature"] - df["AverageTemperatureUncertainty"])
df = df.withColumn("NuevaCol_B", df["AverageTemperature"] + df["AverageTemperatureUncertainty"])
dfPrep = dfCleanYear.join(df, "KEY")
What is the rule/process when a function is called with pandas apply() through lambda vs. not? Examples below. Without lambda apparently, the entire series ( df[column name] ) is passed to the "test" function which throws an error trying to do a boolean operation on a series.
If the same function is called via lambda it works. Iteration over each row with each passed as "x" and the df[ column name ] returns a single value for that column in the current row.
It's like lambda is removing a dimension. Anyone have an explanation or point to the specific doc on this? Thanks.
Example 1 with lambda, works OK
print("probPredDF columns:", probPredDF.columns)
def test( x, y):
if x==y:
r = 'equal'
else:
r = 'not equal'
return r
probPredDF.apply( lambda x: test( x['yTest'], x[ 'yPred']), axis=1 ).head()
Example 1 output
probPredDF columns: Index([0, 1, 'yPred', 'yTest'], dtype='object')
Out[215]:
0 equal
1 equal
2 equal
3 equal
4 equal
dtype: object
Example 2 without lambda, throws boolean operation on series error
print("probPredDF columns:", probPredDF.columns)
def test( x, y):
if x==y:
r = 'equal'
else:
r = 'not equal'
return r
probPredDF.apply( test( probPredDF['yTest'], probPredDF[ 'yPred']), axis=1 ).head()
Example 2 output
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
There is nothing magic about a lambda. They are functions in one parameter, that can be defined inline, and do not have a name. You can use a function where a lambda is expected, but the function will need to also take one parameter. You need to do something like...
Define it as:
def wrapper(x):
return test(x['yTest'], x['yPred'])
Use it as:
probPredDF.apply(wrapper, axis=1)