error performing np.std for array - numpy

this is my code, im trying to calculate the standard deviation of an imported list which is shown below
b=[]
#time=[]
with open('nt.txt') as csvfile:
data=csv.reader(csvfile,delimiter=('\t'))
index=0
for line in data:
b.append(line[1])
#out=line[0]
#new=out.split(" ")
#b.append(new[0])
#else:break
x=statistics.stdev(b)
print(x)
with b =['-0,002549', '-0,002040', '-0,001530'] as my output
i get ...
raise TypeError(msg.format(type(x).__name__)) from None
TypeError: can't convert type 'str' to numerator/denominator

results=np.array([[x],[b]]).astype(np.float32)
You have to set the type of the numpy array, not the list.

Related

Attribute Error when transforming a Pyspark dataframe into a Pandas dataframe

I am trying to implement a page rank using pyspark. In one step I need to pass a pyspark dataframe into a normal pandas one, but when I try to run it I get the following error:
An exception was thrown from a UDF: 'AttributeError: 'DataFrame' object has no attribute 'Id''
The code I am using is the following
n = 0
change = 1
while n <= max_iterations and change > t:
print("Iteration", n)
#We create a new function, as we will be changing the pagerankPDF
new_pagerank_udf = udf (lambda x,y: new_pagerank(x,pageRankPDF,y), DoubleType())
#We create a new pagerankPDF with the updated page rank value
NewPageRankDF=ReverseDF.select(ReverseDF["id"],new_pagerank_udf(ReverseDF["links"],ReverseDF["n_out"]).alias("PR"))
#We transform the NewPageRankDF to pandas in order to be able to operate with it
NewPageRankPDF=NewPageRankDF.toPandas()
#We update the exit conditions
n += 1
change = np.linalg.norm(pageRankPDF["PR"] - NewPageRankDF["PR"])
#We transform the NewPageRankPDF into the pageRankPDF
pageRankPDF=NewPageRankPDF
With the error being in the line
NewPageRankPDF=NewPageRankDF.toPandas()
If you can share any inshigh about what might be causing the error, I would greatly appreciate it

Pandas Rolling Operation on Categorical column

The code I am trying to execute:
for cat_name in df['movement_state'].cat.categories:
transformed_df[f'{cat_name} Count'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts()[cat])
transformed_df[f'{cat_name} Ratio'] = grouped_df['movement_state'].rolling(rolling_window_size, closed='both').apply(lambda s, cat=cat_name: s.value_counts(normalize=True)[cat])
For reproduction purposes just assume the following:
import numpy as np
import pandas as pd
d = {'movement_state': pd.Categorical(np.random.choice(['moving', 'standing', 'parking'], 20))}
grouped_df = pd.DataFrame.from_dict(d)
rolling_window_size = 3
I want to do rolling window operations on my GroupBy Object. I am selecting the column movement_state beforehand. This column is categorical as shown below.
grouped_df['movement_state'].dtypes
# Output
CategoricalDtype(categories=['moving', 'parking', 'standing'], ordered=False)
If I execute, I get these error messages:
pandas.core.base.DataError: No numeric types to aggregate
TypeError: cannot handle this type -> category
ValueError: could not convert string to float: 'standing'
Inside this code snippet of rolling.py from the pandas source code I read that the data must be converted to float64 before it can be processed by cython.
def _prep_values(self, values: ArrayLike) -> np.ndarray:
"""Convert input to numpy arrays for Cython routines"""
if needs_i8_conversion(values.dtype):
raise NotImplementedError(
f"ops for {type(self).__name__} for this "
f"dtype {values.dtype} are not implemented"
)
else:
# GH #12373 : rolling functions error on float32 data
# make sure the data is coerced to float64
try:
if isinstance(values, ExtensionArray):
values = values.to_numpy(np.float64, na_value=np.nan)
else:
values = ensure_float64(values)
except (ValueError, TypeError) as err:
raise TypeError(f"cannot handle this type -> {values.dtype}") from err
My question to you
Is it possible to count the values of a categorical column in a pandas DataFrame using the rolling method as I tried to do?
A possible workaround a came up with is to just use the codes of the categorical column instead of the string values. But this way, s.value_counts()[cat] would raise a KeyError if the window I am looking at does not contain every possible value.

Error: missing 1 required positional argument: , 'occurred at index time'

I am trying to write different sheet names in python based on different dataframes that are created from groupby function for two columns.
list_dfs=[]
TT=Dataframe.groupby(['change','x2'])
for group,name in TT:
list_dfs.append(group)
writer = pd.ExcelWriter('output.xlsx')
def dt(_,g):
for _,g in Dataframe.groupby (Dataframe.index):
print (g)
_.to_excel(writer,g)
writer.save()
DT=Dataframe.apply(dt)
it keeps giving me this error
TypeError: ("dt() missing 1 required positional argument: 'g'", 'occurred at index time')
Your function
def dt(_,g):
takes two arguments.
DataFrame.apply takes a function (or lambda) that takes in only one argument (either a Series or ndarray).
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Try changing the signature of function dt to:
def dt(g):
I just do a small trick as follows.
Dataframe['x2']=res
TT=Dataframe.groupby(['change',res])
writer = ExcelWriter('output.xls')
for name,group in TT:
group.to_excel(writer, sheet_name='Sheet_{}'.format(name))
writer.save()

how to add a column into the numpy matrix?

here is my code to add additional column into x_vals, but i keep getting this error:
x_vals = np.array([x[0:4] for x in iris.data])
np.concatenate([x_vals, np.array([x[0] for x in iris.data])],1)
ValueError: all the input arrays must have same number of dimensions
can anyone help me ?

Getting Error while performing Undersampling for Sklearn

I am trying built an randomforest classifier for binary classification . My data is inbalanced hence I am performing undersampling.
train = data.drop(['Co_Name','Cust_ID','Phone','Shpr_ID','Resi_Cnt','Buz_Cnt','Nearby_Cnt','parseNumber','removeString','Qty','bins','Adj_Addr','Resi','Weight','Resi_Area','Lat','Lng'], axis=1)
Y = data['Resi']
from sklearn import metrics
rus = RandomUnderSampler(random_state=42)
X_train_res, y_train_res = rus.fit_sample(train, Y)
I am getting the below error
446 # make sure we actually converted to numeric:
447 if dtype_numeric and array.dtype.kind == "O":
--> 448 array = array.astype(np.float64)
449 if not allow_nd and array.ndim >= 3:
450 raise ValueError("Found array with dim %d. %s expected <= 2."
ValueError: setting an array element with a sequence.
How to fix this.
Can you share the dataframe? or a sample of that!
This error can be a lot of things, for example:
If you try:
np.asarray(
[
[1, 2],
[2, 3, 4]
],
dtype=np.float)
You will get:
ValueError: setting an array element with a sequence.
This is because the array have incorrect shape of columns. So you can't create an array from lists, with a column length different on the second list. So doesn't match column length.
But your error probably it's related to train vs Y shape or the type in the train(data). During the Under-sampled fit function should have some conversion that throws this error. Confirm if train (data) have the appropriate type before to do the RandomUnderSampler.