df.ix not working , whats the right iloc method? - pandas

This is my program-
#n= no. of days
def ATR(df , n):
df['H-L'] = abs(df['High'] - df['Low'])
df['H-PC'] = abs(df['High'] - df['Close'].shift(1))
df['L-PC'] = abs(df['Low'] - df['Close'].shift(1))
df['TR']=df[['H-L','H-PC','L-PC']].max(axis=1)
df['ATR'] = np.nan
df.ix[n-1,'ATR']=df['TR'][:n-1].mean()
for i in range(n , len(df)):
df['ATR'][i] = (df['ATR'][i-1]*(n-1) + df['TR'][i])/n
return
A warning shows up
'DataFrame' object has no attribute 'ix
I tried to replace it with iloc:
df.iloc[df.index[n-1],'ATR'] = df['TR'][:n-1].mean()
But this time another error pops up :
only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
How to fix this?

Converting code is a pain and we have all been there...
df.ix[n-1,'ATR'] = df['TR'][:n-1].mean()
should become
df['ATR'].iloc[n-1] = df['TR'][:n-1].mean()
Hope this fits the bill

Related

Python - Slicing an Array of float

I have two 1-D of array of float ('Xdata' and 'tdata'). I want to make a new variable named 'ratedata'. The problem is when I run the code, the console showed "IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices". How to encounter this problem? thank you.
the code:
dxdt_a = np.array(pd.read_excel('T50-katalis1-m14.xlsx',index_col=0,header=5))
Xdata = dxdt_a[:,1]
tdata = dxdt_a[:,0]
ratedata = np.zeros(len(Xdata))
for i in ratedata:
ratedata[i] = (Xdata[i+1]-Xdata[i])/(tdata[1]-tdata[0])

How to resolve an error using apply to create a new column in pandas?

I am trying to create a new column with a function who transform positions in strings Degrees minutes and simbol to position in number.
The column is:
Latitud
45º27.19'N
45º17,4'N
46º18,8'N
45º19.54'N
45º32.47'N
....
def formatear (x):
deg, minutes, direction = re.split('[º\']', x)
valor = float(deg) + float(minutes.replace(",","."))/60 * (-1 if direction in ['W', 'S'] else 1)
return valor
Apply function to create a new column
df["LatitudDec"] = df["Latitud"].apply(formatear)
when I apply the function the error is.
ValueError: not enough values to unpack (expected 3, got 2)
The question is not providing enough information to be properly answered, but here a little modification to track down the reason of the error:
def formatear (x):
try:
deg, minutes, direction = re.split('[º\']', x)
valor = float(deg) + float(minutes.replace(",","."))/60 * (-1 if direction in ['W', 'S'] else 1)
except ValueError:
print(f'There has been an error associated with the Latitud {x}')
valor = np.nan
return valor

Facing an IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I have been working on link prediction problem in which the data set, which is a numpy array, has to be parsed and stored into another numpy array. I am trying to do the same but at 9th line it is throwing an IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices. I even tried typecasting the indices with int but it seems to not work. What am I missing here ?
1. train_edges, test_edges, = train_test_split(edgeL,test_size=0.3,random_state=16)
2. out_dim = int(W_out.shape[1])
3. in_dim = int(W_in.shape[1])
4. train_x = np.zeros((len(train_edges), (out_dim + in_dim) * 2))
5. train_y = np.zeros((len(train_edges), 1))
6. for i, edge in enumerate(train_edges):
7. u = edge[0]
8. v = edge[1]
9. train_x[int(i), : int(out_dim)] = W_out[u]
10. train_x[int(i), int(out_dim): int(out_dim + in_dim)] = W_in[u]
11. train_x[i, out_dim + in_dim: out_dim * 2 + in_dim] = W_out[v]
12. train_x[i, out_dim * 2 + in_dim:] = W_in[v]
13. if edge[2] > 0:
14. train_y[i] = 1
15. else:
16. train_y[i] = -1
EDIT:
For reference, The W_out is a 64-dimensional tuple which looks like this
print(W_out[0])
type(W_out.shape[1])
Output:
[[0.10160154 0. 0.70414263 0.6772633 0.07685234 0.75205046
0.421092 0.1776721 0.8622188 0.15669271 0. 0.40653425
0.5768579 0.75861764 0.6745151 0.37883565 0.18074909 0.73928916
0.6289512 0. 0.33160248 0.7441727 0. 0.8810399
0.1110919 0.53732747 0. 0.33330196 0.36220717 0.298112
0.10643011 0.8997948 0.53510064 0.6845873 0.03440218 0.23005858
0.8097505 0.7108275 0.38826624 0.28532124 0.37821335 0.3566149
0.42527163 0.71940386 0.8075657 0.5775364 0.01444144 0.21734199
0.47439903 0.21176265 0.32279345 0.00187511 0.43511534 0.4302601
0.39407462 0.20941389 0.199842 0.8710182 0.2160332 0.30246672
0.27159846 0.19009161 0.32349357 0.08938174]]
int
And edge is a tuple which is from training data set which has source, destination, sign. It looks like this...
train_edges, test_edges, = train_test_split(edgeL,test_size=0.3,random_state=16)
for i, edge in enumerate(train_edges):
print(edge)
print(i)
type(i)
type(edge)
Output:
Streaming output truncated to the last 5000 lines.
2936
['16936', '17031', '1']
2937
['15307', '14904', '1']
2938
['22852', '13045', '1']
2939
['14291', '96703', '1']
2940
Any help/suggestion is highly appreciated.
Your syntax is causing the error.
Looks like accessing the edge object may be the issue. Debug using type() and len() of edge and see what the index error is.
implicitly specifying int(i) is not needed, so the issue will be in the assignment of train_index[x] or your enumeration logic is not right.
As mentioned by #indigo_4_alpha, The error is caused by the 'edge[0]` element which is a string.
Code for checking the train_edges
train_edges, test_edges, = train_test_split(edgeL,test_size=0.3,random_state=16)
for i, edge in enumerate(train_edges):
print(edge)
print(i)
print(edge[0], edge[1],edge[2])
print(type(edge[0]))
Output
['11635' '22046' '1']
2608
11635 22046 1
<class 'str'>
After observing the output, I noticed that individually edge[0] is a string. Then I realized that int(W_out[u] is of no-effect when u itself is a string.
So, I type-casted u=edge[0] to u=int(edge[0]) in the lines 7 and 8 of the code, as shown below.
Master code for Train and test data split
1. train_edges, test_edges, = train_test_split(edgeL,test_size=0.3,random_state=16)
2. out_dim = int(W_out.shape[1])
3. in_dim = int(W_in.shape[1])
4. train_x = np.zeros((len(train_edges), (out_dim + in_dim) * 2))
5. train_y = np.zeros((len(train_edges), 1))
6. for i, edge in enumerate(train_edges):
7. u = int(edge[0])
8. v = int(edge[1])
Thank you one and all for sparing your time and giving me your valuable suggestions.

converting pyspark dataframe fail on 'None Type' object

I have a pyspark dataframe 'data3' with many columns. I am trying to run kmeans on it except the first two columns, when I run my code , tasks always fails on TypeError: float() argument must be a string or a number, not 'NoneType' What am I doing wrong?
def f(x):
rel = {}
#rel['features'] = Vectors.dense(float(x[0]),float(x[1]),float(x[2]),float(x[3]))
rel['features'] = Vectors.dense(float(x[2]),float(x[3]),float(x[4]),float(x[5]),float(x[6]),float(x[7]),float(x[8]),float(x[9]),float(x[10]),float(x[11]),float(x[12]),float(x[13]),float(x[14]),float(x[15]),float(x[16]),float(x[17]),float(x[18]),float(x[19]),float(x[20]),float(x[21]),float(x[22]),float(x[23]),float(x[24]),float(x[25]),float(x[26]),float(x[27]),float(x[28]),float(x[29]),float(x[30]),float(x[31]),float(x[32]),float(x[33]),float(x[34]),float(x[35]),float(x[36]),float(x[37]),float(x[38]),float(x[39]),float(x[40]),float(x[41]),float(x[42]),float(x[43]),float(x[44]),float(x[45]),float(x[46]),float(x[47]),float(x[48]),float(x[49]))
return rel
data= data3.rdd.map(lambda p: Row(**f(p))).toDF()
kmeansmodel = KMeans().setK(7).setFeaturesCol('features').setPredictionCol('prediction').fit(data)
TypeError: float() argument must be a string or a number, not 'NoneType'
Your error comes from converting the xs to float because you probably have missing values
rel['features'] = Vectors.dense(float(x[2]),float(x[3]),float(x[4]),float(x[5]),float(x[6]),float(x[7]),float(x[8]),float(x[9]),float(x[10]),float(x[11]),float(x[12]),float(x[13]),float(x[14]),float(x[15]),float(x[16]),float(x[17]),float(x[18]),float(x[19]),float(x[20]),float(x[21]),float(x[22]),float(x[23]),float(x[24]),float(x[25]),float(x[26]),float(x[27]),float(x[28]),float(x[29]),float(x[30]),float(x[31]),float(x[32]),float(x[33]),float(x[34]),float(x[35]),float(x[36]),float(x[37]),float(x[38]),float(x[39]),float(x[40]),float(x[41]),float(x[42]),float(x[43]),float(x[44]),float(x[45]),float(x[46]),float(x[47]),float(x[48]),float(x[49]))
return rel
You can create a flag to convert each x to float when there is a missing values. For example
list_of_Xs = [x[2], x[3], x[4], x[5], x[6],etc. ]
for x in list_of_Xs:
if x is not None:
x = float(x)
Or use rel.dropna()

genfromtxt in Python-3.5

I am trying to fix a data set using genfromtxt in Python 3.5. But I keep getting the next error:
ndtype = np.dtype(dict(formats=ndtype, names=names))
TypeError: data type not understood
This is the code I'm using. Any help will be appreciated!
names = ["country", "year"]
names.extend(["col%i" % (idx+1) for idx in range(682)])
dtype = "S64,i4" + ",".join(["f18" for idx in range(682)])
dataset = np.genfromtxt(data_file, dtype=dtype, names=names, delimiter=",", skip_header=1, autostrip=2)
dtype = "S64,i4" + ",".join(["f18" for idx in range(682)])
is going to produce something like:
s64,i4f18,f18,f18,f18...
Note the lack of a comma after the i4.