Error in using np.NaN is vectorize functions - pandas

I am using Python 3 on 64bit Win1o. I had issues with the following simple function:
def skudiscounT(t):
s = t.find("ITEMADJ")
if s >= 0:
t = t[s + 8:]
if t.find("-") == 2:
return t
else:
return np.nan # if change to "" it will work fine!
I tried to use this function in np.Vectorize and got the following error:
Traceback (most recent call last):
File "C:/Users/lz09/Desktop/P3/SODetails_Clean_V1.py", line 45, in <module>
SO["SKUDiscount"] = np.vectorize(skudiscounT)(SO['Description'])
File "C:\PD\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 2739, in __call__
return self._vectorize_call(func=func, args=vargs)
File "C:\PD\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 2818, in _vectorize_call
res = array(outputs, copy=False, subok=True, dtype=otypes[0])
ValueError: could not convert string to float: '23-126-408'
When I replace the last line [return np.nan] to [return ''] it worked fine. Anyone know why this is case? Thanks!

Without otypes the dtype of the return array is determined by the first trial result:
In [232]: f = np.vectorize(skudiscounT)
In [234]: f(['abc'])
Out[234]: array([ nan])
In [235]: _.dtype
Out[235]: dtype('float64')
I'm trying to find an argument that returns a string. It looks like your function can also return None.
From the docs:
The data type of the output of vectorized is determined by calling
the function with the first element of the input. This can be avoided
by specifying the otypes argument.
With otypes:
In [246]: f = np.vectorize(skudiscounT, otypes=[object])
In [247]: f(['abc', '23-126ITEMADJ408'])
Out[247]: array([nan, None], dtype=object)
In [248]: f = np.vectorize(skudiscounT, otypes=['U10'])
In [249]: f(['abc', '23-126ITEMADJ408'])
Out[249]:
array(['nan', 'None'],
dtype='<U4')
But for returning a generic object dtype, I'd use the slightly faster:
In [250]: g = np.frompyfunc(skudiscounT, 1,1)
In [251]: g(['abc', '23-126ITEMADJ408'])
Out[251]: array([nan, None], dtype=object)
So what kind of array do you want? float that can hold np.nan, string? or object that can hold 'anything'.

Related

different dimension numpy array broadcasting issue with '+=' operator

I'm new to numpy, and I have an interesting observation on the broadcasting. When I'm adding a 3x5 array directly to a 3x1 array, and update the original 3x1 array with the result, there is no broadcasting issue.
import numpy as np
total = np.random.uniform(-1,1, size=(3))[:,np.newaxis]
print(f'init = \n {total}')
for i in range(3):
total = total + np.ones(shape=(3,5))
print(f'total_{i} = \n {total}')
However, if i'm using '+=' operator to increment the 3x1 array with the value of 3x5 array, there is a broadcasting issue. May I know which rule of numpy broadcasting did I violate in the latter case?
total = np.random.uniform(-1,1, size=(3))[:,np.newaxis]
print(f'init = \n {total}')
for i in range(3):
total += np.ones(shape=(3,5))
print(f'total_{i} = \n {total}')
Thank you!
hawkoli1987
according to add function overridden in numpy array,
def add(x1, x2, *args, **kwargs): # real signature unknown; NOTE: unreliably restored from __doc__
"""
add(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])
Add arguments element-wise.
Parameters
----------
x1, x2 : array_like
The arrays to be added.
If ``x1.shape != x2.shape``, they must be broadcastable to a common
shape (which becomes the shape of the output).
out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have
a shape that the inputs broadcast to. If not provided or None,
a freshly-allocated array is returned. A tuple (possible only as a
keyword argument) must have length equal to the number of outputs.
add function returns a freshly-allocated array when dimensions of arrays are different.
In python, a=a+b and a+=b aren't absolutly same. + calls __add__ function and += calls __iadd__.
a = np.array([1, 2])
b = np.array([3, 4])
first_id = id(a)
a = a + b
second_id = id(a)
assert first_id == second_id # False
a = np.array([1, 2])
b = np.array([3, 4])
first_id = id(a)
a += b
second_id = id(a)
assert first_id == second_id # True
+= function does not create new objects and updates the value to the same address.
numpy add function updates an existing instance when adding an array of the same dimensions, but returns a new object when adding arrays of different dimensions. So when use += functions, the two functions must have the same dimension because the results of the add function must be updated on the same object.
For example,
a = np.array()
total = np.random.uniform(-1,1, size=(3))[:,np.newaxis]
print(id(total))
for i in range(3):
total += np.ones(shape=(3,1))
print(id(total))
id(total) are all same because add function just updates the instance in same address because dimmension of two arrays are same.
In [29]: arr = np.zeros((1,3))
The actual error message is:
In [30]: arr += np.ones((2,3))
Traceback (most recent call last):
Input In [30] in <cell line: 1>
arr += np.ones((2,3))
ValueError: non-broadcastable output operand with shape (1,3) doesn't match the broadcast shape (2,3)
I read that as say that arr on the left is "non-broadcastable", where as arr+np.ones((2,3)) is the result of broadcasting. The wording may be awkward; it's probably produced in some deep compiled function where it makes more sense.
We get a variant on this when we try to assign an array to a slice of an array:
In [31]: temp = arr + np.ones((2,3))
In [32]: temp.shape
Out[32]: (2, 3)
In [33]: arr[:] = temp
Traceback (most recent call last):
Input In [33] in <cell line: 1>
arr[:] = temp
ValueError: could not broadcast input array from shape (2,3) into shape (1,3)
This is clearer, saying that the RHS (2,3) cannot be put into the LHS (1,3) slot.
Or trying to put the (2,3) into one "row" of arr:
In [35]: arr[0] = temp
Traceback (most recent call last):
Input In [35] in <cell line: 1>
arr[0] = temp
ValueError: could not broadcast input array from shape (2,3) into shape (3,)
arr[0] = arr works because it tries to put a (1,3) into a (3,) shape - that's a workable broadcasting combination.
arr[0] = arr.T tries to put a (3,1) into a (3,), and fails.

IndexError: invalid index to scalar variable error

For each gene, I want to perform McNemar's test and then evaluate if the p-value > 0.05. I want to calculate the number of genes that pass the test.
My code raised IndexError: invalid index to scalar variable.
import pandas as pd
from statsmodels.stats.contingency_tables import mcnemar
from statsmodels.stats.gof import chisquare_effectsize
from statsmodels.stats.power import GofChisquarePower
def generate_gene_df(gene, n):
df = pd.DataFrame.from_dict(
{"Gene" : gene,
"Cells": (f'Cell{x}' for x in range(1, n+1)),
"Control": np.random.choice([1,0], p=[0.1, 0.9], size=n),
"Experimental": np.random.choice([1,0], p=[0.1+0.1, 0.9-0.1], size=n)},
orient='columns'
)
df = df.set_index(["Gene","Cells"])
return df
table = pd.crosstab([df["Gene"], df["Cells"]],
[df["Control"], df["Experimental"]]).to_numpy()
# List of simulated genes
gene_df_list = [generate_gene_df(gene, n) for gene in gene_list]
df = pd.concat(gene_df_list)
df = df.reset_index()
alpha=0.05
p_adjusted=[]
pass_test = []
# McNemar's test
result = mcnemar(table, exact=True)
# Bonferroni correction
p_adjusted = multipletests(result.pvalue, alpha=alpha, method="bonferroni")
for index, value in np.ndenumerate(table):
if result.pvalue > alpha:
np.append(pass_test, result.pvalue[index])
num_passed = len(pass_test)
print("Number of genes that failed to reject H0 is: %i" %num_passed)
Traceback:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/tmp/ipykernel_593/1521754442.py in <module>
11 for index, value in np.ndenumerate(table):
12 if result.pvalue > alpha:
---> 13 np.append(pass_test, result.pvalue[index])
14
15
IndexError: invalid index to scalar variable.
I haven't looked at mcnemar, so don't know what it produces. But the error tells us/youthat result.pvalue is a scalar value, and not indexable.
if result.pvalue > alpha:
np.append(pass_test, result.pvalue[index])
I also can deduce that from the fact that the if line works. It would return an ambiguity error if result.pvalue was an array.
Besides rereading the mcnemar docs, I'd suggest rereading the np.append docs (assuming you even did that!).
Generally we discourage use of np.append in a loop. np.append is not a list append clone

ValueError: invalid literal for int() with base 10: 'O'

I am relatively new to python, and as such I don't always understand why I get errors. I keep getting this error:
Traceback (most recent call last):
File "python", line 43, in <module>
ValueError: invalid literal for int() with base 10: 'O'
This is the line it's referring to:
np.insert(arr, [i,num], "O")
I'm trying to change a value in a numpy array.
Some code around this line for context:
hOne = [one,two,three]
hTwo = [four,five,six]
hThree = [seven, eight, nine]
arr = np.array([hOne, hTwo, hThree])
test = "O"
while a != Answer :
Answer = input("Please Enter Ready to Start")
if a == Answer:
while win == 0:
for lists in arr:
print(lists)
place = int(input("Choose a number(Use arabic numerals 1,5 etc.)"))
for i in range(0,len(arr)):
for num in range(0, len(arr[i])):
print(arr[i,num], "test")
print(arr)
if place == arr[i,num]:
if arr[i,num]:
np.delete(arr, [i,num])
np.insert(arr, [i,num], "O")
aiTurn = 1
else:
print(space_taken)
The number variables in the lists just hold the int version of themselves, so one = 1, two = 2 three = 3, etc
I've also tried holding "O" as a variable and changing it that way as well.
Can anyone tell me why I'm getting this error?

Computing Edit Distance (feed_dict error)

I've written some code in Tensorflow to compute the edit-distance between one string and a set of strings. I can't figure out the error.
import tensorflow as tf
sess = tf.Session()
# Create input data
test_string = ['foo']
ref_strings = ['food', 'bar']
def create_sparse_vec(word_list):
num_words = len(word_list)
indices = [[xi, 0, yi] for xi,x in enumerate(word_list) for yi,y in enumerate(x)]
chars = list(''.join(word_list))
return(tf.SparseTensor(indices, chars, [num_words,1,1]))
test_string_sparse = create_sparse_vec(test_string*len(ref_strings))
ref_string_sparse = create_sparse_vec(ref_strings)
sess.run(tf.edit_distance(test_string_sparse, ref_string_sparse, normalize=True))
This code works and when run, it produces the output:
array([[ 0.25],
[ 1. ]], dtype=float32)
But when I attempt to do this by feeding the sparse tensors in through sparse placeholders, I get an error.
test_input = tf.sparse_placeholder(dtype=tf.string)
ref_input = tf.sparse_placeholder(dtype=tf.string)
edit_distances = tf.edit_distance(test_input, ref_input, normalize=True)
feed_dict = {test_input: test_string_sparse,
ref_input: ref_string_sparse}
sess.run(edit_distances, feed_dict=feed_dict)
Here is the error traceback:
Traceback (most recent call last):
File "<ipython-input-29-4e06de0b7af3>", line 1, in <module>
sess.run(edit_distances, feed_dict=feed_dict)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 597, in _run
for subfeed, subfeed_val in _feed_fn(feed, feed_val):
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 558, in _feed_fn
return feed_fn(feed, feed_val)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 268, in <lambda>
[feed.indices, feed.values, feed.shape], feed_val)),
TypeError: zip argument #2 must support iteration
Any idea what is going on here?
TL;DR: For the return type of create_sparse_vec(), use tf.SparseTensorValue instead of tf.SparseTensor.
The problem here comes from the return type of create_sparse_vec(), which is tf.SparseTensor, and is not understood as a feed value in the call to sess.run().
When you feed a (dense) tf.Tensor, the expected value type is a NumPy array (or certain objects that can be converted to an array). When you feed a tf.SparseTensor, the expected value type is a tf.SparseTensorValue, which is similar to a tf.SparseTensor but its indices, values, and shape properties are NumPy arrays (or certain objects that can be converted to arrays, like the lists in your example.
The following code should work:
def create_sparse_vec(word_list):
num_words = len(word_list)
indices = [[xi, 0, yi] for xi,x in enumerate(word_list) for yi,y in enumerate(x)]
chars = list(''.join(word_list))
return tf.SparseTensorValue(indices, chars, [num_words,1,1])

TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')

Strange error from numpy via matplotlib when trying to get a histogram of a tiny toy dataset. I'm just not sure how to interpret the error, which makes it hard to see what to do next.
Didn't find much related, though this nltk question and this gdsCAD question are superficially similar.
I intend the debugging info at bottom to be more helpful than the driver code, but if I've missed something, please ask. This is reproducible as part of an existing test suite.
if n > 1:
return diff(a[slice1]-a[slice2], n-1, axis=axis)
else:
> return a[slice1]-a[slice2]
E TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
../py2.7.11-venv/lib/python2.7/site-packages/numpy/lib/function_base.py:1567: TypeError
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> py2.7.11-venv/lib/python2.7/site-packages/numpy/lib/function_base.py(1567)diff()
-> return a[slice1]-a[slice2]
(Pdb) bt
[...]
py2.7.11-venv/lib/python2.7/site-packages/matplotlib/axes/_axes.py(5678)hist()
-> m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
py2.7.11-venv/lib/python2.7/site-packages/numpy/lib/function_base.py(606)histogram()
-> if (np.diff(bins) < 0).any():
> py2.7.11-venv/lib/python2.7/site-packages/numpy/lib/function_base.py(1567)diff()
-> return a[slice1]-a[slice2]
(Pdb) p numpy.__version__
'1.11.0'
(Pdb) p matplotlib.__version__
'1.4.3'
(Pdb) a
a = [u'A' u'B' u'C' u'D' u'E']
n = 1
axis = -1
(Pdb) p slice1
(slice(1, None, None),)
(Pdb) p slice2
(slice(None, -1, None),)
(Pdb)
I got the same error, but in my case I am subtracting dict.key from dict.value. I have fixed this by subtracting dict.value for corresponding key from other dict.value.
cosine_sim = cosine_similarity(e_b-e_a, w-e_c)
here I got error because e_b, e_a and e_c are embedding vector for word a,b,c respectively. I didn't know that 'w' is string, when I sought out w is string then I fix this by following line:
cosine_sim = cosine_similarity(e_b-e_a, word_to_vec_map[w]-e_c)
Instead of subtracting dict.key, now I have subtracted corresponding value for key
I had a similar issue where an integer in a row of a DataFrame I was iterating over was of type numpy.int64. I got the
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
error when trying to subtract a float from it.
The easiest fix for me was to convert the row using pd.to_numeric(row).
Why is it applying diff to an array of strings.
I get an error at the same point, though with a different message
In [23]: a=np.array([u'A' u'B' u'C' u'D' u'E'])
In [24]: np.diff(a)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-9d5a62fc3ff0> in <module>()
----> 1 np.diff(a)
C:\Users\paul\AppData\Local\Enthought\Canopy\User\lib\site-packages\numpy\lib\function_base.pyc in diff(a, n, axis)
1112 return diff(a[slice1]-a[slice2], n-1, axis=axis)
1113 else:
-> 1114 return a[slice1]-a[slice2]
1115
1116
TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'
Is this a array the bins parameter? What does the docs say bins should be?
I am fairly new to this myself, but I had a similar error and found that it is due to a type casting issue. I was trying to concatenate rather than take the difference but I think the principle is the same here. I provided a similar answer on another question so I hope that is OK.
In essence you need to use a different data type cast, in my case I needed str not float, I suspect yours is the same so my suggested solution is. I am sorry I cannot test it before suggesting but I am unclear from your example what you were doing.
return diff(str(a[slice1])-str(a[slice2]), n-1, axis=axis)
Please see my example code below for the fix to my code, the change occurs on the third to last line. The code is to produce a basic random forest model.
import scipy
import math
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing, metrics, cross_validation
Data = pd.read_csv("Free_Energy_exp.csv", sep=",")
Data = Data.fillna(Data.mean()) # replace the NA values with the mean of the descriptor
header = Data.columns.values # Ues the column headers as the descriptor labels
Data.head()
test_name = "Test.csv"
npArray = np.array(Data)
print header.shape
npheader = np.array(header[1:-1])
print("Array shape X = %d, Y = %d " % (npArray.shape))
datax, datay = npArray.shape
names = npArray[:,0]
X = npArray[:,1:-1].astype(float)
y = npArray[:,-1] .astype(float)
X = preprocessing.scale(X)
XTrain, XTest, yTrain, yTest = cross_validation.train_test_split(X,y, random_state=0)
# Predictions results initialised
RFpredictions = []
RF = RandomForestRegressor(n_estimators = 10, max_features = 5, max_depth = 5, random_state=0)
RF.fit(XTrain, yTrain) # Train the model
print("Training R2 = %5.2f" % RF.score(XTrain,yTrain))
RFpreds = RF.predict(XTest)
with open(test_name,'a') as fpred :
lenpredictions = len(RFpreds)
lentrue = yTest.shape[0]
if lenpredictions == lentrue :
fpred.write("Names/Label,, Prediction Random Forest,, True Value,\n")
for i in range(0,lenpredictions) :
fpred.write(RFpreds[i]+",,"+yTest[i]+",\n")
else :
print "ERROR - names, prediction and true value array size mismatch."
This leads to an error of;
Traceback (most recent call last):
File "min_example.py", line 40, in <module>
fpred.write(RFpreds[i]+",,"+yTest[i]+",\n")
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
The solution is to make each variable a str() type on the third to last line then write to file. No other changes to then code have been made from the above.
import scipy
import math
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing, metrics, cross_validation
Data = pd.read_csv("Free_Energy_exp.csv", sep=",")
Data = Data.fillna(Data.mean()) # replace the NA values with the mean of the descriptor
header = Data.columns.values # Ues the column headers as the descriptor labels
Data.head()
test_name = "Test.csv"
npArray = np.array(Data)
print header.shape
npheader = np.array(header[1:-1])
print("Array shape X = %d, Y = %d " % (npArray.shape))
datax, datay = npArray.shape
names = npArray[:,0]
X = npArray[:,1:-1].astype(float)
y = npArray[:,-1] .astype(float)
X = preprocessing.scale(X)
XTrain, XTest, yTrain, yTest = cross_validation.train_test_split(X,y, random_state=0)
# Predictions results initialised
RFpredictions = []
RF = RandomForestRegressor(n_estimators = 10, max_features = 5, max_depth = 5, random_state=0)
RF.fit(XTrain, yTrain) # Train the model
print("Training R2 = %5.2f" % RF.score(XTrain,yTrain))
RFpreds = RF.predict(XTest)
with open(test_name,'a') as fpred :
lenpredictions = len(RFpreds)
lentrue = yTest.shape[0]
if lenpredictions == lentrue :
fpred.write("Names/Label,, Prediction Random Forest,, True Value,\n")
for i in range(0,lenpredictions) :
fpred.write(str(RFpreds[i])+",,"+str(yTest[i])+",\n")
else :
print "ERROR - names, prediction and true value array size mismatch."
These examples are from a larger code so I hope the examples are clear enough.
I think #James is right. I got stuck by same error while working on Polyval(). And yeah solution is to use the same type of variabes. You can use typecast to cast all variables in the same type.
BELOW IS A EXAMPLE CODE
import numpy
P = numpy.array(input().split(), float)
x = float(input())
print(numpy.polyval(P,x))
here I used float as an output type. so even the user inputs the INT value (whole number). the final answer will be typecasted to float.
I ran into the same issue, but in my case it was just a Python list instead of a Numpy array used. Using two Numpy arrays solved the issue for me.