Confusing error using factorial with numpy or scipy arrays - numpy

The following code identifies a list of prime numbers
import math
import numpy as np
import pandas as pd
import scipy
for x in range(20, 50):
print(x, math.factorial(x-1) % x)
the output is
20 0
21 0
22 0
23 22
24 0
25 0
26 0
27 0
28 0
29 28
30 0
31 30
32 0
33 0
34 0
35 0
36 0
37 36
38 0
39 0
40 0
41 40
42 0
43 42
44 0
45 0
46 0
47 46
48 0
49 0
The residue of the calculation is non zero for every prime.
When I try to do the same calculation with arrays, my results are different.
arr = np.arange(20,50)
modFactArr = factorial(arr-1) % arr
print(np.column_stack((arr,modFactArr)),'\n\n')
smodFactArr=scipy.special.factorial(arr-1) % arr
print(np.column_stack((arr,smodFactArr)),'\n\n')
gives
[[20. 0.]
[21. 0.]
[22. 0.]
[23. 22.]
[24. 0.]
[25. 22.]
[26. 16.]
[27. 11.]
[28. 12.]
[29. 16.]
[30. 24.]
[31. 8.]
[32. 0.]
[33. 18.]
[34. 16.]
[35. 33.]
[36. 20.]
[37. 12.]
[38. 20.]
[39. 10.]
[40. 0.]
[41. 25.]
[42. 26.]
[43. 6.]
[44. 4.]
[45. 3.]
[46. 36.]
[47. 40.]
[48. 0.]
[49. 12.]]
[[20. 0.]
[21. 0.]
[22. 0.]
[23. 22.]
[24. 0.]
[25. 22.]
[26. 16.]
[27. 11.]
[28. 12.]
[29. 16.]
[30. 24.]
[31. 8.]
[32. 0.]
[33. 18.]
[34. 16.]
[35. 33.]
[36. 20.]
[37. 12.]
[38. 20.]
[39. 10.]
[40. 0.]
[41. 25.]
[42. 26.]
[43. 6.]
[44. 4.]
[45. 3.]
[46. 36.]
[47. 40.]
[48. 0.]
[49. 12.]]
Notice now that numbers like 26,27,28,etc are giving residuals now. Is this an error with my code? or is there a reason that scipy and numpy are doing modulo arithmetic differently?

It's happening because of np.dtype. You just have overflow of np.int64 while native python int can not be overflowed.

From the docs:
With exact=False the factorial is approximated using the gamma function
exact=False is the default.
You can tell this is being approximated because the result is a float (hence the .0), and floats are not able to store integral results accurately beyond 2**53.

Related

Is it possible to vectorize the following function?

I have the following function that I would like to vectorize in order to speed it up. the function takes one column vector of arbitrary length and must output a single number. The function is:
xs = np.array([0.3, 5.01, 13.5, -1.01, 66.0, -101.6, 41.44, 111.0, 36.2, 9.0])
def func(xs):
fitness = xs[0]
for i in xs:
if np.abs(i) > fitness:
fitness = np.abs(i - 30)
return fitness
Expected output: 131.6
Let's rewrite the function to better visualize the action:
In [221]: xs = np.array([0.3, 5.01, 13.5, -1.01, 66.0, -101.6, 41.44, 11.0, 36.2, 9.0])
...: def func(xs):
...: x1 = np.abs(xs); x2 =np.abs(xs-30)
...: print(x1); print(x2)
...: fitness = x1[0]
...: for i,j in zip(x1,x2):
...: if i > fitness:
...: fitness = j
...: print(fitness)
...: return fitness
...:
In [222]: xs
Out[222]:
array([ 0.3 , 5.01, 13.5 , -1.01, 66. , -101.6 , 41.44,
11. , 36.2 , 9. ])
In [223]: func(xs)
[ 0.3 5.01 13.5 1.01 66. 101.6 41.44 11. 36.2 9. ]
[ 29.7 24.99 16.5 31.01 36. 131.6 11.44 19. 6.2 21. ]
0.3
24.990000000000002
24.990000000000002
24.990000000000002
36.0
131.6
131.6
131.6
131.6
131.6
Out[223]: 131.6
In [224]: x1 = np.abs(xs); x2 =np.abs(xs-30)
So the successive values of fitness look a lot like the accumulated maximum of x2:
In [225]: np.maximum.accumulate(x2)
Out[225]:
array([ 29.7 , 29.7 , 29.7 , 31.01, 36. , 131.6 , 131.6 , 131.6 ,
131.6 , 131.6 ])
Or skipping the first value of x2:
In [226]: np.maximum.accumulate(x2[1:])
Out[226]:
array([ 24.99, 24.99, 31.01, 36. , 131.6 , 131.6 , 131.6 , 131.6 ,
131.6 ])
That's not a perfect substitute, but may give you/us some ideas of how to make it better.

Strange result when subtracting two tensors

I try to subtract two tensors and then convert every negative value to zero using relu function, but i cannot do that because when i subtract two tensors, tensorflow for some reason add 256 to every negative value !!
img = mpimg.imread('/home/moumenshobaky/tensorflow_files/virtualenv/archive/training/Dessert/82
7.jpg')
img2 = tf.math.floordiv(img,64)*64
img3 = img2-img
# showing an example of the Flatten class and operation
from tensorflow.keras.layers import Flatten
flatten = Flatten(dtype='float32')
print(flatten(img2))
print(img3)
now the result is
tf.Tensor(
[[ 0 0 0 ... 64 64 0]
[ 0 0 0 ... 64 0 0]
[64 64 0 ... 64 0 0]
...
[64 64 64 ... 64 64 64]
[64 64 64 ... 64 64 64]
[64 64 64 ... 64 64 64]], shape=(384, 1536), dtype=uint8)
tf.Tensor(
[[198 197 213 ... 229 252 202]
[194 193 207 ... 235 193 207]
[250 253 198 ... 238 193 207]
...
[227 217 207 ... 218 230 242]
[226 216 206 ... 217 230 239]
[225 215 203 ... 214 227 235]], shape=(384, 1536), dtype=uint8)
I could make it work using tf.keras.utils.img_to_array to convert the image into a numpy array to avoid any unknown behaviour.
I used this image from i presume the same dataset.
img = mpimg.imread('C:/Users/as/Downloads/15.jpg')
img2 = tf.math.floordiv(img,64)*64
# convert to arrays
img = tf.keras.utils.img_to_array(img)
img2 = tf.keras.utils.img_to_array(img2)
img3 = img2-img
# showing an example of the Flatten class and operation
from tensorflow.keras.layers import Flatten
flatten = Flatten(dtype='float32')
print(flatten(img2))
print(img3)
output:
tf.Tensor(
[[192. 128. 64. ... 0. 0. 0.]
[192. 128. 64. ... 0. 0. 0.]
[192. 128. 128. ... 0. 0. 0.]
...
[ 64. 0. 0. ... 64. 0. 0.]
[ 64. 0. 0. ... 64. 0. 0.]
[ 64. 0. 0. ... 64. 0. 0.]], shape=(512, 1536), dtype=float32)
[[[-17. -43. -58.]
[-23. -51. -3.]
[-31. -61. -16.]
...
[-20. -7. -14.]
[-20. -7. -14.]
[-20. -7. -14.]]
[[-19. -45. -60.]
[-25. -53. -5.]
[-33. -63. -18.]
...
[-20. -7. -14.]
[-20. -7. -14.]
[-21. -8. -15.]]
[[-21. -49. -1.]
show more (open the raw output data in a text editor) ...
[-32. -31. -11.]
...
[-30. -60. -15.]
[-30. -60. -15.]
[-30. -60. -15.]]]
The problem occured because the image dtype is uint which is unsigned, so it doesn't allow negative values (check a similar issue here).
so I found out you could also solve the problem with tf.cast(img, dtype=tf.float32).

Convert a multiway pandas.crosstab to an xarray

I want to create a multiway contingency table from my pandas dataframe and store it in an xarray. It seems to me it ought to be straightfoward enough using pandas.crosstab followed by DataFrame.to_xarray() but I'm getting "TypeError: Cannot interpret 'interval[int64]' as a data type" in pandas v1.1.5. (v1.0.1 gives "ValueError: all arrays must be same length").
In [1]: import numpy as np
...: import pandas as pd
...: pd.__version__
Out[1]: '1.1.5'
In [2]: import xarray as xr
...: xr.__version__
Out[2]: '0.17.0'
In [3]: n = 100
...: np.random.seed(42)
...: x = pd.cut(np.random.uniform(low=0, high=3, size=n), range(5))
...: x
Out[3]:
[(1, 2], (2, 3], (2, 3], (1, 2], (0, 1], ..., (1, 2], (1, 2], (1, 2], (0, 1], (0, 1]]
Length: 100
Categories (4, interval[int64]): [(0, 1] < (1, 2] < (2, 3] < (3, 4]]
In [4]: x.value_counts().sort_index()
Out[4]:
(0, 1] 41
(1, 2] 28
(2, 3] 31
(3, 4] 0
dtype: int64
Note I need my table to include empty categories such as (3, 4].
In [6]: idx=pd.date_range('2001-01-01', periods=n, freq='8H')
...: df = pd.DataFrame({'x': x}, index=idx)
...: df['xlag'] = df.x.shift(1, 'D')
...: df['h'] = df.index.hour
...: xtab = pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize='index')
...: xtab
Out[6]:
x (0, 1] (1, 2] (2, 3] (3, 4]
h xlag
0 (0, 1] 0.000000 0.700000 0.300000 0.0
(1, 2] 0.470588 0.411765 0.117647 0.0
(2, 3] 0.500000 0.333333 0.166667 0.0
(3, 4] 0.000000 0.000000 0.000000 0.0
8 (0, 1] 0.588235 0.000000 0.411765 0.0
(1, 2] 1.000000 0.000000 0.000000 0.0
(2, 3] 0.428571 0.142857 0.428571 0.0
(3, 4] 0.000000 0.000000 0.000000 0.0
16 (0, 1] 0.333333 0.250000 0.416667 0.0
(1, 2] 0.444444 0.222222 0.333333 0.0
(2, 3] 0.454545 0.363636 0.181818 0.0
(3, 4] 0.000000 0.000000 0.000000 0.0
That's fine, but my actual application has more categories and more dimensions, so this seems a clear use-case for xarray, but I get an error:
In [8]: xtab.to_xarray()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-aaedf730bb97> in <module>
----> 1 xtab.to_xarray()
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/pandas/core/generic.py in to_xarray(self)
2818 return xarray.DataArray.from_series(self)
2819 else:
-> 2820 return xarray.Dataset.from_dataframe(self)
2821
2822 #Substitution(returns=fmt.return_docstring)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in from_dataframe(cls, dataframe, sparse)
5131 obj._set_sparse_data_from_dataframe(idx, arrays, dims)
5132 else:
-> 5133 obj._set_numpy_data_from_dataframe(idx, arrays, dims)
5134 return obj
5135
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in _set_numpy_data_from_dataframe(self, idx, arrays, dims)
5062 data = np.zeros(shape, values.dtype)
5063 data[indexer] = values
-> 5064 self[name] = (dims, data)
5065
5066 #classmethod
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in __setitem__(self, key, value)
1427 )
1428
-> 1429 self.update({key: value})
1430
1431 def __delitem__(self, key: Hashable) -> None:
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in update(self, other)
3897 Dataset.assign
3898 """
-> 3899 merge_result = dataset_update_method(self, other)
3900 return self._replace(inplace=True, **merge_result._asdict())
3901
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/merge.py in dataset_update_method(dataset, other)
958 priority_arg=1,
959 indexes=indexes,
--> 960 combine_attrs="override",
961 )
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/merge.py in merge_core(objects, compat, join, combine_attrs, priority_arg, explicit_coords, indexes, fill_value)
609 coerced = coerce_pandas_values(objects)
610 aligned = deep_align(
--> 611 coerced, join=join, copy=False, indexes=indexes, fill_value=fill_value
612 )
613 collected = collect_variables_and_indexes(aligned)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/alignment.py in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
428 indexes=indexes,
429 exclude=exclude,
--> 430 fill_value=fill_value,
431 )
432
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
352 if not valid_indexers:
353 # fast path for no reindexing necessary
--> 354 new_obj = obj.copy(deep=copy)
355 else:
356 new_obj = obj.reindex(
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in copy(self, deep, data)
1218 """
1219 if data is None:
-> 1220 variables = {k: v.copy(deep=deep) for k, v in self._variables.items()}
1221 elif not utils.is_dict_like(data):
1222 raise ValueError("Data must be dict-like")
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in <dictcomp>(.0)
1218 """
1219 if data is None:
-> 1220 variables = {k: v.copy(deep=deep) for k, v in self._variables.items()}
1221 elif not utils.is_dict_like(data):
1222 raise ValueError("Data must be dict-like")
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/variable.py in copy(self, deep, data)
2632 """
2633 if data is None:
-> 2634 data = self._data.copy(deep=deep)
2635 else:
2636 data = as_compatible_data(data)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/indexing.py in copy(self, deep)
1484 # 8000341
1485 array = self.array.copy(deep=True) if deep else self.array
-> 1486 return PandasIndexAdapter(array, self._dtype)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/indexing.py in __init__(self, array, dtype)
1407 dtype_ = array.dtype
1408 else:
-> 1409 dtype_ = np.dtype(dtype)
1410 self._dtype = dtype_
1411
TypeError: Cannot interpret 'interval[int64]' as a data type
I can avoid the error by converting x (and xlag) to a different dtype instead of pandas.Categorical before using pandas.crosstab, but then I lose any empty categories, which I need to keep in my real application.
The issue here is not the use of a CategoricalIndex but the category labels (x.categories) is an IntervalIndex which xarray doesn't like.
To remedy this, you can simply replace the categories within your x variable with their string representation, which coerces x.categories to be an "object" dtype instead of an "interval[int64]" dtype:
x = (
pd.cut(np.random.uniform(low=0, high=3, size=n), range(5))
.rename_categories(str)
)
Then calculate your crosstab as you have already done and it should work!
To get your dataset in the coordinates you want (I think), all you need to do is to stack everything in a single MultiIndex row shape. (instead of a crosstab MultiIndex row/Index column shape).
xtab = (
pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize="index")
.stack()
.reorder_levels(["x", "h", "xlag"])
.sort_index()
)
xtab.to_xarray()
If you want to shorten your code and lose some of the explicit ordering of index levels, you can also use unstack instead of stack which gives you the correct ordering right away:
xtab = (
pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize="index")
.unstack([0, 1])
)
xtab.to_xarray()
Regardless of the stack() vs unstack([0, 1]) approach you use, you get this output:
<xarray.DataArray (x: 4, h: 3, xlag: 4)>
array([[[0. , 0.47058824, 0.5 , 0. ],
[0.58823529, 1. , 0.42857143, 0. ],
[0.33333333, 0.44444444, 0.45454545, 0. ]],
[[0.7 , 0.41176471, 0.33333333, 0. ],
[0. , 0. , 0.14285714, 0. ],
[0.25 , 0.22222222, 0.36363636, 0. ]],
[[0.3 , 0.11764706, 0.16666667, 0. ],
[0.41176471, 0. , 0.42857143, 0. ],
[0.41666667, 0.33333333, 0.18181818, 0. ]],
[[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ]]])
Coordinates:
* x (x) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
* h (h) int64 0 8 16
* xlag (xlag) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
#Cameron-Riddell's answer is the key to my problem, but there are a couple of additional reshaping wriggles to smooth out. Applying rename_categories(str) to my x variable as he suggests then proceeding as in my question allows the final line to work:
In [8]: xtab = pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize='index')
...: xtab.to_xarray()
Out[8]:
<xarray.Dataset>
Dimensions: (h: 3, xlag: 4)
Coordinates:
* h (h) int64 0 8 16
* xlag (xlag) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
Data variables:
(0, 1] (h, xlag) float64 0.0 0.4706 0.5 0.0 ... 0.3333 0.4444 0.4545 0.0
(1, 2] (h, xlag) float64 0.7 0.4118 0.3333 0.0 ... 0.25 0.2222 0.3636 0.0
(2, 3] (h, xlag) float64 0.3 0.1176 0.1667 0.0 ... 0.3333 0.1818 0.0
(3, 4] (h, xlag) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
But I wanted a 3-d array with one variable, not a 2-d array with 3 variables. To convert it I need to apply .to_array(dim='x'). But then my dimensions are in the order x, h, xlag and I clearly don't want h in the middle so I also need to transpose them:
In [9]: xtab.to_xarray().to_array(dim='x').transpose('h', 'xlag', 'x')
Out[9]:
<xarray.DataArray (h: 3, xlag: 4, x: 4)>
array([[[0. , 0.7 , 0.3 , 0. ],
[0.47058824, 0.41176471, 0.11764706, 0. ],
[0.5 , 0.33333333, 0.16666667, 0. ],
[0. , 0. , 0. , 0. ]],
[[0.58823529, 0. , 0.41176471, 0. ],
[1. , 0. , 0. , 0. ],
[0.42857143, 0.14285714, 0.42857143, 0. ],
[0. , 0. , 0. , 0. ]],
[[0.33333333, 0.25 , 0.41666667, 0. ],
[0.44444444, 0.22222222, 0.33333333, 0. ],
[0.45454545, 0.36363636, 0.18181818, 0. ],
[0. , 0. , 0. , 0. ]]])
Coordinates:
* h (h) int64 0 8 16
* xlag (xlag) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
* x (x) <U6 '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
That's what I'd envisaged! It displays similarly to pd.crosstab, but it's a 3-d xarray instead of a pandas dataframe with a multiindex. That'll be much easier to handle in the subsequent stages of my program (the crosstab is just an intermediate step, not a result in itself).
I must say that ended up more complicated than I'd anticipated... I found a question from #kilojoules back in 2017 "When to use multiindexing vs. xarray in pandas" to which #Tkanno wrote an answer beginning "There does seem to be a transition to xarray for doing work on multi-dimensional arrays." Seems a shame to me that there isn't a version of pd.crosstab that returns an xarray - or am I asking for more pandas-xarray integration than is possible?

"Unknown label type" with numpy and scikit

X = np.array(...,dtype=np.float64).astype(np.float)
y = np.array(...,dtype=np.float64).astype(np.float)
print(X)
[[-0.5 0. ]
[ 0. 0. ]
[ 0. 0. ]
[ 0. 0. ]
[ 0. -3.5]
[-3.5 -3.5]
[-3.5 -2. ]
[-2. 0. ]
[ 0. 0. ]
[ 0. -3. ]]
print(y)
[-3.5 -3.5 -3.5 -3.5 -3.5 -2. -3. -3. -3. 1. ]
(10, 2)
(10,)
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X, Y)
No matter what I do I get
Traceback (most recent call last):
File "scripts/wave-pool.py", line 174, in <module>
clf.fit(X, y)
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 192, in fit
sample_weight=sample_weight)
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 355, in _partial_fit
if _check_partial_fit_first_call(self, classes):
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py", line 320, in _check_partial_fit_first_call
clf.classes_ = unique_labels(classes)
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py", line 96, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array([-3.5, -3. , -2. , 1. ]),)

Weird behavior of multiply in tensorflow

I am trying to use multiply in my program, but I find the behavior of this op is unnormal. It seems that it is calculating the wrong results. Minimum example:
import tensorflow as tf
batchSize = 2
maxSteps = 3
max_cluster_size = 4
x = tf.Variable(tf.random_uniform(dtype=tf.int32, maxval=20, shape=[batchSize, maxSteps, max_cluster_size]))
y = tf.sequence_mask(tf.random_uniform(minval=1, maxval=max_cluster_size-1, dtype=tf.int32, shape=[batchSize, maxSteps]), maxlen=max_cluster_size)
y = tf.cast(y, tf.int32)
z = tf.multiply(x, y)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
x_v = sess.run(x)
y_v = sess.run(y)
z_v = sess.run(z)
print(x_v.shape)
print(x_v)
print('----------------------------')
print(y_v.shape)
print(y_v)
print('----------------------------')
print(z_v.shape)
print(z_v)
print('----------------------------')
Result:
(2, 3, 4)
[[[ 7 12 19 3]
[10 18 15 7]
[18 9 2 7]]
[[ 4 5 16 1]
[ 2 14 15 14]
[ 5 18 8 18]]]
----------------------------
(2, 3, 4)
[[[1 1 0 0]
[1 0 0 0]
[1 1 0 0]]
[[1 1 0 0]
[1 1 0 0]
[1 1 0 0]]]
----------------------------
(2, 3, 4)
[[[ 7 12 0 0]
[10 0 0 0]
[18 0 0 0]]
[[ 4 5 0 0]
[ 2 0 0 0]
[ 5 0 0 0]]]
----------------------------
Where z_v is expected to be:
[[[ 7 12 0 0]
[10 0 0 0]
[18 9 0 0]]
[[ 4 5 0 0]
[ 2 14 0 0]
[ 5 18 0 0]]]
When I test multiply in other programs, it goes just fine.
I suspect that this may be related to x and y are random variables. Anyone give a hint on this?
Instead of these lines:
x_v = sess.run(x)
y_v = sess.run(y)
z_v = sess.run(z)
you need to use this:
x_v, y_v, z_v = sess.run( [ x, y, z ] )
With the first, separate version, basically what ends up happening is that you create x_v, and then y_v, but when you run the sess.run(z) it will recalculate z's dependencies as well, so you end up seeing the output from different x's and y's than you print.