numpy change elements matching conditions - numpy

For two numpy array a, b
a=[1,2,3] b=[4,5,6]
I want to change x<2.5 data of a to b. So I tried
a[a<2.5]=b
hoping a to be a=[4,5,3].
but this makes error
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
a[a<2.5]=b
ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 2 output values where the mask is true
what is the problem?

The issue you're seeing is a result of how masks work on numpy arrays.
When you write
a[a < 2.5]
you get back the elements of a which match the mask a < 2.5. In this case, that will be the first two elements only.
Attempting to do
a[a < 2.5] = b
is an error because b has three elements, but a[a < 2.5] has only two.
An easy way to achieve the result you're after in numpy is to use np.where.
The syntax of this is np.where(condition, valuesWhereTrue, valuesWhereFalse).
In your case, you could write
newArray = np.where(a < 2.5, b, a)
Alternatively, if you don't want the overhead of a new array, you could perform the replacement in-place (as you're trying to do in the question). To achieve this, you can write:
idxs = a < 2.5
a[idxs] = b[idxs]

Related

How to run all integers in an array through an equation and append to a new array with a for loop?

Maybe there is a better way to do this, but I want to take an array of values from -80 to 0, and sub them into an equation where the missing variable, T, takes each of those values from the first array and then runs the equation with that and makes a new array. See code below:
T = np.arange(-80, 2, 2)
empty = []
esw = 6.11*np.exp(53.49*(6808/T)-5.09*np.log(T))
i = 0
for i in range(T):
6.11*np.exp(53.49*(6808/T[i])-5.09*np.log(T[i]))
x = np.append(empty)
i = I+1
I know this is probably some miserable code, any help would be appreciated, thanks!
First of all, I have taken the liberty of doing a bit of math to hypothetically make your esw possible with negative numbers when the exponent is an integer, though it still won't work when T = 0 because of the division in the exponent. Note that because your exponent is 5.09, the code still doesn't work with negative T values. We now have:
esw = 6.11 * np.exp(53.49*6808/t) / t**(5.09)
where t is some value in your T array of values.
If you're trying to get esw for each value in T, you can structure your code in 2 main ways. The non-vectorised way, which is to loop through every value of T with a for loop and is a safer method, looks like this:
# If the calculation cannot be done, None will be returned.
def func(t):
try:
esw = 6.11 * np.exp(53.49*6808/t) / t**(5.09)
except:
esw = None
return esw
# new_arr is the same size as T. Each value in new_arr is the corresponding value of T
# put through the esw calculation.
new_arr = np.apply_along_axis(func, 0, T)
Note that for the values of T you chose (between -80 and 2), all esw values are either None or infinity.
If your calculations were possible (i.e. all your T values were > 0), you could vectorise your code (a good idea because it's easier to read and also faster), like so:
new_arr = 6.11 * np.exp(53.49*6808/T) / T**(5.09)
This method is less safe because as soon as an error is encountered, the program crashes instead of returning None for that value in T. With your T values, this code crashes.
There are some basic Python errors, suggesting that you haven't read much of a Python intro, and haven't learn to test your code (step by step).
You create an array, e.g.:
In [123]: T = np.arange(-3,2)
In [124]: T
Out[124]: array([-3, -2, -1, 0, 1])
and try to iterate:
In [125]: for i in range(T):print(i)
Traceback (most recent call last):
Input In [125] in <cell line: 1>
for i in range(T):print(i)
TypeError: only integer scalar arrays can be converted to a scalar index
In [126]: range(T)
Traceback (most recent call last):
Input In [126] in <cell line: 1>
range(T)
TypeError: only integer scalar arrays can be converted to a scalar index
range takes a number, not an array. It's a basic Python function that you should know, and use correctly. You can get a number by taking the length of the array or list, len(T):
In [127]: range(len(T))
Out[127]: range(0, 5)
In [128]: list(_)
Out[128]: [0, 1, 2, 3, 4]
You do a i=0 before, and some sort of assignment to i in the loop, which means you don't understand (or care) about how the loop assigns i.
In [129]: for i in range(4):
...: print(i)
...: i = i+10
...:
0
1
2
3
Adding 10 to i did nothing; the for assigns the next value from the range. Again this is basic Python iteration.
As for the empty and np.append:
In [130]: empty=[]
In [131]: np.append(empty)
Traceback (most recent call last):
Input In [131] in <cell line: 1>
np.append(empty)
File <__array_function__ internals>:179 in append
TypeError: _append_dispatcher() missing 1 required positional argument: 'values'
The correct way to use list append is:
In [132]: alist = []
...: for i in T:
...: alist.append(i*2)
...:
In [133]: alist
Out[133]: [-6, -4, -2, 0, 2]
List append works in-place, and it is reasonably fast. np.append is a poorly name function that should not exist. It is not a list append clone.
Since T is an array, we don't need to iterate.
In [134]: T*2
Out[134]: array([-6, -4, -2, 0, 2])
An alternative to the list append loop is a list comprehension. It's a bit faster than the iteration, though not as fast as the direct array calculation.
In [135]: [i*2 for i in T]
Out[135]: [-6, -4, -2, 0, 2]
Finally, that line
6.11*np.exp(53.49*(6808/T[i])-5.09*np.log(T[i]))
in the loop does nothing; not even assign a value to a variable (as you did outside the loop with esw=.... Did you really think it did something? Or was this just a careless mistake?

Is there a better way of finding summary statistics in Python?

The following is my code for finding the 5 point summary statistics. I keep getting this error:
list indices must be integers or slices, not str
It seems like the way i'm using the describe function that i created is wrong.
from statistics import stdev,median,mean
def describe(key):
a=[]
for i in scripts:
a.append(i[key])
a=scripts[key]
total = sum(script[key] for script in scripts)
avg = total/len(a)
avg=mean(a)
s = stdev(a)
q25 = min(a)+(max(a)-min(a))*25
med = min(a)+(max(a)-min(a))*50
med=median(a)
q75 = min(a)+(max(a)-min(a))*75
return (total, avg, s, q25, med, q75)`enter code here`
summary = [('items', describe('items')),
('quantity', describe('quantity')),
('nic', describe('nic')),
('act_cost', describe('act_cost'))]
I keep getting this error:
TypeError Traceback (most recent call last)
<ipython-input-8-ba78d5218ead> in <module>()
----> 1 summary = [('items', describe('items')),
2 ('quantity', describe('quantity')),
3 ('nic', describe('nic')),
4 ('act_cost', describe('act_cost'))]
<ipython-input-1-bcf37f98eb7d> in describe(key)
4 for i in scripts:
5 a.append(i[key])
----> 6 a=scripts[key]
7 total = sum(script[key] for script in scripts)
8 avg = total/len(a)
TypeError: list indices must be integers or slices, not str
It is hard to understand your problem, since we don't know how scripts looks like. It is a global variable which is not defined in your script. The error states that scripts is of type list, but it looks like you assume it is a dataframe in your code. So please check the type of scripts.
Also, did you know that there is an easy way to calculate a Five-number summary with numpy like this:
import numpy as np
minimum, q25, med, q75, maximum = np.percentile(a, [0, 25, 50, 75, 100], interpolation='midpoint')
For description, see:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html
As per your question, you are accessing list of dictionaries.
Directly accessing with its key value is not yielding the result here.
So you must do,
getValues = lambda key,inputData: [subVal[key] for subVal in inputData if key in subVal]
in this case,
getValues('key', scripts) will give the corresponding list, then its easy to compute the statistics of that list.

What is the meaning of `numpy.array(value)`?

numpy.array(value) evaluates to true, if value is int, float or complex. The result seems to be a shapeless array (numpy.array(value).shape returns ()).
Reshaping the above like so numpy.array(value).reshape(1) works fine and numpy.array(value).reshape(1).squeeze() reverses this and again results in a shapeless array.
What is the rationale behind this behavior? Which use-cases exist for this behaviour?
When you create a zero-dimensional array like np.array(3), you get an object that behaves as an array in 99.99% of situations. You can inspect the basic properties:
>>> x = np.array(3)
>>> x
array(3)
>>> x.ndim
0
>>> x.shape
()
>>> x[None]
array([3])
>>> type(x)
numpy.ndarray
>>> x.dtype
dtype('int32')
So far so good. The logic behind this is simple: you can process any array-like object the same way, regardless of whether is it a number, list or array, just by wrapping it in a call to np.array.
One thing to keep in mind is that when you index an array, the index tuple must have ndim or fewer elements. So you can't do:
>>> x[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: too many indices for array
Instead, you have to use a zero-sized tuple (since x[] is invalid syntax):
>>> x[()]
3
You can also use the array as a scalar instead:
>>> y = x + 3
>>> y
6
>>> type(y)
numpy.int32
Adding two scalars produces a scalar instance of the dtype, not another array. That being said, you can use y from this example in exactly the same way you would x, 99.99% of the time, since dtypes inherit from ndarray. It does not matter that 3 is a Python int, since np.add will wrap it in an array regardless. y = x + x will yield identical results.
One difference between x and y in these examples is that x is not officially considered to be a scalar:
>>> np.isscalar(x)
False
>>> np.isscalar(y)
True
The indexing issue can potentially throw a monkey wrench in your plans to index any array like-object. You can easily get around it by supplying ndmin=1 as an argument to the constructor, or using a reshape:
>>> x1 = np.array(3, ndmin=1)
>>> x1
array([3])
>>> x2 = np.array(3).reshape(-1)
>>> x2
array([3])
I generally recommend the former method, as it requires no prior knowledge of the dimensionality of the input.
FurtherRreading:
Why are 0d arrays in Numpy not considered scalar?

Pandas dataframe - multiplying DF's elementwise on same dates - something wrong?

I've been banging my head over this, I just cannot seem to get it right and I don't understand what is the problem... So I tried to do the following:
#!/usr/bin/env python
import matplotlib.pyplot as plt
import numpy as np
import quandl
btc_usd_price_kraken = quandl.get('BCHARTS/KRAKENUSD', returns="pandas")
btc_usd_price_kraken.replace(0, np.nan, inplace=True)
plt.plot(btc_usd_price_kraken.index, btc_usd_price_kraken['Weighted Price'])
plt.grid(True)
plt.title("btc_usd_price_kraken")
plt.show()
eur_usd_price = quandl.get('BUNDESBANK/BBEX3_D_USD_EUR_BB_AC_000', returns="pandas")
eur_dkk_price = quandl.get('ECB/EURDKK', returns="pandas")
usd_dkk_price = eur_dkk_price / eur_usd_price
btc_dkk = btc_usd_price_kraken['Weighted Price'] * usd_dkk_price
plt.plot(btc_dkk.index, btc_dkk) # WHY IS THIS [4785 rows x 1340 columns] ???
plt.grid(True)
plt.title("Historic value of 1 BTC converted to DKK")
plt.show()
As you can see in the comment, I don't understand why I get a result (which I'm trying to plot) that has size: [4785 rows x 1340 columns] ?
Anyway, the code results in a lot of error messages, something like e.g.
> Traceback (most recent call last): File
> "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_qt5agg.py",
> line 197, in __draw_idle_agg
> FigureCanvasAgg.draw(self) File "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py",
...
> return _from_ordinalf(x, tz) File "/usr/lib/python3.6/site-packages/matplotlib/dates.py", line 254, in
> _from_ordinalf
> dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC) ValueError: ordinal must be >= 1
I read some posts and I know that Pandas/Dataframe when using multiply is able to automatically only do an elementwise multiplication, on data-pairs, where the date is the same (so if one DF has timeseries for e.g. 1999-2017 and the other only has e.g. 2012-2015, then only common dates between 2012-2015 will be multiplied, i.e. the intersection subset of the data set) - so this problem about understanding the error message(s) (and the solution) - the whole problem is related to calculating btc_dkk variable and plotting it (which is the price for Bitcoin in the currency DKK)...
This should work:
usd_dkk_price.multiply(btc_usd_price_kraken['Weighted Price'], axis='index').dropna()
You are multiplying on columns, not index (this happens since you are multiplying a dataframe and a series, if you had selected the column in usd_dkk_price, this would not have happened). Then afterwards just drop the rows with NaN.

Selecting from pandas dataframe (or numpy ndarray?) by criterion

I find myself coding this sort of pattern a lot:
tmp = <some operation>
result = tmp[<boolean expression>]
del tmp
...where <boolean expression> is to be understood as a boolean expression involving tmp. (For the time being, tmp is always a pandas dataframe, but I suppose that the same pattern would show up if I were working with numpy ndarrays--not sure.)
For example:
tmp = df.xs('A')['II'] - df.xs('B')['II']
result = tmp[tmp < 0]
del tmp
As one can guess from the del tmp at the end, the only reason for creating tmp at all is so that I can use a boolean expression involving it inside an indexing expression applied to it.
I would love to eliminate the need for this (otherwise useless) intermediate, but I don't know of any efficient1 way to do this. (Please, correct me if I'm wrong!)
As second best, I'd like to push off this pattern to some helper function. The problem is finding a decent way to pass the <boolean expression> to it. I can only think of indecent ones. E.g.:
def filterobj(obj, criterion):
return obj[eval(criterion % 'obj')]
This actually works2:
filterobj(df.xs('A')['II'] - df.xs('B')['II'], '%s < 0')
# Int
# 0 -1.650107
# 2 -0.718555
# 3 -1.725498
# 4 -0.306617
# Name: II
...but using eval always leaves me feeling all yukky 'n' stuff... Please let me know if there's some other way.
1E.g., any approach I can think of involving the filter built-in is probably ineffiencient, since it would apply the criterion (some lambda function) by iterating, "in Python", over the panda (or numpy) object...
2The definition of df used in the last expression above would be something like this:
import itertools
import pandas as pd
import numpy as np
a = ('A', 'B')
i = range(5)
ix = pd.MultiIndex.from_tuples(list(itertools.product(a, i)),
names=('Alpha', 'Int'))
c = ('I', 'II', 'III')
df = pd.DataFrame(np.random.randn(len(idx), len(c)), index=ix, columns=c)
Because of the way Python works, I think this one's going to be tough. I can only think of hacks which only get you part of the way there. Something like
def filterobj(obj, fn):
return obj[fn(obj)]
filterobj(df.xs('A')['II'] - df.xs('B')['II'], lambda x: x < 0)
should work, unless I've missed something. Using lambdas this way is one of the usual tricks for delaying evaluation.
Thinking out loud: one could make a this object which isn't evaluated but just sticks around as an expression, something like
>>> this
this
>>> this < 3
this < 3
>>> df[this < 3]
Traceback (most recent call last):
File "<ipython-input-34-d5f1e0baecf9>", line 1, in <module>
df[this < 3]
[...]
KeyError: u'no item named this < 3'
and then either special-case the treatment of this into pandas or still have a function like
def filterobj(obj, criterion):
return obj[eval(str(criterion.subs({"this": "obj"})))]
(with enough work we could lose the eval, this is simply proof of concept) after which something like
>>> tmp = df["I"] + df["II"]
>>> tmp[tmp < 0]
Alpha Int
A 4 -0.464487
B 3 -1.352535
4 -1.678836
Dtype: float64
>>> filterobj(df["I"] + df["II"], this < 0)
Alpha Int
A 4 -0.464487
B 3 -1.352535
4 -1.678836
Dtype: float64
would work. I'm not sure any of this is worth the headache, though, Python simply isn't very conducive to this style.
This is as concise as I could get:
(df.xs('A')['II'] - df.xs('B')['II']).apply(lambda x: x if (x<0) else np.nan).dropna()
Int
0 -4.488312
1 -0.666710
2 -1.995535
Name: II