Combination function in numpy that be applied as vectorized method on a data frame

Combination function in numpy that be applied as vectorized method on a data frame - pandas

I have dataframe with around 45 million rows and I need to apply a method to calculate combinations of two columns. So the the function needs to be applied on all rows. For loop works with comb function from math module however takes lot of time, .apply also seems to be not viable option. I tried 3 other options
Option 1
comb function from math module as vectorized operation
df1['comb'] = comb(df1['c1'],df1['c2'])
This throws the error TypeError: 'Series' object cannot be interpreted as an integer
Option2
df1['comb'] = np.vectorize(comb_fun)(df1['c1'],df1['c2'])
This works however still takes time
Option 3
import scipy.special as ss
df1['comb'] = ss.comb(df1['c1'],df1['c2'])
This works, gives fast results however gives floating point results which affects my further calculations. When I use exact=True to avoid floating point precision, it gives the following error
TypeError: cannot convert the series to <class 'int'>
If anyone of you know other function/ way that can be applied as vectorized operation on data frame to calculate combinations please suggest. Thanks.

Related

Gdal: how to assign values to pixel based on condition?

I would like to change the values of the pixel of a geotiff raster such as is 1 if the pixel values are between 50 and 100 and 0 otherwise.
Following this post, this is what I am doing:
gdal_calc.py -A input.tif --outfile=output.tif --calc="1*(50<=A<=100)" --NoDataValue=0
but I got the following error
0.. evaluation of calculation 1*(50<=A<=100) failed
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I think such a notation would only work if the expression returns a single boolean, but this returns an array of booleans. Hence the suggestion to aggregate the array to a scalar with something like any() or all().
You should be able to write it in a way compatible with Numpy arrays with something like this:
1 * ((50 <= A) & (A <=100))
Your original expression has an implicit and in it, whereas this uses an explicit & which translates to np.logical_and for an element-wise test if both values on either side are True.
I'm not sure what the multiplication with one adds in this case, it casts the bool to an int32 datatype. Even if you need to write the result as an int32 you can probably still leave the casting to GDAL in this case.
A toy example replicating this would be:
a = np.random.randint(0,2,5, dtype=np.bool_)
b = np.random.randint(0,2,5, dtype=np.bool_)
With this data a and b would fail in the same way, because it can't evaluate an entire array as True/False, whereas a & b would return a new array with the element-wise result.

Pandas df.describe() returning wrong result

I tried getting into Kaggle with house sales competition.
I spent some time getting rid of columns that according to what I got from df.describe() seemed useless (all zeros).
Why does df.describe() return a result that is not true?

The reason for this seemingly weird behaviour is that I used pd.options.display.chop_threshold = 100 without knowing what it does.
From the pandas docs:
"If set to a float value, all float values smaller than the given threshold will be displayed as exactly 0 by repr and friends. [default: None] [currently: None]"
After removing that line, the describe() function returns the result as expected.

Calculating implied volatility using Scipy optimize brentq error

I want to calculate implied volatility using scipy optimise brent root finding algorithm:
def calcimpliedvol(S,K,T,r,marketoptionPrice):
d1=(np.log(S/K)+(r-0.5*sigma**2)*T)/(sigma*np.sqrt(T))
d2=d1-(sigma*np.sqrt(T))
BSprice_call=S*si.norm.cdf(d1,0,1)-K*np.exp(-r*T)*si.norm.cdf(d2,0,1)
fx=BSprice_call-marketoptionPrice
return optimize.brentq(fx,0,1,maxiter=1000)
However, when I run the function giving it all the inputs specified K=6,S=8,T=0.25,r=0,OptionPrice=4 I get an error saying sigma is not defined. Sigma is what I want to find with the optimisation algorithm.
Could someone please advise what am I doing wrong in defining the function?

There are multiple issues with your code
brentq needs a function as the first argument, that it finds the root of. You passed it a variable. This is the main issue
Black-Scholes formula was wrong (it is (r+0.5*sigma**2) not (r-0.5*sigma**2) for d1)
the code does not work for sigma=0 as you divide by sigma. At the very least you should not pass 0 as one of the bounds. Better yet, handle sigma=0 case separately inside the code
The value of 4 for the option price is very high with S=8, K=6, T=0.25. The implied volatility in this case is 2.18 (ie 218%) which is outside the upper bound you gave your root solver
Here is the corrected code. For the first point note how we defined the function bs_price inside your function that is then passed to the solver. Other issues also addressed
from scipy import optimize
import scipy.stats as si
def calcimpliedvol(S,K,T,r,marketoptionPrice):
def bs_price(sigma):
d1=(np.log(S/K)+(r+0.5*sigma**2)*T)/(sigma*np.sqrt(T))
d2=d1-(sigma*np.sqrt(T))
BSprice_call=S*si.norm.cdf(d1,0,1)-K*np.exp(-r*T)*si.norm.cdf(d2,0,1)
fx=BSprice_call-marketoptionPrice
return fx
return optimize.brentq(bs_price,0.0001,100,maxiter=1000)
calcimpliedvol(S=8,K=6,T=0.25, r=0, marketoptionPrice=4)
it returns 2.188862879492475

Issues with Decomposing Trend, Seasonal, and Residual Time Series Elements

i am quite a newbie to Time Series Analysis and this might be a stupid question.
I am trying to generate the trend, seasonal, and residual time series elements, however, my timestamps index are actually strings (lets say 'window1', 'window2', 'window3'). Now, when i try to apply seasonal_decompose(data, model='multiplicative'), it returns an error as, Index' object has no attribute 'inferred_freq' and which is pretty understandable.
However, how to go around this issue by keeping strings as time series index?

Basically here you need to specify freq parameter.
Suppose you have following dataset
s = pd.Series([102,200,322,420], index=['window1', 'window2', 'window3','window4'])
s
>>>window1 102
window2 200
window3 322
window4 420
dtype: int64
Now specify freq parameter,in this case I used freq=1
plt.style.use('default')
plt.figure(figsize = (16,8))
import statsmodels.api as sm
sm.tsa.seasonal_decompose(s.values,freq=1).plot()
result = sm.tsa.stattools.adfuller(s,maxlag=1)
plt.show()
I am not allowed to post image ,but I hope this code will solve your problem.Also here maxlag by default give an error for my dataset ,therefore I used maxlag=1.If you are not sure about its values ,do use default value for maxlag.

Emulating fixed precision in python

For a university course in numerical analysis we are transitioning from Maple to a combination of Numpy and Sympy for various illustrations of the course material. This is because the students already learn Python the semester before.
One of the difficulties we have is in emulating fixed precision in Python. Maple allows the user to specify a decimal precision (say 10 or 20 digits) and from then on every calculation is made with that precision so you can see the effect of rounding errors. In Python we tried some ways to achieve this:
Sympy has a rounding function to a specified number of digits.
Mpmath supports custom precision.
This is however not what we're looking for. These options calculate the exact result and round the exact result to the specified number of digits. We are looking for a solution that does every intermediate calculation in the specified precision. Something that can show, for example, the rounding errors that can happen when dividing two very small numbers.
The best solution so far seems to be the custom data types in Numpy. Using float16, float32 and float64 we were able to al least give an indication of what could go wrong. The problem here is that we always need to use arrays of one element and that we are limited to these three data types.
Does anything more flexible exist for our purpose? Or is the very thing we're looking for hidden somewhere in the mpmath documentation? Of course there are workarounds by wrapping every element of a calculation in a rounding function but this obscures the code to the students.

You can use decimal. There are several ways of usage, for example, localcontext or getcontext.
Example with getcontext from documentation:
>>> from decimal import *
>>> getcontext().prec = 6
>>> Decimal(1) / Decimal(7)
Decimal('0.142857')
Example of localcontext usage:
>>> from decimal import Decimal, localcontext
>>> with localcontext() as ctx:
... ctx.prec = 4
... print Decimal(1) / Decimal(3)
...
0.3333
To reduce typing you can abbreviate the constructor (example from documentation):
>>> D = decimal.Decimal
>>> D('1.23') + D('3.45')
Decimal('4.68')

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Combination function in numpy that be applied as vectorized method on a data frame - pandas

Related

Gdal: how to assign values to pixel based on condition?

Pandas df.describe() returning wrong result

Calculating implied volatility using Scipy optimize brentq error

Issues with Decomposing Trend, Seasonal, and Residual Time Series Elements

Emulating fixed precision in python

Categories

Resources