Running sum of complicated functions using pandas data frame values - pandas

In this simplified example I have three lists of the same length, list a, list b, and list c. I want to find the following running summations
from math import exp
a = [1.3, 4.5, 7.8, 9.2, 4.1]
b = [2.1, 1.1, 1.0, 1.0, -2.0]
c = [3.1, 4.0, 5.0, 6.0, 7.0]
# This simple, but SLOW method
sum1 = 0.0
for i in range(0, 3):
sum1 += a[i] ** b[i] + 3.1 * c[i]
sum2 = 0.0
for i in range(2,4):
sum2 += (b[i] / a[i]) * exp(-c[i])
total = sum1 + sum2
print(total) # yields 52.27644
The above code works just fine; however, for examples with MUCH larger lists it runs very slow. If I were to combine the lists in a pandas data frame, is there some built-in and vectorized capability to conduct this same running summations with the data frame? Something like below.
import pandas as pd
df_dict = {'A': [1.3, 4.5, 7.8, 9.2, 4.1],
'B': [2.1, 1.1, 1.0, 1.0, -2.0],
'C': [3.1, 4.0, 5.0, 6.0, 7.0]}
df = pd.DataFrame(df_dict)
# Some version of a running summation here!

I do not think you need a dataframe here, just use numpy's functions :
step1 = np.power(a[:3], b[:3])
step2 = np.multiply(c[:3], 3.1)
sum1 = np.add(step1, step2).sum()
step3 = np.divide(b[2:4], a[2:4])
step4 = np.exp(np.multiply(c[2:4], -1))
sum2 = np.multiply(step3, step4).sum()
result = sum1 + sum2
result
52.27644589942484
This should be significantly faster as the list size grows; plus you can optimize it further.

Related

Problem Adding A Column Vector To A Matrix Using ND4J

I'm playing with ND4J basics to come up to speed with its linear algebra capabilities.
I'm running on a Macbook Pro using nd4j-api and nd4j-native dependencies version 1.0.0-M2.1, Open JDK version 17, Kotlin 1.7.20, and IntelliJ 2022.2.2 Ultimate Edition.
I'm writing JUnit 5 tests to perform simple operations: add, subtract, multiply, and divide a 2x2 matrix and a scalar. All are successful and pass just fine.
I was successful at adding a 1x2 row vector to the first and second rows of a 2x2 matrix:
#ParameterizedTest
#ValueSource(longs = [0L, 1L])
fun `add a row vector to each row in a matrix`(rowIndex : Long) {
// setup
val a = Nd4j.create(doubleArrayOf(1.0, 2.0, 3.0, 4.0), intArrayOf(2, 2))
val row = Nd4j.create(doubleArrayOf(11.0, 13.0), intArrayOf(2))
// Adds the row vector to all rows
val expected = arrayOf(
Nd4j.create(doubleArrayOf(12.0, 15.0, 3.0, 4.0), intArrayOf(2, 2)),
Nd4j.create(doubleArrayOf(1.0, 2.0, 14.0, 17.0), intArrayOf(2, 2)))
// exercise
a.getRow(rowIndex).addi(row)
// assert
Assertions.assertEquals(expected[rowIndex.toInt()], a)
}
I try to duplicate the trick by adding a 2x1 column vector to the 2x2 matrix:
#Test
fun `add a column vector to the second column of a matrix`() {
// setup
val a = Nd4j.create(doubleArrayOf(1.0, 2.0, 3.0, 4.0), intArrayOf(2, 2))
val col = Nd4j.create(doubleArrayOf(11.0, 13.0), intArrayOf(2, 1))
// Adds the row vector to all rows
val expected = Nd4j.create(doubleArrayOf(1.0, 2.0, 14.0, 17.0), intArrayOf(2, 2))
// exercise
a.getColumn(1).addi(col)
// assert
Assertions.assertEquals(expected, a)
}
I get an error saying that the array shapes don't match:
java.lang.IllegalStateException: Cannot perform in-place operation "addi": result array shape does not match the broadcast operation output shape: [2].addi([2, 1]) != [2].
In-place operations like x.addi(y) can only be performed when x and y have the same shape, or x and y are broadcastable with x.shape() == broadcastShape(x,y)
I have not been successful in figuring out why. Can anyone see where I've gone wrong and suggest a solution?
We have a function for that already. For matrix + column use addiColumnVector.
For views:
Ensure that you have the exact same shape with the reshape. Do that with some vector:
INDArray vec = Nd4j.zeros(5);
vec.getColumn(0).addi(vec.reshape(5,1));
This solution did the trick. Thanks to Adam Gibson for pointing out the need for reshape:
#ParameterizedTest
#ValueSource(longs = [0L, 1L])
fun `add a column vector to each column in a matrix`(colIndex : Long) {
// setup
val a = Nd4j.create(doubleArrayOf(1.0, 2.0, 3.0, 4.0), intArrayOf(2, 2))
val col = Nd4j.create(doubleArrayOf(11.0, 13.0), intArrayOf(2, 1))
val expected = arrayOf(
Nd4j.create(doubleArrayOf(12.0, 2.0, 16.0, 4.0), intArrayOf(2, 2)),
Nd4j.create(doubleArrayOf(1.0, 13.0, 3.0, 17.0), intArrayOf(2, 2)))
// exercise
// Adds the column vector to successive columns
a.getColumn(colIndex).reshape(intArrayOf(2, 1)).addi(col)
// assert
Assertions.assertEquals(expected[colIndex.toInt()], a)
}

Gekko - infeasible solution to optimal scheduling, comparison w/ gurobi

I am somewhat familiar with Gurobi, but transitioning to Gekko since the latter appears to have some advantages. I am running into one issue though, which I will illustrate using my imaginary apple orchard. The 5-weeks harvest period (#horizon: T=5) is upon us, and my - very meagre - produce will be:
[3.0, 7.0, 9.0, 5.0, 4.0]
Some apples I keep for myself [2.0, 4.0, 2.0, 4.0, 2.0], the remaining produce I will sell in the farmer's market at the following prices: [0.8, 0.9, 0.5, 1.2, 1.5]. I have storage space with room for 6 apples, so I can plan ahead and sell apples at the most optimal moments, hence maximizing my revenue. I try to determine the optimal schedule with the following model:
m = GEKKO()
m.time = np.linspace(0,4,5)
orchard = m.Param([3.0, 7.0, 9.0, 5.0, 4.0])
demand = m.Param([2.0, 4.0, 2.0, 4.0, 2.0])
price = m.Param([0.8, 0.9, 0.5, 1.2, 1.5])
### manipulated variables
# selling on the market
sell = m.MV(lb=0)
sell.DCOST = 0
sell.STATUS = 1
# saving apples
storage_out = m.MV(value=0, lb=0)
storage_out.DCOST = 0
storage_out.STATUS = 1
storage_in = m.MV(lb=0)
storage_in.DCOST = 0
storage_in.STATUS = 1
### storage space
storage = m.Var(lb=0, ub=6)
### constraints
# storage change
m.Equation(storage.dt() == storage_in - storage_out)
# balance equation
m.Equation(sell + storage_in + demand == storage_out + orchard)
# Objective: argmax sum(sell[t]*price[t]) for t in [0,4]
m.Maximize(sell*price)
m.options.IMODE=6
m.options.NODES=3
m.options.SOLVER=3
m.options.MAX_ITER=1000
m.solve()
For some reason this is unfeasible (error code = 2). Interestingly, if set demand[0] to 3.0, instead of 2.0 (i.e. equal to orchard[0], the model does produce a succesful solution.
Why is this the case?
Even the "succesful" output values are bit weird: the storage space is not used a single time, and storage_out is not properly constrained in the last timestep. Clearly, I am not formulating the constraints correctly. What should I do to get realistic results, which are comparable to the gurobi output (see code below)?
output = {'sell' : list(sell.VALUE),
's_out' : list(storage_out.VALUE),
's_in' : list(storage_in.VALUE),
'storage' : list(storage.VALUE)}
df_gekko = pd.DataFrame(output)
df_gekko.head()
> sell s_out s_in storage
0 0.0 0.000000 0.000000 0.0
1 3.0 0.719311 0.719311 0.0
2 7.0 0.859239 0.859239 0.0
3 1.0 1.095572 1.095572 0.0
4 26.0 24.124924 0.124923 0.0
Gurobi model solved for with demand = [3.0, 4.0, 2.0, 4.0, 2.0]. Note that gurobi also produces a solution with demand = [2.0, 4.0, 2.0, 4.0, 2.0]. This only has a trivial impact on the outcome: n apples sold at t=0 becomes 1.
T = 5
m = gp.Model()
### horizon (five weeks)
### supply, demand and price data
orchard = [3.0, 7.0, 9.0, 5.0, 4.0]
demand = [3.0, 4.0, 2.0, 4.0, 2.0]
price = [0.8, 0.9, 0.5, 1.2, 1.5]
### manipulated variables
# selling on the market
sell = m.addVars(T)
# saving apples
storage_out = m.addVars(T)
m.addConstr(storage_out[0] == 0)
storage_in = m.addVars(T)
# storage space
storage = m.addVars(T)
m.addConstrs((storage[t]<=6) for t in range(T))
m.addConstrs((storage[t]>=0) for t in range(T))
m.addConstr(storage[0] == 0)
# storage change
#m.addConstr(storage[0] == (0 - storage_out[0]*delta_t + storage_in[0]*delta_t))
m.addConstrs(storage[t] == (storage[t-1] - storage_out[t] + storage_in[t]) for t in range(1, T))
# balance equation
m.addConstrs(sell[t] + demand[t] + storage_in[t] == (storage_out[t] + orchard[t]) for t in range(T))
# Objective: argmax sum(a_sell[t]*a_price[t] - b_buy[t]*b_price[t])
obj = gp.quicksum((price[t]*sell[t]) for t in range(T))
m.setObjective(obj, gp.GRB.MAXIMIZE)
m.optimize()
output:
sell storage_out storage_in storage
0 0.0 0.0 0.0 0.0
1 3.0 0.0 0.0 0.0
2 1.0 0.0 6.0 6.0
3 1.0 0.0 0.0 6.0
4 8.0 6.0 0.0 0.0
You can get a successful solution with:
m.options.NODES=2
The issue is that it is solving the balance equation in between the primary node points with NODES=3. Your differential equation has a linear solution so NODES=2 should be sufficiently accurate.
Here are a couple other ways to improve the solution:
Set a small penalty on moving inventory into or out of storage. Otherwise the solver can find large arbitrary values with storage_in = storage_out.
I used m.Minimize(1e-6*storage_in) and m.Minimize(1e-6*storage_out).
Because the initial condition is typically fixed, I used zero values at the beginning just to make sure that the first point is calculated.
I also switched to integer variables if they are sold and stored in integer units. You need to switch to the APOPT solver if you want an integer solution with SOLVER=1.
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 0.058899999999999994 sec
Objective : -17.299986
Successful solution
---------------------------------------------------
Sell
[0.0, 0.0, 4.0, 1.0, 1.0, 8.0]
Storage Out
[0.0, 0.0, 1.0, 0.0, 0.0, 6.0]
Storage In
[0.0, 1.0, 0.0, 6.0, 0.0, 0.0]
Storage
[0.0, 1.0, 0.0, 6.0, 6.0, 0.0]
Here is the modified script.
from gekko import GEKKO
import numpy as np
m = GEKKO(remote=False)
m.time = np.linspace(0,5,6)
orchard = m.Param([0.0, 3.0, 7.0, 9.0, 5.0, 4.0])
demand = m.Param([0.0, 2.0, 4.0, 2.0, 4.0, 2.0])
price = m.Param([0.0, 0.8, 0.9, 0.5, 1.2, 1.5])
### manipulated variables
# selling on the market
sell = m.MV(lb=0, integer=True)
sell.DCOST = 0
sell.STATUS = 1
# saving apples
storage_out = m.MV(value=0, lb=0, integer=True)
storage_out.DCOST = 0
storage_out.STATUS = 1
storage_in = m.MV(lb=0, integer=True)
storage_in.DCOST = 0
storage_in.STATUS = 1
### storage space
storage = m.Var(lb=0, ub=6, integer=True)
### constraints
# storage change
m.Equation(storage.dt() == storage_in - storage_out)
# balance equation
m.Equation(sell + storage_in + demand == storage_out + orchard)
# Objective: argmax sum(sell[t]*price[t]) for t in [0,4]
m.Maximize(sell*price)
m.Minimize(1e-6 * storage_in)
m.Minimize(1e-6 * storage_out)
m.options.IMODE=6
m.options.NODES=2
m.options.SOLVER=1
m.options.MAX_ITER=1000
m.solve()
print('Sell')
print(sell.value)
print('Storage Out')
print(storage_out.value)
print('Storage In')
print(storage_in.value)
print('Storage')
print(storage.value)

Values in a list to a new dataframe

I have my data as list say
r1=[['Pearson Chi-square ( 4.0) = ', 1021938.0], ['p-value = ', 0.0], ["Cramer's V = ", 1.0]]
I want to extract 4.0 inside this colon 'Pearson Chi-square ( 4.0)' and form a seperate column as DOF
I want to extract 1021938.0 inside this bracket
'Pearson Chi-square ( 4.0) = ', 1021938.0 and form a seperate column
as Chisquare
['p-value = ', 0.0] as 0.0 and form a column called Pvalue
["Cramer's V = ", 1.0] as 1.0 and form a column Cramer's V
So that my output df will be
df
DOF Chisquare Pvalue Cramers'V
4 1021938.0 0.0 1.0
I have tried this line of code
DOF=r1[0][1]
chisquare_stat=r1[0][0]
p_value=r1[1][1]
cramers_v=r1[2][1]
I need some help in extracting individual values alone and write it as a new df for easy refernce.
If you create a new df whit
from pandas import DataFrame
new_df = pd.DataFrame(r1)
T = new_df.T # Transpost the new_df
var = df[1].values
i_want = var[0]
i_want
the output is
1021938.0
This code make a uggly DF in my opinion you can create a new df whit a dictonary... anyway
I think the metod df['column'].values should fix the problem you also can use Transposed matrix its always helpfull

Testing Numpy operations

Whenever I need to test a moderately complex numpy expression, say,
c = np.multiply.outer(a, b)
d = np.einsum('kjij->ijk', c)
I end up doings hacks such as, e.g., setting a and bthus
a = np.arange(9).reshape(3,3)
b = a / 10
so that I can then track what d contains.
This is ugly and not very convenient. Ideally, I would be able to do something like the following:
a = np.array(list("abcdefghi")).reshape(3,3)
b = np.array(list("ABCDEFGHI")).reshape(3,3)
c = np.add.outer(a, b)
d = np.einsum('kjij->ijk', c)
so that, e.g., d[0, 1, 2] could be seen to correspond to 'hB', which is much clearer than .7 (which is what the other assignment to a and b would give.) This cannot be done, because the ufunc add does not take characters.
In summary, once I start chaining a few transformations (an outer product, an einsum, broadcasting or slicing, etc.) I lose track and need to see for myself what my transformations are actually doing. That's when I need to run a few examples, and that's where my current method of doing so strikes me as suboptimal. Is there any standard, or better, way to do this?
In [454]: a = np.array(list("abcdefghi")).reshape(3,3)
...: b = np.array(list("ABCDEFGHI")).reshape(3,3)
np.add can't be used because add has not been defined for the string dtype:
In [455]: c = np.add.outer(a,b)
....
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
But np.char has functions that apply Python string methods to ndarray elements (these aren't fast, just convenient):
Signature: np.char.add(x1, x2)
Docstring:
Return element-wise string concatenation for two arrays of str or unicode.
Using broadcasting I can perform your outer string concatenation:
In [457]: c = np.char.add(a[:,:,None,None], b[None,None,:,:])
In [458]: c.shape
Out[458]: (3, 3, 3, 3)
In [459]: c
Out[459]:
array([[[['aA', 'aB', 'aC'],
['aD', 'aE', 'aF'],
['aG', 'aH', 'aI']],
[['bA', 'bB', 'bC'],
['bD', 'bE', 'bF'],
['bG', 'bH', 'bI']],
....
[['iA', 'iB', 'iC'],
['iD', 'iE', 'iF'],
['iG', 'iH', 'iI']]]], dtype='<U2')
I was skeptical that einsum could handle this array, since normally einsum is used for np.dot like sum-of-products calculations. But with this indexing, it is just selecting a diagonal and rearranging axes, so it does work:
In [460]: np.einsum('kjij->ijk', c)
Out[460]:
array([[['aA', 'dA', 'gA'],
['bB', 'eB', 'hB'],
['cC', 'fC', 'iC']],
[['aD', 'dD', 'gD'],
['bE', 'eE', 'hE'],
['cF', 'fF', 'iF']],
[['aG', 'dG', 'gG'],
['bH', 'eH', 'hH'],
['cI', 'fI', 'iI']]], dtype='<U2')
The d from the numeric test case:
array([[[0. , 3. , 6. ],
[1.1, 4.1, 7.1],
[2.2, 5.2, 8.2]],
[[0.3, 3.3, 6.3],
[1.4, 4.4, 7.4],
[2.5, 5.5, 8.5]],
[[0.6, 3.6, 6.6],
[1.7, 4.7, 7.7],
[2.8, 5.8, 8.8]]])
The pattern with these numeric values is just as clear as with strings.
I like to use distinct array shapes where possible, because it makes tracking dimensions across changes easier:
In [496]: a3 = np.arange(1,13).reshape(4,3)
...: b3 = np.arange(1,7).reshape(2,3) / 10
In [497]: c3 = np.add.outer(a3,b3)
In [498]: d3 = np.einsum('kjij->ijk', c3)
In [499]: c3.shape
Out[499]: (4, 3, 2, 3)
In [500]: d3.shape
Out[500]: (2, 3, 4)
In [501]: d3
Out[501]:
array([[[ 1.1, 4.1, 7.1, 10.1],
[ 2.2, 5.2, 8.2, 11.2],
[ 3.3, 6.3, 9.3, 12.3]],
[[ 1.4, 4.4, 7.4, 10.4],
[ 2.5, 5.5, 8.5, 11.5],
[ 3.6, 6.6, 9.6, 12.6]]])
This, for example, would raise an error if I try ''kjik->ijk'.
With numeric values I can perform the multiply.outer with einsum:
In [502]: c4 = np.multiply.outer(a3,b3)
In [503]: np.allclose(c4,np.einsum('ij,kl',a3,b3))
Out[503]: True
In [504]: d4 = np.einsum('kjij->ijk', c4)
In [505]: np.allclose(d4,np.einsum('kj,ij->ijk',a3,b3))
Out[505]: True
In [506]: d4
Out[506]:
array([[[0.1, 0.4, 0.7, 1. ],
[0.4, 1. , 1.6, 2.2],
[0.9, 1.8, 2.7, 3.6]],
[[0.4, 1.6, 2.8, 4. ],
[1. , 2.5, 4. , 5.5],
[1.8, 3.6, 5.4, 7.2]]])
That 'kj,ij->ijk' gives me a better of idea of what is happening than the d display.
Another way to put it:
(4,3) + (2,3) => (2,3,4)

Why does this numpy array comparison fail?

I try to compare the results of some numpy.array calculations with expected results, and I constantly get false comparison, but the printed arrays look the same, e.g:
def test_gen_sine():
A, f, phi, fs, t = 1.0, 10.0, 1.0, 50.0, 0.1
expected = array([0.54030231, -0.63332387, -0.93171798, 0.05749049, 0.96724906])
result = gen_sine(A, f, phi, fs, t)
npt.assert_array_equal(expected, result)
prints back:
> raise AssertionError(msg)
E AssertionError:
E Arrays are not equal
E
E (mismatch 100.0%)
E x: array([ 0.540302, -0.633324, -0.931718, 0.05749 , 0.967249])
E y: array([ 0.540302, -0.633324, -0.931718, 0.05749 , 0.967249])
My gen_sine function is:
def gen_sine(A, f, phi, fs, t):
sampling_period = 1 / fs
num_samples = fs * t
samples_range = (np.arange(0, num_samples) * 2 * f * np.pi * sampling_period) + phi
return A * np.cos(samples_range)
Why is that? How should I compare the two arrays?
(I'm using numpy 1.9.3 and pytest 2.8.1)
The problem is that np.assert_array_equal returns None and does the assert statement internally. It is incorrect to preface it with a separate assert as you do:
assert np.assert_array_equal(x,y)
Instead in your test you would just do something like:
import numpy as np
from numpy.testing import assert_array_equal
def test_equal():
assert_array_equal(np.arange(0,3), np.array([0,1,2]) # No assertion raised
assert_array_equal(np.arange(0,3), np.array([2,0,1]) # Raises AssertionError
Update:
A few comments
Don't rewrite your entire original question, because then it was unclear what an answer was actually addressing.
As far as your updated question, the issue is that assert_array_equal is not appropriate for comparing floating point arrays as is explained in the documentation. Instead use assert_allclose and then set the desired relative and absolute tolerances.