I'm trying to do the right thing for my career by learning how to really use the data science tools and automate excel.
There is a pretty basic resistor circuit called a Pi attenuator.
On a particular component, two of the terminals are shorted together by design.
Because these two of the elements are common (Terminal 3 shorts R1 and R2 - see spreadsheet screenshot) it isn't possible to just measure them with a meter and get the real value of the element. What is measured
I have measurement data both before and after exposing them to high temp oven
The left side has the Before oven measurements (measured and converted)
I was able to use scipy fsolve to get the correct element values (as if R1 and R2 weren't common). This code absolutely works and gives the correct info.
The F[] equations came from solving for the series/parallel value when an ohmmeter is put across pins (1,3), (2,3), and (1,2). I know those are correct. As I understand it, the system is non-linear because of the variables existing in the denominators.
I used this function to manually enter the starting measurements and converted them to their real values.
The problem is that I need to do this hundreds of times for many test groups, and will need to test other attenuator values. I don't want to have to re-run the script after tying in successive measurements if I can help it.
I'm able to get the data out of excel and into numpy arrays, but when I pass the data to the fsolve function I get
printed values of that have add a bunch of decimal places
RuntimeWarning: overflow encountered in double_scalars-divide by zero encountered in double_scalars
To me these are scary insurmountable issues. The lines the warnings occurred on might have changed, but they happen during the F[] = equation assignments.
I wanted to be able to provide the spreadsheet but I'm told this link won't work.
Actual excel datafile
The code I'm working with that uses openpyxl has been limited to 1 row of data for my sanity. If I can get the correct values I would tackle the next part next.
I tried to do what I could to lay out the basis of what I'm doing with screenshots and by providing as much detail as possible. I'm hoping to receive feedback about how to do what I've explained by using numpy arrays, fsolve, and openpyxl.
In general (in my personal programming "experience") I have issues with making sure I've casted to the correct type. I wouldn't be surprised if that was part of my problem. I have issues with "scope". I also overthink everything, and the majority of my EE schooling used C or assembly.
It's embarrassing to say how long it too me to even get this far. I'm so ignorant that I don't even know what I don't know. I've put it down and picked it back up enough that I'm starting to get emotional and need another set of eyes. I'm trying to fail forward here, but I need another set of eyes.
I tried:
-enforcing float64 (or float32) dtype to the np arrays
-creating other np arrays in the convert_measured function
-casting the parameters passed to convert_measured to seperate variables within the function
-rounding the values from the cells because they seem to expand. I don't need any more than 2 decimals of precision for this
import numpy as np
from scipy.optimize import fsolve
import openpyxl
wb = openpyxl.load_workbook("datafile.xlsx")
ws = wb['Group A']
""" Excel Data Cell Locations
Measured R1,R2,R3 Effective Starting Values
Group A,B,C,D worksheet -Cell Range I15 to K34
I = column 9
K = colum 11
"""
#Data limited to just one row here to debug
MeasColBegin = int(9)
MeasColEnd = int(11)
MeasRowBegin = int(15) # Measured values start in row 15
MeasRowEnd = int(15) # Row 34 is the real end. If this is set to 15 then it's only looking at one row
rows_input = ws.iter_rows(min_row=MeasRowBegin,max_row=MeasRowEnd, min_col=MeasColBegin,max_col=MeasColEnd)
"""
Calculated R1,R2,R3 Actual Finished Values
Group A,B,C,D worksheet -Cell Range I39 to K58
I = column 9
K = colum 11
"""
#These aren't being used yet
CalcColBegin = int(9)
CalcColEnd = int(11)
CalcRowBegin = int(39)
CalcRowEnd = int(58)
#row iterator for actual/calculated values
#This isn't being used yet later
rows_output = ws.iter_rows(min_row=CalcRowBegin,max_row=CalcRowEnd, min_col=CalcColBegin,max_col=CalcColEnd)
#This is called by fsolve
#This works when I don't use data passed from the excel spreadsheet
def convert_measured(z): #Maybe I need to cast z to a different type/size...?
F = np.empty((3))
F.astype('float64')
#print("z datatypes ", z[0].dtype, z[1].dtype, z[2].dtype)
# IDK why this prints so many decimal places, (or why it prints them so many times)
# I assume it has to do how the optimizer works
#print("z[]= ", z[0], z[1], z[2])
# Same assignments as simplier version of code that provides correct answers
#x = z[0]
#y = z[1]
#w = z[2]
x = z[0]
y = z[1]
w = z[2]
#print("x=",x," y=",y," w=",w)
#This is certainly wrong
F[0] = 1/(1/z[0]+1/(z[1]+z[2]))-z[0]
F[1] = 1/(1/z[1]+1/(z[0]+z[2]))-z[1]
F[2] = 1/(1/z[2]+1/(z[0]+z[1]))-z[2]
#I tried thinking that rounding might help. Nope.
#F[0] = 1/(1/x+1/(y+w))-np.around(z[0],4)
#F[1] = 1/(1/y+1/(x+w))-np.around(z[1],4)
#F[2] = 1/(1/w+1/(x+y))-np.around(z[2],4)
#Original code from example that works when I enter the numbers
#F[0] = 1/(1/x+1/(y+w))-148.884
#F[1] = 1/(1/y+1/(x+w))-148.853
#F[2] = 1/(1/w+1/(x+y))-16.506
#Bottom line is that one of my problems is that I don't know how to represet F[0,1,2] in terms of z
return F
def main():
# numpy array where measured values to be converted will go
zGuess = np.array([1,1,1])
zGuess = zGuess.astype('float64')
# numpy array where converted solution/values will go
zSol = np.array([1,1,1])
zSol = zSol.astype('float64')
# For loop used to iterate through rows and extract measurements
# These are passed to the convert_measured function
for row in rows_input:
print("row[]= ", row[0].value, row[1].value, row[2].value) #print out values to check it's the right data
#store values into np array that will be sent to convert_measured
zGuess[0]=row[0].value
zGuess[1]=row[1].value
zGuess[2]=row[2].value
print("zGuess[]=", zGuess[0], zGuess[1], zGuess[2]) #print again to make sure because I had problems with dtype
# Solve for measurements / Convert to actual values
zSol = fsolve(convert_measured, zGuess)
#This should print the true values of the elements as if no shunt between R1 and R2 exists
print("zSol= ",zSol[0], zSol[1], zSol[2])
#Todo: store correct solutions into array and write to spreadsheet
if __name__ == "__main__":
main()
I've made some changes to your code and ran this on your manually calculated data. The result looks to be the same apart from a few rounding differences.
Therefore 'rows_input' currently points to the range C15:E34 in the code sample (other rows_input line is commented out).
The main change was to the line that calls the convert function
zSol = fsolve(convert_measured, zGuess)
to call the function with the params for calculating and round the array values to 2 decimal places
zSol = np.round(fsolve(convert_measured, zSol, zGuess), 2)
The convert_measured function was also changed to accept the inputs for the conversions.
Changed code sample below; (I commented out the print statements except for the zsol values)
import numpy as np
from scipy.optimize import fsolve
import openpyxl
wb = openpyxl.load_workbook("datafile.xlsx")
ws = wb['Group A']
""" Excel Data Cell Locations
Measured R1,R2,R3 Effective Starting Values
Group A,B,C,D worksheet -Cell Range I15 to K34
I = column 9
K = colum 11
"""
# Data limited to just one row here to debug
MeasColBegin = int(9)
MeasColEnd = int(11)
MeasRowBegin = int(15) # Measured values start in row 15
MeasRowEnd = int(15) # Row 34 is the real end. If this is set to 15 then it's only looking at one row
# rows_input = ws.iter_rows(min_row=MeasRowBegin, max_row=MeasRowEnd, min_col=MeasColBegin, max_col=MeasColEnd)
rows_input = ws.iter_rows(min_row=15, max_row=34, min_col=3, max_col=5)
"""
Calculated R1,R2,R3 Actual Finished Values
Group A,B,C,D worksheet -Cell Range I39 to K58
I = column 9
K = colum 11
"""
# These aren't being used yet
CalcColBegin = int(9)
CalcColEnd = int(11)
CalcRowBegin = int(39)
CalcRowEnd = int(58)
# row iterator for actual/calculated values
# This isn't being used yet later
rows_output = ws.iter_rows(min_row=CalcRowBegin, max_row=CalcRowEnd, min_col=CalcColBegin, max_col=CalcColEnd)
def convert_measured(z, xl):
F = np.empty((3))
# F.astype('float64')
x = z[0]
y = z[1]
w = z[2]
# Original code from example that works when I enter the numbers
F[0] = 1/(1/x+1/(y+w))-xl[0]
F[1] = 1/(1/y+1/(x+w))-xl[1]
F[2] = 1/(1/w+1/(x+y))-xl[2]
return F
def main():
# numpy array where measured values to be converted will go
zGuess = np.array([1, 1, 1])
zGuess = zGuess.astype('float64')
# numpy array where converted solution/values will go
zSol = np.array([1, 1, 1])
zSol = zSol.astype('float64')
# For loop used to iterate through rows and extract measurements
# These are passed to the convert_measured function
for row in rows_input:
#print("row[]= ", row[0].value, row[1].value, row[2].value) # print out values to check it's the right data
# store values into np array that will be sent to convert_measured
zGuess[0] = row[0].value
zGuess[1] = row[1].value
zGuess[2] = row[2].value
#print("zGuess[]=", zGuess[0], zGuess[1], zGuess[2]) # print again to make sure because I had problems with dtype
# Solve for measurements / Convert to actual values
# zSol = fsolve(convert_measured(zSol), zGuess)
zSol = np.round(fsolve(convert_measured, zSol, zGuess), 2)
# This should print the true values of the elements as if no shunt between R1 and R2 exists
print("zSol= ", zSol[0], zSol[1], zSol[2])
# Todo: store correct solutions into array and write to spreadsheet
if __name__ == "__main__":
main()
Output looks like this
zSol= 290.03 288.94 16.99
zSol= 283.68 294.84 16.97
zSol= 280.83 297.43 17.07
zSol= 292.67 286.63 16.99
zSol= 277.51 301.04 16.93
zSol= 294.98 284.66 16.95
...
How do I detect multivariate outliers within large data with more than 50 variables. Do i need to plot all of the variables or do i have to group them based independent and dependent variables or do i need an algorithm for this?
We do have a special type of distance formula that we use to find multivariate outliers. It is called Mahalanobis Distance.
The MD is a metric that establishes the separation between a distribution D and a data point x by generalizing the z-score, the MD indicates how far x is from the D mean in terms of standard deviations.
You can use the below function to find out outliers. It returns the index of outliers.
from scipy.stats import chi2
import scipy as sp
import numpy as np
def mahalanobis_method(df):
#M-Distance
x_minus_mean = df - np.mean(df)
cov = np.cov(df.values.T) #Covariance
inv_covmat = sp.linalg.inv(cov) #Inverse covariance
left_term = np.dot(x_minus_mean, inv_covmat)
mahal = np.dot(left_term, x_minus_mean.T)
md = np.sqrt(mahal.diagonal())
#Flag as outliers
outliers = []
#Cut-off point
C = np.sqrt(chi2.ppf((1-0.001), df=df.shape[1])) #degrees of freedom = number of variables
for i, v in enumerate(md):
if v > C:
outliers.append(i)
else:
continue
return outliers, md
If you want to study more about Mahalanobis Distance and its formula you can read this blog.
So, how to understand the above formula? Let’s take the (x – m)^T . C^(-1) term. (x – m) is essentially the distance of the vector from the mean. We then divide this by the covariance matrix (or multiply by the inverse of the covariance matrix). If you think about it, this is essentially a multivariate equivalent of the regular standardization (z = (x – mu)/sigma).
I have a vector of numbers (here random). I'd like to calculate the consecutive relation of something (here means to clarify example) on the left and right side of each number in a vector.
Here is a procedural example. I'm interested in the vectorized form.
from numpy.random import rand
import numpy as np
numbers = rand(40)
k=np.zeros(numbers.shape)
for i in range(*numbers.shape):
k[i]=np.mean(numbers[:i])/np.mean(numbers[i:])
This example will return nan in the first iteration but it is not a problem now.
Here's a vectorized way -
n = len(numbers)
fwd = numbers.cumsum()/np.arange(1,n+1)
bwd = (numbers[::-1].cumsum()[::-1])/np.arange(n,0,-1)
k_out = np.r_[np.nan,fwd[:-1]]/bwd
Optimizing a bit further with one cumsum, it would be -
n = len(numbers)
r = np.arange(1,n+1)
c = numbers.cumsum()
fwd = c/r
b = c[-1]-c
bwd = np.r_[1,b[:-1]]/r[::-1]
k_out = np.r_[np.nan,fwd[:-1]]/bwd
I spent some time and there is a simple and universal solution: numpy.vectorize with excluded parameter, where vector designated to be split must be excluded from vectorisation. The example still uses np.mean but can be replaced with any function:
def split_mean(vect,i):
return np.mean(vect[:i])/np.mean(vect[i:])
v_split_mean = np.vectorize(split_mean)
v_split_mean.excluded.add(0)
numbers = np.random.rand(30)
indexes = np.arange(*numbers.shape)
v_split_mean(numbers,indexes)
How can I vectorize this loop in NumPy? It uses sampling from NumPy's binomial() function to estimate the probability that out of 55 events exactly m of a particular type occur, where the probability of m occuring is 5%; ie it estimates 55Cm.(0.05)^m.(0.95)^(55-m). where 55Cm = 55!/(m!.(55-m)!)
import numpy as np
M = 7
m = np.arange(M+1)
ntrials = 1000000
p = np.empty(M+1)
for r in m:
p[r] = np.sum(np.random.binomial(55, 0.05, ntrials)==r)/ntrials
Here is the equivalent code:
p = np.zeros(M+1)
print p
I imagine you didn't intend for your output to always be all zero, but it is! So the first thing to do is add a dtype=float argument to your np.sum() call. With that out of the way, we can vectorize the whole thing like this:
samples = np.random.binomial(55, 0.05, (ntrials, M+1))
p = np.sum(samples == m, dtype=float, axis=0) / ntrials
This produces an equivalent, though not identical, result. The reason is that the random number generation is done in a different sequence, so you will get an answer which is "correct" but not identical to the old code. If you want the identical result to before, you can get that by changing the first line to this:
samples = p.random.binomial(55, 0.05, (M+1, ntrials)).T
Then you draw in the same order as before, with no real performance penalty.
For each location in the result matrix, instead of storing the dot product of the corresponding row and column in the argument matrices, I would like like to store the element wise product, which will be a vector extending into a third dimension.
One idea would be to convert the argument matrices to vectors with vector entries, and then take their outer product, but I'm not sure how to do this either.
EDIT:
I figured it out before I saw there was a reply. Here is my solution:
def newdot(A, B):
A = A.reshape((1,) + A.shape)
B = B.reshape((1,) + B.shape)
A = A.transpose(2, 1, 0)
B = B.transpose(1, 0, 2)
return A * B
What I am doing is taking apart each row and column pair that will have their outer product taken, and forming two lists of them, which then get their contents matrix multiplied together in parallel.
It's a little convoluted (and difficult to explain) but this function should get you what you're looking for:
def f(m1, m2):
return (m2.A.T * m1.A.reshape(m1.shape[0],1,m1.shape[1]))
m3 = m1 * m2
m3_el = f(m1, m2)
m3[i,j] == sum(m3_el[i,j,:])
m3 == m3_el.sum(2)
The basic idea is to turn the matrices into arrays and do element-by-element multiplication. One of the arrays gets reshaped to have a size of one in its middle dimension, and array broadcasting rules expand this dimension out to match the height of the other array.