Testing Numpy operations - numpy

Whenever I need to test a moderately complex numpy expression, say,
c = np.multiply.outer(a, b)
d = np.einsum('kjij->ijk', c)
I end up doings hacks such as, e.g., setting a and bthus
a = np.arange(9).reshape(3,3)
b = a / 10
so that I can then track what d contains.
This is ugly and not very convenient. Ideally, I would be able to do something like the following:
a = np.array(list("abcdefghi")).reshape(3,3)
b = np.array(list("ABCDEFGHI")).reshape(3,3)
c = np.add.outer(a, b)
d = np.einsum('kjij->ijk', c)
so that, e.g., d[0, 1, 2] could be seen to correspond to 'hB', which is much clearer than .7 (which is what the other assignment to a and b would give.) This cannot be done, because the ufunc add does not take characters.
In summary, once I start chaining a few transformations (an outer product, an einsum, broadcasting or slicing, etc.) I lose track and need to see for myself what my transformations are actually doing. That's when I need to run a few examples, and that's where my current method of doing so strikes me as suboptimal. Is there any standard, or better, way to do this?

In [454]: a = np.array(list("abcdefghi")).reshape(3,3)
...: b = np.array(list("ABCDEFGHI")).reshape(3,3)
np.add can't be used because add has not been defined for the string dtype:
In [455]: c = np.add.outer(a,b)
....
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
But np.char has functions that apply Python string methods to ndarray elements (these aren't fast, just convenient):
Signature: np.char.add(x1, x2)
Docstring:
Return element-wise string concatenation for two arrays of str or unicode.
Using broadcasting I can perform your outer string concatenation:
In [457]: c = np.char.add(a[:,:,None,None], b[None,None,:,:])
In [458]: c.shape
Out[458]: (3, 3, 3, 3)
In [459]: c
Out[459]:
array([[[['aA', 'aB', 'aC'],
['aD', 'aE', 'aF'],
['aG', 'aH', 'aI']],
[['bA', 'bB', 'bC'],
['bD', 'bE', 'bF'],
['bG', 'bH', 'bI']],
....
[['iA', 'iB', 'iC'],
['iD', 'iE', 'iF'],
['iG', 'iH', 'iI']]]], dtype='<U2')
I was skeptical that einsum could handle this array, since normally einsum is used for np.dot like sum-of-products calculations. But with this indexing, it is just selecting a diagonal and rearranging axes, so it does work:
In [460]: np.einsum('kjij->ijk', c)
Out[460]:
array([[['aA', 'dA', 'gA'],
['bB', 'eB', 'hB'],
['cC', 'fC', 'iC']],
[['aD', 'dD', 'gD'],
['bE', 'eE', 'hE'],
['cF', 'fF', 'iF']],
[['aG', 'dG', 'gG'],
['bH', 'eH', 'hH'],
['cI', 'fI', 'iI']]], dtype='<U2')
The d from the numeric test case:
array([[[0. , 3. , 6. ],
[1.1, 4.1, 7.1],
[2.2, 5.2, 8.2]],
[[0.3, 3.3, 6.3],
[1.4, 4.4, 7.4],
[2.5, 5.5, 8.5]],
[[0.6, 3.6, 6.6],
[1.7, 4.7, 7.7],
[2.8, 5.8, 8.8]]])
The pattern with these numeric values is just as clear as with strings.
I like to use distinct array shapes where possible, because it makes tracking dimensions across changes easier:
In [496]: a3 = np.arange(1,13).reshape(4,3)
...: b3 = np.arange(1,7).reshape(2,3) / 10
In [497]: c3 = np.add.outer(a3,b3)
In [498]: d3 = np.einsum('kjij->ijk', c3)
In [499]: c3.shape
Out[499]: (4, 3, 2, 3)
In [500]: d3.shape
Out[500]: (2, 3, 4)
In [501]: d3
Out[501]:
array([[[ 1.1, 4.1, 7.1, 10.1],
[ 2.2, 5.2, 8.2, 11.2],
[ 3.3, 6.3, 9.3, 12.3]],
[[ 1.4, 4.4, 7.4, 10.4],
[ 2.5, 5.5, 8.5, 11.5],
[ 3.6, 6.6, 9.6, 12.6]]])
This, for example, would raise an error if I try ''kjik->ijk'.
With numeric values I can perform the multiply.outer with einsum:
In [502]: c4 = np.multiply.outer(a3,b3)
In [503]: np.allclose(c4,np.einsum('ij,kl',a3,b3))
Out[503]: True
In [504]: d4 = np.einsum('kjij->ijk', c4)
In [505]: np.allclose(d4,np.einsum('kj,ij->ijk',a3,b3))
Out[505]: True
In [506]: d4
Out[506]:
array([[[0.1, 0.4, 0.7, 1. ],
[0.4, 1. , 1.6, 2.2],
[0.9, 1.8, 2.7, 3.6]],
[[0.4, 1.6, 2.8, 4. ],
[1. , 2.5, 4. , 5.5],
[1.8, 3.6, 5.4, 7.2]]])
That 'kj,ij->ijk' gives me a better of idea of what is happening than the d display.
Another way to put it:
(4,3) + (2,3) => (2,3,4)

Related

Julia "MethodError: no method matching build_tree"

I have a very simple sample script:
using Pkg
Pkg.add("DecisionTree")
Pkg.add("DataFrames")
using DataFrames
using DecisionTree
dat = DataFrame(A=[1, 2, 3, 4, 5], B=[2, 5, 1, 2, 6])
model = build_tree(dat[!, "A"], dat[!, "B"])
Which returns an error:
ERROR: LoadError: MethodError: no method matching build_tree(::Vector{Int64}, ::Vector{Int64})
Closest candidates are:
build_tree(::AbstractVector{T}, ::AbstractMatrix{S}) where {S, T} at C:\Users\**\.julia\packages\DecisionTree\iWCbW\src\classification\main.jl:74
build_tree(::AbstractVector{T}, ::AbstractMatrix{S}, ::Any) where {S, T} at C:\Users\**\.julia\packages\DecisionTree\iWCbW\src\classification\main.jl:74
build_tree(::AbstractVector{T}, ::AbstractMatrix{S}, ::Any, ::Any) where {S, T} at C:\Users\**\.julia\packages\DecisionTree\iWCbW\src\classification\main.jl:74
What is going on? How do I deal with that?
Your data types do not match. Try this:
C = reshape(dat[!, "B"], (1, 5))
model = DecisionTree.build_tree(dat[!, "A"], C')

Scala: how to get the mean and variance and covariance of a matrix?

I am new to scala and I desperately need some guidance on the following problem:
I have a dataframe like the one below (some elements may be NULL)
val dfDouble = Seq(
(1.0, 1.0, 1.0, 3.0),
(1.0, 2.0, 0.0, 0.0),
(1.0, 3.0, 1.0, 1.0),
(1.0, 4.0, 0.0, 2.0)).toDF("m1", "m2", "m3", "m4")
dfDouble.show
+---+---+---+---+
| m1| m2| m3| m4|
+---+---+---+---+
|1.0|1.0|1.0|3.0|
|1.0|2.0|0.0|0.0|
|1.0|3.0|1.0|1.0|
|1.0|4.0|0.0|2.0|
+---+---+---+---+
I need to get the following statistics out of this dataframe:
a vector that contains the mean of each column (some elements might be NULL and I want to calculate the mean using only the non-NULL elements); I would also like to refer to each element of the vector by name for example, vec_mean["m1_mean"] would return the first element
vec_mean: Vector(m1_mean, m2_mean, m3_mean, m4_mean)
a variance-covariance matrix that is (4 x 4), where the diagonals are var(m1), var(m2),..., and the off-diagonals are cov(m1,m2), cov(m1,m3) ... Here, I would also like to only use the non-NULL elements in the variance-covariance calculation
A vector that contains the number of non-null for each column
vec_n: Vector(m1_n, m2_n, m3_n, m4_n)
A vector that contains the standard deviation of each column
vec_stdev: Vector(m1_stde, m2_stde, m3_stde, m4_stde)
In R I would convert everything to a matrix and then the rest is easy. But in scala, I'm unfamiliar with matrices and there are apparently multiple types of matrices, which are confusing (DenseMatrix, IndexedMatrix, etc.)
Edited: apparently it makes a difference if the content of the dataframe is Double or Int. Revised the elements to be double
Used the following command per suggested answer and it worked!
val rdd = dfDouble0.rdd.map {
case a: Row => (0 until a.length).foldRight(Array[Double]())((b, acc) =>
{ val k = a.getAs[Double](b)
if(k == null)
acc.+:(0.0)
else acc.+:(k)}).map(_.toDouble)
}
Yo can work with Spark RowMatrix. It has these kind of operations like computing the co-variance matrix using each row as an observation, mean, varianze, etc... The only thing that you have to know is how to build it from a Dataframe.
It turns out that a Dataframe in Spark contains a schema that represents the type of information that can be stored in it, and it is not only floating point numbers arrays. So the first thing is to transform this DF to a RDD of vectors(dense vector in this case).
Having this DF:
val df = Seq(
(1, 1, 1, 3),
(1, 2, 0, 0),
(1, 3, 1, 1),
(1, 4, 0, 2),
(1, 5, 0, 1),
(2, 1, 1, 3),
(2, 2, 1, 1),
(2, 3, 0, 0)).toDF("m1", "m2", "m3", "m4")
Convert it to a RDD Row[DenseVector] representation. There must be dozens of ways of doing this. One could be:
val rdd = df.rdd.map {
case a: Row =>
(0 until a.length).foldRight(Array[Int]())((b, acc) => {
val k = a.getAs[Int](b)
if(k == null) acc.+:(0) else acc.+:(k)
}).map(_.toDouble)
}
As you can see in your IDE, the inferred type is RDD[Array[Float]. Now convert this to a RDD[DenseVector]. As simple as doing:
val rowsRdd = rdd.map(Vectors.dense(_))
And now you can build your Matrix:
val mat: RowMatrix = new RowMatrix(rowsRdd)
Once you have the matrix, you can easily compute the different metrix per column:
println("Mean: " + mat.computeColumnSummaryStatistics().mean)
println("Variance: " + mat.computeColumnSummaryStatistics().variance)
It gives:
Mean: [1.375,2.625,0.5,1.375]
Variance:
[0.26785714285714285,1.9821428571428572,0.2857142857142857,1.4107142857142858]
you can read more info about the capabilities of Spark and these distributed types in the doc: https://spark.apache.org/docs/latest/mllib-data-types.html#data-types-rdd-based-api
You can also compute the Covariance matrix, doing the SVD, etc...

Running sum of complicated functions using pandas data frame values

In this simplified example I have three lists of the same length, list a, list b, and list c. I want to find the following running summations
from math import exp
a = [1.3, 4.5, 7.8, 9.2, 4.1]
b = [2.1, 1.1, 1.0, 1.0, -2.0]
c = [3.1, 4.0, 5.0, 6.0, 7.0]
# This simple, but SLOW method
sum1 = 0.0
for i in range(0, 3):
sum1 += a[i] ** b[i] + 3.1 * c[i]
sum2 = 0.0
for i in range(2,4):
sum2 += (b[i] / a[i]) * exp(-c[i])
total = sum1 + sum2
print(total) # yields 52.27644
The above code works just fine; however, for examples with MUCH larger lists it runs very slow. If I were to combine the lists in a pandas data frame, is there some built-in and vectorized capability to conduct this same running summations with the data frame? Something like below.
import pandas as pd
df_dict = {'A': [1.3, 4.5, 7.8, 9.2, 4.1],
'B': [2.1, 1.1, 1.0, 1.0, -2.0],
'C': [3.1, 4.0, 5.0, 6.0, 7.0]}
df = pd.DataFrame(df_dict)
# Some version of a running summation here!
I do not think you need a dataframe here, just use numpy's functions :
step1 = np.power(a[:3], b[:3])
step2 = np.multiply(c[:3], 3.1)
sum1 = np.add(step1, step2).sum()
step3 = np.divide(b[2:4], a[2:4])
step4 = np.exp(np.multiply(c[2:4], -1))
sum2 = np.multiply(step3, step4).sum()
result = sum1 + sum2
result
52.27644589942484
This should be significantly faster as the list size grows; plus you can optimize it further.

Numpy fancy indexing with 2D array - explanation

I am (re)building up my knowledge of numpy, having used it a little while ago.
I have a question about fancy indexing with multidimenional (in this case 2D) arrays.
Given the following snippet:
>>> a = np.arange(12).reshape(3,4)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> i = np.array( [ [0,1], # indices for the first dim of a
... [1,2] ] )
>>> j = np.array( [ [2,1], # indices for the second dim
... [3,3] ] )
>>>
>>> a[i,j] # i and j must have equal shape
array([[ 2, 5],
[ 7, 11]])
Could someone explain in simple English, the logic being applied to give the results produced. Ideally, the explanation would be applicable for 3D and higher rank arrays being used to index an array.
Conceptually (in terms of restrictions placed on "rows" and "columns"), what does it mean to index using a 2D array?
Conceptually (in terms of restrictions placed on "rows" and "columns"), what does it mean to index using a 2D array?
It means you are constructing a 2d array R, such that R=A[B, C]. This means that the value for rij=abijcij.
So it means that the item located at R[0,0] is the item in A with as row index B[0,0] and as column index C[0,0]. The item R[0,1] is the item in A with row index B[0,1] and as column index C[0,1], etc.
So in this specific case:
>>> b = a[i,j]
>>> b
array([[ 2, 5],
[ 7, 11]])
b[0,0] = 2 since i[0,0] = 0, and j[0,0] = 2, and thus a[0,2] = 2. b[0,1] = 5 since i[0,0] = 1, and j[0,0] = 1, and thus a[1,1] = 5. b[1,0] = 7 since i[0,0] = 1, and j[0,0] = 3, and thus a[1,3] = 7. b[1,1] = 11 since i[0,0] = 2, and j[0,0] = 3, and thus a[2,3] = 11.
So you can say that i will determine the "row indices", and j will determine the "column indices". Of course this concept holds in more dimensions as well: the first "indexer" thus determines the indices in the first index, the second "indexer" the indices in the second index, and so on.

Calculating the sum of vectors

I have two matrices a and b and would like to calculate all the sums between them into a tensor. How can I do this more efficiently than doing the following code:
a = np.array([[1,2],[3,4],[5,6]])
b = np.array([[4,5],[6,7]])
n1 = a.shape[0]
n2 = b.shape[0]
f = a.shape[1]
c = np.zeros((n1,n2,f))
c = np.zeros((n1,n2,f))
for i in range(n1):
for j in range(n2):
c[i,j,:] = a[i,:] + b[j,:]
einstein-sum and the like does obviously not work and an outer product neither - is there an appropriate method?
You can transform your loop expression into a broadcasting one:
c[i,j,:] = a[i,:] + b[j,:]
c[i,j,:] = a[i,None,:] + b[None,j,:] # fill in the missing dimensions
c = a[:,None,:] + b[None,:,:]
In [167]: a[:,None,:]+b[None,:,:]
Out[167]:
array([[[ 5, 7],
[ 7, 9]],
[[ 7, 9],
[ 9, 11]],
[[ 9, 11],
[11, 13]]])
In [168]: _.shape
Out[168]: (3, 2, 2)
a[:,None]+b does the same thing, since leading None (np.newaxis) are automatic, and trailing : also.
Use broadcasting and add extra dimensions using advanced indexing:
a[:,None,:]+b[None,:,:]