Scala: how to get the mean and variance and covariance of a matrix? - dataframe

I am new to scala and I desperately need some guidance on the following problem:
I have a dataframe like the one below (some elements may be NULL)
val dfDouble = Seq(
(1.0, 1.0, 1.0, 3.0),
(1.0, 2.0, 0.0, 0.0),
(1.0, 3.0, 1.0, 1.0),
(1.0, 4.0, 0.0, 2.0)).toDF("m1", "m2", "m3", "m4")
dfDouble.show
+---+---+---+---+
| m1| m2| m3| m4|
+---+---+---+---+
|1.0|1.0|1.0|3.0|
|1.0|2.0|0.0|0.0|
|1.0|3.0|1.0|1.0|
|1.0|4.0|0.0|2.0|
+---+---+---+---+
I need to get the following statistics out of this dataframe:
a vector that contains the mean of each column (some elements might be NULL and I want to calculate the mean using only the non-NULL elements); I would also like to refer to each element of the vector by name for example, vec_mean["m1_mean"] would return the first element
vec_mean: Vector(m1_mean, m2_mean, m3_mean, m4_mean)
a variance-covariance matrix that is (4 x 4), where the diagonals are var(m1), var(m2),..., and the off-diagonals are cov(m1,m2), cov(m1,m3) ... Here, I would also like to only use the non-NULL elements in the variance-covariance calculation
A vector that contains the number of non-null for each column
vec_n: Vector(m1_n, m2_n, m3_n, m4_n)
A vector that contains the standard deviation of each column
vec_stdev: Vector(m1_stde, m2_stde, m3_stde, m4_stde)
In R I would convert everything to a matrix and then the rest is easy. But in scala, I'm unfamiliar with matrices and there are apparently multiple types of matrices, which are confusing (DenseMatrix, IndexedMatrix, etc.)
Edited: apparently it makes a difference if the content of the dataframe is Double or Int. Revised the elements to be double
Used the following command per suggested answer and it worked!
val rdd = dfDouble0.rdd.map {
case a: Row => (0 until a.length).foldRight(Array[Double]())((b, acc) =>
{ val k = a.getAs[Double](b)
if(k == null)
acc.+:(0.0)
else acc.+:(k)}).map(_.toDouble)
}

Yo can work with Spark RowMatrix. It has these kind of operations like computing the co-variance matrix using each row as an observation, mean, varianze, etc... The only thing that you have to know is how to build it from a Dataframe.
It turns out that a Dataframe in Spark contains a schema that represents the type of information that can be stored in it, and it is not only floating point numbers arrays. So the first thing is to transform this DF to a RDD of vectors(dense vector in this case).
Having this DF:
val df = Seq(
(1, 1, 1, 3),
(1, 2, 0, 0),
(1, 3, 1, 1),
(1, 4, 0, 2),
(1, 5, 0, 1),
(2, 1, 1, 3),
(2, 2, 1, 1),
(2, 3, 0, 0)).toDF("m1", "m2", "m3", "m4")
Convert it to a RDD Row[DenseVector] representation. There must be dozens of ways of doing this. One could be:
val rdd = df.rdd.map {
case a: Row =>
(0 until a.length).foldRight(Array[Int]())((b, acc) => {
val k = a.getAs[Int](b)
if(k == null) acc.+:(0) else acc.+:(k)
}).map(_.toDouble)
}
As you can see in your IDE, the inferred type is RDD[Array[Float]. Now convert this to a RDD[DenseVector]. As simple as doing:
val rowsRdd = rdd.map(Vectors.dense(_))
And now you can build your Matrix:
val mat: RowMatrix = new RowMatrix(rowsRdd)
Once you have the matrix, you can easily compute the different metrix per column:
println("Mean: " + mat.computeColumnSummaryStatistics().mean)
println("Variance: " + mat.computeColumnSummaryStatistics().variance)
It gives:
Mean: [1.375,2.625,0.5,1.375]
Variance:
[0.26785714285714285,1.9821428571428572,0.2857142857142857,1.4107142857142858]
you can read more info about the capabilities of Spark and these distributed types in the doc: https://spark.apache.org/docs/latest/mllib-data-types.html#data-types-rdd-based-api
You can also compute the Covariance matrix, doing the SVD, etc...

Related

Julia Jump : Getting all feasible solutions to mip

I would like to have instead of only the vector of optimal solution to a mip , all the feasible (suboptimal) vectors.
I found some old questions here, but I am not sure how they work.
First of all, is there any new library tool/way to do that automatically ?
I tried this but, it did nothing:
if termination_status(m) == MOI.FEASIBLE_POINT
println(x)
end
optimize!(m);
If not, what's the easiest way?
I thought of scanning the optimal solution till I find the first non -zero decision variable, then constraint this variable to be zero and solving the model again.
for i in 1:active_variables
if value.(z[i])==1
#constraint(m, x[i] == 0)
break
end
end
optimize!(m);
But I see this problem with this method** :
Ιf I constraint x[i] to be zero, in the next step I will want maybe to drop again this constraint? This comes down to whether there can exist two(or more) different solutions in which x[i]==1
JuMP supports returning multiple solutions.
Documentation: https://jump.dev/JuMP.jl/stable/manual/solutions/#Multiple-solutions
The workflow is something like:
using JuMP
model = Model()
#variable(model, x[1:10] >= 0)
# ... other constraints ...
optimize!(model)
if termination_status(model) != OPTIMAL
error("The model was not solved correctly.")
end
an_optimal_solution = value.(x; result = 1)
optimal_objective = objective_value(model; result = 1)
for i in 2:result_count(model)
#assert has_values(model; result = i)
println("Solution $(i) = ", value.(x; result = i))
obj = objective_value(model; result = i)
println("Objective $(i) = ", obj)
if isapprox(obj, optimal_objective; atol = 1e-8)
print("Solution $(i) is also optimal!")
end
end
But you need a solver that supports returning multiple solutions, and to configure the right solver-specific options.
See this blog post: https://jump.dev/tutorials/2021/11/02/tutorial-multi-jdf/
The following is an example of all-solution finder for a boolean problem. Such problems are easier to handle since the solution space is easily enumerated (even though it can still grow exponentially big).
First, let's get the packages and define the sample problem:
using Random, JuMP, HiGHS, MathOptInterface
function example_knapsack()
profit = [5, 3, 2, 7, 4]
weight = [2, 8, 4, 2, 5]
capacity = 10
minprofit = 10
model = Model(HiGHS.Optimizer)
set_silent(model)
#variable(model, x[1:5], Bin)
#objective(model, FEASIBILITY_SENSE, 0)
#constraint(model, weight' * x <= capacity)
#constraint(model, profit' * x >= minprofit)
return model
end
(it is a knapsack problem from the JuMP docs).
Next, we use recursion to explore the tree of all possible solutions. The tree does not go down branches with no solution (so the running time is not always exponential):
function findallsol(model, x)
perm = shuffle(1:length(x))
res = Vector{Float64}[]
_findallsol!(res, model, x, perm, 0)
return res
end
function _findallsol!(res, model, x, perm, depth)
n = length(x)
depth > n && return
optimize!(model)
if termination_status(model) == MathOptInterface.OPTIMAL
if depth == n
push!(res, value.(x))
return
else
idx = perm[depth+1]
v = value(x[idx])
newcon = #constraint(model, x[idx] == v)
_findallsol!(res, model, x, perm, depth + 1)
delete(model, newcon)
newcon = #constraint(model, x[idx] == 1 - v)
_findallsol!(res, model, x, perm, depth + 1)
delete(model, newcon)
end
end
return
end
Now we can:
julia> m = example_knapsack()
A JuMP Model
Maximization problem with:
Variables: 5
...
Names registered in the model: x
julia> res = findallsol(m, m.obj_dict[:x])
5-element Vector{Vector{Float64}}:
[1.0, 0.0, 0.0, 1.0, 1.0]
[0.0, 0.0, 0.0, 1.0, 1.0]
[1.0, 0.0, 1.0, 1.0, 0.0]
[1.0, 0.0, 0.0, 1.0, 0.0]
[0.0, 1.0, 0.0, 1.0, 0.0]
And we get a vector with all the solutions.
If the problem in question is a boolean problem, this method might be used, as is. In case it has non-boolean variables, the recursion will have to split the feasible space in some even fashion. For example, choosing a variable and cutting its domain in half, and recursing to each half with a smaller domain on this variable (to ensure termination).
P.S. This is not the optimal method. This problem has been well studied. Possible terms to search for are 'model counting' (especially in the boolean domain).
(UPDATE: Changed objective to use FEASIBLE)

Problem Adding A Column Vector To A Matrix Using ND4J

I'm playing with ND4J basics to come up to speed with its linear algebra capabilities.
I'm running on a Macbook Pro using nd4j-api and nd4j-native dependencies version 1.0.0-M2.1, Open JDK version 17, Kotlin 1.7.20, and IntelliJ 2022.2.2 Ultimate Edition.
I'm writing JUnit 5 tests to perform simple operations: add, subtract, multiply, and divide a 2x2 matrix and a scalar. All are successful and pass just fine.
I was successful at adding a 1x2 row vector to the first and second rows of a 2x2 matrix:
#ParameterizedTest
#ValueSource(longs = [0L, 1L])
fun `add a row vector to each row in a matrix`(rowIndex : Long) {
// setup
val a = Nd4j.create(doubleArrayOf(1.0, 2.0, 3.0, 4.0), intArrayOf(2, 2))
val row = Nd4j.create(doubleArrayOf(11.0, 13.0), intArrayOf(2))
// Adds the row vector to all rows
val expected = arrayOf(
Nd4j.create(doubleArrayOf(12.0, 15.0, 3.0, 4.0), intArrayOf(2, 2)),
Nd4j.create(doubleArrayOf(1.0, 2.0, 14.0, 17.0), intArrayOf(2, 2)))
// exercise
a.getRow(rowIndex).addi(row)
// assert
Assertions.assertEquals(expected[rowIndex.toInt()], a)
}
I try to duplicate the trick by adding a 2x1 column vector to the 2x2 matrix:
#Test
fun `add a column vector to the second column of a matrix`() {
// setup
val a = Nd4j.create(doubleArrayOf(1.0, 2.0, 3.0, 4.0), intArrayOf(2, 2))
val col = Nd4j.create(doubleArrayOf(11.0, 13.0), intArrayOf(2, 1))
// Adds the row vector to all rows
val expected = Nd4j.create(doubleArrayOf(1.0, 2.0, 14.0, 17.0), intArrayOf(2, 2))
// exercise
a.getColumn(1).addi(col)
// assert
Assertions.assertEquals(expected, a)
}
I get an error saying that the array shapes don't match:
java.lang.IllegalStateException: Cannot perform in-place operation "addi": result array shape does not match the broadcast operation output shape: [2].addi([2, 1]) != [2].
In-place operations like x.addi(y) can only be performed when x and y have the same shape, or x and y are broadcastable with x.shape() == broadcastShape(x,y)
I have not been successful in figuring out why. Can anyone see where I've gone wrong and suggest a solution?
We have a function for that already. For matrix + column use addiColumnVector.
For views:
Ensure that you have the exact same shape with the reshape. Do that with some vector:
INDArray vec = Nd4j.zeros(5);
vec.getColumn(0).addi(vec.reshape(5,1));
This solution did the trick. Thanks to Adam Gibson for pointing out the need for reshape:
#ParameterizedTest
#ValueSource(longs = [0L, 1L])
fun `add a column vector to each column in a matrix`(colIndex : Long) {
// setup
val a = Nd4j.create(doubleArrayOf(1.0, 2.0, 3.0, 4.0), intArrayOf(2, 2))
val col = Nd4j.create(doubleArrayOf(11.0, 13.0), intArrayOf(2, 1))
val expected = arrayOf(
Nd4j.create(doubleArrayOf(12.0, 2.0, 16.0, 4.0), intArrayOf(2, 2)),
Nd4j.create(doubleArrayOf(1.0, 13.0, 3.0, 17.0), intArrayOf(2, 2)))
// exercise
// Adds the column vector to successive columns
a.getColumn(colIndex).reshape(intArrayOf(2, 1)).addi(col)
// assert
Assertions.assertEquals(expected[colIndex.toInt()], a)
}

Gratia package: probability instead of log odds for plotting generalized additive model (GAM)

I was trying to plot a logistic GAM model using the gratia package (since it uses ggplot2), however I would like the effects (or partial effects) to be plotted in terms of probabilities instead of log odds.
I have tried by hand using probabilities, however I prefer to use the gratia package. Would there be a way to plot the probabilities specifically using the package?
The model (I created some data):
set.seed(1)
Perf1 <- rlnorm(100)
Sex <- sample(c(rep(1, 40), rep(0, 60)))
set.seed(2)
Group <- sample(c(rep(1, 30), rep(0, 70)))
set.seed(3)
Perf2 <- rlnorm(200)
G <- sample(c(rep(1, 20), rep(0, 80)))
Age <- sample(c(rep(7, 15), rep(8, 20), rep(9, 30), rep(10, 10), rep(11, 15), rep(12, 10)))
sample_data <-data.frame(Age = Age,
Sex = Sex,
G = G,
Group = Group,
Perf1 = Perf1,
Perf2 = Perf1
)
gam_fit <- gam(Group ~ Age + Sex + G + s(Perf1, k = 20) +
s(Perf2, k = 20),
data = sample_data,
family = "binomial",
method="REML", select = F)
draw(gam_fit, parametric = T)
Plotting using gratia:
gam_fit
The effect or partial effect is on the log odds scale, while I would like probabilities instead, but I am unsure how to achieve this.
You'll have to add the model constant term and transform by the inverse of the link function:
draw(gam_fit, constant = coef(gam_fit)[1], fun = inv_link(gam_fit))
(and I'm not sure if constant or fun work on the parametric terms just now.)

How to sort array of struct type in Spark DataFrame by particular field?

Given following code:
import java.sql.Date
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SortQuestion extends App{
val spark = SparkSession.builder().appName("local").master("local[*]").getOrCreate()
import spark.implicits._
case class ABC(a: Int, b: Int, c: Int)
val first = Seq(
ABC(1, 2, 3),
ABC(1, 3, 4),
ABC(2, 4, 5),
ABC(2, 5, 6)
).toDF("a", "b", "c")
val second = Seq(
(1, 2, (Date.valueOf("2018-01-02"), 30)),
(1, 3, (Date.valueOf("2018-01-01"), 20)),
(2, 4, (Date.valueOf("2018-01-02"), 50)),
(2, 5, (Date.valueOf("2018-01-01"), 60))
).toDF("a", "b", "c")
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b")).groupBy("a").agg(sort_array(collect_list("c2")))
.show(false)
}
Spark produces following result:
+---+----------------------------------+
|a |sort_array(collect_list(c2), true)|
+---+----------------------------------+
|1 |[[2018-01-01,20], [2018-01-02,30]]|
|2 |[[2018-01-01,60], [2018-01-02,50]]|
+---+----------------------------------+
This implies that Spark is sorting an array by date (since it is the first field), but I want to instruct Spark to sort by specific field from that nested struct.
I know I can reshape array to (value, date) but it seems inconvenient, I want a general solution (imagine I have a big nested struct, 5 layers deep, and I want to sort that structure by particular column).
Is there a way to do that? Am I missing something?
According to the Hive Wiki:
sort_array(Array<T>) : Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0).
This means that the array will be sorted lexicographically which holds true even with complex data types.
Alternatively, you can create a UDF to sort it (and witness performance degradation) based on the second element:
val sortUdf = udf { (xs: Seq[Row]) => xs.sortBy(_.getAs[Int](1) )
.map{ case Row(x:java.sql.Date, y: Int) => (x,y) }}
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
.groupBy("a")
.agg(sortUdf(collect_list("c2")))
.show(false)
//+---+----------------------------------+
//|a |UDF(collect_list(c2, 0, 0)) |
//+---+----------------------------------+
//|1 |[[2018-01-01,20], [2018-01-02,30]]|
//|2 |[[2018-01-02,50], [2018-01-01,60]]|
//+---+----------------------------------+
For Spark 3+, you can pass a custom comparator function to array_sort:
The comparator will take two arguments representing two elements of
the array. It returns -1, 0, or 1 as the first element is less than,
equal to, or greater than the second element. If the comparator
function returns other values (including null), the function will fail
and raise an error.
val df = first
.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
.groupBy("a")
.agg(collect_list("c2").alias("list"))
val df2 = df.withColumn(
"list",
expr(
"array_sort(list, (left, right) -> case when left._2 < right._2 then -1 when left._2 > right._2 then 1 else 0 end)"
)
)
df2.show(false)
//+---+------------------------------------+
//|a |list |
//+---+------------------------------------+
//|1 |[[2018-01-01, 20], [2018-01-02, 30]]|
//|2 |[[2018-01-02, 50], [2018-01-01, 60]]|
//+---+------------------------------------+
Where _2 is the name of the struct field you wan to use for sorting
If you have complex object it is much better to use statically typed Dataset.
case class Result(a: Int, b: Int, c: Int, c2: (java.sql.Date, Int))
val joined = first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
joined.as[Result]
.groupByKey(_.a)
.mapGroups((key, xs) => (key, xs.map(_.c2).toSeq.sortBy(_._2)))
.show(false)
// +---+----------------------------------+
// |_1 |_2 |
// +---+----------------------------------+
// |1 |[[2018-01-01,20], [2018-01-02,30]]|
// |2 |[[2018-01-02,50], [2018-01-01,60]]|
// +---+----------------------------------+
In simple cases it is also possible to udf, but leads to inefficient and fragile code in general and quickly goes out of control, when complexity of objects grows.

Cython Typing List of Strings

I'm trying to use cython to improve the performance of a loop, but I'm running
into some issues declaring the types of the inputs.
How do I include a field in my typed struct which is a string that can be
either 'front' or 'back'
I have a np.recarray that looks like the following (note the length of the
recarray is unknown as compile time)
import numpy as np
weights = np.recarray(4, dtype=[('a', np.int64), ('b', np.str_, 5), ('c', np.float64)])
weights[0] = (0, "front", 0.5)
weights[1] = (0, "back", 0.5)
weights[2] = (1, "front", 1.0)
weights[3] = (1, "back", 0.0)
as well as inputs of a list of strings and a pandas.Timestamp
import pandas as pd
ts = pd.Timestamp("2015-01-01")
contracts = ["CLX16", "CLZ16"]
I am trying to cythonize the following loop
def ploop(weights, contracts, timestamp):
cwts = []
for gen_num, position, weighting in weights:
if weighting != 0:
if position == "front":
cntrct_idx = gen_num
elif position == "back":
cntrct_idx = gen_num + 1
else:
raise ValueError("transition.columns must contain "
"'front' or 'back'")
cwts.append((gen_num, contracts[cntrct_idx], weighting, timestamp))
return cwts
My attempt involved typing the weights input as a struct in cython,
in a file struct_test.pyx as follows
import numpy as np
cimport numpy as np
cdef packed struct tstruct:
np.int64_t gen_num
char[5] position
np.float64_t weighting
def cloop(tstruct[:] weights_array, contracts, timestamp):
cdef tstruct weights
cdef int i
cdef int cntrct_idx
cwts = []
for k in xrange(len(weights_array)):
w = weights_array[k]
if w.weighting != 0:
if w.position == "front":
cntrct_idx = w.gen_num
elif w.position == "back":
cntrct_idx = w.gen_num + 1
else:
raise ValueError("transition.columns must contain "
"'front' or 'back'")
cwts.append((w.gen_num, contracts[cntrct_idx], w.weighting,
timestamp))
return cwts
But I am receiving runtime errors, which I believe are related to the
char[5] position.
import pyximport
pyximport.install()
import struct_test
struct_test.cloop(weights, contracts, ts)
ValueError: Does not understand character buffer dtype format string ('w')
In addition I am a bit unclear how I would go about typing contracts as well
as timestamp.
Your ploop (without the timestamp variable) produces:
In [226]: ploop(weights, contracts)
Out[226]: [(0, 'CLX16', 0.5), (0, 'CLZ16', 0.5), (1, 'CLZ16', 1.0)]
Equivalent function without a loop:
def ploopless(weights, contracts):
arr_contracts = np.array(contracts) # to allow array indexing
wgts1 = weights[weights['c']!=0]
mask = wgts1['b']=='front'
wgts1['b'][mask] = arr_contracts[wgts1['a'][mask]]
mask = wgts1['b']=='back'
wgts1['b'][mask] = arr_contracts[wgts1['a'][mask]+1]
return wgts1.tolist()
In [250]: ploopless(weights, contracts)
Out[250]: [(0, 'CLX16', 0.5), (0, 'CLZ16', 0.5), (1, 'CLZ16', 1.0)]
I'm taking advantage of the fact that returned list of tuples has same (int, str, int) layout as the input weight array. So I'm just making a copy of weights and replacing selected values of the b field.
Note that I use the field selection index before the mask one. The boolean mask produces a copy, so we have to careful about indexing order.
I'm guessing that loop-less array version will be competitive in time with the cloop (on realistic arrays). The string and list operations in cloop probably limit its speedup.