I was trying to implement linear regression in Keras/TensorFlow and was very surprised how difficult it is. The standard examples work great on random data. However, if we change the input data a little bit, all examples stop work correctly.
I try to find coefficients for y = 0.5 * x1 + 0.5 * x2.
np.random.seed(1443)
n = 100000
x = np.zeros((n, 2))
y = np.zeros((n, 1))
x[:,0] = sorted(preprocessing.scale( np.random.poisson(1000000, (n)) ))
x[:,1] = sorted(preprocessing.scale( np.random.poisson(1000000, (n)) ) )
y = (x[:,0] + x[:,1]) /2
model = keras.Sequential()
model.add( keras.layers.Dense(1, input_shape =(2,), dtype="float32" ))
model.compile(loss='mean_squared_error', optimizer='sgd')
model.fit(x,y, epochs=1000, batch_size=64)
print(model.get_weights())
The results:
| epochs| batch_size | bias | x1 | x2
| ------+------------+------------+------------+-----------
| 1000 | 64 | -5.83E-05 | 0.90410435 | 0.09594361
| 1000 | 1024 | -5.71E-06 | 0.98739249 | 0.01258729
| 1000 | 10000 | -3.07E-07 | -0.2441376 | 1.2441349
My first thought was that it is a bug in Keras. So, I tried R/Tensorflow library:
floatType <- "float32"
p <- 2L
X <- tf$placeholder(floatType, shape = shape(NULL, p), name = "x-data")
Y <- tf$placeholder(floatType, name = "y-data")
W <- tf$Variable(tf$zeros(list(p, 1L), dtype=floatType))
b <- tf$Variable(tf$zeros(list(1L), dtype=floatType))
Y_hat <- tf$add(tf$matmul(X, W), b)
cost <- tf$reduce_mean(tf$square(Y_hat - Y))
generator <- tf$train$GradientDescentOptimizer(learning_rate=0.01)
optimizer <- generator$minimize(cost)
session <- tf$Session()
session$run(tf$global_variables_initializer())
set.seed(1443)
n <- 10^5
x <- matrix( replicate(p, sort(scale((rpois(n, 10^6))))) , nrow = n )
y <- matrix((x[,1]+x[,2])/2)
i <- 1
batch_size <- 10000
epoch_number <- 1000
iterationNumber <- n*epoch_number / batch_size
while (iterationNumber > 0) {
feed_dict <- dict(X = x[i:(i+batch_size-1),, drop = F], Y = y[i:(i+batch_size-1),, drop = F])
session$run(optimizer, feed_dict = feed_dict)
i <- i+batch_size
if( i > n-batch_size)
i <- i %% batch_size
iterationNumber <- iterationNumber - 1
}
r_model <- lm(y ~ x)
tf_coef <- c(session$run(b), session$run(W))
r_coef <- r_model$coefficients
print(rbind(tf_coef, r_coef))
The results:
| epochs| batch_size | bias | x1 | x2
| ------+------------+------------+------------+-----------
|2000 | 64 | -1.33E-06 | 0.500307 | 0.4996932
|1000 | 1000 | 2.79E-08 | 0.5000809 | 0.499919
|1000 | 10000 | -4.33E-07 | 0.5004921 | 0.499507
|1000 | 100000 | 2.96E-18 | 0.5 | 0.5
Tensorflow finds the correct result only when batch size = samples number and the optimization algorithm is SGD. If optimization algorithm was "adam" or "adagrad", errors were much larger.
For obvious reasons, I cannot choose hyperparameter batch_size = n. Could you recommend any approaches to solve this problem with precision 1E-07 for Keras or TensorFlow?
Why TensorFlow finds better solutions than Keras?
Comment 1.
Based on post "today" below:
Train dataset shuffling will significantly improve the performance of TensorFlow version:
shuffledIndex<-sample(1:(nrow(x)))
x <- x[shuffledIndex,]
y <- y[shuffledIndex,,drop=FALSE]
For batch size = 2000:
|(Intercept) | x1 | x2
|----------------+-----------+----------
|-1.130693e-09 | 0.5000004 | 0.4999989
The problem is that you are sorting the generated random numbers for each feature value. So they end up very close to each other:
>>> np.mean(np.abs(x[:,0]-x[:,1]))
0.004125721684553685
As a result we would have:
y = (x1 + x2) / 2
~= (x1 + x1) / 2
= x1
= 0.5 * x1 + 0.5 * x1
= 0.3 * x1 + 0.7 * x1
= -0.3 * x1 + 1.3 * x1
= 10.1 * x1 - 9.1 * x1
= thousands of other possible combinations
In this case the solution that Keras would converge to would really depend on the initial value of the weights and bias of Dense layer. With different initial values you would get different results (and possibly for some of them, it may not converge at all):
# set the initial weight of Dense layer
model.layers[0].set_weights([np.array([[0], [1]]), np.array([0])])
# fit the model ...
# the final weights
model.get_weights()
[array([[0.00203656],
[0.9981099 ]], dtype=float32),
array([4.5520876e-05], dtype=float32)] # because: y = 0 * x1 + 1 * x1 = x1 ~= (x1 + x2) / 2
# again set the weights to something different
model.layers[0].set_weights([np.array([[0], [0]]), np.array([1])])
# fit the model...
# the final weights
model.get_weights()
[array([[0.49986306],
[0.50013727]], dtype=float32),
array([1.4176634e-08], dtype=float32)] # the one you were looking for!
However, if you don't sort the features (i.e. just remove sorted) it is very likely that the converged weights to be very close to [0.5, 0.5].
Related
starting with this example of Dirichlet regression here.
My variable y is a vector of N = 3 elements and the Dirichlet regression model estimates N-1 coeff.
Let’s say I am interested in all 3 coefficients, how can I get them?
Thanks!
library(brms)
library(rstan)
library(dplyr)
bind <- function(...) cbind(...)
N <- 20
df <- data.frame(
y1 = rbinom(N, 10, 0.5), y2 = rbinom(N, 10, 0.7),
y3 = rbinom(N, 10, 0.9), x = rnorm(N)
) %>%
mutate(
size = y1 + y2 + y3,
y1 = y1 / size,
y2 = y2 / size,
y3 = y3 / size
)
df$y <- with(df, cbind(y1, y2, y3))
make_stancode(bind(y1, y2, y3) ~ x, df, dirichlet())
make_standata(bind(y1, y2, y3) ~ x, df, dirichlet())
fit <- brm(bind(y1, y2, y3) ~ x, df, dirichlet())
summary(fit)
Family: dirichlet
Links: muy2 = logit; muy3 = logit; phi = identity
Formula: bind(y1, y2, y3) ~ x
Data: df (Number of observations: 20)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
muy2_Intercept 0.29 0.10 0.10 0.47 1.00 2830 2514
muy3_Intercept 0.56 0.09 0.38 0.73 1.00 2833 2623
muy2_x 0.04 0.11 -0.17 0.24 1.00 3265 2890
muy3_x -0.00 0.10 -0.20 0.19 1.00 3229 2973
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi 39.85 9.13 23.83 59.78 1.00 3358 2652
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
I have a dataframe df with 100 columns:
index | col1 | col2 | col3 | ...
2021-04-01 | qwe | 1 | 1.1 | ...
2021-04-02 | asd | 2 | 2.2 | ...
2021-04-03 | yxc | 3 | 3.3 | ...
dtypes:
col1: category
col2: int32
col3: float64
I want to scale all columns that are not of type "category" AND return it as a dataframe, not a numpy array.
My code so far:
y_feature = "col2"
y = df[[y_feature]] # Set predictor y
X = df.drop(
[
y_feature,
],
axis=1,
)
days = (
pd.date_range(start=df.index.min(), end=df.index.max())
.to_frame(name="date")
.reset_index()
.drop("index", axis=1)
)
limit_training_days = int(len(days.index) * 0.85)
X_train_limit = days.iloc[limit_training_days, 0]
print(f"Date for training: {X_train_limit}")
X_train, y_train = (
X.query("date <= #X_train_limit").squeeze(),
y.query("date <= #X_train_limit").squeeze(),
)
X_test, y_test = (
X.query("date > #X_train_limit").squeeze(),
y.query("date > #X_train_limit").squeeze(),
)
categorical_feature = X_train.select_dtypes("category").columns.tolist()
num_cols = X.drop(categorical_feature, axis=1).columns.tolist()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test_sc[num_cols] = scaler.transform(X_test[num_cols])
After updating my packages it now throws this error for the last 2 lines of code:
SettingWithCopyError: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
How can I scale only non-category columns (keeping category columns untouched) AND return it as a dataframe?
The problem is you trying to change X_train and X_test which are parts of a bigger dataframe. Try:
X_train, X_test = X_train.copy(), X_test.copy()
before scaling.
You can also do:
X_train, y_train = (
X.query("date <= #X_train_limit").squeeze().copy(), # here
y.query("date <= #X_train_limit").squeeze(),
)
How to sample without replacement in TensorFlow? Like numpy.random.choice(n, size=k, replace=False) for some very large integer n (e.g. 100k-100M), and smaller k (e.g. 100-10k).
Also, I want it to be efficient and on the GPU, so other solutions like this with tf.py_func are not really an option for me. Anything which would use tf.range(n) or so is also not an option because n could be very large.
This is one way:
n = ...
sample_size = ...
idx = tf.random_shuffle(tf.range(n))[:sample_size]
EDIT:
I had posted the answer below but then read the last line of your post. I don't think there is a good way to do it if you absolutely cannot produce a tensor with size O(n) (numpy.random.choice with replace=False is also implemented as a slice of a permutation). You could resort to a tf.while_loop until you have unique indices:
n = ...
sample_size = ...
idx = tf.zeros(sample_size, dtype=tf.int64)
idx = tf.while_loop(
lambda i: tf.size(idx) == tf.size(tf.unique(idx)),
lambda i: tf.random_uniform(sample_size, maxval=n, dtype=int64))
EDIT 2:
About the average number of iterations in the previous method. If we call n the number of possible values and k the length of the desired vector (with k ≤ n), the probability that an iteration is successful is:
p = product((n - (i - 1) / n) for i in 1 .. k)
Since each iteartion can be considered a Bernoulli trial, the average number of trials unitl first success is 1 / p (proof here). Here is a function that calculates the average numbre of trials in Python for some k and n values:
def avg_iter(k, n):
if k > n or n <= 0 or k < 0:
raise ValueError()
avg_it = 1.0
for p in (float(n) / (n - i) for i in range(k)):
avg_it *= p
return avg_it
And here are some results:
+-------+------+----------+
| n | k | Avg iter |
+-------+------+----------+
| 10 | 5 | 3.3 |
| 100 | 10 | 1.6 |
| 1000 | 10 | 1.1 |
| 1000 | 100 | 167.8 |
| 10000 | 10 | 1.0 |
| 10000 | 100 | 1.6 |
| 10000 | 1000 | 2.9e+22 |
+-------+------+----------+
You can see it varies wildy depending on the parameters.
It is possible, though, to construct a vector in a fixed number of steps, although the only algorithm I can think of is O(k2). In pure Python it goes like this:
import random
def sample_wo_replacement(n, k):
sample = [0] * k
for i in range(k):
sample[i] = random.randint(0, n - 1 - len(sample))
for i, v in reversed(list(enumerate(sample))):
for p in reversed(sample[:i]):
if v >= p:
v += 1
sample[i] = v
return sample
random.seed(100)
print(sample_wo_replacement(10, 5))
# [2, 8, 9, 7, 1]
print(sample_wo_replacement(10, 10))
# [6, 5, 8, 4, 0, 9, 1, 2, 7, 3]
This is a possible way to do it in TensorFlow (not sure if the best one):
import tensorflow as tf
def sample_wo_replacement_tf(n, k):
# First loop
sample = tf.constant([], dtype=tf.int64)
i = 0
sample, _ = tf.while_loop(
lambda sample, i: i < k,
# This is ugly but I did not want to define more functions
lambda sample, i: (tf.concat([sample,
tf.random_uniform([1], maxval=tf.cast(n - tf.shape(sample)[0], tf.int64), dtype=tf.int64)],
axis=0),
i + 1),
[sample, i], shape_invariants=[tf.TensorShape((None,)), tf.TensorShape(())])
# Second loop
def inner_loop(sample, i):
sample_size = tf.shape(sample)[0]
v = sample[i]
j = i - 1
v, _ = tf.while_loop(
lambda v, j: j >= 0,
lambda v, j: (tf.cond(v >= sample[j], lambda: v + 1, lambda: v), j - 1),
[v, j])
return (tf.where(tf.equal(tf.range(sample_size), i), tf.tile([v], (sample_size,)), sample), i - 1)
i = tf.shape(sample)[0] - 1
sample, _ = tf.while_loop(lambda sample, i: i >= 0, inner_loop, [sample, i])
return sample
And an example:
with tf.Graph().as_default(), tf.Session() as sess:
tf.set_random_seed(100)
sample = sample_wo_replacement_tf(10, 5)
for i in range(10):
print(sess.run(sample))
# [3 0 6 8 4]
# [5 4 8 9 3]
# [1 4 0 6 8]
# [8 9 5 6 7]
# [7 5 0 2 4]
# [8 4 5 3 7]
# [0 5 7 4 3]
# [2 0 3 8 6]
# [3 4 8 5 1]
# [5 7 0 2 9]
This is quite intesive on tf.while_loops, though, which are well-known not to be particularly fast in TensorFlow, so I wouldn't know how fast can you really get with this method without some kind of benchmarking.
EDIT 4:
One last possible method. You can divide the range of possible values (0 to n) in "chunks" of size c and pick a random amount of numbers from each chunk, then shuffle everything. The amount of memory that you use is limited by c, and you don't need nested loops. If n is divisible by c, then you should get about a perfect random distribution, otherwise values in the last "short" chunk would receive some extra probability (this may be negligible depending on the case). Here is a NumPy implementation. It is somewhat long to account for different corner cases and pitfalls, but if c ≥ k and n mod c = 0 several parts get simplified.
import numpy as np
def sample_chunked(n, k, chunk=None):
chunk = chunk or n
last_chunk = chunk
parts = n // chunk
# Distribute k among chunks
max_p = min(float(chunk) / k, 1.0)
max_p_last = max_p
if n % chunk != 0:
parts += 1
last_chunk = n % chunk
max_p_last = min(float(last_chunk) / k, 1.0)
p = np.full(parts, 2)
# Iterate until a valid distribution is found
while not np.isclose(np.sum(p), 1) or np.any(p > max_p) or p[-1] > max_p_last:
p = np.random.uniform(size=parts)
p /= np.sum(p)
dist = (k * p).astype(np.int64)
sample_size = np.sum(dist)
# Account for rounding errors
while sample_size < k:
i = np.random.randint(len(dist))
while (dist[i] >= chunk) or (i == parts - 1 and dist[i] >= last_chunk):
i = np.random.randint(len(dist))
dist[i] += 1
sample_size += 1
while sample_size > k:
i = np.random.randint(len(dist))
while dist[i] == 0:
i = np.random.randint(len(dist))
dist[i] -= 1
sample_size -= 1
assert sample_size == k
# Generate sample parts
sample_parts = []
for i, v in enumerate(np.nditer(dist)):
if v <= 0:
continue
c = chunk if i < parts - 1 else last_chunk
base = chunk * i
sample_parts.append(base + np.random.choice(c, v, replace=False))
sample = np.concatenate(sample_parts, axis=0)
np.random.shuffle(sample)
return sample
np.random.seed(100)
print(sample_chunked(15, 5, 4))
# [ 8 9 12 13 3]
A quick benchmark of sample_chunked(100000000, 100000, 100000) takes about 3.1 seconds in my computer, while I haven't been able to run the previous algorithm (sample_wo_replacement function above) to completion with the same parameters. It should be possible to implement it in TensorFlow, maybe using tf.TensorArray, although it would require significant effort to get it exactly right.
use the gumbel-max trick here: https://github.com/tensorflow/tensorflow/issues/9260
z = -tf.log(-tf.log(tf.random_uniform(tf.shape(logits),0,1)))
_, indices = tf.nn.top_k(logits + z,K)
indices are what you want. This tick is so easy~!
The following works fairly fast on the GPU, and I did not encounter memory issues when using n~100M and k~10k (using NVIDIA GeForce GTX 1080 Ti):
def random_choice_without_replacement(n, k):
"""equivalent to 'numpy.random.choice(n, size=k, replace=False)'"""
return tf.math.top_k(tf.random.uniform(shape=[n]), k, sorted=False).indices
I'd like to compare a betareg regression vs. the same regression using rjags
library(betareg)
d = data.frame(p= sample(c(.1,.2,.3,.4),100, replace= TRUE),
id = seq(1,100,1))
# I am looking to reproduce this regression with jags
b=betareg(p ~ id, data= d,
link = c("logit"), link.phi = NULL, type = c("ML"))
summary(b)
Below I am trying to do the same regression with rjags
#install.packages("rjags")
library(rjags)
jags_str = "
model {
#model
y ~ dbeta(alpha, beta)
alpha <- mu * phi
beta <- (1-mu) * phi
logit(mu) <- a + b*id
#priors
a ~ dnorm(0, .5)
b ~ dnorm(0, .5)
t0 ~ dnorm(0, .5)
phi <- exp(t0)
}"
id = d$id
y = d$p
model <- jags.model(textConnection(jags_str),
data = list(y=y,id=id)
)
update(model, 10000, progress.bar="none"); # Burnin for 10000 samples
samp <- coda.samples(model,
variable.names=c("mu"),
n.iter=20000, progress.bar="none")
summary(samp)
plot(samp)
I get an error on this line
model <- jags.model(textConnection(jags_str),
data = list(y=y,id=id)
)
Error in jags.model(textConnection(jags_str), data = list(y = y, id = id)) :
RUNTIME ERROR:
Invalid vector argument to ilogit
Can you advise
(1) how to fix the error
(2) how to set priors for the beta regression
Thank you.
This error occurs because you have supplied the id vector to the scalar function logit. In Jags inverse link functions cannot be vectorized. To address this, you need to use a for loop to go through each element of id. To do this I would probably add an additional element to your data list that denotes how long id is.
d = data.frame(p= sample(c(.1,.2,.3,.4),100, replace= TRUE),
id = seq(1,100,1), len_id = length(seq(1,100,1)))
From there you just need to make a small edit to your jags code.
for(i in 1:(len_id)){
y[i] ~ dbeta(alpha[i], beta[i])
alpha[i] <- mu[i] * phi
beta[i] <- (1-mu[i]) * phi
logit(mu[i]) <- a + b*id[i]
}
However, if you track mu it is going to be a matrix that is 20000 (# of iterations) by 100 (length of id). You are likely more interested in the actual parameters (a, b, and phi).
I'm trying to optimize a piece of code that solves a large sparse nonlinear system using an interior point method. During the update step, this involves computing the Hessian matrix H, the gradient g, then solving for d in H * d = -g to get the new search direction.
The Hessian matrix has a symmetric tridiagonal structure of the form:
A.T * diag(b) * A + C
I've run line_profiler on the particular function in question:
Line # Hits Time Per Hit % Time Line Contents
==================================================
386 def _direction(n, res, M, Hsig, scale_var, grad_lnprior, z, fac):
387
388 # gradient
389 44 1241715 28220.8 3.7 g = 2 * scale_var * res - grad_lnprior + z * np.dot(M.T, 1. / n)
390
391 # hessian
392 44 3103117 70525.4 9.3 N = sparse.diags(1. / n ** 2, 0, format=FMT, dtype=DTYPE)
393 44 18814307 427597.9 56.2 H = - Hsig - z * np.dot(M.T, np.dot(N, M)) # slow!
394
395 # update direction
396 44 10329556 234762.6 30.8 d, fac = my_solver(H, -g, fac)
397
398 44 111 2.5 0.0 return d, fac
Looking at the output it's clear that constructing H is by far the most costly step - it takes considerably longer than actually solving for the new direction.
Hsig and M are both CSC sparse matrices, n is a dense vector and z is a scalar. The solver I'm using requires H to be either a CSC or CSR sparse matrix.
Here's a function that produces some toy data with the same formats, dimensions and sparseness as my real matrices:
import numpy as np
from scipy import sparse
def make_toy_data(nt=200000, nc=10):
d0 = np.random.randn(nc * (nt - 1))
d1 = np.random.randn(nc * (nt - 1))
M = sparse.diags((d0, d1), (0, nc), shape=(nc * (nt - 1), nc * nt),
format='csc', dtype=np.float64)
d0 = np.random.randn(nc * nt)
Hsig = sparse.diags(d0, 0, shape=(nc * nt, nc * nt), format='csc',
dtype=np.float64)
n = np.random.randn(nc * (nt - 1))
z = np.random.randn()
return Hsig, M, n, z
And here's my original approach for constructing H:
def original(Hsig, M, n, z):
N = sparse.diags(1. / n ** 2, 0, format='csc')
H = - Hsig - z * np.dot(M.T, np.dot(N, M)) # slow!
return H
Timing:
%timeit original(Hsig, M, n, z)
# 1 loops, best of 3: 483 ms per loop
Is there a faster way to construct this matrix?
I get close to a 4x speed-up in computing the product M.T * D * M out of the three diagonal arrays. If d0 and d1 are the main and upper diagonal of M, and d is the main diagonal of D, then the following code creates M.T * D * M directly:
def make_tridi_bis(d0, d1, d, nc=10):
d00 = d0*d0*d
d11 = d1*d1*d
d01 = d0*d1*d
len_ = d0.size
data = np.empty((3*len_ + nc,))
indices = np.empty((3*len_ + nc,), dtype=np.int)
# Fill main diagonal
data[:2*nc:2] = d00[:nc]
indices[:2*nc:2] = np.arange(nc)
data[2*nc+1:-2*nc:3] = d00[nc:] + d11[:-nc]
indices[2*nc+1:-2*nc:3] = np.arange(nc, len_)
data[-2*nc+1::2] = d11[-nc:]
indices[-2*nc+1::2] = np.arange(len_, len_ + nc)
# Fill top diagonal
data[1:2*nc:2] = d01[:nc]
indices[1:2*nc:2] = np.arange(nc, 2*nc)
data[2*nc+2:-2*nc:3] = d01[nc:]
indices[2*nc+2:-2*nc:3] = np.arange(2*nc, len_+nc)
# Fill bottom diagonal
data[2*nc:-2*nc:3] = d01[:-nc]
indices[2*nc:-2*nc:3] = np.arange(len_ - nc)
data[-2*nc::2] = d01[-nc:]
indices[-2*nc::2] = np.arange(len_ - nc ,len_)
indptr = np.empty((len_ + nc + 1,), dtype=np.int)
indptr[0] = 0
indptr[1:nc+1] = 2
indptr[nc+1:len_+1] = 3
indptr[-nc:] = 2
np.cumsum(indptr, out=indptr)
return sparse.csr_matrix((data, indices, indptr), shape=(len_+nc, len_+nc))
If your matrix M were in CSR format, you can extract d0 and d1 as d0 = M.data[::2] and d1 = M.data[1::2], I modified you toy data making routine to return those arrays as well, and here's what I get:
In [90]: np.allclose((M.T * sparse.diags(d, 0) * M).A, make_tridi_bis(d0, d1, d).A)
Out[90]: True
In [92]: %timeit make_tridi_bis(d0, d1, d)
10 loops, best of 3: 124 ms per loop
In [93]: %timeit M.T * sparse.diags(d, 0) * M
1 loops, best of 3: 501 ms per loop
The whole purpose of the above code is to take advantage of the structure of the non-zero entries. If you draw a diagram of the matrices you are multiplying together, it is relatively easy to convince yourself that the main (d_0) and top and bottom (d_1) diagonals of the resulting tridiagonal matrix are simply:
d_0 = np.zeros((len_ + nc,))
d_0[:len_] = d00
d_0[-len_:] += d11
d_1 = d01
The rest of the code in that function is simply building the tridiagonal matrix directly, as calling sparse.diags with the above data is several times slower.
I tried running your test case and had problems with the np.dot(N, M). I didn't dig into it, but I think my numpy/sparse combo (both pretty new) had problems using np.dot on sparse arrays.
But H = -Hsig - z*M.T.dot(N.dot(M)) runs just fine. This uses the sparse dot.
I haven't run a profile, but here are Ipython timings for several parts. It takes longer to generate the data than to do that double dot.
In [37]: timeit Hsig,M,n,z=make_toy_data()
1 loops, best of 3: 2 s per loop
In [38]: timeit N = sparse.diags(1. / n ** 2, 0, format='csc')
1 loops, best of 3: 377 ms per loop
In [39]: timeit H = -Hsig - z*M.T.dot(N.dot(M))
1 loops, best of 3: 1.55 s per loop
H is a
<2000000x2000000 sparse matrix of type '<type 'numpy.float64'>'
with 5999980 stored elements in Compressed Sparse Column format>