Spark ml LogisticRegression failing for large libsvm format data - optimization

I am trying fit a Spark (1.6.0) ml.LogisticRegression for pure L1 regularization (elasticNetParam = 1.0, regParam = 0.1) model for a large ((2920874 rows, 2564827 features)) dataset that is saved in libsvm format. Since this is pure L1, an OWLQN optimizer is being called for the optimization part. However, in the resulting model, regression weights remain at their initial state and number of iterations seems to be just one, which means the optimizer exits after initialization for some reason. Moreover, the optimizer does not print any info messages which is another proof that it does not go beyond initialization step. My code is attached here:
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.mllib.linalg.Vector
import java.io._
import org.apache.spark.mllib.util.MLUtils
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val training = sqlContext.read.format("libsvm").option("numFeatures", "2564828").load("/path/to/large/libsvm/training/data")
val test = sqlContext.read.format("libsvm").option("numFeatures", "2564828").load("/path/to/large/libsvm/test/data")
val lr = (new LogisticRegression()).setMaxIter(100).setElasticNetParam(1.0).setRegParam(0.1)
val lrm = lr.fit(training)
val output = lrm.transform(test)
val evaluator = (new BinaryClassificationEvaluator())
val metric = evaluator.evaluate(output)
Few relevant points:
This code was run from spark-shell (1.6.0) with 40 cores (10 executors with 4 cores each), 4gb driver memory and 4gb executor memory.
Same code with L2 regularization (elasticNetParam = 0.0) runs without any issues.
Same code (pure L1) with a smaller matrix format data used in the LogisticRegression test case runs without issues.

Related

How to use a custom loss function in GPfLOW 2?

I am new to GPflow and I am trying to figure out how to write a custom loss function to optimize the model. For my purpose, I need to manipulate the predicted output of the GP through different data treatments, and thus, it is the output I get after these treatments, that I would like the optimise the GP model according to. For that purpose I would like to use the root mean square error as loss function.
Workflow:
Input -> GP model -> GP_output -> Data treatment -> Predicted_output -> RMSE(Predicted_output, Observations)
I hope this makes sense.
Normally models are optimised doing something like this:
import gpflow as gf
import numpy as np
X = np.linspace(0, 100, num=100)
n = np.random.normal(scale=8, size=X.size)
y_obs = 10 * np.sin(X) + n
model = gf.models.GPR(
data=(X, y_obs),
kernel=gf.kernels.SquaredExponential(),
)
gf.optimizers.Scipy().minimize(
model.training_loss, model.trainable_variables, options=optimizer_config
)
I have figured out how to do a workaround using the scipy minimize function to optimise using RMSE, but I would like to stay within the GPflow framework, where I can just input model.trainable_variables as argument, and have a general function that also works if I have multiple input/output dimensions.
def objective_func(params):
model.kernel.lengthscales.assign(params[0])
model.kernel.variance.assign(params[1])
model.likelihood.variance.assign(params[2])
GP_output = model.predict_y(X)[0]
GP_output = GP_output.numpy()
Predicted_output = data_treatment_func(GP_output)
return np.sqrt(np.square(np.subtract(Predicted_output, y_obs)).mean())
from scipy.optimize import minimize
res = minimize(objective_func,
x0=(1.0, 1.0, 1.0),)
I found the answer myself.
If you write your objective_func using TensorFlow instead of NumPy (e.g. tf.math.sqrt, tf.reduce_mean) you can simply pass that to gf.optimizers.Scipy().minimize(...) instead of model.training_loss:
import tensorflow as tf
def objective_func():
GP_output = model.predict_y(X)[0]
Predicted_output = data_treatment_func(GP_output)
return tf.sqrt(tf.reduce_mean(tf.square(Predicted_output - y_obs)))
gf.optimizers.Scipy().minimize(
objective_func, model.trainable_variables, options=optimizer_config
)

set seed for entire model training/testing

I have a code using tensorflow v1 and I'd like to migrate it toward native tensorflow 2.
The code defines random objects (using numpy.randomor random, a neural network (keras weight initialization etc) and other tensorflow's random functions. At the end, it makes predictions on a random test set and outputs loss/accuracy of the model.
For this task, I'm having the original code and a copy of it and I'm changing the code of the copy part by part. I want to make sure that the behaviour is the same so I want to set the randomness so that I can monitor if the loss/accuracy change
However, even after setting the seeds of the various random modules in my original file, launching it multiple times still give different loss/accuracy
here are my libraries :
import time
import random
import my_file as mf // file in directory scope
import numpy as np
import copy
import os
from matplotlib import pyplot as plt
import tensorflow.compat.v1 as tf
and I'm setting the seeds at the beginning like that :
tf.set_random_seed(42)
random.seed(42)
np.random.seed(42)
My module my_file uses the random library and I'm also setting the seed there
I do understand from the docs that tf.set_random_seed only sets the global seed and that each random operation in tensorflow is also using its own seed, resulting in different behaviors for consecutive calls. For example if I call the training/testing cell 3 times I get the consecutive value of losses L1 -> L2 -> L3
However, this should still result in the same behavior if I restart the environment so why isn't it the case ? If I restart the kernel and execute 3 times I will get L1' =/= L1 -> L2' =/= L2 -> L3' =/= L3
What else should I verify to make sure the behaviour is the same everytime I restart the notebook kernel ?

Fastest way to compute many 3x3 matrix-matrix multiplications

I need to compute the combination of many 3x3 rotation matrices.
Here is a comparison of applying functools.reduce on matmul with numpy and cupy:
import timeit
from functools import reduce
import numpy as np
import cupy as cp
from pyrr.matrix33 import create_from_axis_rotation
# generate random rotation matrices
axes = np.random.rand(10000, 3)
angles = np.pi * np.random.rand(10000)
rotations = [create_from_axis_rotation(*params) for params in zip(axes, angles)]
# then reduce with matmul
xp = np # numpy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")
xp = cp # cupy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")
On a good machine with a Titan GPU, this gives :
numpy: 1.63e+02ms
cupy: 8.78e+02ms
For some reason the GPU is much slower.
In any case, is there a way to calculate this significantly faster ?
Edit
I found a rather simple solution, that works for all chains of small linear transformations (and can be extended to affine transformations easily).
def reduce_loop(matrices):
""" non-optimized reduce """
mat = matrices[0]
for _mat in matrices[1:]:
mat = mat # _mat
return mat
def reduce_split(matrices):
""" reduce by multiplying pairs of matrices recursively """
if len(matrices) == 1:
return matrices[0]
neven = (len(matrices) // 2) * 2
reduced = matrices[:neven:2] # matrices[1:neven:2]
if len(matrices) > neven: # len(matrices) is odd
reduced[-1] = reduced[-1] # matrices[-1]
return reduce_split(reduced)
time = timeit.timeit("reduce_loop(rotations)", number=10, globals=globals())
print(f"reduce_loop: {time * 1000:0.3}ms")
time = timeit.timeit("reduce_split(rotations)", number=10, globals=globals())
print(f"reduce_split: {time * 1000:0.3}ms")
Giving:
reduce_loop: 2.14e+02ms
reduce_split: 24.5ms
I'm sure it's not optimal, but it uses numpy's (and probably cupy's) optimization.
functools.reduce() was removed from core python because it is inefficient and not pythonic. There is no cuPy equivalent, only the host version in the functools library
your cuPy code is spending most of its time fruitlessly copying data from host to device and back again... thousands of times - because reduce() runs only on the host not on the GPU. You are straining your PCI bus, not the GPU
consider making the list “rotations” into a cuPy matrix, and then use striding (not a python list)
use a cuPy reduction kernel to do the matmul
https://docs.cupy.dev/en/stable/reference/generated/cupy.ReductionKernel.html

How to use torch to speed up some common computations?

I am trying make some common computations, like matrix multiplication, but without gradient computation. An example of my computation is like
import numpy as np
from scipy.special import logsumexp
var = 1e-8
a = np.random.randint(0,10,(128,20))
result = np.logsumexp(a, axis=1) / 2. + np.log(np.pi * var)
I want to use torch (gpu) to speed up the computation. Here is the code
import numpy as np
import torch
var = 1e-8
a = np.random.randint(0,10,(128,20))
a = torch.numpy_from(a).cuda()
result = torch.logsumexp(a, dim=1)/ 2. + np.log(np.pi*var)
but i have some questions:
Could the above code speed up the computation? I don't know if it works.
Do I need to convert all values into torch.tensor, like from var to torch.tensor(var).cuda() and from np.log(np.pi*var) to a torch.tensor?
Do I need to convert all tensors into gpu by myself, especially for some intermediate variable?
If the above code doesn't work, how can I speed up the computation with gpu?
You could use torch only to do the computations.
import torch
# optimization by passing device argument, tensor is created on gpu and hence move operation is saved
# convert to float to use with logsumexp
a = torch.randint(0,10, (128,20), device="cuda").float()
result = torch.logsumexp(a, dim=1)/ 2.
Answers to your some of your questions:
Could the above code speed up the computation?
It depends. If you have too many matrix multiplication, using gpu can give speed up.
Do I need to convert all values into torch.tensor, like from var to torch.tensor(var).cuda() and from np.log(np.pi*var) to a torch.tensor?
Yes
Do I need to convert all tensors into gpu by myself, especially for some intermediate variable?
Only leaf variables need to converted, intermediate variable will be placed on device on which the operations are done. For ex: if a and b are on gpu, then as a result of operation c=a+b, c will also be on gpu.

Make Pandas DataFrame apply() use all cores?

As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1).
How can you use all your cores to run apply on a dataframe in parallel?
You may use the swifter package:
pip install swifter
(Note that you may want to use this in a virtualenv to avoid version conflicts with installed dependencies.)
Swifter works as a plugin for pandas, allowing you to reuse the apply function:
import swifter
def some_function(data):
return data * 10
data['out'] = data['in'].swifter.apply(some_function)
It will automatically figure out the most efficient way to parallelize the function, no matter if it's vectorized (as in the above example) or not.
More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.
Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a “simple” Pandas apply, which will not be parallel. In this case, even forcing it to use dask will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing.
The simplest way is to use Dask's map_partitions. You need these imports (you will need to pip install dask):
import pandas as pd
import dask.dataframe as dd
from dask.multiprocessing import get
and the syntax is
data = <your_pandas_dataframe>
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y,z, ...): return <whatever>
res = ddata.map_partitions(lambda df: df.apply((lambda row: myfunc(*row)), axis=1)).compute(get=get)
(I believe that 30 is a suitable number of partitions if you have 16 cores). Just for completeness, I timed the difference on my machine (16 cores):
data = pd.DataFrame()
data['col1'] = np.random.normal(size = 1500000)
data['col2'] = np.random.normal(size = 1500000)
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y): return y*(x**2+1)
def apply_myfunc_to_DF(df): return df.apply((lambda row: myfunc(*row)), axis=1)
def pandas_apply(): return apply_myfunc_to_DF(data)
def dask_apply(): return ddata.map_partitions(apply_myfunc_to_DF).compute(get=get)
def vectorized(): return myfunc(data['col1'], data['col2'] )
t_pds = timeit.Timer(lambda: pandas_apply())
print(t_pds.timeit(number=1))
28.16970546543598
t_dsk = timeit.Timer(lambda: dask_apply())
print(t_dsk.timeit(number=1))
2.708152851089835
t_vec = timeit.Timer(lambda: vectorized())
print(t_vec.timeit(number=1))
0.010668013244867325
Giving a factor of 10 speedup going from pandas apply to dask apply on partitions. Of course, if you have a function you can vectorize, you should - in this case the function (y*(x**2+1)) is trivially vectorized, but there are plenty of things that are impossible to vectorize.
you can try pandarallel instead: A simple and efficient tool to parallelize your pandas operations on all your CPUs (On Linux & macOS)
Parallelization has a cost (instanciating new processes, sending data via shared memory, etc ...), so parallelization is efficiant only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallezation not always worth it.
Functions applied should NOT be lambda functions.
from pandarallel import pandarallel
from math import sin
pandarallel.initialize()
# FORBIDDEN
df.parallel_apply(lambda x: sin(x**2), axis=1)
# ALLOWED
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
see https://github.com/nalepae/pandarallel
If you want to stay in native python:
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df['newcol'] = pool.map(f, df['col'])
will apply function f in a parallel fashion to column col of dataframe df
Just want to give an update answer for Dask
import dask.dataframe as dd
def your_func(row):
#do something
return row
ddf = dd.from_pandas(df, npartitions=30) # find your own number of partitions
ddf_update = ddf.apply(your_func, axis=1).compute()
On my 100,000 records, without Dask:
CPU times: user 6min 32s, sys: 100 ms, total: 6min 32s
Wall time: 6min 32s
With Dask:
CPU times: user 5.19 s, sys: 784 ms, total: 5.98 s
Wall time: 1min 3s
To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.
You can set the amount of cores (and the chunking behaviour) upon init:
import pandas as pd
import mapply
mapply.init(n_workers=-1)
...
df.mapply(myfunc, axis=1)
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up as logical cores), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.
Depending on your definition of all your cores, you could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
import multiprocessing
n_workers = multiprocessing.cpu_count()
# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)
Here is an example of sklearn base transformer, in which pandas apply is parallelized
import multiprocessing as mp
from sklearn.base import TransformerMixin, BaseEstimator
class ParllelTransformer(BaseEstimator, TransformerMixin):
def __init__(self,
n_jobs=1):
"""
n_jobs - parallel jobs to run
"""
self.variety = variety
self.user_abbrevs = user_abbrevs
self.n_jobs = n_jobs
def fit(self, X, y=None):
return self
def transform(self, X, *_):
X_copy = X.copy()
cores = mp.cpu_count()
partitions = 1
if self.n_jobs <= -1:
partitions = cores
elif self.n_jobs <= 0:
partitions = 1
else:
partitions = min(self.n_jobs, cores)
if partitions == 1:
# transform sequentially
return X_copy.apply(self._transform_one)
# splitting data into batches
data_split = np.array_split(X_copy, partitions)
pool = mp.Pool(cores)
# Here reduce function - concationation of transformed batches
data = pd.concat(
pool.map(self._preprocess_part, data_split)
)
pool.close()
pool.join()
return data
def _transform_part(self, df_part):
return df_part.apply(self._transform_one)
def _transform_one(self, line):
# some kind of transformations here
return line
for more info see https://towardsdatascience.com/4-easy-steps-to-improve-your-machine-learning-code-performance-88a0b0eeffa8
The native Python solution (with numpy) that can be applied on the whole DataFrame as the original question asks (not only on a single column)
import numpy as np
import multiprocessing as mp
dfs = np.array_split(df, 8000) # divide the dataframe as desired
def f_app(df):
return df.apply(myfunc, axis=1)
with mp.Pool(mp.cpu_count()) as pool:
res = pd.concat(pool.map(f_app, dfs))
Here another one using Joblib and some helper code from scikit-learn. Lightweight (if you already have scikit-learn), good if you prefer more control over what it is doing since joblib is easily hackable.
from joblib import parallel_backend, Parallel, delayed, effective_n_jobs
from sklearn.utils import gen_even_slices
from sklearn.utils.validation import _num_samples
def parallel_apply(df, func, n_jobs= -1, **kwargs):
""" Pandas apply in parallel using joblib.
Uses sklearn.utils to partition input evenly.
Args:
df: Pandas DataFrame, Series, or any other object that supports slicing and apply.
func: Callable to apply
n_jobs: Desired number of workers. Default value -1 means use all available cores.
**kwargs: Any additional parameters will be supplied to the apply function
Returns:
Same as for normal Pandas DataFrame.apply()
"""
if effective_n_jobs(n_jobs) == 1:
return df.apply(func, **kwargs)
else:
ret = Parallel(n_jobs=n_jobs)(
delayed(type(df).apply)(df[s], func, **kwargs)
for s in gen_even_slices(_num_samples(df), effective_n_jobs(n_jobs)))
return pd.concat(ret)
Usage: result = parallel_apply(my_dataframe, my_func)
Instead of
df["new"] = df["old"].map(fun)
do
from joblib import Parallel, delayed
df["new"] = Parallel(n_jobs=-1, verbose=10)(delayed(fun)(i) for i in df["old"])
To me this is a slight improvement over
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df["new"] = pool.map(fun, df["old"])
as you get a progress indication and automatic batching if the jobs are very small.
Since the question was "How can you use all your cores to run apply on a dataframe in parallel?", the answer can also be with modin. You can run all cores in parallel, though the real time is worse.
See https://github.com/modin-project/modin . It runs of top of dask or ray. They say "Modin is a DataFrame designed for datasets from 1MB to 1TB+." I tried: pip3 install "modin"[ray]". Modin vs pandas was - 12 sec on six cores vs. 6 sec.
In case you need to do something based on the column name inside the function beware that .apply function may give you some trouble. In my case I needed to change the column type using astype() function based on the column name. This is probably not the most efficient way of doing it but suffices the purpose and keeps the column names as the original one.
import multiprocessing as mp
def f(df):
""" the function that you want to apply to each column """
column_name = df.columns[0] # this is the same as the original column name
# do something what you need to do to that column
return df
# Here I just make a list of all the columns. If you don't use .to_frame()
# it will pass series type instead of a dataframe
dfs = [df[column].to_frame() for column in df.columns]
with mp.Pool(mp.cpu_num) as pool:
processed_df = pd.concat(pool.map(f, dfs), axis=1)